Merge pull request #449 from metamx/igalDruid

Igal druid
2014-03-27 17:47:21 -06:00 · 2014-03-27 17:47:21 -06:00 · aae8f4cf3c
parent 0be29df8ae adfcc3d16f
commit aae8f4cf3c
5 changed files with 38 additions and 11 deletions
--- a/docs/content/Indexing-Service.md
+++ b/docs/content/Indexing-Service.md
@ -39,9 +39,9 @@ Tasks are submitted to the overlord node in the form of JSON objects. Tasks can
 ```
 http://<OVERLORD_IP>:<port>/druid/indexer/v1/task
 ```
-this will return you the taskId of the submitted task.
+this will return the taskId of the submitted task.

-Tasks can cancelled via POST requests to:
+Tasks can be cancelled via POST requests to:

 ```
 http://<OVERLORD_IP>:<port>/druid/indexer/v1/task/{taskId}/shutdown
--- a/docs/content/Plumber.md
+++ b/docs/content/Plumber.md
@ -3,17 +3,23 @@ layout: doc_page
 ---

 # Druid Plumbers
-The Plumber is the thing that handles generated segments both while they are being generated and when they are "done". This is also technically a pluggable interface and there are multiple implementations, but there are a lot of details handled by the plumber such that it is expected that there will only be a few implementations and only more advanced third-parties will implement their own. 
+The plumber handles generated segments both while they are being generated and when they are "done". This is also technically a pluggable interface and there are multiple implementations. However, plumbers handle numerous complex details, and therefore an advanced understanding of Druid is recommended before implementing your own. 

 |Field|Type|Description|Required|
 |-----|----|-----------|--------|
-|type|String|Specifies the type of plumber. Each value will have its own configuration schema, plumbers packaged with Druid are described below.|yes|
+|type|String|Specifies the type of plumber. Each value will have its own configuration schema. Plumbers packaged with Druid are described below.|yes|

-We provide a brief description of the example to exemplify the types of things that are configured on the plumber.
+The following can be configured on the plumber:
+
+* `windowPeriod` is the amount of lag time to allow events. This is configured with a 10 minute window, meaning that any event more than 10 minutes ago will be thrown away and not included in the segment generated by the realtime server.
+* `basePersistDirectory` is the directory to put things that need persistence. The plumber is responsible for the actual intermediate persists and this tells it where to store those persists.
+* `maxPendingPersists` is how many persists a plumber can do concurrently without starting to block.
+* `segmentGranularity` specifies the granularity of the segment, or the amount of time a segment will represent.
+* `rejectionPolicy` controls how data sets the data acceptance policy for creating and handing off segments. The following policies are available:
+    * `serverTime` &ndash; The default policy, it is optimal for current data that is generated and ingested in real time. Uses `windowPeriod` to accept only those events that are inside the window looking forward and back.
+    * `none` &ndash; Never hands off data unless shutdown() is called on the configured firehose.
+    * `test` &ndash; Useful for testing that handoff is working, *not useful in terms of data integrity*. It uses the sum of `segmentGranularity` plus `windowPeriod` as a window.

-*   `windowPeriod` is the amount of lag time to allow events. This is configured with a 10 minute window, meaning that any event more than 10 minutes ago will be thrown away and not included in the segment generated by the realtime server.
-*   `basePersistDirectory` is the directory to put things that need persistence. The plumber is responsible for the actual intermediate persists and this tells it where to store those persists.
-*   `maxPendingPersists` is how many persists a plumber can do concurrently without starting to block.


 Available Plumbers
--- a/docs/content/Realtime-ingestion.md
+++ b/docs/content/Realtime-ingestion.md
@ -1,6 +1,8 @@
 ---
 layout: doc_page
 ---
+
+
 Realtime Data Ingestion
 =======================
 For general Real-time Node information, see [here](Realtime.html).
@ -11,6 +13,7 @@ For writing your own plugins to the real-time node, see [Firehose](Firehose.html

 Much of the configuration governing Realtime nodes and the ingestion of data is set in the Realtime spec file, discussed on this page.

+
 <a id="realtime-specfile"></a>
 ## Realtime "specFile"

@ -81,6 +84,7 @@ This is a JSON Array so you can give more than one realtime stream to a given no

 There are four parts to a realtime stream specification, `schema`, `config`, `firehose` and `plumber` which we will go into here.

+
 ### Schema

 This describes the data schema for the output Druid segment. More information about concepts in Druid and querying can be found at [Concepts-and-Terminology](Concepts-and-Terminology.html) and [Querying](Querying.html).
@ -92,6 +96,7 @@ This describes the data schema for the output Druid segment. More information ab
 |indexGranularity|String|The granularity of the data inside the segment. E.g. a value of "minute" will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows.|yes|
 |shardSpec|Object|This describes the shard that is represented by this server. This must be specified properly in order to have multiple realtime nodes indexing the same data stream in a sharded fashion.|no|

+
 ### Config

 This provides configuration for the data processing portion of the realtime stream processor.
@ -101,6 +106,22 @@ This provides configuration for the data processing portion of the realtime stre
 |intermediatePersistPeriod|ISO8601 Period String|The period that determines the rate at which intermediate persists occur. These persists determine how often commits happen against the incoming realtime stream. If the realtime data loading process is interrupted at time T, it should be restarted to re-read data that arrived at T minus this period.|yes|
 |maxRowsInMemory|Number|The number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size.|yes|

+
+### Firehose
+Firehoses describe the data stream source. See [Firehose](Firehose.html) for more information on firehose configuration.
+
+
+### Plumber
+The Plumber handles generated segments both while they are being generated and when they are "done". The configuration parameters in the example are:
+
+* `type` specifies the type of plumber in terms of configuration schema. The Plumber configuration in the example is for the often-used RealtimePlumber.
+* `windowPeriod` is the amount of lag time to allow events. The example configures a 10 minute window, meaning that any event more than 10 minutes ago will be thrown away and not included in the segment generated by the realtime server.
+* `segmentGranularity` specifies the granularity of the segment, or the amount of time a segment will represent.
+* `basePersistDirectory` is the directory to put things that need persistence. The plumber is responsible for the actual intermediate persists and this tells it where to store those persists.
+
+See [Plumber](Plumber.html) for a fuller discussion of Plumber configuration.
+
+
 Constraints
 -----------

--- a/docs/content/Tasks.md
+++ b/docs/content/Tasks.md
@ -2,7 +2,7 @@
 layout: doc_page
 ---
 # Tasks
-Tasks are run on middle managers and always operate on a single data source.
+Tasks are run on middle managers and always operate on a single data source. Tasks are submitted using [POST requests](Indexing-Service.html).

 There are several different types of tasks.

@ -163,7 +163,7 @@ The indexing service can also run real-time tasks. These tasks effectively trans
 |availabilityGroup|String|An uniqueness identifier for the task. Tasks with the same availability group will always run on different middle managers. Used mainly for replication. |yes|
 |requiredCapacity|Integer|How much middle manager capacity this task will take.|yes|

-For schema, fireDepartmentConfig, windowPeriod, segmentGranularity, and rejectionPolicy, see [Realtime Ingestion](Realtime-ingestion.html). For firehose configuration, see [Firehose](Firehose.html).
+For schema, windowPeriod, segmentGranularity, and other configuration information, see [Realtime Ingestion](Realtime-ingestion.html). For firehose configuration, see [Firehose](Firehose.html).


 Segment Merging Tasks
--- a/docs/content/Tutorial:-Loading-Your-Data-Part-2.md
+++ b/docs/content/Tutorial:-Loading-Your-Data-Part-2.md
@ -167,7 +167,7 @@ You should be comfortable starting Druid nodes at this point. If not, it may be
  ]
  ```

-Note: This config uses a "test" rejection policy which will accept all events and timely hand off, however, we strongly recommend you do not use this in production. Using this rejection policy, segments for events for the same time range will be overridden.
+Note: This config uses a "test" [rejection policy](Plumber.html) which will accept all events and timely hand off, however, we strongly recommend you do not use this in production. Using this rejection policy, segments for events for the same time range will be overridden.

 3. Let's copy and paste some data into the Kafka console producer