From 0f4e5cb1256121e8c736547cdc4c58a01c62fb86 Mon Sep 17 00:00:00 2001 From: Igal Levy Date: Thu, 27 Mar 2014 16:18:56 -0700 Subject: [PATCH] added firehose and plumber sections, which were being referenced but were missing --- docs/content/Realtime-ingestion.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/docs/content/Realtime-ingestion.md b/docs/content/Realtime-ingestion.md index fec97b4fc20..6bc1c32cf53 100644 --- a/docs/content/Realtime-ingestion.md +++ b/docs/content/Realtime-ingestion.md @@ -1,6 +1,8 @@ --- layout: doc_page --- + + Realtime Data Ingestion ======================= For general Real-time Node information, see [here](Realtime.html). @@ -11,6 +13,7 @@ For writing your own plugins to the real-time node, see [Firehose](Firehose.html Much of the configuration governing Realtime nodes and the ingestion of data is set in the Realtime spec file, discussed on this page. + ## Realtime "specFile" @@ -81,6 +84,7 @@ This is a JSON Array so you can give more than one realtime stream to a given no There are four parts to a realtime stream specification, `schema`, `config`, `firehose` and `plumber` which we will go into here. + ### Schema This describes the data schema for the output Druid segment. More information about concepts in Druid and querying can be found at [Concepts-and-Terminology](Concepts-and-Terminology.html) and [Querying](Querying.html). @@ -92,6 +96,7 @@ This describes the data schema for the output Druid segment. More information ab |indexGranularity|String|The granularity of the data inside the segment. E.g. a value of "minute" will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows.|yes| |shardSpec|Object|This describes the shard that is represented by this server. This must be specified properly in order to have multiple realtime nodes indexing the same data stream in a sharded fashion.|no| + ### Config This provides configuration for the data processing portion of the realtime stream processor. @@ -101,6 +106,22 @@ This provides configuration for the data processing portion of the realtime stre |intermediatePersistPeriod|ISO8601 Period String|The period that determines the rate at which intermediate persists occur. These persists determine how often commits happen against the incoming realtime stream. If the realtime data loading process is interrupted at time T, it should be restarted to re-read data that arrived at T minus this period.|yes| |maxRowsInMemory|Number|The number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size.|yes| + +### Firehose +Firehoses describe the data stream source. See [Firehose](Firehose.html) for more information on firehose configuration. + + +### Plumber +The Plumber handles generated segments both while they are being generated and when they are "done". The configuration parameters in the example are: + +* `type` specifies the type of plumber in terms of configuration schema. The Plumber configuration in the example is for the often-used RealtimePlumber. +* `windowPeriod` is the amount of lag time to allow events. The example configures a 10 minute window, meaning that any event more than 10 minutes ago will be thrown away and not included in the segment generated by the realtime server. +* `segmentGranularity` specifies the granularity of the segment, or the amount of time a segment will represent. +* `basePersistDirectory` is the directory to put things that need persistence. The plumber is responsible for the actual intermediate persists and this tells it where to store those persists. + +See [Plumber](Plumber.html) for a fuller discussion of Plumber configuration. + + Constraints -----------