add documentation on segments generated (#3785)

This commit is contained in:
David Lim 2016-12-19 10:41:47 -07:00 committed by Fangjin Yang
parent da007ca3c2
commit 8eee259629
1 changed files with 43 additions and 1 deletions

View File

@ -292,7 +292,7 @@ shuts down the currently running supervisor. When a supervisor is shut down in t
managed tasks to stop reading and begin publishing their segments immediately. The call to the shutdown endpoint will
return after all tasks have been signalled to stop but before the tasks finish publishing their segments.
### Schema/Configuration Changes
## Schema/Configuration Changes
Schema and configuration changes are handled by submitting the new supervisor spec via the same
`POST /druid/indexer/v1/supervisor` endpoint used to initially create the supervisor. The overlord will initiate a
@ -300,3 +300,45 @@ graceful shutdown of the existing supervisor which will cause the tasks being ma
and begin publishing their segments. A new supervisor will then be started which will create a new set of tasks that
will start reading from the offsets where the previous now-publishing tasks left off, but using the updated schema.
In this way, configuration changes can be applied without requiring any pause in ingestion.
## Deployment Notes
### On the Subject of Segments
The Kafka indexing service may generate a significantly large number of segments which over time will cause query
performance issues if not properly managed. One important characteristic to understand is that the Kafka indexing task
will generate a Druid partition in each segment granularity interval for each partition in the Kafka topic. As an
example, if you are ingesting realtime data and your segment granularity is 15 minutes with 10 partitions in the Kafka
topic, you would generate a minimum of 40 segments an hour. This is a limitation imposed by the Kafka architecture which
guarantees delivery order within a partition but not across partitions. Therefore as a consumer of Kafka, in order to
generate segments deterministically (and be able to provide exactly-once ingestion semantics) partitions need to be
handled separately.
Compounding this, if your taskDuration was also set to 15 minutes, you would actually generate 80 segments an hour since
any given 15 minute interval would be handled by two tasks. For an example of this behavior, let's say we started the
supervisor at 9:05 with a 15 minute segment granularity. The first task would create a segment for 9:00-9:15 and a
segment for 9:15-9:30 before stopping at 9:20. A second task would be created at 9:20 which would create another segment
for 9:15-9:30 and a segment for 9:30-9:45 before stopping at 9:35. Hence, if taskDuration and segmentGranularity are the
same duration, you will get two tasks generating a segment for each segment granularity interval.
Understanding this behavior is the first step to managing the number of segments produced. Some recommendations for
keeping the number of segments low are:
* Keep the number of Kafka partitions to the minimum required to sustain the required throughput for your event streams.
* Increase segment granularity and task duration so that more events are written into the same segment. One
consideration here is that segments are only handed off to historical nodes after the task duration has elapsed.
Since workers tend to be configured with less query-serving resources than historical nodes, query performance may
suffer if tasks run excessively long without handing off segments.
In many production installations which have been ingesting events for a long period of time, these suggestions alone
will not be sufficient to keep the number of segments at an optimal level. It is recommended that scheduled re-indexing
tasks be run to merge segments together into new segments of an ideal size (in the range of ~500-700 MB per segment).
Currently, the recommended way of doing this is by running periodic Hadoop batch ingestion jobs and using a `dataSource`
inputSpec to read from the segments generated by the Kafka indexing tasks. Details on how to do this can be found under
['Updating Existing Data'](../../ingestion/update-existing-data.html). Note that the Merge Task and Append Task described
[here](../../ingestion/tasks.html) will not work as they require unsharded segments while Kafka indexing tasks always
generated sharded segments.
There is ongoing work to support automatic segment compaction of sharded segments as well as compaction not requiring
Hadoop (see [here](https://github.com/druid-io/druid/pull/1998) and [here](https://github.com/druid-io/druid/pull/3611)
for related PRs).