fix segment info in Kafka indexing service docs (#5390)

* fix segment info in Kafka indexing service docs * review updates
2018-02-15 11:57:30 -06:00 · 2018-02-15 11:57:30 -06:00 · b9b3be6965
parent be38b18a85
commit b9b3be6965
1 changed files with 13 additions and 30 deletions
--- a/docs/content/development/extensions-core/kafka-ingestion.md
+++ b/docs/content/development/extensions-core/kafka-ingestion.md
@ -311,35 +311,18 @@ In this way, configuration changes can be applied without requiring any pause in
 ### On the Subject of Segments
-The Kafka indexing service may generate a significantly large number of segments which over time will cause query
+Each Kafka Indexing Task puts events consumed from Kafka partitions assigned to it in a single segment for each segment
-performance issues if not properly managed. One important characteristic to understand is that the Kafka indexing task
+granular interval until maxRowsPerSegment limit is reached, at this point a new partition for this segment granularity is
-will generate a Druid partition in each segment granularity interval for each partition in the Kafka topic. As an
+created for further events. Kafka Indexing Task also does incremental hand-offs which means that all the segments created by a
-example, if you are ingesting realtime data and your segment granularity is 15 minutes with 10 partitions in the Kafka
+task will not be held up till the task duration is over. As soon as maxRowsPerSegment limit is hit, all the segments held
-topic, you would generate a minimum of 40 segments an hour. This is a limitation imposed by the Kafka architecture which
+by the task at that point in time will be handed-off and new set of segments will be created for further events.
-guarantees delivery order within a partition but not across partitions. Therefore as a consumer of Kafka, in order to
+This means that the task can run for longer durations of time without accumulating old segments locally on Middle Manager
-generate segments deterministically (and be able to provide exactly-once ingestion semantics) partitions need to be
+nodes and it is encouraged to do so.
 handled separately.
-Compounding this, if your taskDuration was also set to 15 minutes, you would actually generate 80 segments an hour since
+Kafka Indexing Service may still produce some small segments. Lets say the task duration is 4 hours, segment granularity
-any given 15 minute interval would be handled by two tasks. For an example of this behavior, let's say we started the
+is set to an HOUR and Supervisor was started at 9:10 then after 4 hours at 13:10, new set of tasks will be started and
-supervisor at 9:05 with a 15 minute segment granularity. The first task would create a segment for 9:00-9:15 and a
+events for the interval 13:00 - 14:00 may be split across previous and new set of tasks. If you see it becoming a problem then
-segment for 9:15-9:30 before stopping at 9:20. A second task would be created at 9:20 which would create another segment
+one can schedule re-indexing tasks be run to merge segments together into new segments of an ideal size (in the range of ~500-700 MB per segment).
 for 9:15-9:30 and a segment for 9:30-9:45 before stopping at 9:35. Hence, if taskDuration and segmentGranularity are the
 same duration, you will get two tasks generating a segment for each segment granularity interval.
 Understanding this behavior is the first step to managing the number of segments produced. Some recommendations for
 keeping the number of segments low are:
  * Keep the number of Kafka partitions to the minimum required to sustain the required throughput for your event streams.
  * Increase segment granularity and task duration so that more events are written into the same segment. One
    consideration here is that segments are only handed off to historical nodes after the task duration has elapsed.
    Since workers tend to be configured with less query-serving resources than historical nodes, query performance may
    suffer if tasks run excessively long without handing off segments.
 In many production installations which have been ingesting events for a long period of time, these suggestions alone
 will not be sufficient to keep the number of segments at an optimal level. It is recommended that scheduled re-indexing
 tasks be run to merge segments together into new segments of an ideal size (in the range of ~500-700 MB per segment).
 Details on how to optimize the segment size can be found on [Segment size optimization](../../operations/segment-optimization.html).
-
+There is also ongoing work to support automatic segment compaction of sharded segments as well as compaction not requiring
-Note that the Merge Task and Append Task described [here](../../ingestion/tasks.html) will not work as they require
+Hadoop (see [here](https://github.com/druid-io/druid/pull/5102)).
 unsharded segments while Kafka indexing tasks always generated sharded segments.