mirror of https://github.com/apache/druid.git
add documentation on segments generated (#3785)
This commit is contained in:
parent
da007ca3c2
commit
8eee259629
|
@ -292,7 +292,7 @@ shuts down the currently running supervisor. When a supervisor is shut down in t
|
|||
managed tasks to stop reading and begin publishing their segments immediately. The call to the shutdown endpoint will
|
||||
return after all tasks have been signalled to stop but before the tasks finish publishing their segments.
|
||||
|
||||
### Schema/Configuration Changes
|
||||
## Schema/Configuration Changes
|
||||
|
||||
Schema and configuration changes are handled by submitting the new supervisor spec via the same
|
||||
`POST /druid/indexer/v1/supervisor` endpoint used to initially create the supervisor. The overlord will initiate a
|
||||
|
@ -300,3 +300,45 @@ graceful shutdown of the existing supervisor which will cause the tasks being ma
|
|||
and begin publishing their segments. A new supervisor will then be started which will create a new set of tasks that
|
||||
will start reading from the offsets where the previous now-publishing tasks left off, but using the updated schema.
|
||||
In this way, configuration changes can be applied without requiring any pause in ingestion.
|
||||
|
||||
## Deployment Notes
|
||||
|
||||
### On the Subject of Segments
|
||||
|
||||
The Kafka indexing service may generate a significantly large number of segments which over time will cause query
|
||||
performance issues if not properly managed. One important characteristic to understand is that the Kafka indexing task
|
||||
will generate a Druid partition in each segment granularity interval for each partition in the Kafka topic. As an
|
||||
example, if you are ingesting realtime data and your segment granularity is 15 minutes with 10 partitions in the Kafka
|
||||
topic, you would generate a minimum of 40 segments an hour. This is a limitation imposed by the Kafka architecture which
|
||||
guarantees delivery order within a partition but not across partitions. Therefore as a consumer of Kafka, in order to
|
||||
generate segments deterministically (and be able to provide exactly-once ingestion semantics) partitions need to be
|
||||
handled separately.
|
||||
|
||||
Compounding this, if your taskDuration was also set to 15 minutes, you would actually generate 80 segments an hour since
|
||||
any given 15 minute interval would be handled by two tasks. For an example of this behavior, let's say we started the
|
||||
supervisor at 9:05 with a 15 minute segment granularity. The first task would create a segment for 9:00-9:15 and a
|
||||
segment for 9:15-9:30 before stopping at 9:20. A second task would be created at 9:20 which would create another segment
|
||||
for 9:15-9:30 and a segment for 9:30-9:45 before stopping at 9:35. Hence, if taskDuration and segmentGranularity are the
|
||||
same duration, you will get two tasks generating a segment for each segment granularity interval.
|
||||
|
||||
Understanding this behavior is the first step to managing the number of segments produced. Some recommendations for
|
||||
keeping the number of segments low are:
|
||||
|
||||
* Keep the number of Kafka partitions to the minimum required to sustain the required throughput for your event streams.
|
||||
* Increase segment granularity and task duration so that more events are written into the same segment. One
|
||||
consideration here is that segments are only handed off to historical nodes after the task duration has elapsed.
|
||||
Since workers tend to be configured with less query-serving resources than historical nodes, query performance may
|
||||
suffer if tasks run excessively long without handing off segments.
|
||||
|
||||
In many production installations which have been ingesting events for a long period of time, these suggestions alone
|
||||
will not be sufficient to keep the number of segments at an optimal level. It is recommended that scheduled re-indexing
|
||||
tasks be run to merge segments together into new segments of an ideal size (in the range of ~500-700 MB per segment).
|
||||
Currently, the recommended way of doing this is by running periodic Hadoop batch ingestion jobs and using a `dataSource`
|
||||
inputSpec to read from the segments generated by the Kafka indexing tasks. Details on how to do this can be found under
|
||||
['Updating Existing Data'](../../ingestion/update-existing-data.html). Note that the Merge Task and Append Task described
|
||||
[here](../../ingestion/tasks.html) will not work as they require unsharded segments while Kafka indexing tasks always
|
||||
generated sharded segments.
|
||||
|
||||
There is ongoing work to support automatic segment compaction of sharded segments as well as compaction not requiring
|
||||
Hadoop (see [here](https://github.com/druid-io/druid/pull/1998) and [here](https://github.com/druid-io/druid/pull/3611)
|
||||
for related PRs).
|
||||
|
|
Loading…
Reference in New Issue