mirror of https://github.com/apache/druid.git
Improve doc for auto compaction (#6782)
* Improve doc for auto compaction * address comments * address comments * address comments
This commit is contained in:
parent
ffded61f5e
commit
3b020fd81b
|
@ -828,14 +828,14 @@ A description of the compaction config is:
|
||||||
|--------|-----------|--------|
|
|--------|-----------|--------|
|
||||||
|`dataSource`|dataSource name to be compacted.|yes|
|
|`dataSource`|dataSource name to be compacted.|yes|
|
||||||
|`keepSegmentGranularity`|Set [keepSegmentGranularity](../ingestion/compaction.html) to true for compactionTask.|no (default = true)|
|
|`keepSegmentGranularity`|Set [keepSegmentGranularity](../ingestion/compaction.html) to true for compactionTask.|no (default = true)|
|
||||||
|`taskPriority`|[Priority](../ingestion/tasks.html#task-priorities) of compact task.|no (default = 25)|
|
|`taskPriority`|[Priority](../ingestion/tasks.html#task-priorities) of compaction task.|no (default = 25)|
|
||||||
|`inputSegmentSizeBytes`|Total input segment size of a compactionTask.|no (default = 419430400)|
|
|`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 1–2GB will result in compaction tasks taking an excessive amount of time.|no (default = 419430400)|
|
||||||
|`targetCompactionSizeBytes`|The target segment size of compaction. The actual size of a compact segment might be slightly larger or smaller than this value. This configuration cannot be used together with `maxRowsPerSegment`.|no (default = 419430400 if `maxRowsPerSegment` is not specified)|
|
|`targetCompactionSizeBytes`|The target segment size, for each segment, after compaction. The actual sizes of compacted segments might be slightly larger or smaller than this value. Each compaction task may generate more than one output segment, and it will try to keep each output segment close to this configured size. This configuration cannot be used together with `maxRowsPerSegment`.|no (default = 419430400)|
|
||||||
|`maxRowsPerSegment`|Max number of rows per segment after compaction. This configuration cannot be used together with `targetCompactionSizeBytes`.|no|
|
|`maxRowsPerSegment`|Max number of rows per segment after compaction. This configuration cannot be used together with `targetCompactionSizeBytes`.|no|
|
||||||
|`maxNumSegmentsToCompact`|Max number of segments to compact together.|no (default = 150)|
|
|`maxNumSegmentsToCompact`|Maximum number of segments to compact together per compaction task. Since a time chunk must be processed in its entirety, if a time chunk has a total number of segments greater than this parameter, compaction will not run for that time chunk.|no (default = 150)|
|
||||||
|`skipOffsetFromLatest`|The offset for searching segments to be compacted. Strongly recommended to set for realtime dataSources. |no (default = "P1D")|
|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted. Strongly recommended to set for realtime dataSources. |no (default = "P1D")|
|
||||||
|`tuningConfig`|Tuning config for compact tasks. See [Compaction TuningConfig](#compaction-tuningconfig).|no|
|
|`tuningConfig`|Tuning config for compaction tasks. See below [Compaction Task TuningConfig](#compact-task-tuningconfig).|no|
|
||||||
|`taskContext`|[Task context](../ingestion/tasks.html#task-context) for compact tasks.|no|
|
|`taskContext`|[Task context](../ingestion/tasks.html#task-context) for compaction tasks.|no|
|
||||||
|
|
||||||
An example of compaction config is:
|
An example of compaction config is:
|
||||||
|
|
||||||
|
@ -845,7 +845,12 @@ An example of compaction config is:
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
For realtime dataSources, it's recommended to set `skipOffsetFromLatest` to some sufficiently large value to avoid frequent compact task failures.
|
Note that compaction tasks can fail if their locks are revoked by other tasks of higher priorities.
|
||||||
|
Since realtime tasks have a higher priority than compaction task by default,
|
||||||
|
it can be problematic if there are frequent conflicts between compaction tasks and realtime tasks.
|
||||||
|
If this is the case, the coordinator's automatic compaction might get stuck because of frequent compaction task failures.
|
||||||
|
This kind of problem may happen especially in Kafka/Kinesis indexing systems which allow late data arrival.
|
||||||
|
If you see this problem, it's recommended to set `skipOffsetFromLatest` to some large enough value to avoid such conflicts between compaction tasks and realtime tasks.
|
||||||
|
|
||||||
##### Compaction TuningConfig
|
##### Compaction TuningConfig
|
||||||
|
|
||||||
|
|
|
@ -69,34 +69,46 @@ Each run, the Druid coordinator compacts small segments abutting each other. Thi
|
||||||
segments which may degrade the query performance as well as increasing the disk space usage.
|
segments which may degrade the query performance as well as increasing the disk space usage.
|
||||||
|
|
||||||
The coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy).
|
The coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy).
|
||||||
Once some segments are found, it launches a [compact task](../ingestion/tasks.html#compaction-task) to compact those segments.
|
Once some segments are found, it launches a [compaction task](../ingestion/tasks.html#compaction-task) to compact those segments.
|
||||||
The maximum number of running compact tasks is `min(sum of worker capacity * slotRatio, maxSlots)`.
|
The maximum number of running compaction tasks is `min(sum of worker capacity * slotRatio, maxSlots)`.
|
||||||
Note that even though `min(sum of worker capacity * slotRatio, maxSlots)` = 0, at least one compact task is always submitted
|
Note that even though `min(sum of worker capacity * slotRatio, maxSlots)` = 0, at least one compaction task is always submitted
|
||||||
if the compaction is enabled for a dataSource.
|
if the compaction is enabled for a dataSource.
|
||||||
See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration) and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) to enable the compaction.
|
See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration) and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) to enable the compaction.
|
||||||
|
|
||||||
Compact tasks might fail due to some reasons.
|
Compaction tasks might fail due to the following reasons.
|
||||||
|
|
||||||
- If the input segments of a compact task are removed or overshadowed before it starts, that compact task fails immediately.
|
- If the input segments of a compaction task are removed or overshadowed before it starts, that compaction task fails immediately.
|
||||||
- If a task of a higher priority acquires a lock for an interval overlapping with the interval of a compact task, the compact task fails.
|
- If a task of a higher priority acquires a lock for an interval overlapping with the interval of a compaction task, the compaction task fails.
|
||||||
|
|
||||||
Once a compact task fails, the coordinator simply finds the segments for the interval of the failed task again, and launches a new compact task in the next run.
|
Once a compaction task fails, the coordinator simply finds the segments for the interval of the failed task again, and launches a new compaction task in the next run.
|
||||||
|
|
||||||
### Segment Search Policy
|
### Segment Search Policy
|
||||||
|
|
||||||
#### Newest Segment First Policy
|
#### Newest Segment First Policy
|
||||||
|
|
||||||
This policy searches the segments of _all dataSources_ in inverse order of their intervals.
|
At every coordinator run, this policy searches for segments to compact by iterating segments from the latest to the oldest.
|
||||||
For example, let me assume there are 3 dataSources (`ds1`, `ds2`, `ds3`) and 5 segments (`seg_ds1_2017-10-01_2017-10-02`, `seg_ds1_2017-11-01_2017-11-02`, `seg_ds2_2017-08-01_2017-08-02`, `seg_ds3_2017-07-01_2017-07-02`, `seg_ds3_2017-12-01_2017-12-02`) for those dataSources.
|
Once it finds the latest segment among all dataSources, it checks if the segment is _compactible_ with other segments of the same dataSource which have the same or abutting intervals.
|
||||||
The segment name indicates its dataSource and interval. The search result of newestSegmentFirstPolicy is [`seg_ds3_2017-12-01_2017-12-02`, `seg_ds1_2017-11-01_2017-11-02`, `seg_ds1_2017-10-01_2017-10-02`, `seg_ds2_2017-08-01_2017-08-02`, `seg_ds3_2017-07-01_2017-07-02`].
|
Note that segments are compactible if their total size is smaller than or equal to the configured `inputSegmentSizeBytes`.
|
||||||
|
|
||||||
Every run, this policy starts searching from the (very latest interval - [skipOffsetFromLatest](../configuration/index.html#compaction-dynamic-configuration)).
|
Here are some details with an example. Let us assume we have two dataSources (`foo`, `bar`)
|
||||||
This is to handle the late segments ingested to realtime dataSources.
|
and 5 segments (`foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`, `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION`, `bar_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_VERSION`, `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`, `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`).
|
||||||
|
When each segment has the same size of 10 MB and `inputSegmentSizeBytes` is 20 MB, this policy first returns two segments (`foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION`) to compact together because
|
||||||
|
`foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION` is the latest segment and `foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` abuts to it.
|
||||||
|
|
||||||
|
If the coordinator has enough task slots for compaction, this policy would continue searching for the next segments and return
|
||||||
|
`bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`.
|
||||||
|
Note that `bar_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_VERSION` is not compacted together even though it abuts to `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`.
|
||||||
|
This is because the total segment size to compact would be greater than `inputSegmentSizeBytes` if it's included.
|
||||||
|
|
||||||
|
The search start point can be changed by setting [skipOffsetFromLatest](../configuration/index.html#compaction-dynamic-configuration).
|
||||||
|
If this is set, this policy will ignore the segments falling into the interval of (the end time of the very latest segment - `skipOffsetFromLatest`).
|
||||||
|
This is to avoid conflicts between compaction tasks and realtime tasks.
|
||||||
|
Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task.
|
||||||
|
|
||||||
<div class="note caution">
|
<div class="note caution">
|
||||||
This policy currently cannot handle the situation when there are a lot of small segments which have the same interval,
|
This policy currently cannot handle the situation when there are a lot of small segments which have the same interval,
|
||||||
and their total size exceeds <a href="../configuration/index.html#compaction-dynamic-configuration">targetCompactionSizebytes</a>.
|
and their total size exceeds <a href="../configuration/index.html#compaction-dynamic-configuration">inputSegmentSizeBytes</a>.
|
||||||
If it finds such segments, it simply skips compacting them.
|
If it finds such segments, it simply skips them.
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
### The Coordinator Console
|
### The Coordinator Console
|
||||||
|
|
Loading…
Reference in New Issue