Update automatic compaction docs with consistent terminology (#12416)

* specify automatic compaction where applicable

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* update for style and consistency

* implement suggested feedback

* remove duplicate example

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/compaction.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/api-reference.md

* update .spelling

* Adopt review suggestions

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
This commit is contained in:
Victoria Lim 2022-05-03 16:22:25 -07:00 committed by GitHub
parent 35a7d863b7
commit 0206a2da5c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 84 additions and 101 deletions

View File

@ -951,14 +951,14 @@ These configuration options control the behavior of the Lookup dynamic configura
|`druid.manager.lookups.threadPoolSize`|How many processes can be managed concurrently (concurrent POST and DELETE requests). Requests this limit will wait in a queue until a slot becomes available.|10| |`druid.manager.lookups.threadPoolSize`|How many processes can be managed concurrently (concurrent POST and DELETE requests). Requests this limit will wait in a queue until a slot becomes available.|10|
|`druid.manager.lookups.period`|How many milliseconds between checks for configuration changes|30_000| |`druid.manager.lookups.period`|How many milliseconds between checks for configuration changes|30_000|
##### Compaction Dynamic Configuration ##### Automatic compaction dynamic configuration
Compaction configurations can also be set or updated dynamically using You can set or update automatic compaction properties dynamically using the
[Coordinator's API](../operations/api-reference.md#compaction-configuration) without restarting Coordinators. [Coordinator API](../operations/api-reference.md#automatic-compaction-configuration) without restarting Coordinators.
For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.md). For details about segment compaction, see [Segment size optimization](../operations/segment-optimization.md).
A description of the compaction config is: You can configure automatic compaction through the following properties:
|Property|Description|Required| |Property|Description|Required|
|--------|-----------|--------| |--------|-----------|--------|
@ -966,16 +966,16 @@ A description of the compaction config is:
|`taskPriority`|[Priority](../ingestion/tasks.md#priority) of compaction task.|no (default = 25)| |`taskPriority`|[Priority](../ingestion/tasks.md#priority) of compaction task.|no (default = 25)|
|`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 12GB will result in compaction tasks taking an excessive amount of time.|no (default = Long.MAX_VALUE)| |`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 12GB will result in compaction tasks taking an excessive amount of time.|no (default = Long.MAX_VALUE)|
|`maxRowsPerSegment`|Max number of rows per segment after compaction.|no| |`maxRowsPerSegment`|Max number of rows per segment after compaction.|no|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Strongly recommended to set for realtime dataSources. See [Data handling with compaction](../ingestion/compaction.md#data-handling-with-compaction)|no (default = "P1D")| |`skipOffsetFromLatest`|The offset for searching segments to be compacted in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Strongly recommended to set for realtime dataSources. See [Data handling with compaction](../ingestion/compaction.md#data-handling-with-compaction).|no (default = "P1D")|
|`tuningConfig`|Tuning config for compaction tasks. See below [Compaction Task TuningConfig](#automatic-compaction-tuningconfig).|no| |`tuningConfig`|Tuning config for compaction tasks. See below [Automatic compaction tuningConfig](#automatic-compaction-tuningconfig).|no|
|`taskContext`|[Task context](../ingestion/tasks.md#context) for compaction tasks.|no| |`taskContext`|[Task context](../ingestion/tasks.md#context) for compaction tasks.|no|
|`granularitySpec`|Custom `granularitySpec`. See [Automatic compaction granularitySpec](#automatic-compaction-granularityspec)|No| |`granularitySpec`|Custom `granularitySpec`. See [Automatic compaction granularitySpec](#automatic-compaction-granularityspec).|No|
|`dimensionsSpec`|Custom `dimensionsSpec`. See [Automatic compaction dimensionsSpec](#automatic-compaction-dimensions-spec)|No| |`dimensionsSpec`|Custom `dimensionsSpec`. See [Automatic compaction dimensionsSpec](#automatic-compaction-dimensionsspec).|No|
|`transformSpec`|Custom `transformSpec`. See [Automatic compaction transformSpec](#automatic-compaction-transform-spec)|No| |`transformSpec`|Custom `transformSpec`. See [Automatic compaction transformSpec](#automatic-compaction-transformspec).|No|
|`metricsSpec`|Custom [`metricsSpec`](../ingestion/ingestion-spec.md#metricsspec). The compaction task preserves any existing metrics regardless of whether `metricsSpec` is specified. If `metricsSpec` is specified, Druid does not reapply any aggregators matching the metric names specified in `metricsSpec` to rows that already have the associated metrics. For rows that do not already have the metric specified in `metricsSpec`, Druid applies the metric aggregator on the source column, then proceeds to combine the metrics across segments as usual. If `metricsSpec` is not specified, Druid automatically discovers the metrics in the existing segments and combines existing metrics with the same metric name across segments. Aggregators for metrics with the same name are assumed to be compatible for combining across segments, otherwise the compaction task may fail.|No| |`metricsSpec`|Custom [`metricsSpec`](../ingestion/ingestion-spec.md#metricsspec). The compaction task preserves any existing metrics regardless of whether `metricsSpec` is specified. If `metricsSpec` is specified, Druid does not reapply any aggregators matching the metric names specified in `metricsSpec` to rows that already have the associated metrics. For rows that do not already have the metric specified in `metricsSpec`, Druid applies the metric aggregator on the source column, then proceeds to combine the metrics across segments as usual. If `metricsSpec` is not specified, Druid automatically discovers the metrics in the existing segments and combines existing metrics with the same metric name across segments. Aggregators for metrics with the same name are assumed to be compatible for combining across segments, otherwise the compaction task may fail.|No|
|`ioConfig`|IO config for compaction tasks. See below [Compaction Task IOConfig](#automatic-compaction-ioconfig).|no| |`ioConfig`|IO config for compaction tasks. See [Automatic compaction ioConfig](#automatic-compaction-ioconfig).|no|
An example of compaction config is: Automatic compaction config example:
```json ```json
{ {
@ -989,10 +989,10 @@ An example of compaction config is:
Compaction tasks fail when higher priority tasks cause Druid to revoke their locks. By default, realtime tasks like ingestion have a higher priority than compaction tasks. Therefore frequent conflicts between compaction tasks and realtime tasks can cause the coordinator's automatic compaction to get stuck. Compaction tasks fail when higher priority tasks cause Druid to revoke their locks. By default, realtime tasks like ingestion have a higher priority than compaction tasks. Therefore frequent conflicts between compaction tasks and realtime tasks can cause the coordinator's automatic compaction to get stuck.
You may see this issue with streaming ingestion from Kafka and Kinesis, which ingest late-arriving data. To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks. You may see this issue with streaming ingestion from Kafka and Kinesis, which ingest late-arriving data. To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks.
###### Automatic compaction TuningConfig ###### Automatic compaction tuningConfig
Auto compaction supports a subset of the [tuningConfig for Parallel task](../ingestion/native-batch.md#tuningconfig). Auto-compaction supports a subset of the [tuningConfig for Parallel task](../ingestion/native-batch.md#tuningconfig).
The below is a list of the supported configurations for auto compaction. The below is a list of the supported configurations for auto-compaction.
|Property|Description|Required| |Property|Description|Required|
|--------|-----------|--------| |--------|-----------|--------|
@ -1022,22 +1022,22 @@ The below is a list of the supported configurations for auto compaction.
|`queryGranularity`|The resolution of timestamp storage within each segment. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No| |`queryGranularity`|The resolution of timestamp storage within each segment. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No|
|`rollup`|Whether to enable ingestion-time rollup or not. Defaults to 'null', which preserves the original setting. Note that once data is rollup, individual records can no longer be recovered. |No| |`rollup`|Whether to enable ingestion-time rollup or not. Defaults to 'null', which preserves the original setting. Note that once data is rollup, individual records can no longer be recovered. |No|
###### Automatic compaction dimensions spec ###### Automatic compaction dimensionsSpec
|Field|Description|Required| |Field|Description|Required|
|-----|-----------|--------| |-----|-----------|--------|
|`dimensions`| A list of dimension names or objects. Defaults to 'null', which preserves the original dimensions. Note that setting this will cause segments manually compacted with `dimensionExclusions` to be compacted again.|No| |`dimensions`| A list of dimension names or objects. Defaults to 'null', which preserves the original dimensions. Note that setting this will cause segments manually compacted with `dimensionExclusions` to be compacted again.|No|
###### Automatic compaction transform spec ###### Automatic compaction transformSpec
|Field|Description|Required| |Field|Description|Required|
|-----|-----------|--------| |-----|-----------|--------|
|`filter`| The `filter` conditionally filters input rows during compaction. Only rows that pass the filter will be included in the compacted segments. Any of Druid's standard [query filters](../querying/filters.md) can be used. Defaults to 'null', which will not filter any row. |No| |`filter`| The `filter` conditionally filters input rows during compaction. Only rows that pass the filter will be included in the compacted segments. Any of Druid's standard [query filters](../querying/filters.md) can be used. Defaults to 'null', which will not filter any row. |No|
###### Automatic compaction IOConfig ###### Automatic compaction ioConfig
Auto compaction supports a subset of the [IOConfig for Parallel task](../ingestion/native-batch.md). Auto-compaction supports a subset of the [ioConfig for Parallel task](../ingestion/native-batch.md).
The below is a list of the supported configurations for auto compaction. The below is a list of the supported configurations for auto-compaction.
|Property|Description|Default|Required| |Property|Description|Default|Required|
|--------|-----------|-------|--------| |--------|-----------|-------|--------|

View File

@ -79,39 +79,38 @@ If a Historical process restarts or becomes unavailable for any reason, the Drui
To ensure an even distribution of segments across Historical processes in the cluster, the Coordinator process will find the total size of all segments being served by every Historical process each time the Coordinator runs. For every Historical process tier in the cluster, the Coordinator process will determine the Historical process with the highest utilization and the Historical process with the lowest utilization. The percent difference in utilization between the two processes is computed, and if the result exceeds a certain threshold, a number of segments will be moved from the highest utilized process to the lowest utilized process. There is a configurable limit on the number of segments that can be moved from one process to another each time the Coordinator runs. Segments to be moved are selected at random and only moved if the resulting utilization calculation indicates the percentage difference between the highest and lowest servers has decreased. To ensure an even distribution of segments across Historical processes in the cluster, the Coordinator process will find the total size of all segments being served by every Historical process each time the Coordinator runs. For every Historical process tier in the cluster, the Coordinator process will determine the Historical process with the highest utilization and the Historical process with the lowest utilization. The percent difference in utilization between the two processes is computed, and if the result exceeds a certain threshold, a number of segments will be moved from the highest utilized process to the lowest utilized process. There is a configurable limit on the number of segments that can be moved from one process to another each time the Coordinator runs. Segments to be moved are selected at random and only moved if the resulting utilization calculation indicates the percentage difference between the highest and lowest servers has decreased.
### Compacting Segments ### Automatic compaction
Each run, the Druid Coordinator compacts segments by merging small segments or splitting a large one. This is useful when your segments are not optimized The Druid Coordinator manages the automatic compaction system.
in terms of segment size which may degrade query performance. See [Segment Size Optimization](../operations/segment-optimization.md) for details. Each run, the Coordinator compacts segments by merging small segments or splitting a large one. This is useful when the size of your segments is not optimized which may degrade query performance.
See [Segment size optimization](../operations/segment-optimization.md) for details.
The Coordinator first finds the segments to compact based on the [segment search policy](#segment-search-policy). The Coordinator first finds the segments to compact based on the [segment search policy](#segment-search-policy-in-automatic-compaction).
Once some segments are found, it issues a [compaction task](../ingestion/tasks.md#compact) to compact those segments. Once some segments are found, it issues a [compaction task](../ingestion/tasks.md#compact) to compact those segments.
The maximum number of running compaction tasks is `min(sum of worker capacity * slotRatio, maxSlots)`. The maximum number of running compaction tasks is `min(sum of worker capacity * slotRatio, maxSlots)`.
Note that even though `min(sum of worker capacity * slotRatio, maxSlots)` = 0, at least one compaction task is always submitted Note that even if `min(sum of worker capacity * slotRatio, maxSlots) = 0`, at least one compaction task is always submitted
if the compaction is enabled for a dataSource. if the compaction is enabled for a dataSource.
See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration) and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) to enable the compaction. See [Automatic compaction configuration API](../operations/api-reference.md#automatic-compaction-configuration) and [Automatic compaction configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) to enable and configure automatic compaction.
Compaction tasks might fail due to the following reasons. Compaction tasks might fail due to the following reasons:
- If the input segments of a compaction task are removed or overshadowed before it starts, that compaction task fails immediately. - If the input segments of a compaction task are removed or overshadowed before it starts, that compaction task fails immediately.
- If a task of a higher priority acquires a [time chunk lock](../ingestion/tasks.md#locking) for an interval overlapping with the interval of a compaction task, the compaction task fails. - If a task of a higher priority acquires a [time chunk lock](../ingestion/tasks.md#locking) for an interval overlapping with the interval of a compaction task, the compaction task fails.
Once a compaction task fails, the Coordinator simply checks the segments in the interval of the failed task again, and issues another compaction task in the next run. Once a compaction task fails, the Coordinator simply checks the segments in the interval of the failed task again, and issues another compaction task in the next run.
Note that Compacting Segments Coordinator Duty is automatically enabled and run as part of the Indexing Service Duties group. However, Compacting Segments Coordinator Duty can be configured to run in isolation as a separate coordinator duty group. This allows changing the period of Compacting Segments Coordinator Duty without impacting the period of other Indexing Service Duties. This can be done by setting the following properties (for more details see [custom pluggable Coordinator Duty](../development/modules.md#adding-your-own-custom-pluggable-coordinator-duty)): Note that Compacting Segments Coordinator Duty is automatically enabled and run as part of the Indexing Service Duties group. However, Compacting Segments Coordinator Duty can be configured to run in isolation as a separate Coordinator duty group. This allows changing the period of Compacting Segments Coordinator Duty without impacting the period of other Indexing Service Duties. This can be done by setting the following properties. For more details, see [custom pluggable Coordinator Duty](../development/modules.md#adding-your-own-custom-pluggable-coordinator-duty).
``` ```
druid.coordinator.dutyGroups=[<SOME_GROUP_NAME>] druid.coordinator.dutyGroups=[<SOME_GROUP_NAME>]
druid.coordinator.<SOME_GROUP_NAME>.duties=["compactSegments"] druid.coordinator.<SOME_GROUP_NAME>.duties=["compactSegments"]
druid.coordinator.<SOME_GROUP_NAME>.period=<PERIOD_TO_RUN_COMPACTING_SEGMENTS_DUTY> druid.coordinator.<SOME_GROUP_NAME>.period=<PERIOD_TO_RUN_COMPACTING_SEGMENTS_DUTY>
``` ```
### Segment search policy ### Segment search policy in automatic compaction
#### Recent segment first policy At every Coordinator run, this policy looks up time chunks from newest to oldest and checks whether the segments in those time chunks
need compaction.
At every coordinator run, this policy looks up time chunks in order of newest-to-oldest and checks whether the segments in those time chunks A set of segments needs compaction if all conditions below are satisfied:
need compaction or not.
A set of segments need compaction if all conditions below are satisfied.
1) Total size of segments in the time chunk is smaller than or equal to the configured `inputSegmentSizeBytes`. 1) Total size of segments in the time chunk is smaller than or equal to the configured `inputSegmentSizeBytes`.
2) Segments have never been compacted yet or compaction spec has been updated since the last compaction, especially `maxRowsPerSegment`, `maxTotalRows`, and `indexSpec`. 2) Segments have never been compacted yet or compaction spec has been updated since the last compaction, especially `maxRowsPerSegment`, `maxTotalRows`, and `indexSpec`.
@ -130,22 +129,22 @@ Assuming that each segment is 10 MB and haven't been compacted yet, this policy
`foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION` and `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION_1` to compact together because `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION` and `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION_1` to compact together because
`2017-11-01T00:00:00.000Z/2017-12-01T00:00:00.000Z` is the most recent time chunk. `2017-11-01T00:00:00.000Z/2017-12-01T00:00:00.000Z` is the most recent time chunk.
If the coordinator has enough task slots for compaction, this policy will continue searching for the next segments and return If the Coordinator has enough task slots for compaction, this policy will continue searching for the next segments and return
`bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`. `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`.
Finally, `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` will be picked up even though there is only one segment in the time chunk of `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z`. Finally, `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` will be picked up even though there is only one segment in the time chunk of `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z`.
The search start point can be changed by setting [skipOffsetFromLatest](../configuration/index.md#compaction-dynamic-configuration). The search start point can be changed by setting [`skipOffsetFromLatest`](../configuration/index.md#automatic-compaction-dynamic-configuration).
If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - `skipOffsetFromLatest`). If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - `skipOffsetFromLatest`).
This is to avoid conflicts between compaction tasks and realtime tasks. This is to avoid conflicts between compaction tasks and realtime tasks.
Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task. Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task.
> This policy currently cannot handle the situation when there are a lot of small segments which have the same interval, > This policy currently cannot handle the situation when there are a lot of small segments which have the same interval,
> and their total size exceeds [inputSegmentSizeBytes](../configuration/index.md#compaction-dynamic-configuration). > and their total size exceeds [`inputSegmentSizeBytes`](../configuration/index.md#automatic-compaction-dynamic-configuration).
> If it finds such segments, it simply skips them. > If it finds such segments, it simply skips them.
### The Coordinator console ### The Coordinator console
The Druid Coordinator exposes a web GUI for displaying cluster information and rule configuration. For more details, please see [coordinator console](../operations/management-uis.md#coordinator-consoles). The Druid Coordinator exposes a web GUI for displaying cluster information and rule configuration. For more details, see [Coordinator console](../operations/management-uis.md#coordinator-consoles).
### FAQ ### FAQ

View File

@ -28,7 +28,7 @@ Query performance in Apache Druid depends on optimally sized segments. Compactio
There are several cases to consider compaction for segment optimization: There are several cases to consider compaction for segment optimization:
- With streaming ingestion, data can arrive out of chronological order creating lots of small segments. - With streaming ingestion, data can arrive out of chronological order creating many small segments.
- If you append data using `appendToExisting` for [native batch](native-batch.md) ingestion creating suboptimal segments. - If you append data using `appendToExisting` for [native batch](native-batch.md) ingestion creating suboptimal segments.
- When you use `index_parallel` for parallel batch indexing and the parallel ingestion tasks create many small segments. - When you use `index_parallel` for parallel batch indexing and the parallel ingestion tasks create many small segments.
- When a misconfigured ingestion task creates oversized segments. - When a misconfigured ingestion task creates oversized segments.
@ -36,7 +36,7 @@ There are several cases to consider compaction for segment optimization:
By default, compaction does not modify the underlying data of the segments. However, there are cases when you may want to modify data during compaction to improve query performance: By default, compaction does not modify the underlying data of the segments. However, there are cases when you may want to modify data during compaction to improve query performance:
- If, after ingestion, you realize that data for the time interval is sparse, you can use compaction to increase the segment granularity. - If, after ingestion, you realize that data for the time interval is sparse, you can use compaction to increase the segment granularity.
- Over time you don't need fine-grained granularity for older data so you want use compaction to change older segments to a coarser query granularity. This reduces the storage space required for older data. For example from `minute` to `hour`, or `hour` to `day`. - If you don't need fine-grained granularity for older data, you can use compaction to change older segments to a coarser query granularity. For example, from `minute` to `hour` or `hour` to `day`. This reduces the storage space required for older data.
- You can change the dimension order to improve sorting and reduce segment size. - You can change the dimension order to improve sorting and reduce segment size.
- You can remove unused columns in compaction or implement an aggregation metric for older data. - You can remove unused columns in compaction or implement an aggregation metric for older data.
- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](./rollup.md#perfect-rollup-vs-best-effort-rollup). - You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](./rollup.md#perfect-rollup-vs-best-effort-rollup).
@ -44,9 +44,10 @@ By default, compaction does not modify the underlying data of the segments. Howe
Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment. Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.
## Types of compaction ## Types of compaction
You can configure the Druid Coordinator to perform automatic compaction, also called auto-compaction, for a datasource. Using a segment search policy, the coordinator periodically identifies segments for compaction starting with the newest to oldest. When it discovers segments that have not been compacted or segments that were compacted with a different or changed spec, it submits compaction task for those segments and only those segments.
Automatic compaction works in most use cases and should be your first option. To learn more about automatic compaction, see [Compacting Segments](../design/coordinator.md#compacting-segments). You can configure the Druid Coordinator to perform automatic compaction, also called auto-compaction, for a datasource. Using its [segment search policy](../design/coordinator.md#segment-search-policy-in-automatic-compaction), the Coordinator periodically identifies segments for compaction starting from newest to oldest. When the Coordinator discovers segments that have not been compacted or segments that were compacted with a different or changed spec, it submits compaction tasks for the time interval covering those segments.
Automatic compaction works in most use cases and should be your first option. To learn more about automatic compaction, see [Compacting Segments](../design/coordinator.md#automatic-compaction).
In cases where you require more control over compaction, you can manually submit compaction tasks. For example: In cases where you require more control over compaction, you can manually submit compaction tasks. For example:
@ -62,7 +63,7 @@ During compaction, Druid overwrites the original set of segments with the compac
You can set `dropExisting` in `ioConfig` to "true" in the compaction task to configure Druid to replace all existing segments fully contained by the interval. See the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations) for an example. You can set `dropExisting` in `ioConfig` to "true" in the compaction task to configure Druid to replace all existing segments fully contained by the interval. See the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations) for an example.
> WARNING: `dropExisting` in `ioConfig` is a beta feature. > WARNING: `dropExisting` in `ioConfig` is a beta feature.
If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjust the auto compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction. See [Compaction dynamic configuration](../configuration/index.md#compaction-dynamic-configuration) for more information. Another option is to set the compaction task to higher priority than the ingestion task. If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks, you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjust the auto-compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction. See [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for more information. Another option is to set the compaction task to higher priority than the ingestion task.
### Segment granularity handling ### Segment granularity handling
@ -82,13 +83,14 @@ If you configure query granularity in compaction to go from a finer granularity
### Dimension handling ### Dimension handling
Apache Druid supports schema changes. Therefore, dimensions can be different across segments even if they are a part of the same data source. See [Different schemas among segments](../design/segments.md#different-schemas-among-segments). If the input segments have different dimensions, the resulting compacted segment include all dimensions of the input segments. Apache Druid supports schema changes. Therefore, dimensions can be different across segments even if they are a part of the same data source. See [Different schemas among segments](../design/segments.md#different-schemas-among-segments). If the input segments have different dimensions, the resulting compacted segment includes all dimensions of the input segments.
Even when the input segments have the same set of dimensions, the dimension order or the data type of dimensions can be different. The dimensions of recent segments precede that of old segments in terms of data types and the ordering because more recent segments are more likely to have the preferred order and data types. Even when the input segments have the same set of dimensions, the dimension order or the data type of dimensions can be different. The dimensions of recent segments precede that of old segments in terms of data types and the ordering because more recent segments are more likely to have the preferred order and data types.
If you want to control dimension ordering or ensure specific values for dimension types, you can configure a custom `dimensionsSpec` in the compaction task spec. If you want to control dimension ordering or ensure specific values for dimension types, you can configure a custom `dimensionsSpec` in the compaction task spec.
### Rollup ### Rollup
Druid only rolls up the output segment when `rollup` is set for all input segments. Druid only rolls up the output segment when `rollup` is set for all input segments.
See [Roll-up](../ingestion/rollup.md) for more details. See [Roll-up](../ingestion/rollup.md) for more details.
You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes). You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes).
@ -104,6 +106,7 @@ To perform a manual compaction, you submit a compaction task. Compaction tasks m
"dataSource": <task_datasource>, "dataSource": <task_datasource>,
"ioConfig": <IO config>, "ioConfig": <IO config>,
"dimensionsSpec": <custom dimensionsSpec>, "dimensionsSpec": <custom dimensionsSpec>,
"transformSpec": <custom transformSpec>,
"metricsSpec": <custom metricsSpec>, "metricsSpec": <custom metricsSpec>,
"tuningConfig": <parallel indexing task tuningConfig>, "tuningConfig": <parallel indexing task tuningConfig>,
"granularitySpec": <compaction task granularitySpec>, "granularitySpec": <compaction task granularitySpec>,
@ -120,14 +123,14 @@ To perform a manual compaction, you submit a compaction task. Compaction tasks m
|`dimensionsSpec`|Custom `dimensionsSpec`. The compaction task uses the specified `dimensionsSpec` if it exists instead of generating one. See [Compaction dimensionsSpec](#compaction-dimensions-spec) for details.|No| |`dimensionsSpec`|Custom `dimensionsSpec`. The compaction task uses the specified `dimensionsSpec` if it exists instead of generating one. See [Compaction dimensionsSpec](#compaction-dimensions-spec) for details.|No|
|`transformSpec`|Custom `transformSpec`. The compaction task uses the specified `transformSpec` rather than using `null`. See [Compaction transformSpec](#compaction-transform-spec) for details.|No| |`transformSpec`|Custom `transformSpec`. The compaction task uses the specified `transformSpec` rather than using `null`. See [Compaction transformSpec](#compaction-transform-spec) for details.|No|
|`metricsSpec`|Custom `metricsSpec`. The compaction task uses the specified `metricsSpec` rather than generating one.|No| |`metricsSpec`|Custom `metricsSpec`. The compaction task uses the specified `metricsSpec` rather than generating one.|No|
|`segmentGranularity`|When set, the compaction task changes the segment granularity for the given interval. Deprecated. Use `granularitySpec`. |No.| |`segmentGranularity`|When set, the compaction task changes the segment granularity for the given interval. Deprecated. Use `granularitySpec`. |No|
|`tuningConfig`|[Parallel indexing task tuningConfig](native-batch.md#tuningconfig). `awaitSegmentAvailabilityTimeoutMillis` in the tuning config is not currently supported for compaction tasks. Do not set it to a non-zero value.|No| |`tuningConfig`|[Parallel indexing task tuningConfig](native-batch.md#tuningconfig). `awaitSegmentAvailabilityTimeoutMillis` in the tuning config is not supported for compaction tasks. Leave this parameter at the default value, 0.|No|
|`context`|[Task context](./tasks.md#context)|No|
|`granularitySpec`|Custom `granularitySpec`. The compaction task uses the specified `granularitySpec` rather than generating one. See [Compaction `granularitySpec`](#compaction-granularity-spec) for details.|No| |`granularitySpec`|Custom `granularitySpec`. The compaction task uses the specified `granularitySpec` rather than generating one. See [Compaction `granularitySpec`](#compaction-granularity-spec) for details.|No|
|`context`|[Task context](./tasks.md#context).|No|
> Note: Use `granularitySpec` over `segmentGranularity` and only set one of these values. If you specify different values for these in the same compaction spec, the task fails. > Note: Use `granularitySpec` over `segmentGranularity` and only set one of these values. If you specify different values for these in the same compaction spec, the task fails.
To control the number of result segments per time chunk, you can set [`maxRowsPerSegment`](../configuration/index.md#compaction-dynamic-configuration) or [`numShards`](../ingestion/native-batch.md#tuningconfig). To control the number of result segments per time chunk, you can set [`maxRowsPerSegment`](../configuration/index.md#automatic-compaction-dynamic-configuration) or [`numShards`](../ingestion/native-batch.md#tuningconfig).
> You can run multiple compaction tasks in parallel. For example, if you want to compact the data for a year, you are not limited to running a single task for the entire year. You can run 12 compaction tasks with month-long intervals. > You can run multiple compaction tasks in parallel. For example, if you want to compact the data for a year, you are not limited to running a single task for the entire year. You can run 12 compaction tasks with month-long intervals.
@ -174,7 +177,7 @@ The compaction `ioConfig` requires specifying `inputSpec` as follows:
|-----|-----------|-------|--------| |-----|-----------|-------|--------|
|`type`|Task type: `compact`|none|Yes| |`type`|Task type: `compact`|none|Yes|
|`inputSpec`|Specification of the target [intervals](#interval-inputspec) or [segments](#segments-inputspec).|none|Yes| |`inputSpec`|Specification of the target [intervals](#interval-inputspec) or [segments](#segments-inputspec).|none|Yes|
|`dropExisting`|If `true` the task replaces all existing segments fully contained by either of the following:<br>- the `interval` in the `interval` type `inputSpec`.<br>- the umbrella interval of the `segments` in the `segment` type `inputSpec`.<br>If compaction fails, Druid does change any of the existing segments.<br>**WARNING**: `dropExisting` in `ioConfig` is a beta feature. |false|no| |`dropExisting`|If `true`, the task replaces all existing segments fully contained by either of the following:<br>- the `interval` in the `interval` type `inputSpec`.<br>- the umbrella interval of the `segments` in the `segment` type `inputSpec`.<br>If compaction fails, Druid does not change any of the existing segments.<br>**WARNING**: `dropExisting` in `ioConfig` is a beta feature. |false|No|
Druid supports two supported `inputSpec` formats: Druid supports two supported `inputSpec` formats:
@ -214,31 +217,10 @@ Druid supports two supported `inputSpec` formats:
|`queryGranularity`|The resolution of timestamp storage within each segment. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No| |`queryGranularity`|The resolution of timestamp storage within each segment. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No|
|`rollup`|Whether to enable ingestion-time rollup or not. Defaults to 'null', which preserves the original setting. Note that once data is rollup, individual records can no longer be recovered. |No| |`rollup`|Whether to enable ingestion-time rollup or not. Defaults to 'null', which preserves the original setting. Note that once data is rollup, individual records can no longer be recovered. |No|
For example, to set the segment granularity to "day", the query granularity to "hour", and enabling rollup:
```json
{
"type": "compact",
"dataSource": "wikipedia",
"ioConfig": {
"type": "compact",
"inputSpec": {
"type": "interval",
"interval": "2017-01-01/2018-01-01"
},
"granularitySpec": {
"segmentGranularity": "day",
"queryGranularity": "hour",
"rollup": true
}
}
}
```
## Learn more ## Learn more
See the following topics for more information: See the following topics for more information:
- [Segment optimization](../operations/segment-optimization.md) for guidance to determine if compaction will help in your case. - [Segment optimization](../operations/segment-optimization.md) for guidance to determine if compaction will help in your case.
- [Compacting Segments](../design/coordinator.md#compacting-segments) for more on automatic compaction. - [Compacting Segments](../design/coordinator.md#automatic-compaction) for details on how the Coordinator manages automatic compaction.
- [Compaction Configuration API](../operations/api-reference.md#compaction-configuration) - [Automatic compaction configuration API](../operations/api-reference.md#automatic-compaction-configuration)
and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for automatic compaction configuration information. and [Automatic compaction configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for automatic compaction configuration information.

View File

@ -356,6 +356,9 @@ You can override the task priority by setting your priority in the task context
The task context is used for various individual task configuration. The task context is used for various individual task configuration.
Specify task context configurations in the `context` field of the ingestion spec. Specify task context configurations in the `context` field of the ingestion spec.
When configuring [automatic compaction](../configuration/index.md#automatic-compaction-dynamic-configuration), set the task context configurations in `taskContext` rather than in `context`.
The settings get passed into the `context` field of the compaction tasks issued to MiddleManagers.
The following parameters apply to all task types. The following parameters apply to all task types.
|property|default|description| |property|default|description|

View File

@ -458,52 +458,52 @@ to filter by interval and limit the number of results respectively.
Update overlord dynamic worker configuration. Update overlord dynamic worker configuration.
#### Compaction Status #### Automatic compaction status
##### GET ##### GET
* `/druid/coordinator/v1/compaction/progress?dataSource={dataSource}` * `/druid/coordinator/v1/compaction/progress?dataSource={dataSource}`
Returns the total size of segments awaiting compaction for the given dataSource. Returns the total size of segments awaiting compaction for the given dataSource.
This is only valid for dataSource which has compaction enabled. The specified dataSource must have automatic compaction enabled.
##### GET ##### GET
* `/druid/coordinator/v1/compaction/status` * `/druid/coordinator/v1/compaction/status`
Returns the status and statistics from the auto compaction run of all dataSources which have auto compaction enabled in the latest run. Returns the status and statistics from the auto-compaction run of all dataSources which have auto-compaction enabled in the latest run.
The response payload includes a list of `latestStatus` objects. Each `latestStatus` represents the status for a dataSource (which has/had auto compaction enabled). The response payload includes a list of `latestStatus` objects. Each `latestStatus` represents the status for a dataSource (which has/had auto-compaction enabled).
The `latestStatus` object has the following keys: The `latestStatus` object has the following keys:
* `dataSource`: name of the datasource for this status information * `dataSource`: name of the datasource for this status information
* `scheduleStatus`: auto compaction scheduling status. Possible values are `NOT_ENABLED` and `RUNNING`. Returns `RUNNING ` if the dataSource has an active auto compaction config submitted otherwise, `NOT_ENABLED` * `scheduleStatus`: auto-compaction scheduling status. Possible values are `NOT_ENABLED` and `RUNNING`. Returns `RUNNING ` if the dataSource has an active auto-compaction config submitted. Otherwise, returns `NOT_ENABLED`.
* `bytesAwaitingCompaction`: total bytes of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction) * `bytesAwaitingCompaction`: total bytes of this datasource waiting to be compacted by the auto-compaction (only consider intervals/segments that are eligible for auto-compaction)
* `bytesCompacted`: total bytes of this datasource that are already compacted with the spec set in the auto compaction config. * `bytesCompacted`: total bytes of this datasource that are already compacted with the spec set in the auto-compaction config
* `bytesSkipped`: total bytes of this datasource that are skipped (not eligible for auto compaction) by the auto compaction. * `bytesSkipped`: total bytes of this datasource that are skipped (not eligible for auto-compaction) by the auto-compaction
* `segmentCountAwaitingCompaction`: total number of segments of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction) * `segmentCountAwaitingCompaction`: total number of segments of this datasource waiting to be compacted by the auto-compaction (only consider intervals/segments that are eligible for auto-compaction)
* `segmentCountCompacted`: total number of segments of this datasource that are already compacted with the spec set in the auto compaction config. * `segmentCountCompacted`: total number of segments of this datasource that are already compacted with the spec set in the auto-compaction config
* `segmentCountSkipped`: total number of segments of this datasource that are skipped (not eligible for auto compaction) by the auto compaction. * `segmentCountSkipped`: total number of segments of this datasource that are skipped (not eligible for auto-compaction) by the auto-compaction
* `intervalCountAwaitingCompaction`: total number of intervals of this datasource waiting to be compacted by the auto compaction (only consider intervals/segments that are eligible for auto compaction) * `intervalCountAwaitingCompaction`: total number of intervals of this datasource waiting to be compacted by the auto-compaction (only consider intervals/segments that are eligible for auto-compaction)
* `intervalCountCompacted`: total number of intervals of this datasource that are already compacted with the spec set in the auto compaction config. * `intervalCountCompacted`: total number of intervals of this datasource that are already compacted with the spec set in the auto-compaction config
* `intervalCountSkipped`: total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction. * `intervalCountSkipped`: total number of intervals of this datasource that are skipped (not eligible for auto-compaction) by the auto-compaction
##### GET ##### GET
* `/druid/coordinator/v1/compaction/status?dataSource={dataSource}` * `/druid/coordinator/v1/compaction/status?dataSource={dataSource}`
Similar to the API `/druid/coordinator/v1/compaction/status` above but filters response to only return information for the {dataSource} given. Similar to the API `/druid/coordinator/v1/compaction/status` above but filters response to only return information for the {dataSource} given.
Note that {dataSource} given must have/had auto compaction enabled. Note that {dataSource} given must have/had auto-compaction enabled.
#### Compaction Configuration #### Automatic compaction configuration
##### GET ##### GET
* `/druid/coordinator/v1/config/compaction` * `/druid/coordinator/v1/config/compaction`
Returns all compaction configs. Returns all automatic compaction configs.
* `/druid/coordinator/v1/config/compaction/{dataSource}` * `/druid/coordinator/v1/config/compaction/{dataSource}`
Returns a compaction config of a dataSource. Returns an automatic compaction config of a dataSource.
##### POST ##### POST
@ -517,15 +517,15 @@ will be set for them.
* `/druid/coordinator/v1/config/compaction` * `/druid/coordinator/v1/config/compaction`
Creates or updates the compaction config for a dataSource. Creates or updates the automatic compaction config for a dataSource.
See [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for configuration details. See [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for configuration details.
##### DELETE ##### DELETE
* `/druid/coordinator/v1/config/compaction/{dataSource}` * `/druid/coordinator/v1/config/compaction/{dataSource}`
Removes the compaction config for a dataSource. Removes the automatic compaction config for a dataSource.
#### Server information #### Server information

View File

@ -1,6 +1,6 @@
--- ---
id: segment-optimization id: segment-optimization
title: "Segment Size Optimization" title: "Segment size optimization"
--- ---
<!-- <!--
@ -87,11 +87,11 @@ In this case, you may want to see only rows of the max version per interval (pai
Once you find your segments need compaction, you can consider the below two options: Once you find your segments need compaction, you can consider the below two options:
- Turning on the [automatic compaction of Coordinators](../design/coordinator.md#compacting-segments). - Turning on the [automatic compaction of Coordinators](../design/coordinator.md#automatic-compaction).
The Coordinator periodically submits [compaction tasks](../ingestion/tasks.md#compact) to re-index small segments. The Coordinator periodically submits [compaction tasks](../ingestion/tasks.md#compact) to re-index small segments.
To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration. To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration.
See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration) See [Automatic compaction configuration API](../operations/api-reference.md#automatic-compaction-configuration)
and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for details. and [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for details.
- Running periodic Hadoop batch ingestion jobs and using a `dataSource` - Running periodic Hadoop batch ingestion jobs and using a `dataSource`
inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
Details on how to do this can be found on the [Updating existing data](../ingestion/data-management.md#update) section Details on how to do this can be found on the [Updating existing data](../ingestion/data-management.md#update) section

View File

@ -55,6 +55,7 @@ DRUIDVERSION
DataSketches DataSketches
DateTime DateTime
DateType DateType
dimensionsSpec
DimensionSpec DimensionSpec
DimensionSpecs DimensionSpecs
Dockerfile Dockerfile
@ -112,6 +113,7 @@ InputFormat
InputSource InputSource
InputSources InputSources
Integer.MAX_VALUE Integer.MAX_VALUE
ioConfig
JBOD JBOD
JDBC JDBC
JDK JDK
@ -671,7 +673,6 @@ baseDataSource
baseDataSource-hashCode baseDataSource-hashCode
classpathPrefix classpathPrefix
derivativeDataSource derivativeDataSource
dimensionsSpec
druid.extensions.hadoopDependenciesDir druid.extensions.hadoopDependenciesDir
hadoopDependencyCoordinates hadoopDependencyCoordinates
maxTaskCount maxTaskCount
@ -1132,7 +1133,6 @@ datetime
f.example.com f.example.com
filePattern filePattern
forceExtendableShardSpecs forceExtendableShardSpecs
granularitySpec
ignoreInvalidRows ignoreInvalidRows
ignoreWhenNoSegments ignoreWhenNoSegments
indexSpecForIntermediatePersists indexSpecForIntermediatePersists
@ -1842,7 +1842,6 @@ cpuacct
dataSourceName dataSourceName
datetime datetime
defaultHistory defaultHistory
dimensionsSpec
doubleMax doubleMax
doubleMin doubleMin
doubleSum doubleSum