diff --git a/docs/assets/compaction-config.png b/docs/assets/compaction-config.png new file mode 100644 index 00000000000..89be4adb567 Binary files /dev/null and b/docs/assets/compaction-config.png differ diff --git a/docs/configuration/index.md b/docs/configuration/index.md index ddf9bb29418..085bdc1ab5c 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -954,7 +954,7 @@ These configuration options control the behavior of the Lookup dynamic configura ##### Automatic compaction dynamic configuration -You can set or update automatic compaction properties dynamically using the +You can set or update [automatic compaction](../ingestion/automatic-compaction.md) properties dynamically using the [Coordinator API](../operations/api-reference.md#automatic-compaction-configuration) without restarting Coordinators. For details about segment compaction, see [Segment size optimization](../operations/segment-optimization.md). @@ -987,8 +987,12 @@ Automatic compaction config example: } ``` -Compaction tasks fail when higher priority tasks cause Druid to revoke their locks. By default, realtime tasks like ingestion have a higher priority than compaction tasks. Therefore frequent conflicts between compaction tasks and realtime tasks can cause the coordinator's automatic compaction to get stuck. -You may see this issue with streaming ingestion from Kafka and Kinesis, which ingest late-arriving data. To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks. +Compaction tasks fail when higher priority tasks cause Druid to revoke their locks. By default, realtime tasks like ingestion have a higher priority than compaction tasks. Frequent conflicts between compaction tasks and realtime tasks can cause the Coordinator's automatic compaction to hang. +You may see this issue with streaming ingestion from Kafka and Kinesis, which ingest late-arriving data. + +To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks. +For example, if you want to skip over segments from thirty days prior to the end time of the most recent segment, assign `"skipOffsetFromLatest": "P30D"`. +For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion). ###### Automatic compaction tuningConfig diff --git a/docs/design/coordinator.md b/docs/design/coordinator.md index 44f7297bb5a..5ce849df758 100644 --- a/docs/design/coordinator.md +++ b/docs/design/coordinator.md @@ -81,7 +81,7 @@ To ensure an even distribution of segments across Historical processes in the cl ### Automatic compaction -The Druid Coordinator manages the automatic compaction system. +The Druid Coordinator manages the [automatic compaction system](../ingestion/automatic-compaction.md). Each run, the Coordinator compacts segments by merging small segments or splitting a large one. This is useful when the size of your segments is not optimized which may degrade query performance. See [Segment size optimization](../operations/segment-optimization.md) for details. @@ -133,10 +133,11 @@ If the Coordinator has enough task slots for compaction, this policy will contin `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`. Finally, `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` will be picked up even though there is only one segment in the time chunk of `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z`. -The search start point can be changed by setting [`skipOffsetFromLatest`](../configuration/index.md#automatic-compaction-dynamic-configuration). +The search start point can be changed by setting `skipOffsetFromLatest`. If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - `skipOffsetFromLatest`). This is to avoid conflicts between compaction tasks and realtime tasks. Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task. +For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion). > This policy currently cannot handle the situation when there are a lot of small segments which have the same interval, > and their total size exceeds [`inputSegmentSizeBytes`](../configuration/index.md#automatic-compaction-dynamic-configuration). diff --git a/docs/ingestion/automatic-compaction.md b/docs/ingestion/automatic-compaction.md new file mode 100644 index 00000000000..8768a6035e6 --- /dev/null +++ b/docs/ingestion/automatic-compaction.md @@ -0,0 +1,198 @@ +--- +id: automatic-compaction +title: "Automatic compaction" +--- + + + +In Apache Druid, compaction is a special type of ingestion task that reads data from a Druid datasource and writes it back into the same datasource. A common use case for this is to [optimally size segments](../operations/segment-optimization.md) after ingestion to improve query performance. Automatic compaction, or auto-compaction, refers to the system for automatic execution of compaction tasks managed by the [Druid Coordinator](../design/coordinator.md). + +The Coordinator [indexing period](../configuration/index.md#coordinator-operation), `druid.coordinator.period.indexingPeriod`, controls the frequency of compaction tasks. +The default indexing period is 30 minutes, meaning that the Coordinator first checks for segments to compact at most 30 minutes from when auto-compaction is enabled. +This time period affects other Coordinator duties including merge and conversion tasks. +To configure the auto-compaction time period without interfering with `indexingPeriod`, see [Set frequency of compaction runs](#set-frequency-of-compaction-runs). + +At every invocation of auto-compaction, the Coordinator initiates a [segment search](../design/coordinator.md#segment-search-policy-in-automatic-compaction) to determine eligible segments to compact. +When there are eligible segments to compact, the Coordinator issues compaction tasks based on available worker capacity. +If a compaction task takes longer than the indexing period, the Coordinator waits for it to finish before resuming the period for segment search. + +As a best practice, you should set up auto-compaction for all Druid datasources. You can run compaction tasks manually for cases where you want to allocate more system resources. For example, you may choose to run multiple compaction tasks in parallel to compact an existing datasource for the first time. See [Compaction](compaction.md) for additional details and use cases. + +This topic guides you through setting up automatic compaction for your Druid cluster. See the [examples](#examples) for common use cases for automatic compaction. + +## Enable automatic compaction + +You can enable automatic compaction for a datasource using the Druid console or programmatically via an API. +This process differs for manual compaction tasks, which can be submitted from the [Tasks view of the Druid console](../operations/druid-console.md) or the [Tasks API](../operations/api-reference.md#post-5). + +### Druid console + +Use the Druid console to enable automatic compaction for a datasource as follows. + +1. Click **Datasources** in the top-level navigation. +2. In the **Compaction** column, click the edit icon for the datasource to compact. +3. In the **Compaction config** dialog, configure the auto-compaction settings. The dialog offers a form view as well as a JSON view. Editing the form updates the JSON specification, and editing the JSON updates the form field, if present. Form fields not present in the JSON indicate default values. You may add additional properties to the JSON for auto-compaction settings not displayed in the form. See [Configure automatic compaction](#configure-automatic-compaction) for supported settings for auto-compaction. +4. Click **Submit**. +5. Refresh the **Datasources** view. The **Compaction** column for the datasource changes from “Not enabled” to “Awaiting first run.” + +The following screenshot shows the compaction config dialog for a datasource with auto-compaction enabled. +![Compaction config in web console](../assets/compaction-dialog.png) + +To disable auto-compaction for a datasource, click **Delete** from the **Compaction config** dialog. Druid does not retain your auto-compaction configuration. + +### Compaction configuration API + +Use the [Coordinator API](../operations/api-reference.md#automatic-compaction-status) to configure automatic compaction. +To enable auto-compaction for a datasource, create a JSON object with the desired auto-compaction settings. +See [Configure automatic compaction](#configure-automatic-compaction) for the syntax of an auto-compaction spec. +Send the JSON object as a payload in a [`POST` request](../operations/api-reference.md#post-4) to `/druid/coordinator/v1/config/compaction`. +The following example configures auto-compaction for the `wikipedia` datasource: + +```sh +curl --location --request POST 'http://localhost:8081/druid/coordinator/v1/config/compaction' \ +--header 'Content-Type: application/json' \ +--data-raw '{ + "dataSource": "wikipedia", + "granularitySpec": { + "segmentGranularity": "DAY" + } +}' +``` + +To disable auto-compaction for a datasource, send a [`DELETE` request](../operations/api-reference.md#delete-1) to `/druid/coordinator/v1/config/compaction/{dataSource}`. Replace `{dataSource}` with the name of the datasource for which to disable auto-compaction. For example: + +```sh +curl --location --request DELETE 'http://localhost:8081/druid/coordinator/v1/config/compaction/wikipedia' +``` + +## Configure automatic compaction + +You can configure automatic compaction dynamically without restarting Druid. +The automatic compaction system uses the following syntax: + +```json +{ + "dataSource": , + "ioConfig": , + "dimensionsSpec": , + "transformSpec": , + "metricsSpec": , + "tuningConfig": , + "granularitySpec": , + "skipOffsetFromLatest":