Docs for automatic compaction (#12569)

* docs for auto-compaction * fix broken links * another link * Apply suggestions from code review Co-authored-by: Suneet Saldanha <suneet@apache.org> * Apply suggestions from code review Co-authored-by: Suneet Saldanha <suneet@apache.org> * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> * reorg content for skipOffset * Update docs/ingestion/automatic-compaction.md Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Apply suggestions from code review Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
2022-06-09 14:55:12 -07:00 · 2022-06-09 14:55:12 -07:00 · 353475bd36
parent a3603ad6b0
commit 353475bd36
9 changed files with 222 additions and 16 deletions
--- a/docs/assets/compaction-config.png
+++ b/docs/assets/compaction-config.png
--- a/docs/configuration/index.md
+++ b/docs/configuration/index.md
@ -954,7 +954,7 @@ These configuration options control the behavior of the Lookup dynamic configura

 ##### Automatic compaction dynamic configuration

-You can set or update automatic compaction properties dynamically using the
+You can set or update [automatic compaction](../ingestion/automatic-compaction.md) properties dynamically using the
 [Coordinator API](../operations/api-reference.md#automatic-compaction-configuration) without restarting Coordinators.

 For details about segment compaction, see [Segment size optimization](../operations/segment-optimization.md).
@ -987,8 +987,12 @@ Automatic compaction config example:
 }
 ```

-Compaction tasks fail when higher priority tasks cause Druid to revoke their locks. By default, realtime tasks like ingestion have a higher priority than compaction tasks. Therefore frequent conflicts between compaction tasks and realtime tasks can cause the coordinator's automatic compaction to get stuck.
-You may see this issue with streaming ingestion from Kafka and Kinesis, which ingest late-arriving data. To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks.
+Compaction tasks fail when higher priority tasks cause Druid to revoke their locks. By default, realtime tasks like ingestion have a higher priority than compaction tasks. Frequent conflicts between compaction tasks and realtime tasks can cause the Coordinator's automatic compaction to hang.
+You may see this issue with streaming ingestion from Kafka and Kinesis, which ingest late-arriving data.
+
+To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks.
+For example, if you want to skip over segments from thirty days prior to the end time of the most recent segment, assign `"skipOffsetFromLatest": "P30D"`.
+For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion).

 ###### Automatic compaction tuningConfig

--- a/docs/design/coordinator.md
+++ b/docs/design/coordinator.md
@ -81,7 +81,7 @@ To ensure an even distribution of segments across Historical processes in the cl

 ### Automatic compaction

-The Druid Coordinator manages the automatic compaction system.
+The Druid Coordinator manages the [automatic compaction system](../ingestion/automatic-compaction.md).
 Each run, the Coordinator compacts segments by merging small segments or splitting a large one. This is useful when the size of your segments is not optimized which may degrade query performance.
 See [Segment size optimization](../operations/segment-optimization.md) for details.

@ -133,10 +133,11 @@ If the Coordinator has enough task slots for compaction, this policy will contin
 `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`.
 Finally, `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` will be picked up even though there is only one segment in the time chunk of `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z`.

-The search start point can be changed by setting [`skipOffsetFromLatest`](../configuration/index.md#automatic-compaction-dynamic-configuration).
+The search start point can be changed by setting `skipOffsetFromLatest`.
 If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - `skipOffsetFromLatest`).
 This is to avoid conflicts between compaction tasks and realtime tasks.
 Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task.
+For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion).

 > This policy currently cannot handle the situation when there are a lot of small segments which have the same interval,
 > and their total size exceeds [`inputSegmentSizeBytes`](../configuration/index.md#automatic-compaction-dynamic-configuration).
--- a/docs/ingestion/automatic-compaction.md
+++ b/docs/ingestion/automatic-compaction.md
@ -0,0 +1,198 @@
+---
+id: automatic-compaction
+title: "Automatic compaction"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+In Apache Druid, compaction is a special type of ingestion task that reads data from a Druid datasource and writes it back into the same datasource. A common use case for this is to [optimally size segments](../operations/segment-optimization.md) after ingestion to improve query performance. Automatic compaction, or auto-compaction, refers to the system for automatic execution of compaction tasks managed by the [Druid Coordinator](../design/coordinator.md).
+
+The Coordinator [indexing period](../configuration/index.md#coordinator-operation), `druid.coordinator.period.indexingPeriod`, controls the frequency of compaction tasks.
+The default indexing period is 30 minutes, meaning that the Coordinator first checks for segments to compact at most 30 minutes from when auto-compaction is enabled.
+This time period affects other Coordinator duties including merge and conversion tasks.
+To configure the auto-compaction time period without interfering with `indexingPeriod`, see [Set frequency of compaction runs](#set-frequency-of-compaction-runs).
+
+At every invocation of auto-compaction, the Coordinator initiates a [segment search](../design/coordinator.md#segment-search-policy-in-automatic-compaction) to determine eligible segments to compact.
+When there are eligible segments to compact, the Coordinator issues compaction tasks based on available worker capacity.
+If a compaction task takes longer than the indexing period, the Coordinator waits for it to finish before resuming the period for segment search.
+
+As a best practice, you should set up auto-compaction for all Druid datasources. You can run compaction tasks manually for cases where you want to allocate more system resources. For example, you may choose to run multiple compaction tasks in parallel to compact an existing datasource for the first time. See [Compaction](compaction.md) for additional details and use cases.
+
+This topic guides you through setting up automatic compaction for your Druid cluster. See the [examples](#examples) for common use cases for automatic compaction.
+
+## Enable automatic compaction
+
+You can enable automatic compaction for a datasource using the Druid console or programmatically via an API.
+This process differs for manual compaction tasks, which can be submitted from the [Tasks view of the Druid console](../operations/druid-console.md) or the [Tasks API](../operations/api-reference.md#post-5).
+
+### Druid console
+
+Use the Druid console to enable automatic compaction for a datasource as follows.
+
+1. Click **Datasources** in the top-level navigation.
+2. In the **Compaction** column, click the edit icon for the datasource to compact.
+3. In the **Compaction config** dialog, configure the auto-compaction settings. The dialog offers a form view as well as a JSON view. Editing the form updates the JSON specification, and editing the JSON updates the form field, if present. Form fields not present in the JSON indicate default values. You may add additional properties to the JSON for auto-compaction settings not displayed in the form. See [Configure automatic compaction](#configure-automatic-compaction) for supported settings for auto-compaction.
+4. Click **Submit**.
+5. Refresh the **Datasources** view. The **Compaction** column for the datasource changes from “Not enabled” to “Awaiting first run.”
+
+The following screenshot shows the compaction config dialog for a datasource with auto-compaction enabled.
+![Compaction config in web console](../assets/compaction-dialog.png)
+
+To disable auto-compaction for a datasource, click **Delete** from the **Compaction config** dialog. Druid does not retain your auto-compaction configuration.
+
+### Compaction configuration API
+
+Use the [Coordinator API](../operations/api-reference.md#automatic-compaction-status) to configure automatic compaction.
+To enable auto-compaction for a datasource, create a JSON object with the desired auto-compaction settings.
+See [Configure automatic compaction](#configure-automatic-compaction) for the syntax of an auto-compaction spec.
+Send the JSON object as a payload in a [`POST` request](../operations/api-reference.md#post-4) to `/druid/coordinator/v1/config/compaction`.
+The following example configures auto-compaction for the `wikipedia` datasource:
+
+```sh
+curl --location --request POST 'http://localhost:8081/druid/coordinator/v1/config/compaction' \
+--header 'Content-Type: application/json' \
+--data-raw '{
+    "dataSource": "wikipedia",
+    "granularitySpec": {
+        "segmentGranularity": "DAY"
+    }
+}'
+```
+
+To disable auto-compaction for a datasource, send a [`DELETE` request](../operations/api-reference.md#delete-1) to `/druid/coordinator/v1/config/compaction/{dataSource}`. Replace `{dataSource}` with the name of the datasource for which to disable auto-compaction. For example:
+
+```sh
+curl --location --request DELETE 'http://localhost:8081/druid/coordinator/v1/config/compaction/wikipedia'
+```
+
+## Configure automatic compaction
+
+You can configure automatic compaction dynamically without restarting Druid.
+The automatic compaction system uses the following syntax:
+
+```json
+{
+    "dataSource": <task_datasource>,
+    "ioConfig": <IO config>,
+    "dimensionsSpec": <custom dimensionsSpec>,
+    "transformSpec": <custom transformSpec>,
+    "metricsSpec": <custom metricsSpec>,
+    "tuningConfig": <parallel indexing task tuningConfig>,
+    "granularitySpec": <compaction task granularitySpec>,
+    "skipOffsetFromLatest": <time period to avoid compaction>,
+    "taskPriority": <compaction task priority>,
+    "taskContext": <task context>
+}
+```
+
+Most fields in the auto-compaction configuration correlate to a typical [Druid ingestion spec](../ingestion/ingestion-spec.md).
+The following properties only apply to auto-compaction:
+* `skipOffsetFromLatest`
+* `taskPriority`
+* `taskContext`
+
+Since the automatic compaction system provides a management layer on top of manual compaction tasks,
+the auto-compaction configuration does not include task-specific properties found in a typical Druid ingestion spec.
+The following properties are automatically set by the Coordinator:
+* `type`: Set to `compact`.
+* `id`: Generated using the task type, datasource name, interval, and timestamp. The task ID is prefixed with `coordinator-issued`.
+* `context`: Set according to the user-provided `taskContext`.
+
+For more details on each of the specs in an auto-compaction configuration, see [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration).
+
+### Avoid conflicts with ingestion
+
+Compaction tasks may be interrupted when they interfere with ingestion. For example, this occurs when an ingestion task needs to write data to a segment for a time interval locked for compaction. If there are continuous failures that prevent compaction from making progress, consider one of the following strategies:
+* Set `skipOffsetFromLatest` to reduce the chance of conflicts between ingestion and compaction. See more details in this section below.
+* Increase the priority value of compaction tasks relative to ingestion tasks. Only recommended for advanced users. This approach can cause ingestion jobs to fail or lag. To change the priority of compaction tasks, set `taskPriority` to the desired priority value in the auto-compaction configuration. For details on the priority values of different task types, see [Lock priority](../ingestion/tasks.md#lock-priority).
+
+The Coordinator compacts segments from newest to oldest. In the auto-compaction configuration, you can set a time period, relative to the end time of the most recent segment, for segments that should not be compacted. Assign this value to `skipOffsetFromLatest`. Note that this offset is not relative to the current time but to the latest segment time. For example, if you want to skip over segments from five days prior to the end time of the most recent segment, assign `"skipOffsetFromLatest": "P5D"`.
+
+To set `skipOffsetFromLatest`, consider how frequently you expect the stream to receive late arriving data. If your stream only occasionally receives late arriving data, the auto-compaction system robustly compacts your data even though data is ingested outside the `skipOffsetFromLatest` window. For most realtime streaming ingestion use cases, it is reasonable to set `skipOffsetFromLatest` to a few hours or a day.
+
+### Set frequency of compaction runs
+
+If you want the Coordinator to check for compaction more frequently than its indexing period, create a separate group to handle compaction duties.
+Set the time period of the duty group in the `coordinator/runtime.properties` file.
+The following example shows how to create a duty group named `compaction` and set the auto-compaction period to 1 minute:
+```
+druid.coordinator.dutyGroups=["compaction"]
+druid.coordinator.compaction.duties=["compactSegments"]
+druid.coordinator.compaction.period=PT60S
+```
+
+## View automatic compaction statistics
+
+After the Coordinator has initiated auto-compaction, you can view compaction statistics for the datasource, including the number of bytes, segments, and intervals already compacted and those awaiting compaction. The Coordinator also reports the total bytes, segments, and intervals not eligible for compaction in accordance with its [segment search policy](../design/coordinator.md#segment-search-policy-in-automatic-compaction).
+
+In the Druid console, the Datasources view displays auto-compaction statistics. The Tasks view shows the task information for compaction tasks that were triggered by the automatic compaction system.
+
+To get statistics by API, send a [`GET` request](../operations/api-reference.md#get-10) to `/druid/coordinator/v1/compaction/status`. To filter the results to a particular datasource, pass the datasource name as a query parameter to the request—for example, `/druid/coordinator/v1/compaction/status?dataSource=wikipedia`.
+
+## Examples
+
+The following examples demonstrate potential use cases in which auto-compaction may improve your Druid performance. See more details in [Compaction strategies](../ingestion/compaction.md#compaction-strategies). The examples in this section do not change the underlying data.
+
+### Change segment granularity
+
+You have a stream set up to ingest data with `HOUR` segment granularity into the `wikistream` datasource. You notice that your Druid segments are smaller than the [recommended segment size](../operations/segment-optimization.md) of 5 million rows per segment. You wish to automatically compact segments to `DAY` granularity while leaving the latest week of data _not_ compacted because your stream consistently receives data within that time period.
+
+The following auto-compaction configuration compacts existing `HOUR` segments into `DAY` segments while leaving the latest week of data not compacted:
+
+```json
+{
+  "dataSource": "wikistream",
+  "granularitySpec": {
+    "segmentGranularity": "DAY"
+  },
+  "skipOffsetFromLatest": "P1W",
+}
+```
+
+### Update partitioning scheme
+
+For your `wikipedia` datasource, you want to optimize segment access when regularly ingesting data without compromising compute time when querying the data. Your ingestion spec for batch append uses [dynamic partitioning](../ingestion/native-batch.md#dynamic-partitioning) to optimize for write-time operations, while your stream ingestion partitioning is configured by the stream service. You want to implement auto-compaction to reorganize the data with a suitable read-time partitioning using [multi-dimension range partitioning](../ingestion/native-batch.md#multi-dimension-range-partitioning). Based on the dimensions frequently accessed in queries, you wish to partition on the following dimensions: `channel`, `countryName`, `namespace`.
+
+The following auto-compaction configuration compacts updates the `wikipedia` segments to use multi-dimension range partitioning:
+
+```json
+{
+  "dataSource": "wikipedia",
+  "tuningConfig": {
+    "partitionsSpec": {
+      "type": "range",
+      "partitionDimensions": [
+        "channel",
+        "countryName",
+        "namespace"
+      ],
+      "targetRowsPerSegment": 5000000
+    }
+  }
+}
+```
+
+## Learn more
+
+See the following topics for more information:
+* [Compaction](compaction.md) for an overview of compaction and how to set up manual compaction in Druid.
+* [Segment optimization](../operations/segment-optimization.md) for guidance on evaluating and optimizing Druid segment size.
+* [Coordinator process](../design/coordinator.md#automatic-compaction) for details on how the Coordinator plans compaction tasks.
+
--- a/docs/ingestion/compaction.md
+++ b/docs/ingestion/compaction.md
@ -47,7 +47,7 @@ Compaction does not improve performance in all situations. For example, if you r

 You can configure the Druid Coordinator to perform automatic compaction, also called auto-compaction, for a datasource. Using its [segment search policy](../design/coordinator.md#segment-search-policy-in-automatic-compaction), the Coordinator periodically identifies segments for compaction starting from newest to oldest. When the Coordinator discovers segments that have not been compacted or segments that were compacted with a different or changed spec, it submits compaction tasks for the time interval covering those segments.

-Automatic compaction works in most use cases and should be your first option. To learn more about automatic compaction, see [Compacting Segments](../design/coordinator.md#automatic-compaction).
+Automatic compaction works in most use cases and should be your first option. To learn more, see [Automatic compaction](../ingestion/automatic-compaction.md).

 In cases where you require more control over compaction, you can manually submit compaction tasks. For example:

@ -63,7 +63,9 @@ During compaction, Druid overwrites the original set of segments with the compac
 You can set `dropExisting` in `ioConfig` to "true" in the compaction task to configure Druid to replace all existing segments fully contained by the interval. See the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations) for an example.
 > WARNING: `dropExisting` in `ioConfig` is a beta feature.

-If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks, you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjust the auto-compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction. See [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for more information. Another option is to set the compaction task to higher priority than the ingestion task.
+If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks, you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjust the auto-compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction.
+Another option is to set the compaction task to higher priority than the ingestion task.
+For more information, see [Avoid conflicts with ingestion](../ingestion/automatic-compaction.md#avoid-conflicts-with-ingestion).

 ### Segment granularity handling

@ -221,6 +223,5 @@ Druid supports two supported `inputSpec` formats:

 See the following topics for more information:
 - [Segment optimization](../operations/segment-optimization.md) for guidance to determine if compaction will help in your case.
- [Compacting Segments](../design/coordinator.md#automatic-compaction) for details on how the Coordinator manages automatic compaction.
- [Automatic compaction configuration API](../operations/api-reference.md#automatic-compaction-configuration)
-and [Automatic compaction configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for automatic compaction configuration information.
+- [Automatic compaction](../ingestion/automatic-compaction.md) for how to enable and configure automatic compaction.
+
--- a/docs/ingestion/tasks.md
+++ b/docs/ingestion/tasks.md
@ -356,7 +356,7 @@ You can override the task priority by setting your priority in the task context

 The task context is used for various individual task configuration.
 Specify task context configurations in the `context` field of the ingestion spec.
-When configuring [automatic compaction](../configuration/index.md#automatic-compaction-dynamic-configuration), set the task context configurations in `taskContext` rather than in `context`.
+When configuring [automatic compaction](../ingestion/automatic-compaction.md), set the task context configurations in `taskContext` rather than in `context`.
 The settings get passed into the `context` field of the compaction tasks issued to MiddleManagers.

 The following parameters apply to all task types.
--- a/docs/operations/api-reference.md
+++ b/docs/operations/api-reference.md
@ -465,7 +465,7 @@ Update overlord dynamic worker configuration.
 * `/druid/coordinator/v1/compaction/progress?dataSource={dataSource}`

 Returns the total size of segments awaiting compaction for the given dataSource. 
-The specified dataSource must have automatic compaction enabled.
+The specified dataSource must have [automatic compaction](../ingestion/automatic-compaction.md) enabled.

 ##### GET

@ -517,7 +517,7 @@ will be set for them.

 * `/druid/coordinator/v1/config/compaction`

-Creates or updates the automatic compaction config for a dataSource.
+Creates or updates the [automatic compaction](../ingestion/automatic-compaction.md) config for a dataSource.
 See [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for configuration details.


--- a/docs/operations/segment-optimization.md
+++ b/docs/operations/segment-optimization.md
@ -90,12 +90,13 @@ Once you find your segments need compaction, you can consider the below two opti
  - Turning on the [automatic compaction of Coordinators](../design/coordinator.md#automatic-compaction).
  The Coordinator periodically submits [compaction tasks](../ingestion/tasks.md#compact) to re-index small segments.
  To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration.
-  See [Automatic compaction configuration API](../operations/api-reference.md#automatic-compaction-configuration)
-  and [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) for details.
+  For more information, see [Automatic compaction](../ingestion/automatic-compaction.md).
  - Running periodic Hadoop batch ingestion jobs and using a `dataSource`
  inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
  Details on how to do this can be found on the [Updating existing data](../ingestion/data-management.md#update) section
  of the data management page.

 ## Learn more
-For an overview of compaction and how to submit a manual compaction task, see [Compaction](../ingestion/compaction.md).
+* For an overview of compaction and how to submit a manual compaction task, see [Compaction](../ingestion/compaction.md).
+* To learn how to enable and configure automatic compaction, see [Automatic compaction](../ingestion/automatic-compaction.md).
+
--- a/website/sidebars.json
+++ b/website/sidebars.json
@ -39,6 +39,7 @@
      "ingestion/schema-design",
      "ingestion/data-management",
      "ingestion/compaction",
+      "ingestion/automatic-compaction",
      {
        "type": "subcategory",
        "label": "Stream ingestion",