mirror of
https://github.com/apache/druid.git
synced 2025-02-17 07:25:02 +00:00
Improve doc for auto compaction (#7117)
* Improve doc for auto compaction * fix doc * address comments
This commit is contained in:
parent
fa218f5160
commit
ded03d9d4c
@ -835,8 +835,10 @@ These configuration options control the behavior of the Lookup dynamic configura
|
||||
|
||||
##### Compaction Dynamic Configuration
|
||||
|
||||
Compaction configurations can also be set or updated dynamically without restarting Coordinators. For segment compaction,
|
||||
please see [Compacting Segments](../design/coordinator.html#compacting-segments).
|
||||
Compaction configurations can also be set or updated dynamically using
|
||||
[Coordinator's API](../operations/api-reference.html#compaction-configuration) without restarting Coordinators.
|
||||
|
||||
For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.html).
|
||||
|
||||
A description of the compaction config is:
|
||||
|
||||
|
@ -66,7 +66,7 @@ To ensure an even distribution of segments across Historical processes in the cl
|
||||
### Compacting Segments
|
||||
|
||||
Each run, the Druid Coordinator compacts small segments abutting each other. This is useful when you have a lot of small
|
||||
segments which may degrade the query performance as well as increasing the disk space usage.
|
||||
segments which may degrade query performance as well as increase disk space usage. See [Segment Size Optimization](../operations/segment-optimization.html) for details.
|
||||
|
||||
The Coordinator first finds the segments to compact together based on the [segment search policy](#segment-search-policy).
|
||||
Once some segments are found, it launches a [compaction task](../ingestion/tasks.html#compaction-task) to compact those segments.
|
||||
|
@ -341,7 +341,8 @@ will be set for them.
|
||||
|
||||
* `/druid/coordinator/v1/config/compaction`
|
||||
|
||||
Creates or updates the compaction config for a dataSource. See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details.
|
||||
Creates or updates the compaction config for a dataSource.
|
||||
See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details.
|
||||
|
||||
|
||||
##### DELETE
|
||||
|
@ -1,6 +1,6 @@
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Segment size optimization"
|
||||
title: "Segment Size Optimization"
|
||||
---
|
||||
|
||||
<!--
|
||||
@ -22,25 +22,79 @@ title: "Segment size optimization"
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
# Segment size optimization
|
||||
# Segment Size Optimization
|
||||
|
||||
In Druid, it's important to optimize the segment size because
|
||||
|
||||
1. Druid stores data in segments. If you're using the [best-effort roll-up](../design/index.html#roll-up-modes) mode,
|
||||
increasing the segment size might introduce further aggregation which reduces the dataSource size.
|
||||
2. When a query is submitted, that query is distributed to all Historicals and realtimes
|
||||
which hold the input segments of the query. Each process has a processing threads pool and use one thread per segment to
|
||||
process it. If the segment size is too large, data might not be well distributed over the
|
||||
whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small,
|
||||
each processing thread processes too small data. This might reduce the processing speed of other queries as well as
|
||||
the input query itself because the processing threads are shared for executing all queries.
|
||||
2. When a query is submitted, that query is distributed to all Historicals and realtime tasks
|
||||
which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool
|
||||
to process a single segment. If segment sizes are too large, data might not be well distributed between data
|
||||
servers, decreasing the degree of parallelism possible during query processing.
|
||||
At the other extreme where segment sizes are too small, the scheduling
|
||||
overhead of processing a larger number of segments per query can reduce
|
||||
performance, as the threads that process each segment compete for the fixed
|
||||
slots of the processing pool.
|
||||
|
||||
It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy
|
||||
especially for the streaming ingestion because the amount of data ingested might vary over time. In this case,
|
||||
you can roughly set the segment size at ingestion time and optimize it later. You have two options:
|
||||
especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case,
|
||||
you can create segments with a sub-optimzed size first and optimize them later.
|
||||
|
||||
You may need to consider the followings to optimize your segments.
|
||||
|
||||
- Number of rows per segment: it's generally recommended for each segment to have around 5 million rows.
|
||||
This setting is usually _more_ important than the below "segment byte size".
|
||||
This is because Druid uses a single thread to process each segment,
|
||||
and thus this setting can directly control how many rows each thread processes,
|
||||
which in turn means how well the query execution is parallelized.
|
||||
- Segment byte size: it's recommended to set 300 ~ 700MB. If this value
|
||||
doesn't match with the "number of rows per segment", please consider optimizing
|
||||
number of rows per segment rather than this value.
|
||||
|
||||
<div class="note">
|
||||
The above recommendation works in general, but the optimal setting can
|
||||
vary based on your workload. For example, if most of your queries
|
||||
are heavy and take a long time to process each row, you may want to make
|
||||
segments smaller so that the query processing can be more parallelized.
|
||||
If you still see some performance issue after optimizing segment size,
|
||||
you may need to find the optimal settings for your workload.
|
||||
</div>
|
||||
|
||||
There might be several ways to check if the compaction is necessary. One way
|
||||
is using the [System Schema](../querying/sql.html#system-schema). The
|
||||
system schema provides several tables about the current system status including the `segments` table.
|
||||
By running the below query, you can get the average number of rows and average size for published segments.
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
"start",
|
||||
"end",
|
||||
version,
|
||||
COUNT(*) AS num_segments,
|
||||
AVG("num_rows") AS avg_num_rows,
|
||||
SUM("num_rows") AS total_num_rows,
|
||||
AVG("size") AS avg_size,
|
||||
SUM("size") AS total_size
|
||||
FROM
|
||||
sys.segments A
|
||||
WHERE
|
||||
datasource = 'your_dataSource' AND
|
||||
is_published = 1
|
||||
GROUP BY 1, 2, 3
|
||||
ORDER BY 1, 2, 3 DESC;
|
||||
```
|
||||
|
||||
Please note that the query result might include overshadowed segments.
|
||||
In this case, you may want to see only rows of the max version per interval (pair of `start` and `end`).
|
||||
|
||||
Once you find your segments need compaction, you can consider the below two options:
|
||||
|
||||
- Turning on the [automatic compaction of Coordinators](../design/coordinator.html#compacting-segments).
|
||||
The Coordinator periodically submits [compaction tasks](../ingestion/tasks.html#compaction-task) to re-index small segments.
|
||||
To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration.
|
||||
See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration)
|
||||
and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for details.
|
||||
- Running periodic Hadoop batch ingestion jobs and using a `dataSource`
|
||||
inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
|
||||
Details on how to do this can be found under ['Updating Existing Data'](../ingestion/update-existing-data.html).
|
||||
|
Loading…
x
Reference in New Issue
Block a user