druid/docs/content/operations/segment-optimization.md

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

---
layout: doc_page
---

# Segment size optimization

In Druid, it's important to optimize the segment size because

  1. Druid stores data in segments. If you're using the [best-effort roll-up](../design/index.html#roll-up-modes) mode,
  increasing the segment size might introduce further aggregation which reduces the dataSource size.
  2. When a query is submitted, that query is distributed to all historicals and realtimes
  which hold the input segments of the query. Each node has a processing threads pool and use one thread per segment to
  process it. If the segment size is too large, data might not be well distributed over the
  whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small,
  each processing thread processes too small data. This might reduce the processing speed of other queries as well as
  the input query itself because the processing threads are shared for executing all queries.

It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy
especially for the streaming ingestion because the amount of data ingested might vary over time. In this case,
you can roughly set the segment size at ingestion time and optimize it later. You have two options:

  - Turning on the [automatic compaction of coordinators](../design/coordinator.html#compacting-segments).
  The coordinator periodically submits [compaction tasks](../ingestion/tasks.html#compaction-task) to re-index small segments.
  - Running periodic Hadoop batch ingestion jobs and using a `dataSource`
  inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
  Details on how to do this can be found under ['Updating Existing Data'](../ingestion/update-existing-data.html).
add missing license headers, in particular to MD files; clean up RAT … (#6563) * add missing license headers, in particular to MD files; clean up RAT exclusions * revert inadvertent doc changes * docs * cr changes * fix modified druid-production.svg 2018-11-13 12:38:37 -05:00			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`

Automatic compaction by coordinators (#5102) * Automatic compaction by coordinator * add links * skip compaction for very recent segments if they are small * fix finding search interval * fix finding search interval * fix TimelineHolder iteration * add test for newestSegmentFirstPolicy * add CompactionSegmentIterator * add numTargetCompactionSegments * add missing config * fix skipping huge shards * fix handling large number of segments per shard * fix test failure * change recursive call to loop * fix logging * fix build * fix test failure * address comments * change dataSources type * check running pendingTasks at each run * fix test * address comments * fix build * fix test * address comments * address comments * add doc for segment size optimization * address comment 2018-01-12 23:52:37 -05:00			`---`
			`layout: doc_page`
			`---`
Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00
Automatic compaction by coordinators (#5102) * Automatic compaction by coordinator * add links * skip compaction for very recent segments if they are small * fix finding search interval * fix finding search interval * fix TimelineHolder iteration * add test for newestSegmentFirstPolicy * add CompactionSegmentIterator * add numTargetCompactionSegments * add missing config * fix skipping huge shards * fix handling large number of segments per shard * fix test failure * change recursive call to loop * fix logging * fix build * fix test failure * address comments * change dataSources type * check running pendingTasks at each run * fix test * address comments * fix build * fix test * address comments * address comments * add doc for segment size optimization * address comment 2018-01-12 23:52:37 -05:00			`# Segment size optimization`

			`In Druid, it's important to optimize the segment size because`

			`1. Druid stores data in segments. If you're using the [best-effort roll-up](../design/index.html#roll-up-modes) mode,`
			`increasing the segment size might introduce further aggregation which reduces the dataSource size.`
			`2. When a query is submitted, that query is distributed to all historicals and realtimes`
			`which hold the input segments of the query. Each node has a processing threads pool and use one thread per segment to`
			`process it. If the segment size is too large, data might not be well distributed over the`
			`whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small,`
			`each processing thread processes too small data. This might reduce the processing speed of other queries as well as`
			`the input query itself because the processing threads are shared for executing all queries.`

			`It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy`
			`especially for the streaming ingestion because the amount of data ingested might vary over time. In this case,`
			`you can roughly set the segment size at ingestion time and optimize it later. You have two options:`

			`- Turning on the [automatic compaction of coordinators](../design/coordinator.html#compacting-segments).`
			`The coordinator periodically submits [compaction tasks](../ingestion/tasks.html#compaction-task) to re-index small segments.`
			- Running periodic Hadoop batch ingestion jobs and using a `dataSource`
			`inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.`
			`Details on how to do this can be found under ['Updating Existing Data'](../ingestion/update-existing-data.html).`