2018-11-13 12:38:37 -05:00
|
|
|
<!--
|
|
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
|
|
~ distributed with this work for additional information
|
|
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
|
|
~ "License"); you may not use this file except in compliance
|
|
|
|
~ with the License. You may obtain a copy of the License at
|
|
|
|
~
|
|
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
~
|
|
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
|
|
~ software distributed under the License is distributed on an
|
|
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
~ KIND, either express or implied. See the License for the
|
|
|
|
~ specific language governing permissions and limitations
|
|
|
|
~ under the License.
|
|
|
|
-->
|
|
|
|
|
2018-01-12 23:52:37 -05:00
|
|
|
---
|
|
|
|
layout: doc_page
|
|
|
|
---
|
2018-09-04 15:54:41 -04:00
|
|
|
|
2018-01-12 23:52:37 -05:00
|
|
|
# Segment size optimization
|
|
|
|
|
|
|
|
In Druid, it's important to optimize the segment size because
|
|
|
|
|
|
|
|
1. Druid stores data in segments. If you're using the [best-effort roll-up](../design/index.html#roll-up-modes) mode,
|
|
|
|
increasing the segment size might introduce further aggregation which reduces the dataSource size.
|
|
|
|
2. When a query is submitted, that query is distributed to all historicals and realtimes
|
|
|
|
which hold the input segments of the query. Each node has a processing threads pool and use one thread per segment to
|
|
|
|
process it. If the segment size is too large, data might not be well distributed over the
|
|
|
|
whole cluster, thereby decreasing the degree of parallelism. If the segment size is too small,
|
|
|
|
each processing thread processes too small data. This might reduce the processing speed of other queries as well as
|
|
|
|
the input query itself because the processing threads are shared for executing all queries.
|
|
|
|
|
|
|
|
It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy
|
|
|
|
especially for the streaming ingestion because the amount of data ingested might vary over time. In this case,
|
|
|
|
you can roughly set the segment size at ingestion time and optimize it later. You have two options:
|
|
|
|
|
|
|
|
- Turning on the [automatic compaction of coordinators](../design/coordinator.html#compacting-segments).
|
|
|
|
The coordinator periodically submits [compaction tasks](../ingestion/tasks.html#compaction-task) to re-index small segments.
|
|
|
|
- Running periodic Hadoop batch ingestion jobs and using a `dataSource`
|
|
|
|
inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel.
|
|
|
|
Details on how to do this can be found under ['Updating Existing Data'](../ingestion/update-existing-data.html).
|