mirror of https://github.com/apache/druid.git
164 lines
6.4 KiB
Markdown
164 lines
6.4 KiB
Markdown
---
|
|
layout: doc_page
|
|
---
|
|
|
|
# Miscellaneous Tasks
|
|
|
|
## Version Converter Task
|
|
|
|
The convert task suite takes active segments and will recompress them using a new IndexSpec. This is handy when doing activities like migrating from Concise to Roaring, or adding dimension compression to old segments.
|
|
|
|
Upon success the new segments will have the same version as the old segment with `_converted` appended. A convert task may be run against the same interval for the same datasource multiple times. Each execution will append another `_converted` to the version for the segments
|
|
|
|
There are two types of conversion tasks. One is the Hadoop convert task, and the other is the indexing service convert task. The Hadoop convert task runs on a hadoop cluster, and simply leaves a task monitor on the indexing service (similar to the hadoop batch task). The indexing service convert task runs the actual conversion on the indexing service.
|
|
|
|
### Hadoop Convert Segment Task
|
|
|
|
```json
|
|
{
|
|
"type": "hadoop_convert_segment",
|
|
"dataSource":"some_datasource",
|
|
"interval":"2013/2015",
|
|
"indexSpec":{"bitmap":{"type":"concise"},"dimensionCompression":"lz4","metricCompression":"lz4"},
|
|
"force": true,
|
|
"validate": false,
|
|
"distributedSuccessCache":"hdfs://some-hdfs-nn:9000/user/jobrunner/cache",
|
|
"jobPriority":"VERY_LOW",
|
|
"segmentOutputPath":"s3n://somebucket/somekeyprefix"
|
|
}
|
|
```
|
|
|
|
The values are described below.
|
|
|
|
|Field|Type|Description|Required|
|
|
|-----|----|-----------|--------|
|
|
|`type`|String|Convert task identifier|Yes: `hadoop_convert_segment`|
|
|
|`dataSource`|String|The datasource to search for segments|Yes|
|
|
|`interval`|Interval string|The interval in the datasource to look for segments|Yes|
|
|
|`indexSpec`|json|The compression specification for the index|Yes|
|
|
|`force`|boolean|Forces the convert task to continue even if binary versions indicate it has been updated recently (you probably want to do this)|No|
|
|
|`validate`|boolean|Runs validation between the old and new segment before reporting task success|No|
|
|
|`distributedSuccessCache`|URI|A location where hadoop should put intermediary files.|Yes|
|
|
|`jobPriority`|`org.apache.hadoop.mapred.JobPriority` as String|The priority to set for the hadoop job|No|
|
|
|`segmentOutputPath`|URI|A base uri for the segment to be placed. Same format as other places a segment output path is needed|Yes|
|
|
|
|
### Indexing Service Convert Segment Task
|
|
|
|
```json
|
|
{
|
|
"type": "convert_segment",
|
|
"dataSource":"some_datasource",
|
|
"interval":"2013/2015",
|
|
"indexSpec":{"bitmap":{"type":"concise"},"dimensionCompression":"lz4","metricCompression":"lz4"},
|
|
"force": true,
|
|
"validate": false
|
|
}
|
|
```
|
|
|
|
|Field|Type|Description|Required (default)|
|
|
|-----|----|-----------|--------|
|
|
|`type`|String|Convert task identifier|Yes: `convert_segment`|
|
|
|`dataSource`|String|The datasource to search for segments|Yes|
|
|
|`interval`|Interval string|The interval in the datasource to look for segments|Yes|
|
|
|`indexSpec`|json|The compression specification for the index|Yes|
|
|
|`force`|boolean|Forces the convert task to continue even if binary versions indicate it has been updated recently (you probably want to do this)|No (false)|
|
|
|`validate`|boolean|Runs validation between the old and new segment before reporting task success|No (true)|
|
|
|
|
Unlike the hadoop convert task, the indexing service task draws its output path from the indexing service's configuration.
|
|
|
|
#### IndexSpec
|
|
|
|
The indexSpec defines segment storage format options to be used at indexing time, such as bitmap type and column
|
|
compression formats. The indexSpec is optional and default parameters will be used if not specified.
|
|
|
|
|Field|Type|Description|Required|
|
|
|-----|----|-----------|--------|
|
|
|bitmap|Object|Compression format for bitmap indexes. Should be a JSON object; see below for options.|no (defaults to Concise)|
|
|
|dimensionCompression|String|Compression format for dimension columns. Choose from `LZ4`, `LZF`, or `uncompressed`.|no (default == `LZ4`)|
|
|
|metricCompression|String|Compression format for metric columns. Choose from `LZ4`, `LZF`, `uncompressed`, or `none`.|no (default == `LZ4`)|
|
|
|longEncoding|String|Encoding format for metric and dimension columns with type long. Choose from `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as is with 8 bytes each.|no (default == `longs`)|
|
|
|
|
##### Bitmap types
|
|
|
|
For Concise bitmaps:
|
|
|
|
|Field|Type|Description|Required|
|
|
|-----|----|-----------|--------|
|
|
|type|String|Must be `concise`.|yes|
|
|
|
|
For Roaring bitmaps:
|
|
|
|
|Field|Type|Description|Required|
|
|
|-----|----|-----------|--------|
|
|
|type|String|Must be `roaring`.|yes|
|
|
|compressRunOnSerialization|Boolean|Use a run-length encoding where it is estimated as more space efficient.|no (default == `true`)|
|
|
|
|
## Noop Task
|
|
|
|
These tasks start, sleep for a time and are used only for testing. The available grammar is:
|
|
|
|
```json
|
|
{
|
|
"type": "noop",
|
|
"id": <optional_task_id>,
|
|
"interval" : <optional_segment_interval>,
|
|
"runTime" : <optional_millis_to_sleep>,
|
|
"firehose": <optional_firehose_to_test_connect>
|
|
}
|
|
```
|
|
|
|
|
|
## Segment Merging Tasks (Deprecated)
|
|
|
|
### Append Task
|
|
|
|
Append tasks append a list of segments together into a single segment (one after the other). The grammar is:
|
|
|
|
```json
|
|
{
|
|
"type": "append",
|
|
"id": <task_id>,
|
|
"dataSource": <task_datasource>,
|
|
"segments": <JSON list of DataSegment objects to append>,
|
|
"aggregations": <optional list of aggregators>,
|
|
"context": <task context>
|
|
}
|
|
```
|
|
|
|
### Merge Task
|
|
|
|
Merge tasks merge a list of segments together. Any common timestamps are merged.
|
|
If rollup is disabled as part of ingestion, common timestamps are not merged and rows are reordered by their timestamp.
|
|
|
|
The grammar is:
|
|
|
|
```json
|
|
{
|
|
"type": "merge",
|
|
"id": <task_id>,
|
|
"dataSource": <task_datasource>,
|
|
"aggregations": <list of aggregators>,
|
|
"rollup": <whether or not to rollup data during a merge>,
|
|
"segments": <JSON list of DataSegment objects to merge>,
|
|
"context": <task context>
|
|
}
|
|
```
|
|
|
|
### Same Interval Merge Task
|
|
|
|
Same Interval Merge task is a shortcut of merge task, all segments in the interval are going to be merged.
|
|
|
|
The grammar is:
|
|
|
|
```json
|
|
{
|
|
"type": "same_interval_merge",
|
|
"id": <task_id>,
|
|
"dataSource": <task_datasource>,
|
|
"aggregations": <list of aggregators>,
|
|
"rollup": <whether or not to rollup data during a merge>,
|
|
"interval": <DataSegment objects in this interval are going to be merged>,
|
|
"context": <task context>
|
|
}
|
|
```
|