druid/docs/content/ingestion/misc-tasks.md
2018-09-04 12:54:41 -07:00

6.4 KiB

layout
doc_page

Miscellaneous Tasks

Version Converter Task

The convert task suite takes active segments and will recompress them using a new IndexSpec. This is handy when doing activities like migrating from Concise to Roaring, or adding dimension compression to old segments.

Upon success the new segments will have the same version as the old segment with _converted appended. A convert task may be run against the same interval for the same datasource multiple times. Each execution will append another _converted to the version for the segments

There are two types of conversion tasks. One is the Hadoop convert task, and the other is the indexing service convert task. The Hadoop convert task runs on a hadoop cluster, and simply leaves a task monitor on the indexing service (similar to the hadoop batch task). The indexing service convert task runs the actual conversion on the indexing service.

Hadoop Convert Segment Task

{
  "type": "hadoop_convert_segment",
  "dataSource":"some_datasource",
  "interval":"2013/2015",
  "indexSpec":{"bitmap":{"type":"concise"},"dimensionCompression":"lz4","metricCompression":"lz4"},
  "force": true,
  "validate": false,
  "distributedSuccessCache":"hdfs://some-hdfs-nn:9000/user/jobrunner/cache",
  "jobPriority":"VERY_LOW",
  "segmentOutputPath":"s3n://somebucket/somekeyprefix"
}

The values are described below.

Field Type Description Required
type String Convert task identifier Yes: hadoop_convert_segment
dataSource String The datasource to search for segments Yes
interval Interval string The interval in the datasource to look for segments Yes
indexSpec json The compression specification for the index Yes
force boolean Forces the convert task to continue even if binary versions indicate it has been updated recently (you probably want to do this) No
validate boolean Runs validation between the old and new segment before reporting task success No
distributedSuccessCache URI A location where hadoop should put intermediary files. Yes
jobPriority org.apache.hadoop.mapred.JobPriority as String The priority to set for the hadoop job No
segmentOutputPath URI A base uri for the segment to be placed. Same format as other places a segment output path is needed Yes

Indexing Service Convert Segment Task

{
  "type": "convert_segment",
  "dataSource":"some_datasource",
  "interval":"2013/2015",
  "indexSpec":{"bitmap":{"type":"concise"},"dimensionCompression":"lz4","metricCompression":"lz4"},
  "force": true,
  "validate": false
}
Field Type Description Required (default)
type String Convert task identifier Yes: convert_segment
dataSource String The datasource to search for segments Yes
interval Interval string The interval in the datasource to look for segments Yes
indexSpec json The compression specification for the index Yes
force boolean Forces the convert task to continue even if binary versions indicate it has been updated recently (you probably want to do this) No (false)
validate boolean Runs validation between the old and new segment before reporting task success No (true)

Unlike the hadoop convert task, the indexing service task draws its output path from the indexing service's configuration.

IndexSpec

The indexSpec defines segment storage format options to be used at indexing time, such as bitmap type and column compression formats. The indexSpec is optional and default parameters will be used if not specified.

Field Type Description Required
bitmap Object Compression format for bitmap indexes. Should be a JSON object; see below for options. no (defaults to Concise)
dimensionCompression String Compression format for dimension columns. Choose from LZ4, LZF, or uncompressed. no (default == LZ4)
metricCompression String Compression format for metric columns. Choose from LZ4, LZF, uncompressed, or none. no (default == LZ4)
longEncoding String Encoding format for metric and dimension columns with type long. Choose from auto or longs. auto encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. longs stores the value as is with 8 bytes each. no (default == longs)
Bitmap types

For Concise bitmaps:

Field Type Description Required
type String Must be concise. yes

For Roaring bitmaps:

Field Type Description Required
type String Must be roaring. yes
compressRunOnSerialization Boolean Use a run-length encoding where it is estimated as more space efficient. no (default == true)

Noop Task

These tasks start, sleep for a time and are used only for testing. The available grammar is:

{
    "type": "noop",
    "id": <optional_task_id>,
    "interval" : <optional_segment_interval>,
    "runTime" : <optional_millis_to_sleep>,
    "firehose": <optional_firehose_to_test_connect>
}

Segment Merging Tasks (Deprecated)

Append Task

Append tasks append a list of segments together into a single segment (one after the other). The grammar is:

{
    "type": "append",
    "id": <task_id>,
    "dataSource": <task_datasource>,
    "segments": <JSON list of DataSegment objects to append>,
    "aggregations": <optional list of aggregators>,
    "context": <task context>
}

Merge Task

Merge tasks merge a list of segments together. Any common timestamps are merged. If rollup is disabled as part of ingestion, common timestamps are not merged and rows are reordered by their timestamp.

The grammar is:

{
    "type": "merge",
    "id": <task_id>,
    "dataSource": <task_datasource>,
    "aggregations": <list of aggregators>,
    "rollup": <whether or not to rollup data during a merge>,
    "segments": <JSON list of DataSegment objects to merge>,
    "context": <task context>
}

Same Interval Merge Task

Same Interval Merge task is a shortcut of merge task, all segments in the interval are going to be merged.

The grammar is:

{
    "type": "same_interval_merge",
    "id": <task_id>,
    "dataSource": <task_datasource>,
    "aggregations": <list of aggregators>,
    "rollup": <whether or not to rollup data during a merge>,
    "interval": <DataSegment objects in this interval are going to be merged>,
    "context": <task context>
}