From b1f25a0522cf1a36caf07d3f2b6a06577159e406 Mon Sep 17 00:00:00 2001 From: YuCheng Hu Date: Thu, 5 Aug 2021 18:44:04 -0400 Subject: [PATCH] =?UTF-8?q?=E6=95=B4=E7=90=86=E6=95=B0=E6=8D=AE=E5=AF=BC?= =?UTF-8?q?=E5=85=A5=E6=96=87=E4=BB=B6=E5=A4=B9=E4=B8=AD=E7=9A=84=E6=96=87?= =?UTF-8?q?=E4=BB=B6=E6=A0=BC=E5=BC=8F?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ingestion/compaction.md | 231 +++ ingestion/data-formats.md | 2629 ++++++++++++++++++++++++++ ingestion/dataformats.md | 1140 ------------ ingestion/datamanage.md | 191 -- ingestion/faq.md | 114 ++ ingestion/hadoop.md | 571 ++++++ ingestion/kafka.md | 1 - ingestion/native-batch.md | 2960 ++++++++++++++++++++++++++++++ ingestion/native.md | 1142 ------------ ingestion/schema-design.md | 438 +++++ ingestion/schemadesign.md | 155 -- ingestion/standalone-realtime.md | 45 + ingestion/taskrefer.md | 373 ---- ingestion/tasks.md | 774 ++++++++ ingestion/tranquility.md | 36 + 15 files changed, 7798 insertions(+), 3002 deletions(-) create mode 100644 ingestion/compaction.md create mode 100644 ingestion/data-formats.md delete mode 100644 ingestion/dataformats.md create mode 100644 ingestion/native-batch.md delete mode 100644 ingestion/native.md create mode 100644 ingestion/schema-design.md delete mode 100644 ingestion/schemadesign.md create mode 100644 ingestion/standalone-realtime.md delete mode 100644 ingestion/taskrefer.md create mode 100644 ingestion/tasks.md create mode 100644 ingestion/tranquility.md diff --git a/ingestion/compaction.md b/ingestion/compaction.md new file mode 100644 index 0000000..9f34381 --- /dev/null +++ b/ingestion/compaction.md @@ -0,0 +1,231 @@ +--- +id: compaction +title: "Compaction" +description: "Defines compaction and automatic compaction (auto-compaction or autocompaction) for segment optimization. Use cases and strategies for compaction. Describes compaction task configuration." +--- + + +Query performance in Apache Druid depends on optimally sized segments. Compaction is one strategy you can use to optimize segment size for your Druid database. Compaction tasks read an existing set of segments for a given time interval and combine the data into a new "compacted" set of segments. In some cases the compacted segments are larger, but there are fewer of them. In other cases the compacted segments may be smaller. Compaction tends to increase performance because optimized segments require less per-segment processing and less memory overhead for ingestion and for querying paths. + +## Compaction strategies +There are several cases to consider compaction for segment optimization: +- With streaming ingestion, data can arrive out of chronological order creating lots of small segments. +- If you append data using `appendToExisting` for [native batch](native-batch.md) ingestion creating suboptimal segments. +- When you use `index_parallel` for parallel batch indexing and the parallel ingestion tasks create many small segments. +- When a misconfigured ingestion task creates oversized segments. + +By default, compaction does not modify the underlying data of the segments. However, there are cases when you may want to modify data during compaction to improve query performance: +- If, after ingestion, you realize that data for the time interval is sparse, you can use compaction to increase the segment granularity. +- Over time you don't need fine-grained granularity for older data so you want use compaction to change older segments to a coarser query granularity. This reduces the storage space required for older data. For example from `minute` to `hour`, or `hour` to `day`. You cannot go from coarser granularity to finer granularity. +- You can change the dimension order to improve sorting and reduce segment size. +- You can remove unused columns in compaction or implement an aggregation metric for older data. +- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](index.md#perfect-rollup-vs-best-effort-rollup). + +Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment. + +## Types of compaction +You can configure the Druid Coordinator to perform automatic compaction, also called auto-compaction, for a datasource. Using a segment search policy, the coordinator periodically identifies segments for compaction starting with the newest to oldest. When it discovers segments that have not been compacted or segments that were compacted with a different or changed spec, it submits compaction task for those segments and only those segments. + +Automatic compaction works in most use cases and should be your first option. To learn more about automatic compaction, see [Compacting Segments](../design/coordinator.md#compacting-segments). + +In cases where you require more control over compaction, you can manually submit compaction tasks. For example: +- Automatic compaction is running into the limit of task slots available to it, so tasks are waiting for previous automatic compaction tasks to complete. Manual compaction can use all available task slots, therefore you can complete compaction more quickly by submitting more concurrent tasks for more intervals. +- You want to force compaction for a specific time range or you want to compact data out of chronological order. + +See [Setting up a manual compaction task](#setting-up-manual-compaction) for more about manual compaction tasks. + +## Data handling with compaction +During compaction, Druid overwrites the original set of segments with the compacted set. Druid also locks the segments for the time interval being compacted to ensure data consistency. By default, compaction tasks do not modify the underlying data. You can configure the compaction task to change the query granularity or add or remove dimensions in the compaction task. This means that the only changes to query results should be the result of intentional, not automatic, changes. + +For compaction tasks, `dropExisting` in `ioConfig` can be set to "true" for Druid to drop (mark unused) all existing segments fully contained by the interval of the compaction task. For an example of why this is important, see the suggestion for reindexing with finer granularity under [Implementation considerations](native-batch.md#implementation-considerations). WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the compaction task interval. + +If an ingestion task needs to write data to a segment for a time interval locked for compaction, by default the ingestion task supersedes the compaction task and the compaction task fails without finishing. For manual compaction tasks you can adjust the input spec interval to avoid conflicts between ingestion and compaction. For automatic compaction, you can set the `skipOffsetFromLatest` key to adjustment the auto compaction starting point from the current time to reduce the chance of conflicts between ingestion and compaction. See [Compaction dynamic configuration](../configuration/index.md#compaction-dynamic-configuration) for more information. Another option is to set the compaction task to higher priority than the ingestion task. + +### Segment granularity handling + +Unless you modify the segment granularity in the [granularity spec](#compaction-granularity-spec), Druid attempts to retain the granularity for the compacted segments. When segments have different segment granularities with no overlap in interval Druid creates a separate compaction task for each to retain the segment granularity in the compacted segment. + +If segments have different segment granularities before compaction but there is some overlap in interval, Druid attempts find start and end of the overlapping interval and uses the closest segment granularity level for the compacted segment. For example consider two overlapping segments: segment "A" for the interval 01/01/2021-01/02/2021 with day granularity and segment "B" for the interval 01/01/2021-02/01/2021. Druid attempts to combine and compacted the overlapped segments. In this example, the earliest start time for the two segments above is 01/01/2020 and the latest end time of the two segments above is 02/01/2020. Druid compacts the segments together even though they have different segment granularity. Druid uses month segment granularity for the newly compacted segment even though segment A's original segment granularity was DAY. + +### Query granularity handling + +Unless you modify the query granularity in the [granularity spec](#compaction-granularity-spec), Druid retains the query granularity for the compacted segments. If segments have different query granularities before compaction, Druid chooses the finest level of granularity for the resulting compacted segment. For example if a compaction task combines two segments, one with day query granularity and one with minute query granularity, the resulting segment uses minute query granularity. + +> In Apache Druid 0.21.0 and prior, Druid sets the granularity for compacted segments to the default granularity of `NONE` regardless of the query granularity of the original segments. + +If you configure query granularity in compaction to go from a finer granularity like month to a coarser query granularity like year, then Druid overshadows the original segment with coarser granularity. Because the new segments have a coarser granularity, running a kill task to remove the overshadowed segments for those intervals will cause you to permanently lose the finer granularity data. + +### Dimension handling +Apache Druid supports schema changes. Therefore, dimensions can be different across segments even if they are a part of the same data source. See [Different schemas among segments](../design/segments.md#different-schemas-among-segments). If the input segments have different dimensions, the resulting compacted segment include all dimensions of the input segments. + +Even when the input segments have the same set of dimensions, the dimension order or the data type of dimensions can be different. The dimensions of recent segments precede that of old segments in terms of data types and the ordering because more recent segments are more likely to have the preferred order and data types. + +If you want to control dimension ordering or ensure specific values for dimension types, you can configure a custom `dimensionsSpec` in the compaction task spec. + +### Rollup +Druid only rolls up the output segment when `rollup` is set for all input segments. +See [Roll-up](../ingestion/index.md#rollup) for more details. +You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes). + +## Setting up manual compaction + +To perform a manual compaction, you submit a compaction task. Compaction tasks merge all segments for the defined interval according to the following syntax: + +```json +{ + "type": "compact", + "id": , + "dataSource": , + "ioConfig": , + "dimensionsSpec" , + "metricsSpec" , + "tuningConfig" , + "granularitySpec" , + "context": +} +``` + +|Field|Description|Required| +|-----|-----------|--------| +|`type`|Task type. Should be `compact`|Yes| +|`id`|Task id|No| +|`dataSource`|Data source name to compact|Yes| +|`ioConfig`|I/O configuration for compaction task. See [Compaction I/O configuration](#compaction-io-configuration) for details.|Yes| +|`dimensionsSpec`|Custom dimensions spec. The compaction task uses the specified dimensions spec if it exists instead of generating one.|No| +|`metricsSpec`|Custom metrics spec. The compaction task uses the specified metrics spec rather than generating one.|No| +|`segmentGranularity`|When set, the compaction task changes the segment granularity for the given interval. Deprecated. Use `granularitySpec`. |No.| +|`tuningConfig`|[Parallel indexing task tuningConfig](native-batch.md#tuningconfig). Note that your tuning config cannot contain a non-zero value for `awaitSegmentAvailabilityTimeoutMillis` because it is not supported by compaction tasks at this time.|No| +|`context`|[Task context](./tasks.md#context)|No| +|`granularitySpec`|Custom `granularitySpec` to describe the `segmentGranularity` and `queryGranularity` for the compacted segments. See [Compaction granularitySpec](#compaction-granularity-spec).|No| + +> Note: Use `granularitySpec` over `segmentGranularity` and only set one of these values. If you specify different values for these in the same compaction spec, the task fails. + +To control the number of result segments per time chunk, you can set [maxRowsPerSegment](../configuration/index.md#compaction-dynamic-configuration) or [numShards](../ingestion/native-batch.md#tuningconfig). + +> You can run multiple compaction tasks in parallel. For example, if you want to compact the data for a year, you are not limited to running a single task for the entire year. You can run 12 compaction tasks with month-long intervals. + +A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters. For example, its `inputSource` is always the [DruidInputSource](native-batch.md#druid-input-source), and `dimensionsSpec` and `metricsSpec` include all dimensions and metrics of the input segments by default. + +Compaction tasks would exit without doing anything and issue a failure status code: +- if the interval you specify has no data segments loaded
+OR +- if the interval you specify is empty. + +Note that the metadata between input segments and the resulting compacted segments may differ if the metadata among the input segments differs as well. If all input segments have the same metadata, however, the resulting output segment will have the same metadata as all input segments. + + +### Example compaction task +The following JSON illustrates a compaction task to compact _all segments_ within the interval `2017-01-01/2018-01-01` and create new segments: + +```json +{ + "type" : "compact", + "dataSource" : "wikipedia", + "ioConfig" : { + "type": "compact", + "inputSpec": { + "type": "interval", + "interval": "2020-01-01/2021-01-01", + } + }, + "granularitySpec": { + "segmentGranularity":"day", + "queryGranularity":"hour" + } +} +``` + +This task doesn't specify a `granularitySpec` so Druid retains the original segment granularity unchanged when compaction is complete. + +### Compaction I/O configuration + +The compaction `ioConfig` requires specifying `inputSpec` as follows: + +|Field|Description|Default|Required?| +|-----|-----------|-------|--------| +|`type`|Task type. Should be `compact`|none|Yes| +|`inputSpec`|Input specification|none|Yes| +|`dropExisting`|If `true`, then the compaction task drops (mark unused) all existing segments fully contained by either the `interval` in the `interval` type `inputSpec` or the umbrella interval of the `segments` in the `segment` type `inputSpec` when the task publishes new compacted segments. If compaction fails, Druid does not drop or mark unused any segments. WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the compaction task interval.|false|no| + + +There are two supported `inputSpec`s for now. + +The interval `inputSpec` is: + +|Field|Description|Required| +|-----|-----------|--------| +|`type`|Task type. Should be `interval`|Yes| +|`interval`|Interval to compact|Yes| + +The segments `inputSpec` is: + +|Field|Description|Required| +|-----|-----------|--------| +|`type`|Task type. Should be `segments`|Yes| +|`segments`|A list of segment IDs|Yes| + +### Compaction granularity spec + +You can optionally use the `granularitySpec` object to configure the segment granularity and the query granularity of the compacted segments. Their syntax is as follows: +```json + "type": "compact", + "id": , + "dataSource": , + ... + , + "granularitySpec": { + "segmentGranularity": , + "queryGranularity": + } + ... +``` + +`granularitySpec` takes the following keys: + +|Field|Description|Required| +|-----|-----------|--------| +|`segmentGranularity`|Time chunking period for the segment granularity. Defaults to 'null', which preserves the original segment granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No| +|`queryGranularity`|Time chunking period for the query granularity. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values. Not supported for automatic compaction.|No| + +For example, to set the segment granularity to "day" and the query granularity to "hour": +```json +{ + "type" : "compact", + "dataSource" : "wikipedia", + "ioConfig" : { + "type": "compact", + "inputSpec": { + "type": "interval", + "interval": "2017-01-01/2018-01-01", + }, + "granularitySpec": { + "segmentGranularity":"day", + "queryGranularity":"hour" + } + } +} +``` + +## Learn more +See the following topics for more information: +- [Segment optimization](../operations/segment-optimization.md) for guidance to determine if compaction will help in your case. +- [Compacting Segments](../design/coordinator.md#compacting-segments) for more on automatic compaction. +- See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration) +and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for automatic compaction configuration information. diff --git a/ingestion/data-formats.md b/ingestion/data-formats.md new file mode 100644 index 0000000..a74d7a0 --- /dev/null +++ b/ingestion/data-formats.md @@ -0,0 +1,2629 @@ +--- +id: data-formats +title: "Data formats" +--- + + + +Apache Druid can ingest denormalized data in JSON, CSV, or a delimited form such as TSV, or any custom format. While most examples in the documentation use data in JSON format, it is not difficult to configure Druid to ingest any other delimited data. +We welcome any contributions to new formats. + +This page lists all default and core extension data formats supported by Druid. +For additional data formats supported with community extensions, +please see our [community extensions list](../development/extensions.md#community-extensions). + +## Formatting the Data + +The following samples show data formats that are natively supported in Druid: + +_JSON_ + +```json +{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143} +{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330} +{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111} +{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900} +{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "cancer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9} +``` + +_CSV_ + +``` +2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143 +2013-08-31T03:32:45Z,"Striker Eureka","en","speed","false","true","true","false","wikipedia","Australia","Australia","Cantebury","Syndey",459,129,330 +2013-08-31T07:11:21Z,"Cherno Alpha","ru","masterYi","false","true","true","false","article","Asia","Russia","Oblast","Moscow",123,12,111 +2013-08-31T11:58:39Z,"Crimson Typhoon","zh","triplets","true","false","true","false","wikipedia","Asia","China","Shanxi","Taiyuan",905,5,900 +2013-08-31T12:41:27Z,"Coyote Tango","ja","cancer","true","false","true","false","wikipedia","Asia","Japan","Kanto","Tokyo",1,10,-9 +``` + +_TSV (Delimited)_ + +``` +2013-08-31T01:02:33Z "Gypsy Danger" "en" "nuclear" "true" "true" "false" "false" "article" "North America" "United States" "Bay Area" "San Francisco" 57 200 -143 +2013-08-31T03:32:45Z "Striker Eureka" "en" "speed" "false" "true" "true" "false" "wikipedia" "Australia" "Australia" "Cantebury" "Syndey" 459 129 330 +2013-08-31T07:11:21Z "Cherno Alpha" "ru" "masterYi" "false" "true" "true" "false" "article" "Asia" "Russia" "Oblast" "Moscow" 123 12 111 +2013-08-31T11:58:39Z "Crimson Typhoon" "zh" "triplets" "true" "false" "true" "false" "wikipedia" "Asia" "China" "Shanxi" "Taiyuan" 905 5 900 +2013-08-31T12:41:27Z "Coyote Tango" "ja" "cancer" "true" "false" "true" "false" "wikipedia" "Asia" "Japan" "Kanto" "Tokyo" 1 10 -9 +``` + +Note that the CSV and TSV data do not contain column heads. This becomes important when you specify the data for ingesting. + +Besides text formats, Druid also supports binary formats such as [Orc](#orc) and [Parquet](#parquet) formats. + +## Custom Formats + +Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for +parsing data will not be as efficient as writing a native Java parser or using an external stream processor. We welcome contributions of new Parsers. + +## Input Format + +> The Input Format is a new way to specify the data format of your input data which was introduced in 0.17.0. +Unfortunately, the Input Format doesn't support all data formats or ingestion methods supported by Druid yet. +Especially if you want to use the Hadoop ingestion, you still need to use the [Parser](#parser). +If your data is formatted in some format not listed in this section, please consider using the Parser instead. + +All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the `inputFormat` entry in your [`ioConfig`](index.md#ioconfig). + +### JSON + +The `inputFormat` to load data of JSON format. An example is: + +```json +"ioConfig": { + "inputFormat": { + "type": "json" + }, + ... +} +``` + +The JSON `inputFormat` has the following components: + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `json`. | yes | +| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no | +| featureSpec | JSON Object | [JSON parser features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) supported by Jackson library. Those features will be applied when parsing the input JSON data. | no | + +### CSV + +The `inputFormat` to load data of the CSV format. An example is: + +```json +"ioConfig": { + "inputFormat": { + "type": "csv", + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"] + }, + ... +} +``` + +The CSV `inputFormat` has the following components: + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `csv`. | yes | +| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) | +| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing | +| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the header. For example, if you set `skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip the first two lines and then extract column information from the third line. `columns` will be ignored if this is set to true. | no (default = false if `columns` is set; otherwise null) | +| skipHeaderRows | Integer | If this is set, the task will skip the first `skipHeaderRows` rows. | no (default = 0) | + +### TSV (Delimited) + +```json +"ioConfig": { + "inputFormat": { + "type": "tsv", + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], + "delimiter":"|" + }, + ... +} +``` + +The `inputFormat` to load data of a delimited format. An example is: + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `tsv`. | yes | +| delimiter | String | A custom delimiter for data values. | no (default = `\t`) | +| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) | +| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing | +| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the header. For example, if you set `skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip the first two lines and then extract column information from the third line. `columns` will be ignored if this is set to true. | no (default = false if `columns` is set; otherwise null) | +| skipHeaderRows | Integer | If this is set, the task will skip the first `skipHeaderRows` rows. | no (default = 0) | + +Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed. + +### ORC + +> You need to include the [`druid-orc-extensions`](../development/extensions-core/orc.md) as an extension to use the ORC input format. + +> If you are considering upgrading from earlier than 0.15.0 to 0.15.0 or a higher version, +> please read [Migration from 'contrib' extension](../development/extensions-core/orc.md#migration-from-contrib-extension) carefully. + +The `inputFormat` to load data of ORC format. An example is: + +```json +"ioConfig": { + "inputFormat": { + "type": "orc", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nested", + "expr": "$.path.to.nested" + } + ] + }, + "binaryAsString": false + }, + ... +} +``` + +The ORC `inputFormat` has the following components: + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `orc`. | yes | +| flattenSpec | JSON Object | Specifies flattening configuration for nested ORC data. See [`flattenSpec`](#flattenspec) for more info. | no | +| binaryAsString | Boolean | Specifies if the binary orc column which is not logically marked as a string should be treated as a UTF-8 encoded string. | no (default = false) | + +### Parquet + +> You need to include the [`druid-parquet-extensions`](../development/extensions-core/parquet.md) as an extension to use the Parquet input format. + +The `inputFormat` to load data of Parquet format. An example is: + +```json +"ioConfig": { + "inputFormat": { + "type": "parquet", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nested", + "expr": "$.path.to.nested" + } + ] + }, + "binaryAsString": false + }, + ... +} +``` + +The Parquet `inputFormat` has the following components: + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +|type| String| This should be set to `parquet` to read Parquet file| yes | +|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) | +| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) | + +### Avro Stream + +> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro Stream input format. + +> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid + +The `inputFormat` to load data of Avro format in stream ingestion. An example is: +```json +"ioConfig": { + "inputFormat": { + "type": "avro_stream", + "avroBytesDecoder": { + "type": "schema_inline", + "schema": { + //your schema goes here, for example + "namespace": "org.apache.druid.data", + "name": "User", + "type": "record", + "fields": [ + { "name": "FullName", "type": "string" }, + { "name": "Country", "type": "string" } + ] + } + }, + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "someRecord_subInt", + "expr": "$.someRecord.subInt" + } + ] + }, + "binaryAsString": false + }, + ... +} +``` + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +|type| String| This should be set to `avro_stream` to read Avro serialized data| yes | +|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) | +|`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes | +| binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) | + +##### Avro Bytes Decoder + +If `type` is not included, the avroBytesDecoder defaults to `schema_repo`. + +###### Inline Schema Based Avro Bytes Decoder + +> The "schema_inline" decoder reads Avro records using a fixed schema and does not support schema migration. If you +> may need to migrate schemas in the future, consider one of the other decoders, all of which use a message header that +> allows the parser to identify the proper Avro schema for reading records. + +This decoder can be used if all the input events can be read using the same schema. In this case, specify the schema in the input task JSON itself, as described below. + +``` +... +"avroBytesDecoder": { + "type": "schema_inline", + "schema": { + //your schema goes here, for example + "namespace": "org.apache.druid.data", + "name": "User", + "type": "record", + "fields": [ + { "name": "FullName", "type": "string" }, + { "name": "Country", "type": "string" } + ] + } +} +... +``` + +###### Multiple Inline Schemas Based Avro Bytes Decoder + +Use this decoder if different input events can have different read schemas. In this case, specify the schema in the input task JSON itself, as described below. + +``` +... +"avroBytesDecoder": { + "type": "multiple_schemas_inline", + "schemas": { + //your id -> schema map goes here, for example + "1": { + "namespace": "org.apache.druid.data", + "name": "User", + "type": "record", + "fields": [ + { "name": "FullName", "type": "string" }, + { "name": "Country", "type": "string" } + ] + }, + "2": { + "namespace": "org.apache.druid.otherdata", + "name": "UserIdentity", + "type": "record", + "fields": [ + { "name": "Name", "type": "string" }, + { "name": "Location", "type": "string" } + ] + }, + ... + ... + } +} +... +``` + +Note that it is essentially a map of integer schema ID to avro schema object. This parser assumes that record has following format. + first 1 byte is version and must always be 1. + next 4 bytes are integer schema ID serialized using big-endian byte order. + remaining bytes contain serialized avro message. + +##### SchemaRepo Based Avro Bytes Decoder + +This Avro bytes decoder first extracts `subject` and `id` from the input message bytes, and then uses them to look up the Avro schema used to decode the Avro record from bytes. For details, see the [schema repo](https://github.com/schema-repo/schema-repo) and [AVRO-1124](https://issues.apache.org/jira/browse/AVRO-1124). You will need an http service like schema repo to hold the avro schema. For information on registering a schema on the message producer side, see `org.apache.druid.data.input.AvroStreamInputRowParserTest#testParse()`. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `schema_repo`. | no | +| subjectAndIdConverter | JSON Object | Specifies how to extract the subject and id from message bytes. | yes | +| schemaRepository | JSON Object | Specifies how to look up the Avro schema from subject and id. | yes | + +###### Avro-1124 Subject And Id Converter + +This section describes the format of the `subjectAndIdConverter` object for the `schema_repo` Avro bytes decoder. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `avro_1124`. | no | +| topic | String | Specifies the topic of your Kafka stream. | yes | + + +###### Avro-1124 Schema Repository + +This section describes the format of the `schemaRepository` object for the `schema_repo` Avro bytes decoder. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `avro_1124_rest_client`. | no | +| url | String | Specifies the endpoint url of your Avro-1124 schema repository. | yes | + +###### Confluent Schema Registry-based Avro Bytes Decoder + +This Avro bytes decoder first extracts a unique `id` from input message bytes, and then uses it to look up the schema in the Schema Registry used to decode the Avro record from bytes. +For details, see the Schema Registry [documentation](http://docs.confluent.io/current/schema-registry/docs/) and [repository](https://github.com/confluentinc/schema-registry). + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `schema_registry`. | no | +| url | String | Specifies the url endpoint of the Schema Registry. | yes | +| capacity | Integer | Specifies the max size of the cache (default = Integer.MAX_VALUE). | no | +| urls | Array | Specifies the url endpoints of the multiple Schema Registry instances. | yes(if `url` is not provided) | +| config | Json | To send additional configurations, configured for Schema Registry | no | +| headers | Json | To send headers to the Schema Registry | no | + +For a single schema registry instance, use Field `url` or `urls` for multi instances. + +Single Instance: +```json +... +"avroBytesDecoder" : { + "type" : "schema_registry", + "url" : +} +... +``` + +Multiple Instances: +```json +... +"avroBytesDecoder" : { + "type" : "schema_registry", + "urls" : [, , ...], + "config" : { + "basic.auth.credentials.source": "USER_INFO", + "basic.auth.user.info": "fred:letmein", + "schema.registry.ssl.truststore.location": "/some/secrets/kafka.client.truststore.jks", + "schema.registry.ssl.truststore.password": "", + "schema.registry.ssl.keystore.location": "/some/secrets/kafka.client.keystore.jks", + "schema.registry.ssl.keystore.password": "", + "schema.registry.ssl.key.password": "" + ... + }, + "headers": { + "traceID" : "b29c5de2-0db4-490b-b421", + "timeStamp" : "1577191871865", + ... + } +} +... +``` + +### Avro OCF + +> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro OCF input format. + +> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid + +The `inputFormat` to load data of Avro OCF format. An example is: +```json +"ioConfig": { + "inputFormat": { + "type": "avro_ocf", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "someRecord_subInt", + "expr": "$.someRecord.subInt" + } + ] + }, + "schema": { + "namespace": "org.apache.druid.data.input", + "name": "SomeDatum", + "type": "record", + "fields" : [ + { "name": "timestamp", "type": "long" }, + { "name": "eventType", "type": "string" }, + { "name": "id", "type": "long" }, + { "name": "someRecord", "type": { + "type": "record", "name": "MySubRecord", "fields": [ + { "name": "subInt", "type": "int"}, + { "name": "subLong", "type": "long"} + ] + }}] + }, + "binaryAsString": false + }, + ... +} +``` + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes | +|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro records. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) | +|schema| JSON Object |Define a reader schema to be used when parsing Avro records, this is useful when parsing multiple versions of Avro OCF file data | no (default will decode using the writer schema contained in the OCF file) | +| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) | + +### Protobuf + +> You need to include the [`druid-protobuf-extensions`](../development/extensions-core/protobuf.md) as an extension to use the Protobuf input format. + +The `inputFormat` to load data of Protobuf format. An example is: +```json +"ioConfig": { + "inputFormat": { + "type": "protobuf", + "protoBytesDecoder": { + "type": "file", + "descriptor": "file:///tmp/metrics.desc", + "protoMessageType": "Metrics" + } + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "someRecord_subInt", + "expr": "$.someRecord.subInt" + } + ] + } + }, + ... +} +``` + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +|type| String| This should be set to `protobuf` to read Protobuf serialized data| yes | +|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Protobuf record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) | +|`protoBytesDecoder`| JSON Object |Specifies how to decode bytes to Protobuf record. | yes | + +### FlattenSpec + +The `flattenSpec` is located in `inputFormat` → `flattenSpec` and is responsible for +bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model. +An example `flattenSpec` is: + +```json +"flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { "name": "baz", "type": "root" }, + { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" }, + { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" } + ] +} +``` +> Conceptually, after input data records are read, the `flattenSpec` is applied first before +> any other specs such as [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), +> [`dimensionsSpec`](./index.md#dimensionsspec), or [`metricsSpec`](./index.md#metricsspec). Keep this in mind when writing +> your ingestion spec. + +Flattening is only supported for [data formats](data-formats.md) that support nesting, including `avro`, `json`, `orc`, +and `parquet`. + +A `flattenSpec` can have the following components: + +| Field | Description | Default | +|-------|-------------|---------| +| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), [`dimensionsSpec`](./index.md#dimensionsspec), and [`metricsSpec`](./index.md#metricsspec).

If false, only explicitly specified fields (see `fields`) will be available for use. | `true` | +| fields | Specifies the fields of interest and how they are accessed. [See below for details.](#field-flattening-specifications) | `[]` | + +#### Field flattening specifications + +Each entry in the `fields` list can have the following components: + +| Field | Description | Default | +|-------|-------------|---------| +| type | Options are as follows:

  • `root`, referring to a field at the root level of the record. Only really useful if `useFieldDiscovery` is false.
  • `path`, referring to a field using [JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data formats that offer nesting, including `avro`, `json`, `orc`, and `parquet`.
  • `jq`, referring to a field using [jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported for the `json` format.
| none (required) | +| name | Name of the field after flattening. This name can be referred to by the [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), [`dimensionsSpec`](./index.md#dimensionsspec), and [`metricsSpec`](./index.md#metricsspec).| none (required) | +| expr | Expression for accessing the field while flattening. For type `path`, this should be [JsonPath](https://github.com/jayway/JsonPath). For type `jq`, this should be [jackson-jq](https://github.com/eiiches/jackson-jq) notation. For other types, this parameter is ignored. | none (required for types `path` and `jq`) | + +#### Notes on flattening + +* For convenience, when defining a root-level field, it is possible to define only the field name, as a string, instead of a JSON object. For example, `{"name": "baz", "type": "root"}` is equivalent to `"baz"`. +* Enabling `useFieldDiscovery` will only automatically detect "simple" fields at the root level that correspond to data types that Druid supports. This includes strings, numbers, and lists of strings or numbers. Other types will not be automatically detected, and must be specified explicitly in the `fields` list. +* Duplicate field `name`s are not allowed. An exception will be thrown. +* If `useFieldDiscovery` is enabled, any discovered field with the same name as one already defined in the `fields` list will be skipped, rather than added twice. +* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) is useful for testing `path`-type expressions. +* jackson-jq supports a subset of the full [jq](https://stedolan.github.io/jq/) syntax. Please refer to the [jackson-jq documentation](https://github.com/eiiches/jackson-jq) for details. + +## Parser + +> The Parser is deprecated for [native batch tasks](./native-batch.md), [Kafka indexing service](../development/extensions-core/kafka-ingestion.md), +and [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md). +Consider using the [input format](#input-format) instead for these types of ingestion. + +This section lists all default and core extension parsers. +For community extension parsers, please see our [community extensions list](../development/extensions.md#community-extensions). + +### String Parser + +`string` typed parsers operate on text based inputs that can be split into individual records by newlines. +Each line can be further parsed using [`parseSpec`](#parsespec). + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `string` in general, or `hadoopyString` when used in a Hadoop indexing job. | yes | +| parseSpec | JSON Object | Specifies the format, timestamp, and dimensions of the data. | yes | + +### Avro Hadoop Parser + +> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro Hadoop Parser. + +> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid + +This parser is for [Hadoop batch ingestion](./hadoop.md). +The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"org.apache.druid.data.input.avro.AvroValueInputFormat"`. +You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, +e.g.: `"avro.schema.input.value.path": "/path/to/your/schema.avsc"` or +`"avro.schema.input.value": "your_schema_JSON_object"`. +If the Avro reader's schema is not set, the schema in Avro object container file will be used. +See [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution) for more information. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `avro_hadoop`. | yes | +| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be an "avro" parseSpec. | yes | +| fromPigAvroStorage | Boolean | Specifies whether the data file is stored using AvroStorage. | no(default == false) | + +An Avro parseSpec can contain a [`flattenSpec`](#flattenspec) using either the "root" or "path" +field types, which can be used to read nested Avro records. The "jq" field type is not currently supported for Avro. + +For example, using Avro Hadoop parser with custom reader's schema file: + +```json +{ + "type" : "index_hadoop", + "spec" : { + "dataSchema" : { + "dataSource" : "", + "parser" : { + "type" : "avro_hadoop", + "parseSpec" : { + "format": "avro", + "timestampSpec": , + "dimensionsSpec": , + "flattenSpec": + } + } + }, + "ioConfig" : { + "type" : "hadoop", + "inputSpec" : { + "type" : "static", + "inputFormat": "org.apache.druid.data.input.avro.AvroValueInputFormat", + "paths" : "" + } + }, + "tuningConfig" : { + "jobProperties" : { + "avro.schema.input.value.path" : "/path/to/my/schema.avsc" + } + } + } +} +``` + +### ORC Hadoop Parser + +> You need to include the [`druid-orc-extensions`](../development/extensions-core/orc.md) as an extension to use the ORC Hadoop Parser. + +> If you are considering upgrading from earlier than 0.15.0 to 0.15.0 or a higher version, +> please read [Migration from 'contrib' extension](../development/extensions-core/orc.md#migration-from-contrib-extension) carefully. + +This parser is for [Hadoop batch ingestion](./hadoop.md). +The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"org.apache.orc.mapreduce.OrcInputFormat"`. + +|Field | Type | Description | Required| +|----------|-------------|----------------------------------------------------------------------------------------|---------| +|type | String | This should say `orc` | yes| +|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data (`timeAndDims` and `orc` format) and a `flattenSpec` (`orc` format) | yes| + +The parser supports two `parseSpec` formats: `orc` and `timeAndDims`. + +`orc` supports auto field discovery and flattening, if specified with a [`flattenSpec`](#flattenspec). +If no `flattenSpec` is specified, `useFieldDiscovery` will be enabled by default. Specifying a `dimensionSpec` is +optional if `useFieldDiscovery` is enabled: if a `dimensionSpec` is supplied, the list of `dimensions` it defines will be +the set of ingested dimensions, if missing the discovered fields will make up the list. + +`timeAndDims` parse spec must specify which fields will be extracted as dimensions through the `dimensionSpec`. + +[All column types](https://orc.apache.org/docs/types.md) are supported, with the exception of `union` types. Columns of + `list` type, if filled with primitives, may be used as a multi-value dimension, or specific elements can be extracted with +`flattenSpec` expressions. Likewise, primitive fields may be extracted from `map` and `struct` types in the same manner. +Auto field discovery will automatically create a string dimension for every (non-timestamp) primitive or `list` of +primitives, as well as any flatten expressions defined in the `flattenSpec`. + +#### Hadoop job properties +Like most Hadoop jobs, the best outcomes will add `"mapreduce.job.user.classpath.first": "true"` or +`"mapreduce.job.classloader": "true"` to the `jobProperties` section of `tuningConfig`. Note that it is likely if using +`"mapreduce.job.classloader": "true"` that you will need to set `mapreduce.job.classloader.system.classes` to include +`-org.apache.hadoop.hive.` to instruct Hadoop to load `org.apache.hadoop.hive` classes from the application jars instead +of system jars, e.g. + +```json +... + "mapreduce.job.classloader": "true", + "mapreduce.job.classloader.system.classes" : "java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml", +... +``` + +This is due to the `hive-storage-api` dependency of the +`orc-mapreduce` library, which provides some classes under the `org.apache.hadoop.hive` package. If instead using the +setting `"mapreduce.job.user.classpath.first": "true"`, then this will not be an issue. + +#### Examples + +##### `orc` parser, `orc` parseSpec, auto field discovery, flatten expressions + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "orc", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "millis" + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +##### `orc` parser, `orc` parseSpec, field discovery with no flattenSpec or dimensionSpec + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "orc", + "timestampSpec": { + "column": "timestamp", + "format": "millis" + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +##### `orc` parser, `orc` parseSpec, no autodiscovery + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "orc", + "flattenSpec": { + "useFieldDiscovery": false, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "millis" + }, + "dimensionsSpec": { + "dimensions": [ + "dim1", + "dim3", + "nestedDim", + "listDimFirstItem" + ], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +##### `orc` parser, `timeAndDims` parseSpec +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "timeAndDims", + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [ + "dim1", + "dim2", + "dim3", + "listDim" + ], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } +} + +``` + +### Parquet Hadoop Parser + +> You need to include the [`druid-parquet-extensions`](../development/extensions-core/parquet.md) as an extension to use the Parquet Hadoop Parser. + +The Parquet Hadoop parser is for [Hadoop batch ingestion](./hadoop.md) and parses Parquet files directly. +The `inputFormat` of `inputSpec` in `ioConfig` must be set to `org.apache.druid.data.input.parquet.DruidParquetInputFormat`. + +The Parquet Hadoop Parser supports auto field discovery and flattening if provided with a +[`flattenSpec`](#flattenspec) with the `parquet` `parseSpec`. Parquet nested list and map +[logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) _should_ operate correctly with +JSON path expressions for all supported types. + +|Field | Type | Description | Required| +|----------|-------------|----------------------------------------------------------------------------------------|---------| +| type | String | This should say `parquet`.| yes | +| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are `timeAndDims` and `parquet` | yes | +| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no(default = false) | + +When the time dimension is a [DateType column](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), +a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined +[format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat) is required. + +#### Parquet Hadoop Parser vs Parquet Avro Hadoop Parser + +Both parsers read from Parquet files, but slightly differently. The main +differences are: + +* The Parquet Hadoop Parser uses a simple conversion while the Parquet Avro Hadoop Parser +converts Parquet data into avro records first with the `parquet-avro` library and then +parses avro data using the `druid-avro-extensions` module to ingest into Druid. +* The Parquet Hadoop Parser sets a hadoop job property +`parquet.avro.add-list-element-records` to `false` (which normally defaults to `true`), in order to 'unwrap' primitive +list elements into multi-value dimensions. +* The Parquet Hadoop Parser supports `int96` Parquet values, while the Parquet Avro Hadoop Parser does not. +There may also be some subtle differences in the behavior of JSON path expression evaluation of `flattenSpec`. + +Based on those differences, we suggest using the Parquet Hadoop Parser over the Parquet Avro Hadoop Parser +to allow ingesting data beyond the schema constraints of Avro conversion. +However, the Parquet Avro Hadoop Parser was the original basis for supporting the Parquet format, and as such it is a bit more mature. + +#### Examples + +##### `parquet` parser, `parquet` parseSpec +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", + "paths": "path/to/file.parquet" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "parquet", + "parseSpec": { + "format": "parquet", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +##### `parquet` parser, `timeAndDims` parseSpec +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", + "paths": "path/to/file.parquet" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "parquet", + "parseSpec": { + "format": "timeAndDims", + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [ + "dim1", + "dim2", + "dim3", + "listDim" + ], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } +} + +``` + +### Parquet Avro Hadoop Parser + +> Consider using the [Parquet Hadoop Parser](#parquet-hadoop-parser) over this parser to ingest +Parquet files. See [Parquet Hadoop Parser vs Parquet Avro Hadoop Parser](#parquet-hadoop-parser-vs-parquet-avro-hadoop-parser) +for the differences between those parsers. + +> You need to include both the [`druid-parquet-extensions`](../development/extensions-core/parquet.md) +[`druid-avro-extensions`] as extensions to use the Parquet Avro Hadoop Parser. + +The Parquet Avro Hadoop Parser is for [Hadoop batch ingestion](./hadoop.md). +This parser first converts the Parquet data into Avro records, and then parses them to ingest into Druid. +The `inputFormat` of `inputSpec` in `ioConfig` must be set to `org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat`. + +The Parquet Avro Hadoop Parser supports auto field discovery and flattening if provided with a +[`flattenSpec`](#flattenspec) with the `avro` `parseSpec`. Parquet nested list and map +[logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) _should_ operate correctly with +JSON path expressions for all supported types. This parser sets a hadoop job property +`parquet.avro.add-list-element-records` to `false` (which normally defaults to `true`), in order to 'unwrap' primitive +list elements into multi-value dimensions. + +Note that the `int96` Parquet value type is not supported with this parser. + +|Field | Type | Description | Required| +|----------|-------------|----------------------------------------------------------------------------------------|---------| +| type | String | This should say `parquet-avro`. | yes | +| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Should be `avro`. | yes | +| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no(default = false) | + +When the time dimension is a [DateType column](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), +a format should not be supplied. When the format is UTF8 (String), either `auto` or +an explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat) is required. + +#### Example + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat", + "paths": "path/to/file.parquet" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "parquet-avro", + "parseSpec": { + "format": "avro", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +### Avro Stream Parser + +> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro Stream Parser. + +> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid + +This parser is for [stream ingestion](./index.md#streaming) and reads Avro data from a stream directly. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `avro_stream`. | no | +| avroBytesDecoder | JSON Object | Specifies [`avroBytesDecoder`](#Avro Bytes Decoder) to decode bytes to Avro record. | yes | +| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Should be an "avro" parseSpec. | yes | + +An Avro parseSpec can contain a [`flattenSpec`](#flattenspec) using either the "root" or "path" +field types, which can be used to read nested Avro records. The "jq" field type is not currently supported for Avro. + +For example, using Avro stream parser with schema repo Avro bytes decoder: + +```json +"parser" : { + "type" : "avro_stream", + "avroBytesDecoder" : { + "type" : "schema_repo", + "subjectAndIdConverter" : { + "type" : "avro_1124", + "topic" : "${YOUR_TOPIC}" + }, + "schemaRepository" : { + "type" : "avro_1124_rest_client", + "url" : "${YOUR_SCHEMA_REPO_END_POINT}", + } + }, + "parseSpec" : { + "format": "avro", + "timestampSpec": , + "dimensionsSpec": , + "flattenSpec": + } +} +``` + +### Protobuf Parser + +> You need to include the [`druid-protobuf-extensions`](../development/extensions-core/protobuf.md) as an extension to use the Protobuf Parser. + +This parser is for [stream ingestion](./index.md#streaming) and reads Protocol buffer data from a stream directly. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `protobuf`. | yes | +| `protoBytesDecoder` | JSON Object | Specifies how to decode bytes to Protobuf record. | yes | +| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. The format must be JSON. See [JSON ParseSpec](./index.md) for more configuration options. Note that timeAndDims parseSpec is no longer supported. | yes | + +Sample spec: + +```json +"parser": { + "type": "protobuf", + "protoBytesDecoder": { + "type": "file", + "descriptor": "file:///tmp/metrics.desc", + "protoMessageType": "Metrics" + }, + "parseSpec": { + "format": "json", + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [ + "unit", + "http_method", + "http_code", + "page", + "metricType", + "server" + ], + "dimensionExclusions": [ + "timestamp", + "value" + ] + } + } +} +``` + +See the [extension description](../development/extensions-core/protobuf.md) for +more details and examples. + +#### Protobuf Bytes Decoder + +If `type` is not included, the `protoBytesDecoder` defaults to `schema_registry`. + +##### File-based Protobuf Bytes Decoder + +This Protobuf bytes decoder first read a descriptor file, and then parse it to get schema used to decode the Protobuf record from bytes. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `file`. | yes | +| descriptor | String | Protobuf descriptor file name in the classpath or URL. | yes | +| protoMessageType | String | Protobuf message type in the descriptor. Both short name and fully qualified name are accepted. The parser uses the first message type found in the descriptor if not specified. | no | + +Sample spec: + +```json +"protoBytesDecoder": { + "type": "file", + "descriptor": "file:///tmp/metrics.desc", + "protoMessageType": "Metrics" +} +``` + +##### Confluent Schema Registry-based Protobuf Bytes Decoder + +This Protobuf bytes decoder first extracts a unique `id` from input message bytes, and then uses it to look up the schema in the Schema Registry used to decode the Avro record from bytes. +For details, see the Schema Registry [documentation](http://docs.confluent.io/current/schema-registry/docs/) and [repository](https://github.com/confluentinc/schema-registry). + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| type | String | This should say `schema_registry`. | yes | +| url | String | Specifies the url endpoint of the Schema Registry. | yes | +| capacity | Integer | Specifies the max size of the cache (default = Integer.MAX_VALUE). | no | +| urls | Array | Specifies the url endpoints of the multiple Schema Registry instances. | yes(if `url` is not provided) | +| config | Json | To send additional configurations, configured for Schema Registry | no | +| headers | Json | To send headers to the Schema Registry | no | + +For a single schema registry instance, use Field `url` or `urls` for multi instances. + +Single Instance: + +```json +... +"protoBytesDecoder": { + "url": , + "type": "schema_registry" +} +... +``` + +Multiple Instances: +```json +... +"protoBytesDecoder": { + "urls": [, , ...], + "type": "schema_registry", + "capacity": 100, + "config" : { + "basic.auth.credentials.source": "USER_INFO", + "basic.auth.user.info": "fred:letmein", + "schema.registry.ssl.truststore.location": "/some/secrets/kafka.client.truststore.jks", + "schema.registry.ssl.truststore.password": "", + "schema.registry.ssl.keystore.location": "/some/secrets/kafka.client.keystore.jks", + "schema.registry.ssl.keystore.password": "", + "schema.registry.ssl.key.password": "", + ... + }, + "headers": { + "traceID" : "b29c5de2-0db4-490b-b421", + "timeStamp" : "1577191871865", + ... + } +} +... +``` + +## ParseSpec + +> The Parser is deprecated for [native batch tasks](./native-batch.md), [Kafka indexing service](../development/extensions-core/kafka-ingestion.md), +and [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md). +Consider using the [input format](#input-format) instead for these types of ingestion. + +ParseSpecs serve two purposes: + +- The String Parser use them to determine the format (i.e., JSON, CSV, TSV) of incoming rows. +- All Parsers use them to determine the timestamp and dimensions of incoming rows. + +If `format` is not included, the parseSpec defaults to `tsv`. + +### JSON ParseSpec + +Use this with the String Parser to load JSON. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| format | String | This should say `json`. | no | +| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes | +| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes | +| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no | + +Sample spec: + +```json +"parseSpec": { + "format" : "json", + "timestampSpec" : { + "column" : "timestamp" + }, + "dimensionSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + } +} +``` + +### JSON Lowercase ParseSpec + +> The _jsonLowercase_ parser is deprecated and may be removed in a future version of Druid. + +This is a special variation of the JSON ParseSpec that lower cases all the column names in the incoming JSON data. This parseSpec is required if you are updating to Druid 0.7.x from Druid 0.6.x, are directly ingesting JSON with mixed case column names, do not have any ETL in place to lower case those column names, and would like to make queries that include the data you created using 0.6.x and 0.7.x. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| format | String | This should say `jsonLowercase`. | yes | +| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes | +| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes | + +### CSV ParseSpec + +Use this with the String Parser to load CSV. Strings are parsed using the com.opencsv library. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| format | String | This should say `csv`. | yes | +| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes | +| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes | +| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) | +| columns | JSON array | Specifies the columns of the data. | yes | + +Sample spec: + +```json +"parseSpec": { + "format" : "csv", + "timestampSpec" : { + "column" : "timestamp" + }, + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], + "dimensionsSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + } +} +``` + +#### CSV Index Tasks + +If your input files contain a header, the `columns` field is optional and you don't need to set. +Instead, you can set the `hasHeaderRow` field to true, which makes Druid automatically extract the column information from the header. +Otherwise, you must set the `columns` field and ensure that field must match the columns of your input data in the same order. + +Also, you can skip some header rows by setting `skipHeaderRows` in your parseSpec. If both `skipHeaderRows` and `hasHeaderRow` options are set, +`skipHeaderRows` is first applied. For example, if you set `skipHeaderRows` to 2 and `hasHeaderRow` to true, Druid will +skip the first two lines and then extract column information from the third line. + +Note that `hasHeaderRow` and `skipHeaderRows` are effective only for non-Hadoop batch index tasks. Other types of index +tasks will fail with an exception. + +#### Other CSV Ingestion Tasks + +The `columns` field must be included and and ensure that the order of the fields matches the columns of your input data in the same order. + +### TSV / Delimited ParseSpec + +Use this with the String Parser to load any delimited text that does not require special escaping. By default, +the delimiter is a tab, so this will load TSV. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| format | String | This should say `tsv`. | yes | +| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes | +| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes | +| delimiter | String | A custom delimiter for data values. | no (default = \t) | +| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) | +| columns | JSON String array | Specifies the columns of the data. | yes | + +Sample spec: + +```json +"parseSpec": { + "format" : "tsv", + "timestampSpec" : { + "column" : "timestamp" + }, + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], + "delimiter":"|", + "dimensionsSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + } +} +``` + +Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed. + +#### TSV (Delimited) Index Tasks + +If your input files contain a header, the `columns` field is optional and doesn't need to be set. +Instead, you can set the `hasHeaderRow` field to true, which makes Druid automatically extract the column information from the header. +Otherwise, you must set the `columns` field and ensure that field must match the columns of your input data in the same order. + +Also, you can skip some header rows by setting `skipHeaderRows` in your parseSpec. If both `skipHeaderRows` and `hasHeaderRow` options are set, +`skipHeaderRows` is first applied. For example, if you set `skipHeaderRows` to 2 and `hasHeaderRow` to true, Druid will +skip the first two lines and then extract column information from the third line. + +Note that `hasHeaderRow` and `skipHeaderRows` are effective only for non-Hadoop batch index tasks. Other types of index +tasks will fail with an exception. + +#### Other TSV (Delimited) Ingestion Tasks + +The `columns` field must be included and and ensure that the order of the fields matches the columns of your input data in the same order. + +### Multi-value dimensions + +Dimensions can have multiple values for TSV and CSV data. To specify the delimiter for a multi-value dimension, set the `listDelimiter` in the `parseSpec`. + +JSON data can contain multi-value dimensions as well. The multiple values for a dimension must be formatted as a JSON array in the ingested data. No additional `parseSpec` configuration is needed. + +### Regex ParseSpec + +```json +"parseSpec":{ + "format" : "regex", + "timestampSpec" : { + "column" : "timestamp" + }, + "dimensionsSpec" : { + "dimensions" : [] + }, + "columns" : [], + "pattern" : +} +``` + +The `columns` field must match the columns of your regex matching groups in the same order. If columns are not provided, default +columns names ("column_1", "column2", ... "column_n") will be assigned. Ensure that your column names include all your dimensions. + +### JavaScript ParseSpec + +```json +"parseSpec":{ + "format" : "javascript", + "timestampSpec" : { + "column" : "timestamp" + }, + "dimensionsSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + }, + "function" : "function(str) { var parts = str.split(\"-\"); return { one: parts[0], two: parts[1] } }" +} +``` + +Note with the JavaScript parser that data must be fully parsed and returned as a `{key:value}` format in the JS logic. +This means any flattening or parsing multi-dimensional values must be done here. + +> JavaScript-based functionality is disabled by default. Please refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it. + +### TimeAndDims ParseSpec + +Use this with non-String Parsers to provide them with timestamp and dimensions information. Non-String Parsers +handle all formatting decisions on their own, without using the ParseSpec. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| format | String | This should say `timeAndDims`. | yes | +| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes | +| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes | + +### Orc ParseSpec + +Use this with the Hadoop ORC Parser to load ORC files. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| format | String | This should say `orc`. | no | +| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes | +| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes | +| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no | + +### Parquet ParseSpec + +Use this with the Hadoop Parquet Parser to load Parquet files. + +| Field | Type | Description | Required | +|-------|------|-------------|----------| +| format | String | This should say `parquet`. | no | +| timestampSpec | JSON Object | Specifies the column and format of the timestamp. | yes | +| dimensionsSpec | JSON Object | Specifies the dimensions of the data. | yes | +| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no | + + + +## 数据格式 +Apache Druid可以接收JSON、CSV或TSV等分隔格式或任何自定义格式的非规范化数据。尽管文档中的大多数示例使用JSON格式的数据,但将Druid配置为接收任何其他分隔数据并不困难。我们欢迎对新格式的任何贡献。 + +此页列出了Druid支持的所有默认和核心扩展数据格式。有关社区扩展支持的其他数据格式,请参阅我们的 [社区扩展列表](../configuration/logging.md#社区扩展)。 + +### 格式化数据 + +下面的示例显示了在Druid中原生支持的数据格式: + +*JSON* +```json +{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143} +{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330} +{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111} +{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900} +{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "cancer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9} +``` + +*CSV* +```json +2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143 +2013-08-31T03:32:45Z,"Striker Eureka","en","speed","false","true","true","false","wikipedia","Australia","Australia","Cantebury","Syndey",459,129,330 +2013-08-31T07:11:21Z,"Cherno Alpha","ru","masterYi","false","true","true","false","article","Asia","Russia","Oblast","Moscow",123,12,111 +2013-08-31T11:58:39Z,"Crimson Typhoon","zh","triplets","true","false","true","false","wikipedia","Asia","China","Shanxi","Taiyuan",905,5,900 +2013-08-31T12:41:27Z,"Coyote Tango","ja","cancer","true","false","true","false","wikipedia","Asia","Japan","Kanto","Tokyo",1,10,-9 +``` + +*TSV(Delimited)* +```json +2013-08-31T01:02:33Z "Gypsy Danger" "en" "nuclear" "true" "true" "false" "false" "article" "North America" "United States" "Bay Area" "San Francisco" 57 200 -143 +2013-08-31T03:32:45Z "Striker Eureka" "en" "speed" "false" "true" "true" "false" "wikipedia" "Australia" "Australia" "Cantebury" "Syndey" 459 129 330 +2013-08-31T07:11:21Z "Cherno Alpha" "ru" "masterYi" "false" "true" "true" "false" "article" "Asia" "Russia" "Oblast" "Moscow" 123 12 111 +2013-08-31T11:58:39Z "Crimson Typhoon" "zh" "triplets" "true" "false" "true" "false" "wikipedia" "Asia" "China" "Shanxi" "Taiyuan" 905 5 900 +2013-08-31T12:41:27Z "Coyote Tango" "ja" "cancer" "true" "false" "true" "false" "wikipedia" "Asia" "Japan" "Kanto" "Tokyo" 1 10 -9 +``` + +请注意,CSV和TSV数据不包含列标题。当您指定要摄取的数据时,这一点就变得很重要。 + +除了文本格式,Druid还支持二进制格式,比如 [Orc](#orc) 和 [Parquet](#parquet) 格式。 + +### 定制格式 + +Druid支持自定义数据格式,可以使用 `Regex` 解析器或 `JavaScript` 解析器来解析这些格式。请注意,使用这些解析器中的任何一个来解析数据都不如编写原生Java解析器或使用外部流处理器那样高效。我们欢迎新解析器的贡献。 + +### InputFormat + +> [!WARNING] +> 输入格式是在0.17.0中引入的指定输入数据的数据格式的新方法。不幸的是,输入格式还不支持Druid支持的所有数据格式或摄取方法。特别是如果您想使用Hadoop接收,您仍然需要使用 [解析器](#parser)。如果您的数据是以本节未列出的某种格式格式化的,请考虑改用解析器。 + +所有形式的Druid摄取都需要某种形式的schema对象。要摄取的数据的格式是使用[`ioConfig`](/ingestion.md#ioConfig) 中的 `inputFormat` 条目指定的。 + +#### JSON + +**JSON** +一个加载JSON格式数据的 `inputFormat` 示例: +```json +"ioConfig": { + "inputFormat": { + "type": "json" + }, + ... +} +``` +JSON `inputFormat` 有以下组件: + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 填 `json` | 是 | +| flattenSpec | JSON对象 | 指定嵌套JSON数据的展平配置。更多信息请参见[flattenSpec](#flattenspec) | 否 | +| featureSpec | JSON对象 | Jackson库支持的 [JSON解析器特性](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) 。这些特性将在解析输入JSON数据时应用。 | 否 | + + +#### CSV +一个加载CSV格式数据的 `inputFormat` 示例: +```json +"ioConfig": { + "inputFormat": { + "type": "csv", + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"] + }, + ... +} +``` + +CSV `inputFormat` 有以下组件: + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 填 `csv` | 是 | +| listDelimiter | String | 多值维度的定制分隔符 | 否(默认ctrl + A) | +| columns | JSON数组 | 指定数据的列。列的顺序应该与数据列的顺序相同。 | 如果 `findColumnsFromHeader` 设置为 `false` 或者缺失, 则为必填项 | +| findColumnsFromHeader | 布尔 | 如果设置了此选项,则任务将从标题行中查找列名。请注意,在从标题中查找列名之前,将首先使用 `skipHeaderRows`。例如,如果将 `skipHeaderRows` 设置为2,将 `findColumnsFromHeader` 设置为 `true`,则任务将跳过前两行,然后从第三行提取列信息。该项如果设置为true,则将忽略 `columns` | 否(如果 `columns` 被设置则默认为 `false`, 否则为null) | +| skipHeaderRows | 整型数值 | 该项如果设置,任务将略过 `skipHeaderRows`配置的行数 | 否(默认为0) | + +#### TSV(Delimited) +```json +"ioConfig": { + "inputFormat": { + "type": "tsv", + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], + "delimiter":"|" + }, + ... +} +``` +TSV `inputFormat` 有以下组件: + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 填 `tsv` | 是 | +| delimiter | String | 数据值的自定义分隔符 | 否(默认为 `\t`) | +| listDelimiter | String | 多值维度的定制分隔符 | 否(默认ctrl + A) | +| columns | JSON数组 | 指定数据的列。列的顺序应该与数据列的顺序相同。 | 如果 `findColumnsFromHeader` 设置为 `false` 或者缺失, 则为必填项 | +| findColumnsFromHeader | 布尔 | 如果设置了此选项,则任务将从标题行中查找列名。请注意,在从标题中查找列名之前,将首先使用 `skipHeaderRows`。例如,如果将 `skipHeaderRows` 设置为2,将 `findColumnsFromHeader` 设置为 `true`,则任务将跳过前两行,然后从第三行提取列信息。该项如果设置为true,则将忽略 `columns` | 否(如果 `columns` 被设置则默认为 `false`, 否则为null) | +| skipHeaderRows | 整型数值 | 该项如果设置,任务将略过 `skipHeaderRows`配置的行数 | 否(默认为0) | + +请确保将分隔符更改为适合于数据的分隔符。与CSV一样,您必须指定要索引的列和列的子集。 + +#### ORC + +> [!WARNING] +> 使用ORC输入格式之前,首先需要包含 [druid-orc-extensions](../development/orc-extensions.md) + +> [!WARNING] +> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../development/orc-extensions.md#从contrib扩展迁移)。 + +一个加载ORC格式数据的 `inputFormat` 示例: +```json +"ioConfig": { + "inputFormat": { + "type": "orc", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nested", + "expr": "$.path.to.nested" + } + ] + } + "binaryAsString": false + }, + ... +} +``` + +ORC `inputFormat` 有以下组件: + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 填 `orc` | 是 | +| flattenSpec | JSON对象 | 指定嵌套JSON数据的展平配置。更多信息请参见[flattenSpec](#flattenspec) | 否 | +| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | + +#### Parquet + +> [!WARNING] +> 使用Parquet输入格式之前,首先需要包含 [druid-parquet-extensions](../development/parquet-extensions.md) + +一个加载Parquet格式数据的 `inputFormat` 示例: +```json +"ioConfig": { + "inputFormat": { + "type": "parquet", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nested", + "expr": "$.path.to.nested" + } + ] + } + "binaryAsString": false + }, + ... +} +``` + +Parquet `inputFormat` 有以下组件: + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 填 `parquet` | 是 | +| flattenSpec | JSON对象 | 定义一个 [flattenSpec](#flattenspec) 从Parquet文件提取嵌套的值。注意,只支持"path"表达式('jq'不可用)| 否(默认自动发现根级别的属性) | +| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | + +#### FlattenSpec + +`flattenSpec` 位于 `inputFormat` -> `flattenSpec` 中,负责将潜在的嵌套输入数据(如JSON、Avro等)和Druid的平面数据模型之间架起桥梁。 `flattenSpec` 示例如下: +```json +"flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { "name": "baz", "type": "root" }, + { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" }, + { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" } + ] +} +``` +> [!WARNING] +> 概念上,输入数据被读取后,Druid会以一个特定的顺序来对数据应用摄入规范: 首先 `flattenSpec`(如果有),然后 `timestampSpec`, 然后 `transformSpec` ,最后是 `dimensionsSpec` 和 `metricsSpec`。在编写摄入规范时需要牢记这一点 + +展平操作仅仅支持嵌套的 [数据格式](dataformats.md), 包括:`avro`, `json`, `orc` 和 `parquet`。 + +`flattenSpec` 有以下组件: + +| 字段 | 描述 | 默认值 | +|-|-|-| +| useFieldDiscovery | 如果为true,则将所有根级字段解释为可用字段,供 [`timestampSpec`](/ingestion.md#timestampSpec)、[`transformSpec`](/ingestion.md#transformSpec)、[`dimensionsSpec`](/ingestion.md#dimensionsSpec) 和 [`metricsSpec`](/ingestion.md#metricsSpec) 使用。

如果为false,则只有显式指定的字段(请参阅 `fields`)才可供使用。 | true | +| fields | 指定感兴趣的字段及其访问方式, 详细请见下边 | `[]` | + +**字段展平规范** + +`fields` 列表中的每个条目都可以包含以下组件: + + + + + + + + + + + + + + + + + + + + + + + + + + +
字段描述默认值
type + 可选项如下: +
    +
  • root, 引用记录根级别的字段。只有当useFieldDiscovery 为false时才真正有用。
  • +
  • path, 引用使用 JsonPath 表示法的字段,支持大多数提供嵌套的数据格式,包括avro,csv, jsonparquet
  • +
  • jq, 引用使用 jackson-jq 表示法的字段, 仅仅支持json格式
  • +
+
none(必填)
name展平后的字段名称。这个名称可以被timestampSpec, transformSpec, dimensionsSpecmetricsSpec引用none(必填)
expr用于在展平时访问字段的表达式。对于类型 `path`,这应该是 JsonPath。对于 `jq` 类型,这应该是 jackson-jq 表达式。对于其他类型,将忽略此参数。none(对于 `path` 和 `jq` 类型的为必填)
+ +**展平操作的注意事项** + +* 为了方便起见,在定义根级字段时,可以只将字段名定义为字符串,而不是JSON对象。例如 `{"name": "baz", "type": "root"}` 等价于 `baz` +* 启用 `useFieldDiscovery` 只会在根级别自动检测与Druid支持的数据类型相对应的"简单"字段, 这包括字符串、数字和字符串或数字列表。不会自动检测到其他类型,其他类型必须在 `fields` 列表中显式指定 +* 不允许重复字段名(`name`), 否则将引发异常 +* 如果启用 `useFieldDiscovery`,则将跳过与字段列表中已定义的字段同名的任何已发现字段,而不是添加两次 +* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) 对于测试 `path`-类型表达式非常有用 +* jackson jq支持完整 [`jq`](https://stedolan.github.io/jq/)语法的一个子集。有关详细信息,请参阅 [jackson jq](https://github.com/eiiches/jackson-jq) 文档 + +### Parser + +> [!WARNING] +> parser在 [本地批任务](native.md), [Kafka索引任务](kafka.md) 和 [Kinesis索引任务](kinesis.md) 中已经废弃,在这些类型的摄入方式中考虑使用 [inputFormat](#数据格式) + +该部分列出来了所有默认的以及核心扩展中的解析器。对于社区的扩展解析器,请参见 [社区扩展列表](../development/extensions.md#社区扩展) + +#### String Parser + +`string` 类型的解析器对基于文本的输入进行操作,这些输入可以通过换行符拆分为单独的记录, 可以使用 [`parseSpec`](#parsespec) 进一步分析每一行。 + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| type | string | 一般是 `string`, 在Hadoop索引任务中为 `hadoopyString` | 是 | +| parseSpec | JSON对象 | 指定格式,数据的timestamp和dimensions | 是 | + +#### Avro Hadoop Parser + +> [!WARNING] +> 需要添加 [druid-avro-extensions](../development/avro-extensions.md) 来使用 Avro Hadoop解析器 + +该解析器用于 [Hadoop批摄取](hadoop.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.avro.AvroValueInputFormat`。您可能想在 `tuningConfig` 中的 `jobProperties` 选项设置Avro reader的schema, 例如:`"avro.schema.input.value.path": "/path/to/your/schema.avsc"` 或者 `"avro.schema.input.value": "your_schema_JSON_object"`。如果未设置Avro读取器的schema,则将使用Avro对象容器文件中的schema,详情可以参见 [avro规范](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution) + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 应该填 `avro_hadoop` | 是 | +| parseSpec | JSON对象 | 指定数据的时间戳和维度。应该是“avro”语法规范。| 是 | + +Avro parseSpec可以包含使用"root"或"path"字段类型的 [flattenSpec](#flattenspec),这些字段类型可用于读取嵌套的Avro记录。Avro当前不支持“jq”字段类型。 + +例如,使用带有自定义读取器schema文件的Avro Hadoop解析器: +```json +{ + "type" : "index_hadoop", + "spec" : { + "dataSchema" : { + "dataSource" : "", + "parser" : { + "type" : "avro_hadoop", + "parseSpec" : { + "format": "avro", + "timestampSpec": , + "dimensionsSpec": , + "flattenSpec": + } + } + }, + "ioConfig" : { + "type" : "hadoop", + "inputSpec" : { + "type" : "static", + "inputFormat": "org.apache.druid.data.input.avro.AvroValueInputFormat", + "paths" : "" + } + }, + "tuningConfig" : { + "jobProperties" : { + "avro.schema.input.value.path" : "/path/to/my/schema.avsc" + } + } + } +} +``` + +#### ORC Hadoop Parser + +> [!WARNING] +> 需要添加 [druid-orc-extensions](../development/orc-extensions.md) 来使用ORC Hadoop解析器 + +> [!WARNING] +> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../development/orc-extensions.md#从contrib扩展迁移)。 + +该解析器用于 [Hadoop批摄取](hadoop.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.orc.mapreduce.OrcInputFormat`。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 应该填 `orc` | 是 | +| parseSpec | JSON对象 | 指定数据(`timeAndDim` 和 `orc` 格式)的时间戳和维度和一个`flattenSpec`(`orc`格式)| 是 | + +解析器支持两种 `parseSpec` 格式: `orc` 和 `timeAndDims` + +`orc` 支持字段的自动发现和展平(如果指定了 [flattenSpec](#flattenspec)。如果未指定展平规范,则默认情况下将启用 `useFieldDiscovery`。如果启用了 `useFieldDiscovery`,则指定`dimensionSpec` 是可选的:如果提供了 `dimensionSpec`,则它定义的维度列表将是摄取维度的集合,如果缺少发现的字段将构成该列表。 + +`timeAndDims` 解析规范必须通过 `dimensionSpec` 指定哪些字段将提取为维度。 + +支持所有 [列类型](https://orc.apache.org/docs/types.html) ,但 `union` 类型除外。`list` 类型的列(如果用基本类型填充)可以用作多值维度,或者可以使用 [flattenSpec](#flattenspec) 表达式提取特定元素。同样,可以用同样的方式从 `map` 和 `struct` 类型中提取基本字段。自动字段发现将自动为每个(非时间戳)基本类型或基本类型 `list` 以及 `flattenSpec` 中定义的任何展平表达式创建字符串维度。 + +**Hadoop job属性** + +像大多数Hadoop作业,最佳结果是在 `tuningConfig` 中的 `jobProperties` 中添加 `"mapreduce.job.user.classpath.first": "true"` 或者 `"mapreduce.job.classloader": "true"`。 注意,如果使用了 `"mapreduce.job.classloader": "true"`, 需要设置 `mapreduce.job.classloader.system.classes` 包含 `-org.apache.hadoop.hive.` 来让Hadoop从应用jars包中加载 `org.apache.hadoop.hive` 而非从系统jar中,例如: + +```json +... + "mapreduce.job.classloader": "true", + "mapreduce.job.classloader.system.classes" : "java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml", +... +``` + +这是因为 `orc-mapreduce` 库的配置单元 `hive-storage-api` 依赖关系,它在 `org.apache.hadoop.hive` 包下提供了一些类。如果改为使用`"mapreduce.job.user.classpath.first":"true"`设置,则不会出现此问题。 + +**示例** + +**`orc` parser, `orc` parseSpec, 自动字段发现, 展平表达式** +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "orc", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "millis" + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +**`orc` parser, `orc` parseSpec, 不具有 `flattenSpec` 或者 `dimensionSpec`的字段发现** + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "orc", + "timestampSpec": { + "column": "timestamp", + "format": "millis" + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` +**`orc` parser, `orc` parseSpec, 非自动发现** + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "orc", + "flattenSpec": { + "useFieldDiscovery": false, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "millis" + }, + "dimensionsSpec": { + "dimensions": [ + "dim1", + "dim3", + "nestedDim", + "listDimFirstItem" + ], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +**`orc` parser, `timeAndDims` parseSpec** + +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", + "paths": "path/to/file.orc" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "orc", + "parseSpec": { + "format": "timeAndDims", + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [ + "dim1", + "dim2", + "dim3", + "listDim" + ], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } +} +``` + +#### Parquet Hadoop Parser + +> [!WARNING] +> 需要添加 [druid-parquet-extensions](../development/parquet-extensions.md) 来使用Parquet Hadoop解析器 + +该解析器用于 [Hadoop批摄取](hadoop.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.parquet.DruidParquetInputFormat`。 + +Parquet Hadoop 解析器支持自动字段发现,如果提供了一个带有 `parquet` `parquetSpec`的 `flattenSpec` 也支持展平。 Parquet嵌套 list 和 map [逻辑类型](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 应与所有受支持类型的JSON path表达式一起正确操作。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 应该填 `parquet` | 是 | +| parseSpec | JSON对象 | 指定数据的时间戳和维度和一个可选的 `flattenSpec`。有效的 `parseSpec` 格式是 `timeAndDims` 和 `parquet` | 是 | +| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | + +当时间维度是一个 [date类型的列](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), 则无需指定一个格式。 当格式为UTF8的String, 则要么指定为 `auto`,或者显式的指定一个 [时间格式](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html)。 + +**Parquet Hadoop解析器 vs Parquet Avro Hadoop解析器** +两者都是从Parquet文件中读取,但是又轻微的不同。主要不同之处是: +* Parquet Hadoop解析器使用简单的转换,而Parquet Avro Hadoop解析器首先使用 `parquet-avro` 库将Parquet数据转换为Avro记录,然后使用 `druid-avro-extensions` 模块将Avro数据解析为druid +* Parquet Hadoop解析器将Hadoop作业属性 `parquet.avro.add-list-element-records` 设置为false(通常默认为true),以便将原始列表元素"展开"为多值维度 +* Parquet Hadoop解析器支持 `int96` Parquet值,而 Parquet Avro Hadoop解析器不支持。`flatteSpec` 的JSON path表达式求值的行为也可能存在一些细微的差异 + +基于这些差异,我们建议在Parquet avro hadoop解析器上使用Parquet Hadoop解析器,以允许摄取超出Avro转换模式约束的数据。然而,Parquet Avro Hadoop解析器是支持Parquet格式的原始基础,因此它更加成熟。 + +**示例** + +`parquet` parser, `parquet` parseSpec +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", + "paths": "path/to/file.parquet" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "parquet", + "parseSpec": { + "format": "parquet", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` +`parquet` parser, `timeAndDims` parseSpec +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", + "paths": "path/to/file.parquet" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "parquet", + "parseSpec": { + "format": "timeAndDims", + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [ + "dim1", + "dim2", + "dim3", + "listDim" + ], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } +} +``` + +#### Parquet Avro Hadoop Parser + +> [!WARNING] +> 考虑在该解析器之上使用 [Parquet Hadoop Parser](#parquet-hadoop-parser) 来摄取Parquet文件。 两者之间的不同之处参见 [Parquet Hadoop解析器 vs Parquet Avro Hadoop解析器]() 部分 + +> [!WARNING] +> 使用Parquet Avro Hadoop Parser需要同时加入 [druid-parquet-extensions](../development/parquet-extensions.md) 和 [druid-avro-extensions](../development/avro-extensions.md) + +该解析器用于 [Hadoop批摄取](hadoop.md), 该解析器首先将Parquet数据转换为Avro记录,然后再解析它们后摄入到Druid。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat`。 + +Parquet Avro Hadoop 解析器支持自动字段发现,如果提供了一个带有 `avro` `parquetSpec`的 `flattenSpec` 也支持展平。 Parquet嵌套 list 和 map [逻辑类型](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 应与所有受支持类型的JSON path表达式一起正确操作。该解析器将Hadoop作业属性 `parquet.avro.add-list-element-records` 设置为false(通常默认为true),以便将原始列表元素"展开"为多值维度。 + +注意,`int96` Parquet值类型在该解析器中是不支持的。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| type | String | 应该填 `parquet-avro` | 是 | +| parseSpec | JSON对象 | 指定数据的时间戳和维度和一个可选的 `flattenSpec`, 应该是 `avro` | 是 | +| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | + +当时间维度是一个 [date类型的列](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), 则无需指定一个格式。 当格式为UTF8的String, 则要么指定为 `auto`,或者显式的指定一个 [时间格式](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html)。 + +**示例** +```json +{ + "type": "index_hadoop", + "spec": { + "ioConfig": { + "type": "hadoop", + "inputSpec": { + "type": "static", + "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat", + "paths": "path/to/file.parquet" + }, + ... + }, + "dataSchema": { + "dataSource": "example", + "parser": { + "type": "parquet-avro", + "parseSpec": { + "format": "avro", + "flattenSpec": { + "useFieldDiscovery": true, + "fields": [ + { + "type": "path", + "name": "nestedDim", + "expr": "$.nestedData.dim1" + }, + { + "type": "path", + "name": "listDimFirstItem", + "expr": "$.listDim[1]" + } + ] + }, + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [], + "dimensionExclusions": [], + "spatialDimensions": [] + } + } + }, + ... + }, + "tuningConfig": + } + } +} +``` + +#### Avro Stream Parser + +> [!WARNING] +> 需要添加 [druid-avro-extensions](../development/avro-extensions.md) 来使用Avro Stream解析器 + +该解析器用于 [流式摄取](streamingest.md), 直接从一个流来读取数据。 + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| type | String | `avro_stream` | 否 | +| avroBytesDecoder | JSON对象 | 指定如何对Avro记录进行解码 | 是 | +| parseSpec | JSON对象 | 指定数据的时间戳和维度。 应该是一个 `avro` parseSpec | 是 | + +Avro parseSpec包含一个使用"root"或者"path"类型的 [`flattenSpec`](ingestion.md#flattenspec.md), 以便可以用来读取嵌套的avro数据。 "jq"类型在Avro中目前还不支持。 + +以下示例展示了一个具有**schema repo**avro解码器的 `Avro stream parser`: +```json +"parser" : { + "type" : "avro_stream", + "avroBytesDecoder" : { + "type" : "schema_repo", + "subjectAndIdConverter" : { + "type" : "avro_1124", + "topic" : "${YOUR_TOPIC}" + }, + "schemaRepository" : { + "type" : "avro_1124_rest_client", + "url" : "${YOUR_SCHEMA_REPO_END_POINT}", + } + }, + "parseSpec" : { + "format": "avro", + "timestampSpec": , + "dimensionsSpec": , + "flattenSpec": + } +} +``` + +**Avro Bytes Decoder** + +如果 `type` 未被指定, `avroBytesDecoder` 默认使用 `schema_repo`。 + +**基于Avro Bytes Decoder的 `inline schema`** + +> [!WARNING] +> "schema_inline"解码器使用固定schema读取Avro记录,不支持schema迁移。如果将来可能需要迁移schema,请考虑其他解码器之一,所有解码器都使用一个消息头,该消息头允许解析器识别正确的Avro schema以读取记录。 + +如果可以使用同一schema读取所有输入事件,则可以使用此解码器。在这种情况下,在输入任务JSON本身中指定schema,如下所述: +```json +... +"avroBytesDecoder": { + "type": "schema_inline", + "schema": { + //your schema goes here, for example + "namespace": "org.apache.druid.data", + "name": "User", + "type": "record", + "fields": [ + { "name": "FullName", "type": "string" }, + { "name": "Country", "type": "string" } + ] + } +} +... +``` +**基于Avro Bytes Decoder的 `multiple inline schemas`** + +如果不同的输入事件可以有不同的读取schema,请使用此解码器。在这种情况下,在输入任务JSON本身中指定schema,如下所述: +```json +... +"avroBytesDecoder": { + "type": "multiple_schemas_inline", + "schemas": { + //your id -> schema map goes here, for example + "1": { + "namespace": "org.apache.druid.data", + "name": "User", + "type": "record", + "fields": [ + { "name": "FullName", "type": "string" }, + { "name": "Country", "type": "string" } + ] + }, + "2": { + "namespace": "org.apache.druid.otherdata", + "name": "UserIdentity", + "type": "record", + "fields": [ + { "name": "Name", "type": "string" }, + { "name": "Location", "type": "string" } + ] + }, + ... + ... + } +} +... +``` +注意,它本质上是一个整数Schema ID到avro schema对象的映射。此解析器假定记录具有以下格式。第一个1字节是版本,必须始终为1, 接下来的4个字节是使用大端字节顺序序列化的整数模式ID。其余字节包含序列化的avro消息。 + +**基于Avro Bytes Decoder的 `SchemaRepo`** + +Avro Bytes Decorder首先提取输入消息的 `subject` 和 `id`, 然后使用她们去查找用来解码Avro记录的Avro schema,详情可以参见 [Schema repo](https://github.com/schema-repo/schema-repo) 和 [AVRO-1124](https://issues.apache.org/jira/browse/AVRO-1124) 。 您需要一个类似schema repo的http服务来保存avro模式。有关在消息生成器端注册架构的信息,请见 `org.apache.druid.data.input.AvroStreamInputRowParserTest#testParse()` + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| type | String | `schema_repo` | 否 | +| subjectAndIdConverter | JSON对象 | 指定如何从消息字节中提取subject和id | 是 | +| schemaRepository | JSON对象 | 指定如何从subject和id查找Avro Schema | 是 | + +**Avro-1124 Subject 和 Id 转换器** +这部分描述了 `schema_avro` avro 字节解码器中的 `subjectAndIdConverter` 的格式 + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| type | String | `avro_1124` | 否 | +| topic | String | 指定Kafka流的主题 | 是 | + +**Avro-1124 Schema Repository** +这部分描述了 `schema_avro` avro 字节解码器中的 `schemaRepository` 的格式 + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| type | String | `avro_1124_rest_client` | 否 | +| url | String | 指定Avro-1124 schema repository的http url | 是 | + +**Confluent Schema Registry-based Avro Bytes Decoder** + +这个Avro字节解码器首先从输入消息字节中提取一个唯一的id,然后使用它在用于从字节解码Avro记录的模式注册表中查找模式。有关详细信息,请参阅schema注册 [文档](https://docs.confluent.io/current/schema-registry/index.html) 和 [存储库](https://github.com/confluentinc/schema-registry)。 + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| type | String | `schema_registry` | 否 | +| url | String | 指定架构注册表的url | 是 | +| capacity | 整型数字 | 指定缓存的最大值(默认为 Integer.MAX_VALUE)| 否 | + +```json +... +"avroBytesDecoder" : { + "type" : "schema_registry", + "url" : +} +... +``` + +#### Protobuf Parser + +> [!WARNING] +> 需要添加 [druid-protobuf-extensions](../development/protobuf-extensions.md) 来使用Protobuf解析器 + +此解析器用于 [流接收](streamingest.md),并直接从流中读取协议缓冲区数据。 + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| type | String | `protobuf` | 是 | +| descriptor | String | 类路径或URL中的Protobuf描述符文件名 | 是 | +| protoMessageType | String | 描述符中的Protobuf消息类型。可接受短名称和全限定名称。如果未指定,解析器将使用描述符中找到的第一个消息类型 | 否 | +| parseSpec | JSON对象 | 指定数据的时间戳和维度。格式必须为JSON。有关更多配置选项,请参阅 [JSON ParseSpec](#json)。请注意,不再支持timeAndDims parseSpec | 是 | + +样例规范: +```json +"parser": { + "type": "protobuf", + "descriptor": "file:///tmp/metrics.desc", + "protoMessageType": "Metrics", + "parseSpec": { + "format": "json", + "timestampSpec": { + "column": "timestamp", + "format": "auto" + }, + "dimensionsSpec": { + "dimensions": [ + "unit", + "http_method", + "http_code", + "page", + "metricType", + "server" + ], + "dimensionExclusions": [ + "timestamp", + "value" + ] + } + } +} +``` +有关更多详细信息和示例,请参见 [扩展说明](../development/protobuf-extensions.md)。 + +### ParseSpec + +> [!WARNING] +> Parser 在 [本地批任务](native.md), [kafka索引任务](kafka.md) 和[Kinesis索引任务](kinesis.md) 中已经废弃,在这些类型的摄入中考虑使用 [inputFormat](#InputFormat) + +`parseSpec` 有两个目的: +* String解析器使用 `parseSpec` 来决定输入行的格式(例如: JSON,CSV,TSV) +* 所有的解析器使用 `parseSpec` 来决定输入行的timestamp和dimensions + +如果 `format` 没有被包含,`parseSpec` 默认为 `tsv` + +#### JSON解析规范 +与字符串解析器一起用于加载JSON。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| format | String | `json` | 否 | +| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | +| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | +| flattenSpec | JSON对象 | 指定嵌套的JSON数据的展平配置,详情可见 [flattenSpec](#flattenspec) | 否 | + +示例规范: +```json +"parseSpec": { + "format" : "json", + "timestampSpec" : { + "column" : "timestamp" + }, + "dimensionSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + } +} +``` + +#### JSON Lowercase解析规范 + +> [!WARNING] +> `JsonLowerCase` 解析器已经废弃,并可能在Druid将来的版本中移除 + +这是JSON ParseSpec的一个特殊变体,它将传入JSON数据中的所有列名小写。如果您正在从Druid 0.6.x更新到druid0.7.x,正在直接接收具有混合大小写列名的JSON,没有任何ETL来将这些列名转换大小写,并且希望进行包含使用0.6.x和0.7.x创建的数据的查询,则需要此parseSpec。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| format | String | `jsonLowerCase` | 是 | +| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | +| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | + +#### CSV解析规范 + +与字符串解析器一起用于加载CSV, 字符串通过使用 `com.opencsv` 库来进行解析。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| format | String | `csv` | 是 | +| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | +| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | +| listDelimiter | String | 多值维度的定制分隔符 | 否(默认为 `ctrl + A`)| +| columns | JSON数组 | 指定数据的列 | 是 | + +示例规范: +```json +"parseSpec": { + "format" : "csv", + "timestampSpec" : { + "column" : "timestamp" + }, + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], + "dimensionsSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + } +} +``` + +**CSV索引任务** + +如果输入文件包含头,则 `columns` 字段是可选的,不需要设置。相反,您可以将 `hasHeaderRow` 字段设置为 `true`,这将使Druid自动从标题中提取列信息。否则,必须设置 `columns` 字段,并确保该字段必须以相同的顺序与输入数据的列匹配。 + +另外,可以通过在parseSpec中设置 `skipHeaderRows` 跳过一些标题行。如果同时设置了 `skipHeaderRows` 和 `HashHeaderRow` 选项,则首先应用`skipHeaderRows` 。例如,如果将 `skipHeaderRows` 设置为2,`hasHeaderRow` 设置为true,Druid将跳过前两行,然后从第三行提取列信息。 + +请注意,`hasHeaderRow` 和 `skipHeaderRows` 仅对非Hadoop批索引任务有效。其他类型的索引任务将失败,并出现异常。 + +**其他CSV摄入任务** + +必须包含 `columns` 字段,并确保字段的顺序与输入数据的列以相同的顺序匹配。 + +#### TSV/Delimited解析规范 + +与字符串解析器一起使用此命令可加载不需要特殊转义的任何分隔文本。默认情况下,分隔符是一个制表符,因此这将加载TSV。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| format | String | `csv` | 是 | +| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | +| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | +| delimiter | String | 数据值的定制分隔符 | 否(默认为 `\t`)| +| listDelimiter | String | 多值维度的定制分隔符 | 否(默认为 `ctrl + A`)| +| columns | JSON数组 | 指定数据的列 | 是 | + +示例规范: +```json +"parseSpec": { + "format" : "tsv", + "timestampSpec" : { + "column" : "timestamp" + }, + "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], + "delimiter":"|", + "dimensionsSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + } +} +``` +请确保将 `delimiter` 更改为数据的适当分隔符。与CSV一样,您必须指定要索引的列和列的子集。 + +**TSV(Delimited)索引任务** + +如果输入文件包含头,则 `columns` 字段是可选的,不需要设置。相反,您可以将 `hasHeaderRow` 字段设置为 `true`,这将使Druid自动从标题中提取列信息。否则,必须设置 `columns` 字段,并确保该字段必须以相同的顺序与输入数据的列匹配。 + +另外,可以通过在parseSpec中设置 `skipHeaderRows` 跳过一些标题行。如果同时设置了 `skipHeaderRows` 和 `HashHeaderRow` 选项,则首先应用`skipHeaderRows` 。例如,如果将 `skipHeaderRows` 设置为2,`hasHeaderRow` 设置为true,Druid将跳过前两行,然后从第三行提取列信息。 + +请注意,`hasHeaderRow` 和 `skipHeaderRows` 仅对非Hadoop批索引任务有效。其他类型的索引任务将失败,并出现异常。 + +**其他TSV(Delimited)摄入任务** + +必须包含 `columns` 字段,并确保字段的顺序与输入数据的列以相同的顺序匹配。 + +#### 多值维度 + +对于TSV和CSV数据,维度可以有多个值。要为多值维度指定分隔符,请在`parseSpec` 中设置 `listDelimiter`。 + +JSON数据也可以包含多值维度。维度的多个值必须在接收的数据中格式化为 `JSON数组`,不需要额外的 `parseSpec` 配置。 + +#### 正则解析规范 +```json +"parseSpec":{ + "format" : "regex", + "timestampSpec" : { + "column" : "timestamp" + }, + "dimensionsSpec" : { + "dimensions" : [] + }, + "columns" : [], + "pattern" : +} +``` + +`columns` 字段必须以相同的顺序与regex匹配组的列匹配。如果未提供列,则默认列名称(“column_1”、“column2”、…”列“)将被分配, 确保列名包含所有维度 + +#### JavaScript解析规范 +```json +"parseSpec":{ + "format" : "javascript", + "timestampSpec" : { + "column" : "timestamp" + }, + "dimensionsSpec" : { + "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] + }, + "function" : "function(str) { var parts = str.split(\"-\"); return { one: parts[0], two: parts[1] } }" +} +``` + +注意: JavaScript解析器必须完全解析数据,并在JS逻辑中以 `{key:value}` 格式返回。这意味着任何展平或解析多维值都必须在这里完成。 + +> [!WARNING] +> 默认情况下禁用基于JavaScript的功能。有关使用Druid的JavaScript功能的指南,包括如何启用它的说明,请参阅 [Druid JavaScript编程指南](../development/JavaScript.md)。 + +#### 时间和维度解析规范 + +与非字符串解析器一起使用,为它们提供时间戳和维度信息。非字符串解析器独立处理所有格式化决策,而不使用ParseSpec。 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| format | String | `timeAndDims` | 是 | +| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | +| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | + +#### Orc解析规范 + +与Hadoop ORC解析器一起使用来加载ORC文件 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| format | String | `orc` | 否 | +| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | +| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | +| flattenSpec | JSON对象 | 指定嵌套的JSON数据的展平配置,详情可见 [flattenSpec](#flattenspec) | 否 | + +#### Parquet解析规范 + +与Hadoop Parquet解析器一起使用来加载Parquet文件 + +| 字段 | 类型 | 描述 | 是否必填 | +|-|-|-|-| +| format | String | `parquet` | 否 | +| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | +| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | +| flattenSpec | JSON对象 | 指定嵌套的JSON数据的展平配置,详情可见 [flattenSpec](#flattenspec) | 否 | \ No newline at end of file diff --git a/ingestion/dataformats.md b/ingestion/dataformats.md deleted file mode 100644 index 4b1158b..0000000 --- a/ingestion/dataformats.md +++ /dev/null @@ -1,1140 +0,0 @@ - -## 数据格式 -Apache Druid可以接收JSON、CSV或TSV等分隔格式或任何自定义格式的非规范化数据。尽管文档中的大多数示例使用JSON格式的数据,但将Druid配置为接收任何其他分隔数据并不困难。我们欢迎对新格式的任何贡献。 - -此页列出了Druid支持的所有默认和核心扩展数据格式。有关社区扩展支持的其他数据格式,请参阅我们的 [社区扩展列表](../configuration/logging.md#社区扩展)。 - -### 格式化数据 - -下面的示例显示了在Druid中原生支持的数据格式: - -*JSON* -```json -{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143} -{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330} -{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111} -{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900} -{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "cancer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9} -``` - -*CSV* -```json -2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143 -2013-08-31T03:32:45Z,"Striker Eureka","en","speed","false","true","true","false","wikipedia","Australia","Australia","Cantebury","Syndey",459,129,330 -2013-08-31T07:11:21Z,"Cherno Alpha","ru","masterYi","false","true","true","false","article","Asia","Russia","Oblast","Moscow",123,12,111 -2013-08-31T11:58:39Z,"Crimson Typhoon","zh","triplets","true","false","true","false","wikipedia","Asia","China","Shanxi","Taiyuan",905,5,900 -2013-08-31T12:41:27Z,"Coyote Tango","ja","cancer","true","false","true","false","wikipedia","Asia","Japan","Kanto","Tokyo",1,10,-9 -``` - -*TSV(Delimited)* -```json -2013-08-31T01:02:33Z "Gypsy Danger" "en" "nuclear" "true" "true" "false" "false" "article" "North America" "United States" "Bay Area" "San Francisco" 57 200 -143 -2013-08-31T03:32:45Z "Striker Eureka" "en" "speed" "false" "true" "true" "false" "wikipedia" "Australia" "Australia" "Cantebury" "Syndey" 459 129 330 -2013-08-31T07:11:21Z "Cherno Alpha" "ru" "masterYi" "false" "true" "true" "false" "article" "Asia" "Russia" "Oblast" "Moscow" 123 12 111 -2013-08-31T11:58:39Z "Crimson Typhoon" "zh" "triplets" "true" "false" "true" "false" "wikipedia" "Asia" "China" "Shanxi" "Taiyuan" 905 5 900 -2013-08-31T12:41:27Z "Coyote Tango" "ja" "cancer" "true" "false" "true" "false" "wikipedia" "Asia" "Japan" "Kanto" "Tokyo" 1 10 -9 -``` - -请注意,CSV和TSV数据不包含列标题。当您指定要摄取的数据时,这一点就变得很重要。 - -除了文本格式,Druid还支持二进制格式,比如 [Orc](#orc) 和 [Parquet](#parquet) 格式。 - -### 定制格式 - -Druid支持自定义数据格式,可以使用 `Regex` 解析器或 `JavaScript` 解析器来解析这些格式。请注意,使用这些解析器中的任何一个来解析数据都不如编写原生Java解析器或使用外部流处理器那样高效。我们欢迎新解析器的贡献。 - -### InputFormat - -> [!WARNING] -> 输入格式是在0.17.0中引入的指定输入数据的数据格式的新方法。不幸的是,输入格式还不支持Druid支持的所有数据格式或摄取方法。特别是如果您想使用Hadoop接收,您仍然需要使用 [解析器](#parser)。如果您的数据是以本节未列出的某种格式格式化的,请考虑改用解析器。 - -所有形式的Druid摄取都需要某种形式的schema对象。要摄取的数据的格式是使用[`ioConfig`](/ingestion.md#ioConfig) 中的 `inputFormat` 条目指定的。 - -#### JSON - -**JSON** -一个加载JSON格式数据的 `inputFormat` 示例: -```json -"ioConfig": { - "inputFormat": { - "type": "json" - }, - ... -} -``` -JSON `inputFormat` 有以下组件: - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 填 `json` | 是 | -| flattenSpec | JSON对象 | 指定嵌套JSON数据的展平配置。更多信息请参见[flattenSpec](#flattenspec) | 否 | -| featureSpec | JSON对象 | Jackson库支持的 [JSON解析器特性](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) 。这些特性将在解析输入JSON数据时应用。 | 否 | - - -#### CSV -一个加载CSV格式数据的 `inputFormat` 示例: -```json -"ioConfig": { - "inputFormat": { - "type": "csv", - "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"] - }, - ... -} -``` - -CSV `inputFormat` 有以下组件: - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 填 `csv` | 是 | -| listDelimiter | String | 多值维度的定制分隔符 | 否(默认ctrl + A) | -| columns | JSON数组 | 指定数据的列。列的顺序应该与数据列的顺序相同。 | 如果 `findColumnsFromHeader` 设置为 `false` 或者缺失, 则为必填项 | -| findColumnsFromHeader | 布尔 | 如果设置了此选项,则任务将从标题行中查找列名。请注意,在从标题中查找列名之前,将首先使用 `skipHeaderRows`。例如,如果将 `skipHeaderRows` 设置为2,将 `findColumnsFromHeader` 设置为 `true`,则任务将跳过前两行,然后从第三行提取列信息。该项如果设置为true,则将忽略 `columns` | 否(如果 `columns` 被设置则默认为 `false`, 否则为null) | -| skipHeaderRows | 整型数值 | 该项如果设置,任务将略过 `skipHeaderRows`配置的行数 | 否(默认为0) | - -#### TSV(Delimited) -```json -"ioConfig": { - "inputFormat": { - "type": "tsv", - "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], - "delimiter":"|" - }, - ... -} -``` -TSV `inputFormat` 有以下组件: - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 填 `tsv` | 是 | -| delimiter | String | 数据值的自定义分隔符 | 否(默认为 `\t`) | -| listDelimiter | String | 多值维度的定制分隔符 | 否(默认ctrl + A) | -| columns | JSON数组 | 指定数据的列。列的顺序应该与数据列的顺序相同。 | 如果 `findColumnsFromHeader` 设置为 `false` 或者缺失, 则为必填项 | -| findColumnsFromHeader | 布尔 | 如果设置了此选项,则任务将从标题行中查找列名。请注意,在从标题中查找列名之前,将首先使用 `skipHeaderRows`。例如,如果将 `skipHeaderRows` 设置为2,将 `findColumnsFromHeader` 设置为 `true`,则任务将跳过前两行,然后从第三行提取列信息。该项如果设置为true,则将忽略 `columns` | 否(如果 `columns` 被设置则默认为 `false`, 否则为null) | -| skipHeaderRows | 整型数值 | 该项如果设置,任务将略过 `skipHeaderRows`配置的行数 | 否(默认为0) | - -请确保将分隔符更改为适合于数据的分隔符。与CSV一样,您必须指定要索引的列和列的子集。 - -#### ORC - -> [!WARNING] -> 使用ORC输入格式之前,首先需要包含 [druid-orc-extensions](../development/orc-extensions.md) - -> [!WARNING] -> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../development/orc-extensions.md#从contrib扩展迁移)。 - -一个加载ORC格式数据的 `inputFormat` 示例: -```json -"ioConfig": { - "inputFormat": { - "type": "orc", - "flattenSpec": { - "useFieldDiscovery": true, - "fields": [ - { - "type": "path", - "name": "nested", - "expr": "$.path.to.nested" - } - ] - } - "binaryAsString": false - }, - ... -} -``` - -ORC `inputFormat` 有以下组件: - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 填 `orc` | 是 | -| flattenSpec | JSON对象 | 指定嵌套JSON数据的展平配置。更多信息请参见[flattenSpec](#flattenspec) | 否 | -| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | - -#### Parquet - -> [!WARNING] -> 使用Parquet输入格式之前,首先需要包含 [druid-parquet-extensions](../development/parquet-extensions.md) - -一个加载Parquet格式数据的 `inputFormat` 示例: -```json -"ioConfig": { - "inputFormat": { - "type": "parquet", - "flattenSpec": { - "useFieldDiscovery": true, - "fields": [ - { - "type": "path", - "name": "nested", - "expr": "$.path.to.nested" - } - ] - } - "binaryAsString": false - }, - ... -} -``` - -Parquet `inputFormat` 有以下组件: - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 填 `parquet` | 是 | -| flattenSpec | JSON对象 | 定义一个 [flattenSpec](#flattenspec) 从Parquet文件提取嵌套的值。注意,只支持"path"表达式('jq'不可用)| 否(默认自动发现根级别的属性) | -| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | - -#### FlattenSpec - -`flattenSpec` 位于 `inputFormat` -> `flattenSpec` 中,负责将潜在的嵌套输入数据(如JSON、Avro等)和Druid的平面数据模型之间架起桥梁。 `flattenSpec` 示例如下: -```json -"flattenSpec": { - "useFieldDiscovery": true, - "fields": [ - { "name": "baz", "type": "root" }, - { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" }, - { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" } - ] -} -``` -> [!WARNING] -> 概念上,输入数据被读取后,Druid会以一个特定的顺序来对数据应用摄入规范: 首先 `flattenSpec`(如果有),然后 `timestampSpec`, 然后 `transformSpec` ,最后是 `dimensionsSpec` 和 `metricsSpec`。在编写摄入规范时需要牢记这一点 - -展平操作仅仅支持嵌套的 [数据格式](dataformats.md), 包括:`avro`, `json`, `orc` 和 `parquet`。 - -`flattenSpec` 有以下组件: - -| 字段 | 描述 | 默认值 | -|-|-|-| -| useFieldDiscovery | 如果为true,则将所有根级字段解释为可用字段,供 [`timestampSpec`](/ingestion.md#timestampSpec)、[`transformSpec`](/ingestion.md#transformSpec)、[`dimensionsSpec`](/ingestion.md#dimensionsSpec) 和 [`metricsSpec`](/ingestion.md#metricsSpec) 使用。

如果为false,则只有显式指定的字段(请参阅 `fields`)才可供使用。 | true | -| fields | 指定感兴趣的字段及其访问方式, 详细请见下边 | `[]` | - -**字段展平规范** - -`fields` 列表中的每个条目都可以包含以下组件: - - - - - - - - - - - - - - - - - - - - - - - - - - -
字段描述默认值
type - 可选项如下: -
    -
  • root, 引用记录根级别的字段。只有当useFieldDiscovery 为false时才真正有用。
  • -
  • path, 引用使用 JsonPath 表示法的字段,支持大多数提供嵌套的数据格式,包括avro,csv, jsonparquet
  • -
  • jq, 引用使用 jackson-jq 表示法的字段, 仅仅支持json格式
  • -
-
none(必填)
name展平后的字段名称。这个名称可以被timestampSpec, transformSpec, dimensionsSpecmetricsSpec引用none(必填)
expr用于在展平时访问字段的表达式。对于类型 `path`,这应该是 JsonPath。对于 `jq` 类型,这应该是 jackson-jq 表达式。对于其他类型,将忽略此参数。none(对于 `path` 和 `jq` 类型的为必填)
- -**展平操作的注意事项** - -* 为了方便起见,在定义根级字段时,可以只将字段名定义为字符串,而不是JSON对象。例如 `{"name": "baz", "type": "root"}` 等价于 `baz` -* 启用 `useFieldDiscovery` 只会在根级别自动检测与Druid支持的数据类型相对应的"简单"字段, 这包括字符串、数字和字符串或数字列表。不会自动检测到其他类型,其他类型必须在 `fields` 列表中显式指定 -* 不允许重复字段名(`name`), 否则将引发异常 -* 如果启用 `useFieldDiscovery`,则将跳过与字段列表中已定义的字段同名的任何已发现字段,而不是添加两次 -* [http://jsonpath.herokuapp.com/](http://jsonpath.herokuapp.com/) 对于测试 `path`-类型表达式非常有用 -* jackson jq支持完整 [`jq`](https://stedolan.github.io/jq/)语法的一个子集。有关详细信息,请参阅 [jackson jq](https://github.com/eiiches/jackson-jq) 文档 - -### Parser - -> [!WARNING] -> parser在 [本地批任务](native.md), [Kafka索引任务](kafka.md) 和 [Kinesis索引任务](kinesis.md) 中已经废弃,在这些类型的摄入方式中考虑使用 [inputFormat](#数据格式) - -该部分列出来了所有默认的以及核心扩展中的解析器。对于社区的扩展解析器,请参见 [社区扩展列表](../development/extensions.md#社区扩展) - -#### String Parser - -`string` 类型的解析器对基于文本的输入进行操作,这些输入可以通过换行符拆分为单独的记录, 可以使用 [`parseSpec`](#parsespec) 进一步分析每一行。 - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| type | string | 一般是 `string`, 在Hadoop索引任务中为 `hadoopyString` | 是 | -| parseSpec | JSON对象 | 指定格式,数据的timestamp和dimensions | 是 | - -#### Avro Hadoop Parser - -> [!WARNING] -> 需要添加 [druid-avro-extensions](../development/avro-extensions.md) 来使用 Avro Hadoop解析器 - -该解析器用于 [Hadoop批摄取](hadoop.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.avro.AvroValueInputFormat`。您可能想在 `tuningConfig` 中的 `jobProperties` 选项设置Avro reader的schema, 例如:`"avro.schema.input.value.path": "/path/to/your/schema.avsc"` 或者 `"avro.schema.input.value": "your_schema_JSON_object"`。如果未设置Avro读取器的schema,则将使用Avro对象容器文件中的schema,详情可以参见 [avro规范](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution) - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 应该填 `avro_hadoop` | 是 | -| parseSpec | JSON对象 | 指定数据的时间戳和维度。应该是“avro”语法规范。| 是 | - -Avro parseSpec可以包含使用"root"或"path"字段类型的 [flattenSpec](#flattenspec),这些字段类型可用于读取嵌套的Avro记录。Avro当前不支持“jq”字段类型。 - -例如,使用带有自定义读取器schema文件的Avro Hadoop解析器: -```json -{ - "type" : "index_hadoop", - "spec" : { - "dataSchema" : { - "dataSource" : "", - "parser" : { - "type" : "avro_hadoop", - "parseSpec" : { - "format": "avro", - "timestampSpec": , - "dimensionsSpec": , - "flattenSpec": - } - } - }, - "ioConfig" : { - "type" : "hadoop", - "inputSpec" : { - "type" : "static", - "inputFormat": "org.apache.druid.data.input.avro.AvroValueInputFormat", - "paths" : "" - } - }, - "tuningConfig" : { - "jobProperties" : { - "avro.schema.input.value.path" : "/path/to/my/schema.avsc" - } - } - } -} -``` - -#### ORC Hadoop Parser - -> [!WARNING] -> 需要添加 [druid-orc-extensions](../development/orc-extensions.md) 来使用ORC Hadoop解析器 - -> [!WARNING] -> 如果您正在考虑从早于0.15.0的版本升级到0.15.0或更高版本,请仔细阅读 [从contrib扩展的迁移](../development/orc-extensions.md#从contrib扩展迁移)。 - -该解析器用于 [Hadoop批摄取](hadoop.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.orc.mapreduce.OrcInputFormat`。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 应该填 `orc` | 是 | -| parseSpec | JSON对象 | 指定数据(`timeAndDim` 和 `orc` 格式)的时间戳和维度和一个`flattenSpec`(`orc`格式)| 是 | - -解析器支持两种 `parseSpec` 格式: `orc` 和 `timeAndDims` - -`orc` 支持字段的自动发现和展平(如果指定了 [flattenSpec](#flattenspec)。如果未指定展平规范,则默认情况下将启用 `useFieldDiscovery`。如果启用了 `useFieldDiscovery`,则指定`dimensionSpec` 是可选的:如果提供了 `dimensionSpec`,则它定义的维度列表将是摄取维度的集合,如果缺少发现的字段将构成该列表。 - -`timeAndDims` 解析规范必须通过 `dimensionSpec` 指定哪些字段将提取为维度。 - -支持所有 [列类型](https://orc.apache.org/docs/types.html) ,但 `union` 类型除外。`list` 类型的列(如果用基本类型填充)可以用作多值维度,或者可以使用 [flattenSpec](#flattenspec) 表达式提取特定元素。同样,可以用同样的方式从 `map` 和 `struct` 类型中提取基本字段。自动字段发现将自动为每个(非时间戳)基本类型或基本类型 `list` 以及 `flattenSpec` 中定义的任何展平表达式创建字符串维度。 - -**Hadoop job属性** - -像大多数Hadoop作业,最佳结果是在 `tuningConfig` 中的 `jobProperties` 中添加 `"mapreduce.job.user.classpath.first": "true"` 或者 `"mapreduce.job.classloader": "true"`。 注意,如果使用了 `"mapreduce.job.classloader": "true"`, 需要设置 `mapreduce.job.classloader.system.classes` 包含 `-org.apache.hadoop.hive.` 来让Hadoop从应用jars包中加载 `org.apache.hadoop.hive` 而非从系统jar中,例如: - -```json -... - "mapreduce.job.classloader": "true", - "mapreduce.job.classloader.system.classes" : "java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml", -... -``` - -这是因为 `orc-mapreduce` 库的配置单元 `hive-storage-api` 依赖关系,它在 `org.apache.hadoop.hive` 包下提供了一些类。如果改为使用`"mapreduce.job.user.classpath.first":"true"`设置,则不会出现此问题。 - -**示例** - -**`orc` parser, `orc` parseSpec, 自动字段发现, 展平表达式** -```json -{ - "type": "index_hadoop", - "spec": { - "ioConfig": { - "type": "hadoop", - "inputSpec": { - "type": "static", - "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", - "paths": "path/to/file.orc" - }, - ... - }, - "dataSchema": { - "dataSource": "example", - "parser": { - "type": "orc", - "parseSpec": { - "format": "orc", - "flattenSpec": { - "useFieldDiscovery": true, - "fields": [ - { - "type": "path", - "name": "nestedDim", - "expr": "$.nestedData.dim1" - }, - { - "type": "path", - "name": "listDimFirstItem", - "expr": "$.listDim[1]" - } - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "millis" - } - } - }, - ... - }, - "tuningConfig": - } - } -} -``` - -**`orc` parser, `orc` parseSpec, 不具有 `flattenSpec` 或者 `dimensionSpec`的字段发现** - -```json -{ - "type": "index_hadoop", - "spec": { - "ioConfig": { - "type": "hadoop", - "inputSpec": { - "type": "static", - "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", - "paths": "path/to/file.orc" - }, - ... - }, - "dataSchema": { - "dataSource": "example", - "parser": { - "type": "orc", - "parseSpec": { - "format": "orc", - "timestampSpec": { - "column": "timestamp", - "format": "millis" - } - } - }, - ... - }, - "tuningConfig": - } - } -} -``` -**`orc` parser, `orc` parseSpec, 非自动发现** - -```json -{ - "type": "index_hadoop", - "spec": { - "ioConfig": { - "type": "hadoop", - "inputSpec": { - "type": "static", - "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", - "paths": "path/to/file.orc" - }, - ... - }, - "dataSchema": { - "dataSource": "example", - "parser": { - "type": "orc", - "parseSpec": { - "format": "orc", - "flattenSpec": { - "useFieldDiscovery": false, - "fields": [ - { - "type": "path", - "name": "nestedDim", - "expr": "$.nestedData.dim1" - }, - { - "type": "path", - "name": "listDimFirstItem", - "expr": "$.listDim[1]" - } - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "millis" - }, - "dimensionsSpec": { - "dimensions": [ - "dim1", - "dim3", - "nestedDim", - "listDimFirstItem" - ], - "dimensionExclusions": [], - "spatialDimensions": [] - } - } - }, - ... - }, - "tuningConfig": - } - } -} -``` - -**`orc` parser, `timeAndDims` parseSpec** - -```json -{ - "type": "index_hadoop", - "spec": { - "ioConfig": { - "type": "hadoop", - "inputSpec": { - "type": "static", - "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat", - "paths": "path/to/file.orc" - }, - ... - }, - "dataSchema": { - "dataSource": "example", - "parser": { - "type": "orc", - "parseSpec": { - "format": "timeAndDims", - "timestampSpec": { - "column": "timestamp", - "format": "auto" - }, - "dimensionsSpec": { - "dimensions": [ - "dim1", - "dim2", - "dim3", - "listDim" - ], - "dimensionExclusions": [], - "spatialDimensions": [] - } - } - }, - ... - }, - "tuningConfig": - } -} -``` - -#### Parquet Hadoop Parser - -> [!WARNING] -> 需要添加 [druid-parquet-extensions](../development/parquet-extensions.md) 来使用Parquet Hadoop解析器 - -该解析器用于 [Hadoop批摄取](hadoop.md)。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.parquet.DruidParquetInputFormat`。 - -Parquet Hadoop 解析器支持自动字段发现,如果提供了一个带有 `parquet` `parquetSpec`的 `flattenSpec` 也支持展平。 Parquet嵌套 list 和 map [逻辑类型](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 应与所有受支持类型的JSON path表达式一起正确操作。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 应该填 `parquet` | 是 | -| parseSpec | JSON对象 | 指定数据的时间戳和维度和一个可选的 `flattenSpec`。有效的 `parseSpec` 格式是 `timeAndDims` 和 `parquet` | 是 | -| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | - -当时间维度是一个 [date类型的列](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), 则无需指定一个格式。 当格式为UTF8的String, 则要么指定为 `auto`,或者显式的指定一个 [时间格式](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html)。 - -**Parquet Hadoop解析器 vs Parquet Avro Hadoop解析器** -两者都是从Parquet文件中读取,但是又轻微的不同。主要不同之处是: -* Parquet Hadoop解析器使用简单的转换,而Parquet Avro Hadoop解析器首先使用 `parquet-avro` 库将Parquet数据转换为Avro记录,然后使用 `druid-avro-extensions` 模块将Avro数据解析为druid -* Parquet Hadoop解析器将Hadoop作业属性 `parquet.avro.add-list-element-records` 设置为false(通常默认为true),以便将原始列表元素"展开"为多值维度 -* Parquet Hadoop解析器支持 `int96` Parquet值,而 Parquet Avro Hadoop解析器不支持。`flatteSpec` 的JSON path表达式求值的行为也可能存在一些细微的差异 - -基于这些差异,我们建议在Parquet avro hadoop解析器上使用Parquet Hadoop解析器,以允许摄取超出Avro转换模式约束的数据。然而,Parquet Avro Hadoop解析器是支持Parquet格式的原始基础,因此它更加成熟。 - -**示例** - -`parquet` parser, `parquet` parseSpec -```json -{ - "type": "index_hadoop", - "spec": { - "ioConfig": { - "type": "hadoop", - "inputSpec": { - "type": "static", - "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", - "paths": "path/to/file.parquet" - }, - ... - }, - "dataSchema": { - "dataSource": "example", - "parser": { - "type": "parquet", - "parseSpec": { - "format": "parquet", - "flattenSpec": { - "useFieldDiscovery": true, - "fields": [ - { - "type": "path", - "name": "nestedDim", - "expr": "$.nestedData.dim1" - }, - { - "type": "path", - "name": "listDimFirstItem", - "expr": "$.listDim[1]" - } - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "auto" - }, - "dimensionsSpec": { - "dimensions": [], - "dimensionExclusions": [], - "spatialDimensions": [] - } - } - }, - ... - }, - "tuningConfig": - } - } -} -``` -`parquet` parser, `timeAndDims` parseSpec -```json -{ - "type": "index_hadoop", - "spec": { - "ioConfig": { - "type": "hadoop", - "inputSpec": { - "type": "static", - "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", - "paths": "path/to/file.parquet" - }, - ... - }, - "dataSchema": { - "dataSource": "example", - "parser": { - "type": "parquet", - "parseSpec": { - "format": "timeAndDims", - "timestampSpec": { - "column": "timestamp", - "format": "auto" - }, - "dimensionsSpec": { - "dimensions": [ - "dim1", - "dim2", - "dim3", - "listDim" - ], - "dimensionExclusions": [], - "spatialDimensions": [] - } - } - }, - ... - }, - "tuningConfig": - } -} -``` - -#### Parquet Avro Hadoop Parser - -> [!WARNING] -> 考虑在该解析器之上使用 [Parquet Hadoop Parser](#parquet-hadoop-parser) 来摄取Parquet文件。 两者之间的不同之处参见 [Parquet Hadoop解析器 vs Parquet Avro Hadoop解析器]() 部分 - -> [!WARNING] -> 使用Parquet Avro Hadoop Parser需要同时加入 [druid-parquet-extensions](../development/parquet-extensions.md) 和 [druid-avro-extensions](../development/avro-extensions.md) - -该解析器用于 [Hadoop批摄取](hadoop.md), 该解析器首先将Parquet数据转换为Avro记录,然后再解析它们后摄入到Druid。在 `ioConfig` 中,`inputSpec` 中的 `inputFormat` 必须设置为 `org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat`。 - -Parquet Avro Hadoop 解析器支持自动字段发现,如果提供了一个带有 `avro` `parquetSpec`的 `flattenSpec` 也支持展平。 Parquet嵌套 list 和 map [逻辑类型](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) 应与所有受支持类型的JSON path表达式一起正确操作。该解析器将Hadoop作业属性 `parquet.avro.add-list-element-records` 设置为false(通常默认为true),以便将原始列表元素"展开"为多值维度。 - -注意,`int96` Parquet值类型在该解析器中是不支持的。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| type | String | 应该填 `parquet-avro` | 是 | -| parseSpec | JSON对象 | 指定数据的时间戳和维度和一个可选的 `flattenSpec`, 应该是 `avro` | 是 | -| binaryAsString | 布尔类型 | 指定逻辑上未标记为字符串的二进制orc列是否应被视为UTF-8编码字符串。 | 否(默认为false) | - -当时间维度是一个 [date类型的列](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), 则无需指定一个格式。 当格式为UTF8的String, 则要么指定为 `auto`,或者显式的指定一个 [时间格式](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html)。 - -**示例** -```json -{ - "type": "index_hadoop", - "spec": { - "ioConfig": { - "type": "hadoop", - "inputSpec": { - "type": "static", - "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat", - "paths": "path/to/file.parquet" - }, - ... - }, - "dataSchema": { - "dataSource": "example", - "parser": { - "type": "parquet-avro", - "parseSpec": { - "format": "avro", - "flattenSpec": { - "useFieldDiscovery": true, - "fields": [ - { - "type": "path", - "name": "nestedDim", - "expr": "$.nestedData.dim1" - }, - { - "type": "path", - "name": "listDimFirstItem", - "expr": "$.listDim[1]" - } - ] - }, - "timestampSpec": { - "column": "timestamp", - "format": "auto" - }, - "dimensionsSpec": { - "dimensions": [], - "dimensionExclusions": [], - "spatialDimensions": [] - } - } - }, - ... - }, - "tuningConfig": - } - } -} -``` - -#### Avro Stream Parser - -> [!WARNING] -> 需要添加 [druid-avro-extensions](../development/avro-extensions.md) 来使用Avro Stream解析器 - -该解析器用于 [流式摄取](streamingest.md), 直接从一个流来读取数据。 - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| type | String | `avro_stream` | 否 | -| avroBytesDecoder | JSON对象 | 指定如何对Avro记录进行解码 | 是 | -| parseSpec | JSON对象 | 指定数据的时间戳和维度。 应该是一个 `avro` parseSpec | 是 | - -Avro parseSpec包含一个使用"root"或者"path"类型的 [`flattenSpec`](ingestion.md#flattenspec.md), 以便可以用来读取嵌套的avro数据。 "jq"类型在Avro中目前还不支持。 - -以下示例展示了一个具有**schema repo**avro解码器的 `Avro stream parser`: -```json -"parser" : { - "type" : "avro_stream", - "avroBytesDecoder" : { - "type" : "schema_repo", - "subjectAndIdConverter" : { - "type" : "avro_1124", - "topic" : "${YOUR_TOPIC}" - }, - "schemaRepository" : { - "type" : "avro_1124_rest_client", - "url" : "${YOUR_SCHEMA_REPO_END_POINT}", - } - }, - "parseSpec" : { - "format": "avro", - "timestampSpec": , - "dimensionsSpec": , - "flattenSpec": - } -} -``` - -**Avro Bytes Decoder** - -如果 `type` 未被指定, `avroBytesDecoder` 默认使用 `schema_repo`。 - -**基于Avro Bytes Decoder的 `inline schema`** - -> [!WARNING] -> "schema_inline"解码器使用固定schema读取Avro记录,不支持schema迁移。如果将来可能需要迁移schema,请考虑其他解码器之一,所有解码器都使用一个消息头,该消息头允许解析器识别正确的Avro schema以读取记录。 - -如果可以使用同一schema读取所有输入事件,则可以使用此解码器。在这种情况下,在输入任务JSON本身中指定schema,如下所述: -```json -... -"avroBytesDecoder": { - "type": "schema_inline", - "schema": { - //your schema goes here, for example - "namespace": "org.apache.druid.data", - "name": "User", - "type": "record", - "fields": [ - { "name": "FullName", "type": "string" }, - { "name": "Country", "type": "string" } - ] - } -} -... -``` -**基于Avro Bytes Decoder的 `multiple inline schemas`** - -如果不同的输入事件可以有不同的读取schema,请使用此解码器。在这种情况下,在输入任务JSON本身中指定schema,如下所述: -```json -... -"avroBytesDecoder": { - "type": "multiple_schemas_inline", - "schemas": { - //your id -> schema map goes here, for example - "1": { - "namespace": "org.apache.druid.data", - "name": "User", - "type": "record", - "fields": [ - { "name": "FullName", "type": "string" }, - { "name": "Country", "type": "string" } - ] - }, - "2": { - "namespace": "org.apache.druid.otherdata", - "name": "UserIdentity", - "type": "record", - "fields": [ - { "name": "Name", "type": "string" }, - { "name": "Location", "type": "string" } - ] - }, - ... - ... - } -} -... -``` -注意,它本质上是一个整数Schema ID到avro schema对象的映射。此解析器假定记录具有以下格式。第一个1字节是版本,必须始终为1, 接下来的4个字节是使用大端字节顺序序列化的整数模式ID。其余字节包含序列化的avro消息。 - -**基于Avro Bytes Decoder的 `SchemaRepo`** - -Avro Bytes Decorder首先提取输入消息的 `subject` 和 `id`, 然后使用她们去查找用来解码Avro记录的Avro schema,详情可以参见 [Schema repo](https://github.com/schema-repo/schema-repo) 和 [AVRO-1124](https://issues.apache.org/jira/browse/AVRO-1124) 。 您需要一个类似schema repo的http服务来保存avro模式。有关在消息生成器端注册架构的信息,请见 `org.apache.druid.data.input.AvroStreamInputRowParserTest#testParse()` - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| type | String | `schema_repo` | 否 | -| subjectAndIdConverter | JSON对象 | 指定如何从消息字节中提取subject和id | 是 | -| schemaRepository | JSON对象 | 指定如何从subject和id查找Avro Schema | 是 | - -**Avro-1124 Subject 和 Id 转换器** -这部分描述了 `schema_avro` avro 字节解码器中的 `subjectAndIdConverter` 的格式 - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| type | String | `avro_1124` | 否 | -| topic | String | 指定Kafka流的主题 | 是 | - -**Avro-1124 Schema Repository** -这部分描述了 `schema_avro` avro 字节解码器中的 `schemaRepository` 的格式 - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| type | String | `avro_1124_rest_client` | 否 | -| url | String | 指定Avro-1124 schema repository的http url | 是 | - -**Confluent Schema Registry-based Avro Bytes Decoder** - -这个Avro字节解码器首先从输入消息字节中提取一个唯一的id,然后使用它在用于从字节解码Avro记录的模式注册表中查找模式。有关详细信息,请参阅schema注册 [文档](https://docs.confluent.io/current/schema-registry/index.html) 和 [存储库](https://github.com/confluentinc/schema-registry)。 - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| type | String | `schema_registry` | 否 | -| url | String | 指定架构注册表的url | 是 | -| capacity | 整型数字 | 指定缓存的最大值(默认为 Integer.MAX_VALUE)| 否 | - -```json -... -"avroBytesDecoder" : { - "type" : "schema_registry", - "url" : -} -... -``` - -#### Protobuf Parser - -> [!WARNING] -> 需要添加 [druid-protobuf-extensions](../development/protobuf-extensions.md) 来使用Protobuf解析器 - -此解析器用于 [流接收](streamingest.md),并直接从流中读取协议缓冲区数据。 - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| type | String | `protobuf` | 是 | -| descriptor | String | 类路径或URL中的Protobuf描述符文件名 | 是 | -| protoMessageType | String | 描述符中的Protobuf消息类型。可接受短名称和全限定名称。如果未指定,解析器将使用描述符中找到的第一个消息类型 | 否 | -| parseSpec | JSON对象 | 指定数据的时间戳和维度。格式必须为JSON。有关更多配置选项,请参阅 [JSON ParseSpec](#json)。请注意,不再支持timeAndDims parseSpec | 是 | - -样例规范: -```json -"parser": { - "type": "protobuf", - "descriptor": "file:///tmp/metrics.desc", - "protoMessageType": "Metrics", - "parseSpec": { - "format": "json", - "timestampSpec": { - "column": "timestamp", - "format": "auto" - }, - "dimensionsSpec": { - "dimensions": [ - "unit", - "http_method", - "http_code", - "page", - "metricType", - "server" - ], - "dimensionExclusions": [ - "timestamp", - "value" - ] - } - } -} -``` -有关更多详细信息和示例,请参见 [扩展说明](../development/protobuf-extensions.md)。 - -### ParseSpec - -> [!WARNING] -> Parser 在 [本地批任务](native.md), [kafka索引任务](kafka.md) 和[Kinesis索引任务](kinesis.md) 中已经废弃,在这些类型的摄入中考虑使用 [inputFormat](#InputFormat) - -`parseSpec` 有两个目的: -* String解析器使用 `parseSpec` 来决定输入行的格式(例如: JSON,CSV,TSV) -* 所有的解析器使用 `parseSpec` 来决定输入行的timestamp和dimensions - -如果 `format` 没有被包含,`parseSpec` 默认为 `tsv` - -#### JSON解析规范 -与字符串解析器一起用于加载JSON。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| format | String | `json` | 否 | -| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | -| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | -| flattenSpec | JSON对象 | 指定嵌套的JSON数据的展平配置,详情可见 [flattenSpec](#flattenspec) | 否 | - -示例规范: -```json -"parseSpec": { - "format" : "json", - "timestampSpec" : { - "column" : "timestamp" - }, - "dimensionSpec" : { - "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] - } -} -``` - -#### JSON Lowercase解析规范 - -> [!WARNING] -> `JsonLowerCase` 解析器已经废弃,并可能在Druid将来的版本中移除 - -这是JSON ParseSpec的一个特殊变体,它将传入JSON数据中的所有列名小写。如果您正在从Druid 0.6.x更新到druid0.7.x,正在直接接收具有混合大小写列名的JSON,没有任何ETL来将这些列名转换大小写,并且希望进行包含使用0.6.x和0.7.x创建的数据的查询,则需要此parseSpec。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| format | String | `jsonLowerCase` | 是 | -| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | -| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | - -#### CSV解析规范 - -与字符串解析器一起用于加载CSV, 字符串通过使用 `com.opencsv` 库来进行解析。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| format | String | `csv` | 是 | -| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | -| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | -| listDelimiter | String | 多值维度的定制分隔符 | 否(默认为 `ctrl + A`)| -| columns | JSON数组 | 指定数据的列 | 是 | - -示例规范: -```json -"parseSpec": { - "format" : "csv", - "timestampSpec" : { - "column" : "timestamp" - }, - "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], - "dimensionsSpec" : { - "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] - } -} -``` - -**CSV索引任务** - -如果输入文件包含头,则 `columns` 字段是可选的,不需要设置。相反,您可以将 `hasHeaderRow` 字段设置为 `true`,这将使Druid自动从标题中提取列信息。否则,必须设置 `columns` 字段,并确保该字段必须以相同的顺序与输入数据的列匹配。 - -另外,可以通过在parseSpec中设置 `skipHeaderRows` 跳过一些标题行。如果同时设置了 `skipHeaderRows` 和 `HashHeaderRow` 选项,则首先应用`skipHeaderRows` 。例如,如果将 `skipHeaderRows` 设置为2,`hasHeaderRow` 设置为true,Druid将跳过前两行,然后从第三行提取列信息。 - -请注意,`hasHeaderRow` 和 `skipHeaderRows` 仅对非Hadoop批索引任务有效。其他类型的索引任务将失败,并出现异常。 - -**其他CSV摄入任务** - -必须包含 `columns` 字段,并确保字段的顺序与输入数据的列以相同的顺序匹配。 - -#### TSV/Delimited解析规范 - -与字符串解析器一起使用此命令可加载不需要特殊转义的任何分隔文本。默认情况下,分隔符是一个制表符,因此这将加载TSV。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| format | String | `csv` | 是 | -| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | -| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | -| delimiter | String | 数据值的定制分隔符 | 否(默认为 `\t`)| -| listDelimiter | String | 多值维度的定制分隔符 | 否(默认为 `ctrl + A`)| -| columns | JSON数组 | 指定数据的列 | 是 | - -示例规范: -```json -"parseSpec": { - "format" : "tsv", - "timestampSpec" : { - "column" : "timestamp" - }, - "columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"], - "delimiter":"|", - "dimensionsSpec" : { - "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] - } -} -``` -请确保将 `delimiter` 更改为数据的适当分隔符。与CSV一样,您必须指定要索引的列和列的子集。 - -**TSV(Delimited)索引任务** - -如果输入文件包含头,则 `columns` 字段是可选的,不需要设置。相反,您可以将 `hasHeaderRow` 字段设置为 `true`,这将使Druid自动从标题中提取列信息。否则,必须设置 `columns` 字段,并确保该字段必须以相同的顺序与输入数据的列匹配。 - -另外,可以通过在parseSpec中设置 `skipHeaderRows` 跳过一些标题行。如果同时设置了 `skipHeaderRows` 和 `HashHeaderRow` 选项,则首先应用`skipHeaderRows` 。例如,如果将 `skipHeaderRows` 设置为2,`hasHeaderRow` 设置为true,Druid将跳过前两行,然后从第三行提取列信息。 - -请注意,`hasHeaderRow` 和 `skipHeaderRows` 仅对非Hadoop批索引任务有效。其他类型的索引任务将失败,并出现异常。 - -**其他TSV(Delimited)摄入任务** - -必须包含 `columns` 字段,并确保字段的顺序与输入数据的列以相同的顺序匹配。 - -#### 多值维度 - -对于TSV和CSV数据,维度可以有多个值。要为多值维度指定分隔符,请在`parseSpec` 中设置 `listDelimiter`。 - -JSON数据也可以包含多值维度。维度的多个值必须在接收的数据中格式化为 `JSON数组`,不需要额外的 `parseSpec` 配置。 - -#### 正则解析规范 -```json -"parseSpec":{ - "format" : "regex", - "timestampSpec" : { - "column" : "timestamp" - }, - "dimensionsSpec" : { - "dimensions" : [] - }, - "columns" : [], - "pattern" : -} -``` - -`columns` 字段必须以相同的顺序与regex匹配组的列匹配。如果未提供列,则默认列名称(“column_1”、“column2”、…”列“)将被分配, 确保列名包含所有维度 - -#### JavaScript解析规范 -```json -"parseSpec":{ - "format" : "javascript", - "timestampSpec" : { - "column" : "timestamp" - }, - "dimensionsSpec" : { - "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] - }, - "function" : "function(str) { var parts = str.split(\"-\"); return { one: parts[0], two: parts[1] } }" -} -``` - -注意: JavaScript解析器必须完全解析数据,并在JS逻辑中以 `{key:value}` 格式返回。这意味着任何展平或解析多维值都必须在这里完成。 - -> [!WARNING] -> 默认情况下禁用基于JavaScript的功能。有关使用Druid的JavaScript功能的指南,包括如何启用它的说明,请参阅 [Druid JavaScript编程指南](../development/JavaScript.md)。 - -#### 时间和维度解析规范 - -与非字符串解析器一起使用,为它们提供时间戳和维度信息。非字符串解析器独立处理所有格式化决策,而不使用ParseSpec。 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| format | String | `timeAndDims` | 是 | -| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | -| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | - -#### Orc解析规范 - -与Hadoop ORC解析器一起使用来加载ORC文件 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| format | String | `orc` | 否 | -| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | -| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | -| flattenSpec | JSON对象 | 指定嵌套的JSON数据的展平配置,详情可见 [flattenSpec](#flattenspec) | 否 | - -#### Parquet解析规范 - -与Hadoop Parquet解析器一起使用来加载Parquet文件 - -| 字段 | 类型 | 描述 | 是否必填 | -|-|-|-|-| -| format | String | `parquet` | 否 | -| timestampSpec | JSON对象 | 指定timestamp的列和格式 | 是 | -| dimensionsSpec | JSON对象 | 指定数据的dimensions | 是 | -| flattenSpec | JSON对象 | 指定嵌套的JSON数据的展平配置,详情可见 [flattenSpec](#flattenspec) | 否 | \ No newline at end of file diff --git a/ingestion/datamanage.md b/ingestion/datamanage.md index 4075abb..e69de29 100644 --- a/ingestion/datamanage.md +++ b/ingestion/datamanage.md @@ -1,191 +0,0 @@ - - -## 数据管理 - -### schema更新 -数据源的schema可以随时更改,Apache Druid支持不同段之间的有不同的schema。 -#### 替换段文件 -Druid使用数据源、时间间隔、版本号和分区号唯一地标识段。只有在某个时间粒度内创建多个段时,分区号才在段id中可见。例如,如果有小时段,但一小时内的数据量超过单个段的容量,则可以在同一小时内创建多个段。这些段将共享相同的数据源、时间间隔和版本号,但具有线性增加的分区号。 -```json -foo_2015-01-01/2015-01-02_v1_0 -foo_2015-01-01/2015-01-02_v1_1 -foo_2015-01-01/2015-01-02_v1_2 -``` -在上面的示例段中,`dataSource`=`foo`,`interval`=`2015-01-01/2015-01-02`,version=`v1`,partitionNum=`0`。如果在以后的某个时间点,使用新的schema重新索引数据,则新创建的段将具有更高的版本id。 -```json -foo_2015-01-01/2015-01-02_v2_0 -foo_2015-01-01/2015-01-02_v2_1 -foo_2015-01-01/2015-01-02_v2_2 -``` -Druid批索引(基于Hadoop或基于IndexTask)保证了时间间隔内的原子更新。在我们的例子中,直到 `2015-01-01/2015-01-02` 的所有 `v2` 段加载到Druid集群中之前,查询只使用 `v1` 段, 当加载完所有v2段并可查询后,所有查询都将忽略 `v1` 段并切换到 `v2` 段。不久之后,`v1` 段将从集群中卸载。 - -请注意,跨越多个段间隔的更新在每个间隔内都是原子的。在整个更新过程中它们不是原子的。例如,您有如下段: -```json -foo_2015-01-01/2015-01-02_v1_0 -foo_2015-01-02/2015-01-03_v1_1 -foo_2015-01-03/2015-01-04_v1_2 -``` -`v2` 段将在构建后立即加载到集群中,并在段重叠的时间段内替换 `v1` 段。在完全加载 `v2` 段之前,集群可能混合了 `v1` 和 `v2` 段。 -```json -foo_2015-01-01/2015-01-02_v1_0 -foo_2015-01-02/2015-01-03_v2_1 -foo_2015-01-03/2015-01-04_v1_2 -``` -在这种情况下,查询可能会命中 `v1` 和 `v2` 段的混合。 -#### 在段中不同的schema -同一数据源的Druid段可能有不同的schema。如果一个字符串列(维度)存在于一个段中而不是另一个段中,则涉及这两个段的查询仍然有效。对缺少维度的段的查询将表现为该维度只有空值。类似地,如果一个段有一个数值列(metric),而另一个没有,那么查询缺少metric的段通常会"做正确的事情"。在此缺失的Metric上的使用聚合的行为类似于该Metric缺失。 -### 压缩与重新索引 -压缩合并是一种覆盖操作,它读取现有的一组段,将它们组合成一个具有较大但较少段的新集,并用新的压缩集覆盖原始集,而不更改它内部存储的数据。 - -出于性能原因,有时将一组段压缩为一组较大但较少的段是有益的,因为在接收和查询路径中都存在一些段处理和内存开销。 - -压缩任务合并给定时间间隔内的所有段。语法为: -```json -{ - "type": "compact", - "id": , - "dataSource": , - "ioConfig": , - "dimensionsSpec" , - "metricsSpec" , - "segmentGranularity": , - "tuningConfig" , - "context": -} -``` - -| 字段 | 描述 | 是否必须 | -|-|-|-| -| `type` | 任务类型,应该是 `compact` | 是 | -| `id` | 任务id | 否 | -| `dataSource` | 将被压缩合并的数据源名称 | 是 | -| `ioConfig` | 压缩合并任务的 `ioConfig`, 详情见 [Compaction ioConfig](#压缩合并的IOConfig) | 是 | -| `dimensionsSpec` | 自定义 `dimensionsSpec`。压缩任务将使用此dimensionsSpec(如果存在),而不是生成dimensionsSpec。更多细节见下文。| 否 | -| `metricsSpec` | 自定义 `metricsSpec`。如果指定了压缩任务,则压缩任务将使用此metricsSpec,而不是生成一个metricsSpec。| 否 | -| `segmentGranularity` | 如果设置了此值,压缩合并任务将更改给定时间间隔内的段粒度。有关详细信息,请参阅 [granularitySpec](ingestion.md#granularityspec) 的 `segmentGranularity`。行为见下表。 | 否 | -| `tuningConfig` | [并行索引任务的tuningConfig](native.md#tuningConfig) | 否 | -| `context` | [任务的上下文](taskrefer.md#上下文参数) | 否 | - -一个压缩合并任务的示例如下: -```json -{ - "type" : "compact", - "dataSource" : "wikipedia", - "ioConfig" : { - "type": "compact", - "inputSpec": { - "type": "interval", - "interval": "2017-01-01/2018-01-01" - } - } -} -``` - -压缩任务读取时间间隔 `2017-01-01/2018-01-01` 的*所有分段*,并生成新分段。由于 `segmentGranularity` 为空,压缩后原始的段粒度将保持不变。要控制每个时间块的结果段数,可以设置 [`maxRowsPerSegment`](../configuration/human-readable-byte.md#Coordinator) 或 [`numShards`](native.md#tuningconfig)。请注意,您可以同时运行多个压缩任务。例如,您可以每月运行12个compactionTasks,而不是一整年只运行一个任务。 - -压缩任务在内部生成 `index` 任务规范,用于使用某些固定参数执行的压缩工作。例如,它的 `inputSource` 始终是 [DruidInputSource](native.md#Druid输入源),`dimensionsSpec` 和 `metricsSpec` 默认包含输入段的所有Dimensions和Metrics。 - -如果指定的时间间隔中没有加载数据段(或者指定的时间间隔为空),则压缩任务将以失败状态代码退出,而不执行任何操作。 - -除非所有输入段具有相同的元数据,否则输出段可以具有与输入段不同的元数据。 - -* Dimensions: 由于Apache Druid支持schema更改,因此即使是同一个数据源的一部分,各个段之间的维度也可能不同。如果输入段具有不同的维度,则输出段基本上包括输入段的所有维度。但是,即使输入段具有相同的维度集,维度顺序或维度的数据类型也可能不同。例如,某些维度的数据类型可以从 `字符串` 类型更改为基本类型,或者可以更改维度的顺序以获得更好的局部性。在这种情况下,在数据类型和排序方面,最近段的维度先于旧段的维度。这是因为最近的段更有可能具有所需的新顺序和数据类型。如果要使用自己的顺序和类型,可以在压缩任务规范中指定自定义 `dimensionsSpec`。 -* Roll-up: 仅当为所有输入段设置了 `rollup` 时,才会汇总输出段。有关详细信息,请参见 [rollup](ingestion.md#rollup)。您可以使用 [段元数据查询](../querying/segmentMetadata.md) 检查段是否已被rollup。 - -#### 压缩合并的IOConfig -压缩IOConfig需要指定 `inputSpec`,如下所示。 - -| 字段 | 描述 | 是否必须 | -|-|-|-| -| `type` | 任务类型,固定为 `compact` | 是 | -| `inputSpec` | 输入规范 | 是 | - -目前有两种支持的 `inputSpec`: - -时间间隔 `inputSpec`: - -| 字段 | 描述 | 是否必须 | -|-|-|-| -| `type` | 任务类型,固定为 `interval` | 是 | -| `interval` | 需要合并压缩的时间间隔 | 是 | - -段 `inputSpec`: - -| 字段 | 描述 | 是否必须 | -|-|-|-| -| `type` | 任务类型,固定为 `segments` | 是 | -| `segments` | 段ID列表 | 是 | - -### 增加新的数据 - -Druid可以通过将新的段追加到现有的段集,来实现新数据插入到现有的数据源中。它还可以通过将现有段集与新数据合并并覆盖原始集来添加新数据。 - -Druid不支持按主键更新单个记录。 - -### 更新现有的数据 - -在数据源中摄取一段时间的数据并创建Apache Druid段之后,您可能需要对摄取的数据进行更改。有几种方法可以做到这一点。 - -#### 使用lookups - -如果有需要经常更新值的维度,请首先尝试使用 [lookups](../querying/lookups.md)。lookups的一个典型用例是,在Druid段中存储一个ID维度,并希望将ID维度映射到一个人类可读的字符串值,该字符串值可能需要定期更新。 - -#### 重新摄取数据 - -如果基于lookups的技术还不够,您需要将想更新的时间块的数据重新索引到Druid中。这可以在覆盖模式(默认模式)下使用 [批处理摄取](ingestion.md#批量摄取) 方法之一来完成。它也可以使用 [流式摄取](ingestion.md#流式摄取) 来完成,前提是您先删除相关时间块的数据。 - -如果在批处理模式下进行重新摄取,Druid的原子更新机制意味着查询将从旧数据无缝地转换到新数据。 - -我们建议保留一份原始数据的副本,以防您需要重新摄取它。 - -#### 使用基于Hadoop的摄取 - -本节假设读者理解如何使用Hadoop进行批量摄取。有关详细信息,请参见 [Hadoop批处理摄取](hadoop.md)。Hadoop批量摄取可用于重新索引数据和增量摄取数据。 - -Druid使用 `ioConfig` 中的 `inputSpec` 来知道要接收的数据位于何处以及如何读取它。对于简单的Hadoop批接收,`static` 或 `granularity` 粒度规范类型允许您读取存储在深层存储中的数据。 - -还有其他类型的 `inputSpec` 可以启用重新索引数据和增量接收数据。 - -#### 使用原生批摄取重新索引 - -本节假设读者了解如何使用 [原生批处理索引](native.md) 而不使用Hadoop的情况下执行批处理摄取(使用 `inputSource` 知道在何处以及如何读取输入数据)。[`DruidInputSource`](native.md#Druid输入源) 可以用来从Druid内部的段读取数据。请注意,**IndexTask**只用于原型设计,因为它必须在一个进程内完成所有处理,并且无法扩展。对于处理超过1GB数据的生产方案,请使用Hadoop批量摄取。 - -### 删除数据 - -Druid支持永久的将标记为"unused"状态(详情可见架构设计中的 [段的生命周期](../design/Design.md#段生命周期))的段删除掉 - -杀死任务负责从元数据存储和深度存储中删除掉指定时间间隔内的不被使用的段 - -更多详细信息,可以看 [杀死任务](taskrefer.md#kill) - -永久删除一个段需要两步: -1. 段必须首先标记为"未使用"。当用户通过Coordinator API手动禁用段时,就会发生这种情况 -2. 在段被标记为"未使用"之后,一个Kill任务将从Druid的元数据存储和深层存储中删除任何“未使用”的段 - -对于数据保留规则的文档,可以详细看 [数据保留](../operations/retainingOrDropData.md) - -对于通过Coordinator API来禁用段的文档,可以详细看 [Coordinator数据源API](../operations/api.md#coordinator) - -在本文档中已经包含了一个删除删除的教程,请看 [数据删除教程](../tutorials/chapter-9.md) - -### 杀死任务 - -**杀死任务**删除段的所有信息并将其从深层存储中删除。在Druid的段表中,要杀死的段必须是未使用的(used==0)。可用语法为: -```json -{ - "type": "kill", - "id": , - "dataSource": , - "interval" : , - "context": -} -``` -### 数据保留 - -Druid支持保留规则,这些规则用于定义数据应保留的时间间隔和应丢弃数据的时间间隔。 - -Druid还支持将Historical进程分成不同的层,并且可以将保留规则配置为将特定时间间隔的数据分配给特定的层。 - -这些特性对于性能/成本管理非常有用;一个常见的场景是将Historical进程分为"热(hot)"层和"冷(cold)"层。 - -有关详细信息,请参阅 [加载规则](../operations/retainingOrDropData.md)。 diff --git a/ingestion/faq.md b/ingestion/faq.md index e416a44..34a176e 100644 --- a/ingestion/faq.md +++ b/ingestion/faq.md @@ -1,3 +1,117 @@ +--- +id: faq +title: "Ingestion troubleshooting FAQ" +sidebar_label: "Troubleshooting FAQ" +--- + + + +## Batch Ingestion + +If you are trying to batch load historical data but no events are being loaded, make sure the interval of your ingestion spec actually encapsulates the interval of your data. Events outside this interval are dropped. + +## Druid ingested my events but I they are not in my query results + +If the number of ingested events seem correct, make sure your query is correctly formed. If you included a `count` aggregator in your ingestion spec, you will need to query for the results of this aggregate with a `longSum` aggregator. Issuing a query with a count aggregator will count the number of Druid rows, which includes [roll-up](../design/index.md). + +## What types of data does Druid support? + +Druid can ingest JSON, CSV, TSV and other delimited data out of the box. Druid supports single dimension values, or multiple dimension values (an array of strings). Druid supports long, float, and double numeric columns. + +## Where do my Druid segments end up after ingestion? + +Depending on what `druid.storage.type` is set to, Druid will upload segments to some [Deep Storage](../dependencies/deep-storage.md). Local disk is used as the default deep storage. + +## My stream ingest is not handing segments off + +First, make sure there are no exceptions in the logs of the ingestion process. Also make sure that `druid.storage.type` is set to a deep storage that isn't `local` if you are running a distributed cluster. + +Other common reasons that hand-off fails are as follows: + +1) Druid is unable to write to the metadata storage. Make sure your configurations are correct. + +2) Historical processes are out of capacity and cannot download any more segments. You'll see exceptions in the Coordinator logs if this occurs and the Coordinator console will show the Historicals are near capacity. + +3) Segments are corrupt and cannot be downloaded. You'll see exceptions in your Historical processes if this occurs. + +4) Deep storage is improperly configured. Make sure that your segment actually exists in deep storage and that the Coordinator logs have no errors. + +## How do I get HDFS to work? + +Make sure to include the `druid-hdfs-storage` and all the hadoop configuration, dependencies (that can be obtained by running command `hadoop classpath` on a machine where hadoop has been setup) in the classpath. And, provide necessary HDFS settings as described in [deep storage](../dependencies/deep-storage.md) . + +## How do I know when I can make query to Druid after submitting batch ingestion task? + +You can verify if segments created by a recent ingestion task are loaded onto historicals and available for querying using the following workflow. +1. Submit your ingestion task. +2. Repeatedly poll the [Overlord's tasks API](../operations/api-reference.md#tasks) ( `/druid/indexer/v1/task/{taskId}/status`) until your task is shown to be successfully completed. +3. Poll the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) (`/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus`) with + `forceMetadataRefresh=true` and `interval=` once. + (Note: `forceMetadataRefresh=true` refreshes Coordinator's metadata cache of all datasources. This can be a heavy operation in terms of the load on the metadata store but is necessary to make sure that we verify all the latest segments' load status) + If there are segments not yet loaded, continue to step 4, otherwise you can now query the data. +4. Repeatedly poll the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) (`/druid/coordinator/v1/datasources/{dataSourceName}/loadstatus`) with + `forceMetadataRefresh=false` and `interval=`. + Continue polling until all segments are loaded. Once all segments are loaded you can now query the data. + Note that this workflow only guarantees that the segments are available at the time of the [Segment Loading by Datasource API](../operations/api-reference.md#segment-loading-by-datasource) call. Segments can still become missing because of historical process failures or any other reasons afterward. + +## I don't see my Druid segments on my Historical processes + +You can check the Coordinator console located at `:`. Make sure that your segments have actually loaded on [Historical processes](../design/historical.md). If your segments are not present, check the Coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because Historical processes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example): + +``` +-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}] + ``` + +## My queries are returning empty results + +You can use a [segment metadata query](../querying/segmentmetadataquery.md) for the dimensions and metrics that have been created for your datasource. Make sure that the name of the aggregators you use in your query match one of these metrics. Also make sure that the query interval you specify match a valid time range where data exists. + +## How can I Reindex existing data in Druid with schema changes? + +You can use DruidInputSource with the [Parallel task](../ingestion/native-batch.md) to ingest existing druid segments using a new schema and change the name, dimensions, metrics, rollup, etc. of the segment. +See [DruidInputSource](../ingestion/native-batch.md#druid-input-source) for more details. +Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing. + +See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details. + +## How can I change the query granularity of existing data in Druid? + +In a lot of situations you may want coarser granularity for older data. Example, any data older than 1 month has only hour level granularity but newer data has minute level granularity. This use case is same as re-indexing. + +To do this use the [DruidInputSource](../ingestion/native-batch.md#druid-input-source) and run a [Parallel task](../ingestion/native-batch.md). The DruidInputSource will allow you to take in existing segments from Druid and aggregate them and feed them back into Druid. It will also allow you to filter the data in those segments while feeding it back in. This means if there are rows you want to delete, you can just filter them away during re-ingestion. +Typically the above will be run as a batch job to say everyday feed in a chunk of data and aggregate it. +Or, if you use hadoop based ingestion, then you can use "dataSource" input spec to do reindexing. + +See the [Update existing data](../ingestion/data-management.md#update) section of the data management page for more details. + +You can also change the query granularity using compaction. See [Query granularity handling](../ingestion/compaction.md#query-granularity-handling). + +## Real-time ingestion seems to be stuck + +There are a few ways this can occur. Druid will throttle ingestion to prevent out of memory problems if the intermediate persists are taking too long or if hand-off is taking too long. If your process logs indicate certain columns are taking a very long time to build (for example, if your segment granularity is hourly, but creating a single column takes 30 minutes), you should re-evaluate your configuration or scale up your real-time ingestion. + +## More information + +Data ingestion for Druid can be difficult for first time users. Please don't hesitate to ask questions in the [Druid Forum](https://www.druidforum.org/). + + ## 数据摄取相关问题FAQ diff --git a/ingestion/hadoop.md b/ingestion/hadoop.md index 26c1216..920dcfe 100644 --- a/ingestion/hadoop.md +++ b/ingestion/hadoop.md @@ -1,3 +1,574 @@ +--- +id: hadoop +title: "Hadoop-based ingestion" +sidebar_label: "Hadoop-based" +--- + + + +Apache Hadoop-based batch ingestion in Apache Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running +instance of a Druid [Overlord](../design/overlord.md). Please refer to our [Hadoop-based vs. native batch comparison table](index.md#batch) for +comparisons between Hadoop-based, native batch (simple), and native batch (parallel) ingestion. + +To run a Hadoop-based ingestion task, write an ingestion spec as specified below. Then POST it to the +[`/druid/indexer/v1/task`](../operations/api-reference.md#tasks) endpoint on the Overlord, or use the +`bin/post-index-task` script included with Druid. + +## Tutorial + +This page contains reference documentation for Hadoop-based ingestion. +For a walk-through instead, check out the [Loading from Apache Hadoop](../tutorials/tutorial-batch-hadoop.md) tutorial. + +## Task syntax + +A sample task is shown below: + +```json +{ + "type" : "index_hadoop", + "spec" : { + "dataSchema" : { + "dataSource" : "wikipedia", + "parser" : { + "type" : "hadoopyString", + "parseSpec" : { + "format" : "json", + "timestampSpec" : { + "column" : "timestamp", + "format" : "auto" + }, + "dimensionsSpec" : { + "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], + "dimensionExclusions" : [], + "spatialDimensions" : [] + } + } + }, + "metricsSpec" : [ + { + "type" : "count", + "name" : "count" + }, + { + "type" : "doubleSum", + "name" : "added", + "fieldName" : "added" + }, + { + "type" : "doubleSum", + "name" : "deleted", + "fieldName" : "deleted" + }, + { + "type" : "doubleSum", + "name" : "delta", + "fieldName" : "delta" + } + ], + "granularitySpec" : { + "type" : "uniform", + "segmentGranularity" : "DAY", + "queryGranularity" : "NONE", + "intervals" : [ "2013-08-31/2013-09-01" ] + } + }, + "ioConfig" : { + "type" : "hadoop", + "inputSpec" : { + "type" : "static", + "paths" : "/MyDirectory/example/wikipedia_data.json" + } + }, + "tuningConfig" : { + "type": "hadoop" + } + }, + "hadoopDependencyCoordinates": +} +``` + +|property|description|required?| +|--------|-----------|---------| +|type|The task type, this should always be "index_hadoop".|yes| +|spec|A Hadoop Index Spec. See [Ingestion](../ingestion/index.md)|yes| +|hadoopDependencyCoordinates|A JSON array of Hadoop dependency coordinates that Druid will use, this property will override the default Hadoop coordinates. Once specified, Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`|no| +|classpathPrefix|Classpath that will be prepended for the Peon process.|no| + +Also note that Druid automatically computes the classpath for Hadoop job containers that run in the Hadoop cluster. But in case of conflicts between Hadoop and Druid's dependencies, you can manually specify the classpath by setting `druid.extensions.hadoopContainerDruidClasspath` property. See the extensions config in [base druid configuration](../configuration/index.md#extensions). + +## `dataSchema` + +This field is required. See the [`dataSchema`](index.md#legacy-dataschema-spec) section of the main ingestion page for details on +what it should contain. + +## `ioConfig` + +This field is required. + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|type|String|This should always be 'hadoop'.|yes| +|inputSpec|Object|A specification of where to pull the data in from. See below.|yes| +|segmentOutputPath|String|The path to dump segments into.|Only used by the [Command-line Hadoop indexer](#cli). This field must be null otherwise.| +|metadataUpdateSpec|Object|A specification of how to update the metadata for the druid cluster these segments belong to.|Only used by the [Command-line Hadoop indexer](#cli). This field must be null otherwise.| + +### `inputSpec` + +There are multiple types of inputSpecs: + +#### `static` + +A type of inputSpec where a static path to the data files is provided. + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|inputFormat|String|Specifies the Hadoop InputFormat class to use. e.g. `org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat` |no| +|paths|Array of String|A String of input paths indicating where the raw data is located.|yes| + +For example, using the static input paths: + +``` +"paths" : "hdfs://path/to/data/is/here/data.gz,hdfs://path/to/data/is/here/moredata.gz,hdfs://path/to/data/is/here/evenmoredata.gz" +``` + +You can also read from cloud storage such as AWS S3 or Google Cloud Storage. +To do so, you need to install the necessary library under Druid's classpath in _all MiddleManager or Indexer processes_. +For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/). + +```bash +java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; +cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/ +``` + +Once you install the Hadoop AWS module in all MiddleManager and Indexer processes, you can put +your S3 paths in the inputSpec with the below job properties. +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/). + +``` +"paths" : "s3a://billy-bucket/the/data/is/here/data.gz,s3a://billy-bucket/the/data/is/here/moredata.gz,s3a://billy-bucket/the/data/is/here/evenmoredata.gz" +``` + +```json +"jobProperties" : { + "fs.s3a.impl" : "org.apache.hadoop.fs.s3a.S3AFileSystem", + "fs.AbstractFileSystem.s3a.impl" : "org.apache.hadoop.fs.s3a.S3A", + "fs.s3a.access.key" : "YOUR_ACCESS_KEY", + "fs.s3a.secret.key" : "YOUR_SECRET_KEY" +} +``` + +For Google Cloud Storage, you need to install [GCS connector jar](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md) +under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer processes_. +Once you install the GCS Connector jar in all MiddleManager and Indexer processes, you can put +your Google Cloud Storage paths in the inputSpec with the below job properties. +For more configurations, see the [instructions to configure Hadoop](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop), +[GCS core default](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml) +and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml). + +``` +"paths" : "gs://billy-bucket/the/data/is/here/data.gz,gs://billy-bucket/the/data/is/here/moredata.gz,gs://billy-bucket/the/data/is/here/evenmoredata.gz" +``` + +```json +"jobProperties" : { + "fs.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem", + "fs.AbstractFileSystem.gs.impl" : "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS" +} +``` + +#### `granularity` + +A type of inputSpec that expects data to be organized in directories according to datetime using the path format: `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (where date is represented by lowercase and time is represented by uppercase). + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|dataGranularity|String|Specifies the granularity to expect the data at, e.g. hour means to expect directories `y=XXXX/m=XX/d=XX/H=XX`.|yes| +|inputFormat|String|Specifies the Hadoop InputFormat class to use. e.g. `org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat` |no| +|inputPath|String|Base path to append the datetime path to.|yes| +|filePattern|String|Pattern that files should match to be included.|yes| +|pathFormat|String|Joda datetime format for each directory. Default value is `"'y'=yyyy/'m'=MM/'d'=dd/'H'=HH"`, or see [Joda documentation](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html)|no| + +For example, if the sample config were run with the interval 2012-06-01/2012-06-02, it would expect data at the paths: + +``` +s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00 +s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01 +... +s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23 +``` + +#### `dataSource` + +This is a type of `inputSpec` that reads data already stored inside Druid. This is used to allow "re-indexing" data and for "delta-ingestion" described later in `multi` type inputSpec. + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|type|String.|This should always be 'dataSource'.|yes| +|ingestionSpec|JSON object.|Specification of Druid segments to be loaded. See below.|yes| +|maxSplitSize|Number|Enables combining multiple segments into single Hadoop InputSplit according to size of segments. With -1, druid calculates max split size based on user specified number of map task(mapred.map.tasks or mapreduce.job.maps). By default, one split is made for one segment. maxSplitSize is specified in bytes.|no| +|useNewAggs|Boolean|If "false", then list of aggregators in "metricsSpec" of hadoop indexing task must be same as that used in original indexing task while ingesting raw data. Default value is "false". This field can be set to "true" when "inputSpec" type is "dataSource" and not "multi" to enable arbitrary aggregators while reindexing. See below for "multi" type support for delta-ingestion.|no| + +Here is what goes inside `ingestionSpec`: + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|dataSource|String|Druid dataSource name from which you are loading the data.|yes| +|intervals|List|A list of strings representing ISO-8601 Intervals.|yes| +|segments|List|List of segments from which to read data from, by default it is obtained automatically. You can obtain list of segments to put here by making a POST query to Coordinator at url /druid/coordinator/v1/metadata/datasources/segments?full with list of intervals specified in the request payload, e.g. ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"]. You may want to provide this list manually in order to ensure that segments read are exactly same as they were at the time of task submission, task would fail if the list provided by the user does not match with state of database when the task actually runs.|no| +|filter|JSON|See [Filters](../querying/filters.md)|no| +|dimensions|Array of String|Name of dimension columns to load. By default, the list will be constructed from parseSpec. If parseSpec does not have an explicit list of dimensions then all the dimension columns present in stored data will be read.|no| +|metrics|Array of String|Name of metric columns to load. By default, the list will be constructed from the "name" of all the configured aggregators.|no| +|ignoreWhenNoSegments|boolean|Whether to ignore this ingestionSpec if no segments were found. Default behavior is to throw error when no segments were found.|no| + +For example + +```json +"ioConfig" : { + "type" : "hadoop", + "inputSpec" : { + "type" : "dataSource", + "ingestionSpec" : { + "dataSource": "wikipedia", + "intervals": ["2014-10-20T00:00:00Z/P2W"] + } + }, + ... +} +``` + +#### `multi` + +This is a composing inputSpec to combine other inputSpecs. This inputSpec is used for delta ingestion. You can also use a `multi` inputSpec to combine data from multiple dataSources. However, each particular dataSource can only be specified one time. +Note that, "useNewAggs" must be set to default value false to support delta-ingestion. + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|children|Array of JSON objects|List of JSON objects containing other inputSpecs.|yes| + +For example: + +```json +"ioConfig" : { + "type" : "hadoop", + "inputSpec" : { + "type" : "multi", + "children": [ + { + "type" : "dataSource", + "ingestionSpec" : { + "dataSource": "wikipedia", + "intervals": ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"], + "segments": [ + { + "dataSource": "test1", + "interval": "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", + "version": "v2", + "loadSpec": { + "type": "local", + "path": "/tmp/index1.zip" + }, + "dimensions": "host", + "metrics": "visited_sum,unique_hosts", + "shardSpec": { + "type": "none" + }, + "binaryVersion": 9, + "size": 2, + "identifier": "test1_2000-01-01T00:00:00.000Z_3000-01-01T00:00:00.000Z_v2" + } + ] + } + }, + { + "type" : "static", + "paths": "/path/to/more/wikipedia/data/" + } + ] + }, + ... +} +``` + +It is STRONGLY RECOMMENDED to provide list of segments in `dataSource` inputSpec explicitly so that your delta ingestion task is idempotent. You can obtain that list of segments by making following call to the Coordinator. +POST `/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full` +Request Body: [interval1, interval2,...] for example ["2012-01-01T00:00:00.000/2012-01-03T00:00:00.000", "2012-01-05T00:00:00.000/2012-01-07T00:00:00.000"] + +## `tuningConfig` + +The tuningConfig is optional and default parameters will be used if no tuningConfig is specified. + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|workingPath|String|The working path to use for intermediate results (results between Hadoop jobs).|Only used by the [Command-line Hadoop indexer](#cli). The default is '/tmp/druid-indexing'. This field must be null otherwise.| +|version|String|The version of created segments. Ignored for HadoopIndexTask unless useExplicitVersion is set to true|no (default == datetime that indexing starts at)| +|partitionsSpec|Object|A specification of how to partition each time bucket into segments. Absence of this property means no partitioning will occur. See [`partitionsSpec`](#partitionsspec) below.|no (default == 'hashed')| +|maxRowsInMemory|Integer|The number of rows to aggregate before persisting. Note that this is the number of post-aggregation rows which may not be equal to the number of input events due to roll-up. This is used to manage the required JVM heap size. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|no (default == 1000000)| +|maxBytesInMemory|Long|The number of bytes to aggregate in heap memory before persisting. Normally this is computed internally and user does not need to set it. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists). Note that `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.|no (default == One-sixth of max JVM memory)| +|leaveIntermediate|Boolean|Leave behind intermediate files (for debugging) in the workingPath when a job completes, whether it passes or fails.|no (default == false)| +|cleanupOnFailure|Boolean|Clean up intermediate files when a job fails (unless leaveIntermediate is on).|no (default == true)| +|overwriteFiles|Boolean|Override existing files found during indexing.|no (default == false)| +|ignoreInvalidRows|Boolean|DEPRECATED. Ignore rows found to have problems. If false, any exception encountered during parsing will be thrown and will halt ingestion; if true, unparseable rows and fields will be skipped. If `maxParseExceptions` is defined, this property is ignored.|no (default == false)| +|combineText|Boolean|Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files.|no (default == false)| +|useCombiner|Boolean|Use Hadoop combiner to merge rows at mapper if possible.|no (default == false)| +|jobProperties|Object|A map of properties to add to the Hadoop job configuration, see below for details.|no (default == null)| +|indexSpec|Object|Tune how data is indexed. See [`indexSpec`](index.md#indexspec) on the main ingestion page for more information.|no| +|indexSpecForIntermediatePersists|Object|defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [`indexSpec`](index.md#indexspec) for possible values.|no (default = same as indexSpec)| +|numBackgroundPersistThreads|Integer|The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and CPU usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1.|no (default == 0)| +|forceExtendableShardSpecs|Boolean|Forces use of extendable shardSpecs. Hash-based partitioning always uses an extendable shardSpec. For single-dimension partitioning, this option should be set to true to use an extendable shardSpec. For partitioning, please check [Partitioning specification](#partitionsspec). This option can be useful when you need to append more data to existing dataSource.|no (default = false)| +|useExplicitVersion|Boolean|Forces HadoopIndexTask to use version.|no (default = false)| +|logParseExceptions|Boolean|If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred.|no(default = false)| +|maxParseExceptions|Integer|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overrides `ignoreInvalidRows` if `maxParseExceptions` is defined.|no(default = unlimited)| +|useYarnRMJobStatusFallback|Boolean|If the Hadoop jobs created by the indexing task are unable to retrieve their completion status from the JobHistory server, and this parameter is true, the indexing task will try to fetch the application status from `http:///ws/v1/cluster/apps/`, where `` is the value of `yarn.resourcemanager.webapp.address` in your Hadoop configuration. This flag is intended as a fallback for cases where an indexing task's jobs succeed, but the JobHistory server is unavailable, causing the indexing task to fail because it cannot determine the job statuses.|no (default = true)| +|awaitSegmentAvailabilityTimeoutMillis|Long|Milliseconds to wait for the newly indexed segments to become available for query after ingestion completes. If `<= 0`, no wait will occur. If `> 0`, the task will wait for the Coordinator to indicate that the new segments are available for querying. If the timeout expires, the task will exit as successful, but the segments were not confirmed to have become available for query.|no (default = 0)| + +### `jobProperties` + +```json + "tuningConfig" : { + "type": "hadoop", + "jobProperties": { + "": "", + "": "" + } + } +``` + +Hadoop's [MapReduce documentation](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) lists the possible configuration parameters. + +With some Hadoop distributions, it may be necessary to set `mapreduce.job.classpath` or `mapreduce.job.user.classpath.first` +to avoid class loading issues. See the [working with different Hadoop versions documentation](../operations/other-hadoop.md) +for more details. + +## `partitionsSpec` + +Segments are always partitioned based on timestamp (according to the granularitySpec) and may be further partitioned in +some other way depending on partition type. Druid supports two types of partitioning strategies: `hashed` (based on the +hash of all dimensions in each row), and `single_dim` (based on ranges of a single dimension). + +Hashed partitioning is recommended in most cases, as it will improve indexing performance and create more uniformly +sized data segments relative to single-dimension partitioning. + +### Hash-based partitioning + +```json + "partitionsSpec": { + "type": "hashed", + "targetRowsPerSegment": 5000000 + } +``` + +Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments +according to the hash of all dimensions in each row. The number of segments is determined automatically based on the +cardinality of the input set and a target partition size. + +The configuration options are: + +|Field|Description|Required| +|--------|-----------|---------| +|type|Type of partitionSpec to be used.|"hashed"| +|targetRowsPerSegment|Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB. Defaults to 5000000 if `numShards` is not set.|either this or `numShards`| +|targetPartitionSize|Deprecated. Renamed to `targetRowsPerSegment`. Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|either this or `numShards`| +|maxRowsPerSegment|Deprecated. Renamed to `targetRowsPerSegment`. Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|either this or `numShards`| +|numShards|Specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically.|either this or `maxRowsPerSegment`| +|partitionDimensions|The dimensions to partition on. Leave blank to select all dimensions. Only used with `numShards`, will be ignored when `targetRowsPerSegment` is set.|no| +|partitionFunction|A function to compute hash of partition dimensions. See [Hash partition function](#hash-partition-function)|`murmur3_32_abs`|no| + +##### Hash partition function + +In hash partitioning, the partition function is used to compute hash of partition dimensions. The partition dimension +values are first serialized into a byte array as a whole, and then the partition function is applied to compute hash of +the byte array. +Druid currently supports only one partition function. + +|name|description| +|----|-----------| +|`murmur3_32_abs`|Applies an absolute value function to the result of [`murmur3_32`](https://guava.dev/releases/16.0/api/docs/com/google/common/hash/Hashing.html#murmur3_32()).| + +### Single-dimension range partitioning + +```json + "partitionsSpec": { + "type": "single_dim", + "targetRowsPerSegment": 5000000 + } +``` + +Single-dimension range partitioning works by first selecting a dimension to partition on, and then separating that dimension +into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example, +your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and +"f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can +override it with a specific dimension. + +The configuration options are: + +|Field|Description|Required| +|--------|-----------|---------| +|type|Type of partitionSpec to be used.|"single_dim"| +|targetRowsPerSegment|Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|yes| +|targetPartitionSize|Deprecated. Renamed to `targetRowsPerSegment`. Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|no| +|maxRowsPerSegment|Maximum number of rows to include in a partition. Defaults to 50% larger than the `targetRowsPerSegment`.|no| +|maxPartitionSize|Deprecated. Use `maxRowsPerSegment` instead. Maximum number of rows to include in a partition. Defaults to 50% larger than the `targetPartitionSize`.|no| +|partitionDimension|The dimension to partition on. Leave blank to select a dimension automatically.|no| +|assumeGrouped|Assume that input data has already been grouped on time and dimensions. Ingestion will run faster, but may choose sub-optimal partitions if this assumption is violated.|no| + +## Remote Hadoop clusters + +If you have a remote Hadoop cluster, make sure to include the folder holding your configuration `*.xml` files in your Druid `_common` configuration folder. + +If you are having dependency problems with your version of Hadoop and the version compiled with Druid, please see [these docs](../operations/other-hadoop.md). + +## Elastic MapReduce + +If your cluster is running on Amazon Web Services, you can use Elastic MapReduce (EMR) to index data +from S3. To do this: + +- Create a persistent, [long-running cluster](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient). +- When creating your cluster, enter the following configuration. If you're using the wizard, this + should be in advanced mode under "Edit software settings": + +``` +classification=yarn-site,properties=[mapreduce.reduce.memory.mb=6144,mapreduce.reduce.java.opts=-server -Xms2g -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.map.java.opts=758,mapreduce.map.java.opts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.task.timeout=1800000] +``` + +- Follow the instructions under + [Configure for connecting to Hadoop](../tutorials/cluster.md#hadoop) using the XML files from `/etc/hadoop/conf` + on your EMR master. + +## Kerberized Hadoop clusters + +By default druid can use the existing TGT kerberos ticket available in local kerberos key cache. +Although TGT ticket has a limited life cycle, +therefore you need to call `kinit` command periodically to ensure validity of TGT ticket. +To avoid this extra external cron job script calling `kinit` periodically, +you can provide the principal name and keytab location and druid will do the authentication transparently at startup and job launching time. + +|Property|Possible Values|Description|Default| +|--------|---------------|-----------|-------| +|`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty| +|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty| + +### Loading from S3 with EMR + +- In the `jobProperties` field in the `tuningConfig` section of your Hadoop indexing task, add: + +``` +"jobProperties" : { + "fs.s3.awsAccessKeyId" : "YOUR_ACCESS_KEY", + "fs.s3.awsSecretAccessKey" : "YOUR_SECRET_KEY", + "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem", + "fs.s3n.awsAccessKeyId" : "YOUR_ACCESS_KEY", + "fs.s3n.awsSecretAccessKey" : "YOUR_SECRET_KEY", + "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem", + "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec" +} +``` + +Note that this method uses Hadoop's built-in S3 filesystem rather than Amazon's EMRFS, and is not compatible +with Amazon-specific features such as S3 encryption and consistent views. If you need to use these +features, you will need to make the Amazon EMR Hadoop JARs available to Druid through one of the +mechanisms described in the [Using other Hadoop distributions](#using-other-hadoop-distributions) section. + +## Using other Hadoop distributions + +Druid works out of the box with many Hadoop distributions. + +If you are having dependency conflicts between Druid and your version of Hadoop, you can try +searching for a solution in the [Druid user groups](https://groups.google.com/forum/#!forum/druid-user), or reading the +Druid [Different Hadoop Versions](../operations/other-hadoop.md) documentation. + + + +## Command line (non-task) version + +To run: + +``` +java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*: org.apache.druid.cli.Main index hadoop +``` + +### Options + +- "--coordinate" - provide a version of Apache Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Apache Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`. +- "--no-default-hadoop" - don't pull down the default hadoop version + +### Spec file + +The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task. See [Hadoop Batch Ingestion](../ingestion/hadoop.md) for details on the spec format. + +In addition, a `metadataUpdateSpec` and `segmentOutputPath` field needs to be added to the ioConfig: + +``` + "ioConfig" : { + ... + "metadataUpdateSpec" : { + "type":"mysql", + "connectURI" : "jdbc:mysql://localhost:3306/druid", + "password" : "diurd", + "segmentTable" : "druid_segments", + "user" : "druid" + }, + "segmentOutputPath" : "/MyDirectory/data/index/output" + }, +``` + +and a `workingPath` field needs to be added to the tuningConfig: + +``` + "tuningConfig" : { + ... + "workingPath": "/tmp", + ... + } +``` + +#### Metadata Update Job Spec + +This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them. + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|type|String|"metadata" is the only value available.|yes| +|connectURI|String|A valid JDBC url to metadata storage.|yes| +|user|String|Username for db.|yes| +|password|String|password for db.|yes| +|segmentTable|String|Table to use in DB.|yes| + +These properties should parrot what you have configured for your [Coordinator](../design/coordinator.md). + +#### segmentOutputPath Config + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|segmentOutputPath|String|the path to dump segments into.|yes| + +#### workingPath Config + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|workingPath|String|the working path to use for intermediate results (results between Hadoop jobs).|no (default == '/tmp/druid-indexing')| + +Please note that the command line Hadoop indexer doesn't have the locking capabilities of the indexing service, so if you choose to use it, +you have to take caution to not override segments created by real-time processing (if you that a real-time pipeline set up). + + + # Hadoop 导入数据 Apache Hadoop-based batch ingestion in Apache Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running diff --git a/ingestion/kafka.md b/ingestion/kafka.md index 1880c18..01529b9 100644 --- a/ingestion/kafka.md +++ b/ingestion/kafka.md @@ -1,4 +1,3 @@ - ## Apache Kafka 摄取数据 Kafka索引服务支持在Overlord上配置*supervisors*,supervisors通过管理Kafka索引任务的创建和生存期来便于从Kafka摄取数据。这些索引任务使用Kafka自己的分区和偏移机制读取事件,因此能够保证只接收一次(**exactly-once**)。supervisor监视索引任务的状态,以便于协调切换、管理故障,并确保维护可伸缩性和复制要求。 diff --git a/ingestion/native-batch.md b/ingestion/native-batch.md new file mode 100644 index 0000000..1dd134f --- /dev/null +++ b/ingestion/native-batch.md @@ -0,0 +1,2960 @@ +--- +id: native-batch +title: "Native batch ingestion" +sidebar_label: "Native batch" +--- + + + + +Apache Druid currently has two types of native batch indexing tasks, `index_parallel` which can run +multiple tasks in parallel, and `index` which will run a single indexing task. Please refer to our +[Hadoop-based vs. native batch comparison table](index.md#batch) for comparisons between Hadoop-based, native batch +(simple), and native batch (parallel) ingestion. + +To run either kind of native batch indexing task, write an ingestion spec as specified below. Then POST it to the +[`/druid/indexer/v1/task`](../operations/api-reference.md#tasks) endpoint on the Overlord, or use the +`bin/post-index-task` script included with Druid. + +## Tutorial + +This page contains reference documentation for native batch ingestion. +For a walk-through instead, check out the [Loading a file](../tutorials/tutorial-batch.md) tutorial, which +demonstrates the "simple" (single-task) mode. + +## Parallel task + +The Parallel task (type `index_parallel`) is a task for parallel batch indexing. This task only uses Druid's resource and +doesn't depend on other external systems like Hadoop. The `index_parallel` task is a supervisor task that orchestrates +the whole indexing process. The supervisor task splits the input data and creates worker tasks to process those splits. +The created worker tasks are issued to the Overlord so that they can be scheduled and run on MiddleManagers or Indexers. +Once a worker task successfully processes the assigned input split, it reports the generated segment list to the supervisor task. +The supervisor task periodically checks the status of worker tasks. If one of them fails, it retries the failed task +until the number of retries reaches the configured limit. If all worker tasks succeed, it publishes the reported segments at once and finalizes ingestion. + +The detailed behavior of the Parallel task is different depending on the [`partitionsSpec`](#partitionsspec). +See each `partitionsSpec` for more details. + +To use this task, the [`inputSource`](#input-sources) in the `ioConfig` should be _splittable_ and `maxNumConcurrentSubTasks` should be set to larger than 1 in the `tuningConfig`. +Otherwise, this task runs sequentially; the `index_parallel` task reads each input file one by one and creates segments by itself. +The supported splittable input formats for now are: + +- [`s3`](#s3-input-source) reads data from AWS S3 storage. +- [`gs`](#google-cloud-storage-input-source) reads data from Google Cloud Storage. +- [`azure`](#azure-input-source) reads data from Azure Blob Storage. +- [`hdfs`](#hdfs-input-source) reads data from HDFS storage. +- [`http`](#http-input-source) reads data from HTTP servers. +- [`local`](#local-input-source) reads data from local storage. +- [`druid`](#druid-input-source) reads data from a Druid datasource. +- [`sql`](#sql-input-source) reads data from a RDBMS source. + +Some other cloud storage types are supported with the legacy [`firehose`](#firehoses-deprecated). +The below `firehose` types are also splittable. Note that only text formats are supported +with the `firehose`. + +### Compression formats supported +The supported compression formats for native batch ingestion are `bz2`, `gz`, `xz`, `zip`, `sz` (Snappy), and `zst` (ZSTD). + +- [`static-cloudfiles`](../development/extensions-contrib/cloudfiles.md#firehose) + +### Implementation considerations + +- You may want to control the amount of input data each worker task processes. This can be + controlled using different configurations depending on the phase in parallel ingestion (see [`partitionsSpec`](#partitionsspec) for more details). + For the tasks that read data from the `inputSource`, you can set the [Split hint spec](#split-hint-spec) in the `tuningConfig`. + For the tasks that merge shuffled segments, you can set the `totalNumMergeTasks` in the `tuningConfig`. +- The number of concurrent worker tasks in parallel ingestion is determined by `maxNumConcurrentSubTasks` in the `tuningConfig`. + The supervisor task checks the number of current running worker tasks and creates more if it's smaller than `maxNumConcurrentSubTasks` + no matter how many task slots are currently available. + This may affect to other ingestion performance. See the below [Capacity Planning](#capacity-planning) section for more details. +- By default, batch ingestion replaces all data (in your `granularitySpec`'s intervals) in any segment that it writes to. + If you'd like to add to the segment instead, set the `appendToExisting` flag in the `ioConfig`. Note that it only replaces + data in segments where it actively adds data: if there are segments in your `granularitySpec`'s intervals that have + no data written by this task, they will be left alone. If any existing segments partially overlap with the + `granularitySpec`'s intervals, the portion of those segments outside the new segments' intervals will still be visible. +- You can set `dropExisting` flag in the `ioConfig` to true if you want the ingestion task to drop all existing segments that + start and end within your `granularitySpec`'s intervals. This applies whether or not the new data covers all existing segments. + `dropExisting` only applies when `appendToExisting` is false and the `granularitySpec` contains an `interval`. WARNING: this + functionality is still in beta and can result in temporary data unavailability for data within the specified `interval` + + The following examples demonstrate when to set the `dropExisting` property to true in the `ioConfig`: + + - Example 1: Consider an existing segment with an interval of 2020-01-01 to 2021-01-01 and YEAR segmentGranularity. You want to + overwrite the whole interval of 2020-01-01 to 2021-01-01 with new data using the finer segmentGranularity of MONTH. + If the replacement data does not have a record within every months from 2020-01-01 to 2021-01-01 + Druid cannot drop the original YEAR segment even if it does include all the replacement. Set `dropExisting` to true in this case to drop + the original segment at year `segmentGranularity` since you no longer need it. + - Example 2: Consider the case where you want to re-ingest or overwrite a datasource and the new data does not contains some time intervals that exist + in the datasource. For example, a datasource contains the following data at MONTH segmentGranularity: + January: 1 record + February: 10 records + March: 10 records + You want to re-ingest and overwrite with new data as follows: + January: 0 records + February: 10 records + March: 9 records + Unless you set `dropExisting` to true, the result after ingestion with overwrite using the same MONTH segmentGranularity would be: + January: 1 record + February: 10 records + March: 9 records + This is incorrect since the new data has 0 records for January. Setting `dropExisting` to true to drop the original + segment for January that is not needed since the newly ingested data has no records for January. + +### Task syntax + +A sample task is shown below: + +```json +{ + "type": "index_parallel", + "spec": { + "dataSchema": { + "dataSource": "wikipedia_parallel_index_test", + "timestampSpec": { + "column": "timestamp" + }, + "dimensionsSpec": { + "dimensions": [ + "page", + "language", + "user", + "unpatrolled", + "newPage", + "robot", + "anonymous", + "namespace", + "continent", + "country", + "region", + "city" + ] + }, + "metricsSpec": [ + { + "type": "count", + "name": "count" + }, + { + "type": "doubleSum", + "name": "added", + "fieldName": "added" + }, + { + "type": "doubleSum", + "name": "deleted", + "fieldName": "deleted" + }, + { + "type": "doubleSum", + "name": "delta", + "fieldName": "delta" + } + ], + "granularitySpec": { + "segmentGranularity": "DAY", + "queryGranularity": "second", + "intervals" : [ "2013-08-31/2013-09-02" ] + } + }, + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "local", + "baseDir": "examples/indexing/", + "filter": "wikipedia_index_data*" + }, + "inputFormat": { + "type": "json" + } + }, + "tuningConfig": { + "type": "index_parallel", + "maxNumConcurrentSubTasks": 2 + } + } +} +``` + +|property|description|required?| +|--------|-----------|---------| +|type|The task type, this should always be `index_parallel`.|yes| +|id|The task ID. If this is not explicitly specified, Druid generates the task ID using task type, data source name, interval, and date-time stamp. |no| +|spec|The ingestion spec including the data schema, IOConfig, and TuningConfig. See below for more details. |yes| +|context|Context containing various task configuration parameters. See below for more details.|no| +|awaitSegmentAvailabilityTimeoutMillis|Long|Milliseconds to wait for the newly indexed segments to become available for query after ingestion completes. If `<= 0`, no wait will occur. If `> 0`, the task will wait for the Coordinator to indicate that the new segments are available for querying. If the timeout expires, the task will exit as successful, but the segments were not confirmed to have become available for query. Note for compaction tasks: you should not set this to a non-zero value because it is not supported by the compaction task type at this time.|no (default = 0)| + +### `dataSchema` + +This field is required. + +See [Ingestion Spec DataSchema](../ingestion/index.md#dataschema) + +If you specify `intervals` explicitly in your dataSchema's `granularitySpec`, batch ingestion will lock the full intervals +specified when it starts up, and you will learn quickly if the specified interval overlaps with locks held by other +tasks (e.g., Kafka ingestion). Otherwise, batch ingestion will lock each interval as it is discovered, so you may only +learn that the task overlaps with a higher-priority task later in ingestion. If you specify `intervals` explicitly, any +rows outside the specified intervals will be thrown away. We recommend setting `intervals` explicitly if you know the +time range of the data so that locking failure happens faster, and so that you don't accidentally replace data outside +that range if there's some stray data with unexpected timestamps. + +### `ioConfig` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|The task type, this should always be `index_parallel`.|none|yes| +|inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes| +|appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no| +|dropExisting|If `true` and `appendToExisting` is `false` and the `granularitySpec` contains an`interval`, then the ingestion task drops (mark unused) all existing segments fully contained by the specified `interval` when the task publishes new segments. If ingestion fails, Druid does not drop or mark unused any segments. In the case of misconfiguration where either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec`, Druid does not drop any segments even if `dropExisting` is `true`. WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the specified `interval`.|false|no| + +### `tuningConfig` + +The tuningConfig is optional and default parameters will be used if no tuningConfig is specified. See below for more details. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|The task type, this should always be `index_parallel`.|none|yes| +|maxRowsPerSegment|Deprecated. Use `partitionsSpec` instead. Used in sharding. Determines how many rows are in each segment.|5000000|no| +|maxRowsInMemory|Used in determining when intermediate persists to disk should occur. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|1000000|no| +|maxBytesInMemory|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists). Note that `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.|1/6 of max JVM memory|no| +|maxColumnsToMerge|A parameter that limits how many segments can be merged in a single phase when merging segments for publishing. This limit is imposed on the total number of columns present in a set of segments being merged. If the limit is exceeded, segment merging will occur in multiple phases. At least 2 segments will be merged in a single phase, regardless of this setting.|-1 (unlimited)|no| +|maxTotalRows|Deprecated. Use `partitionsSpec` instead. Total number of rows in segments waiting for being pushed. Used in determining when intermediate pushing should occur.|20000000|no| +|numShards|Deprecated. Use `partitionsSpec` instead. Directly specify the number of shards to create when using a `hashed` `partitionsSpec`. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no| +|splitHintSpec|Used to give a hint to control the amount of data that each first phase task reads. This hint could be ignored depending on the implementation of the input source. See [Split hint spec](#split-hint-spec) for more details.|size-based split hint spec|no| +|partitionsSpec|Defines how to partition data in each timeChunk, see [PartitionsSpec](#partitionsspec)|`dynamic` if `forceGuaranteedRollup` = false, `hashed` or `single_dim` if `forceGuaranteedRollup` = true|no| +|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](index.md#indexspec)|null|no| +|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](index.md#indexspec) for possible values.|same as indexSpec|no| +|maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no| +|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](../ingestion/index.md#rollup). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, `intervals` in `granularitySpec` must be set and `hashed` or `single_dim` must be used for `partitionsSpec`. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no| +|reportParseExceptions|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|false|no| +|pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|0|no| +|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no| +|maxNumConcurrentSubTasks|Maximum number of worker tasks which can be run in parallel at the same time. The supervisor task would spawn worker tasks up to `maxNumConcurrentSubTasks` regardless of the current available task slots. If this value is set to 1, the supervisor task processes data ingestion on its own instead of spawning worker tasks. If this value is set to too large, too many worker tasks can be created which might block other ingestion. Check [Capacity Planning](#capacity-planning) for more details.|1|no| +|maxRetry|Maximum number of retries on task failures.|3|no| +|maxNumSegmentsToMerge|Max limit for the number of segments that a single task can merge at the same time in the second phase. Used only `forceGuaranteedRollup` is set.|100|no| +|totalNumMergeTasks|Total number of tasks to merge segments in the merge phase when `partitionsSpec` is set to `hashed` or `single_dim`.|10|no| +|taskStatusCheckPeriodMs|Polling period in milliseconds to check running task statuses.|1000|no| +|chatHandlerTimeout|Timeout for reporting the pushed segments in worker tasks.|PT10S|no| +|chatHandlerNumRetries|Retries for reporting the pushed segments in worker tasks.|5|no| +|awaitSegmentAvailabilityTimeoutMillis|Long|Milliseconds to wait for the newly indexed segments to become available for query after ingestion completes. If `<= 0`, no wait will occur. If `> 0`, the task will wait for the Coordinator to indicate that the new segments are available for querying. If the timeout expires, the task will exit as successful, but the segments were not confirmed to have become available for query.|no (default = 0)| + +### Split Hint Spec + +The split hint spec is used to give a hint when the supervisor task creates input splits. +Note that each worker task processes a single input split. You can control the amount of data each worker task will read during the first phase. + +#### Size-based Split Hint Spec + +The size-based split hint spec is respected by all splittable input sources except for the HTTP input source and SQL input source. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should always be `maxSize`.|none|yes| +|maxSplitSize|Maximum number of bytes of input files to process in a single subtask. If a single file is larger than this number, it will be processed by itself in a single subtask (Files are never split across tasks yet). Note that one subtask will not process more files than `maxNumFiles` even when their total size is smaller than `maxSplitSize`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no| +|maxNumFiles|Maximum number of input files to process in a single subtask. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them. Note that one subtask will not process more data than `maxSplitSize` even when the total number of files is smaller than `maxNumFiles`.|1000|no| + +#### Segments Split Hint Spec + +The segments split hint spec is used only for [`DruidInputSource`](#druid-input-source) (and legacy [`IngestSegmentFirehose`](#ingestsegmentfirehose)). + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should always be `segments`.|none|yes| +|maxInputSegmentBytesPerTask|Maximum number of bytes of input segments to process in a single subtask. If a single segment is larger than this number, it will be processed by itself in a single subtask (input segments are never split across tasks). Note that one subtask will not process more segments than `maxNumSegments` even when their total size is smaller than `maxInputSegmentBytesPerTask`. [Human-readable format](../configuration/human-readable-byte.md) is supported.|1GiB|no| +|maxNumSegments|Maximum number of input segments to process in a single subtask. This limit is to avoid task failures when the ingestion spec is too long. There are two known limits on the max size of serialized ingestion spec, i.e., the max ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL (`max_allowed_packet`). These can make ingestion tasks fail if the serialized ingestion spec size hits one of them. Note that one subtask will not process more data than `maxInputSegmentBytesPerTask` even when the total number of segments is smaller than `maxNumSegments`.|1000|no| + +### `partitionsSpec` + +PartitionsSpec is used to describe the secondary partitioning method. +You should use different partitionsSpec depending on the [rollup mode](../ingestion/index.md#rollup) you want. +For perfect rollup, you should use either `hashed` (partitioning based on the hash of dimensions in each row) or +`single_dim` (based on ranges of a single dimension). For best-effort rollup, you should use `dynamic`. + +The three `partitionsSpec` types have different characteristics. + +| PartitionsSpec | Ingestion speed | Partitioning method | Supported rollup mode | Secondary partition pruning at query time | +|----------------|-----------------|---------------------|-----------------------|-------------------------------| +| `dynamic` | Fastest | Partitioning based on number of rows in segment. | Best-effort rollup | N/A | +| `hashed` | Moderate | Partitioning based on the hash value of partition dimensions. This partitioning may reduce your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows how to hash `partitionDimensions` values to locate a segment, given a query including a filter on all the `partitionDimensions`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimensions` for query processing.

Note that `partitionDimensions` must be set at ingestion time to enable secondary partition pruning at query time.| +| `single_dim` | Slowest | Range partitioning based on the value of the partition dimension. Segment sizes may be skewed depending on the partition key distribution. This may reduce your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows the range of `partitionDimension` values in each segment, given a query including a filter on the `partitionDimension`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimension` for query processing. | + +The recommended use case for each partitionsSpec is: +- If your data has a uniformly distributed column which is frequently used in your queries, +consider using `single_dim` partitionsSpec to maximize the performance of most of your queries. +- If your data doesn't have a uniformly distributed column, but is expected to have a [high rollup ratio](./index.md#maximizing-rollup-ratio) +when you roll up with some dimensions, consider using `hashed` partitionsSpec. +It could reduce the size of datasource and query latency by improving data locality. +- If the above two scenarios are not the case or you don't need to roll up your datasource, +consider using `dynamic` partitionsSpec. + +#### Dynamic partitioning + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should always be `dynamic`|none|yes| +|maxRowsPerSegment|Used in sharding. Determines how many rows are in each segment.|5000000|no| +|maxTotalRows|Total number of rows across all segments waiting for being pushed. Used in determining when intermediate segment push should occur.|20000000|no| + +With the Dynamic partitioning, the parallel index task runs in a single phase: +it will spawn multiple worker tasks (type `single_phase_sub_task`), each of which creates segments. +How the worker task creates segments is: + +- The task creates a new segment whenever the number of rows in the current segment exceeds + `maxRowsPerSegment`. +- Once the total number of rows in all segments across all time chunks reaches to `maxTotalRows`, + the task pushes all segments created so far to the deep storage and creates new ones. + +#### Hash-based partitioning + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should always be `hashed`|none|yes| +|numShards|Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. This property and `targetRowsPerSegment` cannot both be set.|none|no| +|targetRowsPerSegment|A target row count for each partition. If `numShards` is left unspecified, the Parallel task will determine a partition count automatically such that each partition has a row count close to the target, assuming evenly distributed keys in the input data. A target per-segment row count of 5 million is used if both `numShards` and `targetRowsPerSegment` are null. |null (or 5,000,000 if both `numShards` and `targetRowsPerSegment` are null)|no| +|partitionDimensions|The dimensions to partition on. Leave blank to select all dimensions.|null|no| +|partitionFunction|A function to compute hash of partition dimensions. See [Hash partition function](#hash-partition-function)|`murmur3_32_abs`|no| + +The Parallel task with hash-based partitioning is similar to [MapReduce](https://en.wikipedia.org/wiki/MapReduce). +The task runs in up to 3 phases: `partial dimension cardinality`, `partial segment generation` and `partial segment merge`. +- The `partial dimension cardinality` phase is an optional phase that only runs if `numShards` is not specified. +The Parallel task splits the input data and assigns them to worker tasks based on the split hint spec. +Each worker task (type `partial_dimension_cardinality`) gathers estimates of partitioning dimensions cardinality for +each time chunk. The Parallel task will aggregate these estimates from the worker tasks and determine the highest +cardinality across all of the time chunks in the input data, dividing this cardinality by `targetRowsPerSegment` to +automatically determine `numShards`. +- In the `partial segment generation` phase, just like the Map phase in MapReduce, +the Parallel task splits the input data based on the split hint spec +and assigns each split to a worker task. Each worker task (type `partial_index_generate`) reads the assigned split, +and partitions rows by the time chunk from `segmentGranularity` (primary partition key) in the `granularitySpec` +and then by the hash value of `partitionDimensions` (secondary partition key) in the `partitionsSpec`. +The partitioned data is stored in local storage of +the [middleManager](../design/middlemanager.md) or the [indexer](../design/indexer.md). +- The `partial segment merge` phase is similar to the Reduce phase in MapReduce. +The Parallel task spawns a new set of worker tasks (type `partial_index_generic_merge`) to merge the partitioned data +created in the previous phase. Here, the partitioned data is shuffled based on +the time chunk and the hash value of `partitionDimensions` to be merged; each worker task reads the data +falling in the same time chunk and the same hash value from multiple MiddleManager/Indexer processes and merges +them to create the final segments. Finally, they push the final segments to the deep storage at once. + +##### Hash partition function + +In hash partitioning, the partition function is used to compute hash of partition dimensions. The partition dimension +values are first serialized into a byte array as a whole, and then the partition function is applied to compute hash of +the byte array. +Druid currently supports only one partition function. + +|name|description| +|----|-----------| +|`murmur3_32_abs`|Applies an absolute value function to the result of [`murmur3_32`](https://guava.dev/releases/16.0/api/docs/com/google/common/hash/Hashing.html#murmur3_32()).| + +#### Single-dimension range partitioning + +> Single dimension range partitioning is currently not supported in the sequential mode of the Parallel task. +The Parallel task will use one subtask when you set `maxNumConcurrentSubTasks` to 1. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should always be `single_dim`|none|yes| +|partitionDimension|The dimension to partition on. Only rows with a single dimension value are allowed.|none|yes| +|targetRowsPerSegment|Target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|none|either this or `maxRowsPerSegment`| +|maxRowsPerSegment|Soft max for the number of rows to include in a partition.|none|either this or `targetRowsPerSegment`| +|assumeGrouped|Assume that input data has already been grouped on time and dimensions. Ingestion will run faster, but may choose sub-optimal partitions if this assumption is violated.|false|no| + +With `single-dim` partitioning, the Parallel task runs in 3 phases, +i.e., `partial dimension distribution`, `partial segment generation`, and `partial segment merge`. +The first phase is to collect some statistics to find +the best partitioning and the other 2 phases are to create partial segments +and to merge them, respectively, as in hash-based partitioning. +- In the `partial dimension distribution` phase, the Parallel task splits the input data and +assigns them to worker tasks based on the split hint spec. Each worker task (type `partial_dimension_distribution`) reads +the assigned split and builds a histogram for `partitionDimension`. +The Parallel task collects those histograms from worker tasks and finds +the best range partitioning based on `partitionDimension` to evenly +distribute rows across partitions. Note that either `targetRowsPerSegment` +or `maxRowsPerSegment` will be used to find the best partitioning. +- In the `partial segment generation` phase, the Parallel task spawns new worker tasks (type `partial_range_index_generate`) +to create partitioned data. Each worker task reads a split created as in the previous phase, +partitions rows by the time chunk from the `segmentGranularity` (primary partition key) in the `granularitySpec` +and then by the range partitioning found in the previous phase. +The partitioned data is stored in local storage of +the [middleManager](../design/middlemanager.md) or the [indexer](../design/indexer.md). +- In the `partial segment merge` phase, the parallel index task spawns a new set of worker tasks (type `partial_index_generic_merge`) to merge the partitioned +data created in the previous phase. Here, the partitioned data is shuffled based on +the time chunk and the value of `partitionDimension`; each worker task reads the segments +falling in the same partition of the same range from multiple MiddleManager/Indexer processes and merges +them to create the final segments. Finally, they push the final segments to the deep storage. + +> Because the task with single-dimension range partitioning makes two passes over the input +> in `partial dimension distribution` and `partial segment generation` phases, +> the task may fail if the input changes in between the two passes. + +### HTTP status endpoints + +The supervisor task provides some HTTP endpoints to get running status. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/mode` + +Returns 'parallel' if the indexing task is running in parallel. Otherwise, it returns 'sequential'. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/phase` + +Returns the name of the current phase if the task running in the parallel mode. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/progress` + +Returns the estimated progress of the current phase if the supervisor task is running in the parallel mode. + +An example of the result is + +```json +{ + "running":10, + "succeeded":0, + "failed":0, + "complete":0, + "total":10, + "estimatedExpectedSucceeded":10 +} +``` + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtasks/running` + +Returns the task IDs of running worker tasks, or an empty list if the supervisor task is running in the sequential mode. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs` + +Returns all worker task specs, or an empty list if the supervisor task is running in the sequential mode. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs/running` + +Returns running worker task specs, or an empty list if the supervisor task is running in the sequential mode. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs/complete` + +Returns complete worker task specs, or an empty list if the supervisor task is running in the sequential mode. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}` + +Returns the worker task spec of the given id, or HTTP 404 Not Found error if the supervisor task is running in the sequential mode. + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}/state` + +Returns the state of the worker task spec of the given id, or HTTP 404 Not Found error if the supervisor task is running in the sequential mode. +The returned result contains the worker task spec, a current task status if exists, and task attempt history. + +An example of the result is + +```json +{ + "spec": { + "id": "index_parallel_lineitem_2018-04-20T22:12:43.610Z_2", + "groupId": "index_parallel_lineitem_2018-04-20T22:12:43.610Z", + "supervisorTaskId": "index_parallel_lineitem_2018-04-20T22:12:43.610Z", + "context": null, + "inputSplit": { + "split": "/path/to/data/lineitem.tbl.5" + }, + "ingestionSpec": { + "dataSchema": { + "dataSource": "lineitem", + "timestampSpec": { + "column": "l_shipdate", + "format": "yyyy-MM-dd" + }, + "dimensionsSpec": { + "dimensions": [ + "l_orderkey", + "l_partkey", + "l_suppkey", + "l_linenumber", + "l_returnflag", + "l_linestatus", + "l_shipdate", + "l_commitdate", + "l_receiptdate", + "l_shipinstruct", + "l_shipmode", + "l_comment" + ] + }, + "metricsSpec": [ + { + "type": "count", + "name": "count" + }, + { + "type": "longSum", + "name": "l_quantity", + "fieldName": "l_quantity", + "expression": null + }, + { + "type": "doubleSum", + "name": "l_extendedprice", + "fieldName": "l_extendedprice", + "expression": null + }, + { + "type": "doubleSum", + "name": "l_discount", + "fieldName": "l_discount", + "expression": null + }, + { + "type": "doubleSum", + "name": "l_tax", + "fieldName": "l_tax", + "expression": null + } + ], + "granularitySpec": { + "type": "uniform", + "segmentGranularity": "YEAR", + "queryGranularity": { + "type": "none" + }, + "rollup": true, + "intervals": [ + "1980-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z" + ] + }, + "transformSpec": { + "filter": null, + "transforms": [] + } + }, + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "local", + "baseDir": "/path/to/data/", + "filter": "lineitem.tbl.5" + }, + "inputFormat": { + "format": "tsv", + "delimiter": "|", + "columns": [ + "l_orderkey", + "l_partkey", + "l_suppkey", + "l_linenumber", + "l_quantity", + "l_extendedprice", + "l_discount", + "l_tax", + "l_returnflag", + "l_linestatus", + "l_shipdate", + "l_commitdate", + "l_receiptdate", + "l_shipinstruct", + "l_shipmode", + "l_comment" + ] + }, + "appendToExisting": false, + "dropExisting": false + }, + "tuningConfig": { + "type": "index_parallel", + "maxRowsPerSegment": 5000000, + "maxRowsInMemory": 1000000, + "maxTotalRows": 20000000, + "numShards": null, + "indexSpec": { + "bitmap": { + "type": "roaring" + }, + "dimensionCompression": "lz4", + "metricCompression": "lz4", + "longEncoding": "longs" + }, + "indexSpecForIntermediatePersists": { + "bitmap": { + "type": "roaring" + }, + "dimensionCompression": "lz4", + "metricCompression": "lz4", + "longEncoding": "longs" + }, + "maxPendingPersists": 0, + "reportParseExceptions": false, + "pushTimeout": 0, + "segmentWriteOutMediumFactory": null, + "maxNumConcurrentSubTasks": 4, + "maxRetry": 3, + "taskStatusCheckPeriodMs": 1000, + "chatHandlerTimeout": "PT10S", + "chatHandlerNumRetries": 5, + "logParseExceptions": false, + "maxParseExceptions": 2147483647, + "maxSavedParseExceptions": 0, + "forceGuaranteedRollup": false, + "buildV9Directly": true + } + } + }, + "currentStatus": { + "id": "index_sub_lineitem_2018-04-20T22:16:29.922Z", + "type": "index_sub", + "createdTime": "2018-04-20T22:16:29.925Z", + "queueInsertionTime": "2018-04-20T22:16:29.929Z", + "statusCode": "RUNNING", + "duration": -1, + "location": { + "host": null, + "port": -1, + "tlsPort": -1 + }, + "dataSource": "lineitem", + "errorMsg": null + }, + "taskHistory": [] +} +``` + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}/history` + +Returns the task attempt history of the worker task spec of the given id, or HTTP 404 Not Found error if the supervisor task is running in the sequential mode. + +### Capacity planning + +The supervisor task can create up to `maxNumConcurrentSubTasks` worker tasks no matter how many task slots are currently available. +As a result, total number of tasks which can be run at the same time is `(maxNumConcurrentSubTasks + 1)` (including the supervisor task). +Please note that this can be even larger than total number of task slots (sum of the capacity of all workers). +If `maxNumConcurrentSubTasks` is larger than `n (available task slots)`, then +`maxNumConcurrentSubTasks` tasks are created by the supervisor task, but only `n` tasks would be started. +Others will wait in the pending state until any running task is finished. + +If you are using the Parallel Index Task with stream ingestion together, +we would recommend to limit the max capacity for batch ingestion to prevent +stream ingestion from being blocked by batch ingestion. Suppose you have +`t` Parallel Index Tasks to run at the same time, but want to limit +the max number of tasks for batch ingestion to `b`. Then, (sum of `maxNumConcurrentSubTasks` +of all Parallel Index Tasks + `t` (for supervisor tasks)) must be smaller than `b`. + +If you have some tasks of a higher priority than others, you may set their +`maxNumConcurrentSubTasks` to a higher value than lower priority tasks. +This may help the higher priority tasks to finish earlier than lower priority tasks +by assigning more task slots to them. + +## Simple task + +The simple task (type `index`) is designed to be used for smaller data sets. The task executes within the indexing service. + +### Task syntax + +A sample task is shown below: + +```json +{ + "type" : "index", + "spec" : { + "dataSchema" : { + "dataSource" : "wikipedia", + "timestampSpec" : { + "column" : "timestamp", + "format" : "auto" + }, + "dimensionsSpec" : { + "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], + "dimensionExclusions" : [] + }, + "metricsSpec" : [ + { + "type" : "count", + "name" : "count" + }, + { + "type" : "doubleSum", + "name" : "added", + "fieldName" : "added" + }, + { + "type" : "doubleSum", + "name" : "deleted", + "fieldName" : "deleted" + }, + { + "type" : "doubleSum", + "name" : "delta", + "fieldName" : "delta" + } + ], + "granularitySpec" : { + "type" : "uniform", + "segmentGranularity" : "DAY", + "queryGranularity" : "NONE", + "intervals" : [ "2013-08-31/2013-09-01" ] + } + }, + "ioConfig" : { + "type" : "index", + "inputSource" : { + "type" : "local", + "baseDir" : "examples/indexing/", + "filter" : "wikipedia_data.json" + }, + "inputFormat": { + "type": "json" + } + }, + "tuningConfig" : { + "type" : "index", + "maxRowsPerSegment" : 5000000, + "maxRowsInMemory" : 1000000 + } + } +} +``` + +|property|description|required?| +|--------|-----------|---------| +|type|The task type, this should always be `index`.|yes| +|id|The task ID. If this is not explicitly specified, Druid generates the task ID using task type, data source name, interval, and date-time stamp. |no| +|spec|The ingestion spec including the data schema, IOConfig, and TuningConfig. See below for more details. |yes| +|context|Context containing various task configuration parameters. See below for more details.|no| + +### `dataSchema` + +This field is required. + +See the [`dataSchema`](../ingestion/index.md#dataschema) section of the ingestion docs for details. + +If you do not specify `intervals` explicitly in your dataSchema's granularitySpec, the Local Index Task will do an extra +pass over the data to determine the range to lock when it starts up. If you specify `intervals` explicitly, any rows +outside the specified intervals will be thrown away. We recommend setting `intervals` explicitly if you know the time +range of the data because it allows the task to skip the extra pass, and so that you don't accidentally replace data outside +that range if there's some stray data with unexpected timestamps. + +### `ioConfig` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|The task type, this should always be "index".|none|yes| +|inputFormat|[`inputFormat`](./data-formats.md#input-format) to specify how to parse input data.|none|yes| +|appendToExisting|Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the `dynamic` partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error.|false|no| +|dropExisting|If `true` and `appendToExisting` is `false` and the `granularitySpec` contains an`interval`, then the ingestion task drops (mark unused) all existing segments fully contained by the specified `interval` when the task publishes new segments. If ingestion fails, Druid does not drop or mark unused any segments. In the case of misconfiguration where either `appendToExisting` is `true` or `interval` is not specified in `granularitySpec`, Druid does not drop any segments even if `dropExisting` is `true`. WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the specified `interval`.|false|no| + +### `tuningConfig` + +The tuningConfig is optional and default parameters will be used if no tuningConfig is specified. See below for more details. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|The task type, this should always be "index".|none|yes| +|maxRowsPerSegment|Deprecated. Use `partitionsSpec` instead. Used in sharding. Determines how many rows are in each segment.|5000000|no| +|maxRowsInMemory|Used in determining when intermediate persists to disk should occur. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|1000000|no| +|maxBytesInMemory|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists). Note that `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.|1/6 of max JVM memory|no| +|maxTotalRows|Deprecated. Use `partitionsSpec` instead. Total number of rows in segments waiting for being pushed. Used in determining when intermediate pushing should occur.|20000000|no| +|numShards|Deprecated. Use `partitionsSpec` instead. Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no| +|partitionDimensions|Deprecated. Use `partitionsSpec` instead. The dimensions to partition on. Leave blank to select all dimensions. Only used with `forceGuaranteedRollup` = true, will be ignored otherwise.|null|no| +|partitionsSpec|Defines how to partition data in each timeChunk, see [PartitionsSpec](#partitionsspec)|`dynamic` if `forceGuaranteedRollup` = false, `hashed` if `forceGuaranteedRollup` = true|no| +|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](index.md#indexspec)|null|no| +|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](index.md#indexspec) for possible values.|same as indexSpec|no| +|maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no| +|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](../ingestion/index.md#rollup). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, the index task will read the entire input data twice: one for finding the optimal number of partitions per time chunk and one for generating segments. Note that the result segments would be hash-partitioned. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no| +|reportParseExceptions|DEPRECATED. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions` and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions` to no more than 1.|false|no| +|pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|0|no| +|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no| +|logParseExceptions|If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred.|false|no| +|maxParseExceptions|The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if `reportParseExceptions` is set.|unlimited|no| +|maxSavedParseExceptions|When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the [task completion report](tasks.md#task-reports). Overridden if `reportParseExceptions` is set.|0|no| + +### `partitionsSpec` + +PartitionsSpec is to describe the secondary partitioning method. +You should use different partitionsSpec depending on the [rollup mode](../ingestion/index.md#rollup) you want. +For perfect rollup, you should use `hashed`. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should always be `hashed`|none|yes| +|maxRowsPerSegment|Used in sharding. Determines how many rows are in each segment.|5000000|no| +|numShards|Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no| +|partitionDimensions|The dimensions to partition on. Leave blank to select all dimensions.|null|no| +|partitionFunction|A function to compute hash of partition dimensions. See [Hash partition function](#hash-partition-function)|`murmur3_32_abs`|no| + +For best-effort rollup, you should use `dynamic`. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should always be `dynamic`|none|yes| +|maxRowsPerSegment|Used in sharding. Determines how many rows are in each segment.|5000000|no| +|maxTotalRows|Total number of rows in segments waiting for being pushed.|20000000|no| + +### `segmentWriteOutMediumFactory` + +|Field|Type|Description|Required| +|-----|----|-----------|--------| +|type|String|See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../configuration/index.md#segmentwriteoutmediumfactory) for explanation and available options.|yes| + +### Segment pushing modes + +While ingesting data using the Index task, it creates segments from the input data and pushes them. For segment pushing, +the Index task supports two segment pushing modes, i.e., _bulk pushing mode_ and _incremental pushing mode_ for +[perfect rollup and best-effort rollup](../ingestion/index.md#rollup), respectively. + +In the bulk pushing mode, every segment is pushed at the very end of the index task. Until then, created segments +are stored in the memory and local storage of the process running the index task. As a result, this mode might cause a +problem due to limited storage capacity, and is not recommended to use in production. + +On the contrary, in the incremental pushing mode, segments are incrementally pushed, that is they can be pushed +in the middle of the index task. More precisely, the index task collects data and stores created segments in the memory +and disks of the process running that task until the total number of collected rows exceeds `maxTotalRows`. Once it exceeds, +the index task immediately pushes all segments created until that moment, cleans all pushed segments up, and +continues to ingest remaining data. + +To enable bulk pushing mode, `forceGuaranteedRollup` should be set in the TuningConfig. Note that this option cannot +be used with `appendToExisting` of IOConfig. + +## Input Sources + +The input source is the place to define from where your index task reads data. +Only the native Parallel task and Simple task support the input source. + +### S3 Input Source + +> You need to include the [`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension to use the S3 input source. + +The S3 input source is to support reading objects directly from S3. +Objects can be specified either via a list of S3 URI strings or a list of +S3 location prefixes, which will attempt to list the contents and ingest +all objects contained in the locations. The S3 input source is splittable +and can be used by the [Parallel task](#parallel-task), +where each worker task of `index_parallel` will read one or multiple objects. + +Sample specs: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "s3", + "uris": ["s3://foo/bar/file.json", "s3://bar/foo/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "s3", + "prefixes": ["s3://foo/bar/", "s3://bar/foo/"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "s3", + "objects": [ + { "bucket": "foo", "path": "bar/file1.json"}, + { "bucket": "bar", "path": "foo/file2.json"} + ] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be `s3`.|None|yes| +|uris|JSON array of URIs where S3 objects to be ingested are located.|None|`uris` or `prefixes` or `objects` must be set| +|prefixes|JSON array of URI prefixes for the locations of S3 objects to be ingested. Empty objects starting with one of the given prefixes will be skipped.|None|`uris` or `prefixes` or `objects` must be set| +|objects|JSON array of S3 Objects to be ingested.|None|`uris` or `prefixes` or `objects` must be set| +|properties|Properties Object for overriding the default S3 configuration. See below for more information.|None|No (defaults will be used if not given) + +Note that the S3 input source will skip all empty objects only when `prefixes` is specified. + +S3 Object: + +|property|description|default|required?| +|--------|-----------|-------|---------| +|bucket|Name of the S3 bucket|None|yes| +|path|The path where data is located.|None|yes| + +Properties Object: + +|property|description|default|required?| +|--------|-----------|-------|---------| +|accessKeyId|The [Password Provider](../operations/password-provider.md) or plain text string of this S3 InputSource's access key|None|yes if secretAccessKey is given| +|secretAccessKey|The [Password Provider](../operations/password-provider.md) or plain text string of this S3 InputSource's secret key|None|yes if accessKeyId is given| + +**Note :** *If accessKeyId and secretAccessKey are not given, the default [S3 credentials provider chain](../development/extensions-core/s3.md#s3-authentication-methods) is used.* + +### Google Cloud Storage Input Source + +> You need to include the [`druid-google-extensions`](../development/extensions-core/google.md) as an extension to use the Google Cloud Storage input source. + +The Google Cloud Storage input source is to support reading objects directly +from Google Cloud Storage. Objects can be specified as list of Google +Cloud Storage URI strings. The Google Cloud Storage input source is splittable +and can be used by the [Parallel task](#parallel-task), where each worker task of `index_parallel` will read +one or multiple objects. + +Sample specs: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "google", + "uris": ["gs://foo/bar/file.json", "gs://bar/foo/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "google", + "prefixes": ["gs://foo/bar/", "gs://bar/foo/"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "google", + "objects": [ + { "bucket": "foo", "path": "bar/file1.json"}, + { "bucket": "bar", "path": "foo/file2.json"} + ] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be `google`.|None|yes| +|uris|JSON array of URIs where Google Cloud Storage objects to be ingested are located.|None|`uris` or `prefixes` or `objects` must be set| +|prefixes|JSON array of URI prefixes for the locations of Google Cloud Storage objects to be ingested. Empty objects starting with one of the given prefixes will be skipped.|None|`uris` or `prefixes` or `objects` must be set| +|objects|JSON array of Google Cloud Storage objects to be ingested.|None|`uris` or `prefixes` or `objects` must be set| + +Note that the Google Cloud Storage input source will skip all empty objects only when `prefixes` is specified. + +Google Cloud Storage object: + +|property|description|default|required?| +|--------|-----------|-------|---------| +|bucket|Name of the Google Cloud Storage bucket|None|yes| +|path|The path where data is located.|None|yes| + +### Azure Input Source + +> You need to include the [`druid-azure-extensions`](../development/extensions-core/azure.md) as an extension to use the Azure input source. + +The Azure input source is to support reading objects directly from Azure Blob store. Objects can be +specified as list of Azure Blob store URI strings. The Azure input source is splittable and can be used +by the [Parallel task](#parallel-task), where each worker task of `index_parallel` will read +a single object. + +Sample specs: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "azure", + "uris": ["azure://container/prefix1/file.json", "azure://container/prefix2/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "azure", + "prefixes": ["azure://container/prefix1/", "azure://container/prefix2/"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "azure", + "objects": [ + { "bucket": "container", "path": "prefix1/file1.json"}, + { "bucket": "container", "path": "prefix2/file2.json"} + ] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be `azure`.|None|yes| +|uris|JSON array of URIs where Azure Blob objects to be ingested are located. Should be in form "azure://\/\"|None|`uris` or `prefixes` or `objects` must be set| +|prefixes|JSON array of URI prefixes for the locations of Azure Blob objects to be ingested. Should be in the form "azure://\/\". Empty objects starting with one of the given prefixes will be skipped.|None|`uris` or `prefixes` or `objects` must be set| +|objects|JSON array of Azure Blob objects to be ingested.|None|`uris` or `prefixes` or `objects` must be set| + +Note that the Azure input source will skip all empty objects only when `prefixes` is specified. + +Azure Blob object: + +|property|description|default|required?| +|--------|-----------|-------|---------| +|bucket|Name of the Azure Blob Storage container|None|yes| +|path|The path where data is located.|None|yes| + +### HDFS Input Source + +> You need to include the [`druid-hdfs-storage`](../development/extensions-core/hdfs.md) as an extension to use the HDFS input source. + +The HDFS input source is to support reading files directly +from HDFS storage. File paths can be specified as an HDFS URI string or a list +of HDFS URI strings. The HDFS input source is splittable and can be used by the [Parallel task](#parallel-task), +where each worker task of `index_parallel` will read one or multiple files. + +Sample specs: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": "hdfs://namenode_host/foo/bar/", "hdfs://namenode_host/bar/foo" + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": "hdfs://namenode_host/foo/bar/", "hdfs://namenode_host/bar/foo" + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": "hdfs://namenode_host/foo/bar/file.json", "hdfs://namenode_host/bar/foo/file2.json" + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": ["hdfs://namenode_host/foo/bar/file.json", "hdfs://namenode_host/bar/foo/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be `hdfs`.|None|yes| +|paths|HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like `*` are supported in these paths. Empty files located under one of the given paths will be skipped.|None|yes| + +You can also ingest from other storage using the HDFS input source if the HDFS client supports that storage. +However, if you want to ingest from cloud storage, consider using the service-specific input source for your data storage. +If you want to use a non-hdfs protocol with the HDFS input source, include the protocol +in `druid.ingestion.hdfs.allowedProtocols`. See [HDFS input source security configuration](../configuration/index.md#hdfs-input-source) for more details. + +### HTTP Input Source + +The HTTP input source is to support reading files directly from remote sites via HTTP. + +> **NOTE:** Ingestion tasks run under the operating system account that runs the Druid processes, for example the Indexer, Middle Manager, and Peon. This means any user who can submit an ingestion task can specify an `HTTPInputSource` at any location where the Druid process has permissions. For example, using `HTTPInputSource`, a console user has access to internal network locations where the they would be denied access otherwise. + +> **WARNING:** `HTTPInputSource` is not limited to the HTTP or HTTPS protocols. It uses the Java `URI` class that supports HTTP, HTTPS, FTP, file, and jar protocols by default. This means you should never run Druid under the `root` account, because a user can use the file protocol to access any files on the local disk. + +For more information about security best practices, see [Security overview](../operations/security-overview.md#best-practices). + +The HTTP input source is _splittable_ and can be used by the [Parallel task](#parallel-task), +where each worker task of `index_parallel` will read only one file. This input source does not support Split Hint Spec. + +Sample specs: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +Example with authentication fields using the DefaultPassword provider (this requires the password to be in the ingestion spec): + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"], + "httpAuthenticationUsername": "username", + "httpAuthenticationPassword": "password123" + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +You can also use the other existing Druid PasswordProviders. Here is an example using the EnvironmentVariablePasswordProvider: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"], + "httpAuthenticationUsername": "username", + "httpAuthenticationPassword": { + "type": "environment", + "variable": "HTTP_INPUT_SOURCE_PW" + } + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +} +``` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be `http`|None|yes| +|uris|URIs of the input files. See below for the protocols allowed for URIs.|None|yes| +|httpAuthenticationUsername|Username to use for authentication with specified URIs. Can be optionally used if the URIs specified in the spec require a Basic Authentication Header.|None|no| +|httpAuthenticationPassword|PasswordProvider to use with specified URIs. Can be optionally used if the URIs specified in the spec require a Basic Authentication Header.|None|no| + +You can only use protocols listed in the `druid.ingestion.http.allowedProtocols` property as HTTP input sources. +The `http` and `https` protocols are allowed by default. See [HTTP input source security configuration](../configuration/index.md#http-input-source) for more details. + +### Inline Input Source + +The Inline input source can be used to read the data inlined in its own spec. +It can be used for demos or for quickly testing out parsing and schema. + +Sample spec: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "inline", + "data": "0,values,formatted\n1,as,CSV" + }, + "inputFormat": { + "type": "csv" + }, + ... + }, +... +``` + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "inline".|yes| +|data|Inlined data to ingest.|yes| + +### Local Input Source + +The Local input source is to support reading files directly from local storage, +and is mainly intended for proof-of-concept testing. +The Local input source is _splittable_ and can be used by the [Parallel task](#parallel-task), +where each worker task of `index_parallel` will read one or multiple files. + +Sample spec: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "local", + "filter" : "*.csv", + "baseDir": "/data/directory", + "files": ["/bar/foo", "/foo/bar"] + }, + "inputFormat": { + "type": "csv" + }, + ... + }, +... +``` + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "local".|yes| +|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter) for more information.|yes if `baseDir` is specified| +|baseDir|Directory to search recursively for files to be ingested. Empty files under the `baseDir` will be skipped.|At least one of `baseDir` or `files` should be specified| +|files|File paths to ingest. Some files can be ignored to avoid ingesting duplicate files if they are located under the specified `baseDir`. Empty files will be skipped.|At least one of `baseDir` or `files` should be specified| + +### Druid Input Source + +The Druid input source is to support reading data directly from existing Druid segments, +potentially using a new schema and changing the name, dimensions, metrics, rollup, etc. of the segment. +The Druid input source is _splittable_ and can be used by the [Parallel task](#parallel-task). +This input source has a fixed input format for reading from Druid segments; +no `inputFormat` field needs to be specified in the ingestion spec when using this input source. + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "druid".|yes| +|dataSource|A String defining the Druid datasource to fetch rows from|yes| +|interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes| +|filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no| + +The Druid input source can be used for a variety of purposes, including: + +- Creating new datasources that are rolled-up copies of existing datasources. +- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance. +- Updating or removing rows using a [`transformSpec`](index.md#transformspec). + +When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number +of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you +want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time` +and the format to `auto` or `millis`. + +It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the +previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a +good idea to test out your reindexing by writing to a separate datasource before overwriting your main one. +Alternatively, if your goals can be satisfied by [compaction](compaction.md), consider that instead as a simpler +approach. + +An example task spec is shown below. It reads from a hypothetical raw datasource `wikipedia_raw` and creates a new +rolled-up datasource `wikipedia_rollup` by grouping on hour, "countryName", and "page". + +```json +{ + "type": "index_parallel", + "spec": { + "dataSchema": { + "dataSource": "wikipedia_rollup", + "timestampSpec": { + "column": "__time", + "format": "millis" + }, + "dimensionsSpec": { + "dimensions": [ + "countryName", + "page" + ] + }, + "metricsSpec": [ + { + "type": "count", + "name": "cnt" + } + ], + "granularitySpec": { + "type": "uniform", + "queryGranularity": "HOUR", + "segmentGranularity": "DAY", + "intervals": ["2016-06-27/P1D"], + "rollup": true + } + }, + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "druid", + "dataSource": "wikipedia_raw", + "interval": "2016-06-27/P1D" + } + }, + "tuningConfig": { + "type": "index_parallel", + "partitionsSpec": { + "type": "hashed" + }, + "forceGuaranteedRollup": true, + "maxNumConcurrentSubTasks": 1 + } + } +} +``` + +> Note: Older versions (0.19 and earlier) did not respect the timestampSpec when using the Druid input source. If you +> have ingestion specs that rely on this and cannot rewrite them, set +> [`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`](../configuration/index.md#indexer-general-configuration) +> to `true` to enable a compatibility mode where the timestampSpec is ignored. + +### SQL Input Source + +The SQL input source is used to read data directly from RDBMS. +The SQL input source is _splittable_ and can be used by the [Parallel task](#parallel-task), where each worker task will read from one SQL query from the list of queries. +This input source does not support Split Hint Spec. +Since this input source has a fixed input format for reading events, no `inputFormat` field needs to be specified in the ingestion spec when using this input source. +Please refer to the Recommended practices section below before using this input source. + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "sql".|Yes| +|database|Specifies the database connection details. The database type corresponds to the extension that supplies the `connectorConfig` support. The specified extension must be loaded into Druid:

  • [mysql-metadata-storage](../development/extensions-core/mysql.md) for `mysql`
  • [postgresql-metadata-storage](../development/extensions-core/postgresql.md) extension for `postgresql`.


You can selectively allow JDBC properties in `connectURI`. See [JDBC connections security config](../configuration/index.md#jdbc-connections-to-external-databases) for more details.|Yes| +|foldCase|Toggle case folding of database column names. This may be enabled in cases where the database returns case insensitive column names in query results.|No| +|sqls|List of SQL queries where each SQL query would retrieve the data to be indexed.|Yes| + +An example SqlInputSource spec is shown below: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "sql", + "database": { + "type": "mysql", + "connectorConfig": { + "connectURI": "jdbc:mysql://host:port/schema", + "user": "user", + "password": "password" + } + }, + "sqls": ["SELECT * FROM table1 WHERE timestamp BETWEEN '2013-01-01 00:00:00' AND '2013-01-01 11:59:59'", "SELECT * FROM table2 WHERE timestamp BETWEEN '2013-01-01 00:00:00' AND '2013-01-01 11:59:59'"] + } + }, +... +``` + +The spec above will read all events from two separate SQLs for the interval `2013-01-01/2013-01-02`. +Each of the SQL queries will be run in its own sub-task and thus for the above example, there would be two sub-tasks. + +**Recommended practices** + +Compared to the other native batch InputSources, SQL InputSource behaves differently in terms of reading the input data and so it would be helpful to consider the following points before using this InputSource in a production environment: + +* During indexing, each sub-task would execute one of the SQL queries and the results are stored locally on disk. The sub-tasks then proceed to read the data from these local input files and generate segments. Presently, there isn’t any restriction on the size of the generated files and this would require the MiddleManagers or Indexers to have sufficient disk capacity based on the volume of data being indexed. + +* Filtering the SQL queries based on the intervals specified in the `granularitySpec` can avoid unwanted data being retrieved and stored locally by the indexing sub-tasks. For example, if the `intervals` specified in the `granularitySpec` is `["2013-01-01/2013-01-02"]` and the SQL query is `SELECT * FROM table1`, `SqlInputSource` will read all the data for `table1` based on the query, even though only data between the intervals specified will be indexed into Druid. + +* Pagination may be used on the SQL queries to ensure that each query pulls a similar amount of data, thereby improving the efficiency of the sub-tasks. + +* Similar to file-based input formats, any updates to existing data will replace the data in segments specific to the intervals specified in the `granularitySpec`. + + +### Combining Input Source + +The Combining input source is used to read data from multiple InputSources. This input source should be only used if all the delegate input sources are + _splittable_ and can be used by the [Parallel task](#parallel-task). This input source will identify the splits from its delegates and each split will be processed by a worker task. Similar to other input sources, this input source supports a single `inputFormat`. Therefore, please note that delegate input sources requiring an `inputFormat` must have the same format for input data. + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "combining".|Yes| +|delegates|List of _splittable_ InputSources to read data from.|Yes| + +Sample spec: + + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "combining", + "delegates" : [ + { + "type": "local", + "filter" : "*.csv", + "baseDir": "/data/directory", + "files": ["/bar/foo", "/foo/bar"] + }, + { + "type": "druid", + "dataSource": "wikipedia", + "interval": "2013-01-01/2013-01-02" + } + ] + }, + "inputFormat": { + "type": "csv" + }, + ... + }, +... +``` + + +### + +## Firehoses (Deprecated) + +Firehoses are deprecated in 0.17.0. It's highly recommended to use the [Input source](#input-sources) instead. +There are several firehoses readily available in Druid, some are meant for examples, others can be used directly in a production environment. + +### StaticS3Firehose + +> You need to include the [`druid-s3-extensions`](../development/extensions-core/s3.md) as an extension to use the StaticS3Firehose. + +This firehose ingests events from a predefined list of S3 objects. +This firehose is _splittable_ and can be used by the [Parallel task](#parallel-task). +Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object. + +Sample spec: + +```json +"firehose" : { + "type" : "static-s3", + "uris": ["s3://foo/bar/file.gz", "s3://bar/foo/file2.gz"] +} +``` + +This firehose provides caching and prefetching features. In the Simple task, a firehose can be read twice if intervals or +shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow. +Note that prefetching or caching isn't that useful in the Parallel task. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be `static-s3`.|None|yes| +|uris|JSON array of URIs where s3 files to be ingested are located.|None|`uris` or `prefixes` must be set| +|prefixes|JSON array of URI prefixes for the locations of s3 files to be ingested.|None|`uris` or `prefixes` must be set| +|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no| +|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no| +|prefetchTriggerBytes|Threshold to trigger prefetching s3 objects.|maxFetchCapacityBytes / 2|no| +|fetchTimeout|Timeout for fetching an s3 object.|60000|no| +|maxFetchRetry|Maximum retry for fetching an s3 object.|3|no| + +#### StaticGoogleBlobStoreFirehose + +> You need to include the [`druid-google-extensions`](../development/extensions-core/google.md) as an extension to use the StaticGoogleBlobStoreFirehose. + +This firehose ingests events, similar to the StaticS3Firehose, but from an Google Cloud Store. + +As with the S3 blobstore, it is assumed to be gzipped if the extension ends in .gz + +This firehose is _splittable_ and can be used by the [Parallel task](#parallel-task). +Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object. + +Sample spec: + +```json +"firehose" : { + "type" : "static-google-blobstore", + "blobs": [ + { + "bucket": "foo", + "path": "/path/to/your/file.json" + }, + { + "bucket": "bar", + "path": "/another/path.json" + } + ] +} +``` + +This firehose provides caching and prefetching features. In the Simple task, a firehose can be read twice if intervals or +shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow. +Note that prefetching or caching isn't that useful in the Parallel task. + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be `static-google-blobstore`.|None|yes| +|blobs|JSON array of Google Blobs.|None|yes| +|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no| +|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no| +|prefetchTriggerBytes|Threshold to trigger prefetching Google Blobs.|maxFetchCapacityBytes / 2|no| +|fetchTimeout|Timeout for fetching a Google Blob.|60000|no| +|maxFetchRetry|Maximum retry for fetching a Google Blob.|3|no| + +Google Blobs: + +|property|description|default|required?| +|--------|-----------|-------|---------| +|bucket|Name of the Google Cloud bucket|None|yes| +|path|The path where data is located.|None|yes| + +### HDFSFirehose + +> You need to include the [`druid-hdfs-storage`](../development/extensions-core/hdfs.md) as an extension to use the HDFSFirehose. + +This firehose ingests events from a predefined list of files from the HDFS storage. +This firehose is _splittable_ and can be used by the [Parallel task](#parallel-task). +Since each split represents an HDFS file, each worker task of `index_parallel` will read files. + +Sample spec: + +```json +"firehose" : { + "type" : "hdfs", + "paths": "/foo/bar,/foo/baz" +} +``` + +This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if +`intervals` are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning +of files is slow. +Note that prefetching or caching isn't that useful in the Parallel task. + +|Property|Description|Default| +|--------|-----------|-------| +|type|This should be `hdfs`.|none (required)| +|paths|HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like `*` are supported in these paths.|none (required)| +|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824| +|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824| +|prefetchTriggerBytes|Threshold to trigger prefetching files.|maxFetchCapacityBytes / 2| +|fetchTimeout|Timeout for fetching each file.|60000| +|maxFetchRetry|Maximum number of retries for fetching each file.|3| + +You can also ingest from other storage using the HDFS firehose if the HDFS client supports that storage. +However, if you want to ingest from cloud storage, consider using the service-specific input source for your data storage. +If you want to use a non-hdfs protocol with the HDFS firehose, you need to include the protocol you want +in `druid.ingestion.hdfs.allowedProtocols`. See [HDFS firehose security configuration](../configuration/index.md#hdfs-input-source) for more details. + +### LocalFirehose + +This Firehose can be used to read the data from files on local disk, and is mainly intended for proof-of-concept testing, and works with `string` typed parsers. +This Firehose is _splittable_ and can be used by [native parallel index tasks](native-batch.md#parallel-task). +Since each split represents a file in this Firehose, each worker task of `index_parallel` will read a file. +A sample local Firehose spec is shown below: + +```json +{ + "type": "local", + "filter" : "*.csv", + "baseDir": "/data/directory" +} +``` + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "local".|yes| +|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter) for more information.|yes| +|baseDir|directory to search recursively for files to be ingested. |yes| + + + +### HttpFirehose + +This Firehose can be used to read the data from remote sites via HTTP, and works with `string` typed parsers. +This Firehose is _splittable_ and can be used by [native parallel index tasks](native-batch.md#parallel-task). +Since each split represents a file in this Firehose, each worker task of `index_parallel` will read a file. +A sample HTTP Firehose spec is shown below: + +```json +{ + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"] +} +``` + +You can only use protocols listed in the `druid.ingestion.http.allowedProtocols` property as HTTP firehose input sources. +The `http` and `https` protocols are allowed by default. See [HTTP firehose security configuration](../configuration/index.md#http-input-source) for more details. + +The below configurations can be optionally used if the URIs specified in the spec require a Basic Authentication Header. +Omitting these fields from your spec will result in HTTP requests with no Basic Authentication Header. + +|property|description|default| +|--------|-----------|-------| +|httpAuthenticationUsername|Username to use for authentication with specified URIs|None| +|httpAuthenticationPassword|PasswordProvider to use with specified URIs|None| + +Example with authentication fields using the DefaultPassword provider (this requires the password to be in the ingestion spec): + +```json +{ + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"], + "httpAuthenticationUsername": "username", + "httpAuthenticationPassword": "password123" +} +``` + +You can also use the other existing Druid PasswordProviders. Here is an example using the EnvironmentVariablePasswordProvider: + +```json +{ + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"], + "httpAuthenticationUsername": "username", + "httpAuthenticationPassword": { + "type": "environment", + "variable": "HTTP_FIREHOSE_PW" + } +} +``` + +The below configurations can optionally be used for tuning the Firehose performance. +Note that prefetching or caching isn't that useful in the Parallel task. + +|property|description|default| +|--------|-----------|-------| +|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824| +|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824| +|prefetchTriggerBytes|Threshold to trigger prefetching HTTP objects.|maxFetchCapacityBytes / 2| +|fetchTimeout|Timeout for fetching an HTTP object.|60000| +|maxFetchRetry|Maximum retries for fetching an HTTP object.|3| + + + +### IngestSegmentFirehose + +This Firehose can be used to read the data from existing druid segments, potentially using a new schema and changing the name, dimensions, metrics, rollup, etc. of the segment. +This Firehose is _splittable_ and can be used by [native parallel index tasks](native-batch.md#parallel-task). +This firehose will accept any type of parser, but will only utilize the list of dimensions and the timestamp specification. + A sample ingest Firehose spec is shown below: + +```json +{ + "type": "ingestSegment", + "dataSource": "wikipedia", + "interval": "2013-01-01/2013-01-02" +} +``` + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "ingestSegment".|yes| +|dataSource|A String defining the data source to fetch rows from, very similar to a table in a relational database|yes| +|interval|A String representing the ISO-8601 interval. This defines the time range to fetch the data over.|yes| +|dimensions|The list of dimensions to select. If left empty, no dimensions are returned. If left null or not defined, all dimensions are returned. |no| +|metrics|The list of metrics to select. If left empty, no metrics are returned. If left null or not defined, all metrics are selected.|no| +|filter| See [Filters](../querying/filters.md)|no| +|maxInputSegmentBytesPerTask|Deprecated. Use [Segments Split Hint Spec](#segments-split-hint-spec) instead. When used with the native parallel index task, the maximum number of bytes of input segments to process in a single task. If a single segment is larger than this number, it will be processed by itself in a single task (input segments are never split across tasks). Defaults to 150MB.|no| + + + +### SqlFirehose + +This Firehose can be used to ingest events residing in an RDBMS. The database connection information is provided as part of the ingestion spec. +For each query, the results are fetched locally and indexed. +If there are multiple queries from which data needs to be indexed, queries are prefetched in the background, up to `maxFetchCapacityBytes` bytes. +This Firehose is _splittable_ and can be used by [native parallel index tasks](native-batch.md#parallel-task). +This firehose will accept any type of parser, but will only utilize the list of dimensions and the timestamp specification. See the extension documentation for more detailed ingestion examples. + +Requires one of the following extensions: + * [MySQL Metadata Store](../development/extensions-core/mysql.md). + * [PostgreSQL Metadata Store](../development/extensions-core/postgresql.md). + + +```json +{ + "type": "sql", + "database": { + "type": "mysql", + "connectorConfig": { + "connectURI": "jdbc:mysql://host:port/schema", + "user": "user", + "password": "password" + } + }, + "sqls": ["SELECT * FROM table1", "SELECT * FROM table2"] +} +``` + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|This should be "sql".||Yes| +|database|Specifies the database connection details. The database type corresponds to the extension that supplies the `connectorConfig` support. The specified extension must be loaded into Druid:

  • [mysql-metadata-storage](../development/extensions-core/mysql.md) for `mysql`
  • [postgresql-metadata-storage](../development/extensions-core/postgresql.md) extension for `postgresql`.


You can selectively allow JDBC properties in `connectURI`. See [JDBC connections security config](../configuration/index.md#jdbc-connections-to-external-databases) for more details.||Yes| +|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|No| +|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|No| +|prefetchTriggerBytes|Threshold to trigger prefetching SQL result objects.|maxFetchCapacityBytes / 2|No| +|fetchTimeout|Timeout for fetching the result set.|60000|No| +|foldCase|Toggle case folding of database column names. This may be enabled in cases where the database returns case insensitive column names in query results.|false|No| +|sqls|List of SQL queries where each SQL query would retrieve the data to be indexed.||Yes| + +#### Database + +|property|description|default|required?| +|--------|-----------|-------|---------| +|type|The type of database to query. Valid values are `mysql` and `postgresql`_||Yes| +|connectorConfig|Specify the database connection properties via `connectURI`, `user` and `password`||Yes| + +### InlineFirehose + +This Firehose can be used to read the data inlined in its own spec. +It can be used for demos or for quickly testing out parsing and schema, and works with `string` typed parsers. +A sample inline Firehose spec is shown below: + +```json +{ + "type": "inline", + "data": "0,values,formatted\n1,as,CSV" +} +``` + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "inline".|yes| +|data|Inlined data to ingest.|yes| + +### CombiningFirehose + +This Firehose can be used to combine and merge data from a list of different Firehoses. + +```json +{ + "type": "combining", + "delegates": [ { firehose1 }, { firehose2 }, ... ] +} +``` + +|property|description|required?| +|--------|-----------|---------| +|type|This should be "combining"|yes| +|delegates|List of Firehoses to combine data from|yes| + + + +## 本地批摄入 + +Apache Druid当前支持两种类型的本地批量索引任务, `index_parallel` 可以并行的运行多个任务, `index`运行单个索引任务。 详情可以查看 [基于Hadoop的摄取vs基于本地批摄取的对比](ingestion.md#批量摄取) 来了解基于Hadoop的摄取、本地简单批摄取、本地并行摄取三者的比较。 + +要运行这两种类型的本地批索引任务,请按以下指定编写摄取规范。然后将其发布到Overlord的 `/druid/indexer/v1/task` 接口,或者使用druid附带的 `bin/post-index-task`。 + +### 教程 + +此页包含本地批处理摄取的参考文档。相反,如果要进行演示,请查看 [加载文件教程](../GettingStarted/chapter-1.md),该教程演示了"简单"(单任务)模式 + +### 并行任务 + +并行任务(`index_parallel`类型)是用于并行批索引的任务。此任务只使用Druid的资源,不依赖于其他外部系统,如Hadoop。`index_parallel` 任务是一个supervisor任务,它协调整个索引过程。supervisor分割输入数据并创建辅助任务来处理这些分割, 创建的worker将发布给Overlord,以便在MiddleManager或Indexer上安排和运行。一旦worker成功处理分配的输入拆分,它就会将生成的段列表报告给supervisor任务。supervisor定期检查工作任务的状态。如果其中一个失败,它将重试失败的任务,直到重试次数达到配置的限制。如果所有工作任务都成功,它将立即发布报告的段并完成摄取。 + +并行任务的详细行为是不同的,取决于 [`partitionsSpec`](#partitionsspec),详情可以查看 `partitionsSpec` + +要使用此任务,`ioConfig` 中的 [`inputSource`](#输入源) 应为*splittable(可拆分的)*,`tuningConfig` 中的 `maxNumConcurrentSubTasks` 应设置为大于1。否则,此任务将按顺序运行;`index_parallel` 任务将逐个读取每个输入文件并自行创建段。目前支持的可拆分输入格式有: + +* [`s3`](#s3%e8%be%93%e5%85%a5%e6%ba%90) 从AWS S3存储读取数据 +* [`gs`](#谷歌云存储输入源) 从谷歌云存储读取数据 +* [`azure`](#azure%e8%be%93%e5%85%a5%e6%ba%90) 从Azure Blob存储读取数据 +* [`hdfs`](#hdfs%e8%be%93%e5%85%a5%e6%ba%90) 从HDFS存储中读取数据 +* [`http`](#HTTP输入源) 从HTTP服务中读取数据 +* [`local`](#local%e8%be%93%e5%85%a5%e6%ba%90) 从本地存储中读取数据 +* [`druid`](#druid%e8%be%93%e5%85%a5%e6%ba%90) 从Druid数据源中读取数据 + +传统的 [`firehose`](#firehoses%e5%b7%b2%e5%ba%9f%e5%bc%83) 支持其他一些云存储类型。下面的 `firehose` 类型也是可拆分的。请注意,`firehose` 只支持文本格式。 + +* [`static-cloudfiles`](../development/rackspacecloudfiles.md) + +您可能需要考虑以下事项: +* 您可能希望控制每个worker进程的输入数据量。这可以使用不同的配置进行控制,具体取决于并行摄取的阶段(有关更多详细信息,请参阅 [`partitionsSpec`](#partitionsspec)。对于从 `inputSource` 读取数据的任务,可以在 `tuningConfig` 中设置 [分割提示规范](#分割提示规范)。对于合并无序段的任务,可以在 `tuningConfig` 中设置`totalNumMergeTasks`。 +* 并行摄取中并发worker的数量由 `tuningConfig` 中的`maxNumConcurrentSubTasks` 确定。supervisor检查当前正在运行的worker的数量,如果小于 `maxNumConcurrentSubTasks`,则无论当前有多少任务槽可用,都会创建更多的worker。这可能会影响其他摄取性能。有关更多详细信息,请参阅下面的 [容量规划部分](#容量规划)。 +* 默认情况下,批量摄取将替换它写入的任何段中的所有数据(在`granularitySpec` 的间隔中)。如果您想添加到段中,请在 `ioConfig` 中设置 `appendToExisting` 标志。请注意,它只替换主动添加数据的段中的数据:如果 `granularitySpec` 的间隔中有段没有此任务写入的数据,则它们将被单独保留。如果任何现有段与 `granularitySpec` 的间隔部分重叠,则新段间隔之外的那些段的部分仍将可见。 + +#### 任务符号 + +一个简易的任务如下所示: + +```json +{ + "type": "index_parallel", + "spec": { + "dataSchema": { + "dataSource": "wikipedia_parallel_index_test", + "timestampSpec": { + "column": "timestamp" + }, + "dimensionsSpec": { + "dimensions": [ + "page", + "language", + "user", + "unpatrolled", + "newPage", + "robot", + "anonymous", + "namespace", + "continent", + "country", + "region", + "city" + ] + }, + "metricsSpec": [ + { + "type": "count", + "name": "count" + }, + { + "type": "doubleSum", + "name": "added", + "fieldName": "added" + }, + { + "type": "doubleSum", + "name": "deleted", + "fieldName": "deleted" + }, + { + "type": "doubleSum", + "name": "delta", + "fieldName": "delta" + } + ], + "granularitySpec": { + "segmentGranularity": "DAY", + "queryGranularity": "second", + "intervals" : [ "2013-08-31/2013-09-02" ] + } + }, + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "local", + "baseDir": "examples/indexing/", + "filter": "wikipedia_index_data*" + }, + "inputFormat": { + "type": "json" + } + }, + "tuningconfig": { + "type": "index_parallel", + "maxNumConcurrentSubTasks": 2 + } + } +} +``` + +| 属性 | 描述 | 是否必须 | +|-|-|-| +| `type` | 任务类型,应当总是 `index_parallel` | 是 | +| `id` | 任务ID。 如果该项没有显式的指定,Druid将使用任务类型、数据源名称、时间间隔、日期时间戳生成一个任务ID | 否 | +| `spec` | 摄取规范包括数据schema、IOConfig 和 TuningConfig。详情见下边详细描述 | 是 | +| `context` | Context包括了多个任务配置参数。详情见下边详细描述 | 否 | + +##### `dataSchema` + +该字段为必须字段。 + +可以参见 [摄取规范中的dataSchema](ingestion.md#dataSchema) + +如果在dataSchema的 `granularitySpec` 中显式地指定了 `intervals`,则批处理摄取将锁定启动时指定的完整间隔,并且您将快速了解指定间隔是否与其他任务(例如Kafka摄取)持有的锁重叠。否则,在发现每个间隔时,批处理摄取将锁定该间隔,因此您可能只会在摄取过程中了解到该任务与较高优先级的任务重叠。如果显式指定 `intervals`,则指定间隔之外的任何行都将被丢弃。如果您知道数据的时间范围,我们建议显式地设置`intervals`,以便锁定失败发生得更快,并且如果有一些带有意外时间戳的杂散数据,您不会意外地替换该范围之外的数据。 + +##### `ioConfig` + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 任务类型, 应当总是 `index_parallel` | none | 是 | +| `inputFormat` | [`inputFormat`](dataformats.md#InputFormat) 用来指定如何解析输入数据 | none | 是 | +| `appendToExisting` | 创建段作为最新版本的附加分片,有效地附加到段集而不是替换它。仅当现有段集具有可扩展类型 `shardSpec`时,此操作才有效。 | false | 否 | + +##### `tuningConfig` + +tuningConfig是一个可选项,如果未指定则使用默认的参数。 详情如下: + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 任务类型,应当总是 `index_parallel` | none | 是 | +| `maxRowsPerSegment` | 已废弃。使用 `partitionsSpec` 替代,被用来分片。 决定在每个段中有多少行。 | 5000000 | 否 | +| `maxRowsInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常用户不需要设置此值,但根据数据的性质,如果行的字节数较短,则用户可能不希望在内存中存储一百万行,应设置此值。 | 1000000 | 否 | +| `maxBytesInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常这是在内部计算的,用户不需要设置它。此值表示在持久化之前要在堆内存中聚合的字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。用于索引的最大堆内存使用量为 `maxBytesInMemory *(2 + maxPendingResistent)` | 最大JVM内存的1/6 | 否 | +| `maxTotalRows` | 已废弃。使用 `partitionsSpec` 替代。等待推送的段中的总行数。用于确定何时应进行中间推送。| 20000000 | 否 | +| `numShards` | 已废弃。使用 `partitionsSpec` 替代。当使用 `hashed` `partitionsSpec`时直接指定要创建的分片数。如果该值被指定了且在 `granularitySpec`中指定了 `intervals`,那么索引任务可以跳过确定间隔/分区传递数据。如果设置了 `maxRowsPerSegment`,则无法指定 `numShards`。 | null | 否 | +| `splitHintSpec` | 用于提供提示以控制每个第一阶段任务读取的数据量。根据输入源的实现,可以忽略此提示。有关更多详细信息,请参见 [分割提示规范](#分割提示规范)。 | 基于大小的分割提示规范 | 否 | +| `partitionsSpec` | 定义在每个时间块中如何分区数据。 参见 [partitionsSpec](#partitionsspec) | 如果 `forceGuaranteedRollup` = false, 则为 `dynamic`; 如果 `forceGuaranteedRollup` = true, 则为 `hashed` 或者 `single_dim` | 否 | +| `indexSpec` | 定义段在索引阶段的存储格式相关选项,参见 [IndexSpec](ingestion.md#tuningConfig) | null | 否 | +| `indexSpecForIntermediatePersists` | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅 [IndexSpec](ingestion.md#tuningConfig)。 | 与 `indexSpec` 相同 | 否 | +| `maxPendingPersists` | 可挂起但未启动的最大持久化任务数。如果新的中间持久化将超过此限制,则在当前运行的持久化完成之前,摄取将被阻止。使用`maxRowsInMemory * (2 + maxPendingResistents)` 索引扩展的最大堆内存使用量。 | 0 (这意味着一个持久化任务只可以与摄取同时运行,而没有一个可以排队) | 否 | +| `forceGuaranteedRollup` | 强制保证 [最佳Rollup](ingestion.md#Rollup)。最佳rollup优化了生成的段的总大小和查询时间,同时索引时间将增加。如果设置为true,则必须设置 `granularitySpec` 中的 `intervals` ,同时必须对 `partitionsSpec` 使用 `single_dim` 或者 `hashed` 。此标志不能与 `IOConfig` 的 `appendToExisting` 一起使用。有关更多详细信息,请参见下面的 ["分段推送模式"](#分段推送模式) 部分。 | false | 否 | +| `reportParseExceptions` | 如果为true,则将引发解析期间遇到的异常并停止摄取;如果为false,则将跳过不可解析的行和字段。 | false | 否 | +| `pushTimeout` | 段推送的超时毫秒时间。 该值必须设置为 >= 0, 0意味着永不超时 | 0 | 否 | +| `segmentWriteOutMediumFactory` | 创建段时使用的段写入介质。 参见 [segmentWriteOutMediumFactory](#segmentWriteOutMediumFactory) | 未指定, 值来源于 `druid.peon.defaultSegmentWriteOutMediumFactory.type` | 否 | +| `maxNumConcurrentSubTasks` | 可同时并行运行的最大worker数。无论当前可用的任务槽如何,supervisor都将生成最多为 `maxNumConcurrentSubTasks` 的worker。如果此值设置为1,supervisor将自行处理数据摄取,而不是生成worker。如果将此值设置为太大,则可能会创建太多的worker,这可能会阻止其他摄取。查看 [容量规划](#容量规划) 以了解更多详细信息。 | 1 | 否 | +| `maxRetry` | 任务失败后最大重试次数 | 3 | 否 | +| `maxNumSegmentsToMerge` | 单个任务在第二阶段可同时合并的段数的最大限制。仅在 `forceGuaranteedRollup` 被设置的时候使用。 | 100 | 否 | +| `totalNumMergeTasks` | 当 `partitionsSpec` 被设置为 `hashed` 或者 `single_dim`时, 在合并阶段用来合并段的最大任务数。 | 10 | 否 | +| `taskStatusCheckPeriodMs` | 检查运行任务状态的轮询周期(毫秒)。| 1000 | 否 | +| `chatHandlerTimeout` | 报告worker中的推送段超时。| PT10S | 否 | +| `chatHandlerNumRetries` | 重试报告worker中的推送段 | 5 | 否 | + +#### 分割提示规范 + +分割提示规范用于在supervisor创建输入分割时给出提示。请注意,每个worker处理一个输入拆分。您可以控制每个worker在第一阶段读取的数据量。 + +**基于大小的分割提示规范** +除HTTP输入源外,所有可拆分输入源都遵循基于大小的拆分提示规范。 + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 应当总是 `maxSize` | none | 是 | +| `maxSplitSize` | 单个任务中要处理的输入文件的最大字节数。如果单个文件大于此数字,则它将在单个任务中自行处理(文件永远不会跨任务拆分)。 | 500MB | 否 | + +**段分割提示规范** + +段分割提示规范仅仅用在 [`DruidInputSource`](#Druid输入源)(和过时的 [`IngestSegmentFirehose`](#IngestSegmentFirehose)) + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 应当总是 `segments` | none | 是 | +| `maxInputSegmentBytesPerTask` | 单个任务中要处理的输入段的最大字节数。如果单个段大于此数字,则它将在单个任务中自行处理(输入段永远不会跨任务拆分)。 | 500MB | 否 | + +##### `partitionsSpec` + +PartitionsSpec用于描述辅助分区方法。您应该根据需要的rollup模式使用不同的partitionsSpec。为了实现 [最佳rollup](ingestion.md#rollup),您应该使用 `hashed`(基于每行中维度的哈希进行分区)或 `single_dim`(基于单个维度的范围)。对于"尽可能rollup"模式,应使用 `dynamic`。 + +三种 `partitionsSpec` 类型有着不同的特征。 + +| PartitionsSpec | 摄入速度 | 分区方式 | 支持的rollup模式 | 查询时的段修剪 | +|-|-|-|-|-| +| `dynamic` | 最快 | 基于段中的行数来进行分区 | 尽可能rollup | N/A | +| `hashed` | 中等 | 基于分区维度的哈希值进行分区。此分区可以通过改进数据位置性来减少数据源大小和查询延迟。有关详细信息,请参见 [分区](ingestion.md#分区)。 | 最佳rollup | N/A | +| `single_dim` | 最慢 | 基于分区维度值的范围分区。段大小可能会根据分区键分布而倾斜。这可能通过改善数据位置性来减少数据源大小和查询延迟。有关详细信息,请参见 [分区](ingestion.md#分区)。 | 最佳rollup | Broker可以使用分区信息提前修剪段以加快查询速度。由于Broker知道每个段中 `partitionDimension` 值的范围,因此,给定一个包含`partitionDimension` 上的筛选器的查询,Broker只选取包含满足 `partitionDimension` 上的筛选器的行的段进行查询处理。| + +对于每一种partitionSpec,推荐的使用场景是: + +* 如果数据有一个在查询中经常使用的均匀分布列,请考虑使用 `single_dim` partitionsSpec来最大限度地提高大多数查询的性能。 +* 如果您的数据不是均匀分布的列,但在使用某些维度进行rollup时,需要具有较高的rollup汇总率,请考虑使用 `hashed` partitionsSpec。通过改善数据的局部性,可以减小数据源的大小和查询延迟。 +* 如果以上两个场景不是这样,或者您不需要rollup数据源,请考虑使用 `dynamic` partitionsSpec。 + +**Dynamic分区** + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 应该总是 `dynamic` | none | 是 | +| `maxRowsPerSegment` | 用来分片。决定在每一个段中有多少行 | 5000000 | 否 | +| `maxTotalRows` | 等待推送的所有段的总行数。用于确定中间段推送的发生时间。 | 20000000 | 否 | + +使用Dynamic分区,并行索引任务在一个阶段中运行:它将生成多个worker(`single_phase_sub_task` 类型),每个worker都创建段。worker创建段的方式是: + +* 每当当前段中的行数超过 `maxRowsPerSegment` 时,任务将创建一个新段。 +* 一旦所有时间块中所有段中的行总数达到 `maxTotalRows`,任务就会将迄今为止创建的所有段推送到深层存储并创建新段。 + +**基于哈希的分区** + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 应该总是 `hashed` | none | 是 | +| `numShards` | 直接指定要创建的分片数。如果该值被指定了,同时在 `granularitySpec` 中指定了 `intervals`,那么索引任务可以跳过确定通过数据的间隔/分区 | null | 是 | +| `partitionDimensions` | 要分区的维度。留空可选择所有维度。| null | 否 | + +基于哈希分区的并行任务类似于 [MapReduce](https://en.wikipedia.org/wiki/MapReduce)。任务分为两个阶段运行,即 `部分段生成` 和 `部分段合并`。 + +* 在 `部分段生成` 阶段,与MapReduce中的Map阶段一样,并行任务根据分割提示规范分割输入数据,并将每个分割分配给一个worker。每个worker(`partial_index_generate` 类型)从 `granularitySpec` 中的`segmentGranularity(主分区键)` 读取分配的分割,然后按`partitionsSpec` 中 `partitionDimensions(辅助分区键)`的哈希值对行进行分区。分区数据存储在 [MiddleManager](../design/MiddleManager.md) 或 [Indexer](../design/Indexer.md) 的本地存储中。 +* `部分段合并` 阶段类似于MapReduce中的Reduce阶段。并行任务生成一组新的worker(`partial_index_merge` 类型)来合并在前一阶段创建的分区数据。这里,分区数据根据要合并的时间块和分区维度的散列值进行洗牌;每个worker从多个MiddleManager/Indexer进程中读取落在同一时间块和同一散列值中的数据,并将其合并以创建最终段。最后,它们将最后的段一次推送到深层存储。 + +**基于单一维度范围分区** + +> [!WARNING] +> 在并行任务的顺序模式下,当前不支持单一维度范围分区。尝试将`maxNumConcurrentSubTasks` 设置为大于1以使用此分区方式。 + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 应该总是 `single_dim` | none | 是 | +| `partitionDimension` | 要分区的维度。 仅仅允许具有单一维度值的行 | none | 是 | +| `targetRowsPerSegment` | 在一个分区中包含的目标行数,应当是一个500MB ~ 1GB目标段的数值。 | none | 要么该值被设置,或者 `maxRowsPerSegment`被设置。 | +| `maxRowsPerSegment` | 分区中要包含的行数的软最大值。| none | 要么该值被设置,或者 `targetRowsPerSegment`被设置。| +| `assumeGrouped` | 假设输入数据已经按时间和维度分组。摄取将运行得更快,但如果违反此假设,则可能会选择次优分区 | false | 否 | + +在 `single-dim` 分区下,并行任务分为3个阶段进行,即 `部分维分布`、`部分段生成` 和 `部分段合并`。第一个阶段是收集一些统计数据以找到最佳分区,另外两个阶段是创建部分段并分别合并它们,就像在基于哈希的分区中那样。 + +* 在 `部分维度分布` 阶段,并行任务分割输入数据,并根据分割提示规范将其分配给worker。每个worker任务(`partial_dimension_distribution` 类型)读取分配的分割并为 `partitionDimension` 构建直方图。并行任务从worker任务收集这些直方图,并根据 `partitionDimension` 找到最佳范围分区,以便在分区之间均匀分布行。请注意,`targetRowsPerSegment` 或 `maxRowsPerSegment` 将用于查找最佳分区。 +* 在 `部分段生成` 阶段,并行任务生成新的worker任务(`partial_range_index_generate` 类型)以创建分区数据。每个worker任务都读取在前一阶段中创建的分割,根据 `granularitySpec` 中的`segmentGranularity(主分区键)`的时间块对行进行分区,然后根据在前一阶段中找到的范围分区对行进行分区。分区数据存储在 [MiddleManager](../design/MiddleManager.md) 或 [Indexer](../design/Indexer.md)的本地存储中。 +* 在 `部分段合并` 阶段,并行索引任务生成一组新的worker任务(`partial_index_generic_merge`类型)来合并在上一阶段创建的分区数据。这里,分区数据根据时间块和 `partitionDimension` 的值进行洗牌;每个工作任务从多个MiddleManager/Indexer进程中读取属于同一范围的同一分区中的段,并将它们合并以创建最后的段。最后,它们将最后的段推到深层存储。 + +> [!WARNING] +> 由于单一维度范围分区的任务在 `部分维度分布` 和 `部分段生成` 阶段对输入进行两次传递,因此如果输入在两次传递之间发生变化,任务可能会失败 + +#### HTTP状态接口 + +supervisor任务提供了一些HTTP接口来获取任务状态。 + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/mode` + +如果索引任务以并行的方式运行,则返回 "parallel", 否则返回 "sequential" + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/phase` + +如果任务以并行的方式运行,则返回当前阶段的名称 + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/progress` + +如果supervisor任务以并行的方式运行,则返回当前阶段的预估进度 + +一个示例结果如下: +```json +{ + "running":10, + "succeeded":0, + "failed":0, + "complete":0, + "total":10, + "estimatedExpectedSucceeded":10 +} +``` + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtasks/running` + +返回正在运行的worker任务的任务IDs,如果该supervisor任务以序列模式运行则返回一个空的列表 + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs` + +返回所有的worker任务规范,如果该supervisor任务以序列模式运行则返回一个空的列表 + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs/running` + +返回正在运行的worker任务规范,如果该supervisor任务以序列模式运行则返回一个空的列表 + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs/complete` + +返回已经完成的worker任务规范,如果该supervisor任务以序列模式运行则返回一个空的列表 + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}` + +返回指定ID的worker任务规范,如果该supervisor任务以序列模式运行则返回一个HTTP 404 + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}/state` + +返回指定ID的worker任务规范的状态,如果该supervisor任务以序列模式运行则返回一个HTTP 404。 返回的结果集中包括worker任务规范,当前任务状态(如果存在的话) 以及任务尝试历史记录。 + +一个示例结果如下: +```json +{ + "spec": { + "id": "index_parallel_lineitem_2018-04-20T22:12:43.610Z_2", + "groupId": "index_parallel_lineitem_2018-04-20T22:12:43.610Z", + "supervisorTaskId": "index_parallel_lineitem_2018-04-20T22:12:43.610Z", + "context": null, + "inputSplit": { + "split": "/path/to/data/lineitem.tbl.5" + }, + "ingestionSpec": { + "dataSchema": { + "dataSource": "lineitem", + "timestampSpec": { + "column": "l_shipdate", + "format": "yyyy-MM-dd" + }, + "dimensionsSpec": { + "dimensions": [ + "l_orderkey", + "l_partkey", + "l_suppkey", + "l_linenumber", + "l_returnflag", + "l_linestatus", + "l_shipdate", + "l_commitdate", + "l_receiptdate", + "l_shipinstruct", + "l_shipmode", + "l_comment" + ] + }, + "metricsSpec": [ + { + "type": "count", + "name": "count" + }, + { + "type": "longSum", + "name": "l_quantity", + "fieldName": "l_quantity", + "expression": null + }, + { + "type": "doubleSum", + "name": "l_extendedprice", + "fieldName": "l_extendedprice", + "expression": null + }, + { + "type": "doubleSum", + "name": "l_discount", + "fieldName": "l_discount", + "expression": null + }, + { + "type": "doubleSum", + "name": "l_tax", + "fieldName": "l_tax", + "expression": null + } + ], + "granularitySpec": { + "type": "uniform", + "segmentGranularity": "YEAR", + "queryGranularity": { + "type": "none" + }, + "rollup": true, + "intervals": [ + "1980-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z" + ] + }, + "transformSpec": { + "filter": null, + "transforms": [] + } + }, + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "local", + "baseDir": "/path/to/data/", + "filter": "lineitem.tbl.5" + }, + "inputFormat": { + "format": "tsv", + "delimiter": "|", + "columns": [ + "l_orderkey", + "l_partkey", + "l_suppkey", + "l_linenumber", + "l_quantity", + "l_extendedprice", + "l_discount", + "l_tax", + "l_returnflag", + "l_linestatus", + "l_shipdate", + "l_commitdate", + "l_receiptdate", + "l_shipinstruct", + "l_shipmode", + "l_comment" + ] + }, + "appendToExisting": false + }, + "tuningConfig": { + "type": "index_parallel", + "maxRowsPerSegment": 5000000, + "maxRowsInMemory": 1000000, + "maxTotalRows": 20000000, + "numShards": null, + "indexSpec": { + "bitmap": { + "type": "roaring" + }, + "dimensionCompression": "lz4", + "metricCompression": "lz4", + "longEncoding": "longs" + }, + "indexSpecForIntermediatePersists": { + "bitmap": { + "type": "roaring" + }, + "dimensionCompression": "lz4", + "metricCompression": "lz4", + "longEncoding": "longs" + }, + "maxPendingPersists": 0, + "reportParseExceptions": false, + "pushTimeout": 0, + "segmentWriteOutMediumFactory": null, + "maxNumConcurrentSubTasks": 4, + "maxRetry": 3, + "taskStatusCheckPeriodMs": 1000, + "chatHandlerTimeout": "PT10S", + "chatHandlerNumRetries": 5, + "logParseExceptions": false, + "maxParseExceptions": 2147483647, + "maxSavedParseExceptions": 0, + "forceGuaranteedRollup": false, + "buildV9Directly": true + } + } + }, + "currentStatus": { + "id": "index_sub_lineitem_2018-04-20T22:16:29.922Z", + "type": "index_sub", + "createdTime": "2018-04-20T22:16:29.925Z", + "queueInsertionTime": "2018-04-20T22:16:29.929Z", + "statusCode": "RUNNING", + "duration": -1, + "location": { + "host": null, + "port": -1, + "tlsPort": -1 + }, + "dataSource": "lineitem", + "errorMsg": null + }, + "taskHistory": [] +} +``` + +* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}/history + ` + +返回被指定ID的worker任务规范的任务尝试历史记录,如果该supervisor任务以序列模式运行则返回一个HTTP 404 + +#### 容量规划 + +不管当前有多少任务槽可用,supervisor任务最多可以创建 `maxNumConcurrentSubTasks` worker任务, 因此,可以同时运行的任务总数为 `(maxNumConcurrentSubTasks+1)(包括supervisor任务)`。请注意,这甚至可以大于任务槽的总数(所有worker的容量之和)。如果`maxNumConcurrentSubTasks` 大于 `n(可用任务槽)`,则`maxNumConcurrentSubTasks` 任务由supervisor任务创建,但只有 `n` 个任务将启动, 其他人将在挂起状态下等待,直到任何正在运行的任务完成。 + +如果将并行索引任务与流摄取一起使用,我们建议**限制批摄取的最大容量**,以防止流摄取被批摄取阻止。假设您同时有 `t` 个并行索引任务要运行, 但是想将批摄取的最大任务数限制在 `b`。 那么, 所有并行索引任务的 `maxNumConcurrentSubTasks` 之和 + `t`(supervisor任务数) 必须小于 `b`。 + +如果某些任务的优先级高于其他任务,则可以将其`maxNumConcurrentSubTasks` 设置为高于低优先级任务的值。这可能有助于高优先级任务比低优先级任务提前完成,方法是为它们分配更多的任务槽。 + +### 简单任务 + +简单任务(`index`类型)设计用于较小的数据集。任务在索引服务中执行。 + +#### 任务符号 + +一个示例任务如下: + +```json +{ + "type" : "index", + "spec" : { + "dataSchema" : { + "dataSource" : "wikipedia", + "timestampSpec" : { + "column" : "timestamp", + "format" : "auto" + }, + "dimensionsSpec" : { + "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], + "dimensionExclusions" : [] + }, + "metricsSpec" : [ + { + "type" : "count", + "name" : "count" + }, + { + "type" : "doubleSum", + "name" : "added", + "fieldName" : "added" + }, + { + "type" : "doubleSum", + "name" : "deleted", + "fieldName" : "deleted" + }, + { + "type" : "doubleSum", + "name" : "delta", + "fieldName" : "delta" + } + ], + "granularitySpec" : { + "type" : "uniform", + "segmentGranularity" : "DAY", + "queryGranularity" : "NONE", + "intervals" : [ "2013-08-31/2013-09-01" ] + } + }, + "ioConfig" : { + "type" : "index", + "inputSource" : { + "type" : "local", + "baseDir" : "examples/indexing/", + "filter" : "wikipedia_data.json" + }, + "inputFormat": { + "type": "json" + } + }, + "tuningConfig" : { + "type" : "index", + "maxRowsPerSegment" : 5000000, + "maxRowsInMemory" : 1000000 + } + } +} +``` + +| 属性 | 描述 | 是否必须 | +|-|-|-| +| `type` | 任务类型, 应该总是 `index` | 是 | +| `id` | 任务ID。如果该值为显式的指定,Druid将会使用任务类型、数据源名称、时间间隔以及日期时间戳生成一个任务ID | 否 | +| `spec` | 摄入规范,包括dataSchema、IOConfig 和 TuningConfig。 详情见下边的描述 | 是 | +| `context` | 包含多个任务配置参数的上下文。 详情见下边的描述 | 否 | + +##### `dataSchema` + +**该字段为必须字段。** + +详情可以见摄取文档中的 [`dataSchema`](ingestion.md#dataSchema) 部分。 + +如果没有在 `dataSchema` 的 `granularitySpec` 中显式指定 `intervals`,本地索引任务将对数据执行额外的传递,以确定启动时要锁定的范围。如果显式指定 `intervals`,则指定间隔之外的任何行都将被丢弃。如果您知道数据的时间范围,我们建议显式设置 `intervals`,因为它允许任务跳过额外的过程,并且如果有一些带有意外时间戳的杂散数据,您不会意外地替换该范围之外的数据。 + +##### `ioConfig` + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 任务类型,应该总是 `index` | none | 是 | +| `inputFormat` | [`inputFormat`](dataformats.md#InputFormat) 指定如何解析输入数据 | none | 是 | +| `appendToExisting` | 创建段作为最新版本的附加分片,有效地附加到段集而不是替换它。仅当现有段集具有可扩展类型 `shardSpec`时,此操作才有效。 | false | 否 | + +##### `tuningConfig` + +tuningConfig是一个可选项,如果未指定则使用默认的参数。 详情如下: + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 任务类型,应当总是 `index` | none | 是 | +| `maxRowsPerSegment` | 已废弃。使用 `partitionsSpec` 替代,被用来分片。 决定在每个段中有多少行。 | 5000000 | 否 | +| `maxRowsInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常用户不需要设置此值,但根据数据的性质,如果行的字节数较短,则用户可能不希望在内存中存储一百万行,应设置此值。 | 1000000 | 否 | +| `maxBytesInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常这是在内部计算的,用户不需要设置它。此值表示在持久化之前要在堆内存中聚合的字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。用于索引的最大堆内存使用量为 `maxBytesInMemory *(2 + maxPendingResistent)` | 最大JVM内存的1/6 | 否 | +| `maxTotalRows` | 已废弃。使用 `partitionsSpec` 替代。等待推送的段中的总行数。用于确定何时应进行中间推送。| 20000000 | 否 | +| `numShards` | 已废弃。使用 `partitionsSpec` 替代。当使用 `hashed` `partitionsSpec`时直接指定要创建的分片数。如果该值被指定了且在 `granularitySpec`中指定了 `intervals`,那么索引任务可以跳过确定间隔/分区传递数据。如果设置了 `maxRowsPerSegment`,则无法指定 `numShards`。 | null | 否 | +| `partitionsSpec` | 定义在每个时间块中如何分区数据。 参见 [partitionsSpec](#partitionsspec) | 如果 `forceGuaranteedRollup` = false, 则为 `dynamic`; 如果 `forceGuaranteedRollup` = true, 则为 `hashed` 或者 `single_dim` | 否 | +| `indexSpec` | 定义段在索引阶段的存储格式相关选项,参见 [IndexSpec](ingestion.md#tuningConfig) | null | 否 | +| `indexSpecForIntermediatePersists` | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅 [IndexSpec](ingestion.md#tuningConfig)。 | 与 `indexSpec` 相同 | 否 | +| `maxPendingPersists` | 可挂起但未启动的最大持久化任务数。如果新的中间持久化将超过此限制,则在当前运行的持久化完成之前,摄取将被阻止。使用`maxRowsInMemory * (2 + maxPendingResistents)` 索引扩展的最大堆内存使用量。 | 0 (这意味着一个持久化任务只可以与摄取同时运行,而没有一个可以排队) | 否 | +| `forceGuaranteedRollup` | 强制保证 [最佳Rollup](ingestion.md#Rollup)。最佳rollup优化了生成的段的总大小和查询时间,同时索引时间将增加。如果设置为true,则必须设置 `granularitySpec` 中的 `intervals` ,同时必须对 `partitionsSpec` 使用 `single_dim` 或者 `hashed` 。此标志不能与 `IOConfig` 的 `appendToExisting` 一起使用。有关更多详细信息,请参见下面的 ["分段推送模式"](#分段推送模式) 部分。 | false | 否 | +| `reportParseExceptions` | 已废弃。如果为true,则将引发解析期间遇到的异常并停止摄取;如果为false,则将跳过不可解析的行和字段。将 `reportParseExceptions` 设置为true将覆盖`maxParseExceptions` 和 `maxSavedParseExceptions` 的现有配置,将 `maxParseExceptions` 设置为0并将 `maxSavedParseExceptions` 限制为不超过1。 | false | 否 | +| `pushTimeout` | 段推送的超时毫秒时间。 该值必须设置为 >= 0, 0意味着永不超时 | 0 | 否 | +| `segmentWriteOutMediumFactory` | 创建段时使用的段写入介质。 参见 [segmentWriteOutMediumFactory](#segmentWriteOutMediumFactory) | 未指定, 值来源于 `druid.peon.defaultSegmentWriteOutMediumFactory.type` | 否 | +| `logParseExceptions` | 如果为true,则在发生解析异常时记录错误消息,其中包含有关发生错误的行的信息。 | false | 否 | +| `maxParseExceptions` | 任务停止接收并失败之前可能发生的最大分析异常数。如果设置了`reportParseExceptions`,则该配置被覆盖。 | unlimited | 否 | +| `maxSavedParseExceptions` | 当出现解析异常时,Druid可以跟踪最新的解析异常。"maxSavedParseExceptions" 限制将保存多少个异常实例。这些保存的异常将在任务完成报告中的任务完成后可用。如果设置了 `reportParseExceptions` ,则该配置被覆盖。 | 0 | 否 | + + +##### `partitionsSpec` + +PartitionsSpec用于描述辅助分区方法。您应该根据需要的rollup模式使用不同的partitionsSpec。为了实现 [最佳rollup](ingestion.md#rollup),您应该使用 `hashed`(基于每行中维度的哈希进行分区) + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 应该总是 `hashed` | none | 是 | +| `maxRowsPerSegment` | 用在分片中,决定在每个段中有多少行 | 5000000 | 否 | +| `numShards` | 直接指定要创建的分片数。如果该值被指定了,同时在 `granularitySpec` 中指定了 `intervals`,那么索引任务可以跳过确定通过数据的间隔/分区 | null | 是 | +| `partitionDimensions` | 要分区的维度。留空可选择所有维度。| null | 否 | + +对于尽可能rollup模式,您应该使用 `dynamic` + +| 属性 | 描述 | 默认值 | 是否必须 | +|-|-|-|-| +| `type` | 应该总是 `dynamic` | none | 是 | +| `maxRowsPerSegment` | 用来分片。决定在每一个段中有多少行 | 5000000 | 否 | +| `maxTotalRows` | 等待推送的所有段的总行数。用于确定中间段推送的发生时间。 | 20000000 | 否 | + +##### `segmentWriteOutMediumFactory` + +| 字段 | 类型 | 描述 | 是否必须 | +|-|-|-|-| +| `type` | String | 配置解释和可选项可以参见 [额外的Peon配置:SegmentWriteOutMediumFactory](../configuration/human-readable-byte.md#SegmentWriteOutMediumFactory) | 是 | + +#### 分段推送模式 + +当使用简单任务摄取数据时,它从输入数据创建段并推送它们。对于分段推送,索引任务支持两种分段推送模式,分别是*批量推送模式*和*增量推送模式*,以实现 [最佳rollup和尽可能rollup](ingestion.md#rollup)。 + +在批量推送模式下,在索引任务的最末端推送每个段。在此之前,创建的段存储在运行索引任务的进程的内存和本地存储中。因此,此模式可能由于存储容量有限而导致问题,建议不要在生产中使用。 + +相反,在增量推送模式下,分段是增量推送的,即可以在索引任务的中间推送。更准确地说,索引任务收集数据并将创建的段存储在运行该任务的进程的内存和磁盘中,直到收集的行总数超过 `maxTotalRows`。一旦超过,索引任务将立即推送创建的所有段,直到此时为止,清除所有推送的段,并继续接收剩余的数据。 + +要启用批量推送模式,应在 `TuningConfig` 中设置`forceGuaranteedRollup`。请注意,此选项不能与 `IOConfig` 的`appendToExisting`一起使用。 + +### 输入源 + +输入源是定义索引任务读取数据的位置。只有本地并行任务和简单任务支持输入源。 + +#### S3输入源 + +> [!WARNING] +> 您需要添加 [`druid-s3-extensions`](../development/S3-compatible.md) 扩展以便使用S3输入源。 + +S3输入源支持直接从S3读取对象。可以通过S3 URI字符串列表或S3位置前缀列表指定对象,该列表将尝试列出内容并摄取位置中包含的所有对象。S3输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个对象。 + +样例规范: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "s3", + "uris": ["s3://foo/bar/file.json", "s3://bar/foo/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "s3", + "prefixes": ["s3://foo/bar", "s3://bar/foo"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "s3", + "objects": [ + { "bucket": "foo", "path": "bar/file1.json"}, + { "bucket": "bar", "path": "foo/file2.json"} + ] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 应该是 `s3` | None | 是 | +| `uris` | 指定被摄取的S3对象位置的URI JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| +| `prefixes` | 指定被摄取的S3对象所在的路径前缀的URI JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。 | +| `objects` | 指定被摄取的S3对象的JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| +| `properties` | 指定用来覆盖默认S3配置的对象属性,详情见下边 | None | 否(未指定则使用默认)| + +注意:只有当 `prefixes` 被指定时,S3输入源将略过空的对象。 + +S3对象: + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `bucket` | S3 Bucket的名称 | None | 是 | +| `path` | 数据路径 | None | 是 | + +属性对象: + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `accessKeyId` | S3输入源访问密钥的 [Password Provider](../operations/passwordproviders.md) 或纯文本字符串 | None | 如果 `secretAccessKey` 被提供的话,则为必须 | +| `secretAccessKey` | S3输入源访问密钥的 [Password Provider](../operations/passwordproviders.md) 或纯文本字符串 | None | 如果 `accessKeyId` 被提供的话,则为必须 | + +**注意**: *如果 `accessKeyId` 和 `secretAccessKey` 未被指定的话, 则将使用默认的 [S3认证](../development/S3-compatible.md#S3认证方式)* + +#### 谷歌云存储输入源 + +> [!WARNING] +> 您需要添加 [`druid-google-extensions`](../configuration/core-ext/google-cloud-storage.md) 扩展以便使用谷歌云存储输入源。 + +谷歌云存储输入源支持直接从谷歌云存储读取对象,可以通过谷歌云存储URI字符串列表指定对象。谷歌云存储输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个对象。 + +样例规范: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "google", + "uris": ["gs://foo/bar/file.json", "gs://bar/foo/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "google", + "prefixes": ["gs://foo/bar", "gs://bar/foo"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "google", + "objects": [ + { "bucket": "foo", "path": "bar/file1.json"}, + { "bucket": "bar", "path": "foo/file2.json"} + ] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 应该是 `google` | None | 是 | +| `uris` | 指定被摄取的谷歌云存储对象位置的URI JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| +| `prefixes` | 指定被摄取的谷歌云存储对象所在的路径前缀的URI JSON数组。 以被给定的前缀开头的空对象将被略过 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。 | +| `objects` | 指定被摄取的谷歌云存储对象的JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| + +注意:只有当 `prefixes` 被指定时,谷歌云存储输入源将略过空的对象。 + +谷歌云存储对象: + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `bucket` | 谷歌云存储 Bucket的名称 | None | 是 | +| `path` | 数据路径 | None | 是 | + +#### Azure输入源 + +> [!WARNING] +> 您需要添加 [`druid-azure-extensions`](../configuration/core-ext/microsoft-azure.md) 扩展以便使用Azure输入源。 + +Azure输入源支持直接从Azure读取对象,可以通过Azure URI字符串列表指定对象。Azure输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个对象。 + +样例规范: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "azure", + "uris": ["azure://container/prefix1/file.json", "azure://container/prefix2/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "azure", + "prefixes": ["azure://container/prefix1", "azure://container/prefix2"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "azure", + "objects": [ + { "bucket": "container", "path": "prefix1/file1.json"}, + { "bucket": "container", "path": "prefix2/file2.json"} + ] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 应该是 `azure` | None | 是 | +| `uris` | 指定被摄取的azure对象位置的URI JSON数组, 格式必须为 `azure:///` | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| +| `prefixes` | 指定被摄取的azure对象所在的路径前缀的URI JSON数组, 格式必须为 `azure:///`, 以被给定的前缀开头的空对象将被略过 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。 | +| `objects` | 指定被摄取的azure对象的JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| + +注意:只有当 `prefixes` 被指定时,azure输入源将略过空的对象。 + +azure对象: + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `bucket` | azure Bucket的名称 | None | 是 | +| `path` | 数据路径 | None | 是 | + +#### HDFS输入源 + +> [!WARNING] +> 您需要添加 [`druid-hdfs-extensions`](../configuration/core-ext/hdfs.md) 扩展以便使用HDFS输入源。 + +HDFS输入源支持直接从HDFS存储中读取文件,文件路径可以指定为HDFS URI字符串或者HDFS URI字符串列表。HDFS输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个文件。 + +样例规范: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": "hdfs://foo/bar/", "hdfs://bar/foo" + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": ["hdfs://foo/bar", "hdfs://bar/foo"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": "hdfs://foo/bar/file.json", "hdfs://bar/foo/file2.json" + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "hdfs", + "paths": ["hdfs://foo/bar/file.json", "hdfs://bar/foo/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 应该总是 `hdfs` | None | 是 | +| `paths` | HDFS路径。可以是JSON数组或逗号分隔的路径字符串,这些路径支持类似*的通配符。给定路径之下的空文件将会被跳过。 | None | 是 | + +您还可以使用HDFS输入源从云存储摄取数据。但是,如果您想从AWS S3或谷歌云存储读取数据,可以考虑使用 [S3输入源](../configuration/core-ext/s3.md) 或 [谷歌云存储输入源](../configuration/core-ext/google-cloud-storage.md)。 + +#### HTTP输入源 + +HTTP输入源支持直接通过HTTP从远程站点直接读取文件。 HTTP输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务只能读取一个文件。 + +样例规范: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +使用DefaultPassword Provider的身份验证字段示例(这要求密码位于摄取规范中): +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"], + "httpAuthenticationUsername": "username", + "httpAuthenticationPassword": "password123" + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` + +您还可以使用其他现有的Druid PasswordProvider。下面是使用EnvironmentVariablePasswordProvider的示例: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "http", + "uris": ["http://example.com/uri1", "http://example2.com/uri2"], + "httpAuthenticationUsername": "username", + "httpAuthenticationPassword": { + "type": "environment", + "variable": "HTTP_INPUT_SOURCE_PW" + } + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +} +``` + +| 属性 | 描述 | 默认 | 是否必须 | +|-|-|-|-| +| `type` | 应该是 `http` | None | 是 | +| `uris` | 输入文件的uris | None | 是 | +| `httpAuthenticationUsername` | 用于指定uri的身份验证的用户名。如果规范中指定的uri需要基本身份验证头,则改属性是可选的。 | None | 否 | +| `httpAuthenticationPassword` | 用于指定uri的身份验证的密码。如果规范中指定的uri需要基本身份验证头,则改属性是可选的。 | None | 否 | + +#### Inline输入源 + +Inline输入源可用于读取其规范内联的数据。它可用于演示或用于快速测试数据解析和schema。 + +样例规范: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "inline", + "data": "0,values,formatted\n1,as,CSV" + }, + "inputFormat": { + "type": "csv" + }, + ... + }, +... +``` +| 属性 | 描述 | 是否必须 | +|-|-|-| +| `type` | 应该是 `inline` | 是 | +| `data` | 要摄入的内联数据 | 是 + + +#### Local输入源 + +Local输入源支持直接从本地存储中读取文件,主要目的用于PoC测试。 Local输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务读取一个或者多个文件。 + +样例规范: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "local", + "filter" : "*.csv", + "baseDir": "/data/directory", + "files": ["/bar/foo", "/foo/bar"] + }, + "inputFormat": { + "type": "csv" + }, + ... + }, +... +``` + +| 属性 | 描述 | 是否必须 | +|-|-|-| +| `type` | 应该是 `local` | 是 | +| `filter` | 文件的通配符筛选器, 详细信息 [点击此处](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) 查看 | 如果 `baseDir` 指定了,则为必须 | +| `baseDir` | 递归搜索要接收的文件的目录, 将跳过 `baseDir` 下的空文件。 | `baseDir` 或者 `files` 至少需要被指定一个 | +| `files` | 要摄取的文件路径。如果某些文件位于指定的 `baseDir` 下,则可以忽略它们以避免摄取重复文件。该选项会跳过空文件。| `baseDir` 或者 `files` 至少需要被指定一个 | + +#### Druid输入源 + +Druid输入源支持直接从现有的Druid段读取数据,可能使用新的模式,并更改段的名称、维度、Metrics、Rollup等。Druid输入源是可拆分的,可以由 [并行任务](#并行任务) 使用。这个输入源有一个固定的从Druid段读取的输入格式;当使用这个输入源时,不需要在摄取规范中指定输入格式字段。 + +| 属性 | 描述 | 是否必须 | +|-|-|-| +| `type` | 应该是 `druid` | 是 | +| `dataSource` | 定义要从中获取行的Druid数据源 | 是 | +| `interval` | ISO-8601时间间隔的字符串,它定义了获取数据的时间范围。 | 是 | +| `dimensions` | 包含要从Druid数据源中选择的维度列名称的字符串列表。如果列表为空,则不返回维度。如果为空,则返回所有维度。 | 否 | +| `metrics` | 包含要选择的Metric列名称的字符串列表。如果列表为空,则不返回任何度量。如果为空,则返回所有Metric。 | 否 | +| `filter` | 详情请查看 [filters](../querying/filters.html) 如果指定,则只返回与筛选器匹配的行。 | 否 | + +DruidInputSource规范的最小示例如下所示: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "druid", + "dataSource": "wikipedia", + "interval": "2013-01-01/2013-01-02" + } + ... + }, +... +``` + +上面的规范将从 `wikipedia` 数据源中读取所有现有dimension和metric列,包括 `2013-01-01/2013-01-02` 时间间隔内带有时间戳( `__time` 列)的所有行。 + +以下规范使用了筛选器并读取原始数据源列子集: +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "druid", + "dataSource": "wikipedia", + "interval": "2013-01-01/2013-01-02", + "dimensions": [ + "page", + "user" + ], + "metrics": [ + "added" + ], + "filter": { + "type": "selector", + "dimension": "page", + "value": "Druid" + } + } + ... + }, +... +``` + +上面的规范只返回 `page`、`user` 维度和 `added` 的Metric。只返回`page` = `Druid` 的行。 + +### Firehoses(已废弃) +#### StaticS3Firehose +#### HDFSFirehose +#### LocalFirehose +#### HttpFirehose +#### IngestSegmentFirehose +#### SqlFirehose +#### InlineFirehose +#### CombiningFirehose \ No newline at end of file diff --git a/ingestion/native.md b/ingestion/native.md deleted file mode 100644 index 1a02b3e..0000000 --- a/ingestion/native.md +++ /dev/null @@ -1,1142 +0,0 @@ - -## 本地批摄入 - -Apache Druid当前支持两种类型的本地批量索引任务, `index_parallel` 可以并行的运行多个任务, `index`运行单个索引任务。 详情可以查看 [基于Hadoop的摄取vs基于本地批摄取的对比](ingestion.md#批量摄取) 来了解基于Hadoop的摄取、本地简单批摄取、本地并行摄取三者的比较。 - -要运行这两种类型的本地批索引任务,请按以下指定编写摄取规范。然后将其发布到Overlord的 `/druid/indexer/v1/task` 接口,或者使用druid附带的 `bin/post-index-task`。 - -### 教程 - -此页包含本地批处理摄取的参考文档。相反,如果要进行演示,请查看 [加载文件教程](../GettingStarted/chapter-1.md),该教程演示了"简单"(单任务)模式 - -### 并行任务 - -并行任务(`index_parallel`类型)是用于并行批索引的任务。此任务只使用Druid的资源,不依赖于其他外部系统,如Hadoop。`index_parallel` 任务是一个supervisor任务,它协调整个索引过程。supervisor分割输入数据并创建辅助任务来处理这些分割, 创建的worker将发布给Overlord,以便在MiddleManager或Indexer上安排和运行。一旦worker成功处理分配的输入拆分,它就会将生成的段列表报告给supervisor任务。supervisor定期检查工作任务的状态。如果其中一个失败,它将重试失败的任务,直到重试次数达到配置的限制。如果所有工作任务都成功,它将立即发布报告的段并完成摄取。 - -并行任务的详细行为是不同的,取决于 [`partitionsSpec`](#partitionsspec),详情可以查看 `partitionsSpec` - -要使用此任务,`ioConfig` 中的 [`inputSource`](#输入源) 应为*splittable(可拆分的)*,`tuningConfig` 中的 `maxNumConcurrentSubTasks` 应设置为大于1。否则,此任务将按顺序运行;`index_parallel` 任务将逐个读取每个输入文件并自行创建段。目前支持的可拆分输入格式有: - -* [`s3`](#s3%e8%be%93%e5%85%a5%e6%ba%90) 从AWS S3存储读取数据 -* [`gs`](#谷歌云存储输入源) 从谷歌云存储读取数据 -* [`azure`](#azure%e8%be%93%e5%85%a5%e6%ba%90) 从Azure Blob存储读取数据 -* [`hdfs`](#hdfs%e8%be%93%e5%85%a5%e6%ba%90) 从HDFS存储中读取数据 -* [`http`](#HTTP输入源) 从HTTP服务中读取数据 -* [`local`](#local%e8%be%93%e5%85%a5%e6%ba%90) 从本地存储中读取数据 -* [`druid`](#druid%e8%be%93%e5%85%a5%e6%ba%90) 从Druid数据源中读取数据 - -传统的 [`firehose`](#firehoses%e5%b7%b2%e5%ba%9f%e5%bc%83) 支持其他一些云存储类型。下面的 `firehose` 类型也是可拆分的。请注意,`firehose` 只支持文本格式。 - -* [`static-cloudfiles`](../development/rackspacecloudfiles.md) - -您可能需要考虑以下事项: -* 您可能希望控制每个worker进程的输入数据量。这可以使用不同的配置进行控制,具体取决于并行摄取的阶段(有关更多详细信息,请参阅 [`partitionsSpec`](#partitionsspec)。对于从 `inputSource` 读取数据的任务,可以在 `tuningConfig` 中设置 [分割提示规范](#分割提示规范)。对于合并无序段的任务,可以在 `tuningConfig` 中设置`totalNumMergeTasks`。 -* 并行摄取中并发worker的数量由 `tuningConfig` 中的`maxNumConcurrentSubTasks` 确定。supervisor检查当前正在运行的worker的数量,如果小于 `maxNumConcurrentSubTasks`,则无论当前有多少任务槽可用,都会创建更多的worker。这可能会影响其他摄取性能。有关更多详细信息,请参阅下面的 [容量规划部分](#容量规划)。 -* 默认情况下,批量摄取将替换它写入的任何段中的所有数据(在`granularitySpec` 的间隔中)。如果您想添加到段中,请在 `ioConfig` 中设置 `appendToExisting` 标志。请注意,它只替换主动添加数据的段中的数据:如果 `granularitySpec` 的间隔中有段没有此任务写入的数据,则它们将被单独保留。如果任何现有段与 `granularitySpec` 的间隔部分重叠,则新段间隔之外的那些段的部分仍将可见。 - -#### 任务符号 - -一个简易的任务如下所示: - -```json -{ - "type": "index_parallel", - "spec": { - "dataSchema": { - "dataSource": "wikipedia_parallel_index_test", - "timestampSpec": { - "column": "timestamp" - }, - "dimensionsSpec": { - "dimensions": [ - "page", - "language", - "user", - "unpatrolled", - "newPage", - "robot", - "anonymous", - "namespace", - "continent", - "country", - "region", - "city" - ] - }, - "metricsSpec": [ - { - "type": "count", - "name": "count" - }, - { - "type": "doubleSum", - "name": "added", - "fieldName": "added" - }, - { - "type": "doubleSum", - "name": "deleted", - "fieldName": "deleted" - }, - { - "type": "doubleSum", - "name": "delta", - "fieldName": "delta" - } - ], - "granularitySpec": { - "segmentGranularity": "DAY", - "queryGranularity": "second", - "intervals" : [ "2013-08-31/2013-09-02" ] - } - }, - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "local", - "baseDir": "examples/indexing/", - "filter": "wikipedia_index_data*" - }, - "inputFormat": { - "type": "json" - } - }, - "tuningconfig": { - "type": "index_parallel", - "maxNumConcurrentSubTasks": 2 - } - } -} -``` - -| 属性 | 描述 | 是否必须 | -|-|-|-| -| `type` | 任务类型,应当总是 `index_parallel` | 是 | -| `id` | 任务ID。 如果该项没有显式的指定,Druid将使用任务类型、数据源名称、时间间隔、日期时间戳生成一个任务ID | 否 | -| `spec` | 摄取规范包括数据schema、IOConfig 和 TuningConfig。详情见下边详细描述 | 是 | -| `context` | Context包括了多个任务配置参数。详情见下边详细描述 | 否 | - -##### `dataSchema` - -该字段为必须字段。 - -可以参见 [摄取规范中的dataSchema](ingestion.md#dataSchema) - -如果在dataSchema的 `granularitySpec` 中显式地指定了 `intervals`,则批处理摄取将锁定启动时指定的完整间隔,并且您将快速了解指定间隔是否与其他任务(例如Kafka摄取)持有的锁重叠。否则,在发现每个间隔时,批处理摄取将锁定该间隔,因此您可能只会在摄取过程中了解到该任务与较高优先级的任务重叠。如果显式指定 `intervals`,则指定间隔之外的任何行都将被丢弃。如果您知道数据的时间范围,我们建议显式地设置`intervals`,以便锁定失败发生得更快,并且如果有一些带有意外时间戳的杂散数据,您不会意外地替换该范围之外的数据。 - -##### `ioConfig` - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 任务类型, 应当总是 `index_parallel` | none | 是 | -| `inputFormat` | [`inputFormat`](dataformats.md#InputFormat) 用来指定如何解析输入数据 | none | 是 | -| `appendToExisting` | 创建段作为最新版本的附加分片,有效地附加到段集而不是替换它。仅当现有段集具有可扩展类型 `shardSpec`时,此操作才有效。 | false | 否 | - -##### `tuningConfig` - -tuningConfig是一个可选项,如果未指定则使用默认的参数。 详情如下: - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 任务类型,应当总是 `index_parallel` | none | 是 | -| `maxRowsPerSegment` | 已废弃。使用 `partitionsSpec` 替代,被用来分片。 决定在每个段中有多少行。 | 5000000 | 否 | -| `maxRowsInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常用户不需要设置此值,但根据数据的性质,如果行的字节数较短,则用户可能不希望在内存中存储一百万行,应设置此值。 | 1000000 | 否 | -| `maxBytesInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常这是在内部计算的,用户不需要设置它。此值表示在持久化之前要在堆内存中聚合的字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。用于索引的最大堆内存使用量为 `maxBytesInMemory *(2 + maxPendingResistent)` | 最大JVM内存的1/6 | 否 | -| `maxTotalRows` | 已废弃。使用 `partitionsSpec` 替代。等待推送的段中的总行数。用于确定何时应进行中间推送。| 20000000 | 否 | -| `numShards` | 已废弃。使用 `partitionsSpec` 替代。当使用 `hashed` `partitionsSpec`时直接指定要创建的分片数。如果该值被指定了且在 `granularitySpec`中指定了 `intervals`,那么索引任务可以跳过确定间隔/分区传递数据。如果设置了 `maxRowsPerSegment`,则无法指定 `numShards`。 | null | 否 | -| `splitHintSpec` | 用于提供提示以控制每个第一阶段任务读取的数据量。根据输入源的实现,可以忽略此提示。有关更多详细信息,请参见 [分割提示规范](#分割提示规范)。 | 基于大小的分割提示规范 | 否 | -| `partitionsSpec` | 定义在每个时间块中如何分区数据。 参见 [partitionsSpec](#partitionsspec) | 如果 `forceGuaranteedRollup` = false, 则为 `dynamic`; 如果 `forceGuaranteedRollup` = true, 则为 `hashed` 或者 `single_dim` | 否 | -| `indexSpec` | 定义段在索引阶段的存储格式相关选项,参见 [IndexSpec](ingestion.md#tuningConfig) | null | 否 | -| `indexSpecForIntermediatePersists` | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅 [IndexSpec](ingestion.md#tuningConfig)。 | 与 `indexSpec` 相同 | 否 | -| `maxPendingPersists` | 可挂起但未启动的最大持久化任务数。如果新的中间持久化将超过此限制,则在当前运行的持久化完成之前,摄取将被阻止。使用`maxRowsInMemory * (2 + maxPendingResistents)` 索引扩展的最大堆内存使用量。 | 0 (这意味着一个持久化任务只可以与摄取同时运行,而没有一个可以排队) | 否 | -| `forceGuaranteedRollup` | 强制保证 [最佳Rollup](ingestion.md#Rollup)。最佳rollup优化了生成的段的总大小和查询时间,同时索引时间将增加。如果设置为true,则必须设置 `granularitySpec` 中的 `intervals` ,同时必须对 `partitionsSpec` 使用 `single_dim` 或者 `hashed` 。此标志不能与 `IOConfig` 的 `appendToExisting` 一起使用。有关更多详细信息,请参见下面的 ["分段推送模式"](#分段推送模式) 部分。 | false | 否 | -| `reportParseExceptions` | 如果为true,则将引发解析期间遇到的异常并停止摄取;如果为false,则将跳过不可解析的行和字段。 | false | 否 | -| `pushTimeout` | 段推送的超时毫秒时间。 该值必须设置为 >= 0, 0意味着永不超时 | 0 | 否 | -| `segmentWriteOutMediumFactory` | 创建段时使用的段写入介质。 参见 [segmentWriteOutMediumFactory](#segmentWriteOutMediumFactory) | 未指定, 值来源于 `druid.peon.defaultSegmentWriteOutMediumFactory.type` | 否 | -| `maxNumConcurrentSubTasks` | 可同时并行运行的最大worker数。无论当前可用的任务槽如何,supervisor都将生成最多为 `maxNumConcurrentSubTasks` 的worker。如果此值设置为1,supervisor将自行处理数据摄取,而不是生成worker。如果将此值设置为太大,则可能会创建太多的worker,这可能会阻止其他摄取。查看 [容量规划](#容量规划) 以了解更多详细信息。 | 1 | 否 | -| `maxRetry` | 任务失败后最大重试次数 | 3 | 否 | -| `maxNumSegmentsToMerge` | 单个任务在第二阶段可同时合并的段数的最大限制。仅在 `forceGuaranteedRollup` 被设置的时候使用。 | 100 | 否 | -| `totalNumMergeTasks` | 当 `partitionsSpec` 被设置为 `hashed` 或者 `single_dim`时, 在合并阶段用来合并段的最大任务数。 | 10 | 否 | -| `taskStatusCheckPeriodMs` | 检查运行任务状态的轮询周期(毫秒)。| 1000 | 否 | -| `chatHandlerTimeout` | 报告worker中的推送段超时。| PT10S | 否 | -| `chatHandlerNumRetries` | 重试报告worker中的推送段 | 5 | 否 | - -#### 分割提示规范 - -分割提示规范用于在supervisor创建输入分割时给出提示。请注意,每个worker处理一个输入拆分。您可以控制每个worker在第一阶段读取的数据量。 - -**基于大小的分割提示规范** -除HTTP输入源外,所有可拆分输入源都遵循基于大小的拆分提示规范。 - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 应当总是 `maxSize` | none | 是 | -| `maxSplitSize` | 单个任务中要处理的输入文件的最大字节数。如果单个文件大于此数字,则它将在单个任务中自行处理(文件永远不会跨任务拆分)。 | 500MB | 否 | - -**段分割提示规范** - -段分割提示规范仅仅用在 [`DruidInputSource`](#Druid输入源)(和过时的 [`IngestSegmentFirehose`](#IngestSegmentFirehose)) - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 应当总是 `segments` | none | 是 | -| `maxInputSegmentBytesPerTask` | 单个任务中要处理的输入段的最大字节数。如果单个段大于此数字,则它将在单个任务中自行处理(输入段永远不会跨任务拆分)。 | 500MB | 否 | - -##### `partitionsSpec` - -PartitionsSpec用于描述辅助分区方法。您应该根据需要的rollup模式使用不同的partitionsSpec。为了实现 [最佳rollup](ingestion.md#rollup),您应该使用 `hashed`(基于每行中维度的哈希进行分区)或 `single_dim`(基于单个维度的范围)。对于"尽可能rollup"模式,应使用 `dynamic`。 - -三种 `partitionsSpec` 类型有着不同的特征。 - -| PartitionsSpec | 摄入速度 | 分区方式 | 支持的rollup模式 | 查询时的段修剪 | -|-|-|-|-|-| -| `dynamic` | 最快 | 基于段中的行数来进行分区 | 尽可能rollup | N/A | -| `hashed` | 中等 | 基于分区维度的哈希值进行分区。此分区可以通过改进数据位置性来减少数据源大小和查询延迟。有关详细信息,请参见 [分区](ingestion.md#分区)。 | 最佳rollup | N/A | -| `single_dim` | 最慢 | 基于分区维度值的范围分区。段大小可能会根据分区键分布而倾斜。这可能通过改善数据位置性来减少数据源大小和查询延迟。有关详细信息,请参见 [分区](ingestion.md#分区)。 | 最佳rollup | Broker可以使用分区信息提前修剪段以加快查询速度。由于Broker知道每个段中 `partitionDimension` 值的范围,因此,给定一个包含`partitionDimension` 上的筛选器的查询,Broker只选取包含满足 `partitionDimension` 上的筛选器的行的段进行查询处理。| - -对于每一种partitionSpec,推荐的使用场景是: - -* 如果数据有一个在查询中经常使用的均匀分布列,请考虑使用 `single_dim` partitionsSpec来最大限度地提高大多数查询的性能。 -* 如果您的数据不是均匀分布的列,但在使用某些维度进行rollup时,需要具有较高的rollup汇总率,请考虑使用 `hashed` partitionsSpec。通过改善数据的局部性,可以减小数据源的大小和查询延迟。 -* 如果以上两个场景不是这样,或者您不需要rollup数据源,请考虑使用 `dynamic` partitionsSpec。 - -**Dynamic分区** - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 应该总是 `dynamic` | none | 是 | -| `maxRowsPerSegment` | 用来分片。决定在每一个段中有多少行 | 5000000 | 否 | -| `maxTotalRows` | 等待推送的所有段的总行数。用于确定中间段推送的发生时间。 | 20000000 | 否 | - -使用Dynamic分区,并行索引任务在一个阶段中运行:它将生成多个worker(`single_phase_sub_task` 类型),每个worker都创建段。worker创建段的方式是: - -* 每当当前段中的行数超过 `maxRowsPerSegment` 时,任务将创建一个新段。 -* 一旦所有时间块中所有段中的行总数达到 `maxTotalRows`,任务就会将迄今为止创建的所有段推送到深层存储并创建新段。 - -**基于哈希的分区** - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 应该总是 `hashed` | none | 是 | -| `numShards` | 直接指定要创建的分片数。如果该值被指定了,同时在 `granularitySpec` 中指定了 `intervals`,那么索引任务可以跳过确定通过数据的间隔/分区 | null | 是 | -| `partitionDimensions` | 要分区的维度。留空可选择所有维度。| null | 否 | - -基于哈希分区的并行任务类似于 [MapReduce](https://en.wikipedia.org/wiki/MapReduce)。任务分为两个阶段运行,即 `部分段生成` 和 `部分段合并`。 - -* 在 `部分段生成` 阶段,与MapReduce中的Map阶段一样,并行任务根据分割提示规范分割输入数据,并将每个分割分配给一个worker。每个worker(`partial_index_generate` 类型)从 `granularitySpec` 中的`segmentGranularity(主分区键)` 读取分配的分割,然后按`partitionsSpec` 中 `partitionDimensions(辅助分区键)`的哈希值对行进行分区。分区数据存储在 [MiddleManager](../design/MiddleManager.md) 或 [Indexer](../design/Indexer.md) 的本地存储中。 -* `部分段合并` 阶段类似于MapReduce中的Reduce阶段。并行任务生成一组新的worker(`partial_index_merge` 类型)来合并在前一阶段创建的分区数据。这里,分区数据根据要合并的时间块和分区维度的散列值进行洗牌;每个worker从多个MiddleManager/Indexer进程中读取落在同一时间块和同一散列值中的数据,并将其合并以创建最终段。最后,它们将最后的段一次推送到深层存储。 - -**基于单一维度范围分区** - -> [!WARNING] -> 在并行任务的顺序模式下,当前不支持单一维度范围分区。尝试将`maxNumConcurrentSubTasks` 设置为大于1以使用此分区方式。 - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 应该总是 `single_dim` | none | 是 | -| `partitionDimension` | 要分区的维度。 仅仅允许具有单一维度值的行 | none | 是 | -| `targetRowsPerSegment` | 在一个分区中包含的目标行数,应当是一个500MB ~ 1GB目标段的数值。 | none | 要么该值被设置,或者 `maxRowsPerSegment`被设置。 | -| `maxRowsPerSegment` | 分区中要包含的行数的软最大值。| none | 要么该值被设置,或者 `targetRowsPerSegment`被设置。| -| `assumeGrouped` | 假设输入数据已经按时间和维度分组。摄取将运行得更快,但如果违反此假设,则可能会选择次优分区 | false | 否 | - -在 `single-dim` 分区下,并行任务分为3个阶段进行,即 `部分维分布`、`部分段生成` 和 `部分段合并`。第一个阶段是收集一些统计数据以找到最佳分区,另外两个阶段是创建部分段并分别合并它们,就像在基于哈希的分区中那样。 - -* 在 `部分维度分布` 阶段,并行任务分割输入数据,并根据分割提示规范将其分配给worker。每个worker任务(`partial_dimension_distribution` 类型)读取分配的分割并为 `partitionDimension` 构建直方图。并行任务从worker任务收集这些直方图,并根据 `partitionDimension` 找到最佳范围分区,以便在分区之间均匀分布行。请注意,`targetRowsPerSegment` 或 `maxRowsPerSegment` 将用于查找最佳分区。 -* 在 `部分段生成` 阶段,并行任务生成新的worker任务(`partial_range_index_generate` 类型)以创建分区数据。每个worker任务都读取在前一阶段中创建的分割,根据 `granularitySpec` 中的`segmentGranularity(主分区键)`的时间块对行进行分区,然后根据在前一阶段中找到的范围分区对行进行分区。分区数据存储在 [MiddleManager](../design/MiddleManager.md) 或 [Indexer](../design/Indexer.md)的本地存储中。 -* 在 `部分段合并` 阶段,并行索引任务生成一组新的worker任务(`partial_index_generic_merge`类型)来合并在上一阶段创建的分区数据。这里,分区数据根据时间块和 `partitionDimension` 的值进行洗牌;每个工作任务从多个MiddleManager/Indexer进程中读取属于同一范围的同一分区中的段,并将它们合并以创建最后的段。最后,它们将最后的段推到深层存储。 - -> [!WARNING] -> 由于单一维度范围分区的任务在 `部分维度分布` 和 `部分段生成` 阶段对输入进行两次传递,因此如果输入在两次传递之间发生变化,任务可能会失败 - -#### HTTP状态接口 - -supervisor任务提供了一些HTTP接口来获取任务状态。 - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/mode` - -如果索引任务以并行的方式运行,则返回 "parallel", 否则返回 "sequential" - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/phase` - -如果任务以并行的方式运行,则返回当前阶段的名称 - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/progress` - -如果supervisor任务以并行的方式运行,则返回当前阶段的预估进度 - -一个示例结果如下: -```json -{ - "running":10, - "succeeded":0, - "failed":0, - "complete":0, - "total":10, - "estimatedExpectedSucceeded":10 -} -``` - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtasks/running` - -返回正在运行的worker任务的任务IDs,如果该supervisor任务以序列模式运行则返回一个空的列表 - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs` - -返回所有的worker任务规范,如果该supervisor任务以序列模式运行则返回一个空的列表 - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs/running` - -返回正在运行的worker任务规范,如果该supervisor任务以序列模式运行则返回一个空的列表 - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspecs/complete` - -返回已经完成的worker任务规范,如果该supervisor任务以序列模式运行则返回一个空的列表 - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}` - -返回指定ID的worker任务规范,如果该supervisor任务以序列模式运行则返回一个HTTP 404 - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}/state` - -返回指定ID的worker任务规范的状态,如果该supervisor任务以序列模式运行则返回一个HTTP 404。 返回的结果集中包括worker任务规范,当前任务状态(如果存在的话) 以及任务尝试历史记录。 - -一个示例结果如下: -```json -{ - "spec": { - "id": "index_parallel_lineitem_2018-04-20T22:12:43.610Z_2", - "groupId": "index_parallel_lineitem_2018-04-20T22:12:43.610Z", - "supervisorTaskId": "index_parallel_lineitem_2018-04-20T22:12:43.610Z", - "context": null, - "inputSplit": { - "split": "/path/to/data/lineitem.tbl.5" - }, - "ingestionSpec": { - "dataSchema": { - "dataSource": "lineitem", - "timestampSpec": { - "column": "l_shipdate", - "format": "yyyy-MM-dd" - }, - "dimensionsSpec": { - "dimensions": [ - "l_orderkey", - "l_partkey", - "l_suppkey", - "l_linenumber", - "l_returnflag", - "l_linestatus", - "l_shipdate", - "l_commitdate", - "l_receiptdate", - "l_shipinstruct", - "l_shipmode", - "l_comment" - ] - }, - "metricsSpec": [ - { - "type": "count", - "name": "count" - }, - { - "type": "longSum", - "name": "l_quantity", - "fieldName": "l_quantity", - "expression": null - }, - { - "type": "doubleSum", - "name": "l_extendedprice", - "fieldName": "l_extendedprice", - "expression": null - }, - { - "type": "doubleSum", - "name": "l_discount", - "fieldName": "l_discount", - "expression": null - }, - { - "type": "doubleSum", - "name": "l_tax", - "fieldName": "l_tax", - "expression": null - } - ], - "granularitySpec": { - "type": "uniform", - "segmentGranularity": "YEAR", - "queryGranularity": { - "type": "none" - }, - "rollup": true, - "intervals": [ - "1980-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z" - ] - }, - "transformSpec": { - "filter": null, - "transforms": [] - } - }, - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "local", - "baseDir": "/path/to/data/", - "filter": "lineitem.tbl.5" - }, - "inputFormat": { - "format": "tsv", - "delimiter": "|", - "columns": [ - "l_orderkey", - "l_partkey", - "l_suppkey", - "l_linenumber", - "l_quantity", - "l_extendedprice", - "l_discount", - "l_tax", - "l_returnflag", - "l_linestatus", - "l_shipdate", - "l_commitdate", - "l_receiptdate", - "l_shipinstruct", - "l_shipmode", - "l_comment" - ] - }, - "appendToExisting": false - }, - "tuningConfig": { - "type": "index_parallel", - "maxRowsPerSegment": 5000000, - "maxRowsInMemory": 1000000, - "maxTotalRows": 20000000, - "numShards": null, - "indexSpec": { - "bitmap": { - "type": "roaring" - }, - "dimensionCompression": "lz4", - "metricCompression": "lz4", - "longEncoding": "longs" - }, - "indexSpecForIntermediatePersists": { - "bitmap": { - "type": "roaring" - }, - "dimensionCompression": "lz4", - "metricCompression": "lz4", - "longEncoding": "longs" - }, - "maxPendingPersists": 0, - "reportParseExceptions": false, - "pushTimeout": 0, - "segmentWriteOutMediumFactory": null, - "maxNumConcurrentSubTasks": 4, - "maxRetry": 3, - "taskStatusCheckPeriodMs": 1000, - "chatHandlerTimeout": "PT10S", - "chatHandlerNumRetries": 5, - "logParseExceptions": false, - "maxParseExceptions": 2147483647, - "maxSavedParseExceptions": 0, - "forceGuaranteedRollup": false, - "buildV9Directly": true - } - } - }, - "currentStatus": { - "id": "index_sub_lineitem_2018-04-20T22:16:29.922Z", - "type": "index_sub", - "createdTime": "2018-04-20T22:16:29.925Z", - "queueInsertionTime": "2018-04-20T22:16:29.929Z", - "statusCode": "RUNNING", - "duration": -1, - "location": { - "host": null, - "port": -1, - "tlsPort": -1 - }, - "dataSource": "lineitem", - "errorMsg": null - }, - "taskHistory": [] -} -``` - -* `http://{PEON_IP}:{PEON_PORT}/druid/worker/v1/chat/{SUPERVISOR_TASK_ID}/subtaskspec/{SUB_TASK_SPEC_ID}/history -` - -返回被指定ID的worker任务规范的任务尝试历史记录,如果该supervisor任务以序列模式运行则返回一个HTTP 404 - -#### 容量规划 - -不管当前有多少任务槽可用,supervisor任务最多可以创建 `maxNumConcurrentSubTasks` worker任务, 因此,可以同时运行的任务总数为 `(maxNumConcurrentSubTasks+1)(包括supervisor任务)`。请注意,这甚至可以大于任务槽的总数(所有worker的容量之和)。如果`maxNumConcurrentSubTasks` 大于 `n(可用任务槽)`,则`maxNumConcurrentSubTasks` 任务由supervisor任务创建,但只有 `n` 个任务将启动, 其他人将在挂起状态下等待,直到任何正在运行的任务完成。 - -如果将并行索引任务与流摄取一起使用,我们建议**限制批摄取的最大容量**,以防止流摄取被批摄取阻止。假设您同时有 `t` 个并行索引任务要运行, 但是想将批摄取的最大任务数限制在 `b`。 那么, 所有并行索引任务的 `maxNumConcurrentSubTasks` 之和 + `t`(supervisor任务数) 必须小于 `b`。 - -如果某些任务的优先级高于其他任务,则可以将其`maxNumConcurrentSubTasks` 设置为高于低优先级任务的值。这可能有助于高优先级任务比低优先级任务提前完成,方法是为它们分配更多的任务槽。 - -### 简单任务 - -简单任务(`index`类型)设计用于较小的数据集。任务在索引服务中执行。 - -#### 任务符号 - -一个示例任务如下: - -```json -{ - "type" : "index", - "spec" : { - "dataSchema" : { - "dataSource" : "wikipedia", - "timestampSpec" : { - "column" : "timestamp", - "format" : "auto" - }, - "dimensionsSpec" : { - "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], - "dimensionExclusions" : [] - }, - "metricsSpec" : [ - { - "type" : "count", - "name" : "count" - }, - { - "type" : "doubleSum", - "name" : "added", - "fieldName" : "added" - }, - { - "type" : "doubleSum", - "name" : "deleted", - "fieldName" : "deleted" - }, - { - "type" : "doubleSum", - "name" : "delta", - "fieldName" : "delta" - } - ], - "granularitySpec" : { - "type" : "uniform", - "segmentGranularity" : "DAY", - "queryGranularity" : "NONE", - "intervals" : [ "2013-08-31/2013-09-01" ] - } - }, - "ioConfig" : { - "type" : "index", - "inputSource" : { - "type" : "local", - "baseDir" : "examples/indexing/", - "filter" : "wikipedia_data.json" - }, - "inputFormat": { - "type": "json" - } - }, - "tuningConfig" : { - "type" : "index", - "maxRowsPerSegment" : 5000000, - "maxRowsInMemory" : 1000000 - } - } -} -``` - -| 属性 | 描述 | 是否必须 | -|-|-|-| -| `type` | 任务类型, 应该总是 `index` | 是 | -| `id` | 任务ID。如果该值为显式的指定,Druid将会使用任务类型、数据源名称、时间间隔以及日期时间戳生成一个任务ID | 否 | -| `spec` | 摄入规范,包括dataSchema、IOConfig 和 TuningConfig。 详情见下边的描述 | 是 | -| `context` | 包含多个任务配置参数的上下文。 详情见下边的描述 | 否 | - -##### `dataSchema` - -**该字段为必须字段。** - -详情可以见摄取文档中的 [`dataSchema`](ingestion.md#dataSchema) 部分。 - -如果没有在 `dataSchema` 的 `granularitySpec` 中显式指定 `intervals`,本地索引任务将对数据执行额外的传递,以确定启动时要锁定的范围。如果显式指定 `intervals`,则指定间隔之外的任何行都将被丢弃。如果您知道数据的时间范围,我们建议显式设置 `intervals`,因为它允许任务跳过额外的过程,并且如果有一些带有意外时间戳的杂散数据,您不会意外地替换该范围之外的数据。 - -##### `ioConfig` - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 任务类型,应该总是 `index` | none | 是 | -| `inputFormat` | [`inputFormat`](dataformats.md#InputFormat) 指定如何解析输入数据 | none | 是 | -| `appendToExisting` | 创建段作为最新版本的附加分片,有效地附加到段集而不是替换它。仅当现有段集具有可扩展类型 `shardSpec`时,此操作才有效。 | false | 否 | - -##### `tuningConfig` - -tuningConfig是一个可选项,如果未指定则使用默认的参数。 详情如下: - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 任务类型,应当总是 `index` | none | 是 | -| `maxRowsPerSegment` | 已废弃。使用 `partitionsSpec` 替代,被用来分片。 决定在每个段中有多少行。 | 5000000 | 否 | -| `maxRowsInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常用户不需要设置此值,但根据数据的性质,如果行的字节数较短,则用户可能不希望在内存中存储一百万行,应设置此值。 | 1000000 | 否 | -| `maxBytesInMemory` | 用于确定何时应该从中间层持久化到磁盘。通常这是在内部计算的,用户不需要设置它。此值表示在持久化之前要在堆内存中聚合的字节数。这是基于对内存使用量的粗略估计,而不是实际使用量。用于索引的最大堆内存使用量为 `maxBytesInMemory *(2 + maxPendingResistent)` | 最大JVM内存的1/6 | 否 | -| `maxTotalRows` | 已废弃。使用 `partitionsSpec` 替代。等待推送的段中的总行数。用于确定何时应进行中间推送。| 20000000 | 否 | -| `numShards` | 已废弃。使用 `partitionsSpec` 替代。当使用 `hashed` `partitionsSpec`时直接指定要创建的分片数。如果该值被指定了且在 `granularitySpec`中指定了 `intervals`,那么索引任务可以跳过确定间隔/分区传递数据。如果设置了 `maxRowsPerSegment`,则无法指定 `numShards`。 | null | 否 | -| `partitionsSpec` | 定义在每个时间块中如何分区数据。 参见 [partitionsSpec](#partitionsspec) | 如果 `forceGuaranteedRollup` = false, 则为 `dynamic`; 如果 `forceGuaranteedRollup` = true, 则为 `hashed` 或者 `single_dim` | 否 | -| `indexSpec` | 定义段在索引阶段的存储格式相关选项,参见 [IndexSpec](ingestion.md#tuningConfig) | null | 否 | -| `indexSpecForIntermediatePersists` | 定义要在索引时用于中间持久化临时段的段存储格式选项。这可用于禁用中间段上的维度/度量压缩,以减少最终合并所需的内存。但是,在中间段上禁用压缩可能会增加页缓存的使用,而在它们被合并到发布的最终段之前使用它们,有关可能的值,请参阅 [IndexSpec](ingestion.md#tuningConfig)。 | 与 `indexSpec` 相同 | 否 | -| `maxPendingPersists` | 可挂起但未启动的最大持久化任务数。如果新的中间持久化将超过此限制,则在当前运行的持久化完成之前,摄取将被阻止。使用`maxRowsInMemory * (2 + maxPendingResistents)` 索引扩展的最大堆内存使用量。 | 0 (这意味着一个持久化任务只可以与摄取同时运行,而没有一个可以排队) | 否 | -| `forceGuaranteedRollup` | 强制保证 [最佳Rollup](ingestion.md#Rollup)。最佳rollup优化了生成的段的总大小和查询时间,同时索引时间将增加。如果设置为true,则必须设置 `granularitySpec` 中的 `intervals` ,同时必须对 `partitionsSpec` 使用 `single_dim` 或者 `hashed` 。此标志不能与 `IOConfig` 的 `appendToExisting` 一起使用。有关更多详细信息,请参见下面的 ["分段推送模式"](#分段推送模式) 部分。 | false | 否 | -| `reportParseExceptions` | 已废弃。如果为true,则将引发解析期间遇到的异常并停止摄取;如果为false,则将跳过不可解析的行和字段。将 `reportParseExceptions` 设置为true将覆盖`maxParseExceptions` 和 `maxSavedParseExceptions` 的现有配置,将 `maxParseExceptions` 设置为0并将 `maxSavedParseExceptions` 限制为不超过1。 | false | 否 | -| `pushTimeout` | 段推送的超时毫秒时间。 该值必须设置为 >= 0, 0意味着永不超时 | 0 | 否 | -| `segmentWriteOutMediumFactory` | 创建段时使用的段写入介质。 参见 [segmentWriteOutMediumFactory](#segmentWriteOutMediumFactory) | 未指定, 值来源于 `druid.peon.defaultSegmentWriteOutMediumFactory.type` | 否 | -| `logParseExceptions` | 如果为true,则在发生解析异常时记录错误消息,其中包含有关发生错误的行的信息。 | false | 否 | -| `maxParseExceptions` | 任务停止接收并失败之前可能发生的最大分析异常数。如果设置了`reportParseExceptions`,则该配置被覆盖。 | unlimited | 否 | -| `maxSavedParseExceptions` | 当出现解析异常时,Druid可以跟踪最新的解析异常。"maxSavedParseExceptions" 限制将保存多少个异常实例。这些保存的异常将在任务完成报告中的任务完成后可用。如果设置了 `reportParseExceptions` ,则该配置被覆盖。 | 0 | 否 | - - -##### `partitionsSpec` - -PartitionsSpec用于描述辅助分区方法。您应该根据需要的rollup模式使用不同的partitionsSpec。为了实现 [最佳rollup](ingestion.md#rollup),您应该使用 `hashed`(基于每行中维度的哈希进行分区) - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 应该总是 `hashed` | none | 是 | -| `maxRowsPerSegment` | 用在分片中,决定在每个段中有多少行 | 5000000 | 否 | -| `numShards` | 直接指定要创建的分片数。如果该值被指定了,同时在 `granularitySpec` 中指定了 `intervals`,那么索引任务可以跳过确定通过数据的间隔/分区 | null | 是 | -| `partitionDimensions` | 要分区的维度。留空可选择所有维度。| null | 否 | - -对于尽可能rollup模式,您应该使用 `dynamic` - -| 属性 | 描述 | 默认值 | 是否必须 | -|-|-|-|-| -| `type` | 应该总是 `dynamic` | none | 是 | -| `maxRowsPerSegment` | 用来分片。决定在每一个段中有多少行 | 5000000 | 否 | -| `maxTotalRows` | 等待推送的所有段的总行数。用于确定中间段推送的发生时间。 | 20000000 | 否 | - -##### `segmentWriteOutMediumFactory` - -| 字段 | 类型 | 描述 | 是否必须 | -|-|-|-|-| -| `type` | String | 配置解释和可选项可以参见 [额外的Peon配置:SegmentWriteOutMediumFactory](../configuration/human-readable-byte.md#SegmentWriteOutMediumFactory) | 是 | - -#### 分段推送模式 - -当使用简单任务摄取数据时,它从输入数据创建段并推送它们。对于分段推送,索引任务支持两种分段推送模式,分别是*批量推送模式*和*增量推送模式*,以实现 [最佳rollup和尽可能rollup](ingestion.md#rollup)。 - -在批量推送模式下,在索引任务的最末端推送每个段。在此之前,创建的段存储在运行索引任务的进程的内存和本地存储中。因此,此模式可能由于存储容量有限而导致问题,建议不要在生产中使用。 - -相反,在增量推送模式下,分段是增量推送的,即可以在索引任务的中间推送。更准确地说,索引任务收集数据并将创建的段存储在运行该任务的进程的内存和磁盘中,直到收集的行总数超过 `maxTotalRows`。一旦超过,索引任务将立即推送创建的所有段,直到此时为止,清除所有推送的段,并继续接收剩余的数据。 - -要启用批量推送模式,应在 `TuningConfig` 中设置`forceGuaranteedRollup`。请注意,此选项不能与 `IOConfig` 的`appendToExisting`一起使用。 - -### 输入源 - -输入源是定义索引任务读取数据的位置。只有本地并行任务和简单任务支持输入源。 - -#### S3输入源 - -> [!WARNING] -> 您需要添加 [`druid-s3-extensions`](../development/S3-compatible.md) 扩展以便使用S3输入源。 - -S3输入源支持直接从S3读取对象。可以通过S3 URI字符串列表或S3位置前缀列表指定对象,该列表将尝试列出内容并摄取位置中包含的所有对象。S3输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个对象。 - -样例规范: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "s3", - "uris": ["s3://foo/bar/file.json", "s3://bar/foo/file2.json"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "s3", - "prefixes": ["s3://foo/bar", "s3://bar/foo"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "s3", - "objects": [ - { "bucket": "foo", "path": "bar/file1.json"}, - { "bucket": "bar", "path": "foo/file2.json"} - ] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 应该是 `s3` | None | 是 | -| `uris` | 指定被摄取的S3对象位置的URI JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| -| `prefixes` | 指定被摄取的S3对象所在的路径前缀的URI JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。 | -| `objects` | 指定被摄取的S3对象的JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| -| `properties` | 指定用来覆盖默认S3配置的对象属性,详情见下边 | None | 否(未指定则使用默认)| - -注意:只有当 `prefixes` 被指定时,S3输入源将略过空的对象。 - -S3对象: - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `bucket` | S3 Bucket的名称 | None | 是 | -| `path` | 数据路径 | None | 是 | - -属性对象: - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `accessKeyId` | S3输入源访问密钥的 [Password Provider](../operations/passwordproviders.md) 或纯文本字符串 | None | 如果 `secretAccessKey` 被提供的话,则为必须 | -| `secretAccessKey` | S3输入源访问密钥的 [Password Provider](../operations/passwordproviders.md) 或纯文本字符串 | None | 如果 `accessKeyId` 被提供的话,则为必须 | - -**注意**: *如果 `accessKeyId` 和 `secretAccessKey` 未被指定的话, 则将使用默认的 [S3认证](../development/S3-compatible.md#S3认证方式)* - -#### 谷歌云存储输入源 - -> [!WARNING] -> 您需要添加 [`druid-google-extensions`](../configuration/core-ext/google-cloud-storage.md) 扩展以便使用谷歌云存储输入源。 - -谷歌云存储输入源支持直接从谷歌云存储读取对象,可以通过谷歌云存储URI字符串列表指定对象。谷歌云存储输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个对象。 - -样例规范: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "google", - "uris": ["gs://foo/bar/file.json", "gs://bar/foo/file2.json"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "google", - "prefixes": ["gs://foo/bar", "gs://bar/foo"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "google", - "objects": [ - { "bucket": "foo", "path": "bar/file1.json"}, - { "bucket": "bar", "path": "foo/file2.json"} - ] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 应该是 `google` | None | 是 | -| `uris` | 指定被摄取的谷歌云存储对象位置的URI JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| -| `prefixes` | 指定被摄取的谷歌云存储对象所在的路径前缀的URI JSON数组。 以被给定的前缀开头的空对象将被略过 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。 | -| `objects` | 指定被摄取的谷歌云存储对象的JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| - -注意:只有当 `prefixes` 被指定时,谷歌云存储输入源将略过空的对象。 - -谷歌云存储对象: - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `bucket` | 谷歌云存储 Bucket的名称 | None | 是 | -| `path` | 数据路径 | None | 是 | - -#### Azure输入源 - -> [!WARNING] -> 您需要添加 [`druid-azure-extensions`](../configuration/core-ext/microsoft-azure.md) 扩展以便使用Azure输入源。 - -Azure输入源支持直接从Azure读取对象,可以通过Azure URI字符串列表指定对象。Azure输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个对象。 - -样例规范: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "azure", - "uris": ["azure://container/prefix1/file.json", "azure://container/prefix2/file2.json"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "azure", - "prefixes": ["azure://container/prefix1", "azure://container/prefix2"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "azure", - "objects": [ - { "bucket": "container", "path": "prefix1/file1.json"}, - { "bucket": "container", "path": "prefix2/file2.json"} - ] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 应该是 `azure` | None | 是 | -| `uris` | 指定被摄取的azure对象位置的URI JSON数组, 格式必须为 `azure:///` | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| -| `prefixes` | 指定被摄取的azure对象所在的路径前缀的URI JSON数组, 格式必须为 `azure:///`, 以被给定的前缀开头的空对象将被略过 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。 | -| `objects` | 指定被摄取的azure对象的JSON数组 | None | `uris` 或者 `prefixes` 或者 `objects` 必须被设置。| - -注意:只有当 `prefixes` 被指定时,azure输入源将略过空的对象。 - -azure对象: - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `bucket` | azure Bucket的名称 | None | 是 | -| `path` | 数据路径 | None | 是 | - -#### HDFS输入源 - -> [!WARNING] -> 您需要添加 [`druid-hdfs-extensions`](../configuration/core-ext/hdfs.md) 扩展以便使用HDFS输入源。 - -HDFS输入源支持直接从HDFS存储中读取文件,文件路径可以指定为HDFS URI字符串或者HDFS URI字符串列表。HDFS输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务将读取一个或多个文件。 - -样例规范: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "hdfs", - "paths": "hdfs://foo/bar/", "hdfs://bar/foo" - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "hdfs", - "paths": ["hdfs://foo/bar", "hdfs://bar/foo"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "hdfs", - "paths": "hdfs://foo/bar/file.json", "hdfs://bar/foo/file2.json" - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "hdfs", - "paths": ["hdfs://foo/bar/file.json", "hdfs://bar/foo/file2.json"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 应该总是 `hdfs` | None | 是 | -| `paths` | HDFS路径。可以是JSON数组或逗号分隔的路径字符串,这些路径支持类似*的通配符。给定路径之下的空文件将会被跳过。 | None | 是 | - -您还可以使用HDFS输入源从云存储摄取数据。但是,如果您想从AWS S3或谷歌云存储读取数据,可以考虑使用 [S3输入源](../configuration/core-ext/s3.md) 或 [谷歌云存储输入源](../configuration/core-ext/google-cloud-storage.md)。 - -#### HTTP输入源 - -HTTP输入源支持直接通过HTTP从远程站点直接读取文件。 HTTP输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务只能读取一个文件。 - -样例规范: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "http", - "uris": ["http://example.com/uri1", "http://example2.com/uri2"] - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -使用DefaultPassword Provider的身份验证字段示例(这要求密码位于摄取规范中): -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "http", - "uris": ["http://example.com/uri1", "http://example2.com/uri2"], - "httpAuthenticationUsername": "username", - "httpAuthenticationPassword": "password123" - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -``` - -您还可以使用其他现有的Druid PasswordProvider。下面是使用EnvironmentVariablePasswordProvider的示例: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "http", - "uris": ["http://example.com/uri1", "http://example2.com/uri2"], - "httpAuthenticationUsername": "username", - "httpAuthenticationPassword": { - "type": "environment", - "variable": "HTTP_INPUT_SOURCE_PW" - } - }, - "inputFormat": { - "type": "json" - }, - ... - }, -... -} -``` - -| 属性 | 描述 | 默认 | 是否必须 | -|-|-|-|-| -| `type` | 应该是 `http` | None | 是 | -| `uris` | 输入文件的uris | None | 是 | -| `httpAuthenticationUsername` | 用于指定uri的身份验证的用户名。如果规范中指定的uri需要基本身份验证头,则改属性是可选的。 | None | 否 | -| `httpAuthenticationPassword` | 用于指定uri的身份验证的密码。如果规范中指定的uri需要基本身份验证头,则改属性是可选的。 | None | 否 | - -#### Inline输入源 - -Inline输入源可用于读取其规范内联的数据。它可用于演示或用于快速测试数据解析和schema。 - -样例规范: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "inline", - "data": "0,values,formatted\n1,as,CSV" - }, - "inputFormat": { - "type": "csv" - }, - ... - }, -... -``` -| 属性 | 描述 | 是否必须 | -|-|-|-| -| `type` | 应该是 `inline` | 是 | -| `data` | 要摄入的内联数据 | 是 - - -#### Local输入源 - -Local输入源支持直接从本地存储中读取文件,主要目的用于PoC测试。 Local输入源是可拆分的,可以由 [并行任务](#并行任务) 使用,其中 `index_parallel` 的每个worker任务读取一个或者多个文件。 - -样例规范: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "local", - "filter" : "*.csv", - "baseDir": "/data/directory", - "files": ["/bar/foo", "/foo/bar"] - }, - "inputFormat": { - "type": "csv" - }, - ... - }, -... -``` - -| 属性 | 描述 | 是否必须 | -|-|-|-| -| `type` | 应该是 `local` | 是 | -| `filter` | 文件的通配符筛选器, 详细信息 [点击此处](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) 查看 | 如果 `baseDir` 指定了,则为必须 | -| `baseDir` | 递归搜索要接收的文件的目录, 将跳过 `baseDir` 下的空文件。 | `baseDir` 或者 `files` 至少需要被指定一个 | -| `files` | 要摄取的文件路径。如果某些文件位于指定的 `baseDir` 下,则可以忽略它们以避免摄取重复文件。该选项会跳过空文件。| `baseDir` 或者 `files` 至少需要被指定一个 | - -#### Druid输入源 - -Druid输入源支持直接从现有的Druid段读取数据,可能使用新的模式,并更改段的名称、维度、Metrics、Rollup等。Druid输入源是可拆分的,可以由 [并行任务](#并行任务) 使用。这个输入源有一个固定的从Druid段读取的输入格式;当使用这个输入源时,不需要在摄取规范中指定输入格式字段。 - -| 属性 | 描述 | 是否必须 | -|-|-|-| -| `type` | 应该是 `druid` | 是 | -| `dataSource` | 定义要从中获取行的Druid数据源 | 是 | -| `interval` | ISO-8601时间间隔的字符串,它定义了获取数据的时间范围。 | 是 | -| `dimensions` | 包含要从Druid数据源中选择的维度列名称的字符串列表。如果列表为空,则不返回维度。如果为空,则返回所有维度。 | 否 | -| `metrics` | 包含要选择的Metric列名称的字符串列表。如果列表为空,则不返回任何度量。如果为空,则返回所有Metric。 | 否 | -| `filter` | 详情请查看 [filters](../querying/filters.html) 如果指定,则只返回与筛选器匹配的行。 | 否 | - -DruidInputSource规范的最小示例如下所示: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "druid", - "dataSource": "wikipedia", - "interval": "2013-01-01/2013-01-02" - } - ... - }, -... -``` - -上面的规范将从 `wikipedia` 数据源中读取所有现有dimension和metric列,包括 `2013-01-01/2013-01-02` 时间间隔内带有时间戳( `__time` 列)的所有行。 - -以下规范使用了筛选器并读取原始数据源列子集: -```json -... - "ioConfig": { - "type": "index_parallel", - "inputSource": { - "type": "druid", - "dataSource": "wikipedia", - "interval": "2013-01-01/2013-01-02", - "dimensions": [ - "page", - "user" - ], - "metrics": [ - "added" - ], - "filter": { - "type": "selector", - "dimension": "page", - "value": "Druid" - } - } - ... - }, -... -``` - -上面的规范只返回 `page`、`user` 维度和 `added` 的Metric。只返回`page` = `Druid` 的行。 - -### Firehoses(已废弃) -#### StaticS3Firehose -#### HDFSFirehose -#### LocalFirehose -#### HttpFirehose -#### IngestSegmentFirehose -#### SqlFirehose -#### InlineFirehose -#### CombiningFirehose \ No newline at end of file diff --git a/ingestion/schema-design.md b/ingestion/schema-design.md new file mode 100644 index 0000000..3bf9199 --- /dev/null +++ b/ingestion/schema-design.md @@ -0,0 +1,438 @@ +--- +id: schema-design +title: "Schema design tips" +--- + + + +## Druid's data model + +For general information, check out the documentation on [Druid's data model](index.md#data-model) on the main +ingestion overview page. The rest of this page discusses tips for users coming from other kinds of systems, as well as +general tips and common practices. + +* Druid data is stored in [datasources](index.md#datasources), which are similar to tables in a traditional RDBMS. +* Druid datasources can be ingested with or without [rollup](#rollup). With rollup enabled, Druid partially aggregates your data during ingestion, potentially reducing its row count, decreasing storage footprint, and improving query performance. With rollup disabled, Druid stores one row for each row in your input data, without any pre-aggregation. +* Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on. +* All columns in Druid datasources, other than the timestamp column, are either dimensions or metrics. This follows the [standard naming convention](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems) of OLAP data. +* Typical production datasources have tens to hundreds of columns. +* [Dimension columns](index.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats. +* [Metric columns](index.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled. + + +## If you're coming from a... + +### Relational model + +(Like Hive or PostgreSQL.) + +Druid datasources are generally equivalent to tables in a relational database. Druid [lookups](../querying/lookups.md) +can act similarly to data-warehouse-style dimension tables, but as you'll see below, denormalization is often +recommended if you can get away with it. + +Common practice for relational data modeling involves [normalization](https://en.wikipedia.org/wiki/Database_normalization): +the idea of splitting up data into multiple tables such that data redundancy is reduced or eliminated. For example, in a +"sales" table, best-practices relational modeling calls for a "product id" column that is a foreign key into a separate +"products" table, which in turn has "product id", "product name", and "product category" columns. This prevents the +product name and category from needing to be repeated on different rows in the "sales" table that refer to the same +product. + +In Druid, on the other hand, it is common to use totally flat datasources that do not require joins at query time. In +the example of the "sales" table, in Druid it would be typical to store "product_id", "product_name", and +"product_category" as dimensions directly in a Druid "sales" datasource, without using a separate "products" table. +Totally flat schemas substantially increase performance, since the need for joins is eliminated at query time. As an +an added speed boost, this also allows Druid's query layer to operate directly on compressed dictionary-encoded data. +Perhaps counter-intuitively, this does _not_ substantially increase storage footprint relative to normalized schemas, +since Druid uses dictionary encoding to effectively store just a single integer per row for string columns. + +If necessary, Druid datasources can be partially normalized through the use of [lookups](../querying/lookups.md), +which are the rough equivalent of dimension tables in a relational database. At query time, you would use Druid's SQL +`LOOKUP` function, or native lookup extraction functions, instead of using the JOIN keyword like you would in a +relational database. Since lookup tables impose an increase in memory footprint and incur more computational overhead +at query time, it is only recommended to do this if you need the ability to update a lookup table and have the changes +reflected immediately for already-ingested rows in your main table. + +Tips for modeling relational data in Druid: + +- Druid datasources do not have primary or unique keys, so skip those. +- Denormalize if possible. If you need to be able to update dimension / lookup tables periodically and have those +changes reflected in already-ingested data, consider partial normalization with [lookups](../querying/lookups.md). +- If you need to join two large distributed tables with each other, you must do this before loading the data into Druid. +Druid does not support query-time joins of two datasources. Lookups do not help here, since a full copy of each lookup +table is stored on each Druid server, so they are not a good choice for large tables. +- Consider whether you want to enable [rollup](#rollup) for pre-aggregation, or whether you want to disable +rollup and load your existing data as-is. Rollup in Druid is similar to creating a summary table in a relational model. + +### Time series model + +(Like OpenTSDB or InfluxDB.) + +Similar to time series databases, Druid's data model requires a timestamp. Druid is not a timeseries database, but +it is a natural choice for storing timeseries data. Its flexible data model allows it to store both timeseries and +non-timeseries data, even in the same datasource. + +To achieve best-case compression and query performance in Druid for timeseries data, it is important to partition and +sort by metric name, like timeseries databases often do. See [Partitioning and sorting](index.md#partitioning) for more details. + +Tips for modeling timeseries data in Druid: + +- Druid does not think of data points as being part of a "time series". Instead, Druid treats each point separately +for ingestion and aggregation. +- Create a dimension that indicates the name of the series that a data point belongs to. This dimension is often called +"metric" or "name". Do not get the dimension named "metric" confused with the concept of Druid metrics. Place this +first in the list of dimensions in your "dimensionsSpec" for best performance (this helps because it improves locality; +see [partitioning and sorting](index.md#partitioning) below for details). +- Create other dimensions for attributes attached to your data points. These are often called "tags" in timeseries +database systems. +- Create [metrics](../querying/aggregations.md) corresponding to the types of aggregations that you want to be able +to query. Typically this includes "sum", "min", and "max" (in one of the long, float, or double flavors). If you want to +be able to compute percentiles or quantiles, use Druid's [approximate aggregators](../querying/aggregations.md#approx). +- Consider enabling [rollup](#rollup), which will allow Druid to potentially combine multiple points into one +row in your Druid datasource. This can be useful if you want to store data at a different time granularity than it is +naturally emitted. It is also useful if you want to combine timeseries and non-timeseries data in the same datasource. +- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger +[automatic detection of dimension columns](#schema-less-dimensions). + +### Log aggregation model + +(Like Elasticsearch or Splunk.) + +Similar to log aggregation systems, Druid offers inverted indexes for fast searching and filtering. Druid's search +capabilities are generally less developed than these systems, and its analytical capabilities are generally more +developed. The main data modeling differences between Druid and these systems are that when ingesting data into Druid, +you must be more explicit. Druid columns have types specific upfront and Druid does not, at this time, natively support +nested data. + +Tips for modeling log data in Druid: + +- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger +[automatic detection of dimension columns](#schema-less-dimensions). +- If you have nested data, flatten it using a [`flattenSpec`](index.md#flattenspec). +- Consider enabling [rollup](#rollup) if you have mainly analytical use cases for your log data. This will +mean you lose the ability to retrieve individual events from Druid, but you potentially gain substantial compression and +query performance boosts. + +## General tips and best practices + +### Rollup + +Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. This is a form +of summarization or pre-aggregation. For more details, see the [Rollup](index.md#rollup) section of the ingestion +documentation. + +### Partitioning and sorting + +Optimally partitioning and sorting your data can have substantial impact on footprint and performance. For more details, +see the [Partitioning](index.md#partitioning) section of the ingestion documentation. + + + +### Sketches for high cardinality columns + +When dealing with high cardinality columns like user IDs or other unique IDs, consider using sketches for approximate +analysis rather than operating on the actual values. When you ingest data using a sketch, Druid does not store the +original raw data, but instead stores a "sketch" of it that it can feed into a later computation at query time. Popular +use cases for sketches include count-distinct and quantile computation. Each sketch is designed for just one particular +kind of computation. + +In general using sketches serves two main purposes: improving rollup, and reducing memory footprint at +query time. + +Sketches improve rollup ratios because they allow you to collapse multiple distinct values into the same sketch. For +example, if you have two rows that are identical except for a user ID (perhaps two users did the same action at the +same time), storing them in a count-distinct sketch instead of as-is means you can store the data in one row instead of +two. You won't be able to retrieve the user IDs or compute exact distinct counts, but you'll still be able to compute +approximate distinct counts, and you'll reduce your storage footprint. + +Sketches reduce memory footprint at query time because they limit the amount of data that needs to be shuffled between +servers. For example, in a quantile computation, instead of needing to send all data points to a central location +so they can be sorted and the quantile can be computed, Druid instead only needs to send a sketch of the points. This +can reduce data transfer needs to mere kilobytes. + +For details about the sketches available in Druid, see the +[approximate aggregators](../querying/aggregations.md#approx) page. + +If you prefer videos, take a look at [Not exactly!](https://www.youtube.com/watch?v=Hpd3f_MLdXo), a conference talk +about sketches in Druid. + +### String vs numeric dimensions + +If the user wishes to ingest a column as a numeric-typed dimension (Long, Double or Float), it is necessary to specify the type of the column in the `dimensions` section of the `dimensionsSpec`. If the type is omitted, Druid will ingest a column as the default String type. + +There are performance tradeoffs between string and numeric columns. Numeric columns are generally faster to group on +than string columns. But unlike string columns, numeric columns don't have indexes, so they can be slower to filter on. +You may want to experiment to find the optimal choice for your use case. + +For details about how to configure numeric dimensions, see the [`dimensionsSpec`](index.md#dimensionsspec) documentation. + + + +### Secondary timestamps + +Druid schemas must always include a primary timestamp. The primary timestamp is used for +[partitioning and sorting](index.md#partitioning) your data, so it should be the timestamp that you will most often filter on. +Druid is able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column. + +If your data has more than one timestamp, you can ingest the others as secondary timestamps. The best way to do this +is to ingest them as [long-typed dimensions](index.md#dimensionsspec) in milliseconds format. +If necessary, you can get them into this format using a [`transformSpec`](index.md#transformspec) and +[expressions](../misc/math-expr.md) like `timestamp_parse`, which returns millisecond timestamps. + +At query time, you can query secondary timestamps with [SQL time functions](../querying/sql.md#time-functions) +like `MILLIS_TO_TIMESTAMP`, `TIME_FLOOR`, and others. If you're using native Druid queries, you can use +[expressions](../misc/math-expr.md). + +### Nested dimensions + +At the time of this writing, Druid does not support nested dimensions. Nested dimensions need to be flattened. For example, +if you have data of the following form: + +``` +{"foo":{"bar": 3}} +``` + +then before indexing it, you should transform it to: + +``` +{"foo_bar": 3} +``` + +Druid is capable of flattening JSON, Avro, or Parquet input data. +Please read about [`flattenSpec`](index.md#flattenspec) for more details. + + + +### Counting the number of ingested events + +When rollup is enabled, count aggregators at query time do not actually tell you the number of rows that have been +ingested. They tell you the number of rows in the Druid datasource, which may be smaller than the number of rows +ingested. + +In this case, a count aggregator at _ingestion_ time can be used to count the number of events. However, it is important to note +that when you query for this metric, you should use a `longSum` aggregator. A `count` aggregator at query time will return +the number of Druid rows for the time interval, which can be used to determine what the roll-up ratio was. + +To clarify with an example, if your ingestion spec contains: + +``` +... +"metricsSpec" : [ + { + "type" : "count", + "name" : "count" + }, +... +``` + +You should query for the number of ingested rows with: + +``` +... +"aggregations": [ + { "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" }, +... +``` + + + +### Schema-less dimensions + +If the `dimensions` field is left empty in your ingestion spec, Druid will treat every column that is not the timestamp column, +a dimension that has been excluded, or a metric column as a dimension. + +Note that when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions. + +### Including the same column as a dimension and a metric + +One workflow with unique IDs is to be able to filter on a particular ID, while still being able to do fast unique counts on the ID column. +If you are not using schema-less dimensions, this use case is supported by setting the `name` of the metric to something different than the dimension. +If you are using schema-less dimensions, the best practice here is to include the same column twice, once as a dimension, and as a `hyperUnique` metric. This may involve +some work at ETL time. + +As an example, for schema-less dimensions, repeat the same column: + +``` +{"device_id_dim":123, "device_id_met":123} +``` + +and in your `metricsSpec`, include: + +``` +{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" } +``` + +`device_id_dim` should automatically get picked up as a dimension. + + + + +## Schema设计 +### Druid数据模型 + +有关一般信息,请查看摄取概述页面上有关 [Druid数据模型](ingestion.md#Druid数据模型) 的文档。本页的其余部分将讨论来自其他类型系统的用户的提示,以及一般提示和常见做法。 + +* Druid数据存储在 [数据源](ingestion.md#数据源) 中,与传统RDBMS中的表类似。 +* Druid数据源可以在摄取过程中使用或不使用 [rollup](ingestion.md#rollup) 。启用rollup后,Druid会在接收期间部分聚合您的数据,这可能会减少其行数,减少存储空间,并提高查询性能。禁用rollup后,Druid为输入数据中的每一行存储一行,而不进行任何预聚合。 +* Druid的每一行都必须有时间戳。数据总是按时间进行分区,每个查询都有一个时间过滤器。查询结果也可以按时间段(如分钟、小时、天等)进行细分。 +* 除了timestamp列之外,Druid数据源中的所有列都是dimensions或metrics。这遵循 [OLAP数据的标准命名约定](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)。 +* 典型的生产数据源有几十到几百列。 +* [dimension列](ingestion.md#维度) 按原样存储,因此可以在查询时对其进行筛选、分组或聚合。它们总是单个字符串、字符串数组、单个long、单个double或单个float。 +* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/Aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/Aggregations.md)。即使禁用了rollup,也可以在接收时配置metrics,但在启用汇总时最有用。 + +### 与其他设计模式类比 +#### 关系模型 +(如 Hive 或者 PostgreSQL) + +Druid数据源通常相当于关系数据库中的表。Druid的 [lookups特性](../querying/lookups.md) 可以类似于数据仓库样式的维度表,但是正如您将在下面看到的,如果您能够摆脱它,通常建议您进行非规范化。 + +关系数据建模的常见实践涉及 [规范化](https://en.wikipedia.org/wiki/Database_normalization) 的思想:将数据拆分为多个表,从而减少或消除数据冗余。例如,在"sales"表中,最佳实践关系建模要求将"product id"列作为外键放入单独的"products"表中,该表依次具有"product id"、"product name"和"product category"列, 这可以防止产品名称和类别需要在"sales"表中引用同一产品的不同行上重复。 + +另一方面,在Druid中,通常使用在查询时不需要连接的完全平坦的数据源。在"sales"表的例子中,在Druid中,通常直接将"product_id"、"product_name"和"product_category"作为维度存储在Druid "sales"数据源中,而不使用单独的"products"表。完全平坦的模式大大提高了性能,因为查询时不需要连接。作为一个额外的速度提升,这也允许Druid的查询层直接操作压缩字典编码的数据。因为Druid使用字典编码来有效地为字符串列每行存储一个整数, 所以可能与直觉相反,这并*没有*显著增加相对于规范化模式的存储空间。 + +如果需要的话,可以通过使用 [lookups](../querying/lookups.md) 规范化Druid数据源,这大致相当于关系数据库中的维度表。在查询时,您将使用Druid的SQL `LOOKUP` 查找函数或者原生 `lookup` 提取函数,而不是像在关系数据库中那样使用JOIN关键字。由于lookup表会增加内存占用并在查询时产生更多的计算开销,因此仅当需要更新lookup表并立即反映主表中已摄取行的更改时,才建议执行此操作。 + +在Druid中建模关系数据的技巧: +* Druid数据源没有主键或唯一键,所以跳过这些。 +* 如果可能的话,去规格化。如果需要定期更新dimensions/lookup并将这些更改反映在已接收的数据中,请考虑使用 [lookups](../querying/lookups.md) 进行部分规范化。 +* 如果需要将两个大型的分布式表连接起来,则必须在将数据加载到Druid之前执行此操作。Druid不支持两个数据源的查询时间连接。lookup在这里没有帮助,因为每个lookup表的完整副本存储在每个Druid服务器上,所以对于大型表来说,它们不是一个好的选择。 +* 考虑是否要为预聚合启用[rollup](ingestion.md#rollup),或者是否要禁用rollup并按原样加载现有数据。Druid中的Rollup类似于在关系模型中创建摘要表。 + +#### 时序模型 +(如 OpenTSDB 或者 InfluxDB) + +与时间序列数据库类似,Druid的数据模型需要时间戳。Druid不是时序数据库,但它同时也是存储时序数据的自然选择。它灵活的数据模型允许它同时存储时序和非时序数据,甚至在同一个数据源中。 + +为了在Druid中实现时序数据的最佳压缩和查询性能,像时序数据库经常做的一样,按照metric名称进行分区和排序很重要。有关详细信息,请参见 [分区和排序](ingestion.md#分区)。 + +在Druid中建模时序数据的技巧: +* Druid并不认为数据点是"时间序列"的一部分。相反,Druid对每一点分别进行摄取和聚合 +* 创建一个维度,该维度指示数据点所属系列的名称。这个维度通常被称为"metric"或"name"。不要将名为"metric"的维度与Druid Metrics的概念混淆。将它放在"dimensionsSpec"中维度列表的第一个位置,以获得最佳性能(这有助于提高局部性;有关详细信息,请参阅下面的 [分区和排序](ingestion.md#分区)) +* 为附着到数据点的属性创建其他维度。在时序数据库系统中,这些通常称为"标签" +* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"(在long、float或double中的一种)。如果你想计算百分位数或分位数,可以使用Druid的 [近似聚合器](../querying/Aggregations.md) +* 考虑启用 [rollup](ingestion.md#rollup),这将允许Druid潜在地将多个点合并到Druid数据源中的一行中。如果希望以不同于原始发出的时间粒度存储数据,则这可能非常有用。如果要在同一个数据源中组合时序和非时序数据,它也很有用 +* 如果您提前不知道要摄取哪些列,请使用空的维度列表来触发 [维度列的自动检测](#无schema的维度列) + +#### 日志聚合模型 +(如 ElasticSearch 或者 Splunk) + +与日志聚合系统类似,Druid提供反向索引,用于快速搜索和筛选。Druid的搜索能力通常不如这些系统发达,其分析能力通常更为发达。Druid和这些系统之间的主要数据建模差异在于,在将数据摄取到Druid中时,必须更加明确。Druid列具有特定的类型,而Druid目前不支持嵌套数据。 + +在Druid中建模日志数据的技巧: +* 如果您提前不知道要摄取哪些列,请使用空维度列表来触发 [维度列的自动检测](#无schema的维度列) +* 如果有嵌套数据,请使用 [展平规范](ingestion.md#flattenspec) 将其扁平化 +* 如果您主要有日志数据的分析场景,请考虑启用 [rollup](ingestion.md#rollup),这意味着您将失去从Druid中检索单个事件的能力,但您可能获得大量的压缩和查询性能提升 + +### 一般提示以及最佳实践 +#### Rollup + +Druid可以在接收数据时将其汇总,以最小化需要存储的原始数据量。这是一种汇总或预聚合的形式。有关更多详细信息,请参阅摄取文档的 [汇总部分](ingestion.md#rollup)。 + +#### 分区与排序 + +对数据进行最佳分区和排序会对占用空间和性能产生重大影响。有关更多详细信息,请参阅摄取文档的 [分区部分](ingestion.md#分区)。 + +#### Sketches高基维处理 + +在处理高基数列(如用户ID或其他唯一ID)时,请考虑使用草图(sketches)进行近似分析,而不是对实际值进行操作。当您使用草图(sketches)摄取数据时,Druid不存储原始原始数据,而是存储它的"草图(sketches)",它可以在查询时输入到以后的计算中。草图(sketches)的常用场景包括 `count-distinct` 和分位数计算。每个草图都是为一种特定的计算而设计的。 + +一般来说,使用草图(sketches)有两个主要目的:改进rollup和减少查询时的内存占用。 + +草图(sketches)可以提高rollup比率,因为它们允许您将多个不同的值折叠到同一个草图(sketches)中。例如,如果有两行除了用户ID之外都是相同的(可能两个用户同时执行了相同的操作),则将它们存储在 `count-distinct sketch` 中而不是按原样,这意味着您可以将数据存储在一行而不是两行中。您将无法检索用户id或计算精确的非重复计数,但您仍将能够计算近似的非重复计数,并且您将减少存储空间。 + +草图(sketches)减少了查询时的内存占用,因为它们限制了需要在服务器之间洗牌的数据量。例如,在分位数计算中,Druid不需要将所有数据点发送到中心位置,以便对它们进行排序和计算分位数,而只需要发送点的草图。这可以将数据传输需要减少到仅千字节。 + +有关Druid中可用的草图的详细信息,请参阅 [近似聚合器页面](../querying/Aggregations.md)。 + +如果你更喜欢 [视频](https://www.youtube.com/watch?v=Hpd3f_MLdXo),那就看一看吧!,一个讨论Druid Sketches的会议。 + +#### 字符串 VS 数值维度 + +如果用户希望将列摄取为数值类型的维度(Long、Double或Float),则需要在 `dimensionsSpec` 的 `dimensions` 部分中指定列的类型。如果省略了该类型,Druid会将列作为默认的字符串类型。 + +字符串列和数值列之间存在性能折衷。数值列通常比字符串列更快分组。但与字符串列不同,数值列没有索引,因此可以更慢地进行筛选。您可能想尝试为您的用例找到最佳选择。 + +有关如何配置数值维度的详细信息,请参阅 [`dimensionsSpec`文档](ingestion.md#dimensionsSpec) + +#### 辅助时间戳 + +Druid schema必须始终包含一个主时间戳, 主时间戳用于对数据进行 [分区和排序](ingestion.md#分区),因此它应该是您最常筛选的时间戳。Druid能够快速识别和检索与主时间戳列的时间范围相对应的数据。 + +如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。 + +在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询,那么可以使用 [表达式](../misc/expression.md)。 + +#### 嵌套维度 + +在编写本文时,Druid不支持嵌套维度。嵌套维度需要展平,例如,如果您有以下数据: +```json +{"foo":{"bar": 3}} +``` + +然后在编制索引之前,应将其转换为: +```json +{"foo_bar": 3} +``` + +Druid能够将JSON、Avro或Parquet输入数据展平化。请阅读 [展平规格](ingestion.md#flattenspec) 了解更多细节。 + +#### 计数接收事件数 + +启用rollup后,查询时的计数聚合器(count aggregator)实际上不会告诉您已摄取的行数。它们告诉您Druid数据源中的行数,可能小于接收的行数。 + +在这种情况下,可以使用*摄取时*的计数聚合器来计算事件数。但是,需要注意的是,在查询此Metrics时,应该使用 `longSum` 聚合器。查询时的 `count` 聚合器将返回时间间隔的Druid行数,该行数可用于确定rollup比率。 + +为了举例说明,如果摄取规范包含: +```json +... +"metricsSpec" : [ + { + "type" : "count", + "name" : "count" + }, +... +``` + +您应该使用查询: +```json +... +"aggregations": [ + { "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" }, +... +``` + +#### 无schema的维度列 + +如果摄取规范中的 `dimensions` 字段为空,Druid将把不是timestamp列、已排除的维度和metric列之外的每一列都视为维度。 + +注意,当使用无schema摄取时,所有维度都将被摄取为字符串类型的维度。 + +##### 包含与Dimension和Metric相同的列 + +一个具有唯一ID的工作流能够对特定ID进行过滤,同时仍然能够对ID列进行快速的唯一计数。如果不使用无schema维度,则通过将Metric的 `name` 设置为与维度不同的值来支持此场景。如果使用无schema维度,这里的最佳实践是将同一列包含两次,一次作为维度,一次作为 `hyperUnique` Metric。这可能涉及到ETL时的一些工作。 + +例如,对于无schema维度,请重复同一列: +```json +{"device_id_dim":123, "device_id_met":123} +``` +同时在 `metricsSpec` 中包含: +```json +{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" } +``` +`device_id_dim` 将自动作为维度来被选取 \ No newline at end of file diff --git a/ingestion/schemadesign.md b/ingestion/schemadesign.md deleted file mode 100644 index b534d35..0000000 --- a/ingestion/schemadesign.md +++ /dev/null @@ -1,155 +0,0 @@ - - -## Schema设计 -### Druid数据模型 - -有关一般信息,请查看摄取概述页面上有关 [Druid数据模型](ingestion.md#Druid数据模型) 的文档。本页的其余部分将讨论来自其他类型系统的用户的提示,以及一般提示和常见做法。 - -* Druid数据存储在 [数据源](ingestion.md#数据源) 中,与传统RDBMS中的表类似。 -* Druid数据源可以在摄取过程中使用或不使用 [rollup](ingestion.md#rollup) 。启用rollup后,Druid会在接收期间部分聚合您的数据,这可能会减少其行数,减少存储空间,并提高查询性能。禁用rollup后,Druid为输入数据中的每一行存储一行,而不进行任何预聚合。 -* Druid的每一行都必须有时间戳。数据总是按时间进行分区,每个查询都有一个时间过滤器。查询结果也可以按时间段(如分钟、小时、天等)进行细分。 -* 除了timestamp列之外,Druid数据源中的所有列都是dimensions或metrics。这遵循 [OLAP数据的标准命名约定](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems)。 -* 典型的生产数据源有几十到几百列。 -* [dimension列](ingestion.md#维度) 按原样存储,因此可以在查询时对其进行筛选、分组或聚合。它们总是单个字符串、字符串数组、单个long、单个double或单个float。 -* [Metrics列](ingestion.md#指标) 是 [预聚合](../querying/Aggregations.md) 存储的,因此它们只能在查询时聚合(不能按筛选或分组)。它们通常存储为数字(整数或浮点数),但也可以存储为复杂对象,如[HyperLogLog草图或近似分位数草图](../querying/Aggregations.md)。即使禁用了rollup,也可以在接收时配置metrics,但在启用汇总时最有用。 - -### 与其他设计模式类比 -#### 关系模型 -(如 Hive 或者 PostgreSQL) - -Druid数据源通常相当于关系数据库中的表。Druid的 [lookups特性](../querying/lookups.md) 可以类似于数据仓库样式的维度表,但是正如您将在下面看到的,如果您能够摆脱它,通常建议您进行非规范化。 - -关系数据建模的常见实践涉及 [规范化](https://en.wikipedia.org/wiki/Database_normalization) 的思想:将数据拆分为多个表,从而减少或消除数据冗余。例如,在"sales"表中,最佳实践关系建模要求将"product id"列作为外键放入单独的"products"表中,该表依次具有"product id"、"product name"和"product category"列, 这可以防止产品名称和类别需要在"sales"表中引用同一产品的不同行上重复。 - -另一方面,在Druid中,通常使用在查询时不需要连接的完全平坦的数据源。在"sales"表的例子中,在Druid中,通常直接将"product_id"、"product_name"和"product_category"作为维度存储在Druid "sales"数据源中,而不使用单独的"products"表。完全平坦的模式大大提高了性能,因为查询时不需要连接。作为一个额外的速度提升,这也允许Druid的查询层直接操作压缩字典编码的数据。因为Druid使用字典编码来有效地为字符串列每行存储一个整数, 所以可能与直觉相反,这并*没有*显著增加相对于规范化模式的存储空间。 - -如果需要的话,可以通过使用 [lookups](../querying/lookups.md) 规范化Druid数据源,这大致相当于关系数据库中的维度表。在查询时,您将使用Druid的SQL `LOOKUP` 查找函数或者原生 `lookup` 提取函数,而不是像在关系数据库中那样使用JOIN关键字。由于lookup表会增加内存占用并在查询时产生更多的计算开销,因此仅当需要更新lookup表并立即反映主表中已摄取行的更改时,才建议执行此操作。 - -在Druid中建模关系数据的技巧: -* Druid数据源没有主键或唯一键,所以跳过这些。 -* 如果可能的话,去规格化。如果需要定期更新dimensions/lookup并将这些更改反映在已接收的数据中,请考虑使用 [lookups](../querying/lookups.md) 进行部分规范化。 -* 如果需要将两个大型的分布式表连接起来,则必须在将数据加载到Druid之前执行此操作。Druid不支持两个数据源的查询时间连接。lookup在这里没有帮助,因为每个lookup表的完整副本存储在每个Druid服务器上,所以对于大型表来说,它们不是一个好的选择。 -* 考虑是否要为预聚合启用[rollup](ingestion.md#rollup),或者是否要禁用rollup并按原样加载现有数据。Druid中的Rollup类似于在关系模型中创建摘要表。 - -#### 时序模型 -(如 OpenTSDB 或者 InfluxDB) - -与时间序列数据库类似,Druid的数据模型需要时间戳。Druid不是时序数据库,但它同时也是存储时序数据的自然选择。它灵活的数据模型允许它同时存储时序和非时序数据,甚至在同一个数据源中。 - -为了在Druid中实现时序数据的最佳压缩和查询性能,像时序数据库经常做的一样,按照metric名称进行分区和排序很重要。有关详细信息,请参见 [分区和排序](ingestion.md#分区)。 - -在Druid中建模时序数据的技巧: -* Druid并不认为数据点是"时间序列"的一部分。相反,Druid对每一点分别进行摄取和聚合 -* 创建一个维度,该维度指示数据点所属系列的名称。这个维度通常被称为"metric"或"name"。不要将名为"metric"的维度与Druid Metrics的概念混淆。将它放在"dimensionsSpec"中维度列表的第一个位置,以获得最佳性能(这有助于提高局部性;有关详细信息,请参阅下面的 [分区和排序](ingestion.md#分区)) -* 为附着到数据点的属性创建其他维度。在时序数据库系统中,这些通常称为"标签" -* 创建与您希望能够查询的聚合类型相对应的 [Druid Metrics](ingestion.md#指标)。通常这包括"sum"、"min"和"max"(在long、float或double中的一种)。如果你想计算百分位数或分位数,可以使用Druid的 [近似聚合器](../querying/Aggregations.md) -* 考虑启用 [rollup](ingestion.md#rollup),这将允许Druid潜在地将多个点合并到Druid数据源中的一行中。如果希望以不同于原始发出的时间粒度存储数据,则这可能非常有用。如果要在同一个数据源中组合时序和非时序数据,它也很有用 -* 如果您提前不知道要摄取哪些列,请使用空的维度列表来触发 [维度列的自动检测](#无schema的维度列) - -#### 日志聚合模型 -(如 ElasticSearch 或者 Splunk) - -与日志聚合系统类似,Druid提供反向索引,用于快速搜索和筛选。Druid的搜索能力通常不如这些系统发达,其分析能力通常更为发达。Druid和这些系统之间的主要数据建模差异在于,在将数据摄取到Druid中时,必须更加明确。Druid列具有特定的类型,而Druid目前不支持嵌套数据。 - -在Druid中建模日志数据的技巧: -* 如果您提前不知道要摄取哪些列,请使用空维度列表来触发 [维度列的自动检测](#无schema的维度列) -* 如果有嵌套数据,请使用 [展平规范](ingestion.md#flattenspec) 将其扁平化 -* 如果您主要有日志数据的分析场景,请考虑启用 [rollup](ingestion.md#rollup),这意味着您将失去从Druid中检索单个事件的能力,但您可能获得大量的压缩和查询性能提升 - -### 一般提示以及最佳实践 -#### Rollup - -Druid可以在接收数据时将其汇总,以最小化需要存储的原始数据量。这是一种汇总或预聚合的形式。有关更多详细信息,请参阅摄取文档的 [汇总部分](ingestion.md#rollup)。 - -#### 分区与排序 - -对数据进行最佳分区和排序会对占用空间和性能产生重大影响。有关更多详细信息,请参阅摄取文档的 [分区部分](ingestion.md#分区)。 - -#### Sketches高基维处理 - -在处理高基数列(如用户ID或其他唯一ID)时,请考虑使用草图(sketches)进行近似分析,而不是对实际值进行操作。当您使用草图(sketches)摄取数据时,Druid不存储原始原始数据,而是存储它的"草图(sketches)",它可以在查询时输入到以后的计算中。草图(sketches)的常用场景包括 `count-distinct` 和分位数计算。每个草图都是为一种特定的计算而设计的。 - -一般来说,使用草图(sketches)有两个主要目的:改进rollup和减少查询时的内存占用。 - -草图(sketches)可以提高rollup比率,因为它们允许您将多个不同的值折叠到同一个草图(sketches)中。例如,如果有两行除了用户ID之外都是相同的(可能两个用户同时执行了相同的操作),则将它们存储在 `count-distinct sketch` 中而不是按原样,这意味着您可以将数据存储在一行而不是两行中。您将无法检索用户id或计算精确的非重复计数,但您仍将能够计算近似的非重复计数,并且您将减少存储空间。 - -草图(sketches)减少了查询时的内存占用,因为它们限制了需要在服务器之间洗牌的数据量。例如,在分位数计算中,Druid不需要将所有数据点发送到中心位置,以便对它们进行排序和计算分位数,而只需要发送点的草图。这可以将数据传输需要减少到仅千字节。 - -有关Druid中可用的草图的详细信息,请参阅 [近似聚合器页面](../querying/Aggregations.md)。 - -如果你更喜欢 [视频](https://www.youtube.com/watch?v=Hpd3f_MLdXo),那就看一看吧!,一个讨论Druid Sketches的会议。 - -#### 字符串 VS 数值维度 - -如果用户希望将列摄取为数值类型的维度(Long、Double或Float),则需要在 `dimensionsSpec` 的 `dimensions` 部分中指定列的类型。如果省略了该类型,Druid会将列作为默认的字符串类型。 - -字符串列和数值列之间存在性能折衷。数值列通常比字符串列更快分组。但与字符串列不同,数值列没有索引,因此可以更慢地进行筛选。您可能想尝试为您的用例找到最佳选择。 - -有关如何配置数值维度的详细信息,请参阅 [`dimensionsSpec`文档](ingestion.md#dimensionsSpec) - -#### 辅助时间戳 - -Druid schema必须始终包含一个主时间戳, 主时间戳用于对数据进行 [分区和排序](ingestion.md#分区),因此它应该是您最常筛选的时间戳。Druid能够快速识别和检索与主时间戳列的时间范围相对应的数据。 - -如果数据有多个时间戳,则可以将其他时间戳作为辅助时间戳摄取。最好的方法是将它们作为 [毫秒格式的Long类型维度](ingestion.md#dimensionsspec) 摄取。如有必要,可以使用 [`transformSpec`](ingestion.md#transformspec) 和 `timestamp_parse` 等 [表达式](../misc/expression.md) 将它们转换成这种格式,后者返回毫秒时间戳。 - -在查询时,可以使用诸如 `MILLIS_TO_TIMESTAMP`、`TIME_FLOOR` 等 [SQL时间函数](../querying/druidsql.md) 查询辅助时间戳。如果您使用的是原生Druid查询,那么可以使用 [表达式](../misc/expression.md)。 - -#### 嵌套维度 - -在编写本文时,Druid不支持嵌套维度。嵌套维度需要展平,例如,如果您有以下数据: -```json -{"foo":{"bar": 3}} -``` - -然后在编制索引之前,应将其转换为: -```json -{"foo_bar": 3} -``` - -Druid能够将JSON、Avro或Parquet输入数据展平化。请阅读 [展平规格](ingestion.md#flattenspec) 了解更多细节。 - -#### 计数接收事件数 - -启用rollup后,查询时的计数聚合器(count aggregator)实际上不会告诉您已摄取的行数。它们告诉您Druid数据源中的行数,可能小于接收的行数。 - -在这种情况下,可以使用*摄取时*的计数聚合器来计算事件数。但是,需要注意的是,在查询此Metrics时,应该使用 `longSum` 聚合器。查询时的 `count` 聚合器将返回时间间隔的Druid行数,该行数可用于确定rollup比率。 - -为了举例说明,如果摄取规范包含: -```json -... -"metricsSpec" : [ - { - "type" : "count", - "name" : "count" - }, -... -``` - -您应该使用查询: -```json -... -"aggregations": [ - { "type": "longSum", "name": "numIngestedEvents", "fieldName": "count" }, -... -``` - -#### 无schema的维度列 - -如果摄取规范中的 `dimensions` 字段为空,Druid将把不是timestamp列、已排除的维度和metric列之外的每一列都视为维度。 - -注意,当使用无schema摄取时,所有维度都将被摄取为字符串类型的维度。 - -##### 包含与Dimension和Metric相同的列 - -一个具有唯一ID的工作流能够对特定ID进行过滤,同时仍然能够对ID列进行快速的唯一计数。如果不使用无schema维度,则通过将Metric的 `name` 设置为与维度不同的值来支持此场景。如果使用无schema维度,这里的最佳实践是将同一列包含两次,一次作为维度,一次作为 `hyperUnique` Metric。这可能涉及到ETL时的一些工作。 - -例如,对于无schema维度,请重复同一列: -```json -{"device_id_dim":123, "device_id_met":123} -``` -同时在 `metricsSpec` 中包含: -```json -{ "type" : "hyperUnique", "name" : "devices", "fieldName" : "device_id_met" } -``` -`device_id_dim` 将自动作为维度来被选取 \ No newline at end of file diff --git a/ingestion/standalone-realtime.md b/ingestion/standalone-realtime.md new file mode 100644 index 0000000..7a3a9e0 --- /dev/null +++ b/ingestion/standalone-realtime.md @@ -0,0 +1,45 @@ +--- +id: standalone-realtime +layout: doc_page +title: "Realtime Process" +--- + + + +Older versions of Apache Druid supported a standalone 'Realtime' process to query and index 'stream pull' +modes of real-time ingestion. These processes would periodically build segments for the data they had collected over +some span of time and then set up hand-off to [Historical](../design/historical.md) servers. + +This processes could be invoked by + +``` +org.apache.druid.cli.Main server realtime +``` + +This model of stream pull ingestion was deprecated for a number of both operational and architectural reasons, and +removed completely in Druid 0.16.0. Operationally, realtime nodes were difficult to configure, deploy, and scale because +each node required an unique configuration. The design of the stream pull ingestion system for realtime nodes also +suffered from limitations which made it not possible to achieve exactly once ingestion. + +The extensions `druid-kafka-eight`, `druid-kafka-eight-simpleConsumer`, `druid-rabbitmq`, and `druid-rocketmq` were also +removed at this time, since they were built to operate on the realtime nodes. + +Please consider using the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md) or +[Kinesis Indexing Service](../development/extensions-core/kinesis-ingestion.md) for stream pull ingestion instead. diff --git a/ingestion/taskrefer.md b/ingestion/taskrefer.md deleted file mode 100644 index 31833f2..0000000 --- a/ingestion/taskrefer.md +++ /dev/null @@ -1,373 +0,0 @@ - - -## 任务参考文档 - -任务在Druid中完成所有与 [摄取](ingestion.md) 相关的工作。 - -对于批量摄取,通常使用 [任务api](../operations/api.md#Overlord) 直接将任务提交给Druid。对于流式接收,任务通常被提交给supervisor。 - -### 任务API - -任务API主要在两个地方是可用的: - -* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api.md) 可以看到完整列表 -* Druid SQL包括了一个 [`sys.tasks`](../querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的,并且可以通过Overlord API查询完整信息的有限制的子集。 - -### 任务报告 - -报告包含已完成的任务和正在运行的任务中有关接收的行数和发生的任何分析异常的信息的报表。 - -报告功能支持 [简单的本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和Kinesis摄取任务支持报告功能。 - -#### 任务结束报告 - -任务运行完成后,一个完整的报告可以在以下接口获取: - -```json -http://:/druid/indexer/v1/task//reports -``` - -一个示例输出如下: - -```json -{ - "ingestionStatsAndErrors": { - "taskId": "compact_twitter_2018-09-24T18:24:23.920Z", - "payload": { - "ingestionState": "COMPLETED", - "unparseableEvents": {}, - "rowStats": { - "determinePartitions": { - "processed": 0, - "processedWithError": 0, - "thrownAway": 0, - "unparseable": 0 - }, - "buildSegments": { - "processed": 5390324, - "processedWithError": 0, - "thrownAway": 0, - "unparseable": 0 - } - }, - "errorMsg": null - }, - "type": "ingestionStatsAndErrors" - } -} -``` - -#### 任务运行报告 - -当一个任务正在运行时, 任务运行报告可以通过以下接口获得,包括摄取状态、未解析事件和过去1分钟、5分钟、15分钟内处理的平均事件数。 - -```json -http://:/druid/indexer/v1/task//reports -``` -和 -```json -http://:/druid/worker/v1/chat//liveReports -``` - -一个示例输出如下: - -```json -{ - "ingestionStatsAndErrors": { - "taskId": "compact_twitter_2018-09-24T18:24:23.920Z", - "payload": { - "ingestionState": "RUNNING", - "unparseableEvents": {}, - "rowStats": { - "movingAverages": { - "buildSegments": { - "5m": { - "processed": 3.392158326408501, - "unparseable": 0, - "thrownAway": 0, - "processedWithError": 0 - }, - "15m": { - "processed": 1.736165476881023, - "unparseable": 0, - "thrownAway": 0, - "processedWithError": 0 - }, - "1m": { - "processed": 4.206417693750045, - "unparseable": 0, - "thrownAway": 0, - "processedWithError": 0 - } - } - }, - "totals": { - "buildSegments": { - "processed": 1994, - "processedWithError": 0, - "thrownAway": 0, - "unparseable": 0 - } - } - }, - "errorMsg": null - }, - "type": "ingestionStatsAndErrors" - } -} -``` -字段的描述信息如下: - -`ingestionStatsAndErrors` 提供了行数和错误数的信息 - -`ingestionState` 标识了摄取任务当前达到了哪一步,可能的取值包括: -* `NOT_STARTED`: 任务还没有读取任何行 -* `DETERMINE_PARTITIONS`: 任务正在处理行来决定分区信息 -* `BUILD_SEGMENTS`: 任务正在处理行来构建段 -* `COMPLETED`: 任务已经完成 - -只有批处理任务具有 `DETERMINE_PARTITIONS` 阶段。实时任务(如由Kafka索引服务创建的任务)没有 `DETERMINE_PARTITIONS` 阶段。 - -`unparseableEvents` 包含由不可解析输入引起的异常消息列表。这有助于识别有问题的输入行。对于 `DETERMINE_PARTITIONS` 和 `BUILD_SEGMENTS` 阶段,每个阶段都有一个列表。请注意,Hadoop批处理任务不支持保存不可解析事件。 - -`rowStats` map包含有关行计数的信息。每个摄取阶段有一个条目。不同行计数的定义如下所示: - -* `processed`: 成功摄入且没有报错的行数 -* `processedWithErro`: 摄取但在一列或多列中包含解析错误的行数。这通常发生在输入行具有可解析的结构但列的类型无效的情况下,例如为数值列传入非数值字符串值 -* `thrownAway`: 跳过的行数。 这包括时间戳在摄取任务定义的时间间隔之外的行,以及使用 [`transformSpec`](ingestion.md#transformspec) 过滤掉的行,但不包括显式用户配置跳过的行。例如,CSV格式的 `skipHeaderRows` 或 `hasHeaderRow` 跳过的行不计算在内 -* `unparseable`: 完全无法分析并被丢弃的行数。这将跟踪没有可解析结构的输入行,例如在使用JSON解析器时传入非JSON数据。 - -`errorMsg` 字段显示一条消息,描述导致任务失败的错误。如果任务成功,则为空 - -### 实时报告 -#### 行画像 - -非并行的 [简单本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和kinesis摄取任务支持在任务运行时检索行统计信息。 - -可以通过运行任务的Peon上的以下URL访问实时报告: - -```json -http://:/druid/worker/v1/chat//rowStats -``` - -示例报告如下所示。`movingAverages` 部分包含四行计数器的1分钟、5分钟和15分钟移动平均增量,其定义与结束报告中的定义相同。`totals` 部分显示当前总计。 - -```json -{ - "movingAverages": { - "buildSegments": { - "5m": { - "processed": 3.392158326408501, - "unparseable": 0, - "thrownAway": 0, - "processedWithError": 0 - }, - "15m": { - "processed": 1.736165476881023, - "unparseable": 0, - "thrownAway": 0, - "processedWithError": 0 - }, - "1m": { - "processed": 4.206417693750045, - "unparseable": 0, - "thrownAway": 0, - "processedWithError": 0 - } - } - }, - "totals": { - "buildSegments": { - "processed": 1994, - "processedWithError": 0, - "thrownAway": 0, - "unparseable": 0 - } - } -} -``` -对于Kafka索引服务,向Overlord API发送一个GET请求,将从supervisor管理的每个任务中检索实时行统计报告,并提供一个组合报告。 - -```json -http://:/druid/indexer/v1/supervisor//stats -``` - -#### 未解析的事件 - -可以对Peon API发起一次Get请求,从正在运行的任务中检索最近遇到的不可解析事件的列表: - -```json -http://:/druid/worker/v1/chat//unparseableEvents -``` -注意:并不是所有的任务类型支持该功能。 当前,该功能只支持非并行的 [本地批任务](native.md) (`index`类型) 和由Kafka、Kinesis索引服务创建的任务。 - -### 任务锁系统 - -本节介绍Druid中的任务锁定系统。Druid的锁定系统和版本控制系统是紧密耦合的,以保证接收数据的正确性。 - -### 段与段之间的"阴影" - -可以运行任务覆盖现有数据。覆盖任务创建的段将*覆盖*现有段。请注意,覆盖关系只适用于**同一时间块和同一数据源**。在过滤过时数据的查询处理中,不考虑这些被遮盖的段。 - -每个段都有一个*主*版本和一个*次*版本。主版本表示为时间戳,格式为["yyyy-MM-dd'T'hh:MM:ss"](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html),次版本表示为整数。这些主版本和次版本用于确定段之间的阴影关系,如下所示。 - -在以下条件下,段 `s1` 将会覆盖另一个段 `s2`: -* `s1` 比 `s2` 有一个更高的主版本 -* `s1` 和 `s2` 有相同的主版本,但是有更高的次版本 - -以下是一些示例: -* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段将覆盖另一个主版本为 `2018-01-01T00:00:00.000Z` 且次版本为 `1` 的段 -* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `1` 的段将覆盖另一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段 - -### 锁 - -如果您正在运行两个或多个 [Druid任务](taskrefer.md),这些任务为同一数据源和同一时间块生成段,那么生成的段可能会相互覆盖,从而导致错误的查询结果。 - -为了避免这个问题,任务将在Druid中创建任何段之前尝试获取锁, 有两种类型的锁,即 *时间块锁* 和 *段锁*。 - -使用时间块锁时,任务将锁定生成的段将写入数据源的整个时间块。例如,假设我们有一个任务将数据摄取到 `wikipedia` 数据源的时间块 `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` 中。使用时间块锁,此任务将在创建段之前锁定wikipedia数据源的 `2019-01-01T00:00.000Z/2019-01-02T00:00:00.000Z` 整个时间块。只要它持有锁,任何其他任务都将无法为同一数据源的同一时间块创建段。使用时间块锁创建的段的主版本*高于*现有段, 它们的次版本总是 `0`。 - -使用段锁时,任务锁定单个段而不是整个时间块。因此,如果两个或多个任务正在读取不同的段,则它们可以同时为同一时间创建同一数据源的块。例如,Kafka索引任务和压缩合并任务总是可以同时将段写入同一数据源的同一时间块中。原因是,Kafka索引任务总是附加新段,而压缩合并任务总是覆盖现有段。使用段锁创建的段具有*相同的*主版本和较高的次版本。 - -> [!WARNING] -> 段锁仍然是实验性的。它可能有未知的错误,这可能会导致错误的查询结果。 - -要启用段锁定,可能需要在 [task context(任务上下文)](#上下文参数) 中将 `forceTimeChunkLock` 设置为 `false`。一旦 `forceTimeChunkLock` 被取消设置,任务将自动选择正确的锁类型。**请注意**,段锁并不总是可用的。使用时间块锁的最常见场景是当覆盖任务更改段粒度时。此外,只有本地索引任务和Kafka/kinesis索引任务支持段锁。Hadoop索引任务和索引实时(`index_realtime`)任务(被 [Tranquility](tranquility.md)使用)还不支持它。 - -任务上下文中的 `forceTimeChunkLock` 仅应用于单个任务。如果要为所有任务取消设置,则需要在 [Overlord配置](../configuration/human-readable-byte.md#overlord) 中设置 `druid.indexer.tasklock.forceTimeChunkLock` 为false。 - -如果两个或多个任务尝试为同一数据源的重叠时间块获取锁,则锁请求可能会相互冲突。**请注意,**锁冲突可能发生在不同的锁类型之间。 - -锁冲突的行为取决于 [任务优先级](#锁优先级)。如果冲突锁请求的所有任务具有相同的优先级,则首先请求的任务将获得锁, 其他任务将等待任务释放锁。 - -如果优先级较低的任务请求锁的时间晚于优先级较高的任务,则此任务还将等待优先级较高的任务释放锁。如果优先级较高的任务比优先级较低的任务请求锁的时间晚,则此任务将*抢占*优先级较低的另一个任务。优先级较低的任务的锁将被撤销,优先级较高的任务将获得一个新锁。 - -锁抢占可以在任务运行时随时发生,除非它在关键的*段发布阶段*。一旦发布段完成,它的锁将再次成为可抢占的。 - -**请注意**,锁由同一groupId的任务共享。例如,同一supervisor的Kafka索引任务具有相同的groupId,并且彼此共享所有锁。 - -### 锁优先级 - -每个任务类型都有不同的默认锁优先级。下表显示了不同任务类型的默认优先级。数字越高,优先级越高。 - -| 任务类型 | 默认优先级 | -|-|-| -| 实时索引任务 | 75 | -| 批量索引任务 | 50 | -| 合并/追加/压缩任务 | 25 | -| 其他任务 | 0 | - -通过在任务上下文中设置优先级,可以覆盖任务优先级,如下所示。 - -```json -"context" : { - "priority" : 100 -} -``` - -### 上下文参数 - -任务上下文用于各种单独的任务配置。以下参数适用于所有任务类型。 - -| 属性 | 默认值 | 描述 | -|-|-|-| -| `taskLockTimeout` | 300000 | 任务锁定超时(毫秒)。更多详细信息,可以查看 [锁](#锁) 部分 | -| `forceTimeChunkLock` | true | *将此设置为false仍然是实验性的* 。强制始终使用时间块锁。如果未设置,则每个任务都会自动选择要使用的锁类型。如果设置了,它将覆盖 [Overlord配置](../Configuration/configuration.md#overlord] 的 `druid.indexer.tasklock.forceTimeChunkLock` 配置。有关详细信息,可以查看 [锁](#锁) 部分。| -| `priority` | 不同任务类型是不同的。 参见 [锁优先级](#锁优先级) | 任务优先级 | - -> [!WARNING] -> 当任务获取锁时,它通过HTTP发送请求并等待,直到它收到包含锁获取结果的响应为止。因此,如果 `taskLockTimeout` 大于 Overlord的`druid.server.http.maxIdleTime` 将会产生HTTP超时错误。 - -### 所有任务类型 -#### `index` - -参见 [本地批量摄取(简单任务)](native.md#简单任务) - -#### `index_parallel` - -参见 [本地批量社区(并行任务)](native.md#并行任务) - -#### `index_sub` - -由 [`index_parallel`](#index_parallel) 代表您自动提交的任务。 - -#### `index_hadoop` - -参见 [基于Hadoop的摄取](hadoop.md) - -#### `index_kafka` - -由 [`Kafka摄取supervisor`](kafka.md) 代表您自动提交的任务。 - -#### `index_kinesis` - -由 [`Kinesis摄取supervisor`](kinesis.md) 代表您自动提交的任务。 - -#### `index_realtime` - -由 [`Tranquility`](tranquility.md) 代表您自动提交的任务。 - -#### `compact` - -压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](datamanage.md#压缩与重新索引) 的文档。 - -#### `kill` - -Kill tasks删除有关某些段的所有元数据,并将其从深层存储中删除。有关详细信息,请参阅有关 [删除数据](datamanage.md#删除数据) 的文档。 - -#### `append` - -附加任务将段列表附加到单个段中(一个接一个)。语法是: - -```json -{ - "type": "append", - "id": , - "dataSource": , - "segments": , - "aggregations": , - "context": -} -``` - -#### `merge` - -合并任务将段列表合并在一起。合并任何公共时间戳。如果在接收过程中禁用了rollup,则不会合并公共时间戳,并按其时间戳对行重新排序。 - -> [!WARNING] -> [`compact`](#compact) 任务通常是比 `merge` 任务更好的选择。 - -语法是: - -```json -{ - "type": "merge", - "id": , - "dataSource": , - "aggregations": , - "rollup": , - "segments": , - "context": -} -``` - -#### `same_interval_merge` - -同一间隔合并任务是合并任务的快捷方式,间隔中的所有段都将被合并。 - -> [!WARNING] -> [`compact`](#compact) 任务通常是比 `same_interval_merge` 任务更好的选择。 - -语法是: - -```json -{ - "type": "same_interval_merge", - "id": , - "dataSource": , - "aggregations": , - "rollup": , - "interval": , - "context": -} -``` diff --git a/ingestion/tasks.md b/ingestion/tasks.md new file mode 100644 index 0000000..7874ba7 --- /dev/null +++ b/ingestion/tasks.md @@ -0,0 +1,774 @@ +--- +id: tasks +title: "Task reference" +--- + + + +Tasks do all [ingestion](index.md)-related work in Druid. + +For batch ingestion, you will generally submit tasks directly to Druid using the +[Task APIs](../operations/api-reference.md#tasks). For streaming ingestion, tasks are generally submitted for you by a +supervisor. + +## Task API + +Task APIs are available in two main places: + +- The [Overlord](../design/overlord.md) process offers HTTP APIs to submit tasks, cancel tasks, check their status, +review logs and reports, and more. Refer to the [Tasks API reference page](../operations/api-reference.md#tasks) for a +full list. +- Druid SQL includes a [`sys.tasks`](../querying/sql.md#tasks-table) table that provides information about currently +running tasks. This table is read-only, and has a limited (but useful!) subset of the full information available through +the Overlord APIs. + + + +## Task reports + +A report containing information about the number of rows ingested, and any parse exceptions that occurred is available for both completed tasks and running tasks. + +The reporting feature is supported by the [simple native batch task](../ingestion/native-batch.md#simple-task), the Hadoop batch task, and Kafka and Kinesis ingestion tasks. + +### Completion report + +After a task completes, a completion report can be retrieved at: + +``` +http://:/druid/indexer/v1/task//reports +``` + +An example output is shown below: + +```json +{ + "ingestionStatsAndErrors": { + "taskId": "compact_twitter_2018-09-24T18:24:23.920Z", + "payload": { + "ingestionState": "COMPLETED", + "unparseableEvents": {}, + "rowStats": { + "determinePartitions": { + "processed": 0, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + }, + "buildSegments": { + "processed": 5390324, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + } + }, + "errorMsg": null + }, + "type": "ingestionStatsAndErrors" + } +} +``` + +### Live report + +When a task is running, a live report containing ingestion state, unparseable events and moving average for number of events processed for 1 min, 5 min, 15 min time window can be retrieved at: + +``` +http://:/druid/indexer/v1/task//reports +``` + +and + +``` +http://:/druid/worker/v1/chat//liveReports +``` + +An example output is shown below: + +```json +{ + "ingestionStatsAndErrors": { + "taskId": "compact_twitter_2018-09-24T18:24:23.920Z", + "payload": { + "ingestionState": "RUNNING", + "unparseableEvents": {}, + "rowStats": { + "movingAverages": { + "buildSegments": { + "5m": { + "processed": 3.392158326408501, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "15m": { + "processed": 1.736165476881023, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "1m": { + "processed": 4.206417693750045, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + } + } + }, + "totals": { + "buildSegments": { + "processed": 1994, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + } + } + }, + "errorMsg": null + }, + "type": "ingestionStatsAndErrors" + } +} +``` + +A description of the fields: + +The `ingestionStatsAndErrors` report provides information about row counts and errors. + +The `ingestionState` shows what step of ingestion the task reached. Possible states include: +* `NOT_STARTED`: The task has not begun reading any rows +* `DETERMINE_PARTITIONS`: The task is processing rows to determine partitioning +* `BUILD_SEGMENTS`: The task is processing rows to construct segments +* `COMPLETED`: The task has finished its work. + +Only batch tasks have the DETERMINE_PARTITIONS phase. Realtime tasks such as those created by the Kafka Indexing Service do not have a DETERMINE_PARTITIONS phase. + +`unparseableEvents` contains lists of exception messages that were caused by unparseable inputs. This can help with identifying problematic input rows. There will be one list each for the DETERMINE_PARTITIONS and BUILD_SEGMENTS phases. Note that the Hadoop batch task does not support saving of unparseable events. + +the `rowStats` map contains information about row counts. There is one entry for each ingestion phase. The definitions of the different row counts are shown below: +* `processed`: Number of rows successfully ingested without parsing errors +* `processedWithError`: Number of rows that were ingested, but contained a parsing error within one or more columns. This typically occurs where input rows have a parseable structure but invalid types for columns, such as passing in a non-numeric String value for a numeric column. +* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside of the ingestion task's defined time interval and rows that were filtered out with a [`transformSpec`](index.md#transformspec), but doesn't include the rows skipped by explicit user configurations. For example, the rows skipped by `skipHeaderRows` or `hasHeaderRow` in the CSV format are not counted. +* `unparseable`: Number of rows that could not be parsed at all and were discarded. This tracks input rows without a parseable structure, such as passing in non-JSON data when using a JSON parser. + +The `errorMsg` field shows a message describing the error that caused a task to fail. It will be null if the task was successful. + +## Live reports + +### Row stats + +The non-parallel [simple native batch task](../ingestion/native-batch.md#simple-task), the Hadoop batch task, and Kafka and Kinesis ingestion tasks support retrieval of row stats while the task is running. + +The live report can be accessed with a GET to the following URL on a Peon running a task: + +``` +http://:/druid/worker/v1/chat//rowStats +``` + +An example report is shown below. The `movingAverages` section contains 1 minute, 5 minute, and 15 minute moving averages of increases to the four row counters, which have the same definitions as those in the completion report. The `totals` section shows the current totals. + +``` +{ + "movingAverages": { + "buildSegments": { + "5m": { + "processed": 3.392158326408501, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "15m": { + "processed": 1.736165476881023, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "1m": { + "processed": 4.206417693750045, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + } + } + }, + "totals": { + "buildSegments": { + "processed": 1994, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + } + } +} +``` + +For the Kafka Indexing Service, a GET to the following Overlord API will retrieve live row stat reports from each task being managed by the supervisor and provide a combined report. + +``` +http://:/druid/indexer/v1/supervisor//stats +``` + +### Unparseable events + +Lists of recently-encountered unparseable events can be retrieved from a running task with a GET to the following Peon API: + +``` +http://:/druid/worker/v1/chat//unparseableEvents +``` + +Note that this functionality is not supported by all task types. Currently, it is only supported by the +non-parallel [native batch task](../ingestion/native-batch.md) (type `index`) and the tasks created by the Kafka +and Kinesis indexing services. + + + +## Task lock system + +This section explains the task locking system in Druid. Druid's locking system +and versioning system are tightly coupled with each other to guarantee the correctness of ingested data. + +## "Overshadowing" between segments + +You can run a task to overwrite existing data. The segments created by an overwriting task _overshadows_ existing segments. +Note that the overshadow relation holds only for the same time chunk and the same data source. +These overshadowed segments are not considered in query processing to filter out stale data. + +Each segment has a _major_ version and a _minor_ version. The major version is +represented as a timestamp in the format of [`"yyyy-MM-dd'T'hh:mm:ss"`](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat) +while the minor version is an integer number. These major and minor versions +are used to determine the overshadow relation between segments as seen below. + +A segment `s1` overshadows another `s2` if + +- `s1` has a higher major version than `s2`, or +- `s1` has the same major version and a higher minor version than `s2`. + +Here are some examples. + +- A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `0` overshadows + another of the major version of `2018-01-01T00:00:00.000Z` and the minor version of `1`. +- A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `1` overshadows + another of the major version of `2019-01-01T00:00:00.000Z` and the minor version of `0`. + +## Locking + +If you are running two or more [druid tasks](./tasks.md) which generate segments for the same data source and the same time chunk, +the generated segments could potentially overshadow each other, which could lead to incorrect query results. + +To avoid this problem, tasks will attempt to get locks prior to creating any segment in Druid. +There are two types of locks, i.e., _time chunk lock_ and _segment lock_. + +When the time chunk lock is used, a task locks the entire time chunk of a data source where generated segments will be written. +For example, suppose we have a task ingesting data into the time chunk of `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` of the `wikipedia` data source. +With the time chunk locking, this task will lock the entire time chunk of `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` of the `wikipedia` data source +before it creates any segments. As long as it holds the lock, any other tasks will be unable to create segments for the same time chunk of the same data source. +The segments created with the time chunk locking have a _higher_ major version than existing segments. Their minor version is always `0`. + +When the segment lock is used, a task locks individual segments instead of the entire time chunk. +As a result, two or more tasks can create segments for the same time chunk of the same data source simultaneously +if they are reading different segments. +For example, a Kafka indexing task and a compaction task can always write segments into the same time chunk of the same data source simultaneously. +The reason for this is because a Kafka indexing task always appends new segments, while a compaction task always overwrites existing segments. +The segments created with the segment locking have the _same_ major version and a _higher_ minor version. + +> The segment locking is still experimental. It could have unknown bugs which potentially lead to incorrect query results. + +To enable segment locking, you may need to set `forceTimeChunkLock` to `false` in the [task context](#context). +Once `forceTimeChunkLock` is unset, the task will choose a proper lock type to use automatically. +Please note that segment lock is not always available. The most common use case where time chunk lock is enforced is +when an overwriting task changes the segment granularity. +Also, the segment locking is supported by only native indexing tasks and Kafka/Kinesis indexing tasks. +Hadoop indexing tasks don't support it. + +`forceTimeChunkLock` in the task context is only applied to individual tasks. +If you want to unset it for all tasks, you would want to set `druid.indexer.tasklock.forceTimeChunkLock` to false in the [overlord configuration](../configuration/index.md#overlord-operations). + +Lock requests can conflict with each other if two or more tasks try to get locks for the overlapped time chunks of the same data source. +Note that the lock conflict can happen between different locks types. + +The behavior on lock conflicts depends on the [task priority](#lock-priority). +If all tasks of conflicting lock requests have the same priority, then the task who requested first will get the lock. +Other tasks will wait for the task to release the lock. + +If a task of a lower priority asks a lock later than another of a higher priority, +this task will also wait for the task of a higher priority to release the lock. +If a task of a higher priority asks a lock later than another of a lower priority, +then this task will _preempt_ the other task of a lower priority. The lock +of the lower-prioritized task will be revoked and the higher-prioritized task will acquire a new lock. + +This lock preemption can happen at any time while a task is running except +when it is _publishing segments_ in a critical section. Its locks become preemptible again once publishing segments is finished. + +Note that locks are shared by the tasks of the same groupId. +For example, Kafka indexing tasks of the same supervisor have the same groupId and share all locks with each other. + + + +## Lock priority + +Each task type has a different default lock priority. The below table shows the default priorities of different task types. Higher the number, higher the priority. + +|task type|default priority| +|---------|----------------| +|Realtime index task|75| +|Batch index task|50| +|Merge/Append/Compaction task|25| +|Other tasks|0| + +You can override the task priority by setting your priority in the task context as below. + +```json +"context" : { + "priority" : 100 +} +``` + + + +## Context parameters + +The task context is used for various individual task configuration. The following parameters apply to all task types. + +|property|default|description| +|--------|-------|-----------| +|`taskLockTimeout`|300000|task lock timeout in millisecond. For more details, see [Locking](#locking).| +|`forceTimeChunkLock`|true|_Setting this to false is still experimental_
Force to always use time chunk lock. If not set, each task automatically chooses a lock type to use. If this set, it will overwrite the `druid.indexer.tasklock.forceTimeChunkLock` [configuration for the overlord](../configuration/index.md#overlord-operations). See [Locking](#locking) for more details.| +|`priority`|Different based on task types. See [Priority](#priority).|Task priority| +|`useLineageBasedSegmentAllocation`|false in 0.21 or earlier, true in 0.22 or later|Enable the new lineage-based segment allocation protocol for the native Parallel task with dynamic partitioning. This option should be off during the replacing rolling upgrade from one of the Druid versions between 0.19 and 0.21 to Druid 0.22 or higher. Once the upgrade is done, it must be set to true to ensure data correctness.| + +> When a task acquires a lock, it sends a request via HTTP and awaits until it receives a response containing the lock acquisition result. +> As a result, an HTTP timeout error can occur if `taskLockTimeout` is greater than `druid.server.http.maxIdleTime` of Overlords. + +## All task types + +### `index` + +See [Native batch ingestion (simple task)](native-batch.md#simple-task). + +### `index_parallel` + +See [Native batch ingestion (parallel task)](native-batch.md#parallel-task). + +### `index_sub` + +Submitted automatically, on your behalf, by an [`index_parallel`](#index_parallel) task. + +### `index_hadoop` + +See [Hadoop-based ingestion](hadoop.md). + +### `index_kafka` + +Submitted automatically, on your behalf, by a +[Kafka-based ingestion supervisor](../development/extensions-core/kafka-ingestion.md). + +### `index_kinesis` + +Submitted automatically, on your behalf, by a +[Kinesis-based ingestion supervisor](../development/extensions-core/kinesis-ingestion.md). + +### `index_realtime` + +Submitted automatically, on your behalf, by [Tranquility](tranquility.md). + +### `compact` + +Compaction tasks merge all segments of the given interval. See the documentation on +[compaction](compaction.md) for details. + +### `kill` + +Kill tasks delete all metadata about certain segments and removes them from deep storage. +See the documentation on [deleting data](../ingestion/data-management.md#delete) for details. + + + + + +## 任务参考文档 + +任务在Druid中完成所有与 [摄取](ingestion.md) 相关的工作。 + +对于批量摄取,通常使用 [任务api](../operations/api.md#Overlord) 直接将任务提交给Druid。对于流式接收,任务通常被提交给supervisor。 + +### 任务API + +任务API主要在两个地方是可用的: + +* [Overlord](../design/Overlord.md) 进程提供HTTP API接口来进行提交任务、取消任务、检查任务状态、查看任务日志与报告等。 查看 [任务API文档](../operations/api.md) 可以看到完整列表 +* Druid SQL包括了一个 [`sys.tasks`](../querying/druidsql.md#系统Schema) ,保存了当前任务运行的信息。 此表是只读的,并且可以通过Overlord API查询完整信息的有限制的子集。 + +### 任务报告 + +报告包含已完成的任务和正在运行的任务中有关接收的行数和发生的任何分析异常的信息的报表。 + +报告功能支持 [简单的本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和Kinesis摄取任务支持报告功能。 + +#### 任务结束报告 + +任务运行完成后,一个完整的报告可以在以下接口获取: + +```json +http://:/druid/indexer/v1/task//reports +``` + +一个示例输出如下: + +```json +{ + "ingestionStatsAndErrors": { + "taskId": "compact_twitter_2018-09-24T18:24:23.920Z", + "payload": { + "ingestionState": "COMPLETED", + "unparseableEvents": {}, + "rowStats": { + "determinePartitions": { + "processed": 0, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + }, + "buildSegments": { + "processed": 5390324, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + } + }, + "errorMsg": null + }, + "type": "ingestionStatsAndErrors" + } +} +``` + +#### 任务运行报告 + +当一个任务正在运行时, 任务运行报告可以通过以下接口获得,包括摄取状态、未解析事件和过去1分钟、5分钟、15分钟内处理的平均事件数。 + +```json +http://:/druid/indexer/v1/task//reports +``` +和 +```json +http://:/druid/worker/v1/chat//liveReports +``` + +一个示例输出如下: + +```json +{ + "ingestionStatsAndErrors": { + "taskId": "compact_twitter_2018-09-24T18:24:23.920Z", + "payload": { + "ingestionState": "RUNNING", + "unparseableEvents": {}, + "rowStats": { + "movingAverages": { + "buildSegments": { + "5m": { + "processed": 3.392158326408501, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "15m": { + "processed": 1.736165476881023, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "1m": { + "processed": 4.206417693750045, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + } + } + }, + "totals": { + "buildSegments": { + "processed": 1994, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + } + } + }, + "errorMsg": null + }, + "type": "ingestionStatsAndErrors" + } +} +``` +字段的描述信息如下: + +`ingestionStatsAndErrors` 提供了行数和错误数的信息 + +`ingestionState` 标识了摄取任务当前达到了哪一步,可能的取值包括: +* `NOT_STARTED`: 任务还没有读取任何行 +* `DETERMINE_PARTITIONS`: 任务正在处理行来决定分区信息 +* `BUILD_SEGMENTS`: 任务正在处理行来构建段 +* `COMPLETED`: 任务已经完成 + +只有批处理任务具有 `DETERMINE_PARTITIONS` 阶段。实时任务(如由Kafka索引服务创建的任务)没有 `DETERMINE_PARTITIONS` 阶段。 + +`unparseableEvents` 包含由不可解析输入引起的异常消息列表。这有助于识别有问题的输入行。对于 `DETERMINE_PARTITIONS` 和 `BUILD_SEGMENTS` 阶段,每个阶段都有一个列表。请注意,Hadoop批处理任务不支持保存不可解析事件。 + +`rowStats` map包含有关行计数的信息。每个摄取阶段有一个条目。不同行计数的定义如下所示: + +* `processed`: 成功摄入且没有报错的行数 +* `processedWithErro`: 摄取但在一列或多列中包含解析错误的行数。这通常发生在输入行具有可解析的结构但列的类型无效的情况下,例如为数值列传入非数值字符串值 +* `thrownAway`: 跳过的行数。 这包括时间戳在摄取任务定义的时间间隔之外的行,以及使用 [`transformSpec`](ingestion.md#transformspec) 过滤掉的行,但不包括显式用户配置跳过的行。例如,CSV格式的 `skipHeaderRows` 或 `hasHeaderRow` 跳过的行不计算在内 +* `unparseable`: 完全无法分析并被丢弃的行数。这将跟踪没有可解析结构的输入行,例如在使用JSON解析器时传入非JSON数据。 + +`errorMsg` 字段显示一条消息,描述导致任务失败的错误。如果任务成功,则为空 + +### 实时报告 +#### 行画像 + +非并行的 [简单本地批处理任务](native.md#简单任务)、Hadoop批处理任务以及Kafka和kinesis摄取任务支持在任务运行时检索行统计信息。 + +可以通过运行任务的Peon上的以下URL访问实时报告: + +```json +http://:/druid/worker/v1/chat//rowStats +``` + +示例报告如下所示。`movingAverages` 部分包含四行计数器的1分钟、5分钟和15分钟移动平均增量,其定义与结束报告中的定义相同。`totals` 部分显示当前总计。 + +```json +{ + "movingAverages": { + "buildSegments": { + "5m": { + "processed": 3.392158326408501, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "15m": { + "processed": 1.736165476881023, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + }, + "1m": { + "processed": 4.206417693750045, + "unparseable": 0, + "thrownAway": 0, + "processedWithError": 0 + } + } + }, + "totals": { + "buildSegments": { + "processed": 1994, + "processedWithError": 0, + "thrownAway": 0, + "unparseable": 0 + } + } +} +``` +对于Kafka索引服务,向Overlord API发送一个GET请求,将从supervisor管理的每个任务中检索实时行统计报告,并提供一个组合报告。 + +```json +http://:/druid/indexer/v1/supervisor//stats +``` + +#### 未解析的事件 + +可以对Peon API发起一次Get请求,从正在运行的任务中检索最近遇到的不可解析事件的列表: + +```json +http://:/druid/worker/v1/chat//unparseableEvents +``` +注意:并不是所有的任务类型支持该功能。 当前,该功能只支持非并行的 [本地批任务](native.md) (`index`类型) 和由Kafka、Kinesis索引服务创建的任务。 + +### 任务锁系统 + +本节介绍Druid中的任务锁定系统。Druid的锁定系统和版本控制系统是紧密耦合的,以保证接收数据的正确性。 + +### 段与段之间的"阴影" + +可以运行任务覆盖现有数据。覆盖任务创建的段将*覆盖*现有段。请注意,覆盖关系只适用于**同一时间块和同一数据源**。在过滤过时数据的查询处理中,不考虑这些被遮盖的段。 + +每个段都有一个*主*版本和一个*次*版本。主版本表示为时间戳,格式为["yyyy-MM-dd'T'hh:MM:ss"](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html),次版本表示为整数。这些主版本和次版本用于确定段之间的阴影关系,如下所示。 + +在以下条件下,段 `s1` 将会覆盖另一个段 `s2`: +* `s1` 比 `s2` 有一个更高的主版本 +* `s1` 和 `s2` 有相同的主版本,但是有更高的次版本 + +以下是一些示例: +* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段将覆盖另一个主版本为 `2018-01-01T00:00:00.000Z` 且次版本为 `1` 的段 +* 一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `1` 的段将覆盖另一个主版本为 `2019-01-01T00:00:00.000Z` 且次版本为 `0` 的段 + +### 锁 + +如果您正在运行两个或多个 [Druid任务](taskrefer.md),这些任务为同一数据源和同一时间块生成段,那么生成的段可能会相互覆盖,从而导致错误的查询结果。 + +为了避免这个问题,任务将在Druid中创建任何段之前尝试获取锁, 有两种类型的锁,即 *时间块锁* 和 *段锁*。 + +使用时间块锁时,任务将锁定生成的段将写入数据源的整个时间块。例如,假设我们有一个任务将数据摄取到 `wikipedia` 数据源的时间块 `2019-01-01T00:00:00.000Z/2019-01-02T00:00:00.000Z` 中。使用时间块锁,此任务将在创建段之前锁定wikipedia数据源的 `2019-01-01T00:00.000Z/2019-01-02T00:00:00.000Z` 整个时间块。只要它持有锁,任何其他任务都将无法为同一数据源的同一时间块创建段。使用时间块锁创建的段的主版本*高于*现有段, 它们的次版本总是 `0`。 + +使用段锁时,任务锁定单个段而不是整个时间块。因此,如果两个或多个任务正在读取不同的段,则它们可以同时为同一时间创建同一数据源的块。例如,Kafka索引任务和压缩合并任务总是可以同时将段写入同一数据源的同一时间块中。原因是,Kafka索引任务总是附加新段,而压缩合并任务总是覆盖现有段。使用段锁创建的段具有*相同的*主版本和较高的次版本。 + +> [!WARNING] +> 段锁仍然是实验性的。它可能有未知的错误,这可能会导致错误的查询结果。 + +要启用段锁定,可能需要在 [task context(任务上下文)](#上下文参数) 中将 `forceTimeChunkLock` 设置为 `false`。一旦 `forceTimeChunkLock` 被取消设置,任务将自动选择正确的锁类型。**请注意**,段锁并不总是可用的。使用时间块锁的最常见场景是当覆盖任务更改段粒度时。此外,只有本地索引任务和Kafka/kinesis索引任务支持段锁。Hadoop索引任务和索引实时(`index_realtime`)任务(被 [Tranquility](tranquility.md)使用)还不支持它。 + +任务上下文中的 `forceTimeChunkLock` 仅应用于单个任务。如果要为所有任务取消设置,则需要在 [Overlord配置](../configuration/human-readable-byte.md#overlord) 中设置 `druid.indexer.tasklock.forceTimeChunkLock` 为false。 + +如果两个或多个任务尝试为同一数据源的重叠时间块获取锁,则锁请求可能会相互冲突。**请注意,**锁冲突可能发生在不同的锁类型之间。 + +锁冲突的行为取决于 [任务优先级](#锁优先级)。如果冲突锁请求的所有任务具有相同的优先级,则首先请求的任务将获得锁, 其他任务将等待任务释放锁。 + +如果优先级较低的任务请求锁的时间晚于优先级较高的任务,则此任务还将等待优先级较高的任务释放锁。如果优先级较高的任务比优先级较低的任务请求锁的时间晚,则此任务将*抢占*优先级较低的另一个任务。优先级较低的任务的锁将被撤销,优先级较高的任务将获得一个新锁。 + +锁抢占可以在任务运行时随时发生,除非它在关键的*段发布阶段*。一旦发布段完成,它的锁将再次成为可抢占的。 + +**请注意**,锁由同一groupId的任务共享。例如,同一supervisor的Kafka索引任务具有相同的groupId,并且彼此共享所有锁。 + +### 锁优先级 + +每个任务类型都有不同的默认锁优先级。下表显示了不同任务类型的默认优先级。数字越高,优先级越高。 + +| 任务类型 | 默认优先级 | +|-|-| +| 实时索引任务 | 75 | +| 批量索引任务 | 50 | +| 合并/追加/压缩任务 | 25 | +| 其他任务 | 0 | + +通过在任务上下文中设置优先级,可以覆盖任务优先级,如下所示。 + +```json +"context" : { + "priority" : 100 +} +``` + +### 上下文参数 + +任务上下文用于各种单独的任务配置。以下参数适用于所有任务类型。 + +| 属性 | 默认值 | 描述 | +|-|-|-| +| `taskLockTimeout` | 300000 | 任务锁定超时(毫秒)。更多详细信息,可以查看 [锁](#锁) 部分 | +| `forceTimeChunkLock` | true | *将此设置为false仍然是实验性的* 。强制始终使用时间块锁。如果未设置,则每个任务都会自动选择要使用的锁类型。如果设置了,它将覆盖 [Overlord配置](../Configuration/configuration.md#overlord] 的 `druid.indexer.tasklock.forceTimeChunkLock` 配置。有关详细信息,可以查看 [锁](#锁) 部分。| +| `priority` | 不同任务类型是不同的。 参见 [锁优先级](#锁优先级) | 任务优先级 | + +> [!WARNING] +> 当任务获取锁时,它通过HTTP发送请求并等待,直到它收到包含锁获取结果的响应为止。因此,如果 `taskLockTimeout` 大于 Overlord的`druid.server.http.maxIdleTime` 将会产生HTTP超时错误。 + +### 所有任务类型 +#### `index` + +参见 [本地批量摄取(简单任务)](native.md#简单任务) + +#### `index_parallel` + +参见 [本地批量社区(并行任务)](native.md#并行任务) + +#### `index_sub` + +由 [`index_parallel`](#index_parallel) 代表您自动提交的任务。 + +#### `index_hadoop` + +参见 [基于Hadoop的摄取](hadoop.md) + +#### `index_kafka` + +由 [`Kafka摄取supervisor`](kafka.md) 代表您自动提交的任务。 + +#### `index_kinesis` + +由 [`Kinesis摄取supervisor`](kinesis.md) 代表您自动提交的任务。 + +#### `index_realtime` + +由 [`Tranquility`](tranquility.md) 代表您自动提交的任务。 + +#### `compact` + +压缩任务合并给定间隔的所有段。有关详细信息,请参见有关 [压缩](datamanage.md#压缩与重新索引) 的文档。 + +#### `kill` + +Kill tasks删除有关某些段的所有元数据,并将其从深层存储中删除。有关详细信息,请参阅有关 [删除数据](datamanage.md#删除数据) 的文档。 + +#### `append` + +附加任务将段列表附加到单个段中(一个接一个)。语法是: + +```json +{ + "type": "append", + "id": , + "dataSource": , + "segments": , + "aggregations": , + "context": +} +``` + +#### `merge` + +合并任务将段列表合并在一起。合并任何公共时间戳。如果在接收过程中禁用了rollup,则不会合并公共时间戳,并按其时间戳对行重新排序。 + +> [!WARNING] +> [`compact`](#compact) 任务通常是比 `merge` 任务更好的选择。 + +语法是: + +```json +{ + "type": "merge", + "id": , + "dataSource": , + "aggregations": , + "rollup": , + "segments": , + "context": +} +``` + +#### `same_interval_merge` + +同一间隔合并任务是合并任务的快捷方式,间隔中的所有段都将被合并。 + +> [!WARNING] +> [`compact`](#compact) 任务通常是比 `same_interval_merge` 任务更好的选择。 + +语法是: + +```json +{ + "type": "same_interval_merge", + "id": , + "dataSource": , + "aggregations": , + "rollup": , + "interval": , + "context": +} +``` diff --git a/ingestion/tranquility.md b/ingestion/tranquility.md new file mode 100644 index 0000000..f664644 --- /dev/null +++ b/ingestion/tranquility.md @@ -0,0 +1,36 @@ +--- +id: tranquility +title: "Tranquility" +--- + + + +[Tranquility](https://github.com/druid-io/tranquility/) is a separately distributed package for pushing +streams to Druid in real-time. + +Tranquility has not been built against a version of Druid later than Druid 0.9.2 +release. It may still work with the latest Druid servers, but not all features and functionality will be available +due to limitations of older Druid APIs on the Tranquility side. + +For new projects that require streaming ingestion, we recommend using Druid's native support for +[Apache Kafka](../development/extensions-core/kafka-ingestion.md) or +[Amazon Kinesis](../development/extensions-core/kinesis-ingestion.md). + +For more details, check out the [Tranquility GitHub page](https://github.com/druid-io/tranquility/).