mirror of https://github.com/apache/druid.git
* Docs refactor of ingestion. Carries #11541
* Update docs/misc/math-expr.md
* add Apache license
* fix header, add topics to sidebar
* Update docs/ingestion/partitioning.md
* pick up changes to and md from c7fdf1d
, #11479
Co-authored-by: Suneet Saldanha <suneet@apache.org>
Co-authored-by: Jihoon Son <jihoonson@apache.org>
This commit is contained in:
parent
aaf0aaad8f
commit
6524d838d7
|
@ -927,8 +927,8 @@ The below is a list of the supported configurations for auto compaction.
|
|||
|`maxBytesInMemory`|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is `maxBytesInMemory` * (2 + `maxPendingPersists`)|no (default = 1/6 of max JVM memory)|
|
||||
|`splitHintSpec`|Used to give a hint to control the amount of data that each first phase task reads. This hint could be ignored depending on the implementation of the input source. See [Split hint spec](../ingestion/native-batch.md#split-hint-spec) for more details.|no (default = size-based split hint spec)|
|
||||
|`partitionsSpec`|Defines how to partition data in each time chunk, see [`PartitionsSpec`](../ingestion/native-batch.md#partitionsspec)|no (default = `dynamic`)|
|
||||
|`indexSpec`|Defines segment storage format options to be used at indexing time, see [IndexSpec](../ingestion/index.md#indexspec)|no|
|
||||
|`indexSpecForIntermediatePersists`|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](../ingestion/index.md#indexspec) for possible values.|no|
|
||||
|`indexSpec`|Defines segment storage format options to be used at indexing time, see [IndexSpec](../ingestion/ingestion-spec.md#indexspec)|no|
|
||||
|`indexSpecForIntermediatePersists`|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](../ingestion/ingestion-spec.md#indexspec) for possible values.|no|
|
||||
|`maxPendingPersists`|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with `maxRowsInMemory` * (2 + `maxPendingPersists`).|no (default = 0, meaning one persist can be running concurrently with ingestion, and none can be queued up)|
|
||||
|`pushTimeout`|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|no (default = 0)|
|
||||
|`segmentWriteOutMediumFactory`|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](../ingestion/native-batch.md#segmentwriteoutmediumfactory).|no (default is the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used)|
|
||||
|
|
|
@ -27,7 +27,7 @@ Apache Druid stores its index in *segment files*, which are partitioned by
|
|||
time. In a basic setup, one segment file is created for each time
|
||||
interval, where the time interval is configurable in the
|
||||
`segmentGranularity` parameter of the
|
||||
[`granularitySpec`](../ingestion/index.md#granularityspec). For Druid to
|
||||
[`granularitySpec`](../ingestion/ingestion-spec.md#granularityspec). For Druid to
|
||||
operate well under heavy query load, it is important for the segment
|
||||
file size to be within the recommended range of 300MB-700MB. If your
|
||||
segment files are larger than this range, then consider either
|
||||
|
@ -187,7 +187,7 @@ Each column is stored as two parts:
|
|||
A ColumnDescriptor is essentially an object that allows us to use Jackson's polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serialization/deserialization logic that can deserialize the rest of the binary.
|
||||
|
||||
### Compression
|
||||
Druid compresses blocks of values for string, long, float, and double columns, using [LZ4](https://github.com/lz4/lz4-java) by default, and bitmaps for string columns and numeric null values are compressed using [Roaring](https://github.com/RoaringBitmap/RoaringBitmap). We recommend sticking with these defaults unless experimental verification with your own data and query patterns suggest that non-default options will perform better in your specific case. For example, for bitmap in string columns, the differences between using Roaring and CONCISE are most pronounced for high cardinality columns. In this case, Roaring is substantially faster on filters that match a lot of values, but in some cases CONCISE can have a lower footprint due to the overhead of the Roaring format (but is still slower when lots of values are matched). Currently, compression is configured on at the segment level rather than individual columns, see [IndexSpec](../ingestion/index.md#indexspec) for more details.
|
||||
Druid compresses blocks of values for string, long, float, and double columns, using [LZ4](https://github.com/lz4/lz4-java) by default, and bitmaps for string columns and numeric null values are compressed using [Roaring](https://github.com/RoaringBitmap/RoaringBitmap). We recommend sticking with these defaults unless experimental verification with your own data and query patterns suggest that non-default options will perform better in your specific case. For example, for bitmap in string columns, the differences between using Roaring and CONCISE are most pronounced for high cardinality columns. In this case, Roaring is substantially faster on filters that match a lot of values, but in some cases CONCISE can have a lower footprint due to the overhead of the Roaring format (but is still slower when lots of values are matched). Currently, compression is configured on at the segment level rather than individual columns, see [IndexSpec](../ingestion/ingestion-spec.md#indexspec) for more details.
|
||||
|
||||
## Sharding Data to Create Segments
|
||||
|
||||
|
|
|
@ -59,7 +59,7 @@ druid.extensions.loadList=["druid-datasketches"]
|
|||
}
|
||||
```
|
||||
|
||||
> It is very common to use `HLLSketchBuild` in combination with [rollup](../../ingestion/index.html#rollup) to create a [metric](../../ingestion/index.html#metricsspec) on high-cardinality columns. In this example, a metric called `userid_hll` is included in the `metricsSpec`. This will perform a HLL sketch on the `userid` field at ingestion time, allowing for highly-performant approximate `COUNT DISTINCT` query operations and improving roll-up ratios when `userid` is then left out of the `dimensionsSpec`.
|
||||
> It is very common to use `HLLSketchBuild` in combination with [rollup](../../ingestion/rollup.md) to create a [metric](../../ingestion/ingestion-spec.html#metricsspec) on high-cardinality columns. In this example, a metric called `userid_hll` is included in the `metricsSpec`. This will perform a HLL sketch on the `userid` field at ingestion time, allowing for highly-performant approximate `COUNT DISTINCT` query operations and improving roll-up ratios when `userid` is then left out of the `dimensionsSpec`.
|
||||
>
|
||||
> ```
|
||||
> :
|
||||
|
|
|
@ -126,7 +126,7 @@ A sample supervisor spec is shown below:
|
|||
|Field|Description|Required|
|
||||
|--------|-----------|---------|
|
||||
|`type`|The supervisor type, this should always be `kafka`.|yes|
|
||||
|`dataSchema`|The schema that will be used by the Kafka indexing task during ingestion. See [`dataSchema`](../../ingestion/index.md#dataschema) for details.|yes|
|
||||
|`dataSchema`|The schema that will be used by the Kafka indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema) for details.|yes|
|
||||
|`ioConfig`|A KafkaSupervisorIOConfig object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task. See [KafkaSupervisorIOConfig](#kafkasupervisorioconfig) below.|yes|
|
||||
|`tuningConfig`|A KafkaSupervisorTuningConfig object for configuring performance-related settings for the supervisor and indexing tasks. See [KafkaSupervisorTuningConfig](#kafkasupervisortuningconfig) below.|no|
|
||||
|
||||
|
|
|
@ -114,7 +114,7 @@ A sample supervisor spec is shown below:
|
|||
|Field|Description|Required|
|
||||
|--------|-----------|---------|
|
||||
|`type`|The supervisor type, this should always be `kinesis`.|yes|
|
||||
|`dataSchema`|The schema that will be used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/index.md#dataschema).|yes|
|
||||
|`dataSchema`|The schema that will be used by the Kinesis indexing task during ingestion. See [`dataSchema`](../../ingestion/ingestion-spec.md#dataschema).|yes|
|
||||
|`ioConfig`|A KinesisSupervisorIOConfig object for configuring Kafka connection and I/O-related settings for the supervisor and indexing task. See [KinesisSupervisorIOConfig](#kinesissupervisorioconfig) below.|yes|
|
||||
|`tuningConfig`|A KinesisSupervisorTuningConfig object for configuring performance-related settings for the supervisor and indexing tasks. See [KinesisSupervisorTuningConfig](#kinesissupervisortuningconfig) below.|no|
|
||||
|
||||
|
|
|
@ -44,7 +44,7 @@ To migrate to 0.15.0+:
|
|||
* The 'contrib' extension supported a `typeString` property, which provided the schema of the
|
||||
ORC file, of which was essentially required to have the types correct, but notably _not_ the column names, which
|
||||
facilitated column renaming. In the 'core' extension, column renaming can be achieved with
|
||||
[`flattenSpec`](../../ingestion/index.md#flattenspec). For example, `"typeString":"struct<time:string,name:string>"`
|
||||
[`flattenSpec`](../../ingestion/ingestion-spec.md#flattenspec). For example, `"typeString":"struct<time:string,name:string>"`
|
||||
with the actual schema `struct<_col0:string,_col1:string>`, to preserve Druid schema would need replaced with:
|
||||
|
||||
```json
|
||||
|
@ -67,7 +67,7 @@ with the actual schema `struct<_col0:string,_col1:string>`, to preserve Druid sc
|
|||
|
||||
* The 'contrib' extension supported a `mapFieldNameFormat` property, which provided a way to specify a dimension to
|
||||
flatten `OrcMap` columns with primitive types. This functionality has also been replaced with
|
||||
[`flattenSpec`](../../ingestion/index.md#flattenspec). For example: `"mapFieldNameFormat": "<PARENT>_<CHILD>"`
|
||||
[`flattenSpec`](../../ingestion/ingestion-spec.md#flattenspec). For example: `"mapFieldNameFormat": "<PARENT>_<CHILD>"`
|
||||
for a dimension `nestedData_dim1`, to preserve Druid schema could be replaced with
|
||||
|
||||
```json
|
||||
|
|
|
@ -36,7 +36,7 @@ By default, compaction does not modify the underlying data of the segments. Howe
|
|||
- Over time you don't need fine-grained granularity for older data so you want use compaction to change older segments to a coarser query granularity. This reduces the storage space required for older data. For example from `minute` to `hour`, or `hour` to `day`. You cannot go from coarser granularity to finer granularity.
|
||||
- You can change the dimension order to improve sorting and reduce segment size.
|
||||
- You can remove unused columns in compaction or implement an aggregation metric for older data.
|
||||
- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](index.md#perfect-rollup-vs-best-effort-rollup).
|
||||
- You can change segment rollup from dynamic partitioning with best-effort rollup to hash or range partitioning with perfect rollup. For more information on rollup, see [perfect vs best-effort rollup](./rollup.md#perfect-rollup-vs-best-effort-rollup).
|
||||
|
||||
Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.
|
||||
|
||||
|
@ -82,7 +82,7 @@ If you want to control dimension ordering or ensure specific values for dimensio
|
|||
|
||||
### Rollup
|
||||
Druid only rolls up the output segment when `rollup` is set for all input segments.
|
||||
See [Roll-up](../ingestion/index.md#rollup) for more details.
|
||||
See [Roll-up](../ingestion/rollup.md) for more details.
|
||||
You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes).
|
||||
|
||||
## Setting up manual compaction
|
||||
|
|
|
@ -79,12 +79,19 @@ Unfortunately, the Input Format doesn't support all data formats or ingestion me
|
|||
Especially if you want to use the Hadoop ingestion, you still need to use the [Parser](#parser).
|
||||
If your data is formatted in some format not listed in this section, please consider using the Parser instead.
|
||||
|
||||
All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the `inputFormat` entry in your [`ioConfig`](index.md#ioconfig).
|
||||
All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the `inputFormat` entry in your [`ioConfig`](ingestion-spec.md#ioconfig).
|
||||
|
||||
### JSON
|
||||
|
||||
The `inputFormat` to load data of JSON format. An example is:
|
||||
Configure the JSON `inputFormat` to load JSON data as follows:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
| type | String | This should say `json`. | yes |
|
||||
| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
|
||||
| featureSpec | JSON Object | [JSON parser features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) supported by Jackson library. Those features will be applied when parsing the input JSON data. | no |
|
||||
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
|
@ -94,18 +101,19 @@ The `inputFormat` to load data of JSON format. An example is:
|
|||
}
|
||||
```
|
||||
|
||||
The JSON `inputFormat` has the following components:
|
||||
### CSV
|
||||
|
||||
Configure the CSV `inputFormat` to load CSV data as follows:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
| type | String | This should say `json`. | yes |
|
||||
| flattenSpec | JSON Object | Specifies flattening configuration for nested JSON data. See [`flattenSpec`](#flattenspec) for more info. | no |
|
||||
| featureSpec | JSON Object | [JSON parser features](https://github.com/FasterXML/jackson-core/wiki/JsonParser-Features) supported by Jackson library. Those features will be applied when parsing the input JSON data. | no |
|
||||
|
||||
### CSV
|
||||
|
||||
The `inputFormat` to load data of the CSV format. An example is:
|
||||
| type | String | This should say `csv`. | yes |
|
||||
| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) |
|
||||
| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing |
|
||||
| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the header. For example, if you set `skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip the first two lines and then extract column information from the third line. `columns` will be ignored if this is set to true. | no (default = false if `columns` is set; otherwise null) |
|
||||
| skipHeaderRows | Integer | If this is set, the task will skip the first `skipHeaderRows` rows. | no (default = 0) |
|
||||
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
|
@ -116,30 +124,9 @@ The `inputFormat` to load data of the CSV format. An example is:
|
|||
}
|
||||
```
|
||||
|
||||
The CSV `inputFormat` has the following components:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
| type | String | This should say `csv`. | yes |
|
||||
| listDelimiter | String | A custom delimiter for multi-value dimensions. | no (default = ctrl+A) |
|
||||
| columns | JSON array | Specifies the columns of the data. The columns should be in the same order with the columns of your data. | yes if `findColumnsFromHeader` is false or missing |
|
||||
| findColumnsFromHeader | Boolean | If this is set, the task will find the column names from the header row. Note that `skipHeaderRows` will be applied before finding column names from the header. For example, if you set `skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip the first two lines and then extract column information from the third line. `columns` will be ignored if this is set to true. | no (default = false if `columns` is set; otherwise null) |
|
||||
| skipHeaderRows | Integer | If this is set, the task will skip the first `skipHeaderRows` rows. | no (default = 0) |
|
||||
|
||||
### TSV (Delimited)
|
||||
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
"type": "tsv",
|
||||
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
|
||||
"delimiter":"|"
|
||||
},
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `inputFormat` to load data of a delimited format. An example is:
|
||||
Configure the TSV `inputFormat` to load TSV data as follows:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|
@ -152,15 +139,32 @@ The `inputFormat` to load data of a delimited format. An example is:
|
|||
|
||||
Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed.
|
||||
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
"type": "tsv",
|
||||
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city","added","deleted","delta"],
|
||||
"delimiter":"|"
|
||||
},
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### ORC
|
||||
|
||||
> You need to include the [`druid-orc-extensions`](../development/extensions-core/orc.md) as an extension to use the ORC input format.
|
||||
To use the ORC input format, load the Druid Orc extension ( [`druid-orc-extensions`](../development/extensions-core/orc.md)).
|
||||
> To upgrade from versions earlier than 0.15.0 to 0.15.0 or new, read [Migration from 'contrib' extension](../development/extensions-core/orc.md#migration-from-contrib-extension).
|
||||
|
||||
> If you are considering upgrading from earlier than 0.15.0 to 0.15.0 or a higher version,
|
||||
> please read [Migration from 'contrib' extension](../development/extensions-core/orc.md#migration-from-contrib-extension) carefully.
|
||||
Configure the ORC `inputFormat` to load ORC data as follows:
|
||||
|
||||
The `inputFormat` to load data of ORC format. An example is:
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
| type | String | This should say `orc`. | yes |
|
||||
| flattenSpec | JSON Object | Specifies flattening configuration for nested ORC data. See [`flattenSpec`](#flattenspec) for more info. | no |
|
||||
| binaryAsString | Boolean | Specifies if the binary orc column which is not logically marked as a string should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
|
@ -181,20 +185,19 @@ The `inputFormat` to load data of ORC format. An example is:
|
|||
}
|
||||
```
|
||||
|
||||
The ORC `inputFormat` has the following components:
|
||||
### Parquet
|
||||
|
||||
To use the Parquet input format load the Druid Parquet extension ([`druid-parquet-extensions`](../development/extensions-core/parquet.md)).
|
||||
|
||||
Configure the Parquet `inputFormat` to load Parquet data as follows:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
| type | String | This should say `orc`. | yes |
|
||||
| flattenSpec | JSON Object | Specifies flattening configuration for nested ORC data. See [`flattenSpec`](#flattenspec) for more info. | no |
|
||||
| binaryAsString | Boolean | Specifies if the binary orc column which is not logically marked as a string should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|
||||
### Parquet
|
||||
|
||||
> You need to include the [`druid-parquet-extensions`](../development/extensions-core/parquet.md) as an extension to use the Parquet input format.
|
||||
|
||||
The `inputFormat` to load data of Parquet format. An example is:
|
||||
|type| String| This should be set to `parquet` to read Parquet file| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
|
@ -215,21 +218,22 @@ The `inputFormat` to load data of Parquet format. An example is:
|
|||
}
|
||||
```
|
||||
|
||||
The Parquet `inputFormat` has the following components:
|
||||
### Avro Stream
|
||||
|
||||
To use the Avro Stream input format load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
|
||||
|
||||
For more information on how Druid handles Avro types, see [Avro Types](../development/extensions-core/avro.md#avro-types) section for
|
||||
|
||||
Configure the Avro `inputFormat` to load Avro data as follows:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|type| String| This should be set to `parquet` to read Parquet file| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|type| String| This should be set to `avro_stream` to read Avro serialized data| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
|`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes |
|
||||
| binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|
||||
### Avro Stream
|
||||
|
||||
> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro Stream input format.
|
||||
|
||||
> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
|
||||
|
||||
The `inputFormat` to load data of Avro format in stream ingestion. An example is:
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
|
@ -263,13 +267,6 @@ The `inputFormat` to load data of Avro format in stream ingestion. An example is
|
|||
}
|
||||
```
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|type| String| This should be set to `avro_stream` to read Avro serialized data| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
|`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes |
|
||||
| binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|
||||
##### Avro Bytes Decoder
|
||||
|
||||
If `type` is not included, the avroBytesDecoder defaults to `schema_repo`.
|
||||
|
@ -430,11 +427,20 @@ Multiple Instances:
|
|||
|
||||
### Avro OCF
|
||||
|
||||
> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro OCF input format.
|
||||
To load the Avro OCF input format, load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
|
||||
|
||||
> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
|
||||
See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
|
||||
|
||||
The `inputFormat` to load data of Avro OCF format. An example is:
|
||||
Configure the Avro OCF `inputFormat` to load Avro OCF data as follows:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro records. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
|schema| JSON Object |Define a reader schema to be used when parsing Avro records, this is useful when parsing multiple versions of Avro OCF file data | no (default will decode using the writer schema contained in the OCF file) |
|
||||
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
|
@ -470,18 +476,19 @@ The `inputFormat` to load data of Avro OCF format. An example is:
|
|||
}
|
||||
```
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro records. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
|schema| JSON Object |Define a reader schema to be used when parsing Avro records, this is useful when parsing multiple versions of Avro OCF file data | no (default will decode using the writer schema contained in the OCF file) |
|
||||
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
|
||||
|
||||
### Protobuf
|
||||
|
||||
> You need to include the [`druid-protobuf-extensions`](../development/extensions-core/protobuf.md) as an extension to use the Protobuf input format.
|
||||
|
||||
The `inputFormat` to load data of Protobuf format. An example is:
|
||||
Configure the Protobuf `inputFormat` to load Protobuf data as follows:
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|type| String| This should be set to `protobuf` to read Protobuf serialized data| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Protobuf record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
|`protoBytesDecoder`| JSON Object |Specifies how to decode bytes to Protobuf record. | yes |
|
||||
|
||||
For example:
|
||||
```json
|
||||
"ioConfig": {
|
||||
"inputFormat": {
|
||||
|
@ -506,18 +513,18 @@ The `inputFormat` to load data of Protobuf format. An example is:
|
|||
}
|
||||
```
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|type| String| This should be set to `protobuf` to read Protobuf serialized data| yes |
|
||||
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Protobuf record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
|
||||
|`protoBytesDecoder`| JSON Object |Specifies how to decode bytes to Protobuf record. | yes |
|
||||
|
||||
### FlattenSpec
|
||||
|
||||
The `flattenSpec` is located in `inputFormat` → `flattenSpec` and is responsible for
|
||||
bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
|
||||
An example `flattenSpec` is:
|
||||
The `flattenSpec` bridges the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model. It is an object within the `inputFormat` object.
|
||||
|
||||
Configure your `flattenSpec` as follows:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
|
||||
| fields | Specifies the fields of interest and how they are accessed. See [Field flattening specifications](#field-flattening-specifications) for more detail. | `[]` |
|
||||
|
||||
For example:
|
||||
```json
|
||||
"flattenSpec": {
|
||||
"useFieldDiscovery": true,
|
||||
|
@ -528,21 +535,12 @@ An example `flattenSpec` is:
|
|||
]
|
||||
}
|
||||
```
|
||||
> Conceptually, after input data records are read, the `flattenSpec` is applied first before
|
||||
> any other specs such as [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec),
|
||||
> [`dimensionsSpec`](./index.md#dimensionsspec), or [`metricsSpec`](./index.md#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
After Druid reads the input data records, it applies the flattenSpec before applying any other specs such as [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec),
|
||||
> [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), or [`metricsSpec`](./ingestion-spec.md#metricsspec). Keep this in mind when writing your ingestion spec.
|
||||
|
||||
Flattening is only supported for [data formats](data-formats.md) that support nesting, including `avro`, `json`, `orc`,
|
||||
and `parquet`.
|
||||
|
||||
A `flattenSpec` can have the following components:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), [`dimensionsSpec`](./index.md#dimensionsspec), and [`metricsSpec`](./index.md#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
|
||||
| fields | Specifies the fields of interest and how they are accessed. [See below for details.](#field-flattening-specifications) | `[]` |
|
||||
|
||||
#### Field flattening specifications
|
||||
|
||||
Each entry in the `fields` list can have the following components:
|
||||
|
@ -550,7 +548,7 @@ Each entry in the `fields` list can have the following components:
|
|||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| type | Options are as follows:<br><br><ul><li>`root`, referring to a field at the root level of the record. Only really useful if `useFieldDiscovery` is false.</li><li>`path`, referring to a field using [JsonPath](https://github.com/jayway/JsonPath) notation. Supported by most data formats that offer nesting, including `avro`, `json`, `orc`, and `parquet`.</li><li>`jq`, referring to a field using [jackson-jq](https://github.com/eiiches/jackson-jq) notation. Only supported for the `json` format.</li></ul> | none (required) |
|
||||
| name | Name of the field after flattening. This name can be referred to by the [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), [`dimensionsSpec`](./index.md#dimensionsspec), and [`metricsSpec`](./index.md#metricsspec).| none (required) |
|
||||
| name | Name of the field after flattening. This name can be referred to by the [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).| none (required) |
|
||||
| expr | Expression for accessing the field while flattening. For type `path`, this should be [JsonPath](https://github.com/jayway/JsonPath). For type `jq`, this should be [jackson-jq](https://github.com/eiiches/jackson-jq) notation. For other types, this parameter is ignored. | none (required for types `path` and `jq`) |
|
||||
|
||||
#### Notes on flattening
|
||||
|
@ -1157,7 +1155,7 @@ This parser is for [stream ingestion](./index.md#streaming) and reads Protocol b
|
|||
|-------|------|-------------|----------|
|
||||
| type | String | This should say `protobuf`. | yes |
|
||||
| `protoBytesDecoder` | JSON Object | Specifies how to decode bytes to Protobuf record. | yes |
|
||||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. The format must be JSON. See [JSON ParseSpec](./index.md) for more configuration options. Note that timeAndDims parseSpec is no longer supported. | yes |
|
||||
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. The format must be JSON. See [JSON ParseSpec](#json-parsespec) for more configuration options. Note that timeAndDims parseSpec is no longer supported. | yes |
|
||||
|
||||
Sample spec:
|
||||
|
||||
|
|
|
@ -0,0 +1,58 @@
|
|||
---
|
||||
id: data-model
|
||||
title: "Druid data model"
|
||||
sidebar_label: Data model
|
||||
description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
|
||||
---
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
Druid stores data in datasources, which are similar to tables in a traditional relational database management system (RDBMS). Druid's data model shares similarities with both relational and timeseries data models.
|
||||
|
||||
## Primary timestamp
|
||||
|
||||
Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
|
||||
for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
|
||||
|
||||
Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. Regardless of the source field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
|
||||
|
||||
You can control other important operations that are based on the primary timestamp in the
|
||||
[`granularitySpec`](./ingestion-spec.md#granularityspec). If you have more than one timestamp column, you can store the others as
|
||||
[secondary timestamps](./schema-design.md#secondary-timestamps).
|
||||
|
||||
## Dimensions
|
||||
|
||||
Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time when necessary.
|
||||
|
||||
If you disable [rollup](./rollup.md), then Druid treats the set of
|
||||
dimensions like a set of columns to ingest. The dimensions behave exactly as you would expect from any database that does not support a rollup feature.
|
||||
|
||||
At ingestion time, you configure dimensions in the [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec).
|
||||
|
||||
## Metrics
|
||||
|
||||
Metrics are columns that Druid stores in an aggregated form. Metrics are most useful when you enable [rollup](rollup.md). If you specify a metric, you can apply an aggregation function to each row during ingestion. This
|
||||
has the following benefits:
|
||||
|
||||
Rollup is a form of aggregation that collapses dimensions while aggregating the values in the metrics, that is, it collapses rows but retains its summary information."
|
||||
- [Rollup](rollup.md) is a form of aggregation that combines multiple rows with the same timestamp value and dimension values. For example, the [rollup tutorial](../tutorials/tutorial-rollup.md) demonstrates using rollup to collapse netflow data to a single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
|
||||
- Druid can compute some aggregators, especially approximate ones, more quickly at query time if they are partially computed at ingestion time, including data that has not been rolled up.
|
||||
|
||||
At ingestion time, you configure Metrics in the [`metricsSpec`](./ingestion-spec.md#metricsspec).
|
|
@ -115,7 +115,7 @@ Also note that Druid automatically computes the classpath for Hadoop job contain
|
|||
|
||||
## `dataSchema`
|
||||
|
||||
This field is required. See the [`dataSchema`](index.md#legacy-dataschema-spec) section of the main ingestion page for details on
|
||||
This field is required. See the [`dataSchema`](ingestion-spec.md#legacy-dataschema-spec) section of the main ingestion page for details on
|
||||
what it should contain.
|
||||
|
||||
## `ioConfig`
|
||||
|
@ -328,8 +328,8 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|
|||
|combineText|Boolean|Use CombineTextInputFormat to combine multiple files into a file split. This can speed up Hadoop jobs when processing a large number of small files.|no (default == false)|
|
||||
|useCombiner|Boolean|Use Hadoop combiner to merge rows at mapper if possible.|no (default == false)|
|
||||
|jobProperties|Object|A map of properties to add to the Hadoop job configuration, see below for details.|no (default == null)|
|
||||
|indexSpec|Object|Tune how data is indexed. See [`indexSpec`](index.md#indexspec) on the main ingestion page for more information.|no|
|
||||
|indexSpecForIntermediatePersists|Object|defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [`indexSpec`](index.md#indexspec) for possible values.|no (default = same as indexSpec)|
|
||||
|indexSpec|Object|Tune how data is indexed. See [`indexSpec`](ingestion-spec.md#indexspec) on the main ingestion page for more information.|no|
|
||||
|indexSpecForIntermediatePersists|Object|defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [`indexSpec`](ingestion-spec.md#indexspec) for possible values.|no (default = same as indexSpec)|
|
||||
|numBackgroundPersistThreads|Integer|The number of new background threads to use for incremental persists. Using this feature causes a notable increase in memory pressure and CPU usage but will make the job finish more quickly. If changing from the default of 0 (use current thread for persists), we recommend setting it to 1.|no (default == 0)|
|
||||
|forceExtendableShardSpecs|Boolean|Forces use of extendable shardSpecs. Hash-based partitioning always uses an extendable shardSpec. For single-dimension partitioning, this option should be set to true to use an extendable shardSpec. For partitioning, please check [Partitioning specification](#partitionsspec). This option can be useful when you need to append more data to existing dataSource.|no (default = false)|
|
||||
|useExplicitVersion|Boolean|Forces HadoopIndexTask to use version.|no (default = false)|
|
||||
|
|
|
@ -22,33 +22,24 @@ title: "Ingestion"
|
|||
~ under the License.
|
||||
-->
|
||||
|
||||
All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
|
||||
Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
|
||||
segments based on that data.
|
||||
Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
|
||||
|
||||
In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
|
||||
(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
|
||||
Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
|
||||
processes are still involved in starting and monitoring the Hadoop jobs).
|
||||
For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
|
||||
Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs.
|
||||
|
||||
Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes.
|
||||
For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section
|
||||
of Druid's design documentation.
|
||||
During ingestion Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md). Historical nodes load the segments into memory to respond to queries. For streaming ingestion, the Middle Managers and indexers can respond to queries in real-time with arriving data. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
|
||||
|
||||
## How to use this documentation
|
||||
This topic introduces streaming and batch ingestion methods. The following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
|
||||
- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
|
||||
- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
|
||||
- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
|
||||
- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
|
||||
|
||||
This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
|
||||
configurations that are common to all [ingestion methods](#ingestion-methods).
|
||||
|
||||
The **individual pages for each ingestion method** provide additional information about concepts and configurations
|
||||
that are unique to each ingestion method.
|
||||
|
||||
We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
|
||||
ingestion method or methods that you have chosen.
|
||||
For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.
|
||||
|
||||
## Ingestion methods
|
||||
|
||||
The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose
|
||||
The tables below list Druid's most common data ingestion methods, along with comparisons to help you choose
|
||||
the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details
|
||||
about how each method works, as well as configuration properties specific to that method, check out its documentation
|
||||
page.
|
||||
|
@ -59,6 +50,8 @@ The most recommended, and most popular, method of streaming ingestion is the
|
|||
[Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. Alternatively, the Kinesis
|
||||
indexing service works with Amazon Kinesis Data Streams.
|
||||
|
||||
Streaming ingestion uses an ongoing process called a supervisor that reads from the data stream to ingest data into Druid.
|
||||
|
||||
This table compares the options:
|
||||
|
||||
| **Method** | [Kafka](../development/extensions-core/kafka-ingestion.md) | [Kinesis](../development/extensions-core/kinesis-ingestion.md) |
|
||||
|
@ -88,656 +81,6 @@ This table compares the three available options:
|
|||
| **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
|
||||
| **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
|
||||
| **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
|
||||
| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
|
||||
| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
|
||||
| **[Rollup modes](./rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
|
||||
| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [partitionsSpec](./native-batch.md#partitionsspec) for details.| Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [partitionsSpec](./native-batch.md#partitionsspec-1) for details. |
|
||||
|
||||
<a name="data-model"></a>
|
||||
|
||||
## Druid's data model
|
||||
|
||||
### Datasources
|
||||
|
||||
Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid
|
||||
offers a unique data modeling system that bears similarity to both relational and timeseries models.
|
||||
|
||||
### Primary timestamp
|
||||
|
||||
Druid schemas must always include a primary timestamp. The primary timestamp is used for
|
||||
[partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data
|
||||
corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column
|
||||
for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks,
|
||||
and time-based retention rules.
|
||||
|
||||
The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the
|
||||
[`granularitySpec`](#granularityspec) controls other important operations that are based on the primary timestamp.
|
||||
Regardless of which input field the primary timestamp is read from, it will always be stored as a column named `__time`
|
||||
in your Druid datasource.
|
||||
|
||||
If you have more than one timestamp column, you can store the others as
|
||||
[secondary timestamps](schema-design.md#secondary-timestamps).
|
||||
|
||||
### Dimensions
|
||||
|
||||
Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply
|
||||
aggregators to dimensions at query time in an ad-hoc manner. If you run with [rollup](#rollup) disabled, then the set of
|
||||
dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical
|
||||
database that does not support a rollup feature.
|
||||
|
||||
Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
|
||||
|
||||
### Metrics
|
||||
|
||||
Metrics are columns that are stored in an aggregated form. They are most useful when [rollup](#rollup) is enabled.
|
||||
Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This
|
||||
has two benefits:
|
||||
|
||||
1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one row even while retaining summary
|
||||
information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this is used to collapse netflow data to a
|
||||
single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
|
||||
2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if
|
||||
they are partially computed at ingestion time.
|
||||
|
||||
Metrics are configured through the [`metricsSpec`](#metricsspec).
|
||||
|
||||
## Rollup
|
||||
|
||||
### What is rollup?
|
||||
|
||||
Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is
|
||||
a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that
|
||||
needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost:
|
||||
as we roll up data, we lose the ability to query individual events.
|
||||
|
||||
When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar
|
||||
to what you would expect from a typical database that does not support a rollup feature.
|
||||
|
||||
When rollup is enabled, then any rows that have identical [dimensions](#dimensions) and [timestamp](#primary-timestamp)
|
||||
to each other (after [`queryGranularity`-based truncation](#granularityspec)) can be collapsed, or _rolled up_, into a
|
||||
single row in Druid.
|
||||
|
||||
By default, rollup is enabled.
|
||||
|
||||
### Enabling or disabling rollup
|
||||
|
||||
Rollup is controlled by the `rollup` setting in the [`granularitySpec`](#granularityspec). By default, it is `true`
|
||||
(enabled). Set this to `false` if you want Druid to store each record as-is, without any rollup summarization.
|
||||
|
||||
### Example of rollup
|
||||
|
||||
For an example of how to configure rollup, and of how the feature will modify your data, check out the
|
||||
[rollup tutorial](../tutorials/tutorial-rollup.md).
|
||||
|
||||
### Maximizing rollup ratio
|
||||
|
||||
You can measure the rollup ratio of a datasource by comparing the number of rows in Druid (`COUNT`) with the number of ingested
|
||||
events. One way to do this is with a
|
||||
[Druid SQL](../querying/sql.md) query such as the following, where "count" refers to a `count`-type metric generated at ingestion time:
|
||||
|
||||
```sql
|
||||
SELECT SUM("count") / (COUNT(*) * 1.0)
|
||||
FROM datasource
|
||||
```
|
||||
|
||||
The higher this number is, the more benefit you are gaining from rollup.
|
||||
|
||||
> See [Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about
|
||||
how counting works when rollup is enabled.
|
||||
|
||||
Tips for maximizing rollup:
|
||||
|
||||
- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
|
||||
you will achieve.
|
||||
- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
|
||||
- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the
|
||||
likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
|
||||
- It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full"
|
||||
datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that
|
||||
has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
|
||||
that datasource leads to much faster query times. This can often be done with just a small increase in storage
|
||||
footprint, since abbreviated datasources tend to be substantially smaller.
|
||||
- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
|
||||
rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
|
||||
[reindexing](data-management.md#reingesting-data) or [compacting](compaction.md) your data in the background after initial ingestion.
|
||||
|
||||
### Perfect rollup vs Best-effort rollup
|
||||
|
||||
Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
|
||||
time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
|
||||
be multiple segments holding rows with the same timestamp and dimension values.
|
||||
|
||||
In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
|
||||
without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
|
||||
segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
|
||||
cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion
|
||||
run in this mode.
|
||||
|
||||
Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals
|
||||
and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which
|
||||
generally increases the time required for ingestion, but provides information necessary for perfect rollup.
|
||||
|
||||
The following table shows how each method handles rollup:
|
||||
|
||||
|Method|How it works|
|
||||
|------|------------|
|
||||
|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
|
||||
|[Hadoop](hadoop.md)|Always perfect.|
|
||||
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
|
||||
|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
|
||||
|
||||
## Partitioning
|
||||
|
||||
### Why partition?
|
||||
|
||||
Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and
|
||||
performance.
|
||||
|
||||
Druid datasources are always partitioned by time into _time chunks_, and each time chunk contains one or more segments.
|
||||
This partitioning happens for all ingestion methods, and is based on the `segmentGranularity` parameter of your
|
||||
ingestion spec's `dataSchema`.
|
||||
|
||||
The segments within a particular time chunk may also be partitioned further, using options that vary based on the
|
||||
ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
|
||||
improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
|
||||
quickly.
|
||||
|
||||
You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
|
||||
dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold
|
||||
storage size decreases - and it also tends to improve query performance as well.
|
||||
|
||||
> Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider
|
||||
> placing it first in the `dimensions` list of your `dimensionsSpec`, which tells Druid to sort rows within each segment
|
||||
> by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.
|
||||
>
|
||||
> However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first
|
||||
> dimension listed in your `dimensionsSpec`. This can prevent dimension sorting from being maximally effective. If
|
||||
> necessary, you can work around this limitation by setting `queryGranularity` equal to `segmentGranularity` in your
|
||||
> [`granularitySpec`](#granularityspec), which will set all timestamps within the segment to the same value, and by saving
|
||||
> your "real" timestamp as a [secondary timestamp](schema-design.md#secondary-timestamps). This limitation may be removed
|
||||
> in a future version of Druid.
|
||||
|
||||
### How to set up partitioning
|
||||
|
||||
Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of
|
||||
flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like
|
||||
Kafka) then you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after it
|
||||
is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain
|
||||
threshold is optimally partitioned, even as you continuously add new data from a stream.
|
||||
|
||||
The following table shows how each ingestion method handles partitioning:
|
||||
|
||||
|Method|How it works|
|
||||
|------|------------|
|
||||
|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
|
||||
|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
|
||||
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
|
||||
|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
|
||||
|
||||
> Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable
|
||||
> approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If
|
||||
> you go with this approach, then you can ignore this section, since it is describing how to set up partitioning
|
||||
> _within a single datasource_.
|
||||
>
|
||||
> For more details on splitting data up into separate datasources, and potential operational considerations, refer
|
||||
> to the [Multitenancy considerations](../querying/multitenancy.md) page.
|
||||
|
||||
<a name="spec"></a>
|
||||
|
||||
## Ingestion specs
|
||||
|
||||
No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or
|
||||
ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor
|
||||
definition is an _ingestion spec_.
|
||||
|
||||
Ingestion specs consists of three main components:
|
||||
|
||||
- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
|
||||
[primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
|
||||
- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
|
||||
documentation for each [ingestion method](#ingestion-methods).
|
||||
- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
|
||||
[ingestion method](#ingestion-methods).
|
||||
|
||||
Example ingestion spec for task type `index_parallel` (native batch):
|
||||
|
||||
```
|
||||
{
|
||||
"type": "index_parallel",
|
||||
"spec": {
|
||||
"dataSchema": {
|
||||
"dataSource": "wikipedia",
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
},
|
||||
"dimensionsSpec": {
|
||||
"dimensions": [
|
||||
"page",
|
||||
"language",
|
||||
{ "type": "long", "name": "userId" }
|
||||
]
|
||||
},
|
||||
"metricsSpec": [
|
||||
{ "type": "count", "name": "count" },
|
||||
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||
],
|
||||
"granularitySpec": {
|
||||
"segmentGranularity": "day",
|
||||
"queryGranularity": "none",
|
||||
"intervals": [
|
||||
"2013-08-31/2013-09-01"
|
||||
]
|
||||
}
|
||||
},
|
||||
"ioConfig": {
|
||||
"type": "index_parallel",
|
||||
"inputSource": {
|
||||
"type": "local",
|
||||
"baseDir": "examples/indexing/",
|
||||
"filter": "wikipedia_data.json"
|
||||
},
|
||||
"inputFormat": {
|
||||
"type": "json",
|
||||
"flattenSpec": {
|
||||
"useFieldDiscovery": true,
|
||||
"fields": [
|
||||
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"tuningConfig": {
|
||||
"type": "index_parallel"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The specific options supported by these sections will depend on the [ingestion method](#ingestion-methods) you have chosen.
|
||||
For more examples, refer to the documentation for each ingestion method.
|
||||
|
||||
You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
|
||||
available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
|
||||
[Kafka](../development/extensions-core/kafka-ingestion.md),
|
||||
[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
|
||||
[native batch](native-batch.md) mode.
|
||||
|
||||
## `dataSchema`
|
||||
|
||||
> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
|
||||
except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
|
||||
|
||||
The `dataSchema` is a holder for the following components:
|
||||
|
||||
- [datasource name](#datasource), [primary timestamp](#timestampspec),
|
||||
[dimensions](#dimensionsspec), [metrics](#metricsspec), and
|
||||
[transforms and filters](#transformspec) (if needed).
|
||||
|
||||
An example `dataSchema` is:
|
||||
|
||||
```
|
||||
"dataSchema": {
|
||||
"dataSource": "wikipedia",
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
},
|
||||
"dimensionsSpec": {
|
||||
"dimensions": [
|
||||
"page",
|
||||
"language",
|
||||
{ "type": "long", "name": "userId" }
|
||||
]
|
||||
},
|
||||
"metricsSpec": [
|
||||
{ "type": "count", "name": "count" },
|
||||
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||
],
|
||||
"granularitySpec": {
|
||||
"segmentGranularity": "day",
|
||||
"queryGranularity": "none",
|
||||
"intervals": [
|
||||
"2013-08-31/2013-09-01"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `dataSource`
|
||||
|
||||
The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
|
||||
[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
|
||||
`dataSource` is:
|
||||
|
||||
```
|
||||
"dataSource": "my-first-datasource"
|
||||
```
|
||||
|
||||
### `timestampSpec`
|
||||
|
||||
The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
|
||||
configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
|
||||
|
||||
```
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
}
|
||||
```
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
A `timestampSpec` can have the following components:
|
||||
|
||||
|Field|Description|Default|
|
||||
|-----|-----------|-------|
|
||||
|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
|
||||
|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
|
||||
|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
|
||||
|
||||
### `dimensionsSpec`
|
||||
|
||||
The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
|
||||
configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
|
||||
|
||||
```
|
||||
"dimensionsSpec" : {
|
||||
"dimensions": [
|
||||
"page",
|
||||
"language",
|
||||
{ "type": "long", "name": "userId" }
|
||||
],
|
||||
"dimensionExclusions" : [],
|
||||
"spatialDimensions" : []
|
||||
}
|
||||
```
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
A `dimensionsSpec` can have the following components:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
|
||||
| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
|
||||
| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
|
||||
|
||||
#### Dimension objects
|
||||
|
||||
Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
|
||||
a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
|
||||
|
||||
Dimension objects can have the following components:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| type | Either `string`, `long`, `float`, or `double`. | `string` |
|
||||
| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
|
||||
| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
|
||||
| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
|
||||
|
||||
#### Inclusions and exclusions
|
||||
|
||||
Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
|
||||
|
||||
Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
|
||||
|
||||
Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
|
||||
|
||||
1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
|
||||
2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
|
||||
3. Any field listed in `dimensionExclusions` is excluded.
|
||||
4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
|
||||
5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
|
||||
6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
|
||||
7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
|
||||
|
||||
> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
|
||||
> schemaless dimension interpretation.
|
||||
|
||||
### `metricsSpec`
|
||||
|
||||
The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
|
||||
to apply at ingestion time. This is most useful when [rollup](#rollup) is enabled, since it's how you configure
|
||||
ingestion-time aggregation.
|
||||
|
||||
An example `metricsSpec` is:
|
||||
|
||||
```
|
||||
"metricsSpec": [
|
||||
{ "type": "count", "name": "count" },
|
||||
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||
]
|
||||
```
|
||||
|
||||
> Generally, when [rollup](#rollup) is disabled, you should have an empty `metricsSpec` (because without rollup,
|
||||
> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
|
||||
> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
|
||||
> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
|
||||
> be done by defining a metric in a `metricsSpec`.
|
||||
|
||||
### `granularitySpec`
|
||||
|
||||
The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
|
||||
the following operations:
|
||||
|
||||
1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
|
||||
2. Truncating the timestamp, if desired (via `queryGranularity`).
|
||||
3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
|
||||
4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`).
|
||||
|
||||
Other than `rollup`, these operations are all based on the [primary timestamp](#primary-timestamp).
|
||||
|
||||
An example `granularitySpec` is:
|
||||
|
||||
```
|
||||
"granularitySpec": {
|
||||
"segmentGranularity": "day",
|
||||
"queryGranularity": "none",
|
||||
"intervals": [
|
||||
"2013-08-31/2013-09-01"
|
||||
],
|
||||
"rollup": true
|
||||
}
|
||||
```
|
||||
|
||||
A `granularitySpec` can have the following components:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
|
||||
| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
|
||||
| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
|
||||
| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
|
||||
| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
|
||||
|
||||
### `transformSpec`
|
||||
|
||||
The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
|
||||
records during ingestion time. It is optional. An example `transformSpec` is:
|
||||
|
||||
```
|
||||
"transformSpec": {
|
||||
"transforms": [
|
||||
{ "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
|
||||
],
|
||||
"filter": {
|
||||
"type": "selector",
|
||||
"dimension": "country",
|
||||
"value": "San Serriffe"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
#### Transforms
|
||||
|
||||
The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
|
||||
"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
|
||||
|
||||
If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
|
||||
shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
|
||||
|
||||
Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
|
||||
they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
|
||||
with another field containing all nulls, which will act similarly to removing the field.
|
||||
|
||||
Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
|
||||
They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
|
||||
a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
|
||||
`timestampSpec`.
|
||||
|
||||
Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
|
||||
|
||||
```
|
||||
{
|
||||
"type": "expression",
|
||||
"name": "<output name>",
|
||||
"expression": "<expr>"
|
||||
}
|
||||
```
|
||||
|
||||
The `expression` is a [Druid query expression](../misc/math-expr.md).
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
#### Filter
|
||||
|
||||
The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
|
||||
ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
|
||||
`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
|
||||
|
||||
### Legacy `dataSchema` spec
|
||||
|
||||
> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
|
||||
except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
|
||||
|
||||
The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
|
||||
|
||||
- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
|
||||
|
||||
#### `parser` (Deprecated)
|
||||
|
||||
In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
|
||||
items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
|
||||
For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
|
||||
|
||||
For details about major components of the `parseSpec`, refer to their subsections:
|
||||
|
||||
- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
|
||||
- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
|
||||
- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
|
||||
|
||||
An example `parser` is:
|
||||
|
||||
```
|
||||
"parser": {
|
||||
"type": "string",
|
||||
"parseSpec": {
|
||||
"format": "json",
|
||||
"flattenSpec": {
|
||||
"useFieldDiscovery": true,
|
||||
"fields": [
|
||||
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||
]
|
||||
},
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
},
|
||||
"dimensionsSpec": {
|
||||
"dimensions": [
|
||||
"page",
|
||||
"language",
|
||||
{ "type": "long", "name": "userId" }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `flattenSpec`
|
||||
|
||||
In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
|
||||
bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
|
||||
See [Flatten spec](./data-formats.md#flattenspec) for more details.
|
||||
|
||||
## `ioConfig`
|
||||
|
||||
The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
|
||||
filesystem, or any other supported source system. The `inputFormat` property applies to all
|
||||
[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
|
||||
uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
|
||||
The rest of `ioConfig` is specific to each individual ingestion method.
|
||||
An example `ioConfig` to read JSON data is:
|
||||
|
||||
```json
|
||||
"ioConfig": {
|
||||
"type": "<ingestion-method-specific type code>",
|
||||
"inputFormat": {
|
||||
"type": "json"
|
||||
},
|
||||
...
|
||||
}
|
||||
```
|
||||
For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
|
||||
|
||||
## `tuningConfig`
|
||||
|
||||
Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
|
||||
properties apply to all [ingestion methods](#ingestion-methods), but most are specific to each individual
|
||||
ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
|
||||
is:
|
||||
|
||||
```plaintext
|
||||
"tuningConfig": {
|
||||
"type": "<ingestion-method-specific type code>",
|
||||
"maxRowsInMemory": 1000000,
|
||||
"maxBytesInMemory": <one-sixth of JVM memory>,
|
||||
"indexSpec": {
|
||||
"bitmap": { "type": "roaring" },
|
||||
"dimensionCompression": "lz4",
|
||||
"metricCompression": "lz4",
|
||||
"longEncoding": "longs"
|
||||
},
|
||||
<other ingestion-method-specific properties>
|
||||
}
|
||||
```
|
||||
|
||||
|Field|Description|Default|
|
||||
|-----|-----------|-------|
|
||||
|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
|
||||
|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
|
||||
|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sketches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
|
||||
|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
|
||||
|indexSpec|Tune how data is indexed. See below for more information.|See table below|
|
||||
|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
|
||||
|
||||
#### `indexSpec`
|
||||
|
||||
The `indexSpec` object can include the following properties:
|
||||
|
||||
|Field|Description|Default|
|
||||
|-----|-----------|-------|
|
||||
|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
|
||||
|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
|
||||
|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
|
||||
|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
|
||||
|
||||
Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
|
||||
[ingestion method](#ingestion-methods) for details.
|
||||
|
|
|
@ -0,0 +1,483 @@
|
|||
---
|
||||
id: ingestion-spec
|
||||
title: Ingestion spec reference
|
||||
sidebar_label: Ingestion spec
|
||||
description: Reference for the configuration options in the ingestion spec.
|
||||
---
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). All types of ingestion use an _ingestion spec_ to configure ingestion.
|
||||
|
||||
Ingestion specs consists of three main components:
|
||||
|
||||
- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
|
||||
[primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
|
||||
- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
|
||||
documentation for each [ingestion method](./index.md#ingestion-methods).
|
||||
- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
|
||||
[ingestion method](./index.md#ingestion-methods).
|
||||
|
||||
Example ingestion spec for task type `index_parallel` (native batch):
|
||||
|
||||
```
|
||||
{
|
||||
"type": "index_parallel",
|
||||
"spec": {
|
||||
"dataSchema": {
|
||||
"dataSource": "wikipedia",
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
},
|
||||
"dimensionsSpec": {
|
||||
"dimensions": [
|
||||
{ "page" },
|
||||
{ "language" },
|
||||
{ "type": "long", "name": "userId" }
|
||||
]
|
||||
},
|
||||
"metricsSpec": [
|
||||
{ "type": "count", "name": "count" },
|
||||
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||
],
|
||||
"granularitySpec": {
|
||||
"segmentGranularity": "day",
|
||||
"queryGranularity": "none",
|
||||
"intervals": [
|
||||
"2013-08-31/2013-09-01"
|
||||
]
|
||||
}
|
||||
},
|
||||
"ioConfig": {
|
||||
"type": "index_parallel",
|
||||
"inputSource": {
|
||||
"type": "local",
|
||||
"baseDir": "examples/indexing/",
|
||||
"filter": "wikipedia_data.json"
|
||||
},
|
||||
"inputFormat": {
|
||||
"type": "json",
|
||||
"flattenSpec": {
|
||||
"useFieldDiscovery": true,
|
||||
"fields": [
|
||||
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"tuningConfig": {
|
||||
"type": "index_parallel"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The specific options supported by these sections will depend on the [ingestion method](./index.md#ingestion-methods) you have chosen.
|
||||
For more examples, refer to the documentation for each ingestion method.
|
||||
|
||||
You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
|
||||
available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
|
||||
[Kafka](../development/extensions-core/kafka-ingestion.md),
|
||||
[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
|
||||
[native batch](native-batch.md) mode.
|
||||
|
||||
## `dataSchema`
|
||||
|
||||
> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
|
||||
except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
|
||||
|
||||
The `dataSchema` is a holder for the following components:
|
||||
|
||||
- [datasource name](#datasource)
|
||||
- [primary timestamp](#timestampspec)
|
||||
- [dimensions](#dimensionsspec)
|
||||
- [metrics](#metricsspec)
|
||||
- [transforms and filters](#transformspec) (if needed).
|
||||
|
||||
An example `dataSchema` is:
|
||||
|
||||
```
|
||||
"dataSchema": {
|
||||
"dataSource": "wikipedia",
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
},
|
||||
"dimensionsSpec": {
|
||||
"dimensions": [
|
||||
{ "page" },
|
||||
{ "language" },
|
||||
{ "type": "long", "name": "userId" }
|
||||
]
|
||||
},
|
||||
"metricsSpec": [
|
||||
{ "type": "count", "name": "count" },
|
||||
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||
],
|
||||
"granularitySpec": {
|
||||
"segmentGranularity": "day",
|
||||
"queryGranularity": "none",
|
||||
"intervals": [
|
||||
"2013-08-31/2013-09-01"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `dataSource`
|
||||
|
||||
The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
|
||||
[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
|
||||
`dataSource` is:
|
||||
|
||||
```
|
||||
"dataSource": "my-first-datasource"
|
||||
```
|
||||
|
||||
### `timestampSpec`
|
||||
|
||||
The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
|
||||
configuring the [primary timestamp](./data-model.md#primary-timestamp). An example `timestampSpec` is:
|
||||
|
||||
```
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
}
|
||||
```
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
A `timestampSpec` can have the following components:
|
||||
|
||||
|Field|Description|Default|
|
||||
|-----|-----------|-------|
|
||||
|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
|
||||
|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
|
||||
|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
|
||||
|
||||
### `dimensionsSpec`
|
||||
|
||||
The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
|
||||
configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec` is:
|
||||
|
||||
```
|
||||
"dimensionsSpec" : {
|
||||
"dimensions": [
|
||||
"page",
|
||||
"language",
|
||||
{ "type": "long", "name": "userId" }
|
||||
],
|
||||
"dimensionExclusions" : [],
|
||||
"spatialDimensions" : []
|
||||
}
|
||||
```
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
A `dimensionsSpec` can have the following components:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
|
||||
| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
|
||||
| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
|
||||
|
||||
#### Dimension objects
|
||||
|
||||
Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
|
||||
a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
|
||||
|
||||
Dimension objects can have the following components:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| type | Either `string`, `long`, `float`, or `double`. | `string` |
|
||||
| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
|
||||
| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
|
||||
| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
|
||||
|
||||
#### Inclusions and exclusions
|
||||
|
||||
Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
|
||||
|
||||
Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
|
||||
|
||||
Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
|
||||
|
||||
1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
|
||||
2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the `flattenSpec`. The `useFieldDiscovery` parameter determines whether the original root-level fields will be retained or discarded.
|
||||
3. Any field listed in `dimensionExclusions` is excluded.
|
||||
4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
|
||||
5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
|
||||
6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
|
||||
7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
|
||||
|
||||
> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
|
||||
> schemaless dimension interpretation.
|
||||
|
||||
### `metricsSpec`
|
||||
|
||||
The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
|
||||
to apply at ingestion time. This is most useful when [rollup](./rollup.md) is enabled, since it's how you configure
|
||||
ingestion-time aggregation.
|
||||
|
||||
An example `metricsSpec` is:
|
||||
|
||||
```
|
||||
"metricsSpec": [
|
||||
{ "type": "count", "name": "count" },
|
||||
{ "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
|
||||
{ "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
|
||||
]
|
||||
```
|
||||
|
||||
> Generally, when [rollup](./rollup.md) is disabled, you should have an empty `metricsSpec` (because without rollup,
|
||||
> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
|
||||
> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
|
||||
> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
|
||||
> be done by defining a metric in a `metricsSpec`.
|
||||
|
||||
### `granularitySpec`
|
||||
|
||||
The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
|
||||
the following operations:
|
||||
|
||||
1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
|
||||
2. Truncating the timestamp, if desired (via `queryGranularity`).
|
||||
3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
|
||||
4. Specifying whether ingestion-time [rollup](./rollup.md) should be used or not (via `rollup`).
|
||||
|
||||
Other than `rollup`, these operations are all based on the [primary timestamp](./data-model.md#primary-timestamp).
|
||||
|
||||
An example `granularitySpec` is:
|
||||
|
||||
```
|
||||
"granularitySpec": {
|
||||
"segmentGranularity": "day",
|
||||
"queryGranularity": "none",
|
||||
"intervals": [
|
||||
"2013-08-31/2013-09-01"
|
||||
],
|
||||
"rollup": true
|
||||
}
|
||||
```
|
||||
|
||||
A `granularitySpec` can have the following components:
|
||||
|
||||
| Field | Description | Default |
|
||||
|-------|-------------|---------|
|
||||
| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
|
||||
| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
|
||||
| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
|
||||
| rollup | Whether to use ingestion-time [rollup](./rollup.md) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
|
||||
| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
|
||||
|
||||
### `transformSpec`
|
||||
|
||||
The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
|
||||
records during ingestion time. It is optional. An example `transformSpec` is:
|
||||
|
||||
```
|
||||
"transformSpec": {
|
||||
"transforms": [
|
||||
{ "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
|
||||
],
|
||||
"filter": {
|
||||
"type": "selector",
|
||||
"dimension": "country",
|
||||
"value": "San Serriffe"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
#### Transforms
|
||||
|
||||
The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
|
||||
"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
|
||||
|
||||
If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
|
||||
shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
|
||||
|
||||
Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
|
||||
they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
|
||||
with another field containing all nulls, which will act similarly to removing the field.
|
||||
|
||||
Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
|
||||
They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
|
||||
a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
|
||||
`timestampSpec`.
|
||||
|
||||
Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
|
||||
|
||||
```
|
||||
{
|
||||
"type": "expression",
|
||||
"name": "<output name>",
|
||||
"expression": "<expr>"
|
||||
}
|
||||
```
|
||||
|
||||
The `expression` is a [Druid query expression](../misc/math-expr.md).
|
||||
|
||||
> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
|
||||
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
|
||||
> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
|
||||
> your ingestion spec.
|
||||
|
||||
#### Filter
|
||||
|
||||
The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
|
||||
ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
|
||||
`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
|
||||
|
||||
### Legacy `dataSchema` spec
|
||||
|
||||
> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
|
||||
except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
|
||||
|
||||
The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
|
||||
|
||||
- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
|
||||
|
||||
#### `parser` (Deprecated)
|
||||
|
||||
In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
|
||||
items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
|
||||
For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
|
||||
|
||||
For details about major components of the `parseSpec`, refer to their subsections:
|
||||
|
||||
- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](./data-model.md#primary-timestamp).
|
||||
- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](./data-model.md#dimensions).
|
||||
- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
|
||||
|
||||
An example `parser` is:
|
||||
|
||||
```
|
||||
"parser": {
|
||||
"type": "string",
|
||||
"parseSpec": {
|
||||
"format": "json",
|
||||
"flattenSpec": {
|
||||
"useFieldDiscovery": true,
|
||||
"fields": [
|
||||
{ "type": "path", "name": "userId", "expr": "$.user.id" }
|
||||
]
|
||||
},
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "auto"
|
||||
},
|
||||
"dimensionsSpec": {
|
||||
"dimensions": [
|
||||
{ "page" },
|
||||
{ "language" },
|
||||
{ "type": "long", "name": "userId" }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `flattenSpec`
|
||||
|
||||
In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
|
||||
bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
|
||||
See [Flatten spec](./data-formats.md#flattenspec) for more details.
|
||||
|
||||
## `ioConfig`
|
||||
|
||||
The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
|
||||
filesystem, or any other supported source system. The `inputFormat` property applies to all
|
||||
[ingestion method](./index.md#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
|
||||
uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
|
||||
The rest of `ioConfig` is specific to each individual ingestion method.
|
||||
An example `ioConfig` to read JSON data is:
|
||||
|
||||
```json
|
||||
"ioConfig": {
|
||||
"type": "<ingestion-method-specific type code>",
|
||||
"inputFormat": {
|
||||
"type": "json"
|
||||
},
|
||||
...
|
||||
}
|
||||
```
|
||||
For more details, see the documentation provided by each [ingestion method](./index.md#ingestion-methods).
|
||||
|
||||
## `tuningConfig`
|
||||
|
||||
Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
|
||||
properties apply to all [ingestion methods](./index.md#ingestion-methods), but most are specific to each individual
|
||||
ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
|
||||
is:
|
||||
|
||||
```plaintext
|
||||
"tuningConfig": {
|
||||
"type": "<ingestion-method-specific type code>",
|
||||
"maxRowsInMemory": 1000000,
|
||||
"maxBytesInMemory": <one-sixth of JVM memory>,
|
||||
"indexSpec": {
|
||||
"bitmap": { "type": "roaring" },
|
||||
"dimensionCompression": "lz4",
|
||||
"metricCompression": "lz4",
|
||||
"longEncoding": "longs"
|
||||
},
|
||||
<other ingestion-method-specific properties>
|
||||
}
|
||||
```
|
||||
|
||||
|Field|Description|Default|
|
||||
|-----|-----------|-------|
|
||||
|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
|
||||
|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
|
||||
|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sketches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
|
||||
|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
|
||||
|indexSpec|Tune how data is indexed. See below for more information.|See table below|
|
||||
|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
|
||||
|
||||
#### `indexSpec`
|
||||
|
||||
The `indexSpec` object can include the following properties:
|
||||
|
||||
|Field|Description|Default|
|
||||
|-----|-----------|-------|
|
||||
|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
|
||||
|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
|
||||
|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
|
||||
|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
|
||||
|
||||
Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
|
||||
[ingestion method](./index.md#ingestion-methods) for details.
|
|
@ -204,7 +204,7 @@ A sample task is shown below:
|
|||
|
||||
This field is required.
|
||||
|
||||
See [Ingestion Spec DataSchema](../ingestion/index.md#dataschema)
|
||||
See [Ingestion Spec DataSchema](../ingestion/ingestion-spec.md#dataschema)
|
||||
|
||||
If you specify `intervals` explicitly in your dataSchema's `granularitySpec`, batch ingestion will lock the full intervals
|
||||
specified when it starts up, and you will learn quickly if the specified interval overlaps with locks held by other
|
||||
|
@ -238,10 +238,10 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|
|||
|numShards|Deprecated. Use `partitionsSpec` instead. Directly specify the number of shards to create when using a `hashed` `partitionsSpec`. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no|
|
||||
|splitHintSpec|Used to give a hint to control the amount of data that each first phase task reads. This hint could be ignored depending on the implementation of the input source. See [Split hint spec](#split-hint-spec) for more details.|size-based split hint spec|no|
|
||||
|partitionsSpec|Defines how to partition data in each timeChunk, see [PartitionsSpec](#partitionsspec)|`dynamic` if `forceGuaranteedRollup` = false, `hashed` or `single_dim` if `forceGuaranteedRollup` = true|no|
|
||||
|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](index.md#indexspec)|null|no|
|
||||
|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](index.md#indexspec) for possible values.|same as indexSpec|no|
|
||||
|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](ingestion-spec.md#indexspec)|null|no|
|
||||
|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](ingestion-spec.md#indexspec) for possible values.|same as indexSpec|no|
|
||||
|maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no|
|
||||
|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](../ingestion/index.md#rollup). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, `intervals` in `granularitySpec` must be set and `hashed` or `single_dim` must be used for `partitionsSpec`. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
|
||||
|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](rollup.md). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, `intervals` in `granularitySpec` must be set and `hashed` or `single_dim` must be used for `partitionsSpec`. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
|
||||
|reportParseExceptions|If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped.|false|no|
|
||||
|pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|0|no|
|
||||
|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no|
|
||||
|
@ -282,7 +282,7 @@ The segments split hint spec is used only for [`DruidInputSource`](#druid-input-
|
|||
### `partitionsSpec`
|
||||
|
||||
PartitionsSpec is used to describe the secondary partitioning method.
|
||||
You should use different partitionsSpec depending on the [rollup mode](../ingestion/index.md#rollup) you want.
|
||||
You should use different partitionsSpec depending on the [rollup mode](rollup.md) you want.
|
||||
For perfect rollup, you should use either `hashed` (partitioning based on the hash of dimensions in each row) or
|
||||
`single_dim` (based on ranges of a single dimension). For best-effort rollup, you should use `dynamic`.
|
||||
|
||||
|
@ -291,13 +291,14 @@ The three `partitionsSpec` types have different characteristics.
|
|||
| PartitionsSpec | Ingestion speed | Partitioning method | Supported rollup mode | Secondary partition pruning at query time |
|
||||
|----------------|-----------------|---------------------|-----------------------|-------------------------------|
|
||||
| `dynamic` | Fastest | [Dynamic partitioning](#dynamic-partitioning) based on the number of rows in a segment. | Best-effort rollup | N/A |
|
||||
| `hashed` | Moderate | Multiple dimension [hash-based partitioning](#hash-based-partitioning) may reduce both your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows how to hash `partitionDimensions` values to locate a segment, given a query including a filter on all the `partitionDimensions`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimensions` for query processing.<br/><br/>Note that `partitionDimensions` must be set at ingestion time to enable secondary partition pruning at query time.|
|
||||
| `single_dim` | Slowest | Single dimension [range partitioning](#single-dimension-range-partitioning) may reduce your datasource size and query latency by improving data locality. See [Partitioning](./index.md#partitioning) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows the range of `partitionDimension` values in each segment, given a query including a filter on the `partitionDimension`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimension` for query processing. |
|
||||
| `hashed` | Moderate | Multiple dimension [hash-based partitioning](#hash-based-partitioning) may reduce both your datasource size and query latency by improving data locality. See [Partitioning](./partitioning.md) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows how to hash `partitionDimensions` values to locate a segment, given a query including a filter on all the `partitionDimensions`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimensions` for query processing.<br/><br/>Note that `partitionDimensions` must be set at ingestion time to enable secondary partition pruning at query time.|
|
||||
| `single_dim` | Slowest | Single dimension [range partitioning](#single-dimension-range-partitioning) may reduce your datasource size and query latency by improving data locality. See [Partitioning](./partitioning.md) for more details. | Perfect rollup | The broker can use the partition information to prune segments early to speed up queries. Since the broker knows the range of `partitionDimension` values in each segment, given a query including a filter on the `partitionDimension`, the broker can pick up only the segments holding the rows satisfying the filter on `partitionDimension` for query processing. |
|
||||
|
||||
|
||||
The recommended use case for each partitionsSpec is:
|
||||
- If your data has a uniformly distributed column which is frequently used in your queries,
|
||||
consider using `single_dim` partitionsSpec to maximize the performance of most of your queries.
|
||||
- If your data doesn't have a uniformly distributed column, but is expected to have a [high rollup ratio](./index.md#maximizing-rollup-ratio)
|
||||
- If your data doesn't have a uniformly distributed column, but is expected to have a [high rollup ratio](./rollup.md#maximizing-rollup-ratio)
|
||||
when you roll up with some dimensions, consider using `hashed` partitionsSpec.
|
||||
It could reduce the size of datasource and query latency by improving data locality.
|
||||
- If the above two scenarios are not the case or you don't need to roll up your datasource,
|
||||
|
@ -741,7 +742,7 @@ A sample task is shown below:
|
|||
|
||||
This field is required.
|
||||
|
||||
See the [`dataSchema`](../ingestion/index.md#dataschema) section of the ingestion docs for details.
|
||||
See the [`dataSchema`](./ingestion-spec.md#dataschema) section of the ingestion docs for details.
|
||||
|
||||
If you do not specify `intervals` explicitly in your dataSchema's granularitySpec, the Local Index Task will do an extra
|
||||
pass over the data to determine the range to lock when it starts up. If you specify `intervals` explicitly, any rows
|
||||
|
@ -772,10 +773,10 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|
|||
|numShards|Deprecated. Use `partitionsSpec` instead. Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no|
|
||||
|partitionDimensions|Deprecated. Use `partitionsSpec` instead. The dimensions to partition on. Leave blank to select all dimensions. Only used with `forceGuaranteedRollup` = true, will be ignored otherwise.|null|no|
|
||||
|partitionsSpec|Defines how to partition data in each timeChunk, see [PartitionsSpec](#partitionsspec)|`dynamic` if `forceGuaranteedRollup` = false, `hashed` if `forceGuaranteedRollup` = true|no|
|
||||
|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](index.md#indexspec)|null|no|
|
||||
|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](index.md#indexspec) for possible values.|same as indexSpec|no|
|
||||
|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](ingestion-spec.md#indexspec)|null|no|
|
||||
|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](ingestion-spec.md#indexspec) for possible values.|same as indexSpec|no|
|
||||
|maxPendingPersists|Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists).|0 (meaning one persist can be running concurrently with ingestion, and none can be queued up)|no|
|
||||
|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](../ingestion/index.md#rollup). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, the index task will read the entire input data twice: one for finding the optimal number of partitions per time chunk and one for generating segments. Note that the result segments would be hash-partitioned. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
|
||||
|forceGuaranteedRollup|Forces guaranteeing the [perfect rollup](rollup.md). The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, the index task will read the entire input data twice: one for finding the optimal number of partitions per time chunk and one for generating segments. Note that the result segments would be hash-partitioned. This flag cannot be used with `appendToExisting` of IOConfig. For more details, see the below __Segment pushing modes__ section.|false|no|
|
||||
|reportParseExceptions|DEPRECATED. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions` and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions` to no more than 1.|false|no|
|
||||
|pushTimeout|Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever.|0|no|
|
||||
|segmentWriteOutMediumFactory|Segment write-out medium to use when creating segments. See [SegmentWriteOutMediumFactory](#segmentwriteoutmediumfactory).|Not specified, the value from `druid.peon.defaultSegmentWriteOutMediumFactory.type` is used|no|
|
||||
|
@ -786,7 +787,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|
|||
### `partitionsSpec`
|
||||
|
||||
PartitionsSpec is to describe the secondary partitioning method.
|
||||
You should use different partitionsSpec depending on the [rollup mode](../ingestion/index.md#rollup) you want.
|
||||
You should use different partitionsSpec depending on the [rollup mode](rollup.md) you want.
|
||||
For perfect rollup, you should use `hashed`.
|
||||
|
||||
|property|description|default|required?|
|
||||
|
@ -815,7 +816,7 @@ For best-effort rollup, you should use `dynamic`.
|
|||
|
||||
While ingesting data using the Index task, it creates segments from the input data and pushes them. For segment pushing,
|
||||
the Index task supports two segment pushing modes, i.e., _bulk pushing mode_ and _incremental pushing mode_ for
|
||||
[perfect rollup and best-effort rollup](../ingestion/index.md#rollup), respectively.
|
||||
[perfect rollup and best-effort rollup](rollup.md), respectively.
|
||||
|
||||
In the bulk pushing mode, every segment is pushed at the very end of the index task. Until then, created segments
|
||||
are stored in the memory and local storage of the process running the index task. As a result, this mode might cause a
|
||||
|
@ -1376,8 +1377,8 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
|
|||
The Druid input source can be used for a variety of purposes, including:
|
||||
|
||||
- Creating new datasources that are rolled-up copies of existing datasources.
|
||||
- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
|
||||
- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
|
||||
- Changing the [partitioning or sorting](./partitioning.md) of a datasource to improve performance.
|
||||
- Updating or removing rows using a [`transformSpec`](./ingestion-spec.md#transformspec).
|
||||
|
||||
When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
|
||||
of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
|
||||
|
|
|
@ -0,0 +1,69 @@
|
|||
---
|
||||
id: partitioning
|
||||
title: Partitioning
|
||||
sidebar_label: Partitioning
|
||||
description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
|
||||
---
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
You can use segment partitioning and sorting within your Druid datasources to reduce the size of your data and increase performance.
|
||||
|
||||
One way to partition is to load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads.
|
||||
|
||||
This topic describes how to set up partitions within a single datasource. It does not cover how to use multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
|
||||
|
||||
## Time chunk partitioning
|
||||
|
||||
Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
|
||||
|
||||
## Secondary partitioning
|
||||
|
||||
Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
|
||||
|
||||
To achieve the best performance and smallest overall footprint, partition your data on a "natural"
|
||||
dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
|
||||
|
||||
## Partitioning and sorting
|
||||
|
||||
Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
|
||||
|
||||
> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
|
||||
|
||||
## How to configure partitioning
|
||||
|
||||
Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
|
||||
Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain time threshold while you continuously add new data from a stream.
|
||||
|
||||
The following table shows how each ingestion method handles partitioning:
|
||||
|
||||
|Method|How it works|
|
||||
|------|------------|
|
||||
|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
|
||||
|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
|
||||
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
|
||||
|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
|
||||
|
||||
|
||||
## Learn more
|
||||
See the following topics for more information:
|
||||
* [`partitionsSpec`](native-batch.md#partitionsspec) for more detail on partitioning with Native Batch ingestion.
|
||||
* [Reindexing](data-management.md#reingesting-data) and [Compaction](compaction.md) for information on how to repartition existing data in Druid.
|
||||
|
|
@ -0,0 +1,79 @@
|
|||
---
|
||||
id: rollup
|
||||
title: "Data rollup"
|
||||
sidebar_label: Data rollup
|
||||
description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
|
||||
---
|
||||
|
||||
<!--
|
||||
~ Licensed to the Apache Software Foundation (ASF) under one
|
||||
~ or more contributor license agreements. See the NOTICE file
|
||||
~ distributed with this work for additional information
|
||||
~ regarding copyright ownership. The ASF licenses this file
|
||||
~ to you under the Apache License, Version 2.0 (the
|
||||
~ "License"); you may not use this file except in compliance
|
||||
~ with the License. You may obtain a copy of the License at
|
||||
~
|
||||
~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~
|
||||
~ Unless required by applicable law or agreed to in writing,
|
||||
~ software distributed under the License is distributed on an
|
||||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
~ KIND, either express or implied. See the License for the
|
||||
~ specific language governing permissions and limitations
|
||||
~ under the License.
|
||||
-->
|
||||
|
||||
Druid can roll up data at ingestion time to reduce the amount of raw data to store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade-off for the efficiency of rollup, you lose the ability to query individual events.
|
||||
|
||||
At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
|
||||
|
||||
When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
|
||||
|
||||
## Maximizing rollup ratio
|
||||
|
||||
To measure the rollup ratio of a datasource compare the number of rows in Druid (`COUNT`) with the number of ingested events. For example, run a [Druid SQL](../querying/sql.md) query where "count" refers to a `count`-type metric generated at ingestion time as follows:
|
||||
|
||||
```sql
|
||||
SELECT SUM("cnt") / (COUNT(*) * 1.0) FROM datasource
|
||||
```
|
||||
The higher the result, the greater the benefit you gain from rollup. See [Counting the number of ingested events](schema-design.md#counting) for more details about how counting works with rollup is enabled.
|
||||
|
||||
Tips for maximizing rollup:
|
||||
|
||||
- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
|
||||
- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
|
||||
- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
|
||||
- You can optionally load the same data into more than one Druid datasource. For example:
|
||||
- Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio.
|
||||
- Create a second "abbreviated" datasource with fewer dimensions and a higher rollup ratio.
|
||||
When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
|
||||
- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
|
||||
- Switch to a guaranteed perfect rollup option.
|
||||
- [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
|
||||
|
||||
## Perfect rollup vs best-effort rollup
|
||||
|
||||
Depending on the ingestion method, Druid has the following rollup options:
|
||||
- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
|
||||
- _Best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.
|
||||
|
||||
In general, ingestion methods that offer best-effort rollup do this for one of the following reasons:
|
||||
- The ingestion method parallelizes ingestion without a shuffling step required for perfect rollup.
|
||||
- The ingestion method uses _incremental publishing_ which means it finalizes and publishes segments before all data for a time chunk has been received.
|
||||
In both of these cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion run in this mode.
|
||||
|
||||
Ingestion methods that guarantee perfect rollup use an additional preprocessing step to determine intervals and partitioning before data ingestion. This preprocessing step scans the entire input dataset. While this step increases the time required for ingestion, it provides information necessary for perfect rollup.
|
||||
|
||||
The following table shows how each method handles rollup:
|
||||
|
||||
|Method|How it works|
|
||||
|------|------------|
|
||||
|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
|
||||
|[Hadoop](hadoop.md)|Always perfect.|
|
||||
|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
|
||||
|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
|
||||
|
||||
## Learn more
|
||||
See the following topic for more information:
|
||||
* [Rollup tutorial](../tutorials/tutorial-rollup.md) for an example of how to configure rollup, and of how the feature modifies your data.
|
|
@ -24,17 +24,17 @@ title: "Schema design tips"
|
|||
|
||||
## Druid's data model
|
||||
|
||||
For general information, check out the documentation on [Druid's data model](index.md#data-model) on the main
|
||||
For general information, check out the documentation on [Druid's data model](./data-model.md) on the main
|
||||
ingestion overview page. The rest of this page discusses tips for users coming from other kinds of systems, as well as
|
||||
general tips and common practices.
|
||||
|
||||
* Druid data is stored in [datasources](index.md#datasources), which are similar to tables in a traditional RDBMS.
|
||||
* Druid datasources can be ingested with or without [rollup](#rollup). With rollup enabled, Druid partially aggregates your data during ingestion, potentially reducing its row count, decreasing storage footprint, and improving query performance. With rollup disabled, Druid stores one row for each row in your input data, without any pre-aggregation.
|
||||
* Druid data is stored in [datasources](./data-model.md), which are similar to tables in a traditional RDBMS.
|
||||
* Druid datasources can be ingested with or without [rollup](./rollup.md). With rollup enabled, Druid partially aggregates your data during ingestion, potentially reducing its row count, decreasing storage footprint, and improving query performance. With rollup disabled, Druid stores one row for each row in your input data, without any pre-aggregation.
|
||||
* Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on.
|
||||
* All columns in Druid datasources, other than the timestamp column, are either dimensions or metrics. This follows the [standard naming convention](https://en.wikipedia.org/wiki/Online_analytical_processing#Overview_of_OLAP_systems) of OLAP data.
|
||||
* Typical production datasources have tens to hundreds of columns.
|
||||
* [Dimension columns](index.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
|
||||
* [Metric columns](index.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
|
||||
* [Dimension columns](./data-model.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
|
||||
* [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
|
||||
|
||||
|
||||
## If you're coming from a...
|
||||
|
@ -89,7 +89,7 @@ it is a natural choice for storing timeseries data. Its flexible data model allo
|
|||
non-timeseries data, even in the same datasource.
|
||||
|
||||
To achieve best-case compression and query performance in Druid for timeseries data, it is important to partition and
|
||||
sort by metric name, like timeseries databases often do. See [Partitioning and sorting](index.md#partitioning) for more details.
|
||||
sort by metric name, like timeseries databases often do. See [Partitioning and sorting](./partitioning.md) for more details.
|
||||
|
||||
Tips for modeling timeseries data in Druid:
|
||||
|
||||
|
@ -98,13 +98,13 @@ for ingestion and aggregation.
|
|||
- Create a dimension that indicates the name of the series that a data point belongs to. This dimension is often called
|
||||
"metric" or "name". Do not get the dimension named "metric" confused with the concept of Druid metrics. Place this
|
||||
first in the list of dimensions in your "dimensionsSpec" for best performance (this helps because it improves locality;
|
||||
see [partitioning and sorting](index.md#partitioning) below for details).
|
||||
see [partitioning and sorting](./partitioning.md) below for details).
|
||||
- Create other dimensions for attributes attached to your data points. These are often called "tags" in timeseries
|
||||
database systems.
|
||||
- Create [metrics](../querying/aggregations.md) corresponding to the types of aggregations that you want to be able
|
||||
to query. Typically this includes "sum", "min", and "max" (in one of the long, float, or double flavors). If you want to
|
||||
be able to compute percentiles or quantiles, use Druid's [approximate aggregators](../querying/aggregations.md#approx).
|
||||
- Consider enabling [rollup](#rollup), which will allow Druid to potentially combine multiple points into one
|
||||
- Consider enabling [rollup](./rollup.md), which will allow Druid to potentially combine multiple points into one
|
||||
row in your Druid datasource. This can be useful if you want to store data at a different time granularity than it is
|
||||
naturally emitted. It is also useful if you want to combine timeseries and non-timeseries data in the same datasource.
|
||||
- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
|
||||
|
@ -124,8 +124,8 @@ Tips for modeling log data in Druid:
|
|||
|
||||
- If you don't know ahead of time what columns you'll want to ingest, use an empty dimensions list to trigger
|
||||
[automatic detection of dimension columns](#schema-less-dimensions).
|
||||
- If you have nested data, flatten it using a [`flattenSpec`](index.md#flattenspec).
|
||||
- Consider enabling [rollup](#rollup) if you have mainly analytical use cases for your log data. This will
|
||||
- If you have nested data, flatten it using a [`flattenSpec`](./ingestion-spec.md#flattenspec).
|
||||
- Consider enabling [rollup](./rollup.md) if you have mainly analytical use cases for your log data. This will
|
||||
mean you lose the ability to retrieve individual events from Druid, but you potentially gain substantial compression and
|
||||
query performance boosts.
|
||||
|
||||
|
@ -134,13 +134,13 @@ query performance boosts.
|
|||
### Rollup
|
||||
|
||||
Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. This is a form
|
||||
of summarization or pre-aggregation. For more details, see the [Rollup](index.md#rollup) section of the ingestion
|
||||
of summarization or pre-aggregation. For more details, see the [Rollup](./rollup.md) section of the ingestion
|
||||
documentation.
|
||||
|
||||
### Partitioning and sorting
|
||||
|
||||
Optimally partitioning and sorting your data can have substantial impact on footprint and performance. For more details,
|
||||
see the [Partitioning](index.md#partitioning) section of the ingestion documentation.
|
||||
see the [Partitioning](./partitioning.md) section of the ingestion documentation.
|
||||
|
||||
<a name="sketches"></a>
|
||||
|
||||
|
@ -180,19 +180,19 @@ There are performance tradeoffs between string and numeric columns. Numeric colu
|
|||
than string columns. But unlike string columns, numeric columns don't have indexes, so they can be slower to filter on.
|
||||
You may want to experiment to find the optimal choice for your use case.
|
||||
|
||||
For details about how to configure numeric dimensions, see the [`dimensionsSpec`](index.md#dimensionsspec) documentation.
|
||||
For details about how to configure numeric dimensions, see the [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec) documentation.
|
||||
|
||||
|
||||
|
||||
### Secondary timestamps
|
||||
|
||||
Druid schemas must always include a primary timestamp. The primary timestamp is used for
|
||||
[partitioning and sorting](index.md#partitioning) your data, so it should be the timestamp that you will most often filter on.
|
||||
[partitioning and sorting](./partitioning.md) your data, so it should be the timestamp that you will most often filter on.
|
||||
Druid is able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column.
|
||||
|
||||
If your data has more than one timestamp, you can ingest the others as secondary timestamps. The best way to do this
|
||||
is to ingest them as [long-typed dimensions](index.md#dimensionsspec) in milliseconds format.
|
||||
If necessary, you can get them into this format using a [`transformSpec`](index.md#transformspec) and
|
||||
is to ingest them as [long-typed dimensions](./ingestion-spec.md#dimensionsspec) in milliseconds format.
|
||||
If necessary, you can get them into this format using a [`transformSpec`](./ingestion-spec.md#transformspec) and
|
||||
[expressions](../misc/math-expr.md) like `timestamp_parse`, which returns millisecond timestamps.
|
||||
|
||||
At query time, you can query secondary timestamps with [SQL time functions](../querying/sql.md#time-functions)
|
||||
|
@ -215,7 +215,7 @@ then before indexing it, you should transform it to:
|
|||
```
|
||||
|
||||
Druid is capable of flattening JSON, Avro, or Parquet input data.
|
||||
Please read about [`flattenSpec`](index.md#flattenspec) for more details.
|
||||
Please read about [`flattenSpec`](./ingestion-spec.md#flattenspec) for more details.
|
||||
|
||||
<a name="counting"></a>
|
||||
|
||||
|
|
|
@ -164,7 +164,7 @@ Only batch tasks have the DETERMINE_PARTITIONS phase. Realtime tasks such as tho
|
|||
the `rowStats` map contains information about row counts. There is one entry for each ingestion phase. The definitions of the different row counts are shown below:
|
||||
* `processed`: Number of rows successfully ingested without parsing errors
|
||||
* `processedWithError`: Number of rows that were ingested, but contained a parsing error within one or more columns. This typically occurs where input rows have a parseable structure but invalid types for columns, such as passing in a non-numeric String value for a numeric column.
|
||||
* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside of the ingestion task's defined time interval and rows that were filtered out with a [`transformSpec`](index.md#transformspec), but doesn't include the rows skipped by explicit user configurations. For example, the rows skipped by `skipHeaderRows` or `hasHeaderRow` in the CSV format are not counted.
|
||||
* `thrownAway`: Number of rows skipped. This includes rows with timestamps that were outside of the ingestion task's defined time interval and rows that were filtered out with a [`transformSpec`](./ingestion-spec.md#transformspec), but doesn't include the rows skipped by explicit user configurations. For example, the rows skipped by `skipHeaderRows` or `hasHeaderRow` in the CSV format are not counted.
|
||||
* `unparseable`: Number of rows that could not be parsed at all and were discarded. This tracks input rows without a parseable structure, such as passing in non-JSON data when using a JSON parser.
|
||||
|
||||
The `errorMsg` field shows a message describing the error that caused a task to fail. It will be null if the task was successful.
|
||||
|
|
|
@ -25,7 +25,7 @@ title: "Segment Size Optimization"
|
|||
|
||||
In Apache Druid, it's important to optimize the segment size because
|
||||
|
||||
1. Druid stores data in segments. If you're using the [best-effort roll-up](../ingestion/index.md#rollup) mode,
|
||||
1. Druid stores data in segments. If you're using the [best-effort roll-up](../ingestion/rollup.md) mode,
|
||||
increasing the segment size might introduce further aggregation which reduces the dataSource size.
|
||||
2. When a query is submitted, that query is distributed to all Historicals and realtime tasks
|
||||
which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool
|
||||
|
|
|
@ -181,7 +181,7 @@ If you change the granularity to `none`, you will get the same results as settin
|
|||
```
|
||||
|
||||
Having a query time `granularity` that is smaller than the `queryGranularity` parameter set at
|
||||
[ingestion time](../ingestion/index.md#granularityspec) is unreasonable because information about that
|
||||
[ingestion time](../ingestion/ingestion-spec.md#granularityspec) is unreasonable because information about that
|
||||
smaller granularity is not present in the indexed data. So, if the query time granularity is smaller than the ingestion
|
||||
time query granularity, Druid produces results that are equivalent to having set `granularity` to `queryGranularity`.
|
||||
|
||||
|
|
|
@ -59,7 +59,7 @@ By default, Druid sorts values in multi-value dimensions. This behavior is contr
|
|||
* `SORTED_SET`: results in the removal of duplicate values
|
||||
* `ARRAY`: retains the original order of the values
|
||||
|
||||
See [Dimension Objects](../ingestion/index.md#dimension-objects) for information on configuring multi-value handling.
|
||||
See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling.
|
||||
|
||||
|
||||
## Querying multi-value dimensions
|
||||
|
|
|
@ -176,11 +176,11 @@ in the Druid root directory represents Wikipedia page edits for a given day.
|
|||
You do not need to adjust transformation or filtering settings, as applying ingestion time transforms and
|
||||
filters are out of scope for this tutorial.
|
||||
|
||||
8. The Configure schema settings are where you configure what [dimensions](../ingestion/index.md#dimensions)
|
||||
and [metrics](../ingestion/index.md#metrics) are ingested. The outcome of this configuration represents exactly how the
|
||||
8. The Configure schema settings are where you configure what [dimensions](../ingestion/data-model.md#dimensions)
|
||||
and [metrics](../ingestion/data-model.md#metrics) are ingested. The outcome of this configuration represents exactly how the
|
||||
data will appear in Druid after ingestion.
|
||||
|
||||
Since our dataset is very small, you can turn off [rollup](../ingestion/index.md#rollup)
|
||||
Since our dataset is very small, you can turn off [rollup](../ingestion/rollup.md)
|
||||
by unsetting the **Rollup** switch and confirming the change when prompted.
|
||||
|
||||
![Data loader schema](../assets/tutorial-batch-data-loader-05.png "Data loader schema")
|
||||
|
|
|
@ -112,9 +112,9 @@ You do not need to enter anything in these steps as applying ingestion time tran
|
|||
|
||||
![Data loader schema](../assets/tutorial-kafka-data-loader-05.png "Data loader schema")
|
||||
|
||||
In the `Configure schema` step, you can configure which [dimensions](../ingestion/index.md#dimensions) and [metrics](../ingestion/index.md#metrics) will be ingested into Druid.
|
||||
In the `Configure schema` step, you can configure which [dimensions](../ingestion/data-model.md#dimensions) and [metrics](../ingestion/data-model.md#metrics) will be ingested into Druid.
|
||||
This is exactly what the data will appear like in Druid once it is ingested.
|
||||
Since our dataset is very small, go ahead and turn off [`Rollup`](../ingestion/index.md#rollup) by clicking on the switch and confirming the change.
|
||||
Since our dataset is very small, go ahead and turn off [`Rollup`](../ingestion/rollup.md) by clicking on the switch and confirming the change.
|
||||
|
||||
Once you are satisfied with the schema, click `Next` to go to the `Partition` step where you can fine tune how the data will be partitioned into segments.
|
||||
|
||||
|
|
|
@ -282,6 +282,10 @@
|
|||
"ingestion/data-management": {
|
||||
"title": "Data management"
|
||||
},
|
||||
"ingestion/data-model": {
|
||||
"title": "Druid data model",
|
||||
"sidebar_label": "Data model"
|
||||
},
|
||||
"ingestion/faq": {
|
||||
"title": "Ingestion troubleshooting FAQ",
|
||||
"sidebar_label": "Troubleshooting FAQ"
|
||||
|
@ -293,10 +297,22 @@
|
|||
"ingestion/index": {
|
||||
"title": "Ingestion"
|
||||
},
|
||||
"ingestion/ingestion-spec": {
|
||||
"title": "Ingestion spec reference",
|
||||
"sidebar_label": "Ingestion spec"
|
||||
},
|
||||
"ingestion/native-batch": {
|
||||
"title": "Native batch ingestion",
|
||||
"sidebar_label": "Native batch"
|
||||
},
|
||||
"ingestion/partitioning": {
|
||||
"title": "Partitioning",
|
||||
"sidebar_label": "Partitioning"
|
||||
},
|
||||
"ingestion/rollup": {
|
||||
"title": "Data rollup",
|
||||
"sidebar_label": "Data rollup"
|
||||
},
|
||||
"ingestion/schema-design": {
|
||||
"title": "Schema design tips"
|
||||
},
|
||||
|
|
|
@ -32,6 +32,10 @@
|
|||
"Ingestion": [
|
||||
"ingestion/index",
|
||||
"ingestion/data-formats",
|
||||
"ingestion/data-model",
|
||||
"ingestion/rollup",
|
||||
"ingestion/partitioning",
|
||||
"ingestion/ingestion-spec",
|
||||
"ingestion/schema-design",
|
||||
"ingestion/data-management",
|
||||
"ingestion/compaction",
|
||||
|
|
Loading…
Reference in New Issue