mirror of https://github.com/apache/druid.git
Consolidate multi-value dimension doc and highlight configurability (#11428)
* Clarify options for multi-value dims * Add first example
This commit is contained in:
parent
8d7d60d18e
commit
a366753ba5
|
@ -1408,11 +1408,6 @@ tasks will fail with an exception.
|
|||
|
||||
The `columns` field must be included and and ensure that the order of the fields matches the columns of your input data in the same order.
|
||||
|
||||
### Multi-value dimensions
|
||||
|
||||
Dimensions can have multiple values for TSV and CSV data. To specify the delimiter for a multi-value dimension, set the `listDelimiter` in the `parseSpec`.
|
||||
|
||||
JSON data can contain multi-value dimensions as well. The multiple values for a dimension must be formatted as a JSON array in the ingested data. No additional `parseSpec` configuration is needed.
|
||||
|
||||
### Regex ParseSpec
|
||||
|
||||
|
|
|
@ -23,20 +23,48 @@ title: "Multi-value dimensions"
|
|||
-->
|
||||
|
||||
|
||||
Apache Druid supports "multi-value" string dimensions. These are generated when an input field contains an
|
||||
array of values instead of a single value (e.g. JSON arrays, or a TSV field containing one or more `listDelimiter`
|
||||
characters). By default Druid ingests the values in alphabetical order, see [Dimension Objects](../ingestion/index.md#dimension-objects) for configuration.
|
||||
Apache Druid supports "multi-value" string dimensions. Multi-value string dimensions result from input fields that contain an
|
||||
array of values instead of a single value, such as the `tags` values in the following JSON array example:
|
||||
|
||||
This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
|
||||
are used as a dimension being grouped by. See the section on multi-value columns in
|
||||
[segments](../design/segments.md#multi-value-columns) for internal representation details. Examples in this document
|
||||
```
|
||||
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}
|
||||
```
|
||||
|
||||
This document describes filtering and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see
|
||||
[segments documentation](../design/segments.md#multi-value-columns). Examples in this document
|
||||
are in the form of [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql.md) for details
|
||||
about using multi-value string dimensions in SQL.
|
||||
|
||||
## Overview
|
||||
|
||||
At ingestion time, Druid can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as multi-value dimensions.
|
||||
|
||||
For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `parseSpec` configuration.
|
||||
|
||||
The following shows an example multi-value dimension named `tags` in a `dimensionsSpec`:
|
||||
|
||||
```
|
||||
"dimensions": [
|
||||
{
|
||||
"type": "string",
|
||||
"name": "tags",
|
||||
"multiValueHandling": "SORTED_ARRAY",
|
||||
"createBitmapIndex": true
|
||||
}
|
||||
],
|
||||
```
|
||||
|
||||
By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. Alternatively, you can specify multi-value handling as:
|
||||
|
||||
* `SORTED_SET`: results in the removal of duplicate values
|
||||
* `ARRAY`: retains the original order of the values
|
||||
|
||||
See [Dimension Objects](../ingestion/index.md#dimension-objects) for information on configuring multi-value handling.
|
||||
|
||||
|
||||
## Querying multi-value dimensions
|
||||
|
||||
Suppose, you have a dataSource with a segment that contains the following rows, with a multi-value dimension
|
||||
called `tags`.
|
||||
The following sections describe filtering and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`.
|
||||
|
||||
```
|
||||
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
|
||||
|
@ -44,6 +72,7 @@ called `tags`.
|
|||
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3
|
||||
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4
|
||||
```
|
||||
> Be sure to remove the comments before trying out the sample data.
|
||||
|
||||
### Filtering
|
||||
|
||||
|
@ -58,7 +87,7 @@ dimensions. Filters follow these rules on multi-value dimensions:
|
|||
underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row
|
||||
if the underlying filter does not match the row.
|
||||
|
||||
For example, this "or" filter would match row1 and row2 of the dataset above, but not row3:
|
||||
The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3:
|
||||
|
||||
```
|
||||
{
|
||||
|
@ -118,7 +147,7 @@ only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you
|
|||
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
|
||||
improve performance.
|
||||
|
||||
### Example: GroupBy query with no filtering
|
||||
## Example: GroupBy query with no filtering
|
||||
|
||||
See [GroupBy querying](groupbyquery.md) for details.
|
||||
|
||||
|
@ -148,7 +177,7 @@ See [GroupBy querying](groupbyquery.md) for details.
|
|||
}
|
||||
```
|
||||
|
||||
returns following result.
|
||||
This query returns the following result:
|
||||
|
||||
```json
|
||||
[
|
||||
|
@ -204,9 +233,9 @@ returns following result.
|
|||
]
|
||||
```
|
||||
|
||||
notice how original rows are "exploded" into multiple rows and merged.
|
||||
Notice that original rows are "exploded" into multiple rows and merged.
|
||||
|
||||
### Example: GroupBy query with a selector query filter
|
||||
## Example: GroupBy query with a selector query filter
|
||||
|
||||
See [query filters](filters.md) for details of selector query filter.
|
||||
|
||||
|
@ -241,7 +270,7 @@ See [query filters](filters.md) for details of selector query filter.
|
|||
}
|
||||
```
|
||||
|
||||
returns following result.
|
||||
This query returns the following result:
|
||||
|
||||
```json
|
||||
[
|
||||
|
@ -283,17 +312,16 @@ returns following result.
|
|||
]
|
||||
```
|
||||
|
||||
You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is
|
||||
applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2,
|
||||
after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside
|
||||
You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is
|
||||
applied on the row before explosion. For multi-value dimensions, a selector filter for "t3" would match row1 and row2,
|
||||
after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside
|
||||
the multiple values matches the query filter.
|
||||
|
||||
### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes
|
||||
## Example: GroupBy query with selector query and dimension filters
|
||||
|
||||
To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as
|
||||
in the query below.
|
||||
To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below.
|
||||
|
||||
See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
|
||||
See filtered `dimensionSpecs` in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
|
||||
|
||||
```json
|
||||
{
|
||||
|
@ -330,7 +358,7 @@ See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#fil
|
|||
}
|
||||
```
|
||||
|
||||
returns the following result.
|
||||
This query returns the following result:
|
||||
|
||||
```json
|
||||
[
|
||||
|
@ -345,5 +373,5 @@ returns the following result.
|
|||
```
|
||||
|
||||
Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered
|
||||
dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline.
|
||||
`dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
|
||||
Having specs are applied at the outermost level of groupBy query processing.
|
||||
|
|
|
@ -220,7 +220,7 @@
|
|||
"title": "Amazon Kinesis ingestion",
|
||||
"sidebar_label": "Amazon Kinesis"
|
||||
},
|
||||
"development/extensions-core/druid-kubernetes": {
|
||||
"development/extensions-core/kubernetes": {
|
||||
"title": "Kubernetes"
|
||||
},
|
||||
"development/extensions-core/lookups-cached-global": {
|
||||
|
@ -327,6 +327,10 @@
|
|||
"operations/basic-cluster-tuning": {
|
||||
"title": "Basic cluster tuning"
|
||||
},
|
||||
"operations/clean-metadata-store": {
|
||||
"title": "Automated cleanup for metadata records",
|
||||
"sidebar_label": "Automated metadata cleanup"
|
||||
},
|
||||
"operations/deep-storage-migration": {
|
||||
"title": "Deep storage migration"
|
||||
},
|
||||
|
|
Loading…
Reference in New Issue