Consolidate multi-value dimension doc and highlight configurability (#11428)

* Clarify options for multi-value dims
* Add first example
This commit is contained in:
sthetland 2021-07-15 10:19:10 -07:00 committed by GitHub
parent 8d7d60d18e
commit a366753ba5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 56 additions and 29 deletions

View File

@ -1408,11 +1408,6 @@ tasks will fail with an exception.
The `columns` field must be included and and ensure that the order of the fields matches the columns of your input data in the same order.
### Multi-value dimensions
Dimensions can have multiple values for TSV and CSV data. To specify the delimiter for a multi-value dimension, set the `listDelimiter` in the `parseSpec`.
JSON data can contain multi-value dimensions as well. The multiple values for a dimension must be formatted as a JSON array in the ingested data. No additional `parseSpec` configuration is needed.
### Regex ParseSpec

View File

@ -23,20 +23,48 @@ title: "Multi-value dimensions"
-->
Apache Druid supports "multi-value" string dimensions. These are generated when an input field contains an
array of values instead of a single value (e.g. JSON arrays, or a TSV field containing one or more `listDelimiter`
characters). By default Druid ingests the values in alphabetical order, see [Dimension Objects](../ingestion/index.md#dimension-objects) for configuration.
Apache Druid supports "multi-value" string dimensions. Multi-value string dimensions result from input fields that contain an
array of values instead of a single value, such as the `tags` values in the following JSON array example:
This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
are used as a dimension being grouped by. See the section on multi-value columns in
[segments](../design/segments.md#multi-value-columns) for internal representation details. Examples in this document
```
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}
```
This document describes filtering and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see
[segments documentation](../design/segments.md#multi-value-columns). Examples in this document
are in the form of [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql.md) for details
about using multi-value string dimensions in SQL.
## Overview
At ingestion time, Druid can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as multi-value dimensions.
For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `parseSpec` configuration.
The following shows an example multi-value dimension named `tags` in a `dimensionsSpec`:
```
"dimensions": [
{
"type": "string",
"name": "tags",
"multiValueHandling": "SORTED_ARRAY",
"createBitmapIndex": true
}
],
```
By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. Alternatively, you can specify multi-value handling as:
* `SORTED_SET`: results in the removal of duplicate values
* `ARRAY`: retains the original order of the values
See [Dimension Objects](../ingestion/index.md#dimension-objects) for information on configuring multi-value handling.
## Querying multi-value dimensions
Suppose, you have a dataSource with a segment that contains the following rows, with a multi-value dimension
called `tags`.
The following sections describe filtering and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`.
```
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
@ -44,6 +72,7 @@ called `tags`.
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4
```
> Be sure to remove the comments before trying out the sample data.
### Filtering
@ -58,7 +87,7 @@ dimensions. Filters follow these rules on multi-value dimensions:
underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row
if the underlying filter does not match the row.
For example, this "or" filter would match row1 and row2 of the dataset above, but not row3:
The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3:
```
{
@ -118,7 +147,7 @@ only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you
your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
improve performance.
### Example: GroupBy query with no filtering
## Example: GroupBy query with no filtering
See [GroupBy querying](groupbyquery.md) for details.
@ -148,7 +177,7 @@ See [GroupBy querying](groupbyquery.md) for details.
}
```
returns following result.
This query returns the following result:
```json
[
@ -204,9 +233,9 @@ returns following result.
]
```
notice how original rows are "exploded" into multiple rows and merged.
Notice that original rows are "exploded" into multiple rows and merged.
### Example: GroupBy query with a selector query filter
## Example: GroupBy query with a selector query filter
See [query filters](filters.md) for details of selector query filter.
@ -241,7 +270,7 @@ See [query filters](filters.md) for details of selector query filter.
}
```
returns following result.
This query returns the following result:
```json
[
@ -283,17 +312,16 @@ returns following result.
]
```
You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is
applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2,
after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside
You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is
applied on the row before explosion. For multi-value dimensions, a selector filter for "t3" would match row1 and row2,
after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside
the multiple values matches the query filter.
### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes
## Example: GroupBy query with selector query and dimension filters
To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as
in the query below.
To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below.
See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
See filtered `dimensionSpecs` in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
```json
{
@ -330,7 +358,7 @@ See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#fil
}
```
returns the following result.
This query returns the following result:
```json
[
@ -345,5 +373,5 @@ returns the following result.
```
Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered
dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline.
`dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
Having specs are applied at the outermost level of groupBy query processing.

View File

@ -220,7 +220,7 @@
"title": "Amazon Kinesis ingestion",
"sidebar_label": "Amazon Kinesis"
},
"development/extensions-core/druid-kubernetes": {
"development/extensions-core/kubernetes": {
"title": "Kubernetes"
},
"development/extensions-core/lookups-cached-global": {
@ -327,6 +327,10 @@
"operations/basic-cluster-tuning": {
"title": "Basic cluster tuning"
},
"operations/clean-metadata-store": {
"title": "Automated cleanup for metadata records",
"sidebar_label": "Automated metadata cleanup"
},
"operations/deep-storage-migration": {
"title": "Deep storage migration"
},