2018-12-13 14:47:20 -05:00
---
2019-08-21 00:48:59 -04:00
id: multi-value-dimensions
2018-12-13 14:47:20 -05:00
title: "Multi-value dimensions"
---
2018-11-13 12:38:37 -05:00
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
2015-12-30 19:01:22 -05:00
2021-07-15 13:19:10 -04:00
Apache Druid supports "multi-value" string dimensions. Multi-value string dimensions result from input fields that contain an
array of values instead of a single value, such as the `tags` values in the following JSON array example:
2016-01-06 00:27:52 -05:00
2021-07-15 13:19:10 -04:00
```
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}
```
2023-11-02 03:31:37 -04:00
It is important to be aware that multi-value dimensions are distinct from [array types ](arrays.md ). While array types behave like standard SQL arrays, multi-value dimensions do not. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation ](sql-data-types.md#multi-value-strings-behavior ).
This document describes inserting, filtering, and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see
2021-07-15 13:19:10 -04:00
[segments documentation ](../design/segments.md#multi-value-columns ). Examples in this document
2023-11-02 03:31:37 -04:00
are in the form of both [SQL ](sql.md ) and [native Druid queries ](querying.md ). Refer to the [Druid SQL documentation ](sql-multivalue-string-functions.md ) for details
about the functions available for using multi-value string dimensions in SQL.
The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes a multi-value dimension, `tags` .
```json lines
{"timestamp": "2011-01-12T00:00:00.000Z", "label": "row1", "tags": ["t1","t2","t3"]}
{"timestamp": "2011-01-13T00:00:00.000Z", "label": "row2", "tags": ["t3","t4","t5"]}
{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row3", "tags": ["t5","t6","t7"]}
{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row4", "tags": []}
```
2016-01-06 00:27:52 -05:00
2023-11-02 03:31:37 -04:00
## Ingestion
2021-07-15 13:19:10 -04:00
2023-11-02 03:31:37 -04:00
### Native batch and streaming ingestion
2024-02-12 16:52:42 -05:00
When using native [batch ](../ingestion/native-batch.md ) or streaming ingestion such as with [Apache Kafka ](../ingestion/kafka-ingestion.md ), the Druid web console data loader can detect multi-value dimensions and configure the `dimensionsSpec` accordingly.
2021-07-15 13:19:10 -04:00
2023-11-02 03:31:37 -04:00
For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `inputFormat` . JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `inputFormat` configuration.
2021-07-15 13:19:10 -04:00
2023-11-02 03:31:37 -04:00
The following shows an example `dimensionsSpec` for native ingestion of the data used in this document:
2021-07-15 13:19:10 -04:00
```
"dimensions": [
2023-11-02 03:31:37 -04:00
{
"type": "string",
"name": "label"
},
2021-07-15 13:19:10 -04:00
{
"type": "string",
"name": "tags",
"multiValueHandling": "SORTED_ARRAY",
"createBitmapIndex": true
}
],
```
By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. Alternatively, you can specify multi-value handling as:
* `SORTED_SET` : results in the removal of duplicate values
* `ARRAY` : retains the original order of the values
2021-08-13 11:42:03 -04:00
See [Dimension Objects ](../ingestion/ingestion-spec.md#dimension-objects ) for information on configuring multi-value handling.
2021-07-15 13:19:10 -04:00
2023-11-02 03:31:37 -04:00
### SQL-based ingestion
Multi-value dimensions can also be inserted with [SQL-based ingestion ](../multi-stage-query/index.md ). The functions `MV_TO_ARRAY` and `ARRAY_TO_MV` can assist in converting `VARCHAR` to `VARCHAR ARRAY` and `VARCHAR ARRAY` into `VARCHAR` respectively. `multiValueHandling` is not available when using the multi-stage query engine to insert data.
For example, to insert the data used in this document:
```sql
REPLACE INTO "mvd_example" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
'{"type":"json"}',
'[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY< STRING > "}]'
)
)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
ARRAY_TO_MV("tags") AS "tags"
FROM "ext"
PARTITIONED BY DAY
```
2021-07-15 13:19:10 -04:00
2023-11-02 03:31:37 -04:00
### SQL-based ingestion with rollup
These input arrays can also be grouped prior to converting into a multi-value dimension:
```sql
REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
'{"type":"json"}',
'[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY< STRING > "}]'
)
)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
ARRAY_TO_MV("tags") AS "tags",
COUNT(*) AS "count"
FROM "ext"
GROUP BY 1, 2, "tags"
PARTITIONED BY DAY
2015-12-30 19:01:22 -05:00
```
2023-11-02 03:31:37 -04:00
Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause since we only wish to coerce the type _after_ grouping.
The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR` , which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case you must use `MV_TO_ARRAY` since the multi-stage query engine only supports grouping on multi-value dimensions as arrays. So, they must be coerced first. These arrays must then be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV` .
```sql
REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
'{"type":"json"}'
)
) EXTEND ("timestamp" VARCHAR, "label" VARCHAR, "tags" VARCHAR)
)
SELECT
TIME_PARSE("timestamp") AS "__time",
"label",
ARRAY_TO_MV(MV_TO_ARRAY("tags")) AS "tags",
COUNT(*) AS "count"
FROM "ext"
GROUP BY 1, 2, MV_TO_ARRAY("tags")
PARTITIONED BY DAY
2015-12-30 19:01:22 -05:00
```
2023-11-02 03:31:37 -04:00
## Querying multi-value dimensions
2015-12-30 19:01:22 -05:00
2016-03-28 21:30:22 -04:00
### Filtering
2020-12-17 16:37:43 -05:00
All query types, as well as [filtered aggregators ](aggregations.md#filtered-aggregator ), can filter on multi-value
2016-03-28 21:30:22 -04:00
dimensions. Filters follow these rules on multi-value dimensions:
- Value filters (like "selector", "bound", and "in") match a row if any of the values of a multi-value dimension match
the filter.
2017-03-23 21:23:46 -04:00
- The Column Comparison filter will match a row if the dimensions have any overlap.
2016-03-28 21:30:22 -04:00
- Value filters that match `null` or `""` (empty string) will match empty cells in a multi-value dimension.
- Logical expression filters behave the same way they do on single-value dimensions: "and" matches a row if all
underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row
if the underlying filter does not match the row.
2023-11-02 03:31:37 -04:00
2021-07-15 13:19:10 -04:00
The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3:
2016-03-28 21:30:22 -04:00
2023-11-02 03:31:37 -04:00
```sql
SELECT *
FROM "mvd_example_rollup"
WHERE tags = 't1' OR tags = 't3'
2016-03-28 21:30:22 -04:00
```
2023-11-02 03:31:37 -04:00
returns
```json lines
{"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1}
{"__time":"2011-01-13T00:00:00.000Z","label":"row2","tags":"[\"t3\",\"t4\",\"t5\"]","count":1}
2016-03-28 21:30:22 -04:00
```
2023-11-02 03:31:37 -04:00
Native queries can also perform filtering that would be considered a "contradiction" in SQL, such as this "and" filter which would match only row1 of the dataset above:
2016-03-28 21:30:22 -04:00
```
{
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "tags",
"value": "t1"
},
{
"type": "selector",
"dimension": "tags",
"value": "t3"
}
]
}
```
2023-11-02 03:31:37 -04:00
which returns
```json lines
{"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1}
```
2016-03-28 21:30:22 -04:00
2023-11-02 03:31:37 -04:00
Multi-value dimensions also consider an empty row as `null` , consider:
```sql
SELECT *
FROM "mvd_example_rollup"
WHERE tags is null
2016-03-28 21:30:22 -04:00
```
2023-11-02 03:31:37 -04:00
which results in:
```json lines
{"__time":"2011-01-14T00:00:00.000Z","label":"row4","tags":null,"count":1}
2016-03-28 21:30:22 -04:00
```
### Grouping
2016-03-22 17:16:34 -04:00
2023-11-02 03:31:37 -04:00
When grouping on a multi-value dimension with SQL or a native [topN ](topnquery.md ) or [groupBy ](groupbyquery.md ) queries, _all_ values
2023-03-10 06:12:08 -05:00
from matching rows will be used to generate one group per value. This behaves similarly to an implicit SQL `UNNEST`
operation. This means it's possible for a query to return more groups than there are rows. For example, a topN on the
dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups:
2023-11-02 03:31:37 -04:00
`t1` , `t2` , and `t3` .
If you only need to include values that match your filter, you can use the SQL functions [`MV_FILTER_ONLY`/`MV_FILTER_NONE` ](sql-multivalue-string-functions.md ),
[filtered virtual column ](virtual-columns.md#list-filtered-virtual-column ), or [filtered dimensionSpec ](dimensionspecs.md#filtered-dimensionspecs ). This can also improve performance.
2016-03-22 17:16:34 -04:00
2023-11-02 03:31:37 -04:00
#### Example: SQL grouping query with no filtering
```sql
SELECT label, tags
FROM "mvd_example_rollup"
GROUP BY 1,2
```
results in:
```json lines
{"label":"row1","tags":"t1"}
{"label":"row1","tags":"t2"}
{"label":"row1","tags":"t3"}
{"label":"row2","tags":"t3"}
{"label":"row2","tags":"t4"}
{"label":"row2","tags":"t5"}
{"label":"row3","tags":"t5"}
{"label":"row3","tags":"t6"}
{"label":"row3","tags":"t7"}
{"label":"row4","tags":null}
```
#### Example: SQL grouping query with a filter
```sql
SELECT label, tags
FROM "mvd_example_rollup"
WHERE tags = 't3'
GROUP BY 1,2
```
results:
```json lines
{"label":"row1","tags":"t1"}
{"label":"row1","tags":"t2"}
{"label":"row1","tags":"t3"}
{"label":"row2","tags":"t3"}
{"label":"row2","tags":"t4"}
{"label":"row2","tags":"t5"}
```
#### Example: native GroupBy query with no filtering
2015-12-30 19:01:22 -05:00
2020-12-17 16:37:43 -05:00
See [GroupBy querying ](groupbyquery.md ) for details.
2015-12-30 19:01:22 -05:00
```json
{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "tags",
"outputName": "tags"
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}
```
2021-07-15 13:19:10 -04:00
This query returns the following result:
2015-12-30 19:01:22 -05:00
```json
[
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t1"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t2"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t3"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t4"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t5"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t6"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t7"
}
}
]
```
2021-07-15 13:19:10 -04:00
Notice that original rows are "exploded" into multiple rows and merged.
2015-12-30 19:01:22 -05:00
2023-11-02 03:31:37 -04:00
#### Example: native GroupBy query with a selector query filter
2015-12-30 19:01:22 -05:00
2020-12-17 16:37:43 -05:00
See [query filters ](filters.md ) for details of selector query filter.
2015-12-30 19:01:22 -05:00
```json
{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"filter": {
"type": "selector",
"dimension": "tags",
"value": "t3"
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "tags",
"outputName": "tags"
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}
```
2021-07-15 13:19:10 -04:00
This query returns the following result:
2015-12-30 19:01:22 -05:00
```json
[
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t1"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t2"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t3"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t4"
}
},
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 1,
"tags": "t5"
}
}
]
```
2021-07-15 13:19:10 -04:00
You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is
2023-11-02 03:31:37 -04:00
applied on the row before explosion. For multi-value dimensions, a filter for value "t3" would match row1 and row2,
2021-07-15 13:19:10 -04:00
after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside
2019-07-03 11:22:33 -04:00
the multiple values matches the query filter.
2015-12-30 19:01:22 -05:00
2023-11-02 03:31:37 -04:00
#### Example: native GroupBy query with selector query and dimension filters
2015-12-30 19:01:22 -05:00
2021-07-15 13:19:10 -04:00
To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below.
2015-12-30 19:01:22 -05:00
2021-07-15 13:19:10 -04:00
See filtered `dimensionSpecs` in [dimensionSpecs ](dimensionspecs.md#filtered-dimensionspecs ) for details.
2015-12-30 19:01:22 -05:00
```json
{
"queryType": "groupBy",
"dataSource": "test",
"intervals": [
"1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
],
"filter": {
"type": "selector",
"dimension": "tags",
"value": "t3"
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "listFiltered",
"delegate": {
"type": "default",
"dimension": "tags",
"outputName": "tags"
},
"values": ["t3"]
}
],
"aggregations": [
{
"type": "count",
"name": "count"
}
]
}
```
2021-07-15 13:19:10 -04:00
This query returns the following result:
2015-12-30 19:01:22 -05:00
```json
[
{
"timestamp": "1970-01-01T00:00:00.000Z",
"event": {
"count": 2,
"tags": "t3"
}
}
]
```
2020-12-17 16:37:43 -05:00
Note that, for groupBy queries, you could get similar result with a [having spec ](having.md ) but using a filtered
2021-07-15 13:19:10 -04:00
`dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
2019-07-03 11:22:33 -04:00
Having specs are applied at the outermost level of groupBy query processing.
2022-02-16 10:23:26 -05:00
## Disable GroupBy on multi-value columns
2023-11-02 03:31:37 -04:00
You can disable the implicit unnesting behavior for groupBy by setting `groupByEnableMultiValueUnnesting: false` in your
[query context ](query-context.md ). In this mode, the groupBy engine will return an error instead of completing the query. This is a safety
2022-02-16 10:23:26 -05:00
feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any
2023-11-02 03:31:37 -04:00
multi-valued dimensions that were inadvertently included.
## Differences between arrays and multi-value dimensions
Avoid confusing string arrays with [multi-value dimensions ](multi-value-dimensions.md ). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions `MV_TO_ARRAY` and `ARRAY_TO_MV` to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature and have SQL compliant behavior.
Use care during ingestion to ensure you get the type you want.
2024-07-25 03:09:40 -04:00
To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch ](../ingestion/native-batch.md ) or streaming ingestion such as with [Apache Kafka ](../ingestion/kafka-ingestion.md ), use dimension type `auto` or enable `useSchemaDiscovery` . When performing a [SQL-based ingestion ](../multi-stage-query/index.md ), write a query that generates arrays. Arrays may contain strings or numbers.
2023-11-02 03:31:37 -04:00
2024-07-25 03:09:40 -04:00
To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery` . When performing a [SQL-based ingestion ](../multi-stage-query/index.md ), wrap arrays in [`ARRAY_TO_MV` ](multi-value-dimensions.md#sql-based-ingestion ). Multi-value dimensions can only contain strings.
2023-11-02 03:31:37 -04:00
You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like:
```sql
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'mytable'
```
Arrays are type `ARRAY` , multi-value strings are type `VARCHAR` .