diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md index d431fead16e..002d6d24b7e 100644 --- a/docs/ingestion/data-formats.md +++ b/docs/ingestion/data-formats.md @@ -1408,11 +1408,6 @@ tasks will fail with an exception. The `columns` field must be included and and ensure that the order of the fields matches the columns of your input data in the same order. -### Multi-value dimensions - -Dimensions can have multiple values for TSV and CSV data. To specify the delimiter for a multi-value dimension, set the `listDelimiter` in the `parseSpec`. - -JSON data can contain multi-value dimensions as well. The multiple values for a dimension must be formatted as a JSON array in the ingested data. No additional `parseSpec` configuration is needed. ### Regex ParseSpec diff --git a/docs/querying/multi-value-dimensions.md b/docs/querying/multi-value-dimensions.md index 09d319b31b5..b5c23ecd055 100644 --- a/docs/querying/multi-value-dimensions.md +++ b/docs/querying/multi-value-dimensions.md @@ -23,20 +23,48 @@ title: "Multi-value dimensions" --> -Apache Druid supports "multi-value" string dimensions. These are generated when an input field contains an -array of values instead of a single value (e.g. JSON arrays, or a TSV field containing one or more `listDelimiter` -characters). By default Druid ingests the values in alphabetical order, see [Dimension Objects](../ingestion/index.md#dimension-objects) for configuration. +Apache Druid supports "multi-value" string dimensions. Multi-value string dimensions result from input fields that contain an +array of values instead of a single value, such as the `tags` values in the following JSON array example: -This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they -are used as a dimension being grouped by. See the section on multi-value columns in -[segments](../design/segments.md#multi-value-columns) for internal representation details. Examples in this document +``` +{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} +``` + +This document describes filtering and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see +[segments documentation](../design/segments.md#multi-value-columns). Examples in this document are in the form of [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql.md) for details about using multi-value string dimensions in SQL. +## Overview + +At ingestion time, Druid can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as multi-value dimensions. + +For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `parseSpec` configuration. + +The following shows an example multi-value dimension named `tags` in a `dimensionsSpec`: + +``` +"dimensions": [ + { + "type": "string", + "name": "tags", + "multiValueHandling": "SORTED_ARRAY", + "createBitmapIndex": true + } +], +``` + +By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. Alternatively, you can specify multi-value handling as: + +* `SORTED_SET`: results in the removal of duplicate values +* `ARRAY`: retains the original order of the values + +See [Dimension Objects](../ingestion/index.md#dimension-objects) for information on configuring multi-value handling. + + ## Querying multi-value dimensions -Suppose, you have a dataSource with a segment that contains the following rows, with a multi-value dimension -called `tags`. +The following sections describe filtering and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`. ``` {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1 @@ -44,6 +72,7 @@ called `tags`. {"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4 ``` +> Be sure to remove the comments before trying out the sample data. ### Filtering @@ -58,7 +87,7 @@ dimensions. Filters follow these rules on multi-value dimensions: underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row if the underlying filter does not match the row. -For example, this "or" filter would match row1 and row2 of the dataset above, but not row3: +The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3: ``` { @@ -118,7 +147,7 @@ only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance. -### Example: GroupBy query with no filtering +## Example: GroupBy query with no filtering See [GroupBy querying](groupbyquery.md) for details. @@ -148,7 +177,7 @@ See [GroupBy querying](groupbyquery.md) for details. } ``` -returns following result. +This query returns the following result: ```json [ @@ -204,9 +233,9 @@ returns following result. ] ``` -notice how original rows are "exploded" into multiple rows and merged. +Notice that original rows are "exploded" into multiple rows and merged. -### Example: GroupBy query with a selector query filter +## Example: GroupBy query with a selector query filter See [query filters](filters.md) for details of selector query filter. @@ -241,7 +270,7 @@ See [query filters](filters.md) for details of selector query filter. } ``` -returns following result. +This query returns the following result: ```json [ @@ -283,17 +312,16 @@ returns following result. ] ``` -You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is -applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2, -after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside +You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is +applied on the row before explosion. For multi-value dimensions, a selector filter for "t3" would match row1 and row2, +after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside the multiple values matches the query filter. -### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes +## Example: GroupBy query with selector query and dimension filters -To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as -in the query below. +To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below. -See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details. +See filtered `dimensionSpecs` in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details. ```json { @@ -330,7 +358,7 @@ See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#fil } ``` -returns the following result. +This query returns the following result: ```json [ @@ -345,5 +373,5 @@ returns the following result. ``` Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered -dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline. +`dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline. Having specs are applied at the outermost level of groupBy query processing. diff --git a/website/i18n/en.json b/website/i18n/en.json index e3c8b34f816..a777e1774fe 100644 --- a/website/i18n/en.json +++ b/website/i18n/en.json @@ -220,7 +220,7 @@ "title": "Amazon Kinesis ingestion", "sidebar_label": "Amazon Kinesis" }, - "development/extensions-core/druid-kubernetes": { + "development/extensions-core/kubernetes": { "title": "Kubernetes" }, "development/extensions-core/lookups-cached-global": { @@ -327,6 +327,10 @@ "operations/basic-cluster-tuning": { "title": "Basic cluster tuning" }, + "operations/clean-metadata-store": { + "title": "Automated cleanup for metadata records", + "sidebar_label": "Automated metadata cleanup" + }, "operations/deep-storage-migration": { "title": "Deep storage migration" },