Consolidate multi-value dimension doc and highlight configurability (#11428)

* Clarify options for multi-value dims * Add first example
2021-07-15 10:19:10 -07:00 · 2021-07-15 10:19:10 -07:00 · a366753ba5
parent 8d7d60d18e
commit a366753ba5
3 changed files with 56 additions and 29 deletions
--- a/docs/ingestion/data-formats.md
+++ b/docs/ingestion/data-formats.md
@ -1408,11 +1408,6 @@ tasks will fail with an exception.

 The `columns` field must be included and and ensure that the order of the fields matches the columns of your input data in the same order.

-### Multi-value dimensions
-
-Dimensions can have multiple values for TSV and CSV data. To specify the delimiter for a multi-value dimension, set the `listDelimiter` in the `parseSpec`.
-
-JSON data can contain multi-value dimensions as well. The multiple values for a dimension must be formatted as a JSON array in the ingested data. No additional `parseSpec` configuration is needed.

 ### Regex ParseSpec

--- a/docs/querying/multi-value-dimensions.md
+++ b/docs/querying/multi-value-dimensions.md
@ -23,20 +23,48 @@ title: "Multi-value dimensions"
  -->


-Apache Druid supports "multi-value" string dimensions. These are generated when an input field contains an
-array of values instead of a single value (e.g. JSON arrays, or a TSV field containing one or more `listDelimiter`
-characters). By default Druid ingests the values in alphabetical order, see [Dimension Objects](../ingestion/index.md#dimension-objects) for configuration.
+Apache Druid supports "multi-value" string dimensions. Multi-value string dimensions result from input fields that contain an
+array of values instead of a single value, such as the `tags` values in the following JSON array example: 

-This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
-are used as a dimension being grouped by. See the section on multi-value columns in
-[segments](../design/segments.md#multi-value-columns) for internal representation details. Examples in this document
+```
+{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} 
+```
+
+This document describes filtering and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see
+[segments documentation](../design/segments.md#multi-value-columns). Examples in this document
 are in the form of [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql.md) for details
 about using multi-value string dimensions in SQL.

+## Overview
+
+At ingestion time, Druid can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as multi-value dimensions.
+
+For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `parseSpec` configuration.
+
+The following shows an example multi-value dimension named `tags` in a `dimensionsSpec`:
+
+```
+"dimensions": [
+  {
+    "type": "string",
+    "name": "tags",
+    "multiValueHandling": "SORTED_ARRAY",
+    "createBitmapIndex": true
+  }
+],
+```
+
+By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. Alternatively, you can specify multi-value handling as:
+
+* `SORTED_SET`: results in the removal of duplicate values
+* `ARRAY`: retains the original order of the values
+
+See [Dimension Objects](../ingestion/index.md#dimension-objects) for information on configuring multi-value handling.
+
+
 ## Querying multi-value dimensions

-Suppose, you have a dataSource with a segment that contains the following rows, with a multi-value dimension
-called `tags`.
+The following sections describe filtering and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`.

 ```
 {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
@ -44,6 +72,7 @@ called `tags`.
 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4
 ```
+> Be sure to remove the comments before trying out the sample data. 

 ### Filtering

@ -58,7 +87,7 @@ dimensions. Filters follow these rules on multi-value dimensions:
  underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row
  if the underlying filter does not match the row.

-For example, this "or" filter would match row1 and row2 of the dataset above, but not row3:
+The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3:

 ```
 {
@ -118,7 +147,7 @@ only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you
 your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
 improve performance.

-### Example: GroupBy query with no filtering
+## Example: GroupBy query with no filtering

 See [GroupBy querying](groupbyquery.md) for details.

@ -148,7 +177,7 @@ See [GroupBy querying](groupbyquery.md) for details.
 }
 ```

-returns following result.
+This query returns the following result:

 ```json
 [
@ -204,9 +233,9 @@ returns following result.
 ]
 ```

-notice how original rows are "exploded" into multiple rows and merged.
+Notice that original rows are "exploded" into multiple rows and merged.

-### Example: GroupBy query with a selector query filter
+## Example: GroupBy query with a selector query filter

 See [query filters](filters.md) for details of selector query filter.

@ -241,7 +270,7 @@ See [query filters](filters.md) for details of selector query filter.
 }
 ```

-returns following result.
+This query returns the following result:

 ```json
 [
@ -283,17 +312,16 @@ returns following result.
 ]
 ```

-You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is
-applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2,
-after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside
+You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is
+applied on the row before explosion. For multi-value dimensions, a selector filter for "t3" would match row1 and row2,
+after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside
 the multiple values matches the query filter.

-### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes
+## Example: GroupBy query with selector query and dimension filters

-To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as
-in the query below.
+To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below.

-See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
+See filtered `dimensionSpecs` in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.

 ```json
 {
@ -330,7 +358,7 @@ See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#fil
 }
 ```

-returns the following result.
+This query returns the following result:

 ```json
 [
@ -345,5 +373,5 @@ returns the following result.
 ```

 Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered
-dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline.
+`dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline.
 Having specs are applied at the outermost level of groupBy query processing.
--- a/website/i18n/en.json
+++ b/website/i18n/en.json
@ -220,7 +220,7 @@
        "title": "Amazon Kinesis ingestion",
        "sidebar_label": "Amazon Kinesis"
      },
-      "development/extensions-core/druid-kubernetes": {
+      "development/extensions-core/kubernetes": {
        "title": "Kubernetes"
      },
      "development/extensions-core/lookups-cached-global": {
@ -327,6 +327,10 @@
      "operations/basic-cluster-tuning": {
        "title": "Basic cluster tuning"
      },
+      "operations/clean-metadata-store": {
+        "title": "Automated cleanup for metadata records",
+        "sidebar_label": "Automated metadata cleanup"
+      },
      "operations/deep-storage-migration": {
        "title": "Deep storage migration"
      },