Improved docs for multi-value dimensions.

- Add central doc for multi-value dimensions, with some content from other docs. - Link to multi-value dimension doc from topN and groupBy docs. - Fixes a broken link from dimensionspecs.md, which was presciently already linking to this nonexistent doc. - Resolve inconsistent naming in docs & code (sometimes "multi-valued", sometimes "multi-value") in favor of "multi-value".
2025-02-15 22:44:53 +00:00 · 2016-03-22 14:16:34 -07:00 · 2016-03-22 14:16:34 -07:00 · ff25325f3b
commit ff25325f3b
parent a6e9ff48ec
8 changed files with 64 additions and 25 deletions
--- a/docs/content/design/segments.md
+++ b/docs/content/design/segments.md
@ -163,7 +163,7 @@ Each column is stored as two parts:
 1.  A Jackson-serialized ColumnDescriptor
 2.  The rest of the binary for the column

-A ColumnDescriptor is essentially an object that allows us to use jackson’s polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-valued, etc.) and then a list of serde logic that can deserialize the rest of the binary.
+A ColumnDescriptor is essentially an object that allows us to use jackson’s polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serde logic that can deserialize the rest of the binary.

 Sharding Data to Create Segments
 --------------------------------
--- a/docs/content/querying/dimensionspecs.md
+++ b/docs/content/querying/dimensionspecs.md
@ -351,14 +351,14 @@ Returns the dimension value formatted according to the given format string.

 For example if you want to concat "[" and "]" before and after the actual dimension value, you need to specify "[%s]" as format string.

-### Filtering DimensionSpecs
+### Filtered DimensionSpecs

-These are only valid for multi-valued dimensions. If you have a row in druid that has a multi-valued dimension with values ["v1", "v2", "v3"] and you send a groupBy/topN query grouping by that dimension with [query filter](filter.html) for value "v1". In the response you will get 3 rows containing "v1", "v2" and "v3". This behavior might be unintuitive for some use cases.
+These are only valid for multi-value dimensions. If you have a row in druid that has a multi-value dimension with values ["v1", "v2", "v3"] and you send a groupBy/topN query grouping by that dimension with [query filter](filter.html) for value "v1". In the response you will get 3 rows containing "v1", "v2" and "v3". This behavior might be unintuitive for some use cases.

-It happens because `query filter` is internally used on the bitmaps and only used to match the row to be included in the query result processing. With multivalued dimensions, "query filter" behaves like a contains check, which will match the row with dimension value ["v1", "v2", "v3"]. Please see the section on "Multi-value columns" in [segment](../design/segments.html) for more details.
-Then groupBy/topN processing pipeline "explodes" all multi-valued dimensions resulting 3 rows for "v1", "v2" and "v3" each.
+It happens because "query filter" is internally used on the bitmaps and only used to match the row to be included in the query result processing. With multi-value dimensions, "query filter" behaves like a contains check, which will match the row with dimension value ["v1", "v2", "v3"]. Please see the section on "Multi-value columns" in [segment](../design/segments.html) for more details.
+Then groupBy/topN processing pipeline "explodes" all multi-value dimensions resulting 3 rows for "v1", "v2" and "v3" each.

-In addition to "query filter" which efficiently selects the rows to be processed, you can use the filtering dimension spec to filter for specific values within the values of a multi-valued dimension. These dimensionSpecs take a delegate DimensionSpec and a filtering criteria. From the "exploded" rows, only rows matching the given filtering criteria are returned in the query result.
+In addition to "query filter" which efficiently selects the rows to be processed, you can use the filtered dimension spec to filter for specific values within the values of a multi-value dimension. These dimensionSpecs take a delegate DimensionSpec and a filtering criteria. From the "exploded" rows, only rows matching the given filtering criteria are returned in the query result.

 The following filtered dimension spec acts as a whitelist or blacklist for values as per the "isWhitelist" attribute value.

@ -372,7 +372,7 @@ Following filtered dimension spec retains only the values matching regex. Note t
 { "type" : "regexFiltered", "delegate" : <dimensionSpec>, "pattern": <java regex pattern> }
 ```

-For more details and examples, see [multi-valued dimensions](multi-valued-dimensions.html).
+For more details and examples, see [multi-value dimensions](multi-value-dimensions.html).

 ### Upper and Lower extraction functions.

--- a/docs/content/querying/groupbyquery.md
+++ b/docs/content/querying/groupbyquery.md
@ -95,3 +95,14 @@ To pull it all together, the above query would return *n\*m* data points, up to
 ...
 ]
 ```
+
+### Behavior on multi-value dimensions
+
+groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
+from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
+there are rows. For example, a groupBy on the dimension `tags` with filter `"t1" OR "t3"` would match only row1, and
+generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
+your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
+improve performance.
+
+See [Multi-value dimensions](multi-value-dimensions.html) for more details.
--- a/docs/content/querying/multi-value-dimensions.md
+++ b/docs/content/querying/multi-value-dimensions.md
@ -1,22 +1,38 @@
 ---
 layout: doc_page
 ---
+# Multi-value dimensions

-This document contains additional query optimizations for certain types of queries.
+Druid supports "multi-value" string dimensions. These are generated when an input field contains an array of values
+instead of a single value (e.e. JSON arrays, or a TSV field containing one or more `listDelimiter` characters).

-# Multi-value Dimensions
+This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
+are used as a dimension being grouped by. See the section on multi-value columns in
+[segments](../design/segments.html#multi-value-columns) for internal representation details.

-Druid supports "multi-valued" dimensions. See the section on multi-valued columns in [segments](../design/segments.html) for internal representation details. This document describes the behavior of groupBy(topN has similar behavior) queries on multi-valued dimensions when they are used as a dimension being grouped by.
+## Querying multi-value dimensions

-Suppose, you have a dataSource with a segment that contains following rows with a multi-valued dimension called tags.
+Suppose, you have a dataSource with a segment that contains the following rows, with a multi-value dimension
+called `tags`.

 ```
-2772011-01-12T00:00:00.000Z,["t1","t2","t3"],  #row1
-2782011-01-13T00:00:00.000Z,["t3","t4","t5"],  #row2
-2792011-01-14T00:00:00.000Z,["t5","t6","t7"]   #row3
+{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
+{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}  #row2
+{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
 ```

-### Group-By query with no filtering
+All query types can filter on multi-value dimensions. Filters operate independently on each value of a multi-value
+dimension. For example, a `"t1" OR "t3"` filter would match row1 and row2 but not row3. A `"t1" AND "t3"` filter
+would only match row1.
+
+topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
+from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
+there are rows. For example, a topN on the dimension `tags` with filter `"t1" OR "t3"` would match only row1, and
+generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
+your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
+improve performance.
+
+### Example: GroupBy query with no filtering

 See [GroupBy querying](groupbyquery.html) for details.

@ -104,7 +120,7 @@ returns following result.

 notice how original rows are "exploded" into multiple rows and merged.

-### Group-By query with a selector query filter
+### Example: GroupBy query with a selector query filter

 See [query filters](filters.html) for details of selector query filter.

@ -181,13 +197,13 @@ returns following result.
 ]
 ```

-You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is applied on the row before explosion. For multi-valued dimensions, selector filter for "t3" would match row1 and row2, after which exploding is done. For multi-valued dimensions, query filter matches a row if any individual value inside the multiple values matches the query filter.
+You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2, after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside the multiple values matches the query filter.

-### Group-By query with a selector query filter and additional filter in "dimensions" attributes
+### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes

 To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as in the query below.

-See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html) for details.
+See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html#filtered-dimensionspecs) for details.

 ```json
 {
@ -224,7 +240,7 @@ See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html)
 }
 ```

-returns following result.
+returns the following result.

 ```json
 [
@ -238,5 +254,4 @@ returns following result.
 ]
 ```

-Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered dimensionSpec would be much more efficient because that gets applied at the lowest level in the query processing pipeline while having spec is applied at the highest level of groupBy query processing.
-
+Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline. Having specs are applied at the outermost level of groupBy query processing.
--- a/docs/content/querying/topnquery.md
+++ b/docs/content/querying/topnquery.md
@ -128,7 +128,20 @@ The format of the results would look like so:
  }
 ]
 ```
+
+### Behavior on multi-value dimensions
+
+topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
+from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
+there are rows. For example, a topN on the dimension `tags` with filter `"t1" OR "t3"` would match only row1, and
+generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
+your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
+improve performance.
+
+See [Multi-value dimensions](multi-value-dimensions.html) for more details.
+
 ### Aliasing
+
 The current TopN algorithm is an approximate algorithm. The top 1000 local results from each segment are returned for merging to determine the global topN. As such, the topN algorithm is approximate in both rank and results. Approximate results *ONLY APPLY WHEN THERE ARE MORE THAN 1000 DIM VALUES*. A topN over a dimension with fewer than 1000 unique dimension values can be considered accurate in rank and accurate in aggregates.

 The threshold can be modified from it's default 1000 via the server parameter `druid.query.topN.minTopNThreshold` which need to restart servers to take effect or set `minTopNThreshold` in query context which take effect per query. 
--- a/docs/content/toc.md
+++ b/docs/content/toc.md
@ -40,9 +40,9 @@
    * [Granularities](../querying/granularities.html)
    * [DimensionSpecs](../querying/dimensionspecs.html)
    * [Context](../querying/query-context.html)
+  * [Multi-value dimensions](../querying/multi-value-dimensions.html)
  * [SQL](../querying/sql.html)
  * [Joins](../querying/joins.html)
-  * [Optimizations](../querying/optimizations.html)
  * [Multitenancy](../querying/multitenancy.html)
  * [Caching](../querying/caching.html)

--- a/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java
+++ b/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java
@ -601,7 +601,7 @@ public class DeterminePartitionsJob implements Jobby

        // Respect poisoning
        if (!currentDimSkip && dvc.numRows < 0) {
-          log.info("Cannot partition on multi-valued dimension: %s", dvc.dim);
+          log.info("Cannot partition on multi-value dimension: %s", dvc.dim);
          currentDimSkip = true;
        }

--- a/processing/src/main/java/io/druid/segment/QueryableIndexStorageAdapter.java
+++ b/processing/src/main/java/io/druid/segment/QueryableIndexStorageAdapter.java
@ -614,7 +614,7 @@ public class QueryableIndexStorageAdapter implements StorageAdapter

                        if (columnVals.hasMultipleValues()) {
                          throw new UnsupportedOperationException(
-                              "makeObjectColumnSelector does not support multivalued GenericColumns"
+                              "makeObjectColumnSelector does not support multi-value GenericColumns"
                          );
                        }