add SQL docs for multi-value string dimensions (#8011)

* add SQL docs for multi-value string dimensions * formatting consistency * fix typo * adjust
2019-07-03 08:22:33 -07:00 · 2019-07-03 08:22:33 -07:00 · 3b84246cd6
parent c556d44a19
commit 3b84246cd6
3 changed files with 126 additions and 87 deletions
--- a/docs/content/misc/math-expr.md
+++ b/docs/content/misc/math-expr.md
@ -54,14 +54,14 @@ begin with a digit. To escape other special characters, you can quote it with do
 For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string
 type, it's the evaluation result of 'Boolean.valueOf(string)'.
-Multi-value string dimensions are supported and may be treated as either scalar or array typed values. When treated as
+[Multi-value string dimensions](../querying/multi-value-dimensions.html) are supported and may be treated as either
-a scalar type, an expression will automatically be transformed to apply the scalar operation across all values of the
+scalar or array typed values. When treated as a scalar type, an expression will automatically be transformed to apply
-multi-valued type, to mimic Druid's native behavior. Values that result in arrays will be coerced back into the native
+the scalar operation across all values of the multi-valued type, to mimic Druid's native behavior. Values that result in
-Druid string type for aggregation. Druid aggregations on multi-value string dimensions on the individual values, _not_
+arrays will be coerced back into the native Druid string type for aggregation. Druid aggregations on multi-value string
-the 'array', behaving similar to the `unnest` operator available in many SQL dialects. However, by using the
+dimensions on the individual values, _not_ the 'array', behaving similar to the `UNNEST` operator available in many SQL
-`array_to_string` function, aggregations may be done on a stringified version of the complete array, allowing the
+dialects. However, by using the `array_to_string` function, aggregations may be done on a stringified version of the
-complete row to be preserved. Using `string_to_array` in an expression post-aggregator, allows transforming the
+complete array, allowing the complete row to be preserved. Using `string_to_array` in an expression post-aggregator,
-stringified dimension back into the true native array type.
+allows transforming the stringified dimension back into the true native array type.
 The following built-in functions are available.
@ -168,31 +168,30 @@ See javadoc of java.lang.Math for detailed explanation for each function.
 | function | description |
 | --- | --- |
-| `array(expr1,expr ...)` | constructs an array from the expression arguments, using the type of the first argument as the output array type |
+| array(expr1,expr ...) | constructs an array from the expression arguments, using the type of the first argument as the output array type |
-| `array_length(arr)` | returns length of array expression |
+| array_length(arr) | returns length of array expression |
-| `array_offset(arr,long)` | returns the array element at the 0 based index supplied, or null for an out of range index|
+| array_offset(arr,long) | returns the array element at the 0 based index supplied, or null for an out of range index|
-| `array_ordinal(arr,long)` | returns the array element at the 1 based index supplied, or null for an out of range index |
+| array_ordinal(arr,long) | returns the array element at the 1 based index supplied, or null for an out of range index |
-| `array_contains(arr,expr)` | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 |
+| array_contains(arr,expr) | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 |
-| `array_overlap(arr1,arr2)` | returns 1 if arr1 and arr2 have any elements in common, else 0 |
+| array_overlap(arr1,arr2) | returns 1 if arr1 and arr2 have any elements in common, else 0 |
-| `array_offset_of(arr,expr)` | returns the 0 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array. |
+| array_offset_of(arr,expr) | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false`if no matching elements exist in the array. |
-| `array_ordinal_of(arr,expr)` | returns the 1 based index of the first occurrence of expr in the array, or `null` if no matching elements exist in the array. |
+| array_ordinal_of(arr,expr) | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. |
-| `array_prepend(expr,arr)` | adds expr to arr at the beginning, the resulting array type determined by the type of the array |
+| array_prepend(expr,arr) | adds expr to arr at the beginning, the resulting array type determined by the type of the array |
-| `array_append(arr1,expr)` | appends expr to arr, the resulting array type determined by the type of the first array |
+| array_append(arr1,expr) | appends expr to arr, the resulting array type determined by the type of the first array |
-| `array_concat(arr1,arr2)` | concatenates 2 arrays, the resulting array type determined by the type of the first array |
+| array_concat(arr1,arr2) | concatenates 2 arrays, the resulting array type determined by the type of the first array |
-| `array_slice(arr,start,end)` | return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end|
+| array_slice(arr,start,end) | return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end|
-| `array_to_string(arr,str)` | joins all elements of arr by the delimiter specified by str |
+| array_to_string(arr,str) | joins all elements of arr by the delimiter specified by str |
-| `string_to_array(str1,str2)` | splits str1 into an array on the delimiter specified by str2 |
+| string_to_array(str1,str2) | splits str1 into an array on the delimiter specified by str2 |
 ## Apply Functions
 | function | description |
 | --- | --- |
-| `map(lambda,arr)` | applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array |
+| map(lambda,arr) | applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array |
-| `cartesian_map(lambda,arr1,arr2,...)` | applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same |
+| cartesian_map(lambda,arr1,arr2,...) | applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same |
-| `filter(lambda,arr)` | filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match |
+| filter(lambda,arr) | filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match |
-| `fold(lambda,arr)` | folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value. |
+| fold(lambda,arr) | folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value. |
-| `cartesian_fold(lambda,arr1,arr2,...)` | folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value. |
+| cartesian_fold(lambda,arr1,arr2,...) | folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value. |
-| `any(lambda,arr)` | returns 1 if any element in the array matches the lambda expression, else 0 |
+| any(lambda,arr) | returns 1 if any element in the array matches the lambda expression, else 0 |
-| `all(lambda,arr)` | returns 1 if all elements in the array matches the lambda expression, else 0 |
+| all(lambda,arr) | returns 1 if all elements in the array matches the lambda expression, else 0 |
--- a/docs/content/querying/multi-value-dimensions.md
+++ b/docs/content/querying/multi-value-dimensions.md
@ -24,12 +24,15 @@ title: "Multi-value dimensions"
 # Multi-value dimensions
-Apache Druid (incubating) supports "multi-value" string dimensions. These are generated when an input field contains an array of values
+Apache Druid (incubating) supports "multi-value" string dimensions. These are generated when an input field contains an
-instead of a single value (e.e. JSON arrays, or a TSV field containing one or more `listDelimiter` characters).
+array of values instead of a single value (e.e. JSON arrays, or a TSV field containing one or more `listDelimiter`
 characters).
 This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they
 are used as a dimension being grouped by. See the section on multi-value columns in
-[segments](../design/segments.html#multi-value-columns) for internal representation details.
+[segments](../design/segments.html#multi-value-columns) for internal representation details. Examples in this document
 are in the form of [native Druid queries](querying.html). Refer to the [Druid SQL documentation](sql.html) for details
 about using multi-value string dimensions in SQL.
 ## Querying multi-value dimensions
@ -109,9 +112,10 @@ This "selector" filter would match row4 of the dataset above:
 ### Grouping
 topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, _all_ values
-from matching rows will be used to generate one group per value. It's possible for a query to return more groups than
+from matching rows will be used to generate one group per value. This can be thought of as the equivalent to the
-there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and
+`UNNEST` operator used on an `ARRAY` type that many SQL dialects support. This means it's possible for a query to return
-generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
+more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match
 only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match
 your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
 improve performance.
@ -280,11 +284,15 @@ returns following result.
 ]
 ```
-You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2, after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside the multiple values matches the query filter.
+You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the results. It happens because query filter is
 applied on the row before explosion. For multi-value dimensions, selector filter for "t3" would match row1 and row2,
 after which exploding is done. For multi-value dimensions, query filter matches a row if any individual value inside
 the multiple values matches the query filter.
 ### Example: GroupBy query with a selector query filter and additional filter in "dimensions" attributes
-To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as in the query below.
+To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as
 in the query below.
 See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html#filtered-dimensionspecs) for details.
@ -337,4 +345,6 @@ returns the following result.
 ]
 ```
-Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline. Having specs are applied at the outermost level of groupBy query processing.
+Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered
 dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline.
 Having specs are applied at the outermost level of groupBy query processing.
--- a/docs/content/querying/sql.md
+++ b/docs/content/querying/sql.md
@ -42,6 +42,61 @@ queries on the query Broker (the first process you query), which are then passed
 queries. Other than the (slight) overhead of translating SQL on the Broker, there isn't an additional performance
 penalty versus native queries.
 ## Data types and casts
 Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit
 float) "string" (UTF-8 encoded strings and string arrays), and "complex" (catch-all for more exotic data types like
 hyperUnique and approxHistogram columns).
 Timestamps (including the `__time` column) are treated by Druid as longs, with the value being the number of
 milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. Therefore, timestamps in Druid do not carry any
 timezone information, but only carry information about the exact moment in time they represent. See the
 [Time functions](#time-functions) section for more information about timestamp handling.
 Druid generally treats NULLs and empty strings interchangeably, rather than according to the SQL standard. As such,
 Druid SQL only has partial support for NULLs. For example, the expressions `col IS NULL` and `col = ''` are equivalent,
 and both will evaluate to true if `col` contains an empty string. Similarly, the expression `COALESCE(col1, col2)` will
 return `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator counts all rows, the `COUNT(expr)`
 aggregator will count the number of rows where expr is neither null nor the empty string. String columns in Druid are
 NULLable. Numeric columns are NOT NULL; if you query a numeric column that is not present in all segments of your Druid
 datasource, then it will be treated as zero for rows from those segments.
 For mathematical operations, Druid SQL will use integer math if all operands involved in an expression are integers.
 Otherwise, Druid will switch to floating point math. You can force this to happen by casting one of your operands
 to FLOAT. At runtime, Druid may widen 32-bit floats to 64-bit for certain operators, like SUM aggregators.
 Druid [multi-value string dimensions](multi-value-dimensions.html) will appear in the table schema as `VARCHAR` typed,
 and may be interacted with in expressions as such. Additionally, they can be treated as `ARRAY` 'like', via a handful of
 special multi-value operators. Expressions against multi-value string dimensions will apply the expression to all values
 of the row, however the caveat is that aggregations on these multi-value string columns will observe the native Druid
 multi-value aggregation behavior, which is equivalent to the `UNNEST` function available in many dialects.
 Refer to the documentation on [multi-value string dimensions](multi-value-dimensions.html) and
 [Druid expressions documentation](../misc/math-expr.html) for additional details. 
 The following table describes how SQL types map onto Druid types during query runtime. Casts between two SQL types
 that have the same Druid runtime type will have no effect, other than exceptions noted in the table. Casts between two
 SQL types that have different Druid runtime types will generate a runtime cast in Druid. If a value cannot be properly
 cast to another value, as in `CAST('foo' AS BIGINT)`, the runtime will substitute a default value. NULL values cast
 to non-nullable types will also be substitued with a default value (for example, nulls cast to numbers will be
 converted to zeroes).
 |SQL type|Druid runtime type|Default value|Notes|
 |--------|------------------|-------------|-----|
 |CHAR|STRING|`''`||
 |VARCHAR|STRING|`''`|Druid STRING columns are reported as VARCHAR|
 |DECIMAL|DOUBLE|`0.0`|DECIMAL uses floating point, not fixed point math|
 |FLOAT|FLOAT|`0.0`|Druid FLOAT columns are reported as FLOAT|
 |REAL|DOUBLE|`0.0`||
 |DOUBLE|DOUBLE|`0.0`|Druid DOUBLE columns are reported as DOUBLE|
 |BOOLEAN|LONG|`false`||
 |TINYINT|LONG|`0`||
 |SMALLINT|LONG|`0`||
 |INTEGER|LONG|`0`||
 |BIGINT|LONG|`0`|Druid LONG columns (except `__time`) are reported as BIGINT|
 |TIMESTAMP|LONG|`0`, meaning 1970-01-01 00:00:00 UTC|Druid's `__time` column is reported as TIMESTAMP. Casts between string and timestamp types assume standard SQL formatting, e.g. `2000-01-02 03:04:05`, _not_ ISO8601 formatting. For handling other formats, use one of the [time functions](#time-functions)|
 |DATE|LONG|`0`, meaning 1970-01-01|Casting TIMESTAMP to DATE rounds down the timestamp to the nearest day. Casts between string and date types assume standard SQL formatting, e.g. `2000-01-02`. For handling other formats, use one of the [time functions](#time-functions)|
 |OTHER|COMPLEX|none|May represent various Druid column types such as hyperUnique, approxHistogram, etc|
 ## Query syntax
 Each Druid datasource appears as a table in the "druid" schema. This is also the default schema, so Druid datasources
@ -269,6 +324,27 @@ simplest way to write literal timestamps in other time zones is to use TIME_PARS
 |`x OR y`|Boolean OR.|
 |`NOT x`|Boolean NOT.|
 ### Multi-value string functions
 All 'array' references in the multi-value string function documentation can refer to multi-value string columns or
 `ARRAY` literals.
 |Function|Notes|
 |--------|-----|
 | `ARRAY(expr1,expr ...)` | constructs an SQL ARRAY literal from the expression arguments, using the type of the first argument as the output array type |
 | `MV_LENGTH(arr)` | returns length of array expression |
 | `MV_OFFSET(arr,long)` | returns the array element at the 0 based index supplied, or null for an out of range index|
 | `MV_ORDINAL(arr,long)` | returns the array element at the 1 based index supplied, or null for an out of range index |
 | `MV_CONTAINS(arr,expr)` | returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0 |
 | `MV_OVERLAP(arr1,arr2)` | returns 1 if arr1 and arr2 have any elements in common, else 0 |
 | `MV_OFFSET_OF(arr,expr)` | returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. |
 | `MV_ORDINAL_OF(arr,expr)` | returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array. |
 | `MV_PREPEND(expr,arr)` | adds expr to arr at the beginning, the resulting array type determined by the type of the array |
 | `MV_APPEND(arr1,expr)` | appends expr to arr, the resulting array type determined by the type of the first array |
 | `MV_CONCAT(arr1,arr2)` | concatenates 2 arrays, the resulting array type determined by the type of the first array |
 | `MV_SLICE(arr,start,end)` | return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end|
 | `MV_TO_STRING(arr,str)` | joins all elements of arr by the delimiter specified by str |
 | `STRING_TO_MV(str1,str2)` | splits str1 into an array on the delimiter specified by str2 |
 ### Other functions
 |Function|Notes|
@ -280,6 +356,7 @@ simplest way to write literal timestamps in other time zones is to use TIME_PARS
 |`COALESCE(value1, value2, ...)`|Returns the first value that is neither NULL nor empty string.|
 |`NVL(expr,expr-for-null)`|Returns 'expr-for-null' if 'expr' is null (or empty string for string type).|
 |`BLOOM_FILTER_TEST(<expr>, <serialized-filter>)`|Returns true if the value is contained in the base64 serialized bloom filter. See [bloom filter extension](../development/extensions-core/bloom-filter.html) documentation for additional details.|
 ### Unsupported features
 Druid does not support all SQL features, including:
@ -291,57 +368,10 @@ Druid does not support all SQL features, including:
 Additionally, some Druid features are not supported by the SQL language. Some unsupported Druid features include:
- [Multi-value dimensions](multi-value-dimensions.html).
+- [Set operations on DataSketches aggregators](../development/extensions-core/datasketches-extension.html).
 - [DataSketches aggregators](../development/extensions-core/datasketches-extension.html).
 - [Spatial filters](../development/geo.html).
 - [Query cancellation](querying.html#query-cancellation).
 ## Data types and casts
 Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit
 float) "string" (UTF-8 encoded strings), and "complex" (catch-all for more exotic data types like hyperUnique and
 approxHistogram columns).
 Timestamps (including the `__time` column) are treated by Druid as longs, with the value being the number of
 milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. Therefore, timestamps in Druid do not carry any
 timezone information, but only carry information about the exact moment in time they represent. See the
 [Time functions](#time-functions) section for more information about timestamp handling.
 Druid generally treats NULLs and empty strings interchangeably, rather than according to the SQL standard. As such,
 Druid SQL only has partial support for NULLs. For example, the expressions `col IS NULL` and `col = ''` are equivalent,
 and both will evaluate to true if `col` contains an empty string. Similarly, the expression `COALESCE(col1, col2)` will
 return `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator counts all rows, the `COUNT(expr)`
 aggregator will count the number of rows where expr is neither null nor the empty string. String columns in Druid are
 NULLable. Numeric columns are NOT NULL; if you query a numeric column that is not present in all segments of your Druid
 datasource, then it will be treated as zero for rows from those segments.
 For mathematical operations, Druid SQL will use integer math if all operands involved in an expression are integers.
 Otherwise, Druid will switch to floating point math. You can force this to happen by casting one of your operands
 to FLOAT. At runtime, Druid may widen 32-bit floats to 64-bit for certain operators, like SUM aggregators.
 The following table describes how SQL types map onto Druid types during query runtime. Casts between two SQL types
 that have the same Druid runtime type will have no effect, other than exceptions noted in the table. Casts between two
 SQL types that have different Druid runtime types will generate a runtime cast in Druid. If a value cannot be properly
 cast to another value, as in `CAST('foo' AS BIGINT)`, the runtime will substitute a default value. NULL values cast
 to non-nullable types will also be substitued with a default value (for example, nulls cast to numbers will be
 converted to zeroes).
 |SQL type|Druid runtime type|Default value|Notes|
 |--------|------------------|-------------|-----|
 |CHAR|STRING|`''`||
 |VARCHAR|STRING|`''`|Druid STRING columns are reported as VARCHAR|
 |DECIMAL|DOUBLE|`0.0`|DECIMAL uses floating point, not fixed point math|
 |FLOAT|FLOAT|`0.0`|Druid FLOAT columns are reported as FLOAT|
 |REAL|DOUBLE|`0.0`||
 |DOUBLE|DOUBLE|`0.0`|Druid DOUBLE columns are reported as DOUBLE|
 |BOOLEAN|LONG|`false`||
 |TINYINT|LONG|`0`||
 |SMALLINT|LONG|`0`||
 |INTEGER|LONG|`0`||
 |BIGINT|LONG|`0`|Druid LONG columns (except `__time`) are reported as BIGINT|
 |TIMESTAMP|LONG|`0`, meaning 1970-01-01 00:00:00 UTC|Druid's `__time` column is reported as TIMESTAMP. Casts between string and timestamp types assume standard SQL formatting, e.g. `2000-01-02 03:04:05`, _not_ ISO8601 formatting. For handling other formats, use one of the [time functions](#time-functions)|
 |DATE|LONG|`0`, meaning 1970-01-01|Casting TIMESTAMP to DATE rounds down the timestamp to the nearest day. Casts between string and date types assume standard SQL formatting, e.g. `2000-01-02`. For handling other formats, use one of the [time functions](#time-functions)|
 |OTHER|COMPLEX|none|May represent various Druid column types such as hyperUnique, approxHistogram, etc|
 ## Query execution