basic docs for nested column query functions (#12922)

* basic docs for nested column query functions
This commit is contained in:
Clint Wylie 2022-08-19 17:12:19 -07:00 committed by GitHub
parent 69fe1f04e5
commit f8097ccfaa
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 313 additions and 5 deletions

View File

@ -170,7 +170,6 @@ See javadoc of java.lang.Math for detailed explanation for each function.
|toradians|toradians(x) converts an angle measured in degrees to an approximately equivalent angle measured in radians|
|ulp|ulp(x) returns the size of an ulp of the argument x|
## Array functions
| function | description |
@ -227,6 +226,34 @@ map((x) -> x + 1, x)
```
in this case, the `x` when evaluating `x + 1` is the lambda argument, thus an element of the multi-valued column `x`, rather than the column `x` itself.
## JSON functions
JSON functions provide facilities to extract, transform, and create `COMPLEX<json>` values.
| function | description |
|---|---|
| json_value(expr, path) | Extract a Druid literal (`STRING`, `LONG`, `DOUBLE`) value from `expr` using JSONPath syntax of `path` |
| json_query(expr, path) | Extract a `COMPLEX<json>` value from `expr` using JSONPath syntax of `path` |
| json_object(expr1, expr2[, expr3, expr4 ...]) | Construct a `COMPLEX<json>` with alternating 'key' and 'value' arguments|
| parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX<json>`. If the input is not a `STRING` or it is invalid JSON, this function will result in an error.|
| try_parse_json(expr) | Deserialize a JSON `STRING` into a `COMPLEX<json>`. If the input is not a `STRING` or it is invalid JSON, this function will result in a `NULL` value. |
| to_json_string(expr) | Convert `expr` into a JSON `STRING` value |
| json_keys(expr, path) | Get array of field names from `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields |
| json_paths(expr) | Get array of all JSONPath paths available from `expr` |
### JSONPath syntax
Druid supports a small, simplified subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures.
|Operator|Description|
| --- | --- |
|`$`| Root element. All JSONPath expressions start with this operator. |
|`.<name>`| Child element in dot notation. |
|`['<name>']`| Child element in bracket notation. |
|`[<number>]`| Array index. |
See [SQL JSON documentation](../querying/sql-json-functions.md#jsonpath-syntax) for examples.
## Reduction functions
Reduction functions operate on zero or more expressions and return a single expression. If no expressions are passed as

View File

@ -33,7 +33,7 @@ Columns in Druid are associated with a specific data type. This topic describes
Druid natively supports five basic column types: "long" (64 bit signed int), "float" (32 bit float), "double" (64 bit
float) "string" (UTF-8 encoded strings and string arrays), and "complex" (catch-all for more exotic data types like
hyperUnique and approxHistogram columns).
json, hyperUnique, and approxHistogram columns).
Timestamps (including the `__time` column) are treated by Druid as longs, with the value being the number of
milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. Therefore, timestamps in Druid do not carry any
@ -112,3 +112,17 @@ When `druid.expressions.useStrictBooleans = false` (the default mode), Druid use
When `druid.expressions.useStrictBooleans = true`, Druid uses three-valued logic for
[expressions](../misc/math-expr.md) evaluation, such as `expression` virtual columns or `expression` filters.
However, even in this mode, Druid uses two-valued logic for filter types other than `expression`.
## Nested columns
Druid supports storing nested data structures in segments using the native `COMPLEX<json>` type. You can interact
with this data using [JSON functions](sql-json-functions.md), which can extract nested values, parse from string,
serialize to string, and create new `COMPLEX<json>` structures.
`COMPLEX` types in general currently have limited functionality outside of the use of the specialized functions which
understand them, and so have undefined behavior when:
* grouping on complex values
* filtered directly on complex values, such as `WHERE json is NULL`
* used as inputs to aggregators without specialized handling for a specific complex type
In many cases, functions are provided to translate `COMPLEX` value types to `STRING`, which serves as a workaround
solution until `COMPLEX` type functionality can be improved.

View File

@ -647,6 +647,46 @@ Parses `address` into an IPv4 address stored as an integer.
Converts `address` into an IPv4 address in dot-decimal notation.
## JSON_KEYS
**Function type:** [JSON](sql-json-functions.md)
`JSON_KEYS(expr, path)`
Returns an array of field names from `expr` at the specified `path`.
## JSON_OBJECT
**Function type:** [JSON](sql-json-functions.md)
`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])`
Constructs a new `COMPLEX<json>` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX<json>` values. `JSON_OBJECT` can accept colon-separated key-value pairs. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`.
## JSON_PATHS
**Function type:** [JSON](sql-json-functions.md)
`JSON_PATHS(expr)`
Returns an array of all paths which refer to literal values in `expr` in JSONPath format.
## JSON_QUERY
**Function type:** [JSON](sql-json-functions.md)
`JSON_QUERY(expr, path)`
Extracts a `COMPLEX<json>` value from `expr`, at the specified `path`.
## JSON_VALUE
**Function type:** [JSON](sql-json-functions.md)
`JSON_VALUE(expr, path [RETURNING sqlType])`
Extracts a literal value from `expr` at the specified `path`. If you specify `RETURNING` and an SQL type name (such as `VARCHAR`, `BIGINT`, `DOUBLE`, etc) the function plans the query using the suggested type. Otherwise, it attempts to infer the type based on the context. If it can't infer the type, it defaults to `VARCHAR`.
## LATEST
`LATEST(expr)`
@ -899,6 +939,14 @@ Returns NULL if two values are equal, else returns the first value.
Returns `e2` if `e1` is null, else returns `e1`.
## PARSE_JSON
**Function type:** [JSON](sql-json-functions.md)
`PARSE_JSON(expr)`
Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in an error.
## PARSE_LONG
`PARSE_LONG(<CHARACTER>, [<INTEGER>])`
@ -1267,6 +1315,15 @@ Adds a certain amount of time to a given timestamp.
Takes the difference between two timestamps, returning the results in the given units.
## TO_JSON_STRING
**Function type:** [JSON](sql-json-functions.md)
`TO_JSON_STRING(expr)`
Serializes `expr` into a JSON string.
## TRIM
`TRIM([BOTH|LEADING|TRAILING] [<chars> FROM] expr)`
@ -1291,6 +1348,16 @@ Alias for [`TRUNCATE`](#truncate).
Truncates a numerical expression to a specific number of decimal digits.
## TRY_PARSE_JSON
**Function type:** [JSON](sql-json-functions.md)
`TRY_PARSE_JSON(expr)`
Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.
## UPPER
`UPPER(expr)`

View File

@ -0,0 +1,71 @@
---
id: sql-json-functions
title: "SQL JSON functions"
sidebar_label: "JSON functions"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
<!--
The format of the tables that describe the functions and operators
should not be changed without updating the script create-sql-docs
in web-console/script/create-sql-docs, because the script detects
patterns in this markdown file and parse it to TypeScript file for web console
-->
Druid supports nested columns, which provide optimized storage and indexes for nested data structures. Use
the following JSON functions to extract, transform, and create `COMPLEX<json>` values.
| Function | Notes |
| --- | --- |
|`JSON_KEYS(expr, path)`| Returns an array of field names from `expr` at the specified `path`.|
|`JSON_OBJECT(KEY expr1 VALUE expr2[, KEY expr3 VALUE expr4, ...])` | Constructs a new `COMPLEX<json>` object. The `KEY` expressions must evaluate to string types. The `VALUE` expressions can be composed of any input type, including other `COMPLEX<json>` values. `JSON_OBJECT` can accept colon-separated key-value pairs. The following syntax is equivalent: `JSON_OBJECT(expr1:expr2[, expr3:expr4, ...])`.|
|`JSON_PATHS(expr)`| Returns an array of all paths which refer to literal values in `expr` in JSONPath format. |
|`JSON_QUERY(expr, path)`| Extracts a `COMPLEX<json>` value from `expr`, at the specified `path`. |
|`JSON_VALUE(expr, path [RETURNING sqlType])`| Extracts a literal value from `expr` at the specified `path`. If you specify `RETURNING` and an SQL type name (such as `VARCHAR`, `BIGINT`, `DOUBLE`, etc) the function plans the query using the suggested type. Otherwise, it attempts to infer the type based on the context. If it can't infer the type, it defaults to `VARCHAR`.|
|`PARSE_JSON(expr)`|Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in an error.|
|`TRY_PARSE_JSON(expr)`|Parses `expr` into a `COMPLEX<json>` object. This operator deserializes JSON values when processing them, translating stringified JSON into a nested structure. If the input is not a `VARCHAR` or it is invalid JSON, this function will result in a `NULL` value.|
|`TO_JSON_STRING(expr)`|Serializes `expr` into a JSON string.|
### JSONPath syntax
Druid supports a subset of the [JSONPath syntax](https://github.com/json-path/JsonPath/blob/master/README.md) operators, primarily limited to extracting individual values from nested data structures.
|Operator|Description|
| --- | --- |
|`$`| Root element. All JSONPath expressions start with this operator. |
|`.<name>`| Child element in dot notation. |
|`['<name>']`| Child element in bracket notation. |
|`[<number>]`| Array index. |
Consider the following example input JSON:
```json
{"x":1, "y":[1, 2, 3]}
```
- To return the entire JSON object:<br>
`$` -> `{"x":1, "y":[1, 2, 3]}`
- To return the value of the key "x":<br>
`$.x` -> `1`
- For a key that contains an array, to return the entire array:<br>
`$['y']` -> `[1, 2, 3]`
- For a key that contains an array, to return an item in the array:<br>
`$.y[1]` -> `2`

View File

@ -31,7 +31,7 @@ Virtual columns are queryable column "views" created from a set of columns durin
A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column.
Virtual columns can be used as dimensions or as inputs to aggregators.
Virtual columns can be referenced by their output names to be used as [dimensions](./dimensionspecs.md) or as inputs to [filters](./filters.md) and [aggregators](./aggregations.md).
Each Apache Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example:
@ -64,6 +64,8 @@ Each Apache Druid query can accept a list of virtual columns as a parameter. The
## Virtual column types
### Expression virtual column
Expression virtual columns use Druid's native [expression](../misc/math-expr.md) system to allow defining query time
transforms of inputs from one or more columns.
The expression virtual column has the following syntax:
@ -78,6 +80,111 @@ The expression virtual column has the following syntax:
|property|description|required?|
|--------|-----------|---------|
|type|Must be `"expression"` to indicate that this is an expression virtual column.|yes|
|name|The name of the virtual column.|yes|
|expression|An [expression](../misc/math-expr.md) that takes a row as input and outputs a value for the virtual column.|yes|
|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, or STRING.|no, default is FLOAT|
|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, STRING, ARRAY types, or COMPLEX types.|no, default is FLOAT|
### Nested field virtual column
The nested field virtual column is an optimized virtual column that can provide direct access into various paths of
a `COMPLEX<json>` column, including using their indexes.
This virtual column is used for the SQL operators `JSON_VALUE` (if `processFromRaw` is set to false) or `JSON_QUERY`
(if `processFromRaw` is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed
list of "path parts" in order to determine what should be selected from the column.
You can define a nested field virtual column with any of the following equivalent syntaxes. The examples all produce
the same output value, with each example showing a different way to specify how to access the nested value. The first
is using JSONPath syntax `path`, the second with a jq `path`, and the third uses `pathParts`.
```json
{
"type": "nested-field",
"columnName": "shipTo",
"outputName": "v0",
"expectedType": "STRING",
"path": "$.phoneNumbers[1].number"
}
```
```json
{
"type": "nested-field",
"columnName": "shipTo",
"outputName": "v1",
"expectedType": "STRING",
"path": ".phoneNumbers[1].number",
"useJqSyntax": true
}
```
```json
{
"type": "nested-field",
"columnName": "shipTo",
"outputName": "v2",
"expectedType": "STRING",
"pathParts": [
{
"type": "field",
"field": "phoneNumbers"
},
{
"type": "arrayElement",
"index": 1
},
{
"type": "field",
"field": "number"
}
]
}
```
|property|description|required?|
|--------|-----------|---------|
|type|Must be `"nested-field"` to indicate that this is a nested field virtual column.|yes|
|columnName|The name of the `COMPLEX<json>` input column.|yes|
|outputName|The name of the virtual column.|yes|
|expectedType|The native Druid output type of the column, Druid will coerce output to this type if it does not match the underlying data. This can be `STRING`, `LONG`, `FLOAT`, `DOUBLE`, or `COMPLEX<json>`. Extracting `ARRAY` types is not yet supported.|no, default `STRING`|
|pathParts|The parsed path parts used to locate the nested values. `path` will be translated into `pathParts` internally. One of `path` or `pathParts` must be set|no, if `path` is defined|
|processFromRaw|If set to true, the virtual column will process the "raw" JSON data to extract values rather than using an optimized "literal" value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a `COMPLEX<json>` at the cost of much slower performance.|no, default false|
|path|'JSONPath' (or 'jq') syntax path. One of `path` or `pathParts` must be set. |no, if `pathParts` is defined|
|useJqSyntax|If true, parse `path` using 'jq' syntax instead of 'JSONPath'.|no, default is false|
#### Nested path part
Specify `pathParts` as an array of objects that describe each component of the path to traverse. Each object can take the following properties:
|property|description|required?|
|--------|-----------|---------|
|type|Must be 'field' or 'arrayElement'. Use `field` when accessing a specific field in a nested structure. Use `arrayElement` when accessing a specific integer position of an array (zero based).|yes|
|field|The name of the 'field' in a 'field' `type` path part|yes, if `type` is 'field'|
|index|The array element index if `type` is `arrayElement`|yes, if `type` is 'arrayElement'|
### List filtered virtual column
This virtual column provides an alternative way to use
['list filtered' dimension spec](./dimensionspecs.md#filtered-dimensionspecs) as a virtual column. It has optimized
access to the underlying column value indexes that can provide a small performance improvement in some cases.
```json
{
"type": "mv-filtered",
"name": "filteredDim3",
"delegate": "dim3",
"values": ["hello", "world"],
"isAllowList": true
}
```
|property|description|required?|
|--------|-----------|---------|
|type|Must be `"mv-filtered"` to indicate that this is a list filtered virtual column.|yes|
|name|The output name of the virtual column|yes|
|delegate|The name of the multi-value STRING input column to filter|yes|
|values|Set of STRING values to allow or deny|yes|
|isAllowList|If true, the output of the virtual column will be limited to the set specified by `values`, else it will provide all values _except_ those specified.|No, default true|

View File

@ -23,7 +23,7 @@ const snarkdown = require('snarkdown');
const writefile = 'lib/sql-docs.js';
const MINIMUM_EXPECTED_NUMBER_OF_FUNCTIONS = 150;
const MINIMUM_EXPECTED_NUMBER_OF_FUNCTIONS = 158;
const MINIMUM_EXPECTED_NUMBER_OF_DATA_TYPES = 14;
function hasHtmlTags(str) {
@ -63,6 +63,7 @@ const readDoc = async () => {
await fs.readFile('../docs/querying/sql-scalar.md', 'utf-8'),
await fs.readFile('../docs/querying/sql-aggregations.md', 'utf-8'),
await fs.readFile('../docs/querying/sql-multivalue-string-functions.md', 'utf-8'),
await fs.readFile('../docs/querying/sql-json-functions.md', 'utf-8'),
await fs.readFile('../docs/querying/sql-operators.md', 'utf-8'),
].join('\n');

View File

@ -125,6 +125,7 @@ JRE
JS
JSON
JsonPath
JSONPath
JSSE
JVM
JVMs
@ -209,6 +210,7 @@ aggregator
aggregators
ambari
analytics
arrayElement
assumeRoleArn
assumeRoleExternalId
async
@ -225,6 +227,7 @@ backfills
backpressure
base64
big-endian
bigint
blobstore
boolean
breakpoint
@ -261,6 +264,7 @@ dequeued
deserialization
deserialize
deserialized
deserializes
downtimes
druid
druidkubernetes-extensions
@ -269,6 +273,7 @@ encodings
endian
endpointConfig
enum
expectedType
expr
failover
featureSpec
@ -301,9 +306,15 @@ injective
inlined
inSubQueryThreshold
interruptible
isAllowList
jackson-jq
javadoc
joinable
json_keys
json_object
json_paths
json_query
json_value
kerberos
keystore
keytool
@ -343,10 +354,12 @@ noop
numerics
numShards
parameterized
parse_json
parseable
partitioner
partitionFunction
partitionsSpec
pathParts
performant
plaintext
pluggable
@ -377,6 +390,7 @@ prepopulated
preprocessing
priori
procs
processFromRaw
programmatically
proto
proxied
@ -432,12 +446,15 @@ subtask
subtasks
supervisorTaskId
symlink
syntaxes
tiering
timeseries
timestamp
timestamps
to_json_string
tradeoffs
transformSpec
try_parse_json
tsv
ulimit
unannounce
@ -456,6 +473,7 @@ unparsed
unsetting
untrusted
useFilterCNF
useJqSyntax
useSSL
uptime
uris
@ -465,6 +483,7 @@ v1
v2
vCPUs
validator
varchar
vectorizable
vectorize
vectorizeVirtualColumns
@ -1331,6 +1350,8 @@ expm1
expr
expr1
expr2
expr3
expr4
fromIndex
getExponent
hypot