druid/docs/querying/virtual-columns.md

7.5 KiB

id title
virtual-columns Virtual columns

Apache Druid supports two query languages: Druid SQL and native queries. This document describes the native language. For information about functions available in SQL, refer to the SQL documentation.

Virtual columns are queryable column "views" created from a set of columns during a query.

A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column.

Virtual columns can be referenced by their output names to be used as dimensions or as inputs to filters and aggregators.

Each Apache Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example:

{
 "queryType": "scan",
 "dataSource": "page_data",
 "columns":[],
 "virtualColumns": [
    {
      "type": "expression",
      "name": "fooPage",
      "expression": "concat('foo' + page)",
      "outputType": "STRING"
    },
    {
      "type": "expression",
      "name": "tripleWordCount",
      "expression": "wordCount * 3",
      "outputType": "LONG"
    }
  ],
 "intervals": [
   "2013-01-01/2019-01-02"
 ]
}

Virtual column types

Expression virtual column

Expression virtual columns use Druid's native expression system to allow defining query time transforms of inputs from one or more columns.

The expression virtual column has the following syntax:

{
  "type": "expression",
  "name": <name of the virtual column>,
  "expression": <row expression>,
  "outputType": <output value type of expression>
}
property description required?
type Must be "expression" to indicate that this is an expression virtual column. yes
name The name of the virtual column. yes
expression An expression that takes a row as input and outputs a value for the virtual column. yes
outputType The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, STRING, ARRAY types, or COMPLEX types. no, default is FLOAT

Nested field virtual column

The nested field virtual column is an optimized virtual column that can provide direct access into various paths of a COMPLEX<json> column, including using their indexes.

This virtual column is used for the SQL operators JSON_VALUE (if processFromRaw is set to false) or JSON_QUERY (if processFromRaw is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed list of "path parts" in order to determine what should be selected from the column.

You can define a nested field virtual column with any of the following equivalent syntaxes. The examples all produce the same output value, with each example showing a different way to specify how to access the nested value. The first is using JSONPath syntax path, the second with a jq path, and the third uses pathParts.

    {
      "type": "nested-field",
      "columnName": "shipTo",
      "outputName": "v0",
      "expectedType": "STRING",
      "path": "$.phoneNumbers[1].number"
    }
    {
      "type": "nested-field",
      "columnName": "shipTo",
      "outputName": "v1",
      "expectedType": "STRING",
      "path": ".phoneNumbers[1].number",
      "useJqSyntax": true
    }
    {
      "type": "nested-field",
      "columnName": "shipTo",
      "outputName": "v2",
      "expectedType": "STRING",
      "pathParts": [
        {
          "type": "field",
          "field": "phoneNumbers"
        },
        {
          "type": "arrayElement",
          "index": 1
        },
        {
          "type": "field",
          "field": "number"
        }
      ]
    }
property description required?
type Must be "nested-field" to indicate that this is a nested field virtual column. yes
columnName The name of the COMPLEX<json> input column. yes
outputName The name of the virtual column. yes
expectedType The native Druid output type of the column, Druid will coerce output to this type if it does not match the underlying data. This can be STRING, LONG, FLOAT, DOUBLE, or COMPLEX<json>. Extracting ARRAY types is not yet supported. no, default STRING
pathParts The parsed path parts used to locate the nested values. path will be translated into pathParts internally. One of path or pathParts must be set no, if path is defined
processFromRaw If set to true, the virtual column will process the "raw" JSON data to extract values rather than using an optimized "literal" value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a COMPLEX<json> at the cost of much slower performance. no, default false
path 'JSONPath' (or 'jq') syntax path. One of path or pathParts must be set. no, if pathParts is defined
useJqSyntax If true, parse path using 'jq' syntax instead of 'JSONPath'. no, default is false

Nested path part

Specify pathParts as an array of objects that describe each component of the path to traverse. Each object can take the following properties:

property description required?
type Must be 'field' or 'arrayElement'. Use field when accessing a specific field in a nested structure. Use arrayElement when accessing a specific integer position of an array (zero based). yes
field The name of the 'field' in a 'field' type path part yes, if type is 'field'
index The array element index if type is arrayElement yes, if type is 'arrayElement'

See Nested columns for more information on ingesting and storing nested data.

List filtered virtual column

This virtual column provides an alternative way to use 'list filtered' dimension spec as a virtual column. It has optimized access to the underlying column value indexes that can provide a small performance improvement in some cases.

    {
      "type": "mv-filtered",
      "name": "filteredDim3",
      "delegate": "dim3",
      "values": ["hello", "world"],
      "isAllowList": true
    }
property description required?
type Must be "mv-filtered" to indicate that this is a list filtered virtual column. yes
name The output name of the virtual column yes
delegate The name of the multi-value STRING input column to filter yes
values Set of STRING values to allow or deny yes
isAllowList If true, the output of the virtual column will be limited to the set specified by values, else it will provide all values except those specified. No, default true