druid/docs/querying/filters.md

18 KiB
Raw Blame History

id title sidebar_label
filters Query filters Filters

Apache Druid supports two query languages: Druid SQL and native queries. This document describes the native language. For information about aggregators available in SQL, refer to the SQL documentation.

A filter is a JSON object indicating which rows of data should be included in the computation for a query. Its essentially the equivalent of the WHERE clause in SQL. Filters are commonly applied on dimensions, but can be applied on aggregated metrics, for example, see Filtered aggregator and Having filters.

Apache Druid supports the following types of filters.

Selector filter

The simplest filter is a selector filter. The selector filter will match a specific dimension with a specific value. Selector filters can be used as the base filters for more complex Boolean expressions of filters.

The grammar for a SELECTOR filter is as follows:

"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }

This is the equivalent of WHERE <dimension_string> = '<dimension_value_string>' or WHERE <dimension_string> IS NULL (if the value is null).

The selector filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

Column comparison filter

The column comparison filter is similar to the selector filter, but instead compares dimensions to each other. For example:

"filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

This is the equivalent of WHERE <dimension_a> = <dimension_b>.

dimensions is list of DimensionSpecs, making it possible to apply an extraction function if needed.

Regular expression filter

The regular expression filter is similar to the selector filter, but using regular expressions. It matches the specified dimension with the given pattern. The pattern can be any standard Java regular expression.

"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

The regex filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

Logical expression filters

AND

The grammar for an AND filter is as follows:

"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }

The filters in fields can be any other filter defined on this page.

OR

The grammar for an OR filter is as follows:

"filter": { "type": "or", "fields": [<filter>, <filter>, ...] }

The filters in fields can be any other filter defined on this page.

NOT

The grammar for a NOT filter is as follows:

"filter": { "type": "not", "field": <filter> }

The filter specified at field can be any other filter defined on this page.

JavaScript filter

The JavaScript filter matches a dimension against the specified JavaScript function predicate. The filter matches values for which the function returns true.

The function takes a single argument, the dimension value, and returns either true or false.

{
  "type" : "javascript",
  "dimension" : <dimension_string>,
  "function" : "function(value) { <...> }"
}

Example The following matches any dimension values for the dimension name between 'bar' and 'foo'

{
  "type" : "javascript",
  "dimension" : "name",
  "function" : "function(x) { return(x >= 'bar' && x <= 'foo') }"
}

The JavaScript filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

JavaScript-based functionality is disabled by default. Please refer to the Druid JavaScript programming guide for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

Extraction filter

The extraction filter is now deprecated. The selector filter with an extraction function specified provides identical functionality and should be used instead.

Extraction filter matches a dimension using some specific Extraction function. The following filter matches the values for which the extraction function has transformation entry input_key=output_value where output_value is equal to the filter value and input_key is present as dimension.

Example The following matches dimension values in [product_1, product_3, product_5] for the column product

{
    "filter": {
        "type": "extraction",
        "dimension": "product",
        "value": "bar_1",
        "extractionFn": {
            "type": "lookup",
            "lookup": {
                "type": "map",
                "map": {
                    "product_1": "bar_1",
                    "product_5": "bar_1",
                    "product_3": "bar_1"
                }
            }
        }
    }
}

Search filter

Search filters can be used to filter on partial string matches.

{
    "filter": {
        "type": "search",
        "dimension": "product",
        "query": {
          "type": "insensitive_contains",
          "value": "foo"
        }
    }
}
property description required?
type This String should always be "search". yes
dimension The dimension to perform the search over. yes
query A JSON object for the type of search. See search query spec for more information. yes
extractionFn Extraction function to apply to the dimension no

The search filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

Search query spec

Contains

property description required?
type This String should always be "contains". yes
value A String value to run the search over. yes
caseSensitive Whether two string should be compared as case sensitive or not no (default == false)

Insensitive Contains

property description required?
type This String should always be "insensitive_contains". yes
value A String value to run the search over. yes

Note that an "insensitive_contains" search is equivalent to a "contains" search with "caseSensitive": false (or not provided).

Fragment

property description required?
type This String should always be "fragment". yes
values A JSON array of String values to run the search over. yes
caseSensitive Whether strings should be compared as case sensitive or not. Default: false(insensitive) no

In filter

In filter can be used to express the following SQL query:

 SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')

The grammar for a "in" filter is as follows:

{
    "type": "in",
    "dimension": "outlaw",
    "values": ["Good", "Bad", "Ugly"]
}

The "in" filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

If an empty values array is passed to the "in" filter, it will simply return an empty result.

If the dimension is a multi-valued dimension, the "in" filter will return true if one of the dimension values is in the values array.

If the values array contains null, the "in" filter matches null values. This differs from the SQL IN filter, which does not match NULL values.

Like filter

Like filters can be used for basic wildcard searches. They are equivalent to the SQL LIKE operator. Special characters supported are "%" (matches any number of characters) and "_" (matches any one character).

property type description required?
type String This should always be "like". yes
dimension String The dimension to filter on yes
pattern String LIKE pattern, such as "foo%" or "___bar". yes
escape String An escape character that can be used to escape special characters. no
extractionFn Extraction function Extraction function to apply to the dimension no

Like filters support the use of extraction functions, see Filtering with Extraction Functions for details.

This Like filter expresses the condition last_name LIKE "D%" (i.e. last_name starts with "D").

{
    "type": "like",
    "dimension": "last_name",
    "pattern": "D%"
}

Bound filter

Bound filters can be used to filter on ranges of dimension values. It can be used for comparison filtering like greater than, less than, greater than or equal to, less than or equal to, and "between" (if both "lower" and "upper" are set).

property type description required?
type String This should always be "bound". yes
dimension String The dimension to filter on yes
lower String The lower bound for the filter no
upper String The upper bound for the filter no
lowerStrict Boolean Perform strict comparison on the lower bound (">" instead of ">=") no, default: false
upperStrict Boolean Perform strict comparison on the upper bound ("<" instead of "<=") no, default: false
ordering String Specifies the sorting order to use when comparing values against the bound. Can be one of the following values: "lexicographic", "alphanumeric", "numeric", "strlen", "version". See Sorting Orders for more details. no, default: "lexicographic"
extractionFn Extraction function Extraction function to apply to the dimension no

Bound filters support the use of extraction functions, see Filtering with Extraction Functions for details.

The following bound filter expresses the condition 21 <= age <= 31:

{
    "type": "bound",
    "dimension": "age",
    "lower": "21",
    "upper": "31" ,
    "ordering": "numeric"
}

This filter expresses the condition foo <= name <= hoo, using the default lexicographic sorting order.

{
    "type": "bound",
    "dimension": "name",
    "lower": "foo",
    "upper": "hoo"
}

Using strict bounds, this filter expresses the condition 21 < age < 31

{
    "type": "bound",
    "dimension": "age",
    "lower": "21",
    "lowerStrict": true,
    "upper": "31" ,
    "upperStrict": true,
    "ordering": "numeric"
}

The user can also specify a one-sided bound by omitting "upper" or "lower". This filter expresses age < 31.

{
    "type": "bound",
    "dimension": "age",
    "upper": "31" ,
    "upperStrict": true,
    "ordering": "numeric"
}

Likewise, this filter expresses age >= 18

{
    "type": "bound",
    "dimension": "age",
    "lower": "18" ,
    "ordering": "numeric"
}

Interval filter

The Interval filter enables range filtering on columns that contain long millisecond values, with the boundaries specified as ISO 8601 time intervals. It is suitable for the __time column, long metric columns, and dimensions with values that can be parsed as long milliseconds.

This filter converts the ISO 8601 intervals to long millisecond start/end ranges and translates to an OR of Bound filters on those millisecond ranges, with numeric comparison. The Bound filters will have left-closed and right-open matching (i.e., start <= time < end).

property type description required?
type String This should always be "interval". yes
dimension String The dimension to filter on yes
intervals Array A JSON array containing ISO-8601 interval strings. This defines the time ranges to filter on. yes
extractionFn Extraction function Extraction function to apply to the dimension no

The interval filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

If an extraction function is used with this filter, the extraction function should output values that are parseable as long milliseconds.

The following example filters on the time ranges of October 1-7, 2014 and November 15-16, 2014.

{
    "type" : "interval",
    "dimension" : "__time",
    "intervals" : [
      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",
      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"
    ]
}

The filter above is equivalent to the following OR of Bound filters:

{
    "type": "or",
    "fields": [
      {
        "type": "bound",
        "dimension": "__time",
        "lower": "1412121600000",
        "lowerStrict": false,
        "upper": "1412640000000" ,
        "upperStrict": true,
        "ordering": "numeric"
      },
      {
         "type": "bound",
         "dimension": "__time",
         "lower": "1416009600000",
         "lowerStrict": false,
         "upper": "1416096000000" ,
         "upperStrict": true,
         "ordering": "numeric"
      }
    ]
}

Filtering with extraction functions

All filters except the "spatial" filter support extraction functions. An extraction function is defined by setting the "extractionFn" field on a filter. See Extraction function for more details on extraction functions.

If specified, the extraction function will be used to transform input values before the filter is applied. The example below shows a selector filter combined with an extraction function. This filter will transform input values according to the values defined in the lookup map; transformed values will then be matched with the string "bar_1".

Example The following matches dimension values in [product_1, product_3, product_5] for the column product

{
    "filter": {
        "type": "selector",
        "dimension": "product",
        "value": "bar_1",
        "extractionFn": {
            "type": "lookup",
            "lookup": {
                "type": "map",
                "map": {
                    "product_1": "bar_1",
                    "product_5": "bar_1",
                    "product_3": "bar_1"
                }
            }
        }
    }
}

Column types

Druid supports filtering on timestamp, string, long, and float columns.

Note that only string columns have bitmap indexes. Therefore, queries that filter on other column types will need to scan those columns.

Filtering on numeric columns

When filtering on numeric columns, you can write filters as if they were strings. In most cases, your filter will be converted into a numeric predicate and will be applied to the numeric column values directly. In some cases (such as the "regex" filter) the numeric column values will be converted to strings during the scan.

For example, filtering on a specific value, myFloatColumn = 10.1:

"filter": {
  "type": "selector",
  "dimension": "myFloatColumn",
  "value": "10.1"
}

Filtering on a range of values, 10 <= myFloatColumn < 20:

"filter": {
  "type": "bound",
  "dimension": "myFloatColumn",
  "ordering": "numeric",
  "lower": "10",
  "lowerStrict": false,
  "upper": "20",
  "upperStrict": true
}

Filtering on the timestamp column

Query filters can also be applied to the timestamp column. The timestamp column has long millisecond values. To refer to the timestamp column, use the string __time as the dimension name. Like numeric dimensions, timestamp filters should be specified as if the timestamp values were strings.

If the user wishes to interpret the timestamp with a specific format, timezone, or locale, the Time Format Extraction Function is useful.

For example, filtering on a long timestamp value:

"filter": {
  "type": "selector",
  "dimension": "__time",
  "value": "124457387532"
}

Filtering on day of week:

"filter": {
  "type": "selector",
  "dimension": "__time",
  "value": "Friday",
  "extractionFn": {
    "type": "timeFormat",
    "format": "EEEE",
    "timeZone": "America/New_York",
    "locale": "en"
  }
}

Filtering on a set of ISO 8601 intervals:

{
    "type" : "interval",
    "dimension" : "__time",
    "intervals" : [
      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",
      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"
    ]
}

True filter

The true filter is a filter which matches all values. It can be used to temporarily disable other filters without removing the filter.


{ "type" : "true" }

Expression filter

The expression filter allows for the implementation of arbitrary conditions, leveraging the Druid expression system.

This filter allows for more flexibility, but it might be less performant than a combination of the other filters on this page due to the fact that not all filter optimizations are in place yet.


{ 
    "type" : "expression" ,
    "expression" : "((product_type == 42) && (!is_deleted))"
}

See the Druid expression system for more details.