druid/math-expr.md at 77e0c16bce04fc0f9503d323accdbbe14691c9a2

mirror of https://github.com/apache/druid.git synced 2025-02-18 16:12:23 +00:00

* Add REGEXP_REPLACE function.

Replaces all instances of a pattern with a replacement string.

* Fixes.

* Improve test coverage.

* Adjust behavior.

2023-06-29 13:47:57 -07:00

30 KiB

Raw Blame History

id	title
math-expr	Expressions

Apache Druid supports two query languages: native queries and Druid SQL. This document describes the native language. For information about functions available in SQL, refer to the SQL documentation.

Expressions are used in various places in the native query language, including virtual columns and join conditions. They are also generated by most Druid SQL functions during the query translation process.

Expressions are also used at ingestion time inside transformations.

This expression language supports the following operators (listed in decreasing order of precedence).

Operators	Description
!, -	Unary NOT and Minus
^	Binary power op
*, /, %	Binary multiplicative
+, -	Binary additive
<, <=, >, >=, ==, !=	Binary Comparison
&&, \|\|	Binary Logical AND, OR

Long, double, and string data types are supported. If a number contains a dot, it is interpreted as a double, otherwise it is interpreted as a long. That means, always add a '.' to your number if you want it interpreted as a double value. String literals should be quoted by single quotation marks.

Additionally, the expression language supports long, double, and string arrays. Array literals are created by wrapping square brackets around a list of scalar literals values delimited by a comma or space character. All values in an array literal must be the same type, however null values are accepted. Typed empty arrays may be defined by prefixing with their type in angle brackets: <STRING>[], <DOUBLE>[], or <LONG>[].

Expressions can contain variables. Variable names may contain letters, digits, '_' and '$'. Variable names must not begin with a digit. To escape other special characters, you can quote it with double quotation marks.

For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string type, it's the evaluation result of 'Boolean.valueOf(string)'.

Multi-value string dimensions are supported and may be treated as either scalar or array typed values, as follows:

When treated as a scalar type, the expression is automatically transformed so that the scalar operation is applied across all values of the multi-valued type, mimicking Druid's native behavior.
Druid coerces values that result in arrays back into the native Druid string type for grouping and aggregation. Grouping on multi-value string dimensions in Druid groups by the individual values, not the 'array'. This behavior produces results similar to an implicit SQL UNNEST operation. Alternatively, you can use the array_to_string function to perform the aggregation on a stringified version of the complete array and therefore preserve the complete row. To transform the stringified dimension back into the true native array type, use string_to_array in an expression post-aggregator.

The following built-in functions are available.

General functions

name	description
cast	cast(expr,LONG or DOUBLE or STRING or ARRAY<LONG>, or ARRAY<DOUBLE> or ARRAY<STRING>) returns expr with specified type. exception can be thrown. Scalar types may be cast to array types and will take the form of a single element list (null will still be null).
if	if(predicate,then,else) returns 'then' if 'predicate' evaluates to a positive number, otherwise it returns 'else'
nvl	nvl(expr,expr-for-null) returns 'expr-for-null' if 'expr' is null (or empty string for string type)
like	like(expr, pattern[, escape]) is equivalent to SQL `expr LIKE pattern`
case_searched	case_searched(expr1, result1, [[expr2, result2, ...], else-result]) is similar to `CASE WHEN expr1 THEN result1 [ELSE else_result] END` in SQL
case_simple	case_simple(expr, value1, result1, [[value2, result2, ...], else-result]) is similar to `CASE expr WHEN value THEN result [ELSE else_result] END` in SQL
isnull	isnull(expr) returns 1 if the value is null, else 0
notnull	notnull(expr) returns 1 if the value is not null, else 0
bloom_filter_test	bloom_filter_test(expr, filter) tests the value of 'expr' against 'filter', a bloom filter serialized as a base64 string. See bloom filter extension documentation for additional details.

String functions

name	description
concat	concat(expr, expr...) concatenate a list of strings
format	format(pattern[, args...]) returns a string formatted in the manner of Java's String.format.
like	like(expr, pattern[, escape]) is equivalent to SQL `expr LIKE pattern`
lookup	lookup(expr, lookup-name) looks up expr in a registered query-time lookup
parse_long	parse_long(string[, radix]) parses a string as a long with the given radix, or 10 (decimal) if a radix is not provided.
regexp_extract	regexp_extract(expr, pattern[, index]) applies a regular expression pattern and extracts a capture group index, or null if there is no match. If index is unspecified or zero, returns the substring that matched the pattern. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.
regexp_like	regexp_like(expr, pattern) returns whether `expr` matches regular expression `pattern`. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.
regexp_replace	regexp_replace(expr, pattern, replacement) replaces all instances of a regular expression pattern with a given replacement string. The pattern may match anywhere inside `expr`; if you want to match the entire string instead, use the `^` and `$` markers at the start and end of your pattern.
contains_string	contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-sensitive.
icontains_string	contains_string(expr, string) returns whether `expr` contains `string` as a substring. This method is case-insensitive.
replace	replace(expr, pattern, replacement) replaces pattern with replacement
substring	substring(expr, index, length) behaves like java.lang.String's substring
right	right(expr, length) returns the rightmost length characters from a string
left	left(expr, length) returns the leftmost length characters from a string
strlen	strlen(expr) returns length of a string in UTF-16 code units
strpos	strpos(haystack, needle[, fromIndex]) returns the position of the needle within the haystack, with indexes starting from 0. The search will begin at fromIndex, or 0 if fromIndex is not specified. If the needle is not found then the function returns -1.
trim	trim(expr[, chars]) remove leading and trailing characters from `expr` if they are present in `chars`. `chars` defaults to ' ' (space) if not provided.
ltrim	ltrim(expr[, chars]) remove leading characters from `expr` if they are present in `chars`. `chars` defaults to ' ' (space) if not provided.
rtrim	rtrim(expr[, chars]) remove trailing characters from `expr` if they are present in `chars`. `chars` defaults to ' ' (space) if not provided.
lower	lower(expr) converts a string to lowercase
upper	upper(expr) converts a string to uppercase
reverse	reverse(expr) reverses a string
repeat	repeat(expr, N) repeats a string N times
lpad	lpad(expr, length, chars) returns a string of `length` from `expr` left-padded with `chars`. If `length` is shorter than the length of `expr`, the result is `expr` which is truncated to `length`. The result will be null if either `expr` or `chars` is null. If `chars` is an empty string, no padding is added, however `expr` may be trimmed if necessary.
rpad	rpad(expr, length, chars) returns a string of `length` from `expr` right-padded with `chars`. If `length` is shorter than the length of `expr`, the result is `expr` which is truncated to `length`. The result will be null if either `expr` or `chars` is null. If `chars` is an empty string, no padding is added, however `expr` may be trimmed if necessary.

Time functions

name	description
timestamp	timestamp(expr[,format-string]) parses string expr into date then returns milliseconds from java epoch. without 'format-string' it's regarded as ISO datetime format
unix_timestamp	same with 'timestamp' function but returns seconds instead
timestamp_ceil	timestamp_ceil(expr, period, [origin, [timezone]]) rounds up a timestamp, returning it as a new timestamp. Period can be any ISO8601 period, like P3M (quarters) or PT12H (half-days). The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".
timestamp_floor	timestamp_floor(expr, period, [origin, [timezone]]) rounds down a timestamp, returning it as a new timestamp. Period can be any ISO8601 period, like P3M (quarters) or PT12H (half-days). The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".
timestamp_shift	timestamp_shift(expr, period, step, [timezone]) shifts a timestamp by a period (step times), returning it as a new timestamp. Period can be any ISO8601 period. Step may be negative. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".
timestamp_extract	timestamp_extract(expr, unit, [timezone]) extracts a time part from expr, returning it as a number. Unit can be EPOCH (number of seconds since 1970-01-01 00:00:00 UTC), SECOND, MINUTE, HOUR, DAY (day of month), DOW (day of week), DOY (day of year), WEEK (week of week year), MONTH (1 through 12), QUARTER (1 through 4), or YEAR. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00"
timestamp_parse	timestamp_parse(string expr, [pattern, [timezone]]) parses a string into a timestamp using a given Joda DateTimeFormat pattern. If the pattern is not provided, this parses time strings in either ISO8601 or SQL format. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00", and will be used as the time zone for strings that do not include a time zone offset. Pattern and time zone must be literals. Strings that cannot be parsed as timestamps will be returned as nulls.
timestamp_format	timestamp_format(expr, [pattern, [timezone]]) formats a timestamp as a string with a given Joda DateTimeFormat pattern, or ISO8601 if the pattern is not provided. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00". Pattern and time zone must be literals.

Math functions

See javadoc of java.lang.Math for detailed explanation for each function.

name	description
abs	abs(x) returns the absolute value of x
acos	acos(x) returns the arc cosine of x
asin	asin(x) returns the arc sine of x
atan	atan(x) returns the arc tangent of x
bitwiseAnd	bitwiseAnd(x,y) returns the result of x & y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles
bitwiseComplement	bitwiseComplement(x) returns the result of ~x. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles
bitwiseConvertDoubleToLongBits	bitwiseConvertDoubleToLongBits(x) converts the bits of an IEEE 754 floating-point double value to a long. If the input is not a double, it is implicitly cast to a double prior to conversion
bitwiseConvertLongBitsToDouble	bitwiseConvertLongBitsToDouble(x) converts a long to the IEEE 754 floating-point double specified by the bits stored in the long. If the input is not a long, it is implicitly cast to a long prior to conversion
bitwiseOr	bitwiseOr(x,y) returns the result of x [PIPE] y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles
bitwiseShiftLeft	bitwiseShiftLeft(x,y) returns the result of x << y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles
bitwiseShiftRight	bitwiseShiftRight(x,y) returns the result of x >> y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles
bitwiseXor	bitwiseXor(x,y) returns the result of x ^ y. Double values will be implicitly cast to longs, use `bitwiseConvertDoubleToLongBits` to perform bitwise operations directly with doubles
atan2	atan2(y, x) returns the angle theta from the conversion of rectangular coordinates (x, y) to polar * coordinates (r, theta)
cbrt	cbrt(x) returns the cube root of x
ceil	ceil(x) returns the smallest (closest to negative infinity) double value that is greater than or equal to x and is equal to a mathematical integer
copysign	copysign(x) returns the first floating-point argument with the sign of the second floating-point argument
cos	cos(x) returns the trigonometric cosine of x
cosh	cosh(x) returns the hyperbolic cosine of x
cot	cot(x) returns the trigonometric cotangent of an angle x
div	div(x,y) is integer division of x by y
exp	exp(x) returns Euler's number raised to the power of x
expm1	expm1(x) returns e^x-1
floor	floor(x) returns the largest (closest to positive infinity) double value that is less than or equal to x and is equal to a mathematical integer
getExponent	getExponent(x) returns the unbiased exponent used in the representation of x
hypot	hypot(x, y) returns sqrt(x^2+y^2) without intermediate overflow or underflow
log	log(x) returns the natural logarithm of x
log10	log10(x) returns the base 10 logarithm of x
log1p	log1p(x) will the natural logarithm of x + 1
max	max(x, y) returns the greater of two values
min	min(x, y) returns the smaller of two values
nextafter	nextafter(x, y) returns the floating-point number adjacent to the x in the direction of the y
nextUp	nextUp(x) returns the floating-point value adjacent to x in the direction of positive infinity
pi	pi returns the constant value of the π
pow	pow(x, y) returns the value of the x raised to the power of y
remainder	remainder(x, y) returns the remainder operation on two arguments as prescribed by the IEEE 754 standard
rint	rint(x) returns value that is closest in value to x and is equal to a mathematical integer
round	round(x, y) returns the value of the x rounded to the y decimal places. While x can be an integer or floating-point number, y must be an integer. The type of the return value is specified by that of x. y defaults to 0 if omitted. When y is negative, x is rounded on the left side of the y decimal points. If x is `NaN`, x returns 0. If x is infinity, x will be converted to the nearest finite double.
safe_divide	safe_divide(x,y) returns the division of x by y if y is not equal to 0. In case y is 0 it returns 0 or `null` if `druid.generic.useDefaultValueForNull=false`
scalb	scalb(d, sf) returns d * 2^sf rounded as if performed by a single correctly rounded floating-point multiply to a member of the double value set
signum	signum(x) returns the signum function of the argument x
sin	sin(x) returns the trigonometric sine of an angle x
sinh	sinh(x) returns the hyperbolic sine of x
sqrt	sqrt(x) returns the correctly rounded positive square root of x
tan	tan(x) returns the trigonometric tangent of an angle x
tanh	tanh(x) returns the hyperbolic tangent of x
todegrees	todegrees(x) converts an angle measured in radians to an approximately equivalent angle measured in degrees
toradians	toradians(x) converts an angle measured in degrees to an approximately equivalent angle measured in radians
ulp	ulp(x) returns the size of an ulp of the argument x

Array functions

function	description
array(expr1,expr ...)	constructs an array from the expression arguments, using the type of the first argument as the output array type
array_length(arr)	returns length of array expression
array_offset(arr,long)	returns the array element at the 0 based index supplied, or null for an out of range index
array_ordinal(arr,long)	returns the array element at the 1 based index supplied, or null for an out of range index
array_contains(arr,expr)	returns 1 if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array, else 0
array_overlap(arr1,arr2)	returns 1 if arr1 and arr2 have any elements in common, else 0
array_offset_of(arr,expr)	returns the 0 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false`if no matching elements exist in the array.
array_ordinal_of(arr,expr)	returns the 1 based index of the first occurrence of expr in the array, or `-1` or `null` if `druid.generic.useDefaultValueForNull=false` if no matching elements exist in the array.
array_prepend(expr,arr)	adds expr to arr at the beginning, the resulting array type determined by the type of the array
array_append(arr,expr)	appends expr to arr, the resulting array type determined by the type of the first array
array_concat(arr1,arr2)	concatenates 2 arrays, the resulting array type determined by the type of the first array
array_set_add(arr,expr)	adds expr to arr and converts the array to a new array composed of the unique set of elements. The resulting array type determined by the type of the array
array_set_add_all(arr1,arr2)	combines the unique set of elements of 2 arrays, the resulting array type determined by the type of the first array
array_slice(arr,start,end)	return the subarray of arr from the 0 based index start(inclusive) to end(exclusive), or `null`, if start is less than 0, greater than length of arr or less than end
array_to_string(arr,str)	joins all elements of arr by the delimiter specified by str
string_to_array(str1,str2)	splits str1 into an array on the delimiter specified by str2

Apply functions

Apply functions allow for special 'lambda' expressions to be defined and applied to array inputs to enable free-form transformations.

function	description
map(lambda,arr)	applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array
cartesian_map(lambda,arr1,arr2,...)	applies a transform specified by a multi argument lambda expression to all elements of the Cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same
filter(lambda,arr)	filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match
fold(lambda,arr,acc)	folds a 2 argument lambda across arr using acc as the initial input value. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value.
cartesian_fold(lambda,arr1,arr2,...,acc)	folds a multi argument lambda across the Cartesian product of all input arrays using acc as the initial input value. The first arguments of the lambda are the array elements of each array and the last is the accumulator, returning a single accumulated value.
any(lambda,arr)	returns 1 if any element in the array matches the lambda expression, else 0
all(lambda,arr)	returns 1 if all elements in the array matches the lambda expression, else 0

Lambda expressions syntax

Lambda expressions are a sort of function definition, where new identifiers can be defined and passed as input to the expression body

(identifier1 ...) -> expr

e.g.

(x, y) -> x + y

The identifier arguments of a lambda expression correspond to the elements of the array it is being applied to. For example:

map((x) -> x + 1, some_multi_value_column)

will map each element of some_multi_value_column to the identifier x so that the lambda expression body can be evaluated for each x. The scoping rules are that lambda arguments will override identifiers which are defined externally from the lambda expression body. Using the same example:

map((x) -> x + 1, x)

in this case, the x when evaluating x + 1 is the lambda argument, thus an element of the multi-valued column x, rather than the column x itself.

JSON functions

JSON functions provide facilities to extract, transform, and create COMPLEX<json> values.

function	description
json_value(expr, path[, type])	Extract a Druid literal (`STRING`, `LONG`, `DOUBLE`) value from `expr` using JSONPath syntax of `path`. The optional `type` argument can be set to `'LONG'`,`'DOUBLE'` or `'STRING'` to cast values to that type.
json_query(expr, path)	Extract a `COMPLEX<json>` value from `expr` using JSONPath syntax of `path`
json_object(expr1, expr2[, expr3, expr4 ...])	Construct a `COMPLEX<json>` with alternating 'key' and 'value' arguments
parse_json(expr)	Deserialize a JSON `STRING` into a `COMPLEX<json>`. If the input is not a `STRING` or it is invalid JSON, this function will result in an error.
try_parse_json(expr)	Deserialize a JSON `STRING` into a `COMPLEX<json>`. If the input is not a `STRING` or it is invalid JSON, this function will result in a `NULL` value.
to_json_string(expr)	Convert `expr` into a JSON `STRING` value
json_keys(expr, path)	Get array of field names from `expr` at the specified JSONPath `path`, or null if the data does not exist or have any fields
json_paths(expr)	Get array of all JSONPath paths available from `expr`

JSONPath syntax

Druid supports a small, simplified subset of the JSONPath syntax operators, primarily limited to extracting individual values from nested data structures.

Operator	Description
`$`	Root element. All JSONPath expressions start with this operator.
`.<name>`	Child element in dot notation.
`['<name>']`	Child element in bracket notation.
`[<number>]`	Array index.

See SQL JSON documentation for examples and Nested columns for more information on ingesting and storing nested data.

Reduction functions

Reduction functions operate on zero or more expressions and return a single expression. If no expressions are passed as arguments, then the result is NULL. The expressions must all be convertible to a common data type, which will be the type of the result:

If all arguments are NULL, the result is NULL. Otherwise, NULL arguments are ignored.
If the arguments comprise a mix of numbers and strings, the arguments are interpreted as strings.
If all arguments are integer numbers, the arguments are interpreted as longs.
If all arguments are numbers and at least one argument is a double, the arguments are interpreted as doubles.

function	description
greatest([expr1, ...])	Evaluates zero or more expressions and returns the maximum value based on comparisons as described above.
least([expr1, ...])	Evaluates zero or more expressions and returns the minimum value based on comparisons as described above.

IP address functions

For the IPv4 address functions, the address argument accepts either an IPv4 dotted-decimal string (e.g., "192.168.0.1") or an IP address represented as a long (e.g., 3232235521). Format the subnet argument as an IPv4 address subnet in CIDR notation (e.g., "192.168.0.0/16").

function	description
ipv4_match(address, subnet)	Returns 1 if the `address` belongs to the `subnet` literal, else 0. If `address` is not a valid IPv4 address, then 0 is returned. This function is more efficient if `address` is a long instead of a string.
ipv4_parse(address)	Parses `address` into an IPv4 address stored as a long. Returns `address` if it is already a valid IPv4 integer address. Returns null if `address` cannot be represented as an IPv4 address.
ipv4_stringify(address)	Converts `address` into an IPv4 address dotted-decimal string. Returns `address` if it is already a valid IPv4 dotted-decimal string. Returns null if `address` cannot be represented as an IPv4 address.

Other functions

function	description
human_readable_binary_byte_format(value[, precision])	Format a number in human-readable IEC format. `precision` must be in the range of [0,3] (default: 2). For example: human_readable_binary_byte_format(1048576) returns `1.00 MiB` human_readable_binary_byte_format(1048576, 3) returns `1.000 MiB`
human_readable_decimal_byte_format(value[, precision])	Format a number in human-readable SI format. `precision` must be in the range of [0,3] (default: 2). For example: human_readable_decimal_byte_format(1000000) returns `1.00 MB` human_readable_decimal_byte_format(1000000, 3) returns `1.000 MB`
human_readable_decimal_format(value[, precision])	Format a number in human-readable SI format. `precision` must be in the range of [0,3] (default: 2). For example: human_readable_decimal_format(1000000) returns `1.00 M` human_readable_decimal_format(1000000, 3) returns `1.000 M`

Vectorization support

A number of expressions support 'vectorized' query engines

Supported features:

constants and identifiers are supported for any column type
cast is supported for numeric and string types
math operators: +,-,*,/,%,^ are supported for numeric types
logical operators: !, &&, ||, are supported for string and numeric types (if druid.expressions.useStrictBooleans=true)
comparison operators: =, !=, >, >=, <, <= are supported for string and numeric types
math functions: abs, acos, asin, atan, cbrt, ceil, cos, cosh, cot, exp, expm1, floor, getExponent, log, log10, log1p, nextUp, rint, signum, sin, sinh, sqrt, tan, tanh, toDegrees, toRadians, ulp, atan2, copySign, div, hypot, max, min, nextAfter, pow, remainder, scalb are supported for numeric types
time functions: timestamp_floor (with constant granularity argument) is supported for numeric types
boolean functions: isnull, notnull are supported for string and numeric types
conditional functions: nvl is supported for string and numeric types
string functions: the concatenation operator (+) and concat function are supported for string and numeric types
other: parse_long is supported for numeric and string types

Logical operator modes

Prior to the 0.23 release of Apache Druid, boolean function expressions have inconsistent handling of true and false values, and the logical 'and' and 'or' operators behave in a manner that is incompatible with SQL, even if SQL compatible null handling mode (druid.generic.useDefaultValueForNull=false) is enabled. Logical operators also pass through their input values similar to many scripting languages, and treat null as false, which can result in some rather strange behavior. Other boolean operations, such as comparisons and equality, retain their input types (e.g. DOUBLE comparison would produce 1.0 for true and 0.0 for false), while many other boolean functions strictly produce LONG typed values of 1 for true and 0 for false.

After 0.23, while the inconsistent legacy behavior is still the default, it can be optionally be changed by setting druid.expressions.useStrictBooleans=true, so that these operations will allow correctly treating null values as "unknown" for SQL compatible behavior, and all boolean output functions will output 'homogeneous' LONG typed boolean values of 1 for true and 0 for false. Additionally,

For the "or" operator:

true || null, null || true, -> 1
false || null, null || false, null || null-> null

For the "and" operator:

true && null, null && true, null && null -> null
false && null, null && false -> 0

Druid currently still retains implicit conversion of LONG, DOUBLE, and STRING types into boolean values in both modes:

LONG or DOUBLE - any value greater than 0 is considered true, else false
STRING - the value 'true' (case insensitive) is considered true, everything else is false.

Legacy behavior:

100 && 11 -> 11
0.7 || 0.3 -> 0.3
100 && 0 -> 0
'troo' && 'true' -> 'troo'
'troo' || 'true' -> 'true'

SQL compatible behavior:

100 && 11 -> 1
0.7 || 0.3 -> 1
100 && 0 -> 0
'troo' && 'true' -> 0
'troo' || 'true' -> 1

30 KiB Raw Blame History