OpenSearch/docs/reference/sql/functions/grouping.asciidoc

102 lines
4.0 KiB
Plaintext

[role="xpack"]
[testenv="basic"]
[[sql-functions-grouping]]
=== Grouping Functions
Functions for creating special __grouping__s (also known as _bucketing_); as such these need to be used
as part of the <<sql-syntax-group-by, grouping>>.
[[sql-functions-grouping-histogram]]
==== `HISTOGRAM`
.Synopsis:
[source, sql]
----
HISTOGRAM(
numeric_exp, <1>
numeric_interval) <2>
HISTOGRAM(
date_exp, <3>
date_time_interval) <4>
----
*Input*:
<1> numeric expression (typically a field)
<2> numeric interval
<3> date/time expression (typically a field)
<4> date/time <<sql-functions-datetime-interval, interval>>
*Output*: non-empty buckets or groups of the given expression divided according to the given interval
.Description
The histogram function takes all matching values and divides them into buckets with fixed size matching the given interval, using (roughly) the following formula:
[source, sql]
----
bucket_key = Math.floor(value / interval) * interval
----
NOTE:: The histogram in SQL does *NOT* return empty buckets for missing intervals as the traditional <<search-aggregations-bucket-histogram-aggregation, histogram>> and <<search-aggregations-bucket-datehistogram-aggregation, date histogram>>. Such behavior does not fit conceptually in SQL which treats all missing values as `NULL`; as such the histogram places all missing values in the `NULL` group.
`Histogram` can be applied on either numeric fields:
[source, sql]
----
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramNumeric]
----
or date/time fields:
[source, sql]
----
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramDateTime]
----
Expressions inside the histogram are also supported as long as the
return type is numeric:
[source, sql]
----
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramNumericExpression]
----
Do note that histograms (and grouping functions in general) allow custom expressions but cannot have any functions applied to them in the `GROUP BY`. In other words, the following statement is *NOT* allowed:
[source, sql]
----
include-tagged::{sql-specs}/docs/docs.csv-spec[expressionOnHistogramNotAllowed]
----
as it requires two groupings (one for histogram followed by a second for applying the function on top of the histogram groups).
Instead one can rewrite the query to move the expression on the histogram _inside_ of it:
[source, sql]
----
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramDateTimeExpression]
----
[IMPORTANT]
When the histogram in SQL is applied on **DATE** type instead of **DATETIME**, the interval specified is truncated to
the multiple of a day. E.g.: for `HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '2 3:04' DAY TO MINUTE)` the interval
actually used will be `INTERVAL '2' DAY`. If the interval specified is less than 1 day, e.g.:
`HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '20' HOUR)` then the interval used will be `INTERVAL '1' DAY`.
[IMPORTANT]
All intervals specified for a date/time HISTOGRAM will use a <<search-aggregations-bucket-datehistogram-aggregation,fixed interval>>
in their `date_histogram` aggregation definition, with the notable exception of `INTERVAL '1' YEAR` where a calendar interval is used.
The choice for a calendar interval was made for having a more intuitive result for YEAR groupings. Calendar intervals consider a one year
bucket as the one starting on January 1st that specific year, whereas a fixed interval one-year-bucket considers one year as a number
of milliseconds (for example, `31536000000ms` corresponding to 365 days, 24 hours per day, 60 minutes per hour etc.). With fixed intervals,
the day of February 5th, 2019 for example, belongs to a bucket that starts on December 20th, 2018 and {es} (and implicitly {es-sql}) would
have returned the year 2018 for a date that's actually in 2019. With calendar interval this behavior is more intuitive, having the day of
February 5th, 2019 actually belonging to the 2019 year bucket.
[IMPORTANT]
Histogram in SQL cannot be applied applied on **TIME** type.
E.g.: `HISTOGRAM(CAST(birth_date AS TIME), INTERVAL '10' MINUTES)` is currently not supported.