2018-12-14 11:20:37 -05:00
[role="xpack"]
[testenv="basic"]
[[sql-functions-grouping]]
=== Grouping Functions
Functions for creating special __grouping__s (also known as _bucketing_); as such these need to be used
as part of the <<sql-syntax-group-by, grouping>>.
[[sql-functions-grouping-histogram]]
==== `HISTOGRAM`
2018-12-21 16:25:54 -05:00
.Synopsis:
2018-12-14 11:20:37 -05:00
[source, sql]
----
2019-04-22 09:33:55 -04:00
HISTOGRAM(
numeric_exp, <1>
numeric_interval) <2>
HISTOGRAM(
date_exp, <3>
date_time_interval) <4>
2018-12-14 11:20:37 -05:00
----
*Input*:
<1> numeric expression (typically a field)
<2> numeric interval
<3> date/time expression (typically a field)
<4> date/time <<sql-functions-datetime-interval, interval>>
*Output*: non-empty buckets or groups of the given expression divided according to the given interval
2020-02-14 11:58:45 -05:00
*Description*: The histogram function takes all matching values and divides them into buckets with fixed size matching the given interval, using (roughly) the following formula:
2018-12-14 11:20:37 -05:00
[source, sql]
----
bucket_key = Math.floor(value / interval) * interval
----
2020-02-14 11:58:45 -05:00
[NOTE]
The histogram in SQL does *NOT* return empty buckets for missing intervals as the traditional <<search-aggregations-bucket-histogram-aggregation, histogram>> and <<search-aggregations-bucket-datehistogram-aggregation, date histogram>>. Such behavior does not fit conceptually in SQL which treats all missing values as `NULL`; as such the histogram places all missing values in the `NULL` group.
2018-12-20 16:11:56 -05:00
2018-12-14 11:20:37 -05:00
`Histogram` can be applied on either numeric fields:
2019-05-31 13:03:41 -04:00
[source, sql]
2018-12-14 11:20:37 -05:00
----
2019-03-25 09:22:59 -04:00
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramNumeric]
2018-12-14 11:20:37 -05:00
----
or date/time fields:
2019-05-31 13:03:41 -04:00
[source, sql]
2018-12-14 11:20:37 -05:00
----
2019-03-25 09:22:59 -04:00
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramDateTime]
2018-12-14 11:20:37 -05:00
----
2018-12-20 16:11:56 -05:00
Expressions inside the histogram are also supported as long as the
return type is numeric:
2019-05-31 13:03:41 -04:00
[source, sql]
2018-12-20 16:11:56 -05:00
----
2019-03-25 09:22:59 -04:00
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramNumericExpression]
2018-12-20 16:11:56 -05:00
----
Do note that histograms (and grouping functions in general) allow custom expressions but cannot have any functions applied to them in the `GROUP BY`. In other words, the following statement is *NOT* allowed:
2018-12-14 11:20:37 -05:00
2019-05-31 13:03:41 -04:00
[source, sql]
2018-12-20 16:11:56 -05:00
----
2019-03-25 09:22:59 -04:00
include-tagged::{sql-specs}/docs/docs.csv-spec[expressionOnHistogramNotAllowed]
2018-12-20 16:11:56 -05:00
----
as it requires two groupings (one for histogram followed by a second for applying the function on top of the histogram groups).
Instead one can rewrite the query to move the expression on the histogram _inside_ of it:
2019-05-31 13:03:41 -04:00
[source, sql]
2018-12-20 16:11:56 -05:00
----
2019-03-25 09:22:59 -04:00
include-tagged::{sql-specs}/docs/docs.csv-spec[histogramDateTimeExpression]
2018-12-20 16:11:56 -05:00
----
2019-01-24 06:41:58 -05:00
[IMPORTANT]
When the histogram in SQL is applied on **DATE** type instead of **DATETIME**, the interval specified is truncated to
the multiple of a day. E.g.: for `HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '2 3:04' DAY TO MINUTE)` the interval
actually used will be `INTERVAL '2' DAY`. If the interval specified is less than 1 day, e.g.:
`HISTOGRAM(CAST(birth_date AS DATE), INTERVAL '20' HOUR)` then the interval used will be `INTERVAL '1' DAY`.
2019-04-01 17:30:39 -04:00
2019-10-09 04:22:41 -04:00
[IMPORTANT]
All intervals specified for a date/time HISTOGRAM will use a <<search-aggregations-bucket-datehistogram-aggregation,fixed interval>>
2020-02-25 11:44:03 -05:00
in their `date_histogram` aggregation definition, with the notable exceptions of `INTERVAL '1' YEAR`, `INTERVAL '1' MONTH` and `INTERVAL '1' DAY` where a calendar interval is used.
The choice for a calendar interval was made for having a more intuitive result for YEAR, MONTH and DAY groupings. In the case of YEAR, for example, the calendar intervals consider a one year
2019-10-09 04:22:41 -04:00
bucket as the one starting on January 1st that specific year, whereas a fixed interval one-year-bucket considers one year as a number
of milliseconds (for example, `31536000000ms` corresponding to 365 days, 24 hours per day, 60 minutes per hour etc.). With fixed intervals,
the day of February 5th, 2019 for example, belongs to a bucket that starts on December 20th, 2018 and {es} (and implicitly {es-sql}) would
have returned the year 2018 for a date that's actually in 2019. With calendar interval this behavior is more intuitive, having the day of
February 5th, 2019 actually belonging to the 2019 year bucket.
2019-04-01 17:30:39 -04:00
[IMPORTANT]
Histogram in SQL cannot be applied applied on **TIME** type.
E.g.: `HISTOGRAM(CAST(birth_date AS TIME), INTERVAL '10' MINUTES)` is currently not supported.