350 lines
12 KiB
Plaintext
350 lines
12 KiB
Plaintext
[[search-aggregations-bucket-histogram-aggregation]]
|
|
=== Histogram Aggregation
|
|
|
|
A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
|
|
It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
|
|
that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5`
|
|
(in case of price it may represent $5). When the aggregation executes, the price field of every document will be
|
|
evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5`
|
|
then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`.
|
|
To make this more formal, here is the rounding function that is used:
|
|
|
|
[source,java]
|
|
--------------------------------------------------
|
|
rem = value % interval
|
|
if (rem < 0) {
|
|
rem += interval
|
|
}
|
|
bucket_key = value - rem
|
|
--------------------------------------------------
|
|
|
|
From the rounding function above it can be seen that the intervals themselves **must** be integers.
|
|
|
|
WARNING: Currently, values are cast to integers before being bucketed, which
|
|
might cause negative floating-point values to fall into the wrong bucket. For
|
|
instance, `-4.5` with an interval of `2` would be cast to `-4`, and so would
|
|
end up in the `-4 <= val < -2` bucket instead of the `-6 <= val < -4` bucket.
|
|
|
|
The following snippet "buckets" the products based on their `price` by interval of `50`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
And the following may be the response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations": {
|
|
"prices" : {
|
|
"buckets": [
|
|
{
|
|
"key": 0,
|
|
"doc_count": 2
|
|
},
|
|
{
|
|
"key": 50,
|
|
"doc_count": 4
|
|
},
|
|
{
|
|
"key": 100,
|
|
"doc_count": 0
|
|
},
|
|
{
|
|
"key": 150,
|
|
"doc_count": 3
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Minimum document count
|
|
|
|
The response above show that no documents has a price that falls within the range of `[100 - 150)`. By default the
|
|
response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with
|
|
a higher minimum count thanks to the `min_doc_count` setting:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50,
|
|
"min_doc_count" : 1
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations": {
|
|
"prices" : {
|
|
"buckets": [
|
|
{
|
|
"key": 0,
|
|
"doc_count": 2
|
|
},
|
|
{
|
|
"key": 50,
|
|
"doc_count": 4
|
|
},
|
|
{
|
|
"key": 150,
|
|
"doc_count": 3
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[[search-aggregations-bucket-histogram-aggregation-extended-bounds]]
|
|
By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with
|
|
the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the
|
|
documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when
|
|
requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.
|
|
|
|
To understand why, let's look at an example:
|
|
|
|
Lets say the you're filtering your request to get all docs with values between `0` and `500`, in addition you'd like
|
|
to slice the data per price using a histogram with an interval of `50`. You also specify `"min_doc_count" : 0` as you'd
|
|
like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than `100`,
|
|
the first bucket you'll get will be the one with `100` as its key. This is confusing, as many times, you'd also like
|
|
to get those buckets between `0 - 100`.
|
|
|
|
With `extended_bounds` setting, you now can "force" the histogram aggregation to start building buckets on a specific
|
|
`min` values and also keep on building buckets up to a `max` value (even if there are no documents anymore). Using
|
|
`extended_bounds` only makes sense when `min_doc_count` is 0 (the empty buckets will never be returned if `min_doc_count`
|
|
is greater than 0).
|
|
|
|
Note that (as the name suggest) `extended_bounds` is **not** filtering buckets. Meaning, if the `extended_bounds.min` is higher
|
|
than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the
|
|
same goes for the `extended_bounds.max` and the last bucket). For filtering buckets, one should nest the histogram aggregation
|
|
under a range `filter` aggregation with the appropriate `from`/`to` settings.
|
|
|
|
Example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"query" : {
|
|
"constant_score" : { "filter": { "range" : { "price" : { "to" : "500" } } } }
|
|
},
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50,
|
|
"extended_bounds" : {
|
|
"min" : 0,
|
|
"max" : 500
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Order
|
|
|
|
By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controlled
|
|
using the `order` setting.
|
|
|
|
Ordering the buckets by their key - descending:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50,
|
|
"order" : { "_key" : "desc" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Ordering the buckets by their `doc_count` - ascending:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50,
|
|
"order" : { "_count" : "asc" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine the order of the buckets:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50,
|
|
"order" : { "price_stats.min" : "asc" } <1>
|
|
},
|
|
"aggs" : {
|
|
"price_stats" : { "stats" : {} } <2>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> The `{ "price_stats.min" : asc" }` will sort the buckets based on `min` value of their `price_stats` sub-aggregation.
|
|
|
|
<2> There is no need to configure the `price` field for the `price_stats` aggregation as it will inherit it by default from its parent histogram aggregation.
|
|
|
|
It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long
|
|
as the aggregations path are of a single-bucket type, where the last aggregation in the path may either by a single-bucket
|
|
one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`),
|
|
in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of
|
|
a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).
|
|
|
|
The path must be defined in the following form:
|
|
|
|
--------------------------------------------------
|
|
AGG_SEPARATOR := '>'
|
|
METRIC_SEPARATOR := '.'
|
|
AGG_NAME := <the name of the aggregation>
|
|
METRIC := <the name of the metric (in case of multi-value metrics aggregation)>
|
|
PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
|
|
--------------------------------------------------
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50,
|
|
"order" : { "promoted_products>rating_stats.avg" : "desc" } <1>
|
|
},
|
|
"aggs" : {
|
|
"promoted_products" : {
|
|
"filter" : { "term" : { "promoted" : true }},
|
|
"aggs" : {
|
|
"rating_stats" : { "stats" : { "field" : "rating" }}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The above will sort the buckets based on the avg rating among the promoted products
|
|
|
|
|
|
==== Offset
|
|
|
|
By default the bucket keys start with 0 and then continue in even spaced steps of `interval`, e.g. if the interval is 10 the first buckets
|
|
(assuming there is data inside them) will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the `offset` option.
|
|
|
|
This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval `10` will result in
|
|
two buckets with 5 documents each. If an additional offset `5` is used, there will be only one single bucket [5-14] containing all the 10
|
|
documents.
|
|
|
|
==== Response Format
|
|
|
|
By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash
|
|
instead keyed by the buckets keys:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"prices" : {
|
|
"histogram" : {
|
|
"field" : "price",
|
|
"interval" : 50,
|
|
"keyed" : true
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggregations": {
|
|
"prices": {
|
|
"buckets": {
|
|
"0": {
|
|
"key": 0,
|
|
"doc_count": 2
|
|
},
|
|
"50": {
|
|
"key": 50,
|
|
"doc_count": 4
|
|
},
|
|
"150": {
|
|
"key": 150,
|
|
"doc_count": 3
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Missing value
|
|
|
|
The `missing` parameter defines how documents that are missing a value should be treated.
|
|
By default they will be ignored but it is also possible to treat them as if they
|
|
had a value.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"quantity" : {
|
|
"histogram" : {
|
|
"field" : "quantity",
|
|
"interval": 10,
|
|
"missing": 0 <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Documents without a value in the `quantity` field will fall into the same bucket as documents that have the value `0`.
|