715 lines
23 KiB
Plaintext
715 lines
23 KiB
Plaintext
[[search-aggregations-bucket-composite-aggregation]]
|
|
=== Composite aggregation
|
|
|
|
A multi-bucket aggregation that creates composite buckets from different sources.
|
|
|
|
Unlike the other `multi-bucket` aggregation the `composite` aggregation can be used
|
|
to paginate **all** buckets from a multi-level aggregation efficiently. This aggregation
|
|
provides a way to stream **all** buckets of a specific aggregation similarly to what
|
|
<<request-body-search-scroll, scroll>> does for documents.
|
|
|
|
The composite buckets are built from the combinations of the
|
|
values extracted/created for each document and each combination is considered as
|
|
a composite bucket.
|
|
|
|
//////////////////////////
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /sales
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"product": {
|
|
"type": "keyword"
|
|
},
|
|
"timestamp": {
|
|
"type": "date"
|
|
},
|
|
"price": {
|
|
"type": "long"
|
|
},
|
|
"shop": {
|
|
"type": "keyword"
|
|
},
|
|
"nested": {
|
|
"type": "nested",
|
|
"properties": {
|
|
"product": {
|
|
"type": "keyword"
|
|
},
|
|
"timestamp": {
|
|
"type": "date"
|
|
},
|
|
"price": {
|
|
"type": "long"
|
|
},
|
|
"shop": {
|
|
"type": "keyword"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST /sales/_bulk?refresh
|
|
{"index":{"_id":0}}
|
|
{"product": "mad max", "price": "20", "timestamp": "2017-05-09T14:35"}
|
|
{"index":{"_id":1}}
|
|
{"product": "mad max", "price": "25", "timestamp": "2017-05-09T12:35"}
|
|
{"index":{"_id":2}}
|
|
{"product": "rocky", "price": "10", "timestamp": "2017-05-08T09:10"}
|
|
{"index":{"_id":3}}
|
|
{"product": "mad max", "price": "27", "timestamp": "2017-05-10T07:07"}
|
|
{"index":{"_id":4}}
|
|
{"product": "apocalypse now", "price": "10", "timestamp": "2017-05-11T08:35"}
|
|
-------------------------------------------------
|
|
// NOTCONSOLE
|
|
// TESTSETUP
|
|
|
|
//////////////////////////
|
|
|
|
For instance the following document:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"keyword": ["foo", "bar"],
|
|
"number": [23, 65, 76]
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
\... creates the following composite buckets when `keyword` and `number` are used as values source
|
|
for the aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{ "keyword": "foo", "number": 23 }
|
|
{ "keyword": "foo", "number": 65 }
|
|
{ "keyword": "foo", "number": 76 }
|
|
{ "keyword": "bar", "number": 23 }
|
|
{ "keyword": "bar", "number": 65 }
|
|
{ "keyword": "bar", "number": 76 }
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
==== Values source
|
|
|
|
The `sources` parameter controls the sources that should be used to build the composite buckets.
|
|
The order that the `sources` are defined is important because it also controls the order
|
|
the keys are returned.
|
|
|
|
The name given to each sources must be unique.
|
|
|
|
There are three different types of values source:
|
|
|
|
[[_terms]]
|
|
===== Terms
|
|
|
|
The `terms` value source is equivalent to a simple `terms` aggregation.
|
|
The values are extracted from a field or a script exactly like the `terms` aggregation.
|
|
|
|
Example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "product": { "terms" : { "field": "product" } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Like the `terms` aggregation it is also possible to use a script to create the values for the composite buckets:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{
|
|
"product": {
|
|
"terms" : {
|
|
"script" : {
|
|
"source": "doc['product'].value",
|
|
"lang": "painless"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[[_histogram]]
|
|
===== Histogram
|
|
|
|
The `histogram` value source can be applied on numeric values to build fixed size
|
|
interval over the values. The `interval` parameter defines how the numeric values should be
|
|
transformed. For instance an `interval` set to 5 will translate any numeric values to its closest interval,
|
|
a value of `101` would be translated to `100` which is the key for the interval between 100 and 105.
|
|
|
|
Example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "histo": { "histogram" : { "field": "price", "interval": 5 } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The values are built from a numeric field or a script that return numerical values:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{
|
|
"histo": {
|
|
"histogram" : {
|
|
"interval": 5,
|
|
"script" : {
|
|
"source": "doc['price'].value",
|
|
"lang": "painless"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[[_date_histogram]]
|
|
===== Date histogram
|
|
|
|
The `date_histogram` is similar to the `histogram` value source except that the interval
|
|
is specified by date/time expression:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "date": { "date_histogram" : { "field": "timestamp", "calendar_interval": "1d" } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The example above creates an interval per day and translates all `timestamp` values to the start of its closest intervals.
|
|
Available expressions for interval: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second`
|
|
|
|
Time values can also be specified via abbreviations supported by <<time-units,time units>> parsing.
|
|
Note that fractional time values are not supported, but you can address this by shifting to another
|
|
time unit (e.g., `1.5h` could instead be specified as `90m`).
|
|
|
|
*Format*
|
|
|
|
Internally, a date is represented as a 64 bit number representing a timestamp in milliseconds-since-the-epoch.
|
|
These timestamps are returned as the bucket keys. It is possible to return a formatted date string instead using
|
|
the format specified with the format parameter:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{
|
|
"date": {
|
|
"date_histogram" : {
|
|
"field": "timestamp",
|
|
"calendar_interval": "1d",
|
|
"format": "yyyy-MM-dd" <1>
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Supports expressive date <<date-format-pattern,format pattern>>
|
|
|
|
*Time Zone*
|
|
|
|
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
|
|
rounding is also done in UTC. The `time_zone` parameter can be used to indicate
|
|
that bucketing should use a different time zone.
|
|
|
|
Time zones may either be specified as an ISO 8601 UTC offset (e.g. `+01:00` or
|
|
`-08:00`) or as a timezone id, an identifier used in the TZ database like
|
|
`America/Los_Angeles`.
|
|
|
|
===== Mixing different values source
|
|
|
|
The `sources` parameter accepts an array of values source.
|
|
It is possible to mix different values source to create composite buckets.
|
|
For example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } },
|
|
{ "product": { "terms": {"field": "product" } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
This will create composite buckets from the values created by two values source, a `date_histogram` and a `terms`.
|
|
Each bucket is composed of two values, one for each value source defined in the aggregation.
|
|
Any type of combinations is allowed and the order in the array is preserved
|
|
in the composite buckets.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "shop": { "terms": {"field": "shop" } } },
|
|
{ "product": { "terms": { "field": "product" } } },
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Order
|
|
|
|
By default the composite buckets are sorted by their natural ordering. Values are sorted
|
|
in ascending order of their values. When multiple value sources are requested, the ordering is done per value
|
|
source, the first value of the composite bucket is compared to the first value of the other composite bucket and if they are equals the
|
|
next values in the composite bucket are used for tie-breaking. This means that the composite bucket
|
|
`[foo, 100]` is considered smaller than `[foobar, 0]` because `foo` is considered smaller than `foobar`.
|
|
It is possible to define the direction of the sort for each value source by setting `order` to `asc` (default value)
|
|
or `desc` (descending order) directly in the value source definition.
|
|
For example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
|
|
{ "product": { "terms": {"field": "product", "order": "asc" } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
\... will sort the composite bucket in descending order when comparing values from the `date_histogram` source
|
|
and in ascending order when comparing values from the `terms` source.
|
|
|
|
==== Missing bucket
|
|
|
|
By default documents without a value for a given source are ignored.
|
|
It is possible to include them in the response by setting `missing_bucket` to
|
|
`true` (defaults to `false`):
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "product_name": { "terms" : { "field": "product", "missing_bucket": true } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
In the example above the source `product_name` will emit an explicit `null` value
|
|
for documents without a value for the field `product`.
|
|
The `order` specified in the source dictates whether the `null` values should rank
|
|
first (ascending order, `asc`) or last (descending order, `desc`).
|
|
|
|
==== Size
|
|
|
|
The `size` parameter can be set to define how many composite buckets should be returned.
|
|
Each composite bucket is considered as a single bucket so setting a size of 10 will return the
|
|
first 10 composite buckets created from the values source.
|
|
The response contains the values for each composite bucket in an array containing the values extracted
|
|
from each value source.
|
|
|
|
==== Pagination
|
|
|
|
If the number of composite buckets is too high (or unknown) to be returned in a single response
|
|
it is possible to split the retrieval in multiple requests.
|
|
Since the composite buckets are flat by nature, the requested `size` is exactly the number of composite buckets
|
|
that will be returned in the response (assuming that they are at least `size` composite buckets to return).
|
|
If all composite buckets should be retrieved it is preferable to use a small size (`100` or `1000` for instance)
|
|
and then use the `after` parameter to retrieve the next results.
|
|
For example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"size": 2,
|
|
"sources" : [
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d" } } },
|
|
{ "product": { "terms": {"field": "product" } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[s/_search/_search\?filter_path=aggregations/]
|
|
|
|
\... returns:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
"aggregations": {
|
|
"my_buckets": {
|
|
"after_key": { <1>
|
|
"date": 1494288000000,
|
|
"product": "mad max"
|
|
},
|
|
"buckets": [
|
|
{
|
|
"key": {
|
|
"date": 1494201600000,
|
|
"product": "rocky"
|
|
},
|
|
"doc_count": 1
|
|
},
|
|
{
|
|
"key": {
|
|
"date": 1494288000000,
|
|
"product": "mad max"
|
|
},
|
|
"doc_count": 2
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\.\.\.//]
|
|
|
|
<1> The last composite bucket returned by the query.
|
|
|
|
NOTE: The `after_key` is equals to the last bucket returned in the response before
|
|
any filtering that could be done by <<search-aggregations-pipeline, Pipeline aggregations>>.
|
|
If all buckets are filtered/removed by a pipeline aggregation, the `after_key` will contain
|
|
the last bucket before filtering.
|
|
|
|
The `after` parameter can be used to retrieve the composite buckets that are **after**
|
|
the last composite buckets returned in a previous round.
|
|
For the example below the last bucket can be found in `after_key` and the next
|
|
round of result can be retrieved with:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"size": 2,
|
|
"sources" : [
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
|
|
{ "product": { "terms": {"field": "product", "order": "asc" } } }
|
|
],
|
|
"after": { "date": 1494288000000, "product": "mad max" } <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Should restrict the aggregation to buckets that sort **after** the provided values.
|
|
|
|
==== Early termination
|
|
|
|
For optimal performance the <<index-modules-index-sorting,index sort>> should be set on the index so that it matches
|
|
parts or fully the source order in the composite aggregation.
|
|
For instance the following index sort:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT twitter
|
|
{
|
|
"settings" : {
|
|
"index" : {
|
|
"sort.field" : ["username", "timestamp"], <1>
|
|
"sort.order" : ["asc", "desc"] <2>
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"username": {
|
|
"type": "keyword",
|
|
"doc_values": true
|
|
},
|
|
"timestamp": {
|
|
"type": "date"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> This index is sorted by `username` first then by `timestamp`.
|
|
<2> ... in ascending order for the `username` field and in descending order for the `timestamp` field.
|
|
|
|
.. could be used to optimize these composite aggregations:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "user_name": { "terms" : { "field": "user_name" } } } <1>
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "user_name": { "terms" : { "field": "user_name" } } }, <1>
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } <2>
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
|
|
<2> `timestamp` matches also the prefix and the order matches (`desc`).
|
|
|
|
In order to optimize the early termination it is advised to set `track_total_hits` in the request
|
|
to `false`. The number of total hits that match the request can be retrieved on the first request
|
|
and it would be costly to compute this number on every page:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"track_total_hits": false,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "user_name": { "terms" : { "field": "user_name" } } },
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Note that the order of the source is important, in the example below switching the `user_name` with the `timestamp`
|
|
would deactivate the sort optimization since this configuration wouldn't match the index sort specification.
|
|
If the order of sources do not matter for your use case you can follow these simple guidelines:
|
|
|
|
* Put the fields with the highest cardinality first.
|
|
* Make sure that the order of the field matches the order of the index sort.
|
|
* Put multi-valued fields last since they cannot be used for early termination.
|
|
|
|
WARNING: <<index-modules-index-sorting,index sort>> can slowdown indexing, it is very important to test index sorting
|
|
with your specific use case and dataset to ensure that it matches your requirement. If it doesn't note that `composite`
|
|
aggregations will also try to early terminate on non-sorted indices if the query matches all document (`match_all` query).
|
|
|
|
==== Sub-aggregations
|
|
|
|
Like any `multi-bucket` aggregations the `composite` aggregation can hold sub-aggregations.
|
|
These sub-aggregations can be used to compute other buckets or statistics on each composite bucket created by this
|
|
parent aggregation.
|
|
For instance the following example computes the average value of a field
|
|
per composite bucket:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"size": 0,
|
|
"aggs" : {
|
|
"my_buckets": {
|
|
"composite" : {
|
|
"sources" : [
|
|
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
|
|
{ "product": { "terms": {"field": "product" } } }
|
|
]
|
|
},
|
|
"aggregations": {
|
|
"the_avg": {
|
|
"avg": { "field": "price" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[s/_search/_search\?filter_path=aggregations/]
|
|
|
|
\... returns:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
"aggregations": {
|
|
"my_buckets": {
|
|
"after_key": {
|
|
"date": 1494201600000,
|
|
"product": "rocky"
|
|
},
|
|
"buckets": [
|
|
{
|
|
"key": {
|
|
"date": 1494460800000,
|
|
"product": "apocalypse now"
|
|
},
|
|
"doc_count": 1,
|
|
"the_avg": {
|
|
"value": 10.0
|
|
}
|
|
},
|
|
{
|
|
"key": {
|
|
"date": 1494374400000,
|
|
"product": "mad max"
|
|
},
|
|
"doc_count": 1,
|
|
"the_avg": {
|
|
"value": 27.0
|
|
}
|
|
},
|
|
{
|
|
"key": {
|
|
"date": 1494288000000,
|
|
"product" : "mad max"
|
|
},
|
|
"doc_count": 2,
|
|
"the_avg": {
|
|
"value": 22.5
|
|
}
|
|
},
|
|
{
|
|
"key": {
|
|
"date": 1494201600000,
|
|
"product": "rocky"
|
|
},
|
|
"doc_count": 1,
|
|
"the_avg": {
|
|
"value": 10.0
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\.\.\.//]
|
|
|
|
==== Pipeline aggregations
|
|
|
|
The composite agg is not currently compatible with pipeline aggregations, nor does it make sense in most cases.
|
|
E.g. due to the paging nature of composite aggs, a single logical partition (one day for example) might be spread
|
|
over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets,
|
|
running something like a derivative on a composite page could lead to inaccurate results as it is only taking into
|
|
account a "partial" result on that page.
|
|
|
|
Pipeline aggs that are self contained to a single bucket (such as `bucket_selector`) might be supported in the future.
|