681 lines
29 KiB
Plaintext
681 lines
29 KiB
Plaintext
[[search-aggregations-bucket-terms-aggregation]]
|
|
=== Terms Aggregation
|
|
|
|
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
|
|
|
|
Example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : { "field" : "gender" }
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
|
|
"aggregations" : {
|
|
"genders" : {
|
|
"doc_count_error_upper_bound": 0, <1>
|
|
"sum_other_doc_count": 0, <2>
|
|
"buckets" : [ <3>
|
|
{
|
|
"key" : "male",
|
|
"doc_count" : 10
|
|
},
|
|
{
|
|
"key" : "female",
|
|
"doc_count" : 10
|
|
},
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> an upper bound of the error on the document counts for each term, see <<search-aggregations-bucket-terms-aggregation-approximate-counts,below>>
|
|
<2> when there are lots of unique terms, elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response
|
|
<3> the list of the top buckets, the meaning of `top` being defined by the <<search-aggregations-bucket-terms-aggregation-order,order>>
|
|
|
|
By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can
|
|
change this default behaviour by setting the `size` parameter.
|
|
|
|
==== Size
|
|
|
|
The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
|
|
default, the node coordinating the search process will request each shard to provide its own top `size` term buckets
|
|
and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
|
|
This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate
|
|
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
|
|
size buckets was not returned). If set to `0`, the `size` will be set to `Integer.MAX_VALUE`.
|
|
|
|
[[search-aggregations-bucket-terms-aggregation-approximate-counts]]
|
|
==== Document counts are approximate
|
|
|
|
As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always
|
|
accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are
|
|
combined to give a final view. Consider the following scenario:
|
|
|
|
A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with
|
|
3 shards. In this case each shard is asked to give its top 5 terms.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"products" : {
|
|
"terms" : {
|
|
"field" : "product",
|
|
"size" : 5
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The terms for each of the three shards are shown below with their
|
|
respective document counts in brackets:
|
|
|
|
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
|
|=========================================================
|
|
| | Shard A | Shard B | Shard C
|
|
|
|
| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
| 6 | Product F (2) | Product H (14) | Product H (28)
|
|
| 7 | Product G (2) | Product I (10) | Product Q (2)
|
|
| 8 | Product H (2) | Product Q (6) | Product D (1)
|
|
| 9 | Product I (1) | Product J (8) |
|
|
| 10 | Product J (1) | Product C (4) |
|
|
|
|
|=========================================================
|
|
|
|
The shards will return their top 5 terms so the results from the shards will be:
|
|
|
|
|
|
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
|
|=========================================================
|
|
| | Shard A | Shard B | Shard C
|
|
|
|
| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
|
|
|=========================================================
|
|
|
|
Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces
|
|
the following:
|
|
|
|
[width="40%",cols="^2,^2"]
|
|
|=========================================================
|
|
|
|
| 1 | Product A (100)
|
|
| 2 | Product Z (52)
|
|
| 3 | Product C (50)
|
|
| 4 | Product G (45)
|
|
| 5 | Product B (43)
|
|
|
|
|=========================================================
|
|
|
|
Because Product A was returned from all shards we know that its document count value is accurate. Product C was only
|
|
returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on
|
|
shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also
|
|
returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of
|
|
combining the results to produce the final list of terms, that there is an error in the document count for Product C and
|
|
not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of
|
|
terms because it did not make it into the top five terms on any of the shards.
|
|
|
|
==== Shard Size
|
|
|
|
The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to
|
|
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
|
|
transfers between the nodes and the client).
|
|
|
|
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
|
|
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
|
|
coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way,
|
|
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
|
|
the client. If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`.
|
|
|
|
|
|
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
|
override it and reset it to be equal to `size`.
|
|
|
|
It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this
|
|
on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.
|
|
|
|
==== Calculating Document Count Error
|
|
|
|
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
|
a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
|
terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example
|
|
given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned
|
|
could have the 4th highest document count.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
|
|
"aggregations" : {
|
|
"products" : {
|
|
"doc_count_error_upper_bound" : 46,
|
|
"buckets" : [
|
|
{
|
|
"key" : "Product A",
|
|
"doc_count" : 100
|
|
},
|
|
{
|
|
"key" : "Product Z",
|
|
"doc_count" : 52
|
|
},
|
|
...
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The second error value can be enabled by setting the `show_term_doc_count_error` parameter to true. This shows an error value
|
|
for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when
|
|
deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for the last term returned
|
|
by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as
|
|
Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document
|
|
count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by
|
|
15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count
|
|
returned is accurate.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
|
|
"aggregations" : {
|
|
"products" : {
|
|
"doc_count_error_upper_bound" : 46,
|
|
"buckets" : [
|
|
{
|
|
"key" : "Product A",
|
|
"doc_count" : 100,
|
|
"doc_count_error_upper_bound" : 0
|
|
},
|
|
{
|
|
"key" : "Product Z",
|
|
"doc_count" : 52,
|
|
"doc_count_error_upper_bound" : 2
|
|
},
|
|
...
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
|
ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
|
does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the
|
|
aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be
|
|
determined and is given a value of -1 to indicate this.
|
|
|
|
[[search-aggregations-bucket-terms-aggregation-order]]
|
|
==== Order
|
|
|
|
The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by
|
|
their `doc_count` descending. It is also possible to change this behaviour as follows:
|
|
|
|
Ordering the buckets by their `doc_count` in an ascending manner:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "_count" : "asc" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Ordering the buckets alphabetically by their terms in an ascending manner:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "_term" : "asc" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "avg_height" : "desc" }
|
|
},
|
|
"aggs" : {
|
|
"avg_height" : { "avg" : { "field" : "height" } }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name):
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "height_stats.avg" : "desc" }
|
|
},
|
|
"aggs" : {
|
|
"height_stats" : { "stats" : { "field" : "height" } }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long
|
|
as the aggregations path are of a single-bucket type, where the last aggregation in the path may either be a single-bucket
|
|
one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`),
|
|
in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of
|
|
a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).
|
|
|
|
The path must be defined in the following form:
|
|
|
|
--------------------------------------------------
|
|
AGG_SEPARATOR := '>'
|
|
METRIC_SEPARATOR := '.'
|
|
AGG_NAME := <the name of the aggregation>
|
|
METRIC := <the name of the metric (in case of multi-value metrics aggregation)>
|
|
PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
|
|
--------------------------------------------------
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"countries" : {
|
|
"terms" : {
|
|
"field" : "address.country",
|
|
"order" : { "females>height_stats.avg" : "desc" }
|
|
},
|
|
"aggs" : {
|
|
"females" : {
|
|
"filter" : { "term" : { "gender" : { "female" }}},
|
|
"aggs" : {
|
|
"height_stats" : { "stats" : { "field" : "height" }}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The above will sort the countries buckets based on the average height among the female population.
|
|
|
|
Multiple criteria can be used to order the buckets by providing an array of order criteria such as the following:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"countries" : {
|
|
"terms" : {
|
|
"field" : "address.country",
|
|
"order" : [ { "females>height_stats.avg" : "desc" }, { "_count" : "desc" } ]
|
|
},
|
|
"aggs" : {
|
|
"females" : {
|
|
"filter" : { "term" : { "gender" : { "female" }}},
|
|
"aggs" : {
|
|
"height_stats" : { "stats" : { "field" : "height" }}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The above will sort the countries buckets based on the average height among the female population and then by
|
|
their `doc_count` in descending order.
|
|
|
|
NOTE: In the event that two buckets share the same values for all order criteria the bucket's term value is used as a
|
|
tie-breaker in ascending alphabetical order to prevent non-deterministic ordering of buckets.
|
|
|
|
==== Minimum document count
|
|
|
|
It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tags" : {
|
|
"terms" : {
|
|
"field" : "tag",
|
|
"min_doc_count": 10
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The above aggregation would only return tags which have been found in 10 hits or more. Default value is `1`.
|
|
|
|
|
|
Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
|
|
|
|
`shard_min_doc_count` parameter
|
|
|
|
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it.
|
|
|
|
|
|
|
|
NOTE: Setting `min_doc_count`=`0` will also return buckets for terms that didn't match any hit. However, some of
|
|
the returned terms which have a document count of zero might only belong to deleted documents, so there is
|
|
no warranty that a `match_all` query would find a positive document count for those terms.
|
|
|
|
WARNING: When NOT sorting on `doc_count` descending, high values of `min_doc_count` may return a number of buckets
|
|
which is less than `size` because not enough data was gathered from the shards. Missing buckets can be
|
|
back by increasing `shard_size`.
|
|
Setting `shard_min_doc_count` too high will cause terms to be filtered out on a shard level. This value should be set much lower than `min_doc_count/#shards`.
|
|
|
|
[[search-aggregations-bucket-terms-aggregation-script]]
|
|
==== Script
|
|
|
|
Generating the terms using a script:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"script" : "doc['gender'].value"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Value Script
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"script" : "'Gender: ' +_value"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
==== Filtering Values
|
|
|
|
It is possible to filter the values for which buckets will be created. This can be done using the `include` and
|
|
`exclude` parameters which are based on regular expression strings or arrays of exact values.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tags" : {
|
|
"terms" : {
|
|
"field" : "tags",
|
|
"include" : ".*sport.*",
|
|
"exclude" : "water_.*"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
In the above example, buckets will be created for all the tags that has the word `sport` in them, except those starting
|
|
with `water_` (so the tag `water_sports` will no be aggregated). The `include` regular expression will determine what
|
|
values are "allowed" to be aggregated, while the `exclude` determines the values that should not be aggregated. When
|
|
both are defined, the `exclude` has precedence, meaning, the `include` is evaluated first and only then the `exclude`.
|
|
|
|
The regular expression are based on the Java(TM) http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html[Pattern],
|
|
and as such, they it is also possible to pass in flags that will determine how the compiled regular expression will work:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tags" : {
|
|
"terms" : {
|
|
"field" : "tags",
|
|
"include" : {
|
|
"pattern" : ".*sport.*",
|
|
"flags" : "CANON_EQ|CASE_INSENSITIVE" <1>
|
|
},
|
|
"exclude" : {
|
|
"pattern" : "water_.*",
|
|
"flags" : "CANON_EQ|CASE_INSENSITIVE"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> the flags are concatenated using the `|` character as a separator
|
|
|
|
The possible flags that can be used are:
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CANON_EQ[`CANON_EQ`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE[`CASE_INSENSITIVE`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#COMMENTS[`COMMENTS`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#DOTALL[`DOTALL`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#LITERAL[`LITERAL`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#MULTILINE[`MULTILINE`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CASE[`UNICODE_CASE`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS[`UNICODE_CHARACTER_CLASS`] and
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNIX_LINES[`UNIX_LINES`]
|
|
|
|
For matching based on exact values the `include` and `exclude` parameters can simply take an array of
|
|
strings that represent the terms as they are found in the index:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"JapaneseCars" : {
|
|
"terms" : {
|
|
"field" : "make",
|
|
"include" : ["mazda", "honda"]
|
|
}
|
|
},
|
|
"ActiveCarManufacturers" : {
|
|
"terms" : {
|
|
"field" : "make",
|
|
"exclude" : ["rover", "jensen"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Multi-field terms aggregation
|
|
|
|
The `terms` aggregation does not support collecting terms from multiple fields
|
|
in the same document. The reason is that the `terms` agg doesn't collect the
|
|
string term values themselves, but rather uses
|
|
<<search-aggregations-bucket-terms-aggregation-execution-hint,global ordinals>>
|
|
to produce a list of all of the unique values in the field. Global ordinals
|
|
results in an important performance boost which would not be possible across
|
|
multiple fields.
|
|
|
|
There are two approaches that you can use to perform a `terms` agg across
|
|
multiple fields:
|
|
|
|
<<search-aggregations-bucket-terms-aggregation-script,Script>>::
|
|
|
|
Use a script to retrieve terms from multiple fields. This disables the global
|
|
ordinals optimization and will be slower than collecting terms from a single
|
|
field, but it gives you the flexibility to implement this option at search
|
|
time.
|
|
|
|
<<copy-to,`copy_to` field>>::
|
|
|
|
If you know ahead of time that you want to collect the terms from two or more
|
|
fields, then use `copy_to` in your mapping to create a new dedicated field at
|
|
index time which contains the values from both fields. You can aggregate on
|
|
this single field, which will benefit from the global ordinals optimization.
|
|
|
|
==== Collect mode
|
|
|
|
Deferring calculation of child aggregations
|
|
|
|
For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation
|
|
of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree
|
|
are expanded in one depth-first pass and only then any pruning occurs. In some rare scenarios this can be very wasteful and can hit memory constraints.
|
|
An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"actors" : {
|
|
"terms" : {
|
|
"field" : "actors",
|
|
"size" : 10
|
|
},
|
|
"aggs" : {
|
|
"costars" : {
|
|
"terms" : {
|
|
"field" : "actors",
|
|
"size" : 5
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets
|
|
during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine
|
|
the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the `breadth_first` collection
|
|
mode as opposed to the default `depth_first` mode:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"actors" : {
|
|
"terms" : {
|
|
"field" : "actors",
|
|
"size" : 10,
|
|
"collect_mode" : "breadth_first"
|
|
},
|
|
"aggs" : {
|
|
"costars" : {
|
|
"terms" : {
|
|
"field" : "actors",
|
|
"size" : 5
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
When using `breadth_first` mode the set of documents that fall into the uppermost buckets are
|
|
cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents.
|
|
In most requests the volume of buckets generated is smaller than the number of documents that fall into them so the default `depth_first`
|
|
collection mode is normally the best bet but occasionally the `breadth_first` strategy can be significantly more efficient. Currently
|
|
elasticsearch will always use the `depth_first` collect_mode unless explicitly instructed to use `breadth_first` as in the above example.
|
|
Note that the `order` parameter can still be used to refer to data from a child aggregation when using the `breadth_first` setting - the parent
|
|
aggregation understands that this child aggregation will need to be called first before any of the other child aggregations.
|
|
|
|
WARNING: It is not possible to nest aggregations such as `top_hits` which require access to match score information under an aggregation that uses
|
|
the `breadth_first` collection mode. This is because this would require a RAM buffer to hold the float score value for every document and
|
|
this would typically be too costly in terms of RAM.
|
|
|
|
[[search-aggregations-bucket-terms-aggregation-execution-hint]]
|
|
==== Execution hint
|
|
|
|
There are different mechanisms by which terms aggregations can be executed:
|
|
|
|
- by using field values directly in order to aggregate data per-bucket (`map`)
|
|
- by using ordinals of the field and preemptively allocating one bucket per ordinal value (`global_ordinals`)
|
|
- by using ordinals of the field and dynamically allocating one bucket per ordinal value (`global_ordinals_hash`)
|
|
- by using per-segment ordinals to compute counts and remap these counts to global counts using global ordinals (`global_ordinals_low_cardinality`)
|
|
|
|
Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured.
|
|
|
|
`map` should only be considered when very few documents match a query. Otherwise the ordinals-based execution modes
|
|
are significantly faster. By default, `map` is only used when running an aggregation on scripts, since they don't have
|
|
ordinals.
|
|
|
|
`global_ordinals_low_cardinality` only works for leaf terms aggregations but is usually the fastest execution mode. Memory
|
|
usage is linear with the number of unique values in the field, so it is only enabled by default on low-cardinality fields.
|
|
|
|
`global_ordinals` is the second fastest option, but the fact that it preemptively allocates buckets can be memory-intensive,
|
|
especially if you have one or more sub aggregations. It is used by default on top-level terms aggregations.
|
|
|
|
`global_ordinals_hash` on the contrary to `global_ordinals` and `global_ordinals_low_cardinality` allocates buckets dynamically
|
|
so memory usage is linear to the number of values of the documents that are part of the aggregation scope. It is used by default
|
|
in inner aggregations.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tags" : {
|
|
"terms" : {
|
|
"field" : "tags",
|
|
"execution_hint": "map" <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> the possible values are `map`, `global_ordinals`, `global_ordinals_hash` and `global_ordinals_low_cardinality`
|
|
|
|
Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.
|