Docs: Added explanation of how to do multi-field terms agg
Closes #5100
This commit is contained in:
parent
7c2490b2ad
commit
1bdf79e527
|
@ -54,19 +54,19 @@ size buckets was not returned). If set to `0`, the `size` will be set to `Intege
|
|||
|
||||
==== Document counts are approximate
|
||||
|
||||
As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always
|
||||
accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are
|
||||
As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always
|
||||
accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are
|
||||
combined to give a final view. Consider the following scenario:
|
||||
|
||||
A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with
|
||||
3 shards. In this case each shard is asked to give its top 5 terms.
|
||||
A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with
|
||||
3 shards. In this case each shard is asked to give its top 5 terms.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"products" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"field" : "product",
|
||||
"size" : 5
|
||||
}
|
||||
|
@ -75,23 +75,23 @@ A request is made to obtain the top 5 terms in the field product, ordered by des
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
The terms for each of the three shards are shown below with their
|
||||
The terms for each of the three shards are shown below with their
|
||||
respective document counts in brackets:
|
||||
|
||||
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
||||
|=========================================================
|
||||
| | Shard A | Shard B | Shard C
|
||||
|
||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
||||
| 6 | Product F (2) | Product H (14) | Product H (28)
|
||||
| 7 | Product G (2) | Product I (10) | Product Q (2)
|
||||
| 8 | Product H (2) | Product Q (6) | Product D (1)
|
||||
| 9 | Product I (1) | Product J (8) |
|
||||
| 10 | Product J (1) | Product C (4) |
|
||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
||||
| 6 | Product F (2) | Product H (14) | Product H (28)
|
||||
| 7 | Product G (2) | Product I (10) | Product Q (2)
|
||||
| 8 | Product H (2) | Product Q (6) | Product D (1)
|
||||
| 9 | Product I (1) | Product J (8) |
|
||||
| 10 | Product J (1) | Product C (4) |
|
||||
|
||||
|=========================================================
|
||||
|
||||
|
@ -102,41 +102,41 @@ The shards will return their top 5 terms so the results from the shards will be:
|
|||
|=========================================================
|
||||
| | Shard A | Shard B | Shard C
|
||||
|
||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
||||
|
||||
|=========================================================
|
||||
|
||||
Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces
|
||||
Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces
|
||||
the following:
|
||||
|
||||
[width="40%",cols="^2,^2"]
|
||||
|=========================================================
|
||||
|
||||
| 1 | Product A (100)
|
||||
| 2 | Product Z (52)
|
||||
| 3 | Product C (50)
|
||||
| 4 | Product G (45)
|
||||
| 5 | Product B (43)
|
||||
| 1 | Product A (100)
|
||||
| 2 | Product Z (52)
|
||||
| 3 | Product C (50)
|
||||
| 4 | Product G (45)
|
||||
| 5 | Product B (43)
|
||||
|
||||
|=========================================================
|
||||
|
||||
Because Product A was returned from all shards we know that its document count value is accurate. Product C was only
|
||||
returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on
|
||||
shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also
|
||||
returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of
|
||||
combining the results to produce the final list of terms, that there is an error in the document count for Product C and
|
||||
not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of
|
||||
Because Product A was returned from all shards we know that its document count value is accurate. Product C was only
|
||||
returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on
|
||||
shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also
|
||||
returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of
|
||||
combining the results to produce the final list of terms, that there is an error in the document count for Product C and
|
||||
not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of
|
||||
terms because it did not make it into the top five terms on any of the shards.
|
||||
|
||||
==== Shard Size
|
||||
|
||||
The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to
|
||||
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
|
||||
transfers between the nodes and the client).
|
||||
transfers between the nodes and the client).
|
||||
|
||||
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
|
||||
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
|
||||
|
@ -153,12 +153,12 @@ on high-cardinality fields as this will kill both your CPU since terms need to b
|
|||
|
||||
==== Calculating Document Count Error
|
||||
|
||||
coming[1.4.0]
|
||||
coming[1.4.0]
|
||||
|
||||
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
||||
a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
||||
terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example
|
||||
given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned
|
||||
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
||||
a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
||||
terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example
|
||||
given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned
|
||||
could have the 4th highest document count.
|
||||
|
||||
[source,js]
|
||||
|
@ -185,13 +185,13 @@ could have the 4th highest document count.
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
The second error value can be enabled by setting the `show_term_doc_count_error` parameter to true. This shows an error value
|
||||
for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when
|
||||
deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for the last term returned
|
||||
by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as
|
||||
Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document
|
||||
count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by
|
||||
15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count
|
||||
The second error value can be enabled by setting the `show_term_doc_count_error` parameter to true. This shows an error value
|
||||
for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when
|
||||
deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for the last term returned
|
||||
by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as
|
||||
Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document
|
||||
count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by
|
||||
15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count
|
||||
returned is accurate.
|
||||
|
||||
[source,js]
|
||||
|
@ -220,10 +220,10 @@ returned is accurate.
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
||||
ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
||||
does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the
|
||||
aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be
|
||||
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
||||
ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
||||
does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the
|
||||
aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be
|
||||
determined and is given a value of -1 to indicate this.
|
||||
|
||||
==== Order
|
||||
|
@ -342,7 +342,7 @@ PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR
|
|||
|
||||
The above will sort the countries buckets based on the average height among the female population.
|
||||
|
||||
==== Minimum document count
|
||||
==== Minimum document count
|
||||
|
||||
It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option:
|
||||
|
||||
|
@ -380,6 +380,7 @@ WARNING: When NOT sorting on `doc_count` descending, high values of `min_doc_cou
|
|||
back by increasing `shard_size`.
|
||||
Setting `shard_min_doc_count` too high will cause terms to be filtered out on a shard level. This value should be set much lower than `min_doc_count/#shards`.
|
||||
|
||||
[[search-aggregations-bucket-terms-aggregation-script]]
|
||||
==== Script
|
||||
|
||||
Generating the terms using a script:
|
||||
|
@ -476,6 +477,33 @@ http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CA
|
|||
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS[`UNICODE_CHARACTER_CLASS`] and
|
||||
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNIX_LINES[`UNIX_LINES`]
|
||||
|
||||
==== Multi-field terms aggregation
|
||||
|
||||
The `terms` aggregation does not support collecting terms from multiple fields
|
||||
in the same document. The reason is that the `terms` agg doesn't collect the
|
||||
string term values themselves, but rather uses
|
||||
<<search-aggregations-bucket-terms-aggregation-execution-hint,global ordinals>>
|
||||
to produce a list of all of the unique values in the field. Global ordinals
|
||||
results in an important performance boost which would not be possible across
|
||||
multiple fields.
|
||||
|
||||
There are two approaches that you can use to perform a `terms` agg across
|
||||
multiple fields:
|
||||
|
||||
<<search-aggregations-bucket-terms-aggregation-script,Script>>::
|
||||
|
||||
Use a script to retrieve terms from multiple fields. This disables the global
|
||||
ordinals optimization and will be slower than collecting terms from a single
|
||||
field, but it gives you the flexibility to implement this option at search
|
||||
time.
|
||||
|
||||
<<copy-to,`copy_to` field>>::
|
||||
|
||||
If you know ahead of time that you want to collect the terms from two or more
|
||||
fields, then use `copy_to` in your mapping to create a new dedicated field at
|
||||
index time which contains the values from both fields. You can aggregate on
|
||||
this single field, which will benefit from the global ordinals optimization.
|
||||
|
||||
==== Collect mode
|
||||
|
||||
added[1.3.0] Deferring calculation of child aggregations
|
||||
|
@ -483,7 +511,7 @@ added[1.3.0] Deferring calculation of child aggregations
|
|||
For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation
|
||||
of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree
|
||||
are expanded in one depth-first pass and only then any pruning occurs. In some rare scenarios this can be very wasteful and can hit memory constraints.
|
||||
An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:
|
||||
An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
|
@ -507,11 +535,11 @@ An example problem scenario is querying a movie database for the 10 most popular
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets
|
||||
during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine
|
||||
Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets
|
||||
during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine
|
||||
the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the `breadth_first` collection
|
||||
mode as opposed to the default `depth_first` mode:
|
||||
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
|
@ -537,18 +565,18 @@ mode as opposed to the default `depth_first` mode:
|
|||
|
||||
|
||||
When using `breadth_first` mode the set of documents that fall into the uppermost buckets are
|
||||
cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents.
|
||||
cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents.
|
||||
In most requests the volume of buckets generated is smaller than the number of documents that fall into them so the default `depth_first`
|
||||
collection mode is normally the best bet but occasionally the `breadth_first` strategy can be significantly more efficient. Currently
|
||||
collection mode is normally the best bet but occasionally the `breadth_first` strategy can be significantly more efficient. Currently
|
||||
elasticsearch will always use the `depth_first` collect_mode unless explicitly instructed to use `breadth_first` as in the above example.
|
||||
Note that the `order` parameter can still be used to refer to data from a child aggregation when using the `breadth_first` setting - the parent
|
||||
aggregation understands that this child aggregation will need to be called first before any of the other child aggregations.
|
||||
|
||||
WARNING: It is not possible to nest aggregations such as `top_hits` which require access to match score information under an aggregation that uses
|
||||
the `breadth_first` collection mode. This is because this would require a RAM buffer to hold the float score value for every document and
|
||||
this would typically be too costly in terms of RAM.
|
||||
|
||||
this would typically be too costly in terms of RAM.
|
||||
|
||||
[[search-aggregations-bucket-terms-aggregation-execution-hint]]
|
||||
==== Execution hint
|
||||
|
||||
added[1.2.0] Added the `global_ordinals`, `global_ordinals_hash` and `global_ordinals_low_cardinality` execution modes
|
||||
|
|
Loading…
Reference in New Issue