[DOCS] Remove approximate document counts example from term agg docs (#55442)
Removes an example from the "Document counts are approximate" section of the terms agg documentation. As #52377 details, the example was no longer accurate in 7.x or 6.8. Document counts were more precise than the example presented. We've opened issue #56025 to discuss re-adding an example later. Co-authored-by: James Rodewig <james.rodewig@elastic.co> Co-authored-by: AB Prashanth <panuradh@buffalo.edu>
This commit is contained in:
parent
17b904def5
commit
e4e02e133e
docs/reference/aggregations/bucket
|
@ -66,7 +66,7 @@ GET /_search
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
// TEST[s/_search/_search\?filter_path=aggregations/]
|
// TEST[s/_search/_search\?filter_path=aggregations/]
|
||||||
|
|
||||||
<1> `terms` aggregation should be a field of type `keyword` or any other data type suitable for bucket aggregations. In order to use it with `text` you will need to enable
|
<1> `terms` aggregation should be a field of type `keyword` or any other data type suitable for bucket aggregations. In order to use it with `text` you will need to enable
|
||||||
<<fielddata, fielddata>>.
|
<<fielddata, fielddata>>.
|
||||||
|
|
||||||
Response:
|
Response:
|
||||||
|
@ -124,84 +124,10 @@ NOTE: If you want to retrieve **all** terms or all combinations of terms in a ne
|
||||||
[[search-aggregations-bucket-terms-aggregation-approximate-counts]]
|
[[search-aggregations-bucket-terms-aggregation-approximate-counts]]
|
||||||
==== Document counts are approximate
|
==== Document counts are approximate
|
||||||
|
|
||||||
As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always
|
Document counts (and the results of any sub aggregations) in the terms
|
||||||
accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are
|
aggregation are not always accurate. Each shard provides its own view of what
|
||||||
combined to give a final view. Consider the following scenario:
|
the ordered list of terms should be. These views are combined to give a final
|
||||||
|
view.
|
||||||
A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with
|
|
||||||
3 shards. In this case each shard is asked to give its top 5 terms.
|
|
||||||
|
|
||||||
[source,console]
|
|
||||||
--------------------------------------------------
|
|
||||||
GET /_search
|
|
||||||
{
|
|
||||||
"aggs" : {
|
|
||||||
"products" : {
|
|
||||||
"terms" : {
|
|
||||||
"field" : "product",
|
|
||||||
"size" : 5
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
--------------------------------------------------
|
|
||||||
// TEST[s/_search/_search\?filter_path=aggregations/]
|
|
||||||
|
|
||||||
The terms for each of the three shards are shown below with their
|
|
||||||
respective document counts in brackets:
|
|
||||||
|
|
||||||
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
|
||||||
|=========================================================
|
|
||||||
| | Shard A | Shard B | Shard C
|
|
||||||
|
|
||||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
||||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
||||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
||||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
||||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
||||||
| 6 | Product F (2) | Product H (14) | Product H (28)
|
|
||||||
| 7 | Product G (2) | Product I (10) | Product Q (2)
|
|
||||||
| 8 | Product H (2) | Product Q (6) | Product D (1)
|
|
||||||
| 9 | Product I (1) | Product J (6) |
|
|
||||||
| 10 | Product J (1) | Product C (4) |
|
|
||||||
|
|
||||||
|=========================================================
|
|
||||||
|
|
||||||
The shards will return their top 5 terms so the results from the shards will be:
|
|
||||||
|
|
||||||
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
|
||||||
|=========================================================
|
|
||||||
| | Shard A | Shard B | Shard C
|
|
||||||
|
|
||||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
|
||||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
|
||||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
|
||||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
|
||||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
|
||||||
|
|
||||||
|=========================================================
|
|
||||||
|
|
||||||
Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces
|
|
||||||
the following:
|
|
||||||
|
|
||||||
[width="40%",cols="^2,^2"]
|
|
||||||
|=========================================================
|
|
||||||
|
|
||||||
| 1 | Product A (100)
|
|
||||||
| 2 | Product Z (52)
|
|
||||||
| 3 | Product C (50)
|
|
||||||
| 4 | Product G (45)
|
|
||||||
| 5 | Product B (43)
|
|
||||||
|
|
||||||
|=========================================================
|
|
||||||
|
|
||||||
Because Product A was returned from all shards we know that its document count value is accurate. Product C was only
|
|
||||||
returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on
|
|
||||||
shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also
|
|
||||||
returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of
|
|
||||||
combining the results to produce the final list of terms, that there is an error in the document count for Product C and
|
|
||||||
not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of
|
|
||||||
terms because it did not make it into the top five terms on any of the shards.
|
|
||||||
|
|
||||||
==== Shard Size
|
==== Shard Size
|
||||||
|
|
||||||
|
@ -226,35 +152,7 @@ The default `shard_size` is `(size * 1.5 + 10)`.
|
||||||
|
|
||||||
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
||||||
a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
||||||
terms. This is calculated as the sum of the document count from the last term returned from each shard. For the example
|
terms. This is calculated as the sum of the document count from the last term returned from each shard.
|
||||||
given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned
|
|
||||||
could have the 4th highest document count.
|
|
||||||
|
|
||||||
[source,console-result]
|
|
||||||
--------------------------------------------------
|
|
||||||
{
|
|
||||||
...
|
|
||||||
"aggregations" : {
|
|
||||||
"products" : {
|
|
||||||
"doc_count_error_upper_bound" : 46,
|
|
||||||
"sum_other_doc_count" : 79,
|
|
||||||
"buckets" : [
|
|
||||||
{
|
|
||||||
"key" : "Product A",
|
|
||||||
"doc_count" : 100
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"key" : "Product Z",
|
|
||||||
"doc_count" : 52
|
|
||||||
}
|
|
||||||
...
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
--------------------------------------------------
|
|
||||||
// TESTRESPONSE[s/\.\.\.//]
|
|
||||||
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]
|
|
||||||
|
|
||||||
==== Per bucket document count error
|
==== Per bucket document count error
|
||||||
|
|
||||||
|
@ -280,39 +178,7 @@ GET /_search
|
||||||
|
|
||||||
This shows an error value for each term returned by the aggregation which represents the 'worst case' error in the document count
|
This shows an error value for each term returned by the aggregation which represents the 'worst case' error in the document count
|
||||||
and can be useful when deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for
|
and can be useful when deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for
|
||||||
the last term returned by all shards which did not return the term. In the example above the error in the document count for Product C
|
the last term returned by all shards which did not return the term.
|
||||||
would be 15 as Shard B was the only shard not to return the term and the document count of the last term it did return was 15.
|
|
||||||
The actual document count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that
|
|
||||||
it would be off by 15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident
|
|
||||||
that the count returned is accurate.
|
|
||||||
|
|
||||||
[source,console-result]
|
|
||||||
--------------------------------------------------
|
|
||||||
{
|
|
||||||
...
|
|
||||||
"aggregations" : {
|
|
||||||
"products" : {
|
|
||||||
"doc_count_error_upper_bound" : 46,
|
|
||||||
"sum_other_doc_count" : 79,
|
|
||||||
"buckets" : [
|
|
||||||
{
|
|
||||||
"key" : "Product A",
|
|
||||||
"doc_count" : 100,
|
|
||||||
"doc_count_error_upper_bound" : 0
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"key" : "Product Z",
|
|
||||||
"doc_count" : 52,
|
|
||||||
"doc_count_error_upper_bound" : 2
|
|
||||||
}
|
|
||||||
...
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
--------------------------------------------------
|
|
||||||
// TESTRESPONSE[s/\.\.\.//]
|
|
||||||
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]
|
|
||||||
|
|
||||||
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
||||||
ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
||||||
|
@ -673,7 +539,7 @@ GET /_search
|
||||||
|
|
||||||
===== Filtering Values with partitions
|
===== Filtering Values with partitions
|
||||||
|
|
||||||
Sometimes there are too many unique terms to process in a single request/response pair so
|
Sometimes there are too many unique terms to process in a single request/response pair so
|
||||||
it can be useful to break the analysis up into multiple requests.
|
it can be useful to break the analysis up into multiple requests.
|
||||||
This can be achieved by grouping the field's values into a number of partitions at query-time and processing
|
This can be achieved by grouping the field's values into a number of partitions at query-time and processing
|
||||||
only one partition in each request.
|
only one partition in each request.
|
||||||
|
@ -712,10 +578,10 @@ GET /_search
|
||||||
This request is finding the last logged access date for a subset of customer accounts because we
|
This request is finding the last logged access date for a subset of customer accounts because we
|
||||||
might want to expire some customer accounts who haven't been seen for a long while.
|
might want to expire some customer accounts who haven't been seen for a long while.
|
||||||
The `num_partitions` setting has requested that the unique account_ids are organized evenly into twenty
|
The `num_partitions` setting has requested that the unique account_ids are organized evenly into twenty
|
||||||
partitions (0 to 19). and the `partition` setting in this request filters to only consider account_ids falling
|
partitions (0 to 19). and the `partition` setting in this request filters to only consider account_ids falling
|
||||||
into partition 0. Subsequent requests should ask for partitions 1 then 2 etc to complete the expired-account analysis.
|
into partition 0. Subsequent requests should ask for partitions 1 then 2 etc to complete the expired-account analysis.
|
||||||
|
|
||||||
Note that the `size` setting for the number of results returned needs to be tuned with the `num_partitions`.
|
Note that the `size` setting for the number of results returned needs to be tuned with the `num_partitions`.
|
||||||
For this particular account-expiration example the process for balancing values for `size` and `num_partitions` would be as follows:
|
For this particular account-expiration example the process for balancing values for `size` and `num_partitions` would be as follows:
|
||||||
|
|
||||||
1. Use the `cardinality` aggregation to estimate the total number of unique account_id values
|
1. Use the `cardinality` aggregation to estimate the total number of unique account_id values
|
||||||
|
@ -724,8 +590,8 @@ For this particular account-expiration example the process for balancing values
|
||||||
4. Run a test request
|
4. Run a test request
|
||||||
|
|
||||||
If we have a circuit-breaker error we are trying to do too much in one request and must increase `num_partitions`.
|
If we have a circuit-breaker error we are trying to do too much in one request and must increase `num_partitions`.
|
||||||
If the request was successful but the last account ID in the date-sorted test response was still an account we might want to
|
If the request was successful but the last account ID in the date-sorted test response was still an account we might want to
|
||||||
expire then we may be missing accounts of interest and have set our numbers too low. We must either
|
expire then we may be missing accounts of interest and have set our numbers too low. We must either
|
||||||
|
|
||||||
* increase the `size` parameter to return more results per partition (could be heavy on memory) or
|
* increase the `size` parameter to return more results per partition (could be heavy on memory) or
|
||||||
* increase the `num_partitions` to consider less accounts per request (could increase overall processing time as we need to make more requests)
|
* increase the `num_partitions` to consider less accounts per request (could increase overall processing time as we need to make more requests)
|
||||||
|
|
Loading…
Reference in New Issue