[DOCS] Remove approximate document counts example from term agg docs (#55442)
Removes an example from the "Document counts are approximate" section of the terms agg documentation. As #52377 details, the example was no longer accurate in 7.x or 6.8. Document counts were more precise than the example presented. We've opened issue #56025 to discuss re-adding an example later. Co-authored-by: James Rodewig <james.rodewig@elastic.co> Co-authored-by: AB Prashanth <panuradh@buffalo.edu>
This commit is contained in:
parent
17b904def5
commit
e4e02e133e
|
@ -66,7 +66,7 @@ GET /_search
|
|||
--------------------------------------------------
|
||||
// TEST[s/_search/_search\?filter_path=aggregations/]
|
||||
|
||||
<1> `terms` aggregation should be a field of type `keyword` or any other data type suitable for bucket aggregations. In order to use it with `text` you will need to enable
|
||||
<1> `terms` aggregation should be a field of type `keyword` or any other data type suitable for bucket aggregations. In order to use it with `text` you will need to enable
|
||||
<<fielddata, fielddata>>.
|
||||
|
||||
Response:
|
||||
|
@ -124,84 +124,10 @@ NOTE: If you want to retrieve **all** terms or all combinations of terms in a ne
|
|||
[[search-aggregations-bucket-terms-aggregation-approximate-counts]]
|
||||
==== Document counts are approximate
|
||||
|
||||
As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always
|
||||
accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are
|
||||
combined to give a final view. Consider the following scenario:
|
||||
|
||||
A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with
|
||||
3 shards. In this case each shard is asked to give its top 5 terms.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET /_search
|
||||
{
|
||||
"aggs" : {
|
||||
"products" : {
|
||||
"terms" : {
|
||||
"field" : "product",
|
||||
"size" : 5
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[s/_search/_search\?filter_path=aggregations/]
|
||||
|
||||
The terms for each of the three shards are shown below with their
|
||||
respective document counts in brackets:
|
||||
|
||||
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
||||
|=========================================================
|
||||
| | Shard A | Shard B | Shard C
|
||||
|
||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
||||
| 6 | Product F (2) | Product H (14) | Product H (28)
|
||||
| 7 | Product G (2) | Product I (10) | Product Q (2)
|
||||
| 8 | Product H (2) | Product Q (6) | Product D (1)
|
||||
| 9 | Product I (1) | Product J (6) |
|
||||
| 10 | Product J (1) | Product C (4) |
|
||||
|
||||
|=========================================================
|
||||
|
||||
The shards will return their top 5 terms so the results from the shards will be:
|
||||
|
||||
[width="100%",cols="^2,^2,^2,^2",options="header"]
|
||||
|=========================================================
|
||||
| | Shard A | Shard B | Shard C
|
||||
|
||||
| 1 | Product A (25) | Product A (30) | Product A (45)
|
||||
| 2 | Product B (18) | Product B (25) | Product C (44)
|
||||
| 3 | Product C (6) | Product F (17) | Product Z (36)
|
||||
| 4 | Product D (3) | Product Z (16) | Product G (30)
|
||||
| 5 | Product E (2) | Product G (15) | Product E (29)
|
||||
|
||||
|=========================================================
|
||||
|
||||
Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces
|
||||
the following:
|
||||
|
||||
[width="40%",cols="^2,^2"]
|
||||
|=========================================================
|
||||
|
||||
| 1 | Product A (100)
|
||||
| 2 | Product Z (52)
|
||||
| 3 | Product C (50)
|
||||
| 4 | Product G (45)
|
||||
| 5 | Product B (43)
|
||||
|
||||
|=========================================================
|
||||
|
||||
Because Product A was returned from all shards we know that its document count value is accurate. Product C was only
|
||||
returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on
|
||||
shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also
|
||||
returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of
|
||||
combining the results to produce the final list of terms, that there is an error in the document count for Product C and
|
||||
not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of
|
||||
terms because it did not make it into the top five terms on any of the shards.
|
||||
Document counts (and the results of any sub aggregations) in the terms
|
||||
aggregation are not always accurate. Each shard provides its own view of what
|
||||
the ordered list of terms should be. These views are combined to give a final
|
||||
view.
|
||||
|
||||
==== Shard Size
|
||||
|
||||
|
@ -226,35 +152,7 @@ The default `shard_size` is `(size * 1.5 + 10)`.
|
|||
|
||||
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as
|
||||
a whole which represents the maximum potential document count for a term which did not make it into the final list of
|
||||
terms. This is calculated as the sum of the document count from the last term returned from each shard. For the example
|
||||
given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned
|
||||
could have the 4th highest document count.
|
||||
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
{
|
||||
...
|
||||
"aggregations" : {
|
||||
"products" : {
|
||||
"doc_count_error_upper_bound" : 46,
|
||||
"sum_other_doc_count" : 79,
|
||||
"buckets" : [
|
||||
{
|
||||
"key" : "Product A",
|
||||
"doc_count" : 100
|
||||
},
|
||||
{
|
||||
"key" : "Product Z",
|
||||
"doc_count" : 52
|
||||
}
|
||||
...
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TESTRESPONSE[s/\.\.\.//]
|
||||
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]
|
||||
terms. This is calculated as the sum of the document count from the last term returned from each shard.
|
||||
|
||||
==== Per bucket document count error
|
||||
|
||||
|
@ -280,39 +178,7 @@ GET /_search
|
|||
|
||||
This shows an error value for each term returned by the aggregation which represents the 'worst case' error in the document count
|
||||
and can be useful when deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for
|
||||
the last term returned by all shards which did not return the term. In the example above the error in the document count for Product C
|
||||
would be 15 as Shard B was the only shard not to return the term and the document count of the last term it did return was 15.
|
||||
The actual document count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that
|
||||
it would be off by 15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident
|
||||
that the count returned is accurate.
|
||||
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
{
|
||||
...
|
||||
"aggregations" : {
|
||||
"products" : {
|
||||
"doc_count_error_upper_bound" : 46,
|
||||
"sum_other_doc_count" : 79,
|
||||
"buckets" : [
|
||||
{
|
||||
"key" : "Product A",
|
||||
"doc_count" : 100,
|
||||
"doc_count_error_upper_bound" : 0
|
||||
},
|
||||
{
|
||||
"key" : "Product Z",
|
||||
"doc_count" : 52,
|
||||
"doc_count_error_upper_bound" : 2
|
||||
}
|
||||
...
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TESTRESPONSE[s/\.\.\.//]
|
||||
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]
|
||||
the last term returned by all shards which did not return the term.
|
||||
|
||||
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is
|
||||
ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard
|
||||
|
@ -673,7 +539,7 @@ GET /_search
|
|||
|
||||
===== Filtering Values with partitions
|
||||
|
||||
Sometimes there are too many unique terms to process in a single request/response pair so
|
||||
Sometimes there are too many unique terms to process in a single request/response pair so
|
||||
it can be useful to break the analysis up into multiple requests.
|
||||
This can be achieved by grouping the field's values into a number of partitions at query-time and processing
|
||||
only one partition in each request.
|
||||
|
@ -712,10 +578,10 @@ GET /_search
|
|||
This request is finding the last logged access date for a subset of customer accounts because we
|
||||
might want to expire some customer accounts who haven't been seen for a long while.
|
||||
The `num_partitions` setting has requested that the unique account_ids are organized evenly into twenty
|
||||
partitions (0 to 19). and the `partition` setting in this request filters to only consider account_ids falling
|
||||
partitions (0 to 19). and the `partition` setting in this request filters to only consider account_ids falling
|
||||
into partition 0. Subsequent requests should ask for partitions 1 then 2 etc to complete the expired-account analysis.
|
||||
|
||||
Note that the `size` setting for the number of results returned needs to be tuned with the `num_partitions`.
|
||||
Note that the `size` setting for the number of results returned needs to be tuned with the `num_partitions`.
|
||||
For this particular account-expiration example the process for balancing values for `size` and `num_partitions` would be as follows:
|
||||
|
||||
1. Use the `cardinality` aggregation to estimate the total number of unique account_id values
|
||||
|
@ -724,8 +590,8 @@ For this particular account-expiration example the process for balancing values
|
|||
4. Run a test request
|
||||
|
||||
If we have a circuit-breaker error we are trying to do too much in one request and must increase `num_partitions`.
|
||||
If the request was successful but the last account ID in the date-sorted test response was still an account we might want to
|
||||
expire then we may be missing accounts of interest and have set our numbers too low. We must either
|
||||
If the request was successful but the last account ID in the date-sorted test response was still an account we might want to
|
||||
expire then we may be missing accounts of interest and have set our numbers too low. We must either
|
||||
|
||||
* increase the `size` parameter to return more results per partition (could be heavy on memory) or
|
||||
* increase the `num_partitions` to consider less accounts per request (could increase overall processing time as we need to make more requests)
|
||||
|
|
Loading…
Reference in New Issue