OpenSearch/docs/reference/search/aggregations
Britta Weber 7944369fd1 Add `shard_min_doc_count` parameter for significant terms similar to `shard_size`
Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.

This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.

On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards  because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.

closes #5998
closes #6041
2014-05-07 18:02:56 +02:00
..
bucket Add `shard_min_doc_count` parameter for significant terms similar to `shard_size` 2014-05-07 18:02:56 +02:00
metrics Update Documentation Feature Flags [1.1.0] 2014-03-25 17:51:30 +01:00
bucket.asciidoc Added `reverse_nested` aggregation. 2014-05-01 00:23:05 +07:00
metrics.asciidoc Cardinality aggregation. 2014-03-13 19:19:56 +01:00