OpenSearch/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc

76 lines
2.3 KiB
Plaintext
Raw Normal View History

[[search-aggregations-bucket-sampler-aggregation]]
=== Sampler Aggregation
experimental[]
A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
.Example use cases:
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
Example:
[source,js]
--------------------------------------------------
{
"query": {
"match": {
"text": "iphone"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 200
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "text"
}
}
}
}
}
}
--------------------------------------------------
Response:
[source,js]
--------------------------------------------------
{
...
"aggregations": {
"sample": {
"doc_count": 1000,<1>
"keywords": {
"doc_count": 1000,
"buckets": [
...
{
"key": "bend",
"doc_count": 58,
"score": 37.982536582524276,
"bg_count": 103
},
....
}
--------------------------------------------------
<1> 1000 documents were sampled in total because we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.
==== shard_size
The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.
==== Limitations
===== Cannot be nested under `breadth_first` aggregations
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
In this situation an error will be thrown.