[[search-aggregations-bucket-sampler-aggregation]] === Sampler Aggregation A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents. .Example use cases: * Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches * Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms` Example: A query on StackOverflow data for the popular term `javascript` OR the rarer term `kibana` will match many documents - most of them missing the word Kibana. To focus the `significant_terms` aggregation on top-scoring documents that are more likely to match the most interesting parts of our query we use a sample. [source,js] -------------------------------------------------- POST /stackoverflow/_search?size=0 { "query": { "query_string": { "query": "tags:kibana OR tags:javascript" } }, "aggs": { "sample": { "sampler": { "shard_size": 200 }, "aggs": { "keywords": { "significant_terms": { "field": "tags", "exclude": ["kibana", "javascript"] } } } } } } -------------------------------------------------- // CONSOLE // TEST[setup:stackoverflow] Response: [source,js] -------------------------------------------------- { ... "aggregations": { "sample": { "doc_count": 200,<1> "keywords": { "doc_count": 200, "bg_count": 650, "buckets": [ { "key": "elasticsearch", "doc_count": 150, "score": 1.078125, "bg_count": 200 }, { "key": "logstash", "doc_count": 50, "score": 0.5625, "bg_count": 50 } ] } } } } -------------------------------------------------- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] <1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded. Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms. [source,js] -------------------------------------------------- POST /stackoverflow/_search?size=0 { "query": { "query_string": { "query": "tags:kibana OR tags:javascript" } }, "aggs": { "low_quality_keywords": { "significant_terms": { "field": "tags", "size": 3, "exclude":["kibana", "javascript"] } } } } -------------------------------------------------- // CONSOLE // TEST[setup:stackoverflow] Response: [source,js] -------------------------------------------------- { ... "aggregations": { "low_quality_keywords": { "doc_count": 600, "bg_count": 650, "buckets": [ { "key": "angular", "doc_count": 200, "score": 0.02777, "bg_count": 200 }, { "key": "jquery", "doc_count": 200, "score": 0.02777, "bg_count": 200 }, { "key": "logstash", "doc_count": 50, "score": 0.0069, "bg_count": 50 } ] } } } -------------------------------------------------- // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] // TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/] // TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/] ==== shard_size The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100. ==== Limitations ===== Cannot be nested under `breadth_first` aggregations Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores. In this situation an error will be thrown.