2015-03-23 09:00:44 -04:00
[[search-aggregations-bucket-sampler-aggregation]]
=== Sampler Aggregation
A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
.Example use cases:
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
Example:
2017-01-31 04:45:25 -05:00
A query on StackOverflow data for the popular term `javascript` OR the rarer term
`kibana` will match many documents - most of them missing the word Kibana. To focus
the `significant_terms` aggregation on top-scoring documents that are more likely to match
the most interesting parts of our query we use a sample.
2019-09-05 10:11:25 -04:00
[source,console]
2015-03-23 09:00:44 -04:00
--------------------------------------------------
2017-01-31 04:45:25 -05:00
POST /stackoverflow/_search?size=0
2015-03-23 09:00:44 -04:00
{
2020-07-20 15:59:00 -04:00
"query": {
"query_string": {
"query": "tags:kibana OR tags:javascript"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 200
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "tags",
"exclude": [ "kibana", "javascript" ]
}
2015-03-23 09:00:44 -04:00
}
2020-07-20 15:59:00 -04:00
}
2015-03-23 09:00:44 -04:00
}
2020-07-20 15:59:00 -04:00
}
2015-03-23 09:00:44 -04:00
}
--------------------------------------------------
2017-01-31 04:45:25 -05:00
// TEST[setup:stackoverflow]
2015-03-23 09:00:44 -04:00
Response:
2019-09-06 16:09:09 -04:00
[source,console-result]
2015-03-23 09:00:44 -04:00
--------------------------------------------------
{
2020-07-20 15:59:00 -04:00
...
"aggregations": {
"sample": {
"doc_count": 200, <1>
"keywords": {
"doc_count": 200,
"bg_count": 650,
"buckets": [
{
"key": "elasticsearch",
"doc_count": 150,
"score": 1.078125,
"bg_count": 200
},
{
"key": "logstash",
"doc_count": 50,
"score": 0.5625,
"bg_count": 50
}
]
}
2017-01-31 04:45:25 -05:00
}
2020-07-20 15:59:00 -04:00
}
2015-03-23 09:00:44 -04:00
}
--------------------------------------------------
2017-01-31 04:45:25 -05:00
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
2015-03-23 09:00:44 -04:00
2017-06-02 03:45:15 -04:00
<1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was
therefore limited rather than unbounded.
2015-03-23 09:00:44 -04:00
2017-01-31 04:45:25 -05:00
Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.
2019-09-05 10:11:25 -04:00
[source,console]
2017-01-31 04:45:25 -05:00
--------------------------------------------------
POST /stackoverflow/_search?size=0
{
2020-07-20 15:59:00 -04:00
"query": {
"query_string": {
"query": "tags:kibana OR tags:javascript"
}
},
"aggs": {
"low_quality_keywords": {
"significant_terms": {
"field": "tags",
"size": 3,
"exclude": [ "kibana", "javascript" ]
}
2017-01-31 04:45:25 -05:00
}
2020-07-20 15:59:00 -04:00
}
2017-01-31 04:45:25 -05:00
}
--------------------------------------------------
// TEST[setup:stackoverflow]
Response:
2019-09-06 16:09:09 -04:00
[source,console-result]
2017-01-31 04:45:25 -05:00
--------------------------------------------------
{
2020-07-20 15:59:00 -04:00
...
"aggregations": {
"low_quality_keywords": {
"doc_count": 600,
"bg_count": 650,
"buckets": [
{
"key": "angular",
"doc_count": 200,
"score": 0.02777,
"bg_count": 200
},
{
"key": "jquery",
"doc_count": 200,
"score": 0.02777,
"bg_count": 200
},
{
"key": "logstash",
"doc_count": 50,
"score": 0.0069,
"bg_count": 50
2017-01-31 04:45:25 -05:00
}
2020-07-20 15:59:00 -04:00
]
2017-01-31 04:45:25 -05:00
}
2020-07-20 15:59:00 -04:00
}
2017-01-31 04:45:25 -05:00
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
// TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]
// TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]
2015-03-23 09:00:44 -04:00
==== shard_size
The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.
2015-05-04 09:18:24 -04:00
==== Limitations
2015-03-23 09:00:44 -04:00
2019-04-30 10:19:09 -04:00
[[sampler-breadth-first-nested-agg]]
2015-05-04 09:18:24 -04:00
===== Cannot be nested under `breadth_first` aggregations
2015-03-23 09:00:44 -04:00
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
2015-12-14 06:54:41 -05:00
In this situation an error will be thrown.