161 lines
4.9 KiB
Plaintext
161 lines
4.9 KiB
Plaintext
[[search-aggregations-bucket-sampler-aggregation]]
|
|
=== Sampler Aggregation
|
|
|
|
A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
|
|
|
|
.Example use cases:
|
|
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
|
|
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
|
|
|
|
|
|
Example:
|
|
|
|
A query on StackOverflow data for the popular term `javascript` OR the rarer term
|
|
`kibana` will match many documents - most of them missing the word Kibana. To focus
|
|
the `significant_terms` aggregation on top-scoring documents that are more likely to match
|
|
the most interesting parts of our query we use a sample.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /stackoverflow/_search?size=0
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "tags:kibana OR tags:javascript"
|
|
}
|
|
},
|
|
"aggs": {
|
|
"sample": {
|
|
"sampler": {
|
|
"shard_size": 200
|
|
},
|
|
"aggs": {
|
|
"keywords": {
|
|
"significant_terms": {
|
|
"field": "tags",
|
|
"exclude": ["kibana", "javascript"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:stackoverflow]
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
"aggregations": {
|
|
"sample": {
|
|
"doc_count": 200,<1>
|
|
"keywords": {
|
|
"doc_count": 200,
|
|
"bg_count": 650,
|
|
"buckets": [
|
|
{
|
|
"key": "elasticsearch",
|
|
"doc_count": 150,
|
|
"score": 1.078125,
|
|
"bg_count": 200
|
|
},
|
|
{
|
|
"key": "logstash",
|
|
"doc_count": 50,
|
|
"score": 0.5625,
|
|
"bg_count": 50
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
|
|
<1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was
|
|
therefore limited rather than unbounded.
|
|
|
|
|
|
Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
|
|
less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.
|
|
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /stackoverflow/_search?size=0
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "tags:kibana OR tags:javascript"
|
|
}
|
|
},
|
|
"aggs": {
|
|
"low_quality_keywords": {
|
|
"significant_terms": {
|
|
"field": "tags",
|
|
"size": 3,
|
|
"exclude":["kibana", "javascript"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:stackoverflow]
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
"aggregations": {
|
|
"low_quality_keywords": {
|
|
"doc_count": 600,
|
|
"bg_count": 650,
|
|
"buckets": [
|
|
{
|
|
"key": "angular",
|
|
"doc_count": 200,
|
|
"score": 0.02777,
|
|
"bg_count": 200
|
|
},
|
|
{
|
|
"key": "jquery",
|
|
"doc_count": 200,
|
|
"score": 0.02777,
|
|
"bg_count": 200
|
|
},
|
|
{
|
|
"key": "logstash",
|
|
"doc_count": 50,
|
|
"score": 0.0069,
|
|
"bg_count": 50
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
// TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]
|
|
// TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]
|
|
|
|
|
|
|
|
==== shard_size
|
|
|
|
The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
|
|
The default value is 100.
|
|
|
|
==== Limitations
|
|
|
|
===== Cannot be nested under `breadth_first` aggregations
|
|
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
|
|
It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
|
|
In this situation an error will be thrown. |