OpenSearch/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc

[[search-aggregations-bucket-sampler-aggregation]]
=== Sampler Aggregation

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

.Example use cases:
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
 

Example:

A query on StackOverflow data for the popular term `javascript` OR the rarer term
`kibana` will match many documents - most of them missing the word Kibana. To focus
the `significant_terms` aggregation on top-scoring documents that are more likely to match 
the most interesting parts of our query we use a sample.

[source,js]
--------------------------------------------------
POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "tags",
                        "exclude": ["kibana", "javascript"]
                    }
                }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE
// TEST[setup:stackoverflow]

Response:

[source,js]
--------------------------------------------------
{
    ...
    "aggregations": {
        "sample": {
            "doc_count": 200,<1>
            "keywords": {
                "doc_count": 200,
                "bg_count": 650,
                "buckets": [
                    {
                        "key": "elasticsearch",
                        "doc_count": 150,
                        "score": 1.078125,
                        "bg_count": 200
                    },
                    {
                        "key": "logstash",
                        "doc_count": 50,
                        "score": 0.5625,
                        "bg_count": 50
                    }
                ]
            }
        }
    }
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]

<1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was
therefore limited rather than unbounded.


Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.


[source,js]
--------------------------------------------------
POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    "aggs": {
             "low_quality_keywords": {
                "significant_terms": {
                    "field": "tags",
                    "size": 3,
                    "exclude":["kibana", "javascript"]
                }
        }
    }
}
--------------------------------------------------
// CONSOLE
// TEST[setup:stackoverflow]

Response:

[source,js]
--------------------------------------------------
{
    ...
    "aggregations": {
        "low_quality_keywords": {
            "doc_count": 600,
            "bg_count": 650,
            "buckets": [
                {
                    "key": "angular",
                    "doc_count": 200,
                    "score": 0.02777,
                    "bg_count": 200
                },
                {
                    "key": "jquery",
                    "doc_count": 200,
                    "score": 0.02777,
                    "bg_count": 200
                },
                {
                    "key": "logstash",
                    "doc_count": 50,
                    "score": 0.0069,
                    "bg_count": 50
                }
            ]
        }
    }
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
// TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]
// TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]


==== shard_size

The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.

==== Limitations

===== Cannot be nested under `breadth_first` aggregations
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
In this situation an error will be thrown.
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`[[search-aggregations-bucket-sampler-aggregation]]`
			`=== Sampler Aggregation`

			`A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.`

			`.Example use cases:`
			`* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches`
			* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`


			`Example:`

[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			A query on StackOverflow data for the popular term `javascript` OR the rarer term
			`kibana` will match many documents - most of them missing the word Kibana. To focus
			the `significant_terms` aggregation on top-scoring documents that are more likely to match
			`the most interesting parts of our query we use a sample.`

New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`[source,js]`
			`--------------------------------------------------`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`POST /stackoverflow/_search?size=0`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`{`
			`"query": {`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`"query_string": {`
			`"query": "tags:kibana OR tags:javascript"`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`}`
			`},`
			`"aggs": {`
			`"sample": {`
			`"sampler": {`
Aggregations Refactor: Refactor Sampler Aggregation 2015-12-14 06:54:41 -05:00			`"shard_size": 200`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`},`
			`"aggs": {`
			`"keywords": {`
			`"significant_terms": {`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`"field": "tags",`
			`"exclude": ["kibana", "javascript"]`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`// CONSOLE`
			`// TEST[setup:stackoverflow]`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00
			`Response:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`...`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`"aggregations": {`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`"sample": {`
Add superset size to Significant Term REST response (#24865) This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value. 2017-06-02 03:45:15 -04:00			`"doc_count": 200,<1>`
[DOCS] fix to sampler agg documentation 2016-02-15 08:17:19 -05:00			`"keywords": {`
Add superset size to Significant Term REST response (#24865) This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value. 2017-06-02 03:45:15 -04:00			`"doc_count": 200,`
			`"bg_count": 650,`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`"buckets": [`
			`{`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`"key": "elasticsearch",`
			`"doc_count": 150,`
			`"score": 1.078125,`
			`"bg_count": 200`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`},`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`{`
			`"key": "logstash",`
			`"doc_count": 50,`
			`"score": 0.5625,`
			`"bg_count": 50`
			`}`
			`]`
			`}`
			`}`
			`}`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`}`
			`--------------------------------------------------`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00
Add superset size to Significant Term REST response (#24865) This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value. 2017-06-02 03:45:15 -04:00			`<1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was`
			`therefore limited rather than unbounded.`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00

[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
			less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.


			`[source,js]`
			`--------------------------------------------------`
			`POST /stackoverflow/_search?size=0`
			`{`
			`"query": {`
			`"query_string": {`
			`"query": "tags:kibana OR tags:javascript"`
			`}`
			`},`
			`"aggs": {`
			`"low_quality_keywords": {`
			`"significant_terms": {`
			`"field": "tags",`
			`"size": 3,`
			`"exclude":["kibana", "javascript"]`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[setup:stackoverflow]`

			`Response:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`...`
			`"aggregations": {`
			`"low_quality_keywords": {`
Add superset size to Significant Term REST response (#24865) This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value. 2017-06-02 03:45:15 -04:00			`"doc_count": 600,`
			`"bg_count": 650,`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`"buckets": [`
			`{`
			`"key": "angular",`
			`"doc_count": 200,`
			`"score": 0.02777,`
Add superset size to Significant Term REST response (#24865) This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value. 2017-06-02 03:45:15 -04:00			`"bg_count": 200`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`},`
			`{`
			`"key": "jquery",`
			`"doc_count": 200,`
			`"score": 0.02777,`
			`"bg_count": 200`
			`},`
			`{`
			`"key": "logstash",`
			`"doc_count": 50,`
			`"score": 0.0069,`
			`"bg_count": 50`
			`}`
			`]`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]`
			`// TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]`
			`// TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]`



New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`==== shard_size`

			The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
			`The default value is 100.`

[DOCS] Fix section levels for Sampler agg 2015-05-04 09:18:24 -04:00			`==== Limitations`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00
[DOCS] Fix section levels for Sampler agg 2015-05-04 09:18:24 -04:00			===== Cannot be nested under `breadth_first` aggregations
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.`
			It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
Aggregations Refactor: Refactor Sampler Aggregation 2015-12-14 06:54:41 -05:00			`In this situation an error will be thrown.`