OpenSearch/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc

[[search-aggregations-bucket-sampler-aggregation]]
=== Sampler Aggregation

experimental[]

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

.Example use cases:
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
 

Example:

[source,js]
--------------------------------------------------
{
    "query": {
        "match": {
            "text": "iphone"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "text"
                    }
                }
            }
        }
    }
}
--------------------------------------------------

Response:

[source,js]
--------------------------------------------------
{
    ...
        "aggregations": {
        "sample": {
            "doc_count": 1000,<1>
            "keywords": {
                "doc_count": 1000,
                "buckets": [
                    ...
                    {
                        "key": "bend",
                        "doc_count": 58,
                        "score": 37.982536582524276,
                        "bg_count": 103
                    },
                    ....
}
--------------------------------------------------

<1> 1000 documents were sampled in total because we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.


==== shard_size

The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.

==== Limitations

===== Cannot be nested under `breadth_first` aggregations
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
In this situation an error will be thrown.
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`[[search-aggregations-bucket-sampler-aggregation]]`
			`=== Sampler Aggregation`

			`experimental[]`

			`A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.`

			`.Example use cases:`
			`* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches`
			* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`


			`Example:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"query": {`
			`"match": {`
			`"text": "iphone"`
			`}`
			`},`
			`"aggs": {`
			`"sample": {`
			`"sampler": {`
Aggregations Refactor: Refactor Sampler Aggregation 2015-12-14 06:54:41 -05:00			`"shard_size": 200`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`},`
			`"aggs": {`
			`"keywords": {`
			`"significant_terms": {`
			`"field": "text"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`Response:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`...`
			`"aggregations": {`
			`"sample": {`
			`"doc_count": 1000,<1>`
[DOCS] fix to sampler agg documentation 2016-02-15 08:17:19 -05:00			`"keywords": {`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`"doc_count": 1000,`
			`"buckets": [`
			`...`
			`{`
			`"key": "bend",`
			`"doc_count": 58,`
			`"score": 37.982536582524276,`
			`"bg_count": 103`
			`},`
			`....`
			`}`
			`--------------------------------------------------`

Aggregations Refactor: Refactor Sampler Aggregation 2015-12-14 06:54:41 -05:00			`<1> 1000 documents were sampled in total because we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00

			`==== shard_size`

			The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
			`The default value is 100.`

[DOCS] Fix section levels for Sampler agg 2015-05-04 09:18:24 -04:00			`==== Limitations`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00
[DOCS] Fix section levels for Sampler agg 2015-05-04 09:18:24 -04:00			===== Cannot be nested under `breadth_first` aggregations
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.`
			It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
Aggregations Refactor: Refactor Sampler Aggregation 2015-12-14 06:54:41 -05:00			`In this situation an error will be thrown.`