OpenSearch/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc

[[search-aggregations-bucket-sampler-aggregation]]
=== Sampler Aggregation

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

.Example use cases:
* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`
 

Example:

A query on StackOverflow data for the popular term `javascript` OR the rarer term
`kibana` will match many documents - most of them missing the word Kibana. To focus
the `significant_terms` aggregation on top-scoring documents that are more likely to match 
the most interesting parts of our query we use a sample.

[source,console]
--------------------------------------------------
POST /stackoverflow/_search?size=0
{
  "query": {
    "query_string": {
      "query": "tags:kibana OR tags:javascript"
    }
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_terms": {
            "field": "tags",
            "exclude": [ "kibana", "javascript" ]
          }
        }
      }
    }
  }
}
--------------------------------------------------
// TEST[setup:stackoverflow]

Response:

[source,console-result]
--------------------------------------------------
{
  ...
  "aggregations": {
    "sample": {
      "doc_count": 200, <1>
      "keywords": {
        "doc_count": 200,
        "bg_count": 650,
        "buckets": [
          {
            "key": "elasticsearch",
            "doc_count": 150,
            "score": 1.078125,
            "bg_count": 200
          },
          {
            "key": "logstash",
            "doc_count": 50,
            "score": 0.5625,
            "bg_count": 50
          }
        ]
      }
    }
  }
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]

<1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was
therefore limited rather than unbounded.


Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.


[source,console]
--------------------------------------------------
POST /stackoverflow/_search?size=0
{
  "query": {
    "query_string": {
      "query": "tags:kibana OR tags:javascript"
    }
  },
  "aggs": {
    "low_quality_keywords": {
      "significant_terms": {
        "field": "tags",
        "size": 3,
        "exclude": [ "kibana", "javascript" ]
      }
    }
  }
}
--------------------------------------------------
// TEST[setup:stackoverflow]

Response:

[source,console-result]
--------------------------------------------------
{
  ...
  "aggregations": {
    "low_quality_keywords": {
      "doc_count": 600,
      "bg_count": 650,
      "buckets": [
        {
          "key": "angular",
          "doc_count": 200,
          "score": 0.02777,
          "bg_count": 200
        },
        {
          "key": "jquery",
          "doc_count": 200,
          "score": 0.02777,
          "bg_count": 200
        },
        {
          "key": "logstash",
          "doc_count": 50,
          "score": 0.0069,
          "bg_count": 50
        }
      ]
    }
  }
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
// TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]
// TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]


==== shard_size

The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.

==== Limitations

[[sampler-breadth-first-nested-agg]]
===== Cannot be nested under `breadth_first` aggregations
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
In this situation an error will be thrown.
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`[[search-aggregations-bucket-sampler-aggregation]]`
			`=== Sampler Aggregation`

			`A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.`

			`.Example use cases:`
			`* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches`
			* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms`


			`Example:`

[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			A query on StackOverflow data for the popular term `javascript` OR the rarer term
			`kibana` will match many documents - most of them missing the word Kibana. To focus
			the `significant_terms` aggregation on top-scoring documents that are more likely to match
			`the most interesting parts of our query we use a sample.`

[DOCS] Replace "// CONSOLE" comments with [source,console] (#46159) (#46332) 2019-09-05 10:11:25 -04:00			`[source,console]`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`--------------------------------------------------`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`POST /stackoverflow/_search?size=0`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`{`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`"query": {`
			`"query_string": {`
			`"query": "tags:kibana OR tags:javascript"`
			`}`
			`},`
			`"aggs": {`
			`"sample": {`
			`"sampler": {`
			`"shard_size": 200`
			`},`
			`"aggs": {`
			`"keywords": {`
			`"significant_terms": {`
			`"field": "tags",`
			`"exclude": [ "kibana", "javascript" ]`
			`}`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`}`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`}`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`}`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`}`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`}`
			`--------------------------------------------------`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`// TEST[setup:stackoverflow]`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00
			`Response:`

[DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449) (#46459) 2019-09-06 16:09:09 -04:00			`[source,console-result]`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`--------------------------------------------------`
			`{`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`...`
			`"aggregations": {`
			`"sample": {`
			`"doc_count": 200, <1>`
			`"keywords": {`
			`"doc_count": 200,`
			`"bg_count": 650,`
			`"buckets": [`
			`{`
			`"key": "elasticsearch",`
			`"doc_count": 150,`
			`"score": 1.078125,`
			`"bg_count": 200`
			`},`
			`{`
			`"key": "logstash",`
			`"doc_count": 50,`
			`"score": 0.5625,`
			`"bg_count": 50`
			`}`
			`]`
			`}`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`}`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`}`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`}`
			`--------------------------------------------------`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00
Add superset size to Significant Term REST response (#24865) This commit adds a new bg_count field to the REST response of SignificantTerms aggregations. Similarly to the bg_count that already exists in significant terms buckets, this new bg_count field is set at the aggregation level and is populated with the superset size value. 2017-06-02 03:45:15 -04:00			`<1> 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was`
			`therefore limited rather than unbounded.`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00

[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			Without the `sampler` aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
			less significant terms such as `jquery` and `angular` rather than focusing on the more insightful Kibana-related terms.


[DOCS] Replace "// CONSOLE" comments with [source,console] (#46159) (#46332) 2019-09-05 10:11:25 -04:00			`[source,console]`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`--------------------------------------------------`
			`POST /stackoverflow/_search?size=0`
			`{`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`"query": {`
			`"query_string": {`
			`"query": "tags:kibana OR tags:javascript"`
			`}`
			`},`
			`"aggs": {`
			`"low_quality_keywords": {`
			`"significant_terms": {`
			`"field": "tags",`
			`"size": 3,`
			`"exclude": [ "kibana", "javascript" ]`
			`}`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`}`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`}`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`}`
			`--------------------------------------------------`
			`// TEST[setup:stackoverflow]`

			`Response:`

[DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449) (#46459) 2019-09-06 16:09:09 -04:00			`[source,console-result]`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`--------------------------------------------------`
			`{`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`...`
			`"aggregations": {`
			`"low_quality_keywords": {`
			`"doc_count": 600,`
			`"bg_count": 650,`
			`"buckets": [`
			`{`
			`"key": "angular",`
			`"doc_count": 200,`
			`"score": 0.02777,`
			`"bg_count": 200`
			`},`
			`{`
			`"key": "jquery",`
			`"doc_count": 200,`
			`"score": 0.02777,`
			`"bg_count": 200`
			`},`
			`{`
			`"key": "logstash",`
			`"doc_count": 50,`
			`"score": 0.0069,`
			`"bg_count": 50`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`}`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`]`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`}`
[DOCS] Reformat agg snippets to use two-space indents (#59912) (#59922) 2020-07-20 15:59:00 -04:00			`}`
[DOCS] [TEST] enhancement - added CONSOLE scripts for sampler aggs (#22869) Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs. Includes new StackOverflow index setup in build.gradle Closes #22746 * Formatting tweaks 2017-01-31 04:45:25 -05:00			`}`
			`--------------------------------------------------`
			`// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]`
			`// TESTRESPONSE[s/0.02777/$body.aggregations.low_quality_keywords.buckets.0.score/]`
			`// TESTRESPONSE[s/0.0069/$body.aggregations.low_quality_keywords.buckets.2.score/]`



New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`==== shard_size`

			The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard.
			`The default value is 100.`

[DOCS] Fix section levels for Sampler agg 2015-05-04 09:18:24 -04:00			`==== Limitations`
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00
[DOCS] Add anchors for Asciidoctor migration (#41648) 2019-04-30 10:19:09 -04:00			`[[sampler-breadth-first-nested-agg]]`
[DOCS] Fix section levels for Sampler agg 2015-05-04 09:18:24 -04:00			===== Cannot be nested under `breadth_first` aggregations
New feature - Sampler aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108 2015-03-23 09:00:44 -04:00			`Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.`
			It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores.
Aggregations Refactor: Refactor Sampler Aggregation 2015-12-14 06:54:41 -05:00			`In this situation an error will be thrown.`