Add documentation for collapse, oversample, truncate_hits processors (#5881)

* Add documentation for collapse, oversample, truncate_hits processors Signed-off-by: Michael Froh <froh@amazon.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/search-pipelines/oversample-processor.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/search-pipelines/collapse-processor.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/search-pipelines/oversample-processor.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/search-pipelines/truncate-hits-processor.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/search-pipelines/collapse-processor.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/search-pipelines/collapse-processor.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _search-plugins/search-pipelines/truncate-hits-processor.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * More editorial comments and link fixes Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Add oversample and deduplicate to vale and format files nicely Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * Update _search-plugins/search-pipelines/truncate-hits-processor.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Michael Froh <froh@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
2024-02-01 19:02:26 +00:00 · 2024-02-01 19:02:26 +00:00 · 83f91acd5b
commit 83f91acd5b
parent 6af66500eb
6 changed files with 958 additions and 1 deletions
--- a/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt
+++ b/.github/vale/styles/Vocab/OpenSearch/Words/accept.txt
@ -75,6 +75,7 @@ Levenshtein
 [Mm]ultivalued
 [Mm]ultiword
 [Nn]amespace
+[Oo]versamples?
 pebibyte
 [Pp]luggable
 [Pp]reconfigure
--- a/_search-plugins/search-pipelines/collapse-processor.md
+++ b/_search-plugins/search-pipelines/collapse-processor.md
@ -0,0 +1,144 @@
+---
+layout: default
+title: Collapse
+nav_order: 7
+has_children: false
+parent: Search processors
+grand_parent: Search pipelines
+---
+
+# Collapse processor
+
+The `collapse` response processor discards hits that have the same value for a particular field as a previous document in the result set.
+This is similar to passing the `collapse` parameter in a search request, but the response processor is applied to the
+response after fetching from all shards. The `collapse` response processor may be used in conjunction with the `rescore` search
+request parameter or may be applied after a reranking response processor.
+
+Using the `collapse` response processor will likely result in fewer than `size` results being returned because hits are discarded 
+from a set whose size is already less than or equal to `size`. To increase the likelihood of returning `size` hits, use the 
+`oversample` request processor and `truncate_hits` response processor, as shown in [this example]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/truncate-hits-processor/#oversample-collapse-and-truncate-hits).
+
+## Request fields
+
+The following table lists all request fields.
+
+Field | Data type | Description
+:--- | :--- | :---
+`field` | String | The field whose value will be read from each returned search hit. Only the first hit for each given field value will be returned in the search response. Required.
+`context_prefix` | String | May be used to read the `original_size` variable from a specific scope in order to avoid collisions. Optional.
+`tag` | String | The processor's identifier. Optional.
+`description` | String | A description of the processor. Optional.
+`ignore_failure` | Boolean | If `true`, OpenSearch [ignores any failure]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/creating-search-pipeline/#ignoring-processor-failures) of this processor and continues to run the remaining processors in the search pipeline. Optional. Default is `false`.
+
+## Example
+
+The following example demonstrates using a search pipeline with a `collapse` processor.
+
+### Setup
+
+Create many documents containing a field to use for collapsing:
+
+```json
+POST /_bulk
+{ "create":{"_index":"my_index","_id":1}}
+{ "title" : "document 1", "color":"blue" }
+{ "create":{"_index":"my_index","_id":2}}
+{ "title" : "document 2", "color":"blue" }
+{ "create":{"_index":"my_index","_id":3}}
+{ "title" : "document 3", "color":"red" }
+{ "create":{"_index":"my_index","_id":4}}
+{ "title" : "document 4", "color":"red" }
+{ "create":{"_index":"my_index","_id":5}}
+{ "title" : "document 5", "color":"yellow" }
+{ "create":{"_index":"my_index","_id":6}}
+{ "title" : "document 6", "color":"yellow" }
+{ "create":{"_index":"my_index","_id":7}}
+{ "title" : "document 7", "color":"orange" }
+{ "create":{"_index":"my_index","_id":8}}
+{ "title" : "document 8", "color":"orange" }
+{ "create":{"_index":"my_index","_id":9}}
+{ "title" : "document 9", "color":"green" }
+{ "create":{"_index":"my_index","_id":10}}
+{ "title" : "document 10", "color":"green" }
+``` 
+{% include copy-curl.html %}
+
+Create a pipeline that only collapses on the `color` field:
+
+```json
+PUT /_search/pipeline/collapse_pipeline
+{
+  "response_processors": [
+    {
+      "collapse" : {
+        "field": "color"
+      }
+    }
+  ]
+}
+```
+{% include copy-curl.html %}
+
+### Using a search pipeline
+
+In this example, you request the top three documents before collapsing on the `color` field. Because the first two documents have the same `color`, the second one is discarded,
+and the request returns the first and third documents:
+
+```json
+POST /my_index/_search?search_pipeline=collapse_pipeline
+{
+  "size": 3
+}
+```
+{% include copy-curl.html %}
+
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+  
+```json
+{
+  "took" : 2,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 10,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my_index",
+        "_id" : "1",
+        "_score" : 1.0,
+        "_source" : {
+          "title" : "document 1",
+          "color" : "blue"
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "3",
+        "_score" : 1.0,
+        "_source" : {
+          "title" : "document 3",
+          "color" : "red"
+        }
+      }
+    ]
+  },
+  "profile" : {
+    "shards" : [ ]
+  }
+}
+```
+</details>
--- a/_search-plugins/search-pipelines/oversample-processor.md
+++ b/_search-plugins/search-pipelines/oversample-processor.md
@ -0,0 +1,292 @@
+---
+layout: default
+title: Oversample
+nav_order: 17
+has_children: false
+parent: Search processors
+grand_parent: Search pipelines
+---
+
+# Oversample processor
+
+The `oversample` request processor multiplies the `size` parameter of the search request by a specified `sample_factor` (>= 1.0), saving the original value in the `original_size` pipeline variable. The `oversample` processor is designed to work with the [`truncate_hits` response processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/truncate-hits-processor/) but may be used on its own.
+
+## Request fields
+
+The following table lists all request fields.
+
+Field | Data type | Description
+:--- | :--- | :---
+`sample_factor` | Float | The multiplicative factor (>= 1.0) that will be applied to the `size` parameter before processing the search request. Required.
+`context_prefix` | String | May be used to scope the `original_size` variable in order to avoid collisions. Optional.
+`tag` | String | The processor's identifier. Optional.
+`description` | String | A description of the processor. Optional.
+`ignore_failure` | Boolean | If `true`, OpenSearch [ignores any failure]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/creating-search-pipeline/#ignoring-processor-failures) of this processor and continues to run the remaining processors in the search pipeline. Optional. Default is `false`.
+
+
+## Example
+
+The following example demonstrates using a search pipeline with an `oversample` processor.
+
+### Setup
+
+Create an index named `my_index` containing many documents:
+
+```json
+POST /_bulk
+{ "create":{"_index":"my_index","_id":1}}
+{ "doc": { "title" : "document 1" }}
+{ "create":{"_index":"my_index","_id":2}}
+{ "doc": { "title" : "document 2" }}
+{ "create":{"_index":"my_index","_id":3}}
+{ "doc": { "title" : "document 3" }}
+{ "create":{"_index":"my_index","_id":4}}
+{ "doc": { "title" : "document 4" }}
+{ "create":{"_index":"my_index","_id":5}}
+{ "doc": { "title" : "document 5" }}
+{ "create":{"_index":"my_index","_id":6}}
+{ "doc": { "title" : "document 6" }}
+{ "create":{"_index":"my_index","_id":7}}
+{ "doc": { "title" : "document 7" }}
+{ "create":{"_index":"my_index","_id":8}}
+{ "doc": { "title" : "document 8" }}
+{ "create":{"_index":"my_index","_id":9}}
+{ "doc": { "title" : "document 9" }}
+{ "create":{"_index":"my_index","_id":10}}
+{ "doc": { "title" : "document 10" }}
+```
+{% include copy-curl.html %}
+
+### Creating a search pipeline
+
+The following request creates a search pipeline named `my_pipeline` with an `oversample` request processor that requests 50% more hits than specified in `size`:
+
+```json
+PUT /_search/pipeline/my_pipeline 
+{
+  "request_processors": [
+    {
+      "oversample" : {
+        "tag" : "oversample_1",
+        "description" : "This processor will multiply `size` by 1.5.",
+        "sample_factor" : 1.5
+      }
+    }
+  ]
+}
+```
+{% include copy-curl.html %}
+
+### Using a search pipeline
+
+Search for documents in `my_index` without a search pipeline:
+
+```json
+POST /my_index/_search
+{
+  "size": 5
+}
+```
+{% include copy-curl.html %}
+
+The response contains five hits:
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+
+```json
+{
+  "took" : 3,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 10,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my_index",
+        "_id" : "1",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 1"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "2",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 2"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "3",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 3"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "4",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 4"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "5",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 5"
+          }
+        }
+      }
+    ]
+  }
+}
+```
+</details>
+
+To search with a pipeline, specify the pipeline name in the `search_pipeline` query parameter:
+
+```json
+POST /my_index/_search?search_pipeline=my_pipeline
+{
+  "size": 5
+}
+```
+{% include copy-curl.html %}
+
+The response contains 8 documents (5 * 1.5 = 7.5, rounded up to 8):
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+  
+```json
+{
+  "took" : 13,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 10,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my_index",
+        "_id" : "1",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 1"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "2",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 2"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "3",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 3"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "4",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 4"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "5",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 5"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "6",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 6"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "7",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 7"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "8",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 8"
+          }
+        }
+      }
+    ]
+  }
+}
+```
+</details>
--- a/_search-plugins/search-pipelines/personalize-search-ranking.md
+++ b/_search-plugins/search-pipelines/personalize-search-ranking.md
@ -1,7 +1,7 @@
 ---
 layout: default
 title: Personalize search ranking
-nav_order: 40
+nav_order: 18
 has_children: false
 parent: Search processors
 grand_parent: Search pipelines
--- a/_search-plugins/search-pipelines/search-processors.md
+++ b/_search-plugins/search-pipelines/search-processors.md
@ -26,6 +26,8 @@ Processor | Description | Earliest available version
 [`filter_query`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/filter-query-processor/) | Adds a filtering query that is used to filter requests. | 2.8
 [`neural_query_enricher`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/neural-query-enricher/) | Sets a default model for neural search at the index or field level. | 2.11
 [`script`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/script-processor/) | Adds a script that is run on newly indexed documents. | 2.8
+[`oversample`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/oversample-processor/) | Increases the search request `size` parameter, storing the original value in the pipeline state.  | 2.12
+

 ## Search response processors

@ -37,6 +39,8 @@ Processor | Description | Earliest available version
 :--- | :--- | :---
 [`personalize_search_ranking`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/personalize-search-ranking/) | Uses [Amazon Personalize](https://aws.amazon.com/personalize/) to rerank search results (requires setting up the Amazon Personalize service). | 2.9
 [`rename_field`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/rename-field-processor/)| Renames an existing field. | 2.8
+[`collapse`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/collapse-processor/)| Deduplicates search hits based on a field value, similarly to `collapse` in a search request. | 2.12
+[`truncate_hits`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/truncate-hits-processor/)| Discards search hits after a specified target count is reached. Can undo the effect of the `oversample` request processor.  | 2.12

 ## Search phase results processors

--- a/_search-plugins/search-pipelines/truncate-hits-processor.md
+++ b/_search-plugins/search-pipelines/truncate-hits-processor.md
@ -0,0 +1,516 @@
+---
+layout: default
+title: Truncate hits
+nav_order: 35
+has_children: false
+parent: Search processors
+grand_parent: Search pipelines
+---
+
+# Truncate hits processor
+
+The `truncate_hits` response processor discards returned search hits after a given hit count is reached. The `truncate_hits` processor is designed to work with the [`oversample` request processor]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/oversample-processor/) but may be used on its own.
+
+The `target_size` parameter (which specifies where to truncate) is optional. If it is not specified, then OpenSearch uses the `original_size` variable set by the
+`oversample` processor (if available).
+
+The following is a common usage pattern:
+
+1. Add the `oversample` processor to a request pipeline to fetch a larger set of results.
+1. In the response pipeline, apply a reranking processor (which may promote results from beyond the originally requested top N) or the `collapse` processor (which may discard results after deduplication).
+1. Apply the `truncate` processor to return (at most) the originally requested number of hits.
+
+## Request fields
+
+The following table lists all request fields.
+
+Field | Data type | Description
+:--- | :--- | :---
+`target_size` | Integer | The maximum number of search hits to return (>=0). If not specified, the processor will try to read the `original_size` variable and will fail if it is not available. Optional.
+`context_prefix` | String | May be used to read the `original_size` variable from a specific scope in order to avoid collisions. Optional.
+`tag` | String | The processor's identifier. Optional.
+`description` | String | A description of the processor. Optional.
+`ignore_failure` | Boolean | If `true`, OpenSearch [ignores any failure]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/creating-search-pipeline/#ignoring-processor-failures) of this processor and continues to run the remaining processors in the search pipeline. Optional. Default is `false`.
+
+## Example
+
+The following example demonstrates using a search pipeline with a `truncate` processor.
+
+### Setup
+
+Create an index named `my_index` containing many documents:
+
+```json
+POST /_bulk
+{ "create":{"_index":"my_index","_id":1}}
+{ "doc": { "title" : "document 1" }}
+{ "create":{"_index":"my_index","_id":2}}
+{ "doc": { "title" : "document 2" }}
+{ "create":{"_index":"my_index","_id":3}}
+{ "doc": { "title" : "document 3" }}
+{ "create":{"_index":"my_index","_id":4}}
+{ "doc": { "title" : "document 4" }}
+{ "create":{"_index":"my_index","_id":5}}
+{ "doc": { "title" : "document 5" }}
+{ "create":{"_index":"my_index","_id":6}}
+{ "doc": { "title" : "document 6" }}
+{ "create":{"_index":"my_index","_id":7}}
+{ "doc": { "title" : "document 7" }}
+{ "create":{"_index":"my_index","_id":8}}
+{ "doc": { "title" : "document 8" }}
+{ "create":{"_index":"my_index","_id":9}}
+{ "doc": { "title" : "document 9" }}
+{ "create":{"_index":"my_index","_id":10}}
+{ "doc": { "title" : "document 10" }}
+```
+{% include copy-curl.html %}
+
+### Creating a search pipeline
+
+The following request creates a search pipeline named `my_pipeline` with a `truncate_hits` response processor that discards hits after the first five:
+
+```json
+PUT /_search/pipeline/my_pipeline 
+{
+  "response_processors": [
+    {
+      "truncate_hits" : {
+        "tag" : "truncate_1",
+        "description" : "This processor will discard results after the first 5.",
+        "target_size" : 5
+      }
+    }
+  ]
+}
+```
+{% include copy-curl.html %}
+
+### Using a search pipeline
+
+Search for documents in `my_index` without a search pipeline:
+
+```json
+POST /my_index/_search
+{
+  "size": 8
+}
+```
+{% include copy-curl.html %}
+
+The response contains eight hits:
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+
+```json
+{
+  "took" : 13,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 10,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my_index",
+        "_id" : "1",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 1"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "2",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 2"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "3",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 3"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "4",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 4"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "5",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 5"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "6",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 6"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "7",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 7"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "8",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 8"
+          }
+        }
+      }
+    ]
+  }
+}
+```
+</details>
+
+To search with a pipeline, specify the pipeline name in the `search_pipeline` query parameter:
+
+```json
+POST /my_index/_search?search_pipeline=my_pipeline
+{
+  "size": 8
+}
+```
+{% include copy-curl.html %}
+
+The response contains only 5 hits, even though 8 were requested and 10 were available:
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+
+```json
+{
+  "took" : 3,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 10,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my_index",
+        "_id" : "1",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 1"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "2",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 2"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "3",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 3"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "4",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 4"
+          }
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "5",
+        "_score" : 1.0,
+        "_source" : {
+          "doc" : {
+            "title" : "document 5"
+          }
+        }
+      }
+    ]
+  }
+}
+```
+</details>
+
+## Oversample, collapse, and truncate hits
+
+The following is a more realistic example in which you use `oversample` to request many candidate documents, use `collapse` to remove documents that duplicate a particular field (to get more diverse results), and then use `truncate` to return the originally requested document count (to avoid returning a large result payload from the cluster).
+
+
+### Setup
+
+Create many documents containing a field that you'll use for collapsing:
+
+```json
+POST /_bulk
+{ "create":{"_index":"my_index","_id":1}}
+{ "title" : "document 1", "color":"blue" }
+{ "create":{"_index":"my_index","_id":2}}
+{ "title" : "document 2", "color":"blue" }
+{ "create":{"_index":"my_index","_id":3}}
+{ "title" : "document 3", "color":"red" }
+{ "create":{"_index":"my_index","_id":4}}
+{ "title" : "document 4", "color":"red" }
+{ "create":{"_index":"my_index","_id":5}}
+{ "title" : "document 5", "color":"yellow" }
+{ "create":{"_index":"my_index","_id":6}}
+{ "title" : "document 6", "color":"yellow" }
+{ "create":{"_index":"my_index","_id":7}}
+{ "title" : "document 7", "color":"orange" }
+{ "create":{"_index":"my_index","_id":8}}
+{ "title" : "document 8", "color":"orange" }
+{ "create":{"_index":"my_index","_id":9}}
+{ "title" : "document 9", "color":"green" }
+{ "create":{"_index":"my_index","_id":10}}
+{ "title" : "document 10", "color":"green" }
+``` 
+{% include copy-curl.html %}
+
+Create a pipeline that collapses only on the `color` field:
+
+```json
+PUT /_search/pipeline/collapse_pipeline
+{
+  "response_processors": [
+    {
+      "collapse" : {
+        "field": "color"
+      }
+    }
+  ]
+}
+```
+{% include copy-curl.html %}
+
+Create another pipeline that oversamples, collapses, and then truncates results:
+
+```json
+PUT /_search/pipeline/oversampling_collapse_pipeline
+{
+  "request_processors": [
+    {
+      "oversample": {
+        "sample_factor": 3
+      }
+    }
+  ],
+  "response_processors": [
+    {
+      "collapse" : {
+        "field": "color"
+      }
+    },
+    {
+      "truncate_hits": {
+        "description": "Truncates back to the original size before oversample increased it."
+      }
+    }
+  ]
+}
+```
+{% include copy-curl.html %}
+
+### Collapse without oversample
+
+In this example, you request the top three documents before collapsing on the `color` field. Because the first two documents have the same `color`, the second one is discarded, and the request returns the first and third documents:
+
+```json
+POST /my_index/_search?search_pipeline=collapse_pipeline
+{
+  "size": 3
+}
+```
+{% include copy-curl.html %}
+
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+
+```json
+{
+  "took" : 2,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 10,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my_index",
+        "_id" : "1",
+        "_score" : 1.0,
+        "_source" : {
+          "title" : "document 1",
+          "color" : "blue"
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "3",
+        "_score" : 1.0,
+        "_source" : {
+          "title" : "document 3",
+          "color" : "red"
+        }
+      }
+    ]
+  },
+  "profile" : {
+    "shards" : [ ]
+  }
+}
+```
+</details>
+
+
+### Oversample, collapse, and truncate
+
+Now you will use the `oversampling_collapse_pipeline`, which requests the top 9 documents (multiplying the size by 3), deduplicates by `color`, and then returns the top 3 hits:
+
+```json
+POST /my_index/_search?search_pipeline=oversampling_collapse_pipeline
+{
+  "size": 3
+}
+```
+{% include copy-curl.html %}
+
+
+<details open markdown="block">
+  <summary>
+    Response
+  </summary>
+  {: .text-delta}
+
+```json
+{
+  "took" : 2,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 10,
+      "relation" : "eq"
+    },
+    "max_score" : 1.0,
+    "hits" : [
+      {
+        "_index" : "my_index",
+        "_id" : "1",
+        "_score" : 1.0,
+        "_source" : {
+          "title" : "document 1",
+          "color" : "blue"
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "3",
+        "_score" : 1.0,
+        "_source" : {
+          "title" : "document 3",
+          "color" : "red"
+        }
+      },
+      {
+        "_index" : "my_index",
+        "_id" : "5",
+        "_score" : 1.0,
+        "_source" : {
+          "title" : "document 5",
+          "color" : "yellow"
+        }
+      }
+    ]
+  },
+  "profile" : {
+    "shards" : [ ]
+  }
+}
+```
+</details>
+
+