[7.x] [DOCS] Add PIT to search after docs (#61593) (#62101)

2020-09-14 09:13:23 -04:00 · 2020-09-14 09:13:23 -04:00 · af13c9802d
parent 95766da345
commit af13c9802d
5 changed files with 308 additions and 176 deletions
--- a/docs/reference/search.asciidoc
+++ b/docs/reference/search.asciidoc
@ -14,6 +14,7 @@ exception of the <<search-explain,explain API>>.
 * <<search-search>>
 * <<search-multi-search>>
 * <<async-search>>
+* <<point-in-time-api>>
 * <<scroll-api>>
 * <<clear-scroll-api>>
 * <<search-suggesters>>
@ -51,6 +52,8 @@ include::search/search.asciidoc[]

 include::search/async-search.asciidoc[]

+include::search/point-in-time-api.asciidoc[]
+
 include::search/scroll-api.asciidoc[]

 include::search/clear-scroll-api.asciidoc[]
--- a/docs/reference/search/point-in-time-api.asciidoc
+++ b/docs/reference/search/point-in-time-api.asciidoc
@ -0,0 +1,120 @@
+[role="xpack"]
+[testenv="basic"]
+[[point-in-time-api]]
+=== Point in time API
++++
+<titleabbrev>Point in time</titleabbrev>
++++
+
+A search request by default executes against the most recent visible data of
+the target indices, which is called point in time. Elasticsearch pit (point in time)
+is a lightweight view into the state of the data as it existed when initiated.
+In some cases, it's preferred to perform multiple search requests using
+the same point in time. For example, if <<indices-refresh,refreshes>> happen between
+search_after requests, then the results of those requests might not be consistent as
+changes happening between searches are only visible to the more recent point in time.
+
+A point in time must be opened explicitly before being used in search requests. The
+keep_alive parameter tells Elasticsearch how long it should keep a point in time alive,
+e.g. `?keep_alive=5m`.
+
+[source,console]
+--------------------------------------------------
+POST /my-index-000001/_pit?keep_alive=1m
+--------------------------------------------------
+// TEST[setup:my_index]
+
+The result from the above request includes a `id`, which should
+be passed to the `id` of the `pit` parameter of a search request.
+
+[source,console]
+--------------------------------------------------
+POST /_search <1>
+{
+    "size": 100,
+    "query": {
+        "match" : {
+            "title" : "elasticsearch"
+        }
+    },
+    "pit": {
+	    "id":  "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", <2>
+	    "keep_alive": "1m"  <3>
+    }
+}
+--------------------------------------------------
+// TEST[catch:missing]
+
+<1> A search request with the `pit` parameter must not specify `index`, `routing`,
+and {ref}/search-request-body.html#request-body-search-preference[`preference`]
+as these parameters are copied from the point in time.
+<2> The `id` parameter tells Elasticsearch to execute the request using contexts
+from this point int time.
+<3> The `keep_alive` parameter tells Elasticsearch how long it should extend
+the time to live of the point in time.
+
+IMPORTANT: The open point in time request and each subsequent search request can
+return different `id`; thus always use the most recently received `id` for the
+next search request.
+
+[[point-in-time-keep-alive]]
+==== Keeping point in time alive
+The `keep_alive` parameter, which is passed to a open point in time request and
+search request, extends the time to live of the corresponding point in time.
+The value (e.g. `1m`, see <<time-units>>) does not need to be long enough to
+process all data -- it just needs to be long enough for the next request.
+
+Normally, the background merge process optimizes the index by merging together
+smaller segments to create new, bigger segments. Once the smaller segments are
+no longer needed they are deleted. However, open point-in-times prevent the
+old segments from being deleted since they are still in use.
+
+TIP: Keeping older segments alive means that more disk space and file handles
+are needed. Ensure that you have configured your nodes to have ample free file
+handles. See <<file-descriptors>>.
+
+Additionally, if a segment contains deleted or updated documents then the
+point in time must keep track of whether each document in the segment was live at
+the time of the initial search request. Ensure that your nodes have sufficient heap
+space if you have many open point-in-times on an index that is subject to ongoing
+deletes or updates.
+
+You can check how many point-in-times (i.e, search contexts) are open with the
+<<cluster-nodes-stats,nodes stats API>>:
+
+[source,console]
+---------------------------------------
+GET /_nodes/stats/indices/search
+---------------------------------------
+
+[[close-point-in-time-api]]
+==== Close point in time API
+
+Point-in-time is automatically closed when its `keep_alive` has
+been elapsed. However keeping point-in-times has a cost, as discussed in the
+<<point-in-time-keep-alive,previous section>>. Point-in-times should be closed
+as soon as they are no longer used in search requests.
+
+[source,console]
+---------------------------------------
+DELETE /_pit
+{
+    "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
+}
+---------------------------------------
+// TEST[catch:missing]
+
+The API returns the following response:
+
+[source,console-result]
+--------------------------------------------------
+{
+   "succeeded": true, <1>
+   "num_freed": 3     <2>
+}
+--------------------------------------------------
+// TESTRESPONSE[s/"succeeded": true/"succeeded": $body.succeeded/]
+// TESTRESPONSE[s/"num_freed": 3/"num_freed": $body.num_freed/]
+
+<1> If true, all search contexts associated with the point-in-time id are successfully closed
+<2> The number of search contexts have been successfully closed
--- a/docs/reference/search/scroll-api.asciidoc
+++ b/docs/reference/search/scroll-api.asciidoc
@ -4,6 +4,10 @@
 <titleabbrev>Scroll</titleabbrev>
 ++++

+IMPORTANT: We no longer recommend using the scroll API for deep pagination. If
+you need to preserve the index state while paging through more than 10,000 hits,
+use the <<search-after,`search_after`>> parameter with a point in time (PIT).
+
 Retrieves the next batch of results for a <<scroll-search-results,scrolling
 search>>.

--- a/docs/reference/search/search-your-data/paginate-search-results.asciidoc
+++ b/docs/reference/search/search-your-data/paginate-search-results.asciidoc
@ -1,18 +1,11 @@
 [[paginate-search-results]]
 == Paginate search results

-By default, the <<search-search,search API>> returns the top 10 matching documents.
-
-To paginate through a larger set of results, you can use the search API's `size`
-and `from` parameters. The `size` parameter is the number of matching documents
-to return. The `from` parameter is a zero-indexed offset from the beginning of
-the complete result set that indicates the document you want to start with.
-
-The following search API request sets the `from` offset to `5`, meaning the
-request offsets, or skips, the first five matching documents.
-
-The `size` parameter is `20`, meaning the request can return up to 20 documents,
-starting at the offset.
+By default, searches return the top 10 matching hits. To page through a larger
+set of results, you can use the <<search-search,search API>>'s `from` and `size`
+parameters. The `from` parameter defines the number of hits to skip, defaulting
+to `0`. The `size` parameter is the maximum number of hits to return. Together,
+these two parameters define a page of results.

 [source,console]
 ----
@ -28,29 +21,177 @@ GET /_search
 }
 ----

-By default, you cannot page through more than 10,000 documents using the `from`
-and `size` parameters. This limit is set using the
-<<index-max-result-window,`index.max_result_window`>> index setting.
+Avoid using `from` and `size` to page too deeply or request too many results at
+once. Search requests usually span multiple shards. Each shard must load its
+requested hits and the hits for any previous pages into memory. For deep pages
+or large sets of results, these operations can significantly increase memory and
+CPU usage, resulting in degraded performance or node failures.

-Deep paging or requesting many results at once can result in slow searches.
-Results are sorted before being returned. Because search requests usually span
-multiple shards, each shard must generate its own sorted results. These separate
-results must then be combined and sorted to ensure that the overall sort order
-is correct.
+By default, you cannot use `from` and `size` to page through more than 10,000
+hits. This limit is a safeguard set by the
+<<index-max-result-window,`index.max_result_window`>> index setting. If you need
+to page through more than 10,000 hits, use the <<search-after,`search_after`>>
+parameter instead.

-As an alternative to deep paging, we recommend using
-<<scroll-search-results,scroll>> or the
-<<search-after,`search_after`>> parameter.
+WARNING: {es} uses Lucene's internal doc IDs as tie-breakers. These internal doc
+IDs can be completely different across replicas of the same data. When paging
+search hits, you might occasionally see that documents with the same sort values
+are not ordered consistently.
+
+[discrete]
+[[search-after]]
+=== Search after
+
+You can use the `search_after` parameter to retrieve the next page of hits
+using a set of <<sort-search-results,sort values>> from the previous page.
+
+Using `search_after` requires multiple search requests with the same `query` and
+`sort` values. If a <<near-real-time,refresh>> occurs between these requests,
+the order of your results may change, causing inconsistent results across pages. To
+prevent this, you can create a <<point-in-time-api,point in time (PIT)>> to
+preserve the current index state over your searches.
+
+[source,console]
+----
+POST /my-index-000001/_pit?keep_alive=1m
+----
+// TEST[setup:my_index]
+
+The API returns a PIT ID.
+
+[source,console-result]
+----
+{
+  "id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
+}
+----
+// TESTRESPONSE[s/"id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="/"id": $body.id/]
+
+To get the first page of results, submit a search request with a `sort`
+argument. If using a PIT, specify the PIT ID in the `pit.id` parameter and omit
+the target data stream or index from the request path.
+
+IMPORTANT: We recommend you include a tiebreaker field in your `sort`. This
+tiebreaker field should contain a unique value for each document. If you don't
+include a tiebreaker field, your paged results could miss or duplicate hits.
+
+[source,console]
+----
+GET /_search
+{
+  "size": 10000,
+  "query": {
+    "match" : {
+      "user.id" : "elkbee"
+    }
+  },
+  "pit": {
+	    "id":  "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", <1>
+	    "keep_alive": "1m"
+  },
+  "sort": [ <2>
+    {"@timestamp": "asc"},
+    {"tie_breaker_id": "asc"}
+  ]
+}
+----
+// TEST[catch:missing]
+
+<1> PIT ID for the search.
+<2> Sorts hits for the search.
+
+The search response includes an array of `sort` values for each hit. If you used
+a PIT, the response's `pit_id` parameter contains an updated PIT ID.
+
+[source,console-result]
+----
+{
+  "pit_id" : "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA==", <1>
+  "took" : 17,
+  "timed_out" : false,
+  "_shards" : ...,
+  "hits" : {
+    "total" : ...,
+    "max_score" : null,
+    "hits" : [
+      ...
+      {
+        "_index" : "my-index-000001",
+        "_id" : "FaslK3QBySSL_rrj9zM5",
+        "_score" : null,
+        "_source" : ...,
+        "sort" : [                                <2>
+          4098435132000,
+          "FaslK3QBySSL_rrj9zM5"
+        ]
+      }
+    ]
+  }
+}
+----
+// TESTRESPONSE[skip: unable to access PIT ID]
+
+<1> Updated `id` for the point in time.
+<2> Sort values for the last returned hit.
+
+To get the next page of results, rerun the previous search using the last hit's
+sort values as the `search_after` argument. If using a PIT, use the latest PIT
+ID in the `pit.id` parameter. The search's `query` and `sort` arguments must
+remain unchanged. If provided, the `from` argument must be `0` (default) or `-1`.
+
+[source,console]
+----
+GET /_search
+{
+  "size": 10000,
+  "query": {
+    "match" : {
+      "user.id" : "elkbee"
+    }
+  },
+  "pit": {
+	    "id":  "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA==", <1>
+	    "keep_alive": "1m"
+  },
+  "sort": [
+    {"@timestamp": "asc"},
+    {"tie_breaker_id": "asc"}
+  ],
+  "search_after": [                                <2>
+    4098435132000,
+    "FaslK3QBySSL_rrj9zM5"
+  ]
+}
+----
+// TEST[catch:missing]
+
+<1> PIT ID returned by the previous search.
+<2> Sort values from the previous search's last hit.
+
+You can repeat this process to get additional pages of results. If using a PIT,
+you can extend the PIT's retention period using the
+`keep_alive` parameter of each search request.
+
+When you're finished, you should delete your PIT.
+
+[source,console]
+----
+DELETE /_pit
+{
+    "id" : "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA=="
+}
+----
+// TEST[catch:missing]

-WARNING: {es} uses Lucene's internal doc IDs as tie-breakers. These internal
-doc IDs can be completely different across replicas of the same
-data. When paginating, you might occasionally see that documents with the same
-sort values are not ordered consistently.

 [discrete]
 [[scroll-search-results]]
 === Scroll search results

+IMPORTANT: We no longer recommend using the scroll API for deep pagination. If
+you need to preserve the index state while paging through more than 10,000 hits,
+use the <<search-after,`search_after`>> parameter with a point in time (PIT).
+
 While a `search` request returns a single ``page'' of results, the `scroll`
 API can be used to retrieve large numbers of results (or even all results)
 from a single search request, in much the same way as you would use a cursor
@ -340,85 +481,3 @@ For append only time-based indices, the `timestamp` field can be used safely.

 NOTE: By default the maximum number of slices allowed per scroll is limited to 1024.
 You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
-
-[discrete]
-[[search-after]]
-=== Search after
-
-Pagination of results can be done by using the `from` and `size` but the cost becomes prohibitive when the deep pagination is reached.
-The `index.max_result_window` which defaults to 10,000 is a safeguard, search requests take heap memory and time proportional to `from + size`.
-The <<scroll-search-results,scroll>> API is recommended for efficient deep scrolling but scroll contexts are costly and it is not
-recommended to use it for real time user requests.
-The `search_after` parameter circumvents this problem by providing a live cursor.
-The idea is to use the results from the previous page to help the retrieval of the next page.
-
-Suppose that the query to retrieve the first page looks like this:
-
-[source,console]
--------------------------------------------------
-GET my-index-000001/_search
-{
-  "size": 10,
-  "query": {
-    "match" : {
-      "message" : "foo"
-    }
-  },
-  "sort": [
-    {"@timestamp": "asc"},
-    {"tie_breaker_id": "asc"}      <1>
-  ]
-}
--------------------------------------------------
-// TEST[setup:my_index]
-// TEST[s/"tie_breaker_id": "asc"/"tie_breaker_id": {"unmapped_type": "keyword"}/]
-
-<1> A copy of the `_id` field with `doc_values` enabled
-
-[IMPORTANT]
-A field with one unique value per document should be used as the tiebreaker
-of the sort specification. Otherwise the sort order for documents that have
-the same sort values would be undefined and could lead to missing or duplicate
-results. The <<mapping-id-field,`_id` field>> has a unique value per document
-but it is not recommended to use it as a tiebreaker directly.
-Beware that `search_after` looks for the first document which fully or partially
-matches tiebreaker's provided value. Therefore if a document has a tiebreaker value of
-`"654323"` and you `search_after` for `"654"` it would still match that document
-and return results found after it.
-<<doc-values,doc value>> are disabled on this field so sorting on it requires
-to load a lot of data in memory. Instead it is advised to duplicate (client side
- or with a <<ingest-processors,set ingest processor>>) the content
-of the <<mapping-id-field,`_id` field>> in another field that has
-<<doc-values,doc value>> enabled and to use this new field as the tiebreaker
-for the sort.
-
-The result from the above request includes an array of `sort values` for each document.
-These `sort values` can be used in conjunction with the `search_after` parameter to start returning results "after" any
-document in the result list.
-For instance we can use the `sort values` of the last document and pass it to `search_after` to retrieve the next page of results:
-
-[source,console]
--------------------------------------------------
-GET my-index-000001/_search
-{
-  "size": 10,
-  "query": {
-    "match" : {
-      "message" : "foo"
-    }
-  },
-  "search_after": [1463538857, "654323"],
-  "sort": [
-    {"@timestamp": "asc"},
-    {"tie_breaker_id": "asc"}
-  ]
-}
--------------------------------------------------
-// TEST[setup:my_index]
-// TEST[s/"tie_breaker_id": "asc"/"tie_breaker_id": {"unmapped_type": "keyword"}/]
-
-NOTE: The parameter `from` must be set to 0 (or -1) when `search_after` is used.
-
-`search_after` is not a solution to jump freely to a random page but rather to scroll many queries in parallel.
-It is very similar to the `scroll` API but unlike it, the `search_after` parameter is stateless, it is always resolved against the latest
- version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.
--- a/docs/reference/search/search.asciidoc
+++ b/docs/reference/search/search.asciidoc
@ -89,21 +89,9 @@ computation as part of a hit. Defaults to `false`.

 include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=from]
 +
--
-By default, you cannot page through more than 10,000 documents using the `from`
-and `size` parameters. This limit is set using the
-<<index-max-result-window,`index.max_result_window`>> index setting.
-
-Deep paging or requesting many results at once can result in slow searches.
-Results are sorted before being returned. Because search requests usually span
-multiple shards, each shard must generate its own sorted results. These separate
-results must then be combined and sorted to ensure that the overall order is
-correct.
-
-As an alternative to deep paging, we recommend using
-<<scroll-search-results,scroll>> or the
+By default, you cannot page through more than 10,000 hits using the `from` and
+`size` parameters. To page through more hits, use the
 <<search-after,`search_after`>> parameter.
--

 `ignore_throttled`::
 (Optional, boolean) If `true`, concrete, expanded or aliased indices will be
@ -229,25 +217,10 @@ last modification of each hit. See <<optimistic-concurrency-control>>.
 `size`::
 (Optional, integer) Defines the number of hits to return. Defaults to `10`.
 +
--
-By default, you cannot page through more than 10,000 documents using the `from`
-and `size` parameters. This limit is set using the
-<<index-max-result-window,`index.max_result_window`>> index setting.
-
-Deep paging or requesting many results at once can result in slow searches.
-Results are sorted before being returned. Because search requests usually span
-multiple shards, each shard must generate its own sorted results. These separate
-results must then be combined and sorted to ensure that the overall order is
-correct.
-
-As an alternative to deep paging, we recommend using
-<<scroll-search-results,scroll>> or the
+By default, you cannot page through more than 10,000 hits using the `from` and
+`size` parameters. To page through more hits, use the
 <<search-after,`search_after`>> parameter.

-If the <<search-api-scroll-query-param,`scroll` parameter>> is specified, this
-value cannot be `0`.
--
-
 `sort`::
 (Optional, string) A comma-separated list of <field>:<direction> pairs.

@ -366,21 +339,9 @@ computation as part of a hit. Defaults to `false`.

 include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=from]
 +
--
-By default, you cannot page through more than 10,000 documents using the `from`
-and `size` parameters. This limit is set using the
-<<index-max-result-window,`index.max_result_window`>> index setting.
-
-Deep paging or requesting many results at once can result in slow searches.
-Results are sorted before being returned. Because search requests usually span
-multiple shards, each shard must generate its own sorted results. These separate
-results must then be combined and sorted to ensure that the overall order is
-correct.
-
-As an alternative to deep paging, we recommend using
-<<scroll-search-results,scroll>> or the
+By default, you cannot page through more than 10,000 hits using the `from` and
+`size` parameters. To page through more hits, use the
 <<search-after,`search_after`>> parameter.
--

 `indices_boost`::
 (Optional, array of objects)
@ -419,25 +380,10 @@ last modification of each hit. See <<optimistic-concurrency-control>>.
 `size`:: 
 (Optional, integer) The number of hits to return. Defaults to `10`.
 +
--
-By default, you cannot page through more than 10,000 documents using the `from`
-and `size` parameters. This limit is set using the
-<<index-max-result-window,`index.max_result_window`>> index setting.
-
-Deep paging or requesting many results at once can result in slow searches.
-Results are sorted before being returned. Because search requests usually span
-multiple shards, each shard must generate its own sorted results. These separate
-results must then be combined and sorted to ensure that the overall order is
-correct.
-
-As an alternative to deep paging, we recommend using
-<<scroll-search-results,scroll>> or the
+By default, you cannot page through more than 10,000 hits using the `from` and
+`size` parameters. To page through more hits, use the
 <<search-after,`search_after`>> parameter.

-If the <<search-api-scroll-query-param,`scroll` parameter>> is specified, this
-value cannot be `0`.
--
-
 `_source`::
 (Optional)
 Indicates which <<mapping-source-field,source fields>> are returned for matching