484 lines
17 KiB
Plaintext
484 lines
17 KiB
Plaintext
[[paginate-search-results]]
|
||
== Paginate search results
|
||
|
||
By default, searches return the top 10 matching hits. To page through a larger
|
||
set of results, you can use the <<search-search,search API>>'s `from` and `size`
|
||
parameters. The `from` parameter defines the number of hits to skip, defaulting
|
||
to `0`. The `size` parameter is the maximum number of hits to return. Together,
|
||
these two parameters define a page of results.
|
||
|
||
[source,console]
|
||
----
|
||
GET /_search
|
||
{
|
||
"from": 5,
|
||
"size": 20,
|
||
"query": {
|
||
"match": {
|
||
"user.id": "kimchy"
|
||
}
|
||
}
|
||
}
|
||
----
|
||
|
||
Avoid using `from` and `size` to page too deeply or request too many results at
|
||
once. Search requests usually span multiple shards. Each shard must load its
|
||
requested hits and the hits for any previous pages into memory. For deep pages
|
||
or large sets of results, these operations can significantly increase memory and
|
||
CPU usage, resulting in degraded performance or node failures.
|
||
|
||
By default, you cannot use `from` and `size` to page through more than 10,000
|
||
hits. This limit is a safeguard set by the
|
||
<<index-max-result-window,`index.max_result_window`>> index setting. If you need
|
||
to page through more than 10,000 hits, use the <<search-after,`search_after`>>
|
||
parameter instead.
|
||
|
||
WARNING: {es} uses Lucene's internal doc IDs as tie-breakers. These internal doc
|
||
IDs can be completely different across replicas of the same data. When paging
|
||
search hits, you might occasionally see that documents with the same sort values
|
||
are not ordered consistently.
|
||
|
||
[discrete]
|
||
[[search-after]]
|
||
=== Search after
|
||
|
||
You can use the `search_after` parameter to retrieve the next page of hits
|
||
using a set of <<sort-search-results,sort values>> from the previous page.
|
||
|
||
Using `search_after` requires multiple search requests with the same `query` and
|
||
`sort` values. If a <<near-real-time,refresh>> occurs between these requests,
|
||
the order of your results may change, causing inconsistent results across pages. To
|
||
prevent this, you can create a <<point-in-time-api,point in time (PIT)>> to
|
||
preserve the current index state over your searches.
|
||
|
||
[source,console]
|
||
----
|
||
POST /my-index-000001/_pit?keep_alive=1m
|
||
----
|
||
// TEST[setup:my_index]
|
||
|
||
The API returns a PIT ID.
|
||
|
||
[source,console-result]
|
||
----
|
||
{
|
||
"id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
|
||
}
|
||
----
|
||
// TESTRESPONSE[s/"id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="/"id": $body.id/]
|
||
|
||
To get the first page of results, submit a search request with a `sort`
|
||
argument. If using a PIT, specify the PIT ID in the `pit.id` parameter and omit
|
||
the target data stream or index from the request path.
|
||
|
||
IMPORTANT: We recommend you include a tiebreaker field in your `sort`. This
|
||
tiebreaker field should contain a unique value for each document. If you don't
|
||
include a tiebreaker field, your paged results could miss or duplicate hits.
|
||
|
||
[source,console]
|
||
----
|
||
GET /_search
|
||
{
|
||
"size": 10000,
|
||
"query": {
|
||
"match" : {
|
||
"user.id" : "elkbee"
|
||
}
|
||
},
|
||
"pit": {
|
||
"id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", <1>
|
||
"keep_alive": "1m"
|
||
},
|
||
"sort": [ <2>
|
||
{"@timestamp": "asc"},
|
||
{"tie_breaker_id": "asc"}
|
||
]
|
||
}
|
||
----
|
||
// TEST[catch:missing]
|
||
|
||
<1> PIT ID for the search.
|
||
<2> Sorts hits for the search.
|
||
|
||
The search response includes an array of `sort` values for each hit. If you used
|
||
a PIT, the response's `pit_id` parameter contains an updated PIT ID.
|
||
|
||
[source,console-result]
|
||
----
|
||
{
|
||
"pit_id" : "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA==", <1>
|
||
"took" : 17,
|
||
"timed_out" : false,
|
||
"_shards" : ...,
|
||
"hits" : {
|
||
"total" : ...,
|
||
"max_score" : null,
|
||
"hits" : [
|
||
...
|
||
{
|
||
"_index" : "my-index-000001",
|
||
"_id" : "FaslK3QBySSL_rrj9zM5",
|
||
"_score" : null,
|
||
"_source" : ...,
|
||
"sort" : [ <2>
|
||
4098435132000,
|
||
"FaslK3QBySSL_rrj9zM5"
|
||
]
|
||
}
|
||
]
|
||
}
|
||
}
|
||
----
|
||
// TESTRESPONSE[skip: unable to access PIT ID]
|
||
|
||
<1> Updated `id` for the point in time.
|
||
<2> Sort values for the last returned hit.
|
||
|
||
To get the next page of results, rerun the previous search using the last hit's
|
||
sort values as the `search_after` argument. If using a PIT, use the latest PIT
|
||
ID in the `pit.id` parameter. The search's `query` and `sort` arguments must
|
||
remain unchanged. If provided, the `from` argument must be `0` (default) or `-1`.
|
||
|
||
[source,console]
|
||
----
|
||
GET /_search
|
||
{
|
||
"size": 10000,
|
||
"query": {
|
||
"match" : {
|
||
"user.id" : "elkbee"
|
||
}
|
||
},
|
||
"pit": {
|
||
"id": "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA==", <1>
|
||
"keep_alive": "1m"
|
||
},
|
||
"sort": [
|
||
{"@timestamp": "asc"},
|
||
{"tie_breaker_id": "asc"}
|
||
],
|
||
"search_after": [ <2>
|
||
4098435132000,
|
||
"FaslK3QBySSL_rrj9zM5"
|
||
]
|
||
}
|
||
----
|
||
// TEST[catch:missing]
|
||
|
||
<1> PIT ID returned by the previous search.
|
||
<2> Sort values from the previous search's last hit.
|
||
|
||
You can repeat this process to get additional pages of results. If using a PIT,
|
||
you can extend the PIT's retention period using the
|
||
`keep_alive` parameter of each search request.
|
||
|
||
When you're finished, you should delete your PIT.
|
||
|
||
[source,console]
|
||
----
|
||
DELETE /_pit
|
||
{
|
||
"id" : "46ToAwEPbXktaW5kZXgtMDAwMDAxFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAFldicVdzOFFtVHZTZDFoWWowTGkwS0EAAAAAAAAAAAQURzZzcUszUUJ5U1NMX3Jyak5ET0wBFnVzaTVuenpUVGQ2TFNheUxVUG5LVVEAAA=="
|
||
}
|
||
----
|
||
// TEST[catch:missing]
|
||
|
||
|
||
[discrete]
|
||
[[scroll-search-results]]
|
||
=== Scroll search results
|
||
|
||
IMPORTANT: We no longer recommend using the scroll API for deep pagination. If
|
||
you need to preserve the index state while paging through more than 10,000 hits,
|
||
use the <<search-after,`search_after`>> parameter with a point in time (PIT).
|
||
|
||
While a `search` request returns a single ``page'' of results, the `scroll`
|
||
API can be used to retrieve large numbers of results (or even all results)
|
||
from a single search request, in much the same way as you would use a cursor
|
||
on a traditional database.
|
||
|
||
Scrolling is not intended for real time user requests, but rather for
|
||
processing large amounts of data, e.g. in order to reindex the contents of one
|
||
data stream or index into a new data stream or index with a different
|
||
configuration.
|
||
|
||
.Client support for scrolling and reindexing
|
||
*********************************************
|
||
|
||
Some of the officially supported clients provide helpers to assist with
|
||
scrolled searches and reindexing:
|
||
|
||
Perl::
|
||
|
||
See https://metacpan.org/pod/Search::Elasticsearch::Client::5_0::Bulk[Search::Elasticsearch::Client::5_0::Bulk]
|
||
and https://metacpan.org/pod/Search::Elasticsearch::Client::5_0::Scroll[Search::Elasticsearch::Client::5_0::Scroll]
|
||
|
||
Python::
|
||
|
||
See https://elasticsearch-py.readthedocs.org/en/master/helpers.html[elasticsearch.helpers.*]
|
||
|
||
JavaScript::
|
||
|
||
See {jsclient-current}/client-helpers.html[client.helpers.*]
|
||
|
||
*********************************************
|
||
|
||
NOTE: The results that are returned from a scroll request reflect the state of
|
||
the data stream or index at the time that the initial `search` request was made, like a
|
||
snapshot in time. Subsequent changes to documents (index, update or delete)
|
||
will only affect later search requests.
|
||
|
||
In order to use scrolling, the initial search request should specify the
|
||
`scroll` parameter in the query string, which tells Elasticsearch how long it
|
||
should keep the ``search context'' alive (see <<scroll-search-context>>), eg `?scroll=1m`.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
POST /my-index-000001/_search?scroll=1m
|
||
{
|
||
"size": 100,
|
||
"query": {
|
||
"match": {
|
||
"message": "foo"
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[setup:my_index]
|
||
|
||
The result from the above request includes a `_scroll_id`, which should
|
||
be passed to the `scroll` API in order to retrieve the next batch of
|
||
results.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
POST /_search/scroll <1>
|
||
{
|
||
"scroll" : "1m", <2>
|
||
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" <3>
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[continued s/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==/$body._scroll_id/]
|
||
|
||
<1> `GET` or `POST` can be used and the URL should not include the `index`
|
||
name -- this is specified in the original `search` request instead.
|
||
<2> The `scroll` parameter tells Elasticsearch to keep the search context open
|
||
for another `1m`.
|
||
<3> The `scroll_id` parameter
|
||
|
||
The `size` parameter allows you to configure the maximum number of hits to be
|
||
returned with each batch of results. Each call to the `scroll` API returns the
|
||
next batch of results until there are no more results left to return, ie the
|
||
`hits` array is empty.
|
||
|
||
IMPORTANT: The initial search request and each subsequent scroll request each
|
||
return a `_scroll_id`. While the `_scroll_id` may change between requests, it doesn’t
|
||
always change — in any case, only the most recently received `_scroll_id` should be used.
|
||
|
||
NOTE: If the request specifies aggregations, only the initial search response
|
||
will contain the aggregations results.
|
||
|
||
NOTE: Scroll requests have optimizations that make them faster when the sort
|
||
order is `_doc`. If you want to iterate over all documents regardless of the
|
||
order, this is the most efficient option:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET /_search?scroll=1m
|
||
{
|
||
"sort": [
|
||
"_doc"
|
||
]
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[setup:my_index]
|
||
|
||
[discrete]
|
||
[[scroll-search-context]]
|
||
==== Keeping the search context alive
|
||
|
||
A scroll returns all the documents which matched the search at the time of the
|
||
initial search request. It ignores any subsequent changes to these documents.
|
||
The `scroll_id` identifies a _search context_ which keeps track of everything
|
||
that {es} needs to return the correct documents. The search context is created
|
||
by the initial request and kept alive by subsequent requests.
|
||
|
||
The `scroll` parameter (passed to the `search` request and to every `scroll`
|
||
request) tells Elasticsearch how long it should keep the search context alive.
|
||
Its value (e.g. `1m`, see <<time-units>>) does not need to be long enough to
|
||
process all data -- it just needs to be long enough to process the previous
|
||
batch of results. Each `scroll` request (with the `scroll` parameter) sets a
|
||
new expiry time. If a `scroll` request doesn't pass in the `scroll`
|
||
parameter, then the search context will be freed as part of _that_ `scroll`
|
||
request.
|
||
|
||
Normally, the background merge process optimizes the index by merging together
|
||
smaller segments to create new, bigger segments. Once the smaller segments are
|
||
no longer needed they are deleted. This process continues during scrolling, but
|
||
an open search context prevents the old segments from being deleted since they
|
||
are still in use.
|
||
|
||
TIP: Keeping older segments alive means that more disk space and file handles
|
||
are needed. Ensure that you have configured your nodes to have ample free file
|
||
handles. See <<file-descriptors>>.
|
||
|
||
Additionally, if a segment contains deleted or updated documents then the
|
||
search context must keep track of whether each document in the segment was live
|
||
at the time of the initial search request. Ensure that your nodes have
|
||
sufficient heap space if you have many open scrolls on an index that is subject
|
||
to ongoing deletes or updates.
|
||
|
||
NOTE: To prevent against issues caused by having too many scrolls open, the
|
||
user is not allowed to open scrolls past a certain limit. By default, the
|
||
maximum number of open scrolls is 500. This limit can be updated with the
|
||
`search.max_open_scroll_context` cluster setting.
|
||
|
||
You can check how many search contexts are open with the
|
||
<<cluster-nodes-stats,nodes stats API>>:
|
||
|
||
[source,console]
|
||
---------------------------------------
|
||
GET /_nodes/stats/indices/search
|
||
---------------------------------------
|
||
|
||
[discrete]
|
||
[[clear-scroll]]
|
||
==== Clear scroll
|
||
|
||
Search context are automatically removed when the `scroll` timeout has been
|
||
exceeded. However keeping scrolls open has a cost, as discussed in the
|
||
<<scroll-search-context,previous section>> so scrolls should be explicitly
|
||
cleared as soon as the scroll is not being used anymore using the
|
||
`clear-scroll` API:
|
||
|
||
[source,console]
|
||
---------------------------------------
|
||
DELETE /_search/scroll
|
||
{
|
||
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
|
||
}
|
||
---------------------------------------
|
||
// TEST[catch:missing]
|
||
|
||
Multiple scroll IDs can be passed as array:
|
||
|
||
[source,console]
|
||
---------------------------------------
|
||
DELETE /_search/scroll
|
||
{
|
||
"scroll_id" : [
|
||
"DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==",
|
||
"DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB"
|
||
]
|
||
}
|
||
---------------------------------------
|
||
// TEST[catch:missing]
|
||
|
||
All search contexts can be cleared with the `_all` parameter:
|
||
|
||
[source,console]
|
||
---------------------------------------
|
||
DELETE /_search/scroll/_all
|
||
---------------------------------------
|
||
|
||
The `scroll_id` can also be passed as a query string parameter or in the request body.
|
||
Multiple scroll IDs can be passed as comma separated values:
|
||
|
||
[source,console]
|
||
---------------------------------------
|
||
DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==,DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB
|
||
---------------------------------------
|
||
// TEST[catch:missing]
|
||
|
||
[discrete]
|
||
[[slice-scroll]]
|
||
==== Sliced scroll
|
||
|
||
For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which
|
||
can be consumed independently:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET /my-index-000001/_search?scroll=1m
|
||
{
|
||
"slice": {
|
||
"id": 0, <1>
|
||
"max": 2 <2>
|
||
},
|
||
"query": {
|
||
"match": {
|
||
"message": "foo"
|
||
}
|
||
}
|
||
}
|
||
GET /my-index-000001/_search?scroll=1m
|
||
{
|
||
"slice": {
|
||
"id": 1,
|
||
"max": 2
|
||
},
|
||
"query": {
|
||
"match": {
|
||
"message": "foo"
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[setup:my_index_big]
|
||
|
||
<1> The id of the slice
|
||
<2> The maximum number of slices
|
||
|
||
The result from the first request returned documents that belong to the first slice (id: 0) and the result from the
|
||
second request returned documents that belong to the second slice. Since the maximum number of slices is set to 2
|
||
the union of the results of the two requests is equivalent to the results of a scroll query without slicing.
|
||
By default the splitting is done on the shards first and then locally on each shard using the _id field
|
||
with the following formula:
|
||
`slice(doc) = floorMod(hashCode(doc._id), max)`
|
||
For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned
|
||
to the first shard and the slices 1 and 3 are assigned to the second shard.
|
||
|
||
Each scroll is independent and can be processed in parallel like any scroll request.
|
||
|
||
NOTE: If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals
|
||
to N bits per slice where N is the total number of documents in the shard.
|
||
After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of
|
||
sliced query you perform in parallel to avoid the memory explosion.
|
||
|
||
To avoid this cost entirely it is possible to use the `doc_values` of another field to do the slicing
|
||
but the user must ensure that the field has the following properties:
|
||
|
||
* The field is numeric.
|
||
|
||
* `doc_values` are enabled on that field
|
||
|
||
* Every document should contain a single value. If a document has multiple values for the specified field, the first value is used.
|
||
|
||
* The value for each document should be set once when the document is created and never updated. This ensures that each
|
||
slice gets deterministic results.
|
||
|
||
* The cardinality of the field should be high. This ensures that each slice gets approximately the same amount of documents.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
GET /my-index-000001/_search?scroll=1m
|
||
{
|
||
"slice": {
|
||
"field": "@timestamp",
|
||
"id": 0,
|
||
"max": 10
|
||
},
|
||
"query": {
|
||
"match": {
|
||
"message": "foo"
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[setup:my_index_big]
|
||
|
||
For append only time-based indices, the `timestamp` field can be used safely.
|
||
|
||
NOTE: By default the maximum number of slices allowed per scroll is limited to 1024.
|
||
You can update the `index.max_slices_per_scroll` index setting to bypass this limit.
|