lucene/solr/solr-ref-guide/src/pagination-of-results.adoc

270 lines
15 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

= Pagination of Results
:page-shortname: pagination-of-results
:page-permalink: pagination-of-results.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
In most search applications, the "top" matching results (sorted by score, or some other criteria) are displayed to some human user.
In many applications the UI for these sorted results are displayed to the user in "pages" containing a fixed number of matching results, and users don't typically look at results past the first few pages worth of results.
== Basic Pagination
In Solr, this basic paginated searching is supported using the `start` and `rows` parameters, and performance of this common behaviour can be tuned by utilizing the <<query-settings-in-solrconfig.adoc#QuerySettingsinSolrConfig-queryResultCache,`queryResultCache`>> and adjusting the <<query-settings-in-solrconfig.adoc#QuerySettingsinSolrConfig-queryResultWindowSize,`queryResultWindowSize`>> configuration options based on your expected page sizes.
=== Basic Pagination Examples
The easiest way to think about simple pagination, is to simply multiply the page number you want (treating the "first" page number as "0") by the number of rows per page; such as in the following pseudo-code:
[source,plain]
----
function fetch_solr_page($page_number, $rows_per_page) {
$start = $page_number * $rows_per_page
$params = [ q = $some_query, rows = $rows_per_page, start = $start ]
return fetch_solr($params)
}
----
=== How Basic Pagination is Affected by Index Updates
The `start` param specified in a request to Solr indicates an *absolute* "offset" in the complete sorted list of matches that the client wants Solr to use as the beginning of the current "page".
If an index modification (such as adding or removing documents) which affects the sequence of ordered documents matching a query occurs in between two requests from a client for subsequent pages of results, then it is possible that these modifications can result in the same document being returned on multiple pages, or documents being "skipped" as the result set shrinks or grows.
For example, consider an index containing 26 documents like so:
[options="header",%autowidth,width="30%"]
|===
|id |name
|1 |A
|2 |B
| ... |
|26 |Z
|===
Followed by the following requests & index modifications interleaved:
* A client requests `q=*:*&rows=5&start=0&sort=name asc`
** documents with the ids `1-5` will be returned to the client
* Document id `3` is deleted
* The client requests "page #2" using `q=*:*&rows=5&start=5&sort=name asc`
** Documents `7-11` will be returned
** Document `6` has been skipped, since it is now the 5th document in the sorted set of all matching results it would be returned on a new request for "page #1"
* 3 new documents are now added with the ids `90`, `91`, and `92`; All three documents have a name of `A`
* The client requests "page #3" using `q=*:*&rows=5&start=10&sort=name asc`
** Documents `9-13` will be returned
** Documents `9`, `10`, and `11` have now been returned on both page #2 and page #3 since they moved farther back in the list of sorted results
In typical situations these impacts from index changes on paginated searching don't significantly affect user experience -- either because they happen extremely infrequently in fairly static collections, or because the users recognize that the collection of data is constantly evolving and expect to see documents shift up and down in the result sets.
=== Performance Problems with "Deep Paging"
In some situations, the results of a Solr search are not destined for a simple paginated user interface.
When you wish to fetch a very large number of sorted results from Solr to feed into an external system, using very large values for the `start` or `rows` parameters can be very inefficient. Pagination using `start` and `rows` not only require Solr to compute (and sort) in memory all of the matching documents that should be fetched for the current page, but also all of the documents that would have appeared on previous pages.
While a request for `start=0&rows=1000000` may be obviously inefficient because it requires Solr to maintain & sort in memory a set of 1 million documents, likewise a request for `start=999000&rows=1000` is equally inefficient for the same reasons. Solr can't compute which matching document is the 999001st result in sorted order, without first determining what the first 999000 matching sorted results are.
If the index is distributed, which is common when running in SolrCloud mode, then 1 million documents are retrieved from *each shard*. For a ten shard index, ten million entries must be retrieved and sorted to figure out the 1000 documents that match those query parameters.
== Fetching A Large Number of Sorted Results: Cursors
As an alternative to increasing the "start" parameter to request subsequent pages of sorted results, Solr supports using a "Cursor" to scan through results.
Cursors in Solr are a logical concept that doesn't involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue.
=== Using Cursors
To use a cursor with Solr, specify a `cursorMark` parameter with the value of `\*`. You can think of this being analogous to `start=0` as a way to tell Solr "start at the beginning of my sorted results" except that it also informs Solr that you want to use a Cursor.
In addition to returning the top N sorted results (where you can control N using the `rows` parameter) the Solr response will also include an encoded String named `nextCursorMark`. You then take the `nextCursorMark` String value from the response, and pass it back to Solr as the `cursorMark` parameter for your next request. You can repeat this process until you've fetched as many docs as you want, or until the `nextCursorMark` returned matches the `cursorMark` you've already specified -- indicating that there are no more results.
=== Constraints when using Cursors
There are a few important constraints to be aware of when using `cursorMark` parameter in a Solr request
. `cursorMark` and `start` are mutually exclusive parameters.
* Your requests must either not include a `start` parameter, or it must be specified with a value of "```0```".
. `sort` clauses must include the uniqueKey field (either `asc` or `desc`).
* If `id` is your uniqueKey field, then sort params like `id asc` and `name asc, id desc` would both work fine, but `name asc` by itself would not
. Sorts including <<working-with-dates.adoc#working-with-dates,Date Math>> based functions that involve calculations relative to `NOW` will cause confusing results, since every document will get a new sort value on every subsequent request. This can easily result in cursors that never end, and constantly return the same documents over and over even if the documents are never updated.
+
In this situation, choose & re-use a fixed value for the <<working-with-dates.adoc#WorkingwithDates-NOW,`NOW` request param>> in all of your cursor requests.
Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that `cursorMark` would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every `cursorMark` value will identify a unique point in the sequence of documents.
=== Cursor Examples
==== Fetch All Docs
The pseudo-code shown here shows the basic logic involved in fetching all documents matching a query using a cursor:
[source,plain]
----
// when fetching all docs, you might as well use a simple id sort
// unless you really need the docs to come back in a specific order
$params = [ q => $some_query, sort => 'id asc', rows => $r, cursorMark => '*' ]
$done = false
while (not $done) {
$results = fetch_solr($params)
// do something with $results
if ($params[cursorMark] == $results[nextCursorMark]) {
$done = true
}
$params[cursorMark] = $results[nextCursorMark]
}
----
Using SolrJ, this pseudo-code would be:
[source,java]
----
SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
----
If you wanted to do this by hand using curl, the sequence of requests would look something like this:
[source,plain]
----
$ curl '...&rows=10&sort=id+asc&cursorMark=*'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 docs here ...
]},
"nextCursorMark":"AoEjR0JQ"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEjR0JQ'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 more docs here ...
]},
"nextCursorMark":"AoEpVkRCREIxQTE2"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpVkRCREIxQTE2'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 10 more docs here ...
]},
"nextCursorMark":"AoEmbWF4dG9y"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEmbWF4dG9y'
{
"response":{"numFound":32,"start":0,"docs":[
// ... 2 docs here because we've reached the end.
]},
"nextCursorMark":"AoEpdmlld3Nvbmlj"}
$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpdmlld3Nvbmlj'
{
"response":{"numFound":32,"start":0,"docs":[
// no more docs here, and note that the nextCursorMark
// matches the cursorMark param we used
]},
"nextCursorMark":"AoEpdmlld3Nvbmlj"}
----
==== Fetch First _N_ docs, Based on Post Processing
Since the cursor is stateless from Solr's perspective, your client code can stop fetching additional results as soon as you have decided you have enough information:
[source,java]
----
while (! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
boolean hadEnough = doCustomProcessingOfResults(rsp);
if (hadEnough || cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
----
=== How Cursors are Affected by Index Updates
Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted list of matching documents. Instead, the `cursorMark` specified in a request encapsulates information about the *relative* position of the last document returned, based on the *absolute* sort values of that document. This means that the impact of index modifications is much smaller when using a cursor compared to basic pagination. Consider the same example index described when discussing basic pagination:
[options="header",%autowidth,width="30%"]
|===
|id |name
|1 |A
|2 |B
| ... |
|26 |Z
|===
* A client requests `q=*:*&rows=5&start=0&sort=name asc, id asc&cursorMark=*`
** Documents with the ids `1-5` will be returned to the client in order
* Document id `3` is deleted
* The client requests 5 more documents using the `nextCursorMark` from the previous response
** Documents `6-10` will be returned -- the deletion of a document that's already been returned doesn't affect the relative position of the cursor
* 3 new documents are now added with the ids `90`, `91`, and `92`; All three documents have a name of `A`
* The client requests 5 more documents using the `nextCursorMark` from the previous response
** Documents `11-15` will be returned -- the addition of new documents with sort values already past does not affect the relative position of the cursor
* Document id `1` is updated to change its 'name' to `Q`
* Document id 17 is updated to change its 'name' to `A`
* The client requests 5 more documents using the `nextCursorMark` from the previous response
** The resulting documents are `16,1,18,19,20` in that order
** Because the sort value of document `1` changed so that it is _after_ the cursor position, the document is returned to the client twice
** Because the sort value of document `17` changed so that it is _before_ the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress
In a nutshell: When fetching all results matching a query using `cursorMark`, the only way index modifications can result in a document being skipped, or returned twice, is if the sort value of the document changes.
[TIP]
====
One way to ensure that a document will never be returned more then once, is to use the uniqueKey field as the primary (and therefore: only significant) sort criterion.
In this situation, you will be guaranteed that each document is only returned once, no matter how it may be be modified during the use of the cursor.
====
=== "Tailing" a Cursor
Because Cursor requests are stateless, and the cursorMark values encapsulate the absolute sort values of the last document returned from a search, it's possible to "continue" fetching additional results from a cursor that has already reached its end. If new documents are added (or existing documents are updated) to the end of the results.
You can think of this as similar to using something like "tail -f" in Unix. The most common examples of how this can be useful is when you have a "timestamp" field recording when a document has been added/updated in your index. Client applications can continuously poll a cursor using a `sort=timestamp asc, id asc` for documents matching a query, and always be notified when a document is added or updated matching the request criteria.
Another common example is when you have uniqueKey values that always increase as new documents are created, and you can continuously poll a cursor using `sort=id asc` to be notified about new documents.
The pseudo-code for tailing a cursor is only a slight modification from our early example for processing all docs matching a query:
[source,plain]
----
while (true) {
$doneForNow = false
while (not $doneForNow) {
$results = fetch_solr($params)
// do something with $results
if ($params[cursorMark] == $results[nextCursorMark]) {
$doneForNow = true
}
$params[cursorMark] = $results[nextCursorMark]
}
sleep($some_configured_delay)
}
----
TIP: For certain specialized cases, the <<exporting-result-sets.adoc#exporting-result-sets,/export handler>> may be an option.