215 lines
7.0 KiB
Plaintext
215 lines
7.0 KiB
Plaintext
[[index-modules-index-sorting]]
|
|
== Index Sorting
|
|
|
|
When creating a new index in Elasticsearch it is possible to configure how the Segments
|
|
inside each Shard will be sorted. By default Lucene does not apply any sort.
|
|
The `index.sort.*` settings define which fields should be used to sort the documents inside each Segment.
|
|
|
|
[WARNING]
|
|
nested fields are not compatible with index sorting because they rely on the assumption
|
|
that nested documents are stored in contiguous doc ids, which can be broken by index sorting.
|
|
An error will be thrown if index sorting is activated on an index that contains nested fields.
|
|
|
|
For instance the following example shows how to define a sort on a single field:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT twitter
|
|
{
|
|
"settings" : {
|
|
"index" : {
|
|
"sort.field" : "date", <1>
|
|
"sort.order" : "desc" <2>
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"date": {
|
|
"type": "date"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> This index is sorted by the `date` field
|
|
<2> ... in descending order.
|
|
|
|
It is also possible to sort the index by more than one field:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT twitter
|
|
{
|
|
"settings" : {
|
|
"index" : {
|
|
"sort.field" : ["username", "date"], <1>
|
|
"sort.order" : ["asc", "desc"] <2>
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"username": {
|
|
"type": "keyword",
|
|
"doc_values": true
|
|
},
|
|
"date": {
|
|
"type": "date"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> This index is sorted by `username` first then by `date`
|
|
<2> ... in ascending order for the `username` field and in descending order for the `date` field.
|
|
|
|
|
|
Index sorting supports the following settings:
|
|
|
|
`index.sort.field`::
|
|
|
|
The list of fields used to sort the index.
|
|
Only `boolean`, `numeric`, `date` and `keyword` fields with `doc_values` are allowed here.
|
|
|
|
`index.sort.order`::
|
|
|
|
The sort order to use for each field.
|
|
The order option can have the following values:
|
|
* `asc`: For ascending order
|
|
* `desc`: For descending order.
|
|
|
|
`index.sort.mode`::
|
|
|
|
Elasticsearch supports sorting by multi-valued fields.
|
|
The mode option controls what value is picked to sort the document.
|
|
The mode option can have the following values:
|
|
* `min`: Pick the lowest value.
|
|
* `max`: Pick the highest value.
|
|
|
|
`index.sort.missing`::
|
|
|
|
The missing parameter specifies how docs which are missing the field should be treated.
|
|
The missing value can have the following values:
|
|
* `_last`: Documents without value for the field are sorted last.
|
|
* `_first`: Documents without value for the field are sorted first.
|
|
|
|
[WARNING]
|
|
Index sorting can be defined only once at index creation. It is not allowed to add or update
|
|
a sort on an existing index. Index sorting also has a cost in terms of indexing throughput since
|
|
documents must be sorted at flush and merge time. You should test the impact on your application
|
|
before activating this feature.
|
|
|
|
[float]
|
|
[[early-terminate]]
|
|
=== Early termination of search request
|
|
|
|
By default in Elasticsearch a search request must visit every document that match a query to
|
|
retrieve the top documents sorted by a specified sort.
|
|
Though when the index sort and the search sort are the same it is possible to limit
|
|
the number of documents that should be visited per segment to retrieve the N top ranked documents globally.
|
|
For example, let's say we have an index that contains events sorted by a timestamp field:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT events
|
|
{
|
|
"settings" : {
|
|
"index" : {
|
|
"sort.field" : "timestamp",
|
|
"sort.order" : "desc" <1>
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"timestamp": {
|
|
"type": "date"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> This index is sorted by timestamp in descending order (most recent first)
|
|
|
|
You can search for the last 10 events with:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /events/_search
|
|
{
|
|
"size": 10,
|
|
"sort": [
|
|
{ "timestamp": "desc" }
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
Elasticsearch will detect that the top docs of each segment are already sorted in the index
|
|
and will only compare the first N documents per segment.
|
|
The rest of the documents matching the query are collected to count the total number of results
|
|
and to build aggregations.
|
|
|
|
If you're only looking for the last 10 events and have no interest in
|
|
the total number of documents that match the query you can set `track_total_hits`
|
|
to false:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /events/_search
|
|
{
|
|
"size": 10,
|
|
"sort": [ <1>
|
|
{ "timestamp": "desc" }
|
|
],
|
|
"track_total_hits": false
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
<1> The index sort will be used to rank the top documents and each segment will early terminate the collection after the first 10 matches.
|
|
|
|
This time, Elasticsearch will not try to count the number of documents and will be able to terminate the query
|
|
as soon as N documents have been collected per segment.
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"_shards": ...
|
|
"hits" : { <1>
|
|
"max_score" : null,
|
|
"hits" : []
|
|
},
|
|
"took": 20,
|
|
"timed_out": false
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
|
|
// TESTRESPONSE[s/"took": 20,/"took": "$body.took",/]
|
|
|
|
<1> The total number of hits matching the query is unknown because of early termination.
|
|
|
|
NOTE: Aggregations will collect all documents that match the query regardless
|
|
of the value of `track_total_hits`
|
|
|
|
[[index-modules-index-sorting-conjunctions]]
|
|
=== Use index sorting to speed up conjunctions
|
|
|
|
Index sorting can be useful in order to organize Lucene doc ids (not to be
|
|
conflated with `_id`) in a way that makes conjunctions (a AND b AND ...) more
|
|
efficient. In order to be efficient, conjunctions rely on the fact that if any
|
|
clause does not match, then the entire conjunction does not match. By using
|
|
index sorting, we can put documents that do not match together, which will
|
|
help skip efficiently over large ranges of doc IDs that do not match the
|
|
conjunction.
|
|
|
|
This trick only works with low-cardinality fields. A rule of thumb is that
|
|
you should sort first on fields that both have a low cardinality and are
|
|
frequently used for filtering. The sort order (`asc` or `desc`) does not
|
|
matter as we only care about putting values that would match the same clauses
|
|
close to each other.
|
|
|
|
For instance if you were indexing cars for sale, it might be interesting to
|
|
sort by fuel type, body type, make, year of registration and finally mileage.
|