OpenSearch/docs/reference/index-modules/index-sorting.asciidoc

[[index-modules-index-sorting]]
== Index Sorting

beta[]

When creating a new index in elasticsearch it is possible to configure how the Segments
inside each Shard will be sorted. By default Lucene does not apply any sort.
The `index.sort.*` settings define which fields should be used to sort the documents inside each Segment.

[WARNING]
nested fields are not compatible with index sorting because they rely on the assumption
that nested documents are stored in contiguous doc ids, which can be broken by index sorting.
An error will be thrown if index sorting is activated on an index that contains nested fields.

For instance the following example shows how to define a sort on a single field:

[source,js]
--------------------------------------------------
PUT twitter
{
    "settings" : {
        "index" : {
            "sort.field" : "date", <1>
            "sort.order" : "desc" <2>
        }
    },
    "mappings": {
        "tweet": {
            "properties": {
                "date": {
                    "type": "date"
                }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

<1> This index is sorted by the `date` field
<2> ... in descending order.

It is also possible to sort the index by more than one field:

[source,js]
--------------------------------------------------
PUT twitter
{
    "settings" : {
        "index" : {
            "sort.field" : ["username", "date"], <1>
            "sort.order" : ["asc", "desc"] <2>
        }
    },
    "mappings": {
        "tweet": {
            "properties": {
                "username": {
                    "type": "keyword",
                    "doc_values": true
                },
                "date": {
                    "type": "date"
                }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

<1> This index is sorted by `username` first then by `date`
<2> ... in ascending order for the `username` field and in descending order for the `date` field.


Index sorting supports the following settings:

`index.sort.field`::

    The list of fields used to sort the index.
    Only `boolean`, `numeric`, `date` and `keyword` fields with `doc_values` are allowed here.

`index.sort.order`::

    The sort order to use for each field.
    The order option can have the following values:
        * `asc`:  For ascending order
        * `desc`: For descending order.

`index.sort.mode`::

    Elasticsearch supports sorting by multi-valued fields.
    The mode option controls what value is picked to sort the document.
    The mode option can have the following values:
        * `min`: 	Pick the lowest value.
        * `max`: 	Pick the highest value.

`index.sort.missing`::

    The missing parameter specifies how docs which are missing the field should be treated.
     The missing value can have the following values:
        * `_last`: Documents without value for the field are sorted last.
        * `_first`: Documents without value for the field are sorted first.

[WARNING]
Index sorting can be defined only once at index creation. It is not allowed to add or update
a sort on an existing index. Index sorting also has a cost in terms of indexing throughput since
documents must be sorted at flush and merge time. You should test the impact on your application
before activating this feature.

[float]
[[early-terminate]]
=== Early termination of search request

By default in elasticsearch a search request must visit every document that match a query to
retrieve the top documents sorted by a specified sort.
Though when the index sort and the search sort are the same it is possible to limit
the number of documents that should be visited per segment to retrieve the N top ranked documents globally.
For example, let's say we have an index that contains events sorted by a timestamp field:

[source,js]
--------------------------------------------------
PUT events
{
    "settings" : {
        "index" : {
            "sort.field" : "timestamp",
            "sort.order" : "desc" <1>
        }
    },
    "mappings": {
        "doc": {
            "properties": {
                "timestamp": {
                    "type": "date"
                }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

<1> This index is sorted by timestamp in descending order (most recent first)

You can search for the last 10 events with:

[source,js]
--------------------------------------------------
GET /events/_search
{
    "size": 10,
    "sort": [
        { "timestamp": "desc" }
    ]
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

Elasticsearch will detect that the top docs of each segment are already sorted in the index
and will only compare the first N documents per segment.
The rest of the documents matching the query are collected to count the total number of results
and to build aggregations.

If you're only looking for the last 10 events and have no interest in
the total number of documents that match the query you can set `track_total_hits`
to false:

[source,js]
--------------------------------------------------
GET /events/_search
{
    "size": 10,
    "sort": [ <1>
        { "timestamp": "desc" }
    ],
    "track_total_hits": false
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

<1> The index sort will be used to rank the top documents and each segment will early terminate the collection after the first 10 matches.

This time, Elasticsearch will not try to count the number of documents and will be able to terminate the query
as soon as N documents have been collected per segment.

[source,js]
--------------------------------------------------
{
  "_shards": ...
   "hits" : {
      "total" : -1,     <1>
      "max_score" : null,
      "hits" : []
  },
  "took": 20,
  "timed_out": false
}
--------------------------------------------------
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
// TESTRESPONSE[s/"took": 20,/"took": "$body.took",/]

<1> The total number of hits matching the query is unknown because of early termination.

NOTE: Aggregations will collect all documents that match the query regardless of the value of `track_total_hits`

[[index-modules-index-sorting-conjunctions]]
=== Use index sorting to speed up conjunctions

Index sorting can be useful in order to organize Lucene doc ids (not to be
conflated with `_id`) in a way that makes conjunctions (a AND b AND ...) more
efficient. In order to be efficient, conjunctions rely on the fact that if any
clause does not match, then the entire conjunction does not match. By using
index sorting, we can put documents that do not match together, which will
help skip efficiently over large ranges of doc IDs that do not match the
conjunction.

This trick only works with low-cardinality fields. A rule of thumb is that
you should sort first on fields that both have a low cardinality and are
frequently used for filtering. The sort order (`asc` or `desc`) does not
matter as we only care about putting values that would match the same clauses
close to each other.

For instance if you were indexing cars for sale, it might be interesting to
sort by fuel type, body type, make, year of registration and finally mileage.
Enable index-time sorting (#24055) This change adds an index setting to define how the documents should be sorted inside each Segment. It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on disk. It is not allowed to use a `nested` fields inside an index that defines an index sorting since `nested` fields relies on the original sort of the index. This change does not add early termination capabilities in the search layer. This will be added in a follow up. Relates #6720 2017-04-19 14:36:11 +02:00			`[[index-modules-index-sorting]]`
			`== Index Sorting`

Update experimental labels in the docs (#25727) Relates https://github.com/elastic/elasticsearch/issues/19798 Removed experimental label from: * Painless * Diversified Sampler Agg * Sampler Agg * Significant Terms Agg * Terms Agg document count error and execution_hint * Cardinality Agg precision_threshold * Pipeline Aggregations * index.shard.check_on_startup * index.store.type (added warning) * Preloading data into the file system cache * foreach ingest processor * Field caps API * Profile API Added experimental label to: * Moving Average Agg Prediction Changed experimental to beta for: * Adjacency matrix agg * Normalizers * Tasks API * Index sorting Labelled experimental in Lucene: * ICU plugin custom rules file * Flatten graph token filter * Synonym graph token filter * Word delimiter graph token filter * Simple pattern tokenizer * Simple pattern split tokenizer Replaced experimental label with warning that details may change in the future: * Analysis explain output format * Segments verbose output format * Percentile Agg compression and HDR Histogram * Percentile Rank Agg HDR Histogram 2017-07-18 14:06:22 +02:00			`beta[]`
Enable index-time sorting (#24055) This change adds an index setting to define how the documents should be sorted inside each Segment. It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on disk. It is not allowed to use a `nested` fields inside an index that defines an index sorting since `nested` fields relies on the original sort of the index. This change does not add early termination capabilities in the search layer. This will be added in a follow up. Relates #6720 2017-04-19 14:36:11 +02:00
			`When creating a new index in elasticsearch it is possible to configure how the Segments`
			`inside each Shard will be sorted. By default Lucene does not apply any sort.`
			The `index.sort.*` settings define which fields should be used to sort the documents inside each Segment.

			`[WARNING]`
			`nested fields are not compatible with index sorting because they rely on the assumption`
			`that nested documents are stored in contiguous doc ids, which can be broken by index sorting.`
			`An error will be thrown if index sorting is activated on an index that contains nested fields.`

			`For instance the following example shows how to define a sort on a single field:`

			`[source,js]`
			`--------------------------------------------------`
			`PUT twitter`
			`{`
			`"settings" : {`
			`"index" : {`
			`"sort.field" : "date", <1>`
			`"sort.order" : "desc" <2>`
			`}`
			`},`
			`"mappings": {`
			`"tweet": {`
			`"properties": {`
			`"date": {`
			`"type": "date"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`

			<1> This index is sorted by the `date` field
			`<2> ... in descending order.`

			`It is also possible to sort the index by more than one field:`

			`[source,js]`
			`--------------------------------------------------`
			`PUT twitter`
			`{`
			`"settings" : {`
			`"index" : {`
			`"sort.field" : ["username", "date"], <1>`
			`"sort.order" : ["asc", "desc"] <2>`
			`}`
			`},`
			`"mappings": {`
			`"tweet": {`
			`"properties": {`
			`"username": {`
			`"type": "keyword",`
			`"doc_values": true`
			`},`
			`"date": {`
			`"type": "date"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`

			<1> This index is sorted by `username` first then by `date`
			<2> ... in ascending order for the `username` field and in descending order for the `date` field.


			`Index sorting supports the following settings:`

			`index.sort.field`::

			`The list of fields used to sort the index.`
			Only `boolean`, `numeric`, `date` and `keyword` fields with `doc_values` are allowed here.

			`index.sort.order`::

			`The sort order to use for each field.`
			`The order option can have the following values:`
			* `asc`: For ascending order
			* `desc`: For descending order.

			`index.sort.mode`::

			`Elasticsearch supports sorting by multi-valued fields.`
			`The mode option controls what value is picked to sort the document.`
			`The mode option can have the following values:`
			* `min`: Pick the lowest value.
			* `max`: Pick the highest value.

			`index.sort.missing`::

			`The missing parameter specifies how docs which are missing the field should be treated.`
			`The missing value can have the following values:`
			* `_last`: Documents without value for the field are sorted last.
			* `_first`: Documents without value for the field are sorted first.

			`[WARNING]`
			`Index sorting can be defined only once at index creation. It is not allowed to add or update`
Automatically early terminate search query based on index sorting (#24864) This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates #6720 2017-06-08 12:10:46 +02:00			`a sort on an existing index. Index sorting also has a cost in terms of indexing throughput since`
			`documents must be sorted at flush and merge time. You should test the impact on your application`
			`before activating this feature.`
Docs: More search speed advices. (#24802) 2017-06-01 17:23:22 +02:00
Automatically early terminate search query based on index sorting (#24864) This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates #6720 2017-06-08 12:10:46 +02:00			`[float]`
			`[[early-terminate]]`
			`=== Early termination of search request`

			`By default in elasticsearch a search request must visit every document that match a query to`
			`retrieve the top documents sorted by a specified sort.`
			`Though when the index sort and the search sort are the same it is possible to limit`
			`the number of documents that should be visited per segment to retrieve the N top ranked documents globally.`
			`For example, let's say we have an index that contains events sorted by a timestamp field:`

			`[source,js]`
			`--------------------------------------------------`
			`PUT events`
			`{`
			`"settings" : {`
			`"index" : {`
			`"sort.field" : "timestamp",`
[DOCS] Fixed callout reference error. 2017-06-08 16:40:53 -07:00			`"sort.order" : "desc" <1>`
Automatically early terminate search query based on index sorting (#24864) This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates #6720 2017-06-08 12:10:46 +02:00			`}`
			`},`
			`"mappings": {`
			`"doc": {`
			`"properties": {`
			`"timestamp": {`
			`"type": "date"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`

			`<1> This index is sorted by timestamp in descending order (most recent first)`

			`You can search for the last 10 events with:`

			`[source,js]`
			`--------------------------------------------------`
			`GET /events/_search`
			`{`
			`"size": 10,`
			`"sort": [`
			`{ "timestamp": "desc" }`
			`]`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[continued]`

			`Elasticsearch will detect that the top docs of each segment are already sorted in the index`
			`and will only compare the first N documents per segment.`
			`The rest of the documents matching the query are collected to count the total number of results`
			`and to build aggregations.`

			`If you're only looking for the last 10 events and have no interest in`
			the total number of documents that match the query you can set `track_total_hits`
			`to false:`

			`[source,js]`
			`--------------------------------------------------`
			`GET /events/_search`
			`{`
			`"size": 10,`
[DOCS] Fixed callout reference error. 2017-06-08 16:40:53 -07:00			`"sort": [ <1>`
Automatically early terminate search query based on index sorting (#24864) This commit refactors the query phase in order to be able to automatically detect queries that can be early terminated. If the index sort matches the query sort, the top docs collection is early terminated on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector. This change also adds a new parameter to the search request called `track_total_hits`. It indicates if the total number of hits that match the query should be tracked. If false, queries sorted by the index sort will not try to compute this information and and will limit the collection to the first N documents per segment. Aggregations are not impacted and will continue to see every document even when the index sort matches the query sort and `track_total_hits` is false. Relates #6720 2017-06-08 12:10:46 +02:00			`{ "timestamp": "desc" }`
			`],`
			`"track_total_hits": false`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[continued]`

			`<1> The index sort will be used to rank the top documents and each segment will early terminate the collection after the first 10 matches.`

			`This time, Elasticsearch will not try to count the number of documents and will be able to terminate the query`
			`as soon as N documents have been collected per segment.`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"_shards": ...`
			`"hits" : {`
			`"total" : -1, <1>`
			`"max_score" : null,`
			`"hits" : []`
			`},`
			`"took": 20,`
			`"timed_out": false`
			`}`
			`--------------------------------------------------`
			`// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]`
			`// TESTRESPONSE[s/"took": 20,/"took": "$body.took",/]`

			`<1> The total number of hits matching the query is unknown because of early termination.`

			NOTE: Aggregations will collect all documents that match the query regardless of the value of `track_total_hits`
Docs: More search speed advices. (#24802) 2017-06-01 17:23:22 +02:00
			`[[index-modules-index-sorting-conjunctions]]`
			`=== Use index sorting to speed up conjunctions`

			`Index sorting can be useful in order to organize Lucene doc ids (not to be`
			conflated with `_id`) in a way that makes conjunctions (a AND b AND ...) more
			`efficient. In order to be efficient, conjunctions rely on the fact that if any`
			`clause does not match, then the entire conjunction does not match. By using`
			`index sorting, we can put documents that do not match together, which will`
			`help skip efficiently over large ranges of doc IDs that do not match the`
			`conjunction.`

			`This trick only works with low-cardinality fields. A rule of thumb is that`
			`you should sort first on fields that both have a low cardinality and are`
			frequently used for filtering. The sort order (`asc` or `desc`) does not
			`matter as we only care about putting values that would match the same clauses`
			`close to each other.`

			`For instance if you were indexing cars for sale, it might be interesting to`
			`sort by fuel type, body type, make, year of registration and finally mileage.`