OpenSearch/docs/reference/migration/migrate_7_0/search.asciidoc

274 lines
10 KiB
Plaintext

[float]
[[breaking_70_search_changes]]
=== Search and Query DSL changes
[float]
==== Off-heap terms index
The terms dictionary is the part of the inverted index that records all terms
that occur within a segment in sorted order. In order to provide fast retrieval,
terms dictionaries come with a small terms index that allows for efficient
random access by term. Until now this terms index had always been loaded
on-heap.
As of 7.0, the terms index is loaded on-heap for fields that only have unique
values such as `_id` fields, and off-heap otherwise - likely most other fields.
This is expected to reduce memory requirements but might slow down search
requests if both below conditions are met:
* The size of the data directory on each node is significantly larger than the
amount of memory that is available to the filesystem cache.
* The number of matches of the query is not several orders of magnitude greater
than the number of terms that the query tries to match, either explicitly via
`term` or `terms` queries, or implicitly via multi-term queries such as
`prefix`, `wildcard` or `fuzzy` queries.
This change affects both existing indices created with Elasticsearch 6.x and new
indices created with Elasticsearch 7.x.
[float]
==== Changes to queries
* The default value for `transpositions` parameter of `fuzzy` query
has been changed to `true`.
* The `query_string` options `use_dismax`, `split_on_whitespace`,
`all_fields`, `locale`, `auto_generate_phrase_query` and
`lowercase_expanded_terms` deprecated in 6.x have been removed.
* Purely negative queries (only MUST_NOT clauses) now return a score of `0`
rather than `1`.
* The boundary specified using geohashes in the `geo_bounding_box` query
now include entire geohash cell, instead of just geohash center.
* Attempts to generate multi-term phrase queries against non-text fields
with a custom analyzer will now throw an exception.
* An `envelope` crossing the dateline in a `geo_shape `query is now processed
correctly when specified using REST API instead of having its left and
right corners flipped.
* Attempts to set `boost` on inner span queries will now throw a parsing exception.
[float]
==== Adaptive replica selection enabled by default
Adaptive replica selection has been enabled by default. If you wish to return to
the older round robin of search requests, you can use the
`cluster.routing.use_adaptive_replica_selection` setting:
[source,js]
--------------------------------------------------
PUT /_cluster/settings
{
"transient": {
"cluster.routing.use_adaptive_replica_selection": false
}
}
--------------------------------------------------
// CONSOLE
[float]
==== Search API returns `400` for invalid requests
The Search API returns `400 - Bad request` while it would previously return
`500 - Internal Server Error` in the following cases of invalid request:
* the result window is too large
* sort is used in combination with rescore
* the rescore window is too large
* the number of slices is too large
* keep alive for scroll is too large
* number of filters in the adjacency matrix aggregation is too large
* script compilation errors
[float]
==== Scroll queries cannot use the `request_cache` anymore
Setting `request_cache:true` on a query that creates a scroll (`scroll=1m`)
has been deprecated in 6 and will now return a `400 - Bad request`.
Scroll queries are not meant to be cached.
[float]
==== Scroll queries cannot use `rescore` anymore
Including a rescore clause on a query that creates a scroll (`scroll=1m`) has
been deprecated in 6.5 and will now return a `400 - Bad request`. Allowing
rescore on scroll queries would break the scroll sort. In the 6.x line, the
rescore clause was silently ignored (for scroll queries), and it was allowed in
the 5.x line.
[float]
==== Term Suggesters supported distance algorithms
The following string distance algorithms were given additional names in 6.2 and
their existing names were deprecated. The deprecated names have now been
removed.
* `levenstein` - replaced by `levenshtein`
* `jarowinkler` - replaced by `jaro_winkler`
[float]
==== `popular` mode for Suggesters
The `popular` mode for Suggesters (`term` and `phrase`) now uses the doc frequency
(instead of the sum of the doc frequency) of the input terms to compute the frequency
threshold for candidate suggestions.
[float]
==== Limiting the number of terms that can be used in a Terms Query request
Executing a Terms Query with a lot of terms may degrade the cluster performance,
as each additional term demands extra processing and memory.
To safeguard against this, the maximum number of terms that can be used in a
Terms Query request has been limited to 65536. This default maximum can be changed
for a particular index with the index setting `index.max_terms_count`.
[float]
==== Limiting the length of regex that can be used in a Regexp Query request
Executing a Regexp Query with a long regex string may degrade search performance.
To safeguard against this, the maximum length of regex that can be used in a
Regexp Query request has been limited to 1000. This default maximum can be changed
for a particular index with the index setting `index.max_regex_length`.
[float]
==== Limiting the number of auto-expanded fields
Executing queries that use automatic expansion of fields (e.g. `query_string`, `simple_query_string`
or `multi_match`) can have performance issues for indices with a large numbers of fields.
To safeguard against this, a hard limit of 1024 fields has been introduced for queries
using the "all fields" mode ("default_field": "*") or other fieldname expansions (e.g. "foo*").
[float]
==== Invalid `_search` request body
Search requests with extra content after the main object will no longer be accepted
by the `_search` endpoint. A parsing exception will be thrown instead.
[float]
==== Doc-value fields default format
The format of doc-value fields is changing to be the same as what could be
obtained in 6.x with the special `use_field_mapping` format. This is mostly a
change for date fields, which are now formatted based on the format that is
configured in the mappings by default. This behavior can be changed by
specifying a <<search-request-docvalue-fields,`format`>> within the doc-value
field.
[float]
==== Context Completion Suggester
The ability to query and index context enabled suggestions without context,
deprecated in 6.x, has been removed. Context enabled suggestion queries
without contexts have to visit every suggestion, which degrades the search performance
considerably.
For geo context the value of the `path` parameter is now validated against the mapping,
and the context is only accepted if `path` points to a field with `geo_point` type.
[float]
==== Semantics changed for `max_concurrent_shard_requests`
`max_concurrent_shard_requests` used to limit the total number of concurrent shard
requests a single high level search request can execute. In 7.0 this changed to be the
max number of concurrent shard requests per node. The default is now `5`.
[float]
==== `max_score` set to `null` when scores are not tracked
`max_score` used to be set to `0` whenever scores are not tracked. `null` is now used
instead which is a more appropriate value for a scenario where scores are not available.
[float]
==== Negative boosts are not allowed
Setting a negative `boost` for a query or a field, deprecated in 6x, is not allowed in this version.
To deboost a specific query or field you can use a `boost` comprise between 0 and 1.
[float]
==== Negative scores are not allowed in Function Score Query
Negative scores in the Function Score Query are deprecated in 6.x, and are
not allowed in this version. If a negative score is produced as a result
of computation (e.g. in `script_score` or `field_value_factor` functions),
an error will be thrown.
[float]
==== The filter context has been removed
The `filter` context has been removed from Elasticsearch's query builders,
the distinction between queries and filters is now decided in Lucene depending
on whether queries need to access score or not. As a result `bool` queries with
`should` clauses that don't need to access the score will no longer set their
`minimum_should_match` to 1. This behavior has been deprecated in the previous
major version.
[float]
==== `hits.total` is now an object in the search response
The total hits that match the search request is now returned as an object
with a `value` and a `relation`. `value` indicates the number of hits that
match and `relation` indicates whether the value is accurate (`eq`) or a lower bound
(`gte`):
```
{
"hits": {
"total": { <1>
"value": 1000,
"relation": "eq"
},
...
}
}
```
The "total" object in the response indicates that the query matches exactly 1000
documents ("eq"). The `value` is always accurate (`"relation": "eq"`) when
`track_total_hits` is set to true in the request.
You can also retrieve `hits.total` as a number in the rest response by adding
`rest_total_hits_as_int=true` in the request parameter of the search request.
This parameter has been added to ease the transition to the new format and
will be removed in the next major version (8.0).
[float]
==== `hits.total` is omitted in the response if `track_total_hits` is disabled (false)
If `track_total_hits` is set to `false` in the search request the search response
will set `hits.total` to null and the object will not be displayed in the rest
layer. You can add `rest_total_hits_as_int=true` in the search request parameters
to get the old format back (`"total": -1`).
[float]
==== `track_total_hits` defaults to 10,000
By default search request will count the total hits accurately up to `10,000`
documents. If the total number of hits that match the query is greater than this
value, the response will indicate that the returned value is a lower bound:
[source,js]
--------------------------------------------------
{
"_shards": ...
"timed_out": false,
"took": 100,
"hits": {
"max_score": 1.0,
"total" : {
"value": 10000, <1>
"relation": "gte" <2>
},
"hits": ...
}
}
--------------------------------------------------
// NOTCONSOLE
<1> There are at least 10000 documents that match the query
<2> This is a lower bound (`"gte"`).
You can force the count to always be accurate by setting `"track_total_hits`
to true explicitly in the search request.