The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly.
Some comments about the change:
* Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309Closes#32899
We used to set `maxScore` to `0` within `TopDocs` in situations where there is really no score as the size was set to `0` and scores were not even tracked. In such scenarios, `Float.Nan` is more appropriate, which gets converted to `max_score: null` on the REST layer. That's also more consistent with lucene which set `maxScore` to `Float.Nan` when merging empty `TopDocs` (see `TopDocs#merge`).
Currently docs don't explain how `ignore_above` behaves with arrays of
strings.
Clarify how `ignore_above` applies for arrays of strings and
also note that all string(s) will still be visible in the
`_source` field.
Relates #33057
* Add basic support for field aliases in index mappings. (#31287)
* Allow for aliases when fetching stored fields. (#31411)
* Add tests around accessing field aliases in scripts. (#31417)
* Add documentation around field aliases. (#31538)
* Add validation for field alias mappings. (#31518)
* Return both concrete fields and aliases in DocumentFieldMappers#getMapper. (#31671)
* Make sure that field-level security is enforced when using field aliases. (#31807)
* Add more comprehensive tests for field aliases in queries + aggregations. (#31565)
* Remove the deprecated method DocumentFieldMappers#getFieldMapper. (#32148)
The range docs had an introductory section that described how to set up
and index *and* a test setup section in `docs/build.gradle` that
duplicated that section. This is bad because these section can (and do)
drift from one another. This change removes the setup in build.gradle
and marks the introductor snippet with `// TESTSETUP` so it is used on
all the snippets.
A link to the ip_range datatype page provides a way for newer users to know
it exists if they land directly on the ip datatype page first via a search.
This field is similar to the `feature` field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.
Relates #27552
* [DOCS] Rewords _field_names documentation
Corrects the language around when we write to `_field_names` and when you might want to disable it given that n recent versions it does not carry the indexing overhead it once did.
Relates to #30862
* Update wording following review
Specifying `index_phrases: true` on a text field mapping will add a subsidiary
[field]._index_phrase field, indexing two-term shingles from the parent field.
The parent analysis chain is re-used, wrapped with a FixedShingleFilter.
At query time, if a phrase match query is executed, the mapping will redirect it
to run against the subsidiary field.
This should trade faster phrase querying for a larger index and longer indexing
times.
Relates to #27049
This change adds an option named `split_queries_on_whitespace` to the `keyword`
field type. When set to true full text queries (`match`, `multi_match`, `query_string`, ...) that target the field will split the input on whitespace to build the query terms. Defaults to `false`.
Closes#30393
Lucene has a new `FeatureField` which gives the ability to record numeric
features as term frequencies. Its main benefit is that it allows to boost
queries with the values of these features and efficiently skip non-competitive
documents at the same time using block-max WAND and indexed impacts.
This commit changes the default out-of-the-box configuration for the
number of shards from five to one. We think this will help address a
common problem of oversharding. For users with time-based indices that
need a different default, this can be managed with index templates. For
users with non-time-based indices that find they need to re-shard with
the split API in place they no longer need to resort only to
reindexing.
Since this has the impact of changing the default number of shards used
in REST tests, we want to ensure that we still have coverage for issues
that could arise from multiple shards. As such, we randomize (rarely)
the default number of shards in REST tests to two. This is managed via a
global index template. However, some tests check the templates that are
in the cluster state during the test. Since this template is randomly
there, we need a way for tests to skip adding the template used to set
the number of shards to two. For this we add the default_shards feature
skip. To avoid having to write our docs in a complicated way because
sometimes they might be behind one shard, and sometimes they might be
behind two shards we apply the default_shards feature skip to all docs
tests. That is, these tests will always run with the default number of
shards (one).
This adds a new `_ignored` meta field which indexes and stores fields that have
been ignored at index time because of the `ignore_malformed` option. It makes
malformed documents easier to identify by using `exists` or `term(s)` queries
on the `_ignored` field.
Closes#29494
Some features have been deprecated since `6.0` like the `_parent` field or the
ability to have multiple types per index. This allows to remove quite some
code, which in-turn will hopefully make it easier to proceed with the removal
of types.
This improves the way similarities are plugged in in order to:
- reject the classic similarity on 7.x indices and emit a deprecation
warning otherwise
- reject unkwown parameters on 7.x indices and emit a deprecation
warning otherwise
Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.
Closes#23208Closes#29035
Today this part of the documentation just says that Geo queries are not 100%
accurate, but in fact we can be more precise about which kinds of queries see
which kinds of error. This commit clarifies this point.
At time of writing, GeoJSON did not enforce a specific ordering of vertices in
a polygon, but it now does. We occasionally get reports of Elasticsearch
rejecting apparently-valid GeoJSON because of badly oriented polygons, and it's
helpful to be able to point at this bit of the documentation when responding.
This enhancement adds Z value support (source only) to geo_shape fields. If vertices are provided with a third dimension, the third dimension is ignored for indexing but returned as part of source. Like beofre, any values greater than the 3rd dimension are ignored.
closes#23747
This will reject mapping updates to the `_default_` mapping with 7.x indices
and still emit a deprecation warning with 6.x indices.
Relates #15613
Supersedes #28248
Currently the callouts for this section are below all the examples, making it
harder to relate them to the snippets. Instead they should be moved closer
to the examples.
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field.
The mapping looks like this:
"my_text_field" : {
"type" : "text",
"analyzer" : "english",
"index_prefix" : { "min_chars" : 1, "max_chars" : 10 }
}
Relates to #27049
Since #25826 we reject infinite values for float, double and half_float
datatypes. This change adds this restriction to the documentation for the
supported datatypes.
Closes#27653
Allowing `_doc` as a type will enable users to make the transition to 7.0
smoother since the index APIs will be `PUT index/_doc/id` and `POST index/_doc`.
This also moves most of the documentation to `_doc` as a type name.
Closes#27750Closes#27751
* Caps
* Fix awkward wording that took multiple passes to parse
* Floating point _number_
* Something more descriptive about the `scaled_float` scaling factor.
Add an index level setting `index.mapping.nested_objects.limit` to control
the number of nested json objects that can be in a single document
across all fields. Defaults to 10000.
Throw an error if the number of created nested documents exceed this
limit during the parsing of a document.
Closes#26962
This section was removed to hide this ability to new users.
This change restores the section and adds a warning regarding the expected performance.
Closes#27336
extract all clauses from a conjunction query.
When clauses from a conjunction are extracted the number of clauses is
also stored in an internal doc values field (minimum_should_match field).
This field is used by the CoveringQuery and allows the percolator to
reduce the number of false positives when selecting candidate matches and
in certain cases be absolutely sure that a conjunction candidate match
will match and then skip MemoryIndex validation. This can greatly improve
performance.
Before this change only a single clause was extracted from a conjunction
query. The percolator tried to extract the clauses that was rarest in order
(based on term length) to attempt less candidate queries to be selected
in the first place. However this still method there is still a very high
chance that candidate query matches are false positives.
This change also removes the influencing query extraction added via #26081
as this is no longer needed because now all conjunction clauses are extracted.
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extractionCloses#26307
Numeric fields no longer support the index_options parameter. This changes the parameter
to be rejected in numeric field types after it was deprecated in 6.0.
Closes#21475
The percolator will add a `_percolator_document_slot` field to all percolator
hits to indicate with what document it has matched. This number matches with
the order in which the documents have been specified in the percolate query.
Also improved the support for multiple percolate queries in a search request.
The `index.percolator.map_unmapped_fields_as_text` is a more better name, because unmapped fields are mapped to a text field with default settings
and string is no longer a field type (it is either keyword or text).
* Remove the _all metadata field
This change removes the `_all` metadata field. This field is deprecated in 6
and cannot be activated for indices created in 6 so it can be safely removed in
the next major version (e.g. 7).
The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses.
In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being
selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses.
This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the percolate query.
For example a status like field is something that should configured as an ignore field.
Queries on this field tend to match with more documents and so if clauses for this fields
get selected as best clause then that isn't very helpful for the candidate query that the
percolate query generates to filter out percolator queries that are likely not going to match.
The Writeble representation is less heavy to parse and that will benefit percolate performance and throughput.
The query builder's binary format has now the same bwc guarentees as the xcontent format.
Added a qa test that verifies that percolator queries written in older versions are still readable by the current version.
Today if we search across a large amount of shards we hit every shard. Yet, it's quite
common to search across an index pattern for time based indices but filtering will exclude
all results outside a certain time range ie. `now-3d`. While the search can potentially hit
hundreds of shards the majority of the shards might yield 0 results since there is not document
that is within this date range. Kibana for instance does this regularly but used `_field_stats`
to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results.
This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards
and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.
Indexing a join field on a document requires a value of type "object" and two sub fields "name"
and "parent". The "parent" field is only required on child documents, but the "name" field which
denotes the name of the relation is always needed. Previously, only the short-hand version of the
join field was documented. This adds documentation for the long-hand join field data, and
explicitly points out that just specifying the name of the relation for the field value is a
convenience shortcut.
* Add documentation for the new parent-join field
This commit adds the docs for the new parent-join field.
It explains how to define, index and query this new field.
Relates #20257
This commit adds back "id" as the key within a script to specify a
stored script (which with file scripts now gone is no longer ambiguous).
It also adds "source" as a replacement for "code". This is in an attempt
to normalize how scripts are specified across both put stored scripts and script usages, including search template requests. This also deprecates the old inline/stored keys.
This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815
It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.