Backport to 7x
Enable geo_shape query to work on geo_point fields for shapes: circle, polygon, multipolygon, rectangle see: #48928
Co-Authored-By: @iverase
Adds conceptual docs for token graphs.
These docs cover:
* How a token graph is constructed from a token stream
* How synonyms and multi-position tokens impact token graphs
* How token graphs are used during search
* Why some token filters produce invalid token graphs
Also makes the following supporting changes:
* Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking
* Adds several SVGs for token graph diagrams
Removes the `flat_settings` and `timeout` query parameters from the JSON
spec and asciidoc docs for the put index template API.
These parameters are not supported by the API.
It is useful to be able to delay state recovery until enough data nodes have
joined the cluster, since this gives the shard allocator a decent opportunity
to re-use as much existing data as possible. However we also have the option to
delay state recovery until a certain number of master-eligible nodes have
joined, and this is unnecessary: we require a majority of master-eligible nodes
for state recovery, and there is no advantage in waiting for more.
This commit deprecates the unnecessary settings in preparation for their
removal.
Relates #51806
This commit adjusts a `deprecation[...]` message in the docs since such
messages must be on a single line. It also moves this message to the start of
the description of the deprecated setting as is the case with other such
messages.
* Submit async search to work only with POST (#53368)
Currently the submit async search API can be called using both GET and POST at REST, but given that it submits a call and creates internal state, POST should be the only allowed method.
* Refine SearchProgressListener internal API (#53373)
The following cumulative improvements have been made:
- rename `onReduce` and `notifyReduce` to `onFinalReduce` and `notifyFinalReduce`
- add unit test for `SearchShard`
- on* methods in `SearchProgressListener` shouldn't need to be public as they should never be called directly, they only need to be overridden hence they can be made protected. They are actually called directly from a test which required some adapting, like making `AsyncSearchTask.Listener` class package private instead of private
- Instead of overriding `getProgressListener` in `AsyncSearchTask`, as it feels weird to override a getter method, added a specific method that allows to retrieve the Listener directly without needing to cast it. Made the getter and setter for the listener final in the base class.
- rename `SearchProgressListener#searchShards` methods to `buildSearchShards` and make it static given that it accesses no instance members
- make `SearchShard` and `SearchShardTask` classes final
* Move async search yaml tests to x-pack yaml test folder (#53537)
The yaml tests for async search currently sit in its qa folder. There is no reason though for them to live in a separate folder as they don't require particular setup. This commit moves them to the main folder together with the other x-pack yaml tests so that they will be run by the client test runners too.
* [DOCS] Add temporary redirect for async-search (#53454)
The following API spec files contain a link to a not-yet-created
async search docs page:
* [async_search.delete.json][0]
* [async_search.get.json][1]
* [async_search.submit.json][2]
The Elaticsearch-js client uses these spec files to create their docs.
This created a broken link in the Elaticsearch-js docs, which has broken
the docs build.
This PR adds a temporary redirect for the docs page. This redirect
should be removed when the actual API docs are added.
[0]: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.delete.json
[1]: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.get.json
[2]: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/src/test/resources/rest-api-spec/api/async_search.submit.json
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Removes experimental.
* Replaces `"v"` (for value) with `"m"` (for metric).
* Move the note about tiebreaking into the list of limitations of the
sort.
* Explain how you ask for `metrics`.
* Clean up some wording.
* Link to the docs from `top_metrics`.
Closes#51813
Makes the following changes to the `remove_duplicates` token filter
docs:
* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Adds custom analyzer example
* New wildcard field optimised for wildcard queries (#49993)
Indexes values using size 3 ngrams and also stores the full original as a binary doc value.
Wildcard queries operate by using a cheap approximation query on the ngram field followed up by a more expensive verification query using an automaton on the binary doc values. Also supports aggregations and sorting.
Keyword field values with length more than ignore_above are not
indexed. But highlighters still were retrieving these values
from _source and were trying to highlight them. This sometimes lead to
errors if a field length exceeded max_analyzed_offset. But also this
is an overall wrong behaviour to attempt to highlight something that was
ignored during indexing.
This PR checks if a keyword value was ignored because of its length,
and if yes, skips highlighting it.
Backport: #53408Closes#43800
Adds a new parameter for classification that enables choosing whether to assign labels to
maximise accuracy or to maximise the minimum class recall.
Fixes#52427.
This changes the `top_metrics` aggregation to return metrics in their
original type. Since it only supports numerics, that means that dates,
longs, and doubles will come back as stored, with their appropriate
formatter applied.
This change removes the Lucene's experimental flag from the documentations of the following
tokenizer/filters:
* Simple Pattern Split Tokenizer
* Simple Pattern tokenizer
* Flatten Graph Token Filter
* Word Delimiter Graph Token Filter
The flag is still present in Lucene codebase but we're fully supporting these tokenizers/filters
in ES for a long time now so the docs flag is misleading.
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Restructures the 'Update an enrich policy' section to:
* Migrate the content to the section. It was previously stored in the
Put Enrich Policy API docs.
* Remove the warning tag admonition from the section content.
* Replace a reused section earlier in the "Set up an enrich processor"
page with a link.
No substantive changes were made to the content.
Adds a new `default_field_map` field to trained model config objects.
This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.
The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
Makes the following changes to the `word_delimiter` token filter docs:
* Adds a warning admonition recommending the `word_delimiter_graph`
filter instead. This warning includes a link to the deprecated Lucene
`WordDelimiterFilter`.
* Updates the description
* Adds detailed analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter documentation
In a tip admonition, we recommend using the `keyword` tokenizer with the
`word_delimiter_graph` token filter. However, we only use the
`whitespace` tokenizer in the example snippets. This updates those
snippets to use the `keyword` tokenizer instead.
Also corrects several spacing issues for arrays in these docs.
Updates the SVG for a token graph to make the layout consistent with
other graphs. This means moving the text directly above the edge lines.
Previously, the text was above the edge line.
Adds a tip admonition to the basic example in the EQL search docs.
This tip lets users know they can set up a Beat to automatically
index data in ES, rather than manually indexing using the bulk or index
APIs.
Documents the `nodes` response parameters returned by the
`_cluster/stats` API.
Also adds collapsible attributes for the `indices` and `nodes`
sections.
Makes the following changes to the `word_delimiter_graph` token filter
docs:
* Updates the Lucene experimental admonition.
* Updates description
* Adds analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter list
* Expands and updates section re: differences between `word_delimiter`
and `word_delimiter_graph`
This commit introduces hidden aliases. These are similar to hidden
indices, in that they are not visible by default, unless explicitly
specified by name or by indicating that hidden indices/aliases are
desired.
The new alias property, `is_hidden` is implemented similarly to
`is_write_index`, except that it must be consistent across all indices
with a given alias - that is, all indices with a given alias must
specify the alias as either hidden, or all specify it as non-hidden,
either explicitly or by omitting the `is_hidden` property.
7.5 and 7.6 had a regression that allowed for
script_score queries to have negative scores.
We have corrected this regression in #52478.
This is an addition to #52478 that adds
a test and release notes.
Adds documentation for the `any` keyword to the EQL syntax docs.
Includes:
* Definition of an event category and its relationship to the event
category field.
* Example matching all event categories using `any` keyword
* Example using `any` with `where true`
Updates the documented default `event_category_field` and `timestamp_field`
values for the EQL search API. Also updates related guidance in the
EQL requirement docs.
Relates to #53073.
Per the [Asciidoctor docs][0], Asciidoctor replaces the following
syntax with double arrows in the rendered HTML:
* => renders as ⇒
* <= renders as ⇐
This escapes several unintended replacements, such as in the Painless
docs.
Where appropriate, it also replaces some double arrow instances with
single arrows for consistency.
[0]: https://asciidoctor.org/docs/user-manual/#replacements
Makes the following changes to the `stop` token filter docs:
* Updates description
* Adds a link to the related Lucene filter
* Adds detailed analyze snippet
* Updates custom analyzer and custom filter snippets
* Adds a list of predefined stop words by language
Co-authored-by: ScottieL <36999642+ScottieL@users.noreply.github.com>
This field is a specialization of the `keyword` field for the case when all
documents have the same value. It typically performs more efficiently than
keywords at query time by figuring out whether all or none of the documents
match at rewrite time, like `term` queries on `_index`.
The name is up for discussion. I liked including `keyword` in it, so that we
still have room for a `singleton_numeric` in the future. However I'm unsure
whether to call it `singleton`, `constant` or something else, any opinions?
For this field there is a choice between
1. accepting values in `_source` when they are equal to the value configured
in mappings, but rejecting mapping updates
2. rejecting values in `_source` but then allowing updates to the value that
is configured in the mapping
This commit implements option 1, so that it is possible to reindex from/to an
index that has the field mapped as a keyword with no changes to the source.
Backport of #49713
implement transform node attributes to disable transform on certain nodes and
test which nodes are allowed to do remote connections
closes#52200closes#50033closes#48734
backport #52712
Makes the following updates to the EQL search tutorial:
* Adds an API response to the basic tutorial
* Adds an example using the `event_type_field` parm
* Adds an example using the `timestamp_field`parm
* Adds an example using the `query` parm
* Updates example dataset to support more EQL query variety
Makes the following changes to the `trim` token filter docs:
* Updates description
* Adds a link to the related Lucene filter
* Adds tip about removing whitespace using tokenizers
* Adds detailed analyze snippets
* Adds custom analyzer snippet
Adds a warning admonition stating that the `index_options` mapping
parameter is intended only for `text` fields.
Removes an outdated statement regarding default values for numeric
and other datatypes.
Adds reporting of memory usage for data frame analytics jobs.
This commit introduces a new index pattern `.ml-stats-*` whose
first concrete index will be `.ml-stats-000001`. This index serves
to store instrumentation information for those jobs.
Backport of #52778 and #52958
Closes#43990. Describe how to change the default GC settings without changing
the default `jvm.options`. Give examples using `jvm.options.d`, and
`ES_JAVA_OPTS` with Docker.
Backport of #51233 to the seven dot x branch.
Tries to load a `Mapper` instance for the mapping snippet of a dynamic template.
This should catch things like using an analyzer that is undefined or mapping attributes that are unused.
This is best effort:
* If `{{name}}` placeholder is used in the mapping snippet then validation is skipped.
* If `match_mapping_type` is not specified then validation is performed for all mapping types.
If parsing succeeds with a single mapping type then this the dynamic mapping is considered valid.
If is detected that a dynamic template mapping snippet is invalid at mapping update time then the mapping update is failed for indices created on 8.0.0-alpha1 and later. For indices created on prior version a deprecation warning is omitted instead. In 7.x clusters the mapping update will never fail in case of an invalid dynamic template mapping snippet and a deprecation warning will always be omitted.
Closes#17411Closes#24419
Co-authored-by: Adrien Grand <jpountz@gmail.com>
This adds a new configurable field called `indices_options`. This allows users to create or update the indices_options used when a datafeed reads from an index.
This is necessary for the following use cases:
- Reading from frozen indices
- Allowing certain indices in multiple index patterns to not exist yet
These index options are available on datafeed creation and update. Users may specify them as URL parameters or within the configuration object.
closes https://github.com/elastic/elasticsearch/issues/48056
This change adds the recall@k metric and refactors precision@k to match
the new metric.
Recall@k is an important metric to use for learning to rank (LTR)
use-cases. Candidate generation or first ranking phase ranking functions
are often optimized for high recall, in order to generate as many
relevant candidates in the top-k as possible for a second phase of
ranking. Adding this metric allows tuning that base query for LTR.
See: https://github.com/elastic/elasticsearch/issues/51676
Backports: https://github.com/elastic/elasticsearch/pull/52577
Add query execution and return actual results returned from
Elasticsearch inside the tests
(cherry picked from commit 3e039282bf991af87604a6d4f8eada19d5e33842)
The introductory sections of the reference manual contains some simplified
instructions for adding a node to the cluster. Unfortunately they are a little
too simplified and only really work for clusters running on `localhost`. If you
try and follow these instructions for a distributed cluster then the new node
will, confusingly, auto-bootstrap itself into a distinct one-node cluster.
Multiple nodes running on localhost is a valid config, of course, but we should
spell out that these instructions are really only for experimentation and that
it takes a bit more work to add nodes to a distributed cluster. This commit
does so.
Also, the "important config" instructions for discovery say that you MUST set
`discovery.seed_hosts` whereas in fact it is fine to ignore this setting and
use a dynamic discovery mechanism instead. This commit weakens this statement
and links to the docs for dynamic discovery mechanisms.
Finally, this section is also overloaded with some technical details that are
not important for this context and are adequately covered elsewhere, and
completely fails to note that the default discovery port is 9300. This commit
addresses this.
Adds the `?refresh=wait_for` query argument to an index API snippet in
the term vectors API docs.
This should ensure the document is indexed and available before a
subsequent term vectors API request executes.
Fixes#52814.
* Smarter copying of the rest specs and tests (#52114)
This PR addresses the unnecessary copying of the rest specs and allows
for better semantics for which specs and tests are copied. By default
the rest specs will get copied if the project applies
`elasticsearch.standalone-rest-test` or `esplugin` and the project
has rest tests or you configure the custom extension `restResources`.
This PR also removes the need for dozens of places where the x-pack
specs were copied by supporting copying of the x-pack rest specs too.
The plugin/task introduced here can also copy the rest tests to the
local project through a similar configuration.
The new plugin/task allows a user to minimize the surface area of
which rest specs are copied. Per project can be configured to include
only a subset of the specs (or tests). Configuring a project to only
copy the specs when actually needed should help with build cache hit
rates since we can better define what is actually in use.
However, project level optimizations for build cache hit rates are
not included with this PR.
Also, with this PR you can no longer use the includePackaged flag on
integTest task.
The following items are included in this PR:
* new plugin: `elasticsearch.rest-resources`
* new tasks: CopyRestApiTask and CopyRestTestsTask - performs the copy
* new extension 'restResources'
```
restResources {
restApi {
includeCore 'foo' , 'bar' //will include the core specs that start with foo and bar
includeXpack 'baz' //will include x-pack specs that start with baz
}
restTests {
includeCore 'foo', 'bar' //will include the core tests that start with foo and bar
includeXpack 'baz' //will include the x-pack tests that start with baz
}
}
```
Remove reference to an "SQL API" which could suggest that one needs to
treat this in a special way when configuring the ODBC driver.
(cherry picked from commit 451c341e0193b542409e8891ec2a31e62529a5e7)
Adds an explicit "important" admonition discouraging apps from using
cat APIs.
cat APIs are intended for human consumption via the command line or
Kibana console only. They are not intended for consumption by
applications.
Indices open with the `niofs` store type load much more data on-heap than
indices open with the `mmapfs` store type. This limitation is now documented
and examples have been updated to show how to update settings to use the
`mmapfs` store type rather than `niofs`.
We should be more explicit about the downsides of disabling replicas and
explain that users should be ready to re-do the entire load in case of
issues mid-way.
One architecture that we have recommended to several users to speed up
indexing involved using CCR to prevent searching from stealing resources
from indexing.
Before boost in script_score query was wrongly applied only to the subquery.
This commit makes sure that the boost is applied to the whole score
that comes out of script.
Closes#48465
Explicitly notes the Elasticsearch API endpoints that support CCS.
This should deter users from attempting to use CCS with other API
endpoints, such as `GET <index>/_doc/<_id>`.
* Adds an example request to the top of the page.
* Relocates several parameters erroneously listed under "Request body"
to the appropriate "Query parameters" section.
* Updates the "Request body" section to better document the NDJSON
structure of msearch requests.
Add default value to each one of the usages of `allow_no_indices`
since it differs between different APIs.
Relates to: #52534
(cherry picked from commit 2eb986488ac326d6da6ab8ad0203a94e08684a36)
This adds machine learning model feature importance calculations to the inference processor.
The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
"field_mappings": {},
"model_id": "my_model",
"inference_config": {
"regression": {
"num_top_feature_importance_values": 3
}
}
}
```
This will write to the document as follows:
```
"inference" : {
"feature_importance" : {
"FlightTimeMin" : -76.90955548511226,
"FlightDelayType" : 114.13514762158526,
"DistanceMiles" : 13.731580450792187
},
"predicted_value" : 108.33165831875137,
"model_id" : "my_model"
}
```
This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).
It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.
Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.
NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc
usability blocked by: https://github.com/elastic/ml-cpp/pull/991
Re-adds several redirects removed with #50510.
These redirects were related to the relocation of several API docs to
new pages under the 'REST APIs' chapter.
We've since decided to only remove such redirects with major releases.
When `PUT` is called to store a trained model, it is useful to return the newly create model config. But, it is NOT useful to return the inflated definition.
These definitions can be large and returning the inflated definition causes undo work on the server and client side.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This commit updates the enrich.get_policy API to specify name
as a list, in line with other URL parts that accept a comma-separated
list of values.
In addition, update the get enrich policy API docs
to align the URL part name in the documentation with
the name used in the REST API specs.
(cherry picked from commit 94f6f946ef283dc93040e052b4676c5bc37f4bde)