Backport of #49076
In case an exception occurs inside a pipeline processor,
the pipeline stack is kept around as header in the exception.
Then in the on_failure processor the id of the pipeline the
exception occurred is made accessible via the `on_failure_pipeline`
ingest metadata.
Closes#44920
Add TRUNC as alias to already implemented TRUNCATE
numeric function which is the flavour supported by
Oracle and PostgreSQL.
Relates to: #41195
(cherry picked from commit f2aa7f0779bc5cce40cc0c1f5e5cf1a5bb7d84f0)
The default is set to Integer.MAX_VALUE but is reported to be `0` in the docs.
With the current implementation a value of 0 would mean all terms are filtered
out, which is the opposite of "unbounded".
Closes#49520
The breaking changes cover the removal of TLSv1 from the default
protocols, but assume that users who need to retain TLSv1 support will
understand all the places where they may used it.
This has proven not to be true, as it is easy to be unaware that (for
example) an LDAP server is using TLSv1.
This change explicitly lists all the places where TLS protocols may
need to be configured.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
Co-Authored-By: Pius <pius@elastic.co>
* Adds a title abbreviation
* Relocates the older name deprecation warning
* Updates the description and adds a Lucene link
* Adds a note to explain payloads and how to store them
* Adds analyze and custom analyzer snippets
* Adds a 'Return stored payloads' example
The categorization job wizard in the ML UI will use this
information when showing the effect of the chosen categorization
analyzer on a sample of input.
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.
The API consolidates information that is useful before
creating a data frame analytics job.
It includes:
- memory estimation
- field selection explanation
Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.
Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.
Backport of #49455
This commit enhances the required pipeline functionality by changing it
so that default/request pipelines can also be executed, but the required
pipeline is always executed last. This gives users the flexibility to
execute their own indexing pipelines, but also ensure that any required
pipelines are also executed. Since such pipelines are executed last, we
change the name of required pipelines to final pipelines.
Reformats the edge n-gram and n-gram token filter docs. Changes include:
* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
filters
Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
All document scores are positive 32-bit floating point numbers. However, this
wasn't previously documented.
This can result in surprising behavior, such as precision loss, for users when
customizing scores using the function score query.
This commit updates an existing admonition in the function score query docs to
document the 32-bits precision limit. It also updates the search API reference
docs to note that `_score` is a 32-bit float.
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.
Closes#25894
Previously, DATEDIFF for minutes and hours was doing a
rounding calculation using all the time fields (secs, msecs/micros/nanos).
Instead it should first truncate the 2 dates to the respective field (mins or hours)
zeroing out all the more detailed time fields and then make the subtraction.
(cherry picked from commit 124cd18e20429e19d52fd8dc383827ea5132d428)
The `string` type (with option `analyzed`) has been replaced by `text` after `6.0`,
also the `annonated_text` field do not support doc values and should be mentioned.
Backport of #47573.
Closes#43603. Allow environment variables to be passed to ES in a Docker
container via a file, by setting an environment variable with the `_FILE`
suffix that points to the file with the intended value of the env var.
Backport of #47468 to 7.x
This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:
min_length: The length of the shortest term
max_length: The length of the longest term
avg_length: The average length of all terms
distribution: The probability distribution of all characters appearing in all terms
entropy: The total Shannon entropy value calculated for all terms
This aggregation has been implemented as an analytics plugin.
Makes a few changes to better align the update license API docs with
the [API reference template][0].
Changes:
* Replaces POST with PUT in several snippet examples.
While both are valid, PUT is a bit more RESTful.
* Removes leading slashes (/) from all snippets.
* Relocates and retitles the 'Authorization' section to 'Prerequisites'.
* Replaces explicit titles with the appropriate API reference template
attributes.
* Replaces unneeded `[float]` tags with explicit anchors.
Closes#35341
[0]: https://github.com/elastic/docs/blob/master/shared/api-ref-ex.asciidoc
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.
To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.
This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.
Closes#48956.
This PR makes the following two fixes around updating flattened fields:
* Make sure that the new value for ignore_above is immediately taken into
affect. Previously we recorded the new value but did not use it when parsing
documents.
* Allow depth_limit to be updated dynamically. It seems plausible that a user
might want to tweak this setting as they encounter more data.
* [ML] Add new geo_results.(actual_point|typical_point) fields for `lat_long` results (#47050)
[ML] Add new geo_results.(actual_point|typical_point) fields for `lat_long` results (#47050)
Related PR: https://github.com/elastic/ml-cpp/pull/809
* adjusting bwc version
The realtime GET API currently has erratic performance in case where a document is accessed
that has just been indexed but not refreshed yet, as the implementation will currently force an
internal refresh in that case. Refreshing can be an expensive operation, and also will block the
thread that executes the GET operation, blocking other GETs to be processed. In case of
frequent access of recently indexed documents, this can lead to a refresh storm and terrible
GET performance.
While older versions of Elasticsearch (2.x and older) did not trigger refreshes and instead opted
to read from the translog in case of realtime GET API or update API, this was removed in 5.0
(#20102) to avoid inconsistencies between values that were returned from the translog and
those returned by the index. This was partially reverted in 6.3 (#29264) to allow _update and
upsert to read from the translog again as it was easier to guarantee consistency for these, and
also brought back more predictable performance characteristics of this API. Calls to the realtime
GET API, however, would still always do a refresh if necessary to return consistent results. This
means that users that were calling realtime GET APIs to coordinate updates on client side
(realtime GET + CAS for conditional index of updated doc) would still see very erratic
performance.
This PR (together with #48707) resolves the inconsistencies between reading from translog and
index. In particular it fixes the inconsistencies that happen when requesting stored fields, which
were not available when reading from translog. In case where stored fields are requested, this
PR will reparse the _source from the translog and derive the stored fields to be returned. With
this, it changes the realtime GET API to allow reading from the translog again, avoid refresh
storms and blocking the GET threadpool, and provide overall much better and predictable
performance for this API.
The first example of splitting rules for the `word_delimiter` token filter was spread across two bullet points. This makes it look like they are two separate splitting rules.
CCR follower stats can return information for persistent tasks that are in the process of being cleaned up. This is problematic for tests where CCR follower indices have been deleted, but their persistent follower task is only cleaned up asynchronously afterwards. If one of the following tests then accesses the follower stats, it might still get the stats for that follower task.
In addition, some tests were not cleaning up their auto-follow patterns, leaving orphaned patterns behind. Other tests cleaned up their auto-follow patterns. As always the same name was used, it just depended on the test execution order whether this led to a failure or not. This commit fixes the offensive tests, and will also automatically remove auto-follow-patterns at the end of tests, like we do for many other features.
Closes #48700