This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.
This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.
Closes#56113
* [ML] adds new for_export flag to GET _ml/inference API (#57351)
Adds a new boolean flag, `for_export` to the `GET _ml/inference/<model_id>` API.
This flag is useful for moving models between clusters.
This commit adds the `expand_wildcards` parameter documentation to the
`_cat/indices` and `_cat/aliases` docs, as those APIs now support
`expand_wildcards`. Additionally, clarifies the `expand_wildcards` docs with
respect to hidden indices.
This adds a max_model_memory setting to forecast requests.
This setting can take a string value that is formatted according to byte sizes (i.e. "50mb", "150mb").
The default value is `20mb`.
There is a HARD limit at `500mb` which will throw an error if used.
If the limit is larger than 40% the anomaly job's configured model limit, the forecast limit is reduced to be strictly lower than that value. This reduction is logged and audited.
related native change: https://github.com/elastic/ml-cpp/pull/1238
closes: https://github.com/elastic/elasticsearch/issues/56420
Implement TIME_PARSE(<time_str>, <pattern_str>) function
which allows to parse a time string according to the specified
pattern into a time object. The patterns allowed are those of
java.time.format.DateTimeFormatter.
Closes#54963
Co-authored-by: Andrei Stefan <astefan@users.noreply.github.com>
Co-authored-by: Patrick Jiang(白泽) <patrickjiang0530@gmail.com>
(cherry picked from commit 1fe1188d449cad7d0782a202372edc52a4014135)
Backporting #56888 to 7.x branch.
Limit the creation of data streams only for namespaces that have a composable template with a data stream definition.
This way we ensure that mappings/settings have been specified and will be used at data stream creation and data stream rollover.
Also remove `timestamp_field` parameter from create data stream request and
let the create data stream api resolve the timestamp field
from the data stream definition snippet inside a composable template.
Relates to #53100
Changes:
* Rewrites description and adds a Lucene link
* Reformats the configurable parameters as a definition list
* Changes the `Theory` heading to `Using the min_hash token filter for
similarity search`
* Adds some additional detail to the analyzer example
This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in #55873.
As discussed at https://elastic.slack.com/archives/C0D1XEXEZ/p1586939752242300 is it not possible to restore snapshots taken on newer versions into clusters running lower versions. For example, a snapshot created in a 7.6.0 cluster cannot be restored on a 7.5.0 cluster. This needs to be documented.
Makes following changes to better clarify docs for read-only URL
snapshot repositories:
* Adds an example snippet for registering a URL repository
* Rewrites the protocols paragraph
* Adds a note to explicitly point out that only URLs using the `ftp`,
`http`, `http`, and `jar` protocols do not need the `path.repo`
setting.
Fixes#16280
Previously, the restore API snippet included a `include_global_state` value of `true`.
Some users copy and paste the code example verbatim, updating only the index and
snapshot value names. Running the snippet could inadvertently wipe out a
cluster's current ILM policies, index templates, and ingest pipelines.
This change updates the snippet to use a `include_global_state` value of
`false`. It also adds a callout that better describes impacts of
using a `include_global_state` argument of `true`.
Co-authored-by: Mike Wong <mike.wong@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: David Turner <david.turner@elastic.co>
Move the JDBC functionality integration tests from `:sql:qa` to a separate
module `:sql:qa:jdbc`. This way the tests are isolated from the rest of the
integration tests and they only depend to the `:sql:jdbc` module, thus
removing the danger of accidentally pulling in some dependency that may
hide bugs.
Moreover this is a preparation for #56722, so that we can run those tests
between different JDBC and ES node versions and ensure forward
compatibility.
Move the rest of existing tests inside a new `:sql:qa:server` project, so that
the `:sql:qa` becomes the parent project for both and one can run all the integration
tests by using this parent project.
(cherry picked from commit c09f4a04484b8a43934fe58fbc41bd90b7dbcc76)
Changes:
* Adds API reference docs for the delete snapshot repo API.
* Corrects an error in the delete snapshot repo API spec. Comma-separated
repository names are not supported.
* Relocates the existing delete snapshot repo API example docs.
Elasticsearch enables HTTP compression by default. However, to mitigate
potential security risks like the BREACH attack, compression is disabled by
default if HTTPS is enabled.
This updates the `http.compression` setting definition accordingly and adds
additional context.
Co-authored-by: Leaf-Lin <39002973+Leaf-Lin@users.noreply.github.com>
* Changes for #52239.
* Incorporating review feedback from Julie T. Also single-sourcing nexted options in the Mapping page and referencing them in the Nested page.
* Moving tip after the introduction and clarifying limits.
* Update docs/reference/mapping.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Update docs/reference/mapping/types/nested.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Throttling nightly cleanup as much as we do has been over cautious.
Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.
Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.
This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
* [Transform] add support for terms agg in transforms (#56696)
This adds support for `terms` and `rare_terms` aggs in transforms.
The default behavior is that the results are collapsed in the following manner:
`<AGG_NAME>.<BUCKET_NAME>.<SUBAGGS...>...`
Or if no sub aggs exist
`<AGG_NAME>.<BUCKET_NAME>.<_doc_count>`
The mapping is also defined as `flattened` by default. This is to avoid field explosion while still providing (limited) search and aggregation capabilities.
This aggregation will perform normalizations of metrics
for a given series of data in the form of bucket values.
The aggregations supports the following normalizations
- rescale 0-1
- rescale 0-100
- percentage of sum
- mean normalization
- z-score normalization
- softmax normalization
To specify which normalization is to be used, it can be specified
in the normalize agg's `normalizer` field.
For example:
```
{
"normalize": {
"buckets_path": <>,
"normalizer": "percent"
}
}
```
* [DOCS] Add info about ILM and unallocated shards.
* Incorporated review feedback.
* Update docs/reference/ilm/actions/ilm-allocate.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Apply suggestions from code review
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Fix xref
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
This adds a few things to the `breakdown` of the profiler:
* `histogram` aggregations now contain `total_buckets` which is the
count of buckets that they collected. This could be useful when
debugging a histogram inside of another bucketing agg that is fairly
selective.
* All bucketing aggs that can delay their sub-aggregations will now add
a list of delayed sub-aggregations. This is useful because we
sometimes have fairly involved logic around which sub-aggregations get
delayed and this will save you from having to guess.
* Aggregtations wrapped in the `MultiBucketAggregatorWrapper` can't
accurately add anything to the breakdown. Instead they the wrapper
adds a marker entry `"multi_bucket_aggregator_wrapper": true` so we
can be quickly pick out such aggregations when debugging.
It also fixes a bug where `_count` breakdown entries were contributing
to the overall `time_in_nanos`. They didn't add a large amount of time
so it is unlikely that this caused a big problem, but I was there.
To support the arbitrary breakdown data this reworks the profiler so
that the `breakdown` can contain any data that is supported by
`StreamOutput#writeGenericValue(Object)` and
`XContentBuilder#value(Object)`.
This optional parameter can only be a string. To test out a transient custom
analysis chain, users are expected to use the 'tokenizer', 'filter', and
'char_filter' parameters.
Today we report some statistics in terms of Lucene-level documents, which
differ from Elasticsearch-level documents in a number of ways and include
things like document tombstones which users cannot directly observe. This
commit clarifies the internal nature of these statistics.
Closes#56497
The docs pattern url was using `*` which means zero or many instead
of `?` which means zero or one. The pattern url returned in error
messages was not in sync with the one in the docs.
Fixes: #56476
(cherry picked from commit 1a5945c3962cdda21482f4b0b3e0ca508534c2c4)
* [DOCS] Promote cron expressions info from Watcher to a separate topic.
* Fix table error
* Fixed xref
* Apply suggestions from code review
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Incorporated review feedback
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
'system' indices will carry special meaning in the future this commit
removes the system from the name to avoid confusion. (technically
these indices will be hidden not system)
* QL: case sensitive support in EQL (#56404)
* adds a generic startsWith function to QL
* modifies the existent EQL startsWith function to be case sensitive
aware
* improves the existent EQL startsWith function to use a prefix query
when the function is used in a case sensitive context. Same improvement
is used in SQL's newly added STARTS_WITH function.
* adds case sensitivity to EQL configuration through a case_sensitive
parameter in the eql request, as established in #54411.
The case_sensitive parameter can be specified when running queries
(default is case insensitive)
(cherry picked from commit ee5a09ea840167566e34c28c8225dc38bc6a7ae8)
Similar to what the moving function aggregation does, except merging windows of percentiles
sketches together instead of cumulatively merging final metrics
This commit removes the `prefer_v2_templates` flag and setting. This was a brief setting that
allowed specifying whether V1 or V2 template should be used when an index is created. It has been
removed in favor of V2 templates always having priority.
Relates to #53101Resolves#56528
This is not a breaking change because this flag was never in a released version.
* [DOCS] Align with ILM changes.
* Apply suggestions from code review
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
* Incorporated review comments.
Changes:
* Moves the document request body parameters for the search API
from the Request body search page to the Search API reference page.
* Relocates a search request body example from the Request body search
page to the Search API reference page.
* Adds a note to any duplicated query and request body parameters.
* Add xpack setting deprecations to deprecation API
The deprecated settings showed up in the deprecation log file by
default, but I did not add them to the deprecation API. This commit
fixes that. Now if you use one of the deprecated basic feature
enablement settings, calling _monitoring/deprecations will inform you of
that fact.
* Remove incorrectly backported settings documents
It seems that I backported these docs to the wrong place in #56061,
in #55980, and in #56167. I hope they're in the right place now.
Co-authored-by: debadair <debadair@elastic.co>
Notes that you cannot use EQL in ES to search the values of `nested`
fields or their sub-fields. However, indices containing `nested` field
mappings are otherwise supported.
This is a backport of #55884 with redirects removed.
Changes:
* Adds an abbreviated title for the search API page.
* Removes the following invalid query parameters:
* `analyzer`
* `analyze_wildcard`
* `default_operator`
* `df`
* `lenient`
* `suggest_mode`
* `suggest_size`
* Replaces the URI search page's query parameter docs with a xref
* Updates the headings of several examples
The following settings are now no-ops:
* xpack.flattened.enabled
* xpack.logstash.enabled
* xpack.rollup.enabled
* xpack.slm.enabled
* xpack.sql.enabled
* xpack.transform.enabled
* xpack.vectors.enabled
Since these settings no longer need to be checked, we can remove settings
parameters from a number of constructors and methods, and do so in this
commit.
We also update documentation to remove references to these settings.
Our upgrade docs don't mention the upgrade assistant, which can be
helpful when migrating across major versions. The docs also don't
mention deprecation logs, which can highlight other functionality that
may change.
This adds a related tip admonition to the upgrade docs.
This PR implements the following changes to make ML model snapshot
retention more flexible in advance of adding a UI for the feature in
an upcoming release.
- The default for `model_snapshot_retention_days` for new jobs is now
10 instead of 1
- There is a new job setting, `daily_model_snapshot_retention_after_days`,
that defaults to 1 for new jobs and `model_snapshot_retention_days`
for pre-7.8 jobs
- For days that are older than `model_snapshot_retention_days`, all
model snapshots are deleted as before
- For days that are in between `daily_model_snapshot_retention_after_days`
and `model_snapshot_retention_days` all but the first model snapshot
for that day are deleted
- The `retain` setting of model snapshots is still respected to allow
selected model snapshots to be retained indefinitely
Backport of #56125
Per #54411, we plan to handle case sensitivity via a parameter for the
EQL search API (with the possible exception of the `between` function).
This removes references and examples related to case sensitivity from
the EQL functions docs.
Previously, when the timezone was missing from the datetime string
and the pattern, UTC was used, instead of the session defined timezone.
Moreover, if a timezone was included in the datetime string and the
pattern then this timezone was used. To have a consistent behaviour
the resulting datetime will always be converted to the session defined
timezone, e.g.:
```
SELECT DATETIME_PARSE('2020-05-04 10:20:30.123 +02:00', 'HH:mm:ss dd/MM/uuuu VV') AS datetime;
```
with `time_zone` set to `-03:00` will result in
```
2020-05-04T05:20:40.123-03:00
```
Follows: #54960
(cherry picked from commit 8810ed03a209cc8fe1bad309a81e85b56a39da27)
* Delay warning about missing x-pack (#54265)
Currently, when monitoring is enabled in a freshly-installed cluster,
the non-master nodes log a warning message indicating that master may
not have x-pack installed. The message is often printed even when the
master does have x-pack installed but takes some time to setup the local
exporter for monitoring. This commit adds the local exporter setting
`wait_master.timeout` which defaults to 30 seconds. The setting
configures the time that the non-master nodes should wait for master to
setup monitoring. After the time elapses, they log a message to the user
about possible missing x-pack installation on master.
The logging of this warning was moved from `resolveBulk()` to
`openBulk()` since `resolveBulk()` is called only on cluster updates and
the message might not be logged until a new cluster update occurs.
Closes#40898
It's possible for a constant_keyword to have a 'null' value before any documents
are seen that contain a value for the field. In this case, no documents have a
value for the field, and 'exists' queries should return no documents.
Makes the following changes to the `porter_stem` token filter docs:
* Rewrites description and adds a Lucene link
* Adds detailed analyze example
* Adds an analyzer example
Backports #55933 to 7.x
Implements value_count and avg aggregations over Histogram fields as discussed in #53285
- value_count returns the sum of all counts array of the histograms
- avg computes a weighted average of the values array of the histogram by multiplying each value with its associated element in the counts array
* Allow Deleting Multiple Snapshots at Once (#55474)
Adds deleting multiple snapshots in one go without significantly changing the mechanics of snapshot deletes otherwise.
This change does not yet allow mixing snapshot delete and abort. Abort is still only allowed for a single snapshot delete by exact name.
* Make xpack.monitoring.enabled setting a no-op
This commit turns xpack.monitoring.enabled into a no-op. Mostly, this involved
removing the setting from the setup for integration tests. Monitoring may
introduce some complexity for test setup and teardown, so we should keep an eye
out for turbulence and failures
* Docs for making deprecated setting a no-op
I see occasional confusion about the explanations emitted by the same-shard
allocation decider, particularly amongst new users setting up a single-node
cluster and trying to determine why their cluster has `yellow` health. For
example:
the shard cannot be allocated to the same node on which a copy of the shard
already exists
This is technically correct but it's quite a complicated sentence. Also, by
starting with "the shard cannot be allocated" it makes it sound like this is
the problem, whereas in fact this message is a good thing and users should
typically focus their attention elsewhere.
This commit simplifies the wording of these messages and makes them sound more
positive, for example:
a copy of this shard is already allocated to this node
Adds a concise list of EQL advantages, based on the "EQL Advantages"
section in the [EQL for the masses][0] blog post.
The intent is to inform users how EQL could benefit at a high level.
[0]: https://www.elastic.co/blog/eql-for-the-masses
Co-Authored-By: Ross Wolf <31489089+rw-access@users.noreply.github.com>
Removes an example from the "Document counts are approximate" section of the
terms agg documentation.
As #52377 details, the example was no longer accurate in 7.x or 6.8. Document
counts were more precise than the example presented.
We've opened issue #56025 to discuss re-adding an example later.
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: AB Prashanth <panuradh@buffalo.edu>
* Make xpack.ilm.enabled setting a no-op
* Add watcher setting to not use ILM
* Update documentation for no-op setting
* Remove NO_ILM ml index templates
* Remove unneeded setting from test setup
* Inline variable definitions for ML templates
* Use identical parameter names in templates
* New ILM/watcher setting falls back to old setting
* Add fallback unit test for watcher/ilm setting
Makes the following changes to the `kstem` token filter docs:
* Rewrite description and adds a Lucene work
* Adds detailed analyze example
* Adds an analyzer example
Implements Sum aggregation over Histogram fields by summing the value of each bucket multiplied by their count as requested in #53285
Backports #55681 to 7.x
* [DOCS] Rework conceptual info for ILM. (#52181)
* [DOCS] Rework conceptual info for ILM.
* Split the actions out of concepts.
* Added xpack role to actions.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
* Apply suggestions from code review
* Edit actions for consistency and add action template. (#55632)
* Edit actions for consistency and add action template.
* Update docs/reference/ilm/actions/ilm-readonly.asciidoc
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
* Apply suggestions from code review
The disk decider had special handling for the single data node case,
allowing any allocation (skipping watermark checks) for such clusters.
This special handling can now be avoided via a setting.
Warn about potential performance impact when a large number of fields
is used with query string query and no default field.
Re-adds content from #35570.
That content was erroneously removed in #45296.
Co-authored-by: Peter Dyson <peter.dyson@geekpete.com>
This has no practical impact on users since frozen indices are the only
throttled indices today. However this has an impact on upcoming features
that would use search throttling.
Filtering out throttled indices made sense a couple years ago, but as
we're now improving support for slow requests with `_async_search` and
exploring ways to reduce storage costs, this feature has most likely
become a trap, that we'd like to not have with upcoming features that
would use search throttling.
Relates #54058
The Lucene `preserve_original` setting is currently not supported in the `edge_ngram`
token filter. This change adds it with a default value of `false`.
Closes#55767
The failed_category_count statistic records the number of times
categorization wanted to create a new category but couldn't
because the job had reached its model_memory_limit.
Backport of #55716
Makes the following changes to the `stemmer` token filter docs:
* Adds detailed analyze example
* Rewrites parameter definitions
* Adds custom analyzer example
* Adds a `language` value for the `estonian` stemmer
* Reorders the `language` values to show recommended algorithms first,
followed by other values alphabetically
Adds conceptual documentation for stemming, including:
* An overview of why stemming is helpful in search
* Algorithmic vs. dictionary stemming
* Token filters used to control stemming, such as `stemmer_override`, `keyword_marker`, and `conditional`
The example looks the same as in the previous section although it should use the
"fuzziness" parameter. This seems to be okay on 6.8 and master and was probably
only forgotten to port to 7.x branches.
This adds a validation to VSParserHelper to ensure that a field or
script or both are specified by the user. This is technically
required today already, but throws an exception much deeper
in the agg framework and has a very unintuitive error for the user
(as well as eating more resources instead of failing early)
Adds a important admonition to the EQL syntax page noting that
the equal (`==`) operator should not be used to match `text` field
values.
Relates to #52709 and #53020
Documents several parameters missing from the bulk API's response body
docs. Also moves several response-related chunks of text to the response
body section.
Relates to #55237
The ML info endpoint returns the max_model_memory_limit setting
if one is configured. However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.
This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.
The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.
Backport of #55529
Adds a "node" field to the response from the following endpoints:
1. Open anomaly detection job
2. Start datafeed
3. Start data frame analytics job
If the job or datafeed is assigned to a node immediately then
this field will return the ID of that node.
In the case where a job or datafeed is opened or started lazily
the node field will contain an empty string. Clients that want
to test whether a job or datafeed was opened or started lazily
can therefore check for this.
Backport of #55473
Adds an example for bulk API requests that include failures.
Also documents guidance on use the `filter_path` parameter
to narrow the bulk API response for errors.
Closes#55237
Removes the 'Testing' chapter from the Elasticsearch Reference guide.
This chapter was originally written for so that users using the Java HLRC client could
use the same test classes when testing Elasticsearch in their own applications.
However, this is no longer the case or recommended.
Closes#55257.
This paves the data layer way so that exceptionally large models are partitioned across multiple documents.
This change means that nodes before 7.8.0 will not be able to use trained inference models created on nodes on or after 7.8.0.
I chose the definition document limit to be 100. This *SHOULD* be plenty for any large model. One of the largest models that I have created so far had the following stats:
~314MB of inflated JSON, ~66MB when compressed, ~177MB of heap.
With the chunking sizes of `16 * 1024 * 1024` its compressed string could be partitioned to 5 documents.
Supporting models 20 times this size (compressed) seems adequate for now.
This commit adds a new querystring parameter on the following APIs:
- Index
- Update
- Bulk
- Create Index
- Rollover
These APIs now support a `?prefer_v2_templates=true|false` flag. This flag changes the preference
creation to use either V2 index templates or V1 templates. This flag defaults to `false` and will be
changed to `true` for 8.0+ in subsequent work.
Additionally, setting this flag internally sets the `index.prefer_v2_templates` index-level setting.
This setting is used so that actions that automatically create a new index (things like rollover
initiated by ILM) will inherit the preference from the original index. This setting is dynamic so
that a transition from v1 to v2 templates can occur for long-running indices grouped by an alias
performing periodic rollover.
This also adds support for sending this parameter to the High Level Rest Client.
Relates to #53101
We believe there's no longer a need to be able to disable basic-license
features completely using the "xpack.*.enabled" settings. If users don't
want to use those features, they simply don't need to use them. Having
such features always available lets us build more complex features that
assume basic-license features are present.
This commit deprecates settings of the form "xpack.*.enabled" for
basic-license features, excluding "security", which is a special case.
It also removes deprecated settings from integration tests and unit
tests where they're not directly relevant; e.g. monitoring and ILM are
no longer disabled in many integration tests.
PR #51260 moved usage counts about mapping field types and analysis to
the `_cluster/stats` API.
This documents those stats in the response section of the cluster stats
API docs.
Implement the use of scalar functions inside aggregate functions.
This allows for complex expressions inside aggregations, with or without
GROUBY as well as with or without a HAVING clause. e.g.:
```
SELECT MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) AS max, b
FROM test
GROUP BY b
HAVING MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) > 5
```
Scalar functions are still not allowed for `KURTOSIS` and `SKEWNESS` as
this is currently not implemented on the ElasticSearch side.
Fixes: #29980Fixes: #36865Fixes: #37271
(cherry picked from commit 506d1beea7abb2b45de793bba2e349090a78f2f9)
The main changes are:
1. Throw an error when updating `include_in_parent` or `include_in_root` attribute of nested field dynamically by the PUT mapping API.
2. Add a test for the change.
Closes#53792
Co-authored-by: bellengao <gbl_long@163.com>
* [DOCS] Reformat `flatten_graph` token filter
Makes the following changes to the `flatten_graph` token filter docs:
* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Adds analyzer example