This adds a `rare_terms` aggregation. It is an aggregation designed
to identify the long-tail of keywords, e.g. terms that are "rare" or
have low doc counts.
This aggregation is designed to be more memory efficient than the
alternative, which is setting a terms aggregation to size: LONG_MAX
(or worse, ordering a terms agg by count ascending, which has
unbounded error).
This aggregation works by maintaining a map of terms that have
been seen. A counter associated with each value is incremented
when we see the term again. If the counter surpasses a predefined
threshold, the term is removed from the map and inserted into a cuckoo
filter. If a future term is found in the cuckoo filter we assume it
was previously removed from the map and is "common".
The map keys are the "rare" terms after collection is done.
Following the removal of the `unzip` package from the Elasticsearch
Docker image in #39040, update setup instructions for TLS in Docker.
Also avoid cross-platform ownership+permission issues by not relying
on local bind mounts for storing generated certs and don't require
`curl` locally installed.
Backport of #43748
Removes the suggestion to use IP addresses for `cluster.initial_master_nodes`
in the "important settings" discovery docs, leaving only the suggestion to use
node names.
Relates #41179, #41569
This commit merges the `object-fields` feature branch. The new 'flattened
object' field type allows an entire JSON object to be indexed into a field, and
provides limited search functionality over the field's contents.
This commit adds a wildcard intervals source, similar to the prefix. It
also changes the term parameter in prefix to read prefix, to bring it
in to line with the pattern parameter in wildcard.
Closes#43198
Currently changing resources (like dictionaries, synonym files etc...) of search
time analyzers is only possible by closing an index, changing the underlying
resource (e.g. synonym files) and then re-opening the index for the change to
take effect.
This PR adds a new API endpoint that allows triggering reloading of certain
analysis resources (currently token filters) that will then pick up changes in
underlying file resources. To achieve this we introduce a new type of custom
analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows
swapping out analysis components. Custom analyzers that contain filters that are
markes as "updateable" will automatically choose this implementation. This PR
also adds this capability to `synonym` token filters for use in search time
analyzers.
Relates to #29051
We should throw an exception at construction time if a list of
articles is not provided, otherwise we can get random NPEs during
indexing.
Relates to #43002
It is possible for internal ML indices like `.data-frame-notifications-1` to leak,
causing other docs tests to fail when they accidentally search over these
indices. This PR updates the ignore_above tests to only search a specific index.
This commit adds a prefix intervals source, allowing you to search
for intervals that contain terms starting with a given prefix. The source
can make use of the index_prefixes mapping option.
Relates to #43198
* [ML][Data Frame] Add support for allow_no_match for endpoints (#43490)
* [ML][Data Frame] Add support for allow_no_match parameter in endpoints
Adds support for:
* Get Transforms
* Get Transforms stats
* stop transforms
* Update DataFrameTransformDocumentationIT.java
Given a nested structure composed of Lists and Maps, getByPath will return the value
keyed by path. getByPath is a method on Lists and Maps.
The path is string Map keys and integer List indices separated by dot. An optional third
argument returns a default value if the path lookup fails due to a missing value.
Eg.
['key0': ['a', 'b'], 'key1': ['c', 'd']].getByPath('key1') = ['c', 'd']
['key0': ['a', 'b'], 'key1': ['c', 'd']].getByPath('key1.0') = 'c'
['key0': ['a', 'b'], 'key1': ['c', 'd']].getByPath('key2', 'x') = 'x'
[['key0': 'value0'], ['key1': 'value1']].getByPath('1.key1') = 'value1'
Throws IllegalArgumentException if an item cannot be found and a default is not given.
Throws NumberFormatException if a path element operating on a List is not an integer.
Fixes#42769
This change introduces a new setting,
xpack.ml.process_connect_timeout, to enable
the timeout for one of the external ML processes
to connect to the ES JVM to be increased.
The timeout may need to be increased if many
processes are being started simultaneously on
the same machine. This is unlikely in clusters
with many ML nodes, as we balance the processes
across the ML nodes, but can happen in clusters
with a single ML node and a high value for
xpack.ml.node_concurrent_job_allocations.
A voting-only master-eligible node is a node that can participate in master elections but will not act
as a master in the cluster. In particular, a voting-only node can help elect another master-eligible
node as master, and can serve as a tiebreaker in elections. High availability (HA) clusters require at
least three master-eligible nodes, so that if one of the three nodes is down, then the remaining two
can still elect a master amongst them-selves. This only requires one of the two remaining nodes to
have the capability to act as master, but both need to have voting powers. This means that one of
the three master-eligible nodes can be made as voting-only. If this voting-only node is a dedicated
master, a less powerful machine or a smaller heap-size can be chosen for this node. Alternatively, a
voting-only non-dedicated master node can play the role of the third master-eligible node, which
allows running an HA cluster with only two dedicated master nodes.
Closes#14340
Co-authored-by: David Turner <david.turner@elastic.co>
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
The existing language was misleading about the model snapshots and where they are located. Saying "to disk" sounds like files external to Elasticsearch IMO. It raises the obvious question, where on disk? which node? Is it in the Elasticsearch snapshot repo? The model snapshots are held in an internal index.
* Example of how to set slow logs dynamically per-index
* Make _settings API example more explicit
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
* Add TEST directive to fix CI
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
the geo-bounding-box and phrase-suggest docs were susceptible to
failing due to other indices in the cluster. This change restricts
the queries to the index that is set up for the test.
relates to #43271.
This commit tweaks the docs for secure settings to ensure the user is
aware adding non secure settings to the keystore will result in
elasticsearch not starting.
fixes#43328
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>