Currently when a document misses a vector value, vector function
returns 0 as a score for this document. We think this is incorrect
behaviour.
With this change, an error will be thrown if vector functions are
used with docs that are missing vector doc values.
Also VectorScriptDocValues is modified to allow size() function,
which can be used to check if a document has a value for the
vector field.
This PR adds the reference documentation pages of the data frame analytics APIs (PUT, START, STOP, GET, GET stats, DELETE, Evaluate) to the ML APIs pool.
This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory,
and removes the need to pass around tokenizer names when building custom analyzers.
As this means that TokenizerFactory is no longer a functional interface, the commit also
adds a factory method to TokenizerFactory to make construction simpler.
This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.
The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.
This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
This commit deprecates the `transport.profiles.*.xpack.security.type`
setting. This setting is used to configure a profile that would only
allow client actions. With the upcoming removal of the transport client
the setting should also be deprecated so that it may be removed in
a future version.
Typically, dense vectors of both documents and queries must have the same
number of dimensions. Different number of dimensions among documents
or query vector indicate an error. This PR enforces that all vectors
for the same field have the same number of dimensions. It also enforces
that query vectors have the same number of dimensions.
* [ML][Data Frame] add node attr to GET _stats (#43842)
* [ML][Data Frame] add node attr to GET _stats
* addressing testing issues with node.attributes
* adjusting for backport
This change explains why Painless doesn't natively support datetime now, and
gives examples of how to create a version of now through user-defined
parameters.
Currently the repsonse of the "_reload_search_analyzer" endpoint contains the
index names and nodeIds of indices were analyzers reloading was triggered. This
change add the names of the search-time analyzers that were reloaded.
Closes#43804
Clarifies the roles of a dedicated voting-only master-eligible node.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
Co-Authored-By: David Turner <david.turner@elastic.co>
A few places in the documentation had mentioned 6.7 as the version to
upgrade from, when doing an upgrade to 7.0. While this is technically
possible, this commit will replace all those mentions to 6.8, as this is
the latest version with the latest bugfixes, deprecation checks and
ugprade assistant features - which should be the one used for upgrades.
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
This adds a `rare_terms` aggregation. It is an aggregation designed
to identify the long-tail of keywords, e.g. terms that are "rare" or
have low doc counts.
This aggregation is designed to be more memory efficient than the
alternative, which is setting a terms aggregation to size: LONG_MAX
(or worse, ordering a terms agg by count ascending, which has
unbounded error).
This aggregation works by maintaining a map of terms that have
been seen. A counter associated with each value is incremented
when we see the term again. If the counter surpasses a predefined
threshold, the term is removed from the map and inserted into a cuckoo
filter. If a future term is found in the cuckoo filter we assume it
was previously removed from the map and is "common".
The map keys are the "rare" terms after collection is done.
Following the removal of the `unzip` package from the Elasticsearch
Docker image in #39040, update setup instructions for TLS in Docker.
Also avoid cross-platform ownership+permission issues by not relying
on local bind mounts for storing generated certs and don't require
`curl` locally installed.
Backport of #43748
Removes the suggestion to use IP addresses for `cluster.initial_master_nodes`
in the "important settings" discovery docs, leaving only the suggestion to use
node names.
Relates #41179, #41569
This commit merges the `object-fields` feature branch. The new 'flattened
object' field type allows an entire JSON object to be indexed into a field, and
provides limited search functionality over the field's contents.
This commit adds a wildcard intervals source, similar to the prefix. It
also changes the term parameter in prefix to read prefix, to bring it
in to line with the pattern parameter in wildcard.
Closes#43198
Currently changing resources (like dictionaries, synonym files etc...) of search
time analyzers is only possible by closing an index, changing the underlying
resource (e.g. synonym files) and then re-opening the index for the change to
take effect.
This PR adds a new API endpoint that allows triggering reloading of certain
analysis resources (currently token filters) that will then pick up changes in
underlying file resources. To achieve this we introduce a new type of custom
analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows
swapping out analysis components. Custom analyzers that contain filters that are
markes as "updateable" will automatically choose this implementation. This PR
also adds this capability to `synonym` token filters for use in search time
analyzers.
Relates to #29051
We should throw an exception at construction time if a list of
articles is not provided, otherwise we can get random NPEs during
indexing.
Relates to #43002
It is possible for internal ML indices like `.data-frame-notifications-1` to leak,
causing other docs tests to fail when they accidentally search over these
indices. This PR updates the ignore_above tests to only search a specific index.
This commit adds a prefix intervals source, allowing you to search
for intervals that contain terms starting with a given prefix. The source
can make use of the index_prefixes mapping option.
Relates to #43198
* [ML][Data Frame] Add support for allow_no_match for endpoints (#43490)
* [ML][Data Frame] Add support for allow_no_match parameter in endpoints
Adds support for:
* Get Transforms
* Get Transforms stats
* stop transforms
* Update DataFrameTransformDocumentationIT.java
Given a nested structure composed of Lists and Maps, getByPath will return the value
keyed by path. getByPath is a method on Lists and Maps.
The path is string Map keys and integer List indices separated by dot. An optional third
argument returns a default value if the path lookup fails due to a missing value.
Eg.
['key0': ['a', 'b'], 'key1': ['c', 'd']].getByPath('key1') = ['c', 'd']
['key0': ['a', 'b'], 'key1': ['c', 'd']].getByPath('key1.0') = 'c'
['key0': ['a', 'b'], 'key1': ['c', 'd']].getByPath('key2', 'x') = 'x'
[['key0': 'value0'], ['key1': 'value1']].getByPath('1.key1') = 'value1'
Throws IllegalArgumentException if an item cannot be found and a default is not given.
Throws NumberFormatException if a path element operating on a List is not an integer.
Fixes#42769