The count should match the number of all df-analytics that
matched the id in the request. However, we set the count
to the number of df-analytics returned which was bound to the
`size` parameter. This commit fixes this by setting the count
to the count of the `get` response.
A bug was introduced in 6.6.0 when we added support for
rollup indices. Rollup caps does NOT support looking at
remote indices, consequently, since we always look up rollup
caps, the datafeed fails with an error if its config
includes a concrete remote index. (When all remote indices
in a datafeed config are wildcards the problem did not
occur.)
The rollups feature does not support remote indices, so if
there is any remote index in a datafeed config (wildcarded
or not), we can skip the rollup cap checks. This PR
implements that change.
This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory,
and removes the need to pass around tokenizer names when building custom analyzers.
As this means that TokenizerFactory is no longer a functional interface, the commit also
adds a factory method to TokenizerFactory to make construction simpler.
This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.
The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.
This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
Renames `_id_copy` to `ml__id_copy` as field names starting with
underscore are deprecated. The new field name `ml__id_copy` was
chosen as an obscure enough field that users won't have in their data.
Otherwise, this field is only intented to be used by df-analytics.
If a job is opened and then closed and does nothing in
between then it should not persist any results or state
documents. This change adapts the no-op job test to
assert no results in addition to no state, and to log
any documents that cause this assertion to fail.
Relates elastic/ml-cpp#512
Relates #43680
The Action base class currently works for both Streamable and Writeable
response types. This commit intorduces StreamableResponseAction, for
which only the legacy Action implementions which provide newResponse()
will extend. This eliminates the need for overriding newResponse() with
an UnsupportedOperationException.
relates #34389
Since #41817 was merged the ml-cpp zip file for any
given version has been cached indefinitely by Gradle.
This is problematic, particularly in the case of the
master branch where the version 8.0.0-SNAPSHOT will
be in use for more than a year.
This change tells Gradle that the ml-cpp zip file is
a "changing" dependency, and to check whether it has
changed every two hours. Two hours is a compromise
between checking on every build and annoying developers
with slow internet connections and checking rarely
causing bug fixes in the ml-cpp code to take a long
time to propagate through to elasticsearch PRs that
rely on them.
This commit adds support for multiple source indices.
In order to deal with multiple indices having different mappings,
it attempts a best-effort approach to merge the mappings assuming
there are no conflicts. In case conflicts exists an error will be
returned.
To allow users creating custom mappings for special use cases,
the destination index is now allowed to exist before the analytics
job runs. In addition, settings are no longer copied except for
the `index.number_of_shards` and `index.number_of_replicas`.
* Deduplicate org.elasticsearch.xpack.core.dataframe.utils.TimeUtils and org.elasticsearch.xpack.core.ml.utils.time.TimeUtils into a common class: org.elasticsearch.xpack.core.common.time.TimeUtils.
* Add unit tests for parseTimeField and parseTimeFieldToInstant methods
This change introduces a new setting,
xpack.ml.process_connect_timeout, to enable
the timeout for one of the external ML processes
to connect to the ES JVM to be increased.
The timeout may need to be increased if many
processes are being started simultaneously on
the same machine. This is unlikely in clusters
with many ML nodes, as we balance the processes
across the ML nodes, but can happen in clusters
with a single ML node and a high value for
xpack.ml.node_concurrent_job_allocations.
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
The error message if the native controller failed to run
(for example due to running Elasticsearch on an unsupported
platform) was not easy to understand. This change removes
pointless detail from the message and adds some hints about
likely causes.
Fixes#42341
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.
Note that where possible response fields were made final and default
constructors were removed.
This is a large PR, but the change is mostly mechanical.
Relates to #34389
Backport of #43414
After the network disruption a partition is created,
one side of which can form a cluster the other can't.
Ensure requests are sent to a node on the correct side
of the cluster
This commit removes some very old test logging annotations that appeared
to be added to investigate test failures that are long since closed. If
these are needed, they can be added back on a case-by-case basis with a
comment associating them to a test failure.
* Return 0 for negative "free" and "total" memory reported by the OS
We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.
In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.
Resolves#42157
* Fix test passing in invalid memory value
* Fix another test passing in invalid memory value
* Also change mem check in MachineLearning.machineMemoryFromStats
* Add background documentation for why we prevent negative return values
* Clarify comment a bit more
This trace logging looks like it was copy/pasted from another test,
where the logging in that test was only added to investigate a test
failure. This commit removes the trace logging.
The ML failover tests sometimes need to wait for jobs to be
assigned to new nodes following a node failure. They wait
10 seconds for this to happen. However, if the node that
failed was the master node and a new master was elected then
this 10 seconds might not be long enough as a refresh of the
memory stats will delay job assignment. Once the memory
refresh completes the persistent task will be assigned when
the next cluster state update occurs or after the periodic
recheck interval, which defaults to 30 seconds. Rather than
increase the length of the wait for assignment to 31 seconds,
this change decreases the periodic recheck interval to 1
second.
Fixes#43289
We were stopping a node in the cluster at a time when
the replica shards of the .ml-state index might not
have been created. This change moves the wait for
green status to a point where the .ml-state index
exists.
Fixes#40546Fixes#41742
Forward port of #43111
A static code analysis revealed that we are not closing
the input stream in the post_data endpoint. This
actually makes no difference in practice, as the
particular InputStream implementation in this case is
org.elasticsearch.common.bytes.BytesReferenceStreamInput
and its close() method is a no-op. However, it is good
practice to close the stream anyway.
The machine learning feature of xpack has native binaries with a
different commit id than the rest of code. It is currently exposed in
the xpack info api. This commit adds that commit information to the ML
info api, so that it may be removed from the info api.
Previously 10 digit numbers were considered candidates to be
timestamps recorded as seconds since the epoch and 13 digit
numbers as timestamps recorded as milliseconds since the epoch.
However, this meant that we could detect these formats for
numbers that would represent times far in the future. As an
example ISBN numbers starting with 9 were detected as milliseconds
since the epoch since they had 13 digits.
This change tweaks the logic for detecting such timestamps to
require that they begin with 1 or 2. This means that numbers
that would represent times beyond about 2065 are no longer
detected as epoch timestamps. (We can add 3 to the definition
as we get closer to the cutoff date.)
The description field of xpack featuresets is optionally part of the
xpack info api, when using the verbose flag. However, this information
is unnecessary, as it is better left for documentation (and the existing
descriptions describe anything meaningful). This commit removes the
description field from feature sets.
The tests for the ML TimeoutChecker rely on threads
not being interrupted after the TimeoutChecker is
closed. This change ensures this by making the
close() and setTimeoutExceeded() methods synchronized
so that the code inside them cannot execute
simultaneously.
Fixes#43097
* [ML] Adding support for geo_shape, geo_centroid, geo_point in datafeeds
* only supporting doc_values for geo_point fields
* moving validation into GeoPointField ctor
Get resources action sorts on the resource id. When there are no resources at
all, then it is possible the index does not contain a mapping for the resource
id field. In that case, the search api fails by default.
This commit adjusts the search request to ignore unmapped fields.
Closeselastic/kibana#37870
Both TransportAnalyzeAction and CategorizationAnalyzer have logic to build
custom analyzers for index-independent analysis. A lot of this code is duplicated,
and it requires the AnalysisRegistry to expose a number of internal provider
classes, as well as making some assumptions about when analysis components are
constructed.
This commit moves the build logic directly into AnalysisRegistry, reducing the
registry's API surface considerably.
Previously, a reindex request had two different size specifications in the body:
* Outer level, determining the maximum documents to process
* Inside the source element, determining the scroll/batch size.
The outer level size has now been renamed to max_docs to
avoid confusion and clarify its semantics, with backwards compatibility and
deprecation warnings for using size.
Similarly, the size parameter has been renamed to max_docs for
update/delete-by-query to keep the 3 interfaces consistent.
Finally, all 3 endpoints now support max_docs in both body and URL.
Relates #24344
A static code analysis revealed that we are not closing
the input stream in the find_file_structure endpoint.
This actually makes no difference in practice, as the
particular InputStream implementation in this case is
org.elasticsearch.common.bytes.BytesReferenceStreamInput
and its close() method is a no-op. However, it is good
practice to close the stream anyway.
This change adds the earliest and latest timestamps into
the field stats for fields of type "date" in the output of
the ML find_file_structure endpoint. This will enable the
cards for date fields in the file data visualizer in the UI
to be made to look more similar to the cards for date
fields in the index data visualizer in the UI.
Dots in the column names cause an error in the ingest
pipeline, as dots are special characters in ingest pipeline.
This PR changes dots into underscores in CSV field names
suggested by the ML find_file_structure endpoint _unless_
the field names are specifically overridden. The reason for
allowing them in overrides is that fields that are not
mentioned in the ingest pipeline can contain dots. But it's
more consistent that the default behaviour is to replace
them all.
Fixeselastic/kibana#26800