This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.
This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.
Backport of #44350
* [ML][Data Frame] treat bulk index failures as an indexing failure
* removing redundant public modifier
* changing to an ElasticsearchException
* fixing redundant public modifier
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
Previously a data frame transform would check whether the
source index was changed every 10 seconds. Sometimes it
may be desirable for the check to be done less frequently.
This commit increases the default to 60 seconds but also
allows the frequency to be overridden by a setting in the
data frame transform config.
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
* [ML][Data Frame] adds new pipeline field to dest config (#43124)
* [ML][Data Frame] adds new pipeline field to dest config
* Adding pipeline support to _preview
* removing unused import
* moving towards extracting _source from pipeline simulation
* fixing permission requirement, adding _index entry to doc
* adjusting for java 8 compatibility
* adjusting bwc serialization version to 7.3.0
* [ML][Data Frame] cleaning up usage test since tasks are cancelled onfinish
* Update DataFrameUsageIT.java
* Fixing additional test, waiting for task to complete
* removing unused import
* unmuting test
* stop data frame task after it finishes
* test auto stop
* adapt tests
* persist the state correctly and move stop into listener
* Calling `onStop` even if persistence fails, changing `stop` to rely on doSaveState
* add support for fixed_interval, calendar_interval, remove interval
* adapt HLRC
* checkstyle
* add a hlrc to server test
* adapt yml test
* improve naming and doc
* improve interface and add test code for hlrc to server
* address review comments
* repair merge conflict
* fix date patterns
* address review comments
* remove assert for warning
* improve exception message
* use constants
The date_histogram accepts an interval which can be either a calendar
interval (DST-aware, leap seconds, arbitrary length of months, etc) or
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.
This leads to confusing arrangement where `1d` == calendar, but
`2d` == fixed. And if you want a day of fixed time, you have to
specify `24h` (e.g. the next smallest unit). This arrangement is very
error-prone for users.
This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.
The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the
old dual-purpose interval will be removed.
The change applies to both REST and java clients.
* [ML] adding pivot.size option for setting paging size
* Changing field name to address PR comments
* fixing ctor usage
* adjust hlrc for field name change
* [ML] Adds progress reporting for transforms
* fixing after master merge
* Addressing PR comments
* removing unused imports
* Adjusting afterKey handling and percentage to be 100*
* Making sure it is a linked hashmap for serialization
* removing unused import
* addressing PR comments
* removing unused import
* simplifying code, only storing total docs and decrementing
* adjusting for rewrite
* removing initial progress gathering from executor
* [ML] Add mappings, serialization, and hooks to persist stats
* Adding tests for transforms without tasks having stats persisted
* intermittent commit
* Adjusting usage stats to account for stored stats docs
* Adding tests for id expander
* Addressing PR comments
* removing unused import
* adding shard failures to the task response