Adds index versioning for the internal data frame transform index. Allows for new indices to be created and referenced, `GET` requests now query over the index pattern and takes the latest doc (based on INDEX name).
In internal test clusters tests we check that wiping all indices was acknowledged
but in REST tests we didn't.
This aligns the behavior in both kinds of tests.
Relates #45605 which might be caused by unacked deletes that were just slow.
* [ML][Data frame] fixing failure state transitions and race condition (#45627)
There is a small window for a race condition while we are flagging a task as failed.
Here are the steps where the race condition occurs:
1. A failure occurs
2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following:
a. `finishAndSetState()` which sets the IndexerState to STARTED
b. `doSaveState(...)` which attempts to save the current state of the indexer
3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs.
The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.
Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another.
I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state.
closes#45609
Relates to #45562
* [ML][Data Frame] moves failure state transition for MT safety (#45676)
* [ML][Data Frame] moves failure state transition for MT safety
* removing unused imports
This commit replaces task_state and indexer_state in the
data frame _stats output with a single top level state
that combines the two. It is defined as:
- failed if what's currently reported as task_state is failed
- stopped if there is no persistent task
- Otherwise what's currently reported as indexer_state
Backport of #45276
* [ML][Data Frame] Add update transform api endpoint (#45154)
This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`.
This PR contains all that is necessary for this addition:
* HLRC
* Docs
* Server side
This adds support for `geo_bounds` aggregation inside the `pivot.aggregations` configuration.
The two points returned from the `geo_bounds` aggregation are transformed into `geo_shape` whose types are dynamic given the point's similarity.
* `point` if the two points are identical
* `linestring` if the two points share either a latitude or longitude
* `polygon` if the two points are completely different
The automatically deduced mapping for the resulting field is a `geo_shape`.
introduces an abstraction for how checkpointing and synchronization works, covering
- retrieval of checkpoints
- check for updates
- retrieving stats information
This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.
This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.
Backport of #44350
* [ML][Data Frame] treat bulk index failures as an indexing failure
* removing redundant public modifier
* changing to an ElasticsearchException
* fixing redundant public modifier
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
Previously a data frame transform would check whether the
source index was changed every 10 seconds. Sometimes it
may be desirable for the check to be done less frequently.
This commit increases the default to 60 seconds but also
allows the frequency to be overridden by a setting in the
data frame transform config.
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
* [ML][Data Frame] Add version and create_time to transform config (#43384)
* [ML][Data Frame] Add version and create_time to transform config
* s/transform_version/version s/Date/Instant
* fixing getter/setter for version
* adjusting for backport
* [ML][Data Frame] adds new pipeline field to dest config (#43124)
* [ML][Data Frame] adds new pipeline field to dest config
* Adding pipeline support to _preview
* removing unused import
* moving towards extracting _source from pipeline simulation
* fixing permission requirement, adding _index entry to doc
* adjusting for java 8 compatibility
* adjusting bwc serialization version to 7.3.0
* [ML][Data Frame] only complete task after state persistence
There is a race condition where the task could be completed, but there
is still a pending document write. This change moves
the task cancellation into the actionlistener of the state persistence.
intermediate commit
intermediate commit
* removing unused import
* removing unused const
* refreshing internal index after waiting for task to complete
* adjusting test data generation
* [ML][Data Frame] cleaning up usage test since tasks are cancelled onfinish
* Update DataFrameUsageIT.java
* Fixing additional test, waiting for task to complete
* removing unused import
* unmuting test
* stop data frame task after it finishes
* test auto stop
* adapt tests
* persist the state correctly and move stop into listener
* Calling `onStop` even if persistence fails, changing `stop` to rely on doSaveState
* add support for fixed_interval, calendar_interval, remove interval
* adapt HLRC
* checkstyle
* add a hlrc to server test
* adapt yml test
* improve naming and doc
* improve interface and add test code for hlrc to server
* address review comments
* repair merge conflict
* fix date patterns
* address review comments
* remove assert for warning
* improve exception message
* use constants
The date_histogram accepts an interval which can be either a calendar
interval (DST-aware, leap seconds, arbitrary length of months, etc) or
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.
This leads to confusing arrangement where `1d` == calendar, but
`2d` == fixed. And if you want a day of fixed time, you have to
specify `24h` (e.g. the next smallest unit). This arrangement is very
error-prone for users.
This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.
The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the
old dual-purpose interval will be removed.
The change applies to both REST and java clients.
* [ML] adding pivot.size option for setting paging size
* Changing field name to address PR comments
* fixing ctor usage
* adjust hlrc for field name change