* [ML][Data Frame] adds new pipeline field to dest config (#43124)
* [ML][Data Frame] adds new pipeline field to dest config
* Adding pipeline support to _preview
* removing unused import
* moving towards extracting _source from pipeline simulation
* fixing permission requirement, adding _index entry to doc
* adjusting for java 8 compatibility
* adjusting bwc serialization version to 7.3.0
* [ML][Data Frame] cleaning up usage test since tasks are cancelled onfinish
* Update DataFrameUsageIT.java
* Fixing additional test, waiting for task to complete
* removing unused import
* unmuting test
* stop data frame task after it finishes
* test auto stop
* adapt tests
* persist the state correctly and move stop into listener
* Calling `onStop` even if persistence fails, changing `stop` to rely on doSaveState
* add support for fixed_interval, calendar_interval, remove interval
* adapt HLRC
* checkstyle
* add a hlrc to server test
* adapt yml test
* improve naming and doc
* improve interface and add test code for hlrc to server
* address review comments
* repair merge conflict
* fix date patterns
* address review comments
* remove assert for warning
* improve exception message
* use constants
The date_histogram accepts an interval which can be either a calendar
interval (DST-aware, leap seconds, arbitrary length of months, etc) or
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.
This leads to confusing arrangement where `1d` == calendar, but
`2d` == fixed. And if you want a day of fixed time, you have to
specify `24h` (e.g. the next smallest unit). This arrangement is very
error-prone for users.
This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.
The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the
old dual-purpose interval will be removed.
The change applies to both REST and java clients.
* [ML] adding pivot.size option for setting paging size
* Changing field name to address PR comments
* fixing ctor usage
* adjust hlrc for field name change
* [ML] Adds progress reporting for transforms
* fixing after master merge
* Addressing PR comments
* removing unused imports
* Adjusting afterKey handling and percentage to be 100*
* Making sure it is a linked hashmap for serialization
* removing unused import
* addressing PR comments
* removing unused import
* simplifying code, only storing total docs and decrementing
* adjusting for rewrite
* removing initial progress gathering from executor
* [ML] Add mappings, serialization, and hooks to persist stats
* Adding tests for transforms without tasks having stats persisted
* intermittent commit
* Adjusting usage stats to account for stored stats docs
* Adding tests for id expander
* Addressing PR comments
* removing unused import
* adding shard failures to the task response
* [ML] Add data frame task state object and field
* A new state item is added so that the overall task state can be
accoutned for
* A new FAILED state and reason have been added as well so that failures
can be shown to the user for optional correction
* Addressing PR comments
* adjusting after master merge
* addressing pr comment
* Adjusting auditor usage with failure state
* Refactor, renamed state items to task_state and indexer_state
* Adding todo and removing redundant auditor call
* Address HLRC changes and PR comment
* adjusting hlrc IT test
* [ML] make source and dest objects in the transform config
* addressing PR comments
* Fixing compilation post merge
* adding comment for Arrays.hashCode
* addressing changes for moving dest to object
* fixing data_frame yml tests
* fixing API test
Add a checkpoint service for data frame transforms, which allows to ask for a checkpoint of the
source. In future these checkpoints will be stored in the internal index to
- detect upstream changes
- updating the data frame without a full re-run
- allow data frame clients to checkpoint themselves
* [Data Frame] Refactor PUT transform such that:
* POST _start creates the task and starts it
* GET transforms queries docs instead of tasks
* POST _stop verifies the stored config exists before trying to stop
the task
* Addressing PR comments
* Refactoring DataFrameFeatureSet#usage, decreasing size returned getTransformConfigurations
* fixing failing usage test
This change adds two new cluster privileges:
* manage_data_frame_transforms
* monitor_data_frame_transforms
And two new built-in roles:
* data_frame_transforms_admin
* data_frame_transforms_user
These permit access to the data frame transform endpoints.
(Index privileges are also required on the source and
destination indices for each data frame transform, but
since these indices are configurable they it is not
appropriate to grant them via built-in roles.)
fix a couple of odd behaviors of data frame transforms REST API's:
- check if id from body and id from URL match if both are specified
- do not allow a body for delete
- allow get and stats without specifying an id
The data frame plugin allows users to create feature indexes by pivoting a source index. In a
nutshell this can be understood as reindex supporting aggregations or similar to the so called entity
centric indexing.
Full history is provided in: feature/data-frame-transforms