Commit Graph

160 Commits

Author SHA1 Message Date
Hendrik Muhs 42ada88c5d cleanup static member 2019-09-06 08:55:57 +02:00
Hendrik Muhs ab5f09d29b [ML-DataFrame] improve error message for timeout case in stop (#46131)
improve error message if stopping of transform times out.

related #45610
2019-09-06 08:36:01 +02:00
Hendrik Muhs 78824cce81 reuse mock client to avoid probles with thread context closed errors (#46398) 2019-09-06 08:17:25 +02:00
Benjamin Trent caf3e4d654
[7.x] [ML][Transforms] fixing rolling upgrade continuous transform test (#45823) (#46347) (#46337)
* [ML][Transforms] fixing rolling upgrade continuous transform test (#45823)

* [ML][Transforms] fixing rolling upgrade continuous transform test

* adjusting wait assert logic

* adjusting wait conditions

* [ML][Transforms] allow executor to call start on started task (#46347)

* making sure we only upgrade from 7.4.0 in test
2019-09-05 10:31:11 -05:00
Benjamin Trent 8a4ff7d57d
[ML][Transforms] protecting doSaveState with optimistic concurrency (#46156) (#46281)
* [ML][Transforms] protecting doSaveState with optimistic concurrency

* task code cleanup
2019-09-03 15:27:24 -05:00
Benjamin Trent c3c1dcd2ac
[ML][Transforms] fixing listener being called twice (#46284) (#46292) 2019-09-03 15:26:56 -05:00
Benjamin Trent 53df54c703
[ML][Transforms] fixing stop on changes check bug (#46162) (#46273)
* [ML][Transforms] fixing stop on changes check bug

* Adding new method finishAndCheckState to cover race conditions in early terminations

* changing stopping conditions in `onStart`

* allow indexer to finish when exiting early
2019-09-03 11:04:18 -05:00
David Roberts ab045744ac [ML-DataFrame] Fix off-by-one error in checkpoint operations_behind (#46235)
Fixes a problem where operations_behind would be one less than
expected per shard in a new index matched by the data frame
transform source pattern.

For example, if a data frame transform had a source of foo*
and a new index foo-new was created with 2 shards and 7 documents
indexed in it then operations_behind would be 5 prior to this
change.

The problem was that an empty index has a global checkpoint
number of -1 and the sequence number of the first document that
is indexed into an index is 0, not 1.  This doesn't matter for
indices included in both the last and next checkpoints, as the
off-by-one errors cancelled, but for a new index it affected
the observed result.
2019-09-03 12:45:02 +01:00
Przemysław Witek b8a0379057
Refactor auditor-related classes (#45893) (#46120) 2019-08-29 14:21:03 +02:00
Benjamin Trent b756e1b9be
[ML][Transforms] adjusting when and what to audit (#45876) (#45916)
* [ML][Transforms] adjusting when and what to audit

* Update DataFrameTransformTask.java

* removing unnecessary audit message
2019-08-23 13:53:02 -05:00
Benjamin Trent 94c2de65b9
[ML][Transforms] fix doSaveState check (#45882) (#45902)
* [ML][Transforms] fix doSaveState check

* removing unnecessary log statement
2019-08-23 09:38:52 -05:00
Benjamin Trent dff3e636c2
[ML][Transforms] unifying logging, adding some more logging (#45788) (#45859)
* [ML][Transforms] unifying logging, adding some more logging

* using parameterizedMessage instead of string concat

* fixing bracket closure
2019-08-22 13:15:07 -05:00
Benjamin Trent e50a78cf50
[ML-DataFrame] version data frame transform internal index (#45375) (#45837)
Adds index versioning for the internal data frame transform index. Allows for new indices to be created and referenced, `GET` requests now query over the index pattern and takes the latest doc (based on INDEX name).
2019-08-22 11:46:30 -05:00
Benjamin Trent 43bb5924e6
[ML][Data Frame] fixing _start?force=true bug (#45660) (#45734)
* [ML][Data Frame] fixing _start?force=true bug

* removing unused import

* removing old TODO
2019-08-20 09:23:07 -05:00
Przemysław Witek b37ebd1adf
Prepare the codebase for new Auditor subclasses (#45716) (#45731) 2019-08-20 16:03:50 +02:00
Benjamin Trent 88641a08af
[ML][Data frame] fixing failure state transitions and race condition (#45627) (#45656)
* [ML][Data frame] fixing failure state transitions and race condition (#45627)

There is a small window for a race condition while we are flagging a task as failed.

Here are the steps where the race condition occurs:
1. A failure occurs
2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following:
   a. `finishAndSetState()` which sets the IndexerState to STARTED
   b. `doSaveState(...)` which attempts to save the current state of the indexer
3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs.

The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.

Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another.

I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state.

closes #45609

Relates to #45562

* [ML][Data Frame] moves failure state transition for MT safety (#45676)

* [ML][Data Frame] moves failure state transition for MT safety

* removing unused imports
2019-08-20 07:30:17 -05:00
Benjamin Trent fde5dae387
[ML][Data Frame] adjusting change detection workflow (#45511) (#45580)
* [ML][Data Frame] adjusting change detection workflow

* adjusting for PR comment

* disallowing null as an argument value
2019-08-14 17:26:24 -05:00
Benjamin Trent 0c343d8443
[7.x] [ML][Transforms] adjusting stats.progress for cont. transforms (#45361) (#45551)
* [ML][Transforms] adjusting stats.progress for cont. transforms (#45361)

* [ML][Transforms] adjusting stats.progress for cont. transforms

* addressing PR comments

* rename fix

* Adjusting bwc serialization versions
2019-08-14 13:08:27 -05:00
Armin Braun a9e1402189
Remove Settings from BaseRestRequest Constructor (#45418) (#45429)
* Resolving the todo, cleaning up the unused `settings` parameter
* Cleaning up some other minor dead code in affected classes
2019-08-12 05:14:45 +02:00
Hendrik Muhs bf4da6c6ad
[ML-DataFrame] fix starting a batch data frame after stopping at runtime (#45340) (#45381)
fix loading of next checkpoint after data frame transform has been stopped/started within one run

closes #45339
2019-08-09 20:30:11 +02:00
Hendrik Muhs 7d0aff0ed5 [ML-DataFrame] fix test failure in checkpoint retrieval (#45297)
gracefully handle if index response returns null, increase and assert timeout

closes #45238
2019-08-09 09:04:53 +02:00
Hendrik Muhs 68f9102550 [ML-DataFrame] audit changes in the source index (#45282)
add audits when the set of source indexes changes and in a special case runs empty
2019-08-08 23:31:55 +02:00
David Roberts 14545f8958
[ML-DataFrame] Combine task_state and indexer_state in _stats (#45324)
This commit replaces task_state and indexer_state in the
data frame _stats output with a single top level state
that combines the two. It is defined as:

- failed if what's currently reported as task_state is failed
- stopped if there is no persistent task
- Otherwise what's currently reported as indexer_state

Backport of #45276
2019-08-08 16:24:26 +01:00
Benjamin Trent 5db9982f71
[7.x] [ML][Data Frame] Add update transform api endpoint (#45154) (#45279)
* [ML][Data Frame] Add update transform api endpoint (#45154)

This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`.

This PR contains all that is necessary for this addition:
* HLRC
* Docs
* Server side
2019-08-07 10:37:35 -05:00
Benjamin Trent 3a71b91dca
[ML][Data Frame] add support for geo_bounds aggregation (#44441) (#45281)
This adds support for `geo_bounds` aggregation inside the `pivot.aggregations` configuration. 

The two points returned from the `geo_bounds` aggregation are transformed into `geo_shape` whose types are dynamic given the point's similarity.

* `point` if the two points are identical
* `linestring` if the two points share either a latitude or longitude 
* `polygon` if the two points are completely different

The automatically deduced mapping for the resulting field is a `geo_shape`.
2019-08-07 10:37:09 -05:00
Benjamin Trent be911e6a53
[ML][Data Frames] Fix null aggregation handling in indexer (#45061) (#45257)
* [ML][Data Frames] Fix null aggregation handling in indexer

* addressing PR comments

* adjusting error messages
2019-08-07 07:01:13 -05:00
Hendrik Muhs 6b5a2513a9 [ML-DataFrame] introduce an abstraction for checkpointing (#44900)
introduces an abstraction for how checkpointing and synchronization works, covering

 - retrieval of checkpoints
 - check for updates
 - retrieving stats information
2019-08-06 07:38:59 +02:00
Benjamin Trent 3f48720d41
[ML][Data Frames] unify validation exceptions between PUT/_preview (#44983) (#45012)
* [ML][Data Frames] unify validation exceptions between PUT/_preview

* addressing PR comments
2019-07-30 13:05:07 -05:00
Benjamin Trent 22feedf289
[ML][Data Frame] add support for bucket_selector (#44718) (#45008) 2019-07-30 11:32:58 -05:00
David Roberts b2e969f4ba
[ML-DataFrame] Remove ID field from data frame indexer stats (#44848)
This is a followup to #44350. The indexer stats used to
be persisted standalone, but now are only persisted as
part of a state-and-stats document. During the review
of #44350 it was decided that we'll stick with this
design, so there will never be a need for an indexer
stats object to store its transform ID as it is stored
on the enclosing document. This PR removes the indexer
stats document ID.

Backport of #44768
2019-07-25 15:19:32 +01:00
David Roberts caf9411a72
[ML] Improve response format of data frame stats endpoint (#44743)
This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.

This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.

Backport of #44350
2019-07-23 18:00:50 +01:00
Benjamin Trent 6f53865fde
[ML][Data Frame] Fixes failure state tests and failure setting handling (#44645) (#44698)
* [ML][Data Frame] fixing flaky test

* adjusting frequency

* fixing tests

* addressing PR comments
2019-07-23 08:33:12 -05:00
Benjamin Trent 4456850a8e
[7.x] [ML][Data Frame] Add optional defer_validation param to PUT (#44455) (#44697)
* [ML][Data Frame] Add optional defer_validation param to PUT (#44455)

* [ML][Data Frame] Add optional defer_validation param to PUT

* addressing PR comments

* reverting bad replace

* addressing pr comments

* Update put-transform.asciidoc

* Update put-transform.asciidoc

* Update put-transform.asciidoc

* adjusting for backport

* fixing imports

* [DOCS] Fixes formatting in  create data frame transform API
2019-07-22 15:12:55 -05:00
Benjamin Trent 06e21f7902
[7.x] [ML][Data Frame] adding force delete (#44590) (#44696)
* [ML][Data Frame] adding force delete (#44590)

* [ML][Data Frame] adding force delete

* Update delete-transform.asciidoc

* adjusting for backport
2019-07-22 13:13:25 -05:00
Benjamin Trent a948362d0a
[7.x] [ML][Data Frame] deregister scheduler on transform failure (#44569) (#44576)
* [ML][Data Frame] deregister scheduler on transform failure (#44569)

* fixing test

* Update DataFrameRestTestCase.java

* Update DataFrameTaskFailedStateIT.java

* Update DataFramePivotRestIT.java
2019-07-22 09:06:48 -05:00
David Roberts 6d27eec30f [ML-DataFrame] Use lenient expand open in data frame searches (#44633)
Since #44344 we use IndicesOptions.LENIENT_EXPAND_OPEN
when deciding which indices to include in checkpoint
calculation. This change uses the same option when
deciding which indices to search for data and which
indices to get mappings from, otherwise there is a
potential mismatch between the checkpoint details and
what is searched elsewhere.
2019-07-22 11:30:33 +01:00
Benjamin Trent 2e303fc5f7
[ML][Data Frame] adding dynamic cluster setting for failure retries (#44577) (#44639)
This adds a new dynamic cluster setting `xpack.data_frame.num_transform_failure_retries`.

This setting indicates how many times non-critical failures should be retried before a data frame transform is marked as failed and should stop executing. At the time of this commit; Min: 0, Max: 100, Default: 10
2019-07-19 16:17:39 -05:00
Ryan Ernst f193d14764
Convert remaining Action Response/Request to writeable.reader (#44528) (#44607)
This commit converts readFrom to ctor with StreamInput on the remaining
ActionResponse and ActionRequest classes.

relates #34389
2019-07-19 13:33:38 -07:00
Benjamin Trent d5ca72740e
[ML][Data Frame] adjust onFinish audit frequency (#44450) (#44508) 2019-07-18 14:28:34 -05:00
Benjamin Trent 858dbfc074
[ML][Data Frame] treat bulk index failures as an indexing failure (#44351) (#44427)
* [ML][Data Frame] treat bulk index failures as an indexing failure

* removing redundant public modifier

* changing to an ElasticsearchException

* fixing redundant public modifier
2019-07-16 10:04:28 -05:00
Hendrik Muhs 6c1f740759
[ML-DataFrame] make checkpointing more robust (#44344) (#44414)
make checkpointing more robust:

- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error

fixes #43992
2019-07-16 13:43:13 +02:00
Ryan Ernst 59658daef9
Separate streamable based master node actions (#44313)
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.

relates #34389
2019-07-15 09:20:20 -07:00
Hendrik Muhs 684b562381
[7.x][ML-DataFrame] Rewrite continuous logic to prevent terms count limit (#44287)
Rewrites how continuous data frame transforms calculates and handles buckets that require an update. Instead of storing the whole set in memory, it pages through the updates using a 2nd cursor. This lowers memory consumption and prevents problems with limits at query time (max_terms_count). The list of updates can be re-retrieved in a failure case (#43662)
2019-07-13 06:58:04 +02:00
Benjamin Trent 51ff6b420a
[ML][Data Frame] prevent task from attempting to run when failed (#44239) (#44292) 2019-07-12 15:24:49 -05:00
Benjamin Trent 79c62fd724
[ML][Data Frame] Fixing default delay set in timesync (#44281) (#44293)
* [ML][Data Frame] Fixing default delay set in timesync

* disallowing explicit null, don't do duration check on write
2019-07-12 15:21:47 -05:00
Benjamin Trent 68cd675892
[ML][Data Frame] responding with 409 status code when failing _stop (#44231) (#44276)
* [ML][Data Frame] responding with appropriate status code when failing _stop

* adding null checks for persistent task data

* addressing PR comments
2019-07-12 10:10:24 -05:00
Benjamin Trent 40cc081ad3
[ML][Data Frame] adds index validations to _start data frame transform (#44191) (#44227)
* [ML][Data Frame] adds index validations to _start data frame transform

* addressing pr comments
2019-07-11 12:50:50 -05:00
David Roberts cb62d4acdf [ML-DataFrame] Add a frequency option to transform config, default 1m (#44120)
Previously a data frame transform would check whether the
source index was changed every 10 seconds. Sometimes it
may be desirable for the check to be done less frequently.
This commit increases the default to 60 seconds but also
allows the frequency to be overridden by a setting in the
data frame transform config.
2019-07-10 09:59:00 +01:00
David Kyle 5fc12917c3 Data frame task failure does not make a 500 response (#44058)
Data frame task responses had logic to return a HTTP 500 status code if there was 
any node or task failures even if other tasks in the same request reported correctly. 
This is different to how other task responses are handled where a 200 is always 
returned leaving the client should check for failures. Returning a 500 also breaks
the high level rest client so always return a 200

Closes #44011
2019-07-08 11:53:11 +01:00
Hendrik Muhs 4128b9b4f7 audit message missing for autostop
call onStop when auto stopping (#43984)

fixes #43977
2019-07-04 21:40:42 +02:00