Commit Graph

548 Commits

Author SHA1 Message Date
Przemysław Witek c62fe8c344
Require that the dependent variable column has at most 2 distinct values in classfication analysis. (#47858) (#47906) 2019-10-11 14:57:08 +02:00
Igor Motov b5afa95fd8 Fix Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart
Tracked by #47612
2019-10-10 18:17:01 +04:00
Igor Motov 17433e79d8 Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart
Tracked by #47612
2019-10-10 17:56:23 +04:00
Dimitris Athanasiou c1b0bfd74a
[7.x][ML] Unwrap exception causes before calling instanceof (#47676) (#47724)
When exceptions could be returned from another node, the exception
might be wrapped in a `RemoteTransportException`. In places where
we handled specific exceptions using `instanceof` we ought to unwrap
the cause first.

This commit attempts to fix this issue after searching code in the ML
plugin.

Backport of #47676
2019-10-08 16:02:47 +03:00
Dimitris Athanasiou 7667ea5f6f
[7.x][ML] Additional outlier detection parameters (#47600) (#47669)
Adds the following parameters to `outlier_detection`:

- `compute_feature_influence` (boolean): whether to compute or not
   feature influence scores
- `outlier_fraction` (double): the proportion of the data set assumed
   to be outlying prior to running outlier detection
- `standardization_enabled` (boolean): whether to apply standardization
   to the feature values

Backport of #47600
2019-10-07 18:21:33 +03:00
Dimitris Athanasiou ffacfc642c
[7.x][ML] Mute RegressionIT.testStopAndRestart (#47624) (#47625)
Relates #47612
2019-10-05 23:58:32 +03:00
Przemysław Witek ee952da2e2
[7.x] Implement evaluation API for multiclass classification problem (#47126) (#47343) 2019-10-04 17:54:51 +02:00
Przemysław Witek ec9b77deaa
[7.x] Implement new analysis type: classification (#46537) (#47559) 2019-10-04 13:47:19 +02:00
David Roberts 31a5e1c7ee [ML] More accurate job memory overhead (#47516)
When an ML job runs the memory required can be
broken down into:

1. Memory required to load the executable code
2. Instrumented model memory
3. Other memory used by the job's main process or
   ancilliary processes that is not instrumented

Previously we added a simple fixed overhead to
account for 1 and 3. This was 100MB for anomaly
detection jobs (large because of the completely
uninstrumented categorization function and
normalize process), and 20MB for data frame
analytics jobs.

However, this was an oversimplification because
the executable code only needs to be loaded once
per machine.  Also the 100MB overhead for anomaly
detection jobs was probably too high in most cases
because categorization and normalization don't use
_that_ much memory.

This PR therefore changes the calculation of memory
requirements as follows:

1. A per-node overhead of 30MB for _only_ the first
   job of any type to be run on a given node - this
   is to account for loading the executable code
2. The established model memory (if applicable) or
   model memory limit of the job
3. A per-job overhead of 10MB for anomaly detection
   jobs and 5MB for data frame analytics jobs, to
   account for the uninstrumented memory usage

This change will enable more jobs to be run on the
same node.  It will be particularly beneficial when
there are a large number of small jobs.  It will
have less of an effect when there are a small number
of large jobs.
2019-10-04 09:57:31 +01:00
Dimitris Athanasiou b9541eb3af
[7.x][ML] Make PUT data frame analytics action a master node action (… (#47433)
While it seemed like the PUT data frame analytics action did not
have to be a master node action as the config is stored in an index
rather than the cluster state, there are other subtle nuances which
make it worthwhile to convert it. In particular, it helps maintain
order of execution for put actions which are anyhow user driven and
are expected to have low volume.

This commit converts `TransportPutDataFrameAnalyticsAction` from
a handled transport action to a master node action.

Note this means that the action might fail in a mixed cluster
but as the API is still experimental and not widely used there will
be few moments more suitable to make this change than now.
2019-10-02 16:24:21 +03:00
David Roberts 4379a3c52b [ML] Throttle the delete-by-query of expired results (#47177)
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.

This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)

Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)

Relates #47103
2019-10-02 11:16:34 +01:00
Dimitris Athanasiou 36884a3c32
[7.x][ML] Restore analytics state if available (#47128) (#47393)
This commit restores the model state if available in data
frame analytics jobs.

In addition, this changes the start API so that a stopped job
can be restarted. As we now store the progress in the state index
when the task is stopped, we can use it to determine what state
the job was in when it got stopped.

Note that in order to be able to distinguish between a job
that runs for the first time and another that is restarting,
we ensure reindexing progress is reported to be at least 1
for a running task.
2019-10-02 10:24:05 +03:00
Benjamin Trent f5fe5e7cd6
[7.x] [ML][Inference] Adding preprocessors to definition object (#47320) (#47370)
* [ML][Inference] Adding preprocessors to definition object (#47320)

* [ML][Inference] Adding preprocessors to definition object

* Update TrainedModelConfig.java

* adjusting for backport
2019-10-01 13:31:25 -04:00
Benjamin Trent 4335e07716
[7.x] [ML][Inference] adding .ml-inference* index and storage (#47267) (#47310)
* [ML][Inference] adding .ml-inference* index and storage (#47267)

* [ML][Inference] adding .ml-inference* index and storage

* Addressing PR comments

* Allowing null definition, adding validation tests for model config

* fixing line length

* adjusting for backport
2019-10-01 08:20:33 -04:00
David Roberts 0807d409bf [ML] Reinstate ML daily maintenance actions (#47103)
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.

Fixes #47003
2019-09-30 13:12:32 +01:00
Rory Hunter 53a4d2176f
Convert most awaitBusy calls to assertBusy (#45794) (#47112)
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.

There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.

I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.

Other changes:

  * Rework `waitForDocs` to scale its timeout. Instead of calling
    `assertBusy` in a loop, work out a reasonable overall timeout and await
    just once.
  * Some tests failed after switching to `assertBusy` and had to be fixed.
  * Correct the expect templates in AbstractUpgradeTestCase.  The ES
    Security team confirmed that they don't use templates any more, so
    remove this from the expected templates. Also rewrite how the setup
    code checks for templates, in order to give more information.
  * Remove an expected ML template from XPackRestTestConstants The ML team
    advised that the ML tests shouldn't be waiting for any
    `.ml-notifications*` templates, since such checks should happen in the
    production code instead.
  * Also rework the template checking code in `XPackRestTestHelper` to give
    more helpful failure messages.
  * Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
2019-09-29 12:21:46 +01:00
Przemysław Witek 3fbd58d156
[7.x] Allow evaluation to consist of multiple steps. (#46653) (#47194) 2019-09-27 13:01:51 +02:00
David Roberts 77cc6d5bad [TEST] Work around _cat/indices bug with security enabled (#47160)
When the ML native multi-node tests use _cat/indices/_all
and the request goes to a non-master node, _all is
translated to a list of concrete indices by the authz layer
on the coordinating node before the request is forwarded
to the master node. Then it is possible for the master
node to return an index_not_found_exception if one of
the concrete indices that was expanded on the
coordinating node has been deleted in the meantime.
(#47159 has been opened to track the underlying problem.)

It has been observed that the index that gets deleted when
the problem affects the ML native multi-node tests is
always the ML notifications index. The tests that fail are
only interested in the presence or absense of ML results
indices. Therefore the workaround is to only _cat indices
that match the ML results index pattern.

Fixes #45652
2019-09-26 13:29:40 +01:00
Dimitris Athanasiou 0765bd4bf7
[7.x][ML] Ensure data frame analytics task is only marked completed once (#47119) (#47157)
Closes #46907
2019-09-26 15:26:06 +03:00
Tanguy Leroux 95e2ca741e
Remove unused private methods and fields (#47154)
This commit removes a bunch of unused private fields and unused
private methods from the code base.

Backport of (#47115)
2019-09-26 12:49:21 +02:00
Benjamin Trent 05fb7be571
[7.x] [ML][Inference] Feature pre-processing objects and functions (#46777) (#47040)
* [ML][Inference] Feature pre-processing objects and functions (#46777)

To support inference on pre-trained machine learning models, some basic feature encoding will be necessary. I am using a named object serialization approach so new encodings/pre-processing steps could be added in the future. 

This PR lays down the ground work for 3 basic encodings:

* HotOne
* Target Mean
* Frequency

More feature encodings or pre-processings could be added in the future:

* Handling missing columns
* Standardization
* Label encoding
* etc....

* fixing compilation for namedxcontent tests
2019-09-25 08:16:24 -04:00
Yannick Welsch eb86d71edd Mute MlJobIT.testDeleteJob
Relates #45652
2019-09-25 12:53:09 +02:00
Yannick Welsch 7a5b5af171 Mute MlJobIT.testDeleteJobAsync
Relates #45652
2019-09-25 12:53:05 +02:00
Benjamin Trent 00c1c0132b
[ML] fix two datafeed flush lockup bugs (#46982) (#47024)
* [ML] fix two flush lockup bugs

* Addressing PR comments

* moving debug logging line so it is only written on success
2019-09-24 13:03:20 -04:00
Yannick Welsch 9638ca20b0 Allow dropping documents with auto-generated ID (#46773)
When using auto-generated IDs + the ingest drop processor (which looks to be used by filebeat
as well) + coordinating nodes that do not have the ingest processor functionality, this can lead
to a NullPointerException.

The issue is that markCurrentItemAsDropped() is creating an UpdateResponse with no id when
the request contains auto-generated IDs. The response serialization is lenient for our
REST/XContent format (i.e. we will send "id" : null) but the internal transport format (used for
communication between nodes) assumes for this field to be non-null, which means that it can't
be serialized between nodes. Bulk requests with ingest functionality are processed on the
coordinating node if the node has the ingest capability, and only otherwise sent to a different
node. This means that, in order to reproduce this, one needs two nodes, with the coordinating
node not having the ingest functionality.

Closes #46678
2019-09-19 16:46:33 +02:00
Dimitris Athanasiou 02a5e153dc
[7.x][ML] Parse and index data frame analytics state (#46804) (#46820)
This commit reuses the same state processor that is used for autodetect
to parse state output from data frame analytics jobs. We then index the
state document into the state index.

Backport of #46804
2019-09-18 20:37:40 +03:00
Dimitris Athanasiou cebe8da617
[7.x][ML] MlMemoryTracker should ignore analytics tasks without config (#46789) (#46811)
It is possible for a running analytics job that its config is removed
from the '.ml-config' index (perhaps the user deleted the entire index,
etc.). In that case the task remains without a matching config. I have
raised #46781 to discuss how to deal with this issue.

This commit focuses on `MlMemoryTracker` and changes it so that when
we get the configs for the running tasks we leniently ignore missing ones.
This at least means memory tracking will keep working for other jobs
if one or more are missing.

In addition, this commit makes the cleanup code for native analytics
tests more robust by explicitly stopping all jobs and force-stopping
if an error occurs. This helps so that a single failing test does
not cause other tests fail due to pending tasks.

Backport of #46789
2019-09-18 16:35:25 +03:00
Przemysław Witek e49be611ad
[7.x] Add audit messages for Data Frame Analytics (#46521) (#46738) 2019-09-16 21:21:38 +02:00
Dimitris Athanasiou 63eb0d9081
[7.x][ML] Avoid marking data frame analytics task completed twice (#46721) (#46724)
When the stop API is called while the task is running there is
a chance the task gets marked completed twice. This may cause
undesired side effects, like indexing the progress document a second
time after the stop API has returned (the cause for #46705).

This commit adds a check that the task has not been completed before
proceeding to mark it so. In addition, when we update the task's state
we could get some warnings that the task was missing if the stop API
has been called in the meantime. We now check the errors are
`ResourceNotFoundException` and ignore them if so.

Closes #46705

Backports #46721
2019-09-15 17:25:26 +03:00
Dimitris Athanasiou 0bc8acaf5b
[7.x][ML] Create state index and alias before starting an analytics job (#46602) (#46648)
This is fixing a bug where if an analytics job is started before any
anomaly detection job is opened, we create an index after the state
write alias.

Instead, we should create the state index and alias before starting
an analytics job and this commit makes sure this is the case.

Backport of #46602
2019-09-13 10:34:12 +03:00
David Roberts 461de5b58e [TEST] Remove incorrect data frame analytics state assertion (#46597)
After starting the analytics job and checking its state
the state can be any of "started", "reindexing" or
"analyzing" depending on how quickly the work is done.
2019-09-11 16:33:14 +01:00
Dimitris Athanasiou 579af626f5
[7.x][ML] No error when datafeed stops during updating to started (#46495) (#46542)
Investigating the test failure reported in #45518 it appears that
the datafeed task was not found during a tast state update. There
are only two places where such an update is performed: when we set
the state to `started` and when we set it to `stopping`. We handle
`ResourceNotFoundException` in the latter but not in the former.

Thus the test reveals a rare race condition where the datafeed gets
requested to stop before we managed to update its state to `started`.
I could not reproduce this scenario but it would be my best guess.

This commit catches `ResourceNotFoundException` while updating the
state to `started` and lets the task terminate smoothly.

Closes #45518

Backport of #46495
2019-09-11 13:18:42 +03:00
Przemysław Witek e38e631dac
[7.x] Implement DataFrameAnalyticsAuditMessage and DataFrameAnalyticsAuditor (#45967) (#46519) 2019-09-11 12:17:26 +02:00
Przemysław Witek e21deae535
Disallow persisting any documents when datafeed is isolated (#46485) (#46490) 2019-09-09 21:01:27 +02:00
David Roberts 7c7fb7e32d [ML] Tolerate total_search_time_ms not mapped in get datafeed stats (#46432)
ML users who upgrade from versions prior to 7.4 to 7.4 or later
will have ML results indices that do not have mappings for the
total_search_time_ms field.  Therefore, when searching these
indices we must tolerate this field not having a mapping.

Fixes #46437
2019-09-06 14:31:15 +01:00
Dimitris Athanasiou a6834068e3
[7.x][ML] Extract DataFrameAnalyticsTask into its own class (#46402) (#46426)
This refactors `DataFrameAnalyticsTask` into its own class.
The task has quite a lot of functionality now and I believe it would
make code more readable to have it live as its own class rather than
an inner class of the start action class.

Backport of #46402
2019-09-06 14:13:46 +03:00
Benjamin Trent 457ff3e2fb
7.x/ml fix instance serialization bwc (#46404)
* [ML] Fixing instance serialization version for bwc

* fixing CppLogMessage
2019-09-05 13:23:26 -05:00
Benjamin Trent 5201386232
[ML] testFullClusterRestart waiting for stable cluster (#46280) (#46335)
* [ML] waiting for ml indices before waiting task assignment testFullClusterRestart

* waiting for a stable cluster after fullrestart

* removing unused imports
2019-09-05 06:57:33 -05:00
Dimitris Athanasiou 8fca5b5204
[7.x][ML] Unmute testStopOutlierDetectionWithEnoughDocumentsToScroll (#46271) (#46282)
The test seems to have been failing due to a race condition between
stopping the task and refreshing the destination index. In particular,
we were going forward with refreshing the destination index even
though the task stopped in the meantime. This was fixed in
request.

Closes #43960

Backport of #46271
2019-09-04 10:57:01 +03:00
Benjamin Trent d0c5573a51
[ML] Throw an error when a datafeed needs CCS but it is not enabled for the node (#46044) (#46096)
Though we allow CCS within datafeeds, users could prevent nodes from accessing remote clusters. This can cause mysterious errors and difficult to troubleshoot.

This commit adds a check to verify that `cluster.remote.connect` is enabled on the current node when a datafeed is configured with a remote index pattern.
2019-08-30 09:27:07 -05:00
Dimitris Athanasiou 5921ae53d8
[7.x][ML] Regression dependent variable must be numeric (#46072) (#46136)
* [ML] Regression dependent variable must be numeric

This adds a validation that the dependent variable of a regression
analysis must be numeric.

* Address review comments and fix some problems

In addition to addressing the review comments, this
commit fixes a few issues I found during testing.

In particular:

- if there were mappings for required fields but they were
not included we were not reporting the error
- if explicitly included fields had unsupported types we were
not reporting the error

Unfortunately, I couldn't get those fixed without refactoring
the code in `ExtractedFieldsDetector`.
2019-08-30 09:57:43 +03:00
Przemysław Witek b8a0379057
Refactor auditor-related classes (#45893) (#46120) 2019-08-29 14:21:03 +02:00
Przemysław Witek fbe9e8a530
Do not throw an exception if the process finished quickly but without any error. (#46073) (#46113) 2019-08-29 10:47:17 +02:00
Dimitris Athanasiou 25d64508f6
[7.x][ML] Support boolean fields for DF analytics (#46037) (#46054)
This commit adds support for `boolean` fields in data frame
analytics (and currently both outlier detection and regression).
The analytics process expects `boolean` fields to be encoded as
integers with 0 or 1 value.
2019-08-28 12:02:29 +03:00
Dimitris Athanasiou 873ad3f942
[7.x][ML] Add option to regression to randomize training set (#45969) (#46017)
Adds a parameter `training_percent` to regression. The default
value is `100`. When the parameter is set to a value less than `100`,
from the rows that can be used for training (ie. those that have a
value for the dependent variable) we randomly choose whether to actually
use for training. This enables splitting the data into a training set and
the rest, usually called testing, validation or holdout set, which allows
for validating the model on data that have not been used for training.

Technically, the analytics process considers as training the data that
have a value for the dependent variable. Thus, when we decide a training
row is not going to be used for training, we simply clear the row's
dependent variable.
2019-08-27 17:53:11 +03:00
Benjamin Trent a3a4ae0ac2
[ML] fixing bug where analytics process starts with 0 rows (#45879) (#45988)
The native process requires that there be a non-zero number of rows to analyze. If the flag --rows 0 is passed to the executable, it throws and does not start.

When building the configuration for the process we should not start the native process if there are no rows.

Adding some logging to indicate what is occurring.
2019-08-26 14:18:17 -05:00
Benjamin Trent d64018f8e1
[ML] add supported types to no fields error message (#45926) (#45987)
* [ML] add supported types to no fields error message

* adding supported types to logger debug
2019-08-26 14:18:00 -05:00
Dimitris Athanasiou be554fe5f0
[7.x][ML] Improve progress reportings for DF analytics (#45856) (#45910)
Previously, the stats API reports a progress percentage
for DF analytics tasks that are running and are in the
`reindexing` or `analyzing` state.

This means that when the task is `stopped` there is no progress
reported. Thus, one cannot distinguish between a task that never
run to one that completed.

In addition, there are blind spots in the progress reporting.
In particular, we do not account for when data is loaded into the
process. We also do not account for when results are written.

This commit addresses the above issues. It changes progress
to being a list of objects, each one describing the phase
and its progress as a percentage. We currently have 4 phases:
reindexing, loading_data, analyzing, writing_results.

When the task stops, progress is persisted as a document in the
state index. The stats API now reports progress from in-memory
if the task is running, or returns the persisted document
(if there is one).
2019-08-23 23:04:39 +03:00
Przemysław Witek 85d55e30d0
Add test that proves _timing_stats document is deleted when the job is deleted (#45840) (#45854) 2019-08-23 07:03:09 +02:00
Przemysław Witek 2ed19b2c81
Put error message from inside the process into the exception that is thrown when the process doesn't start correctly. (#45846) (#45875) 2019-08-23 07:02:50 +02:00
Benjamin Trent 8e3c54fff7
[7.x] [ML] Adding data frame analytics stats to _usage API (#45820) (#45872)
* [ML] Adding data frame analytics stats to _usage API (#45820)

* [ML] Adding data frame analytics stats to _usage API

* making the size of analytics stats 10k

* adjusting backport
2019-08-22 15:15:41 -05:00
Przemysław Witek 7512337922
[7.x] Allow the user to specify 'query' in Evaluate Data Frame request (#45775) (#45825) 2019-08-22 11:14:26 +02:00
Przemysław Witek bf701b83d2
Shorten field names in EstimateMemoryUsageResponse (#45719) (#45772) 2019-08-21 12:45:09 +02:00
Przemysław Witek c6709f0979
Mute tests affected by renaming fields in Estimate memory usage response (#45743) (#45766) 2019-08-21 09:57:23 +02:00
Dimitris Athanasiou d5c3d9b50f
[7.x][ML] Do not skip rows with missing values for regression (#45751) (#45754)
Regression analysis support missing fields. Even more, it is expected
that the dependent variable has missing fields to the part of the
data frame that is not for training.

This commit allows to declare that an analysis supports missing values.
For such analysis, rows with missing values are not skipped. Instead,
they are written as normal with empty strings used for the missing values.

This also contains a fix to the integration test.

Closes #45425
2019-08-21 08:15:38 +03:00
Benjamin Trent ba7b677618
[ML] better handle empty results when evaluating regression (#45745) (#45759)
* [ML] better handle empty results when evaluating regression

* adding new failure test to ml_security black list

* fixing equality check for regression results
2019-08-20 17:37:04 -05:00
Dimitris Athanasiou 49edf9e5b5
[7.x][ML] Remove timeout on waiting for DF analytics result processor to complete (#45724) (#45733)
We cannot know how long the analysis will take to complete thus we should not have
a timeout. Note that if the process crashes, the result processor will pick the
exception due to the stream closing.

Closes #45723
2019-08-20 17:21:40 +03:00
Przemysław Witek b37ebd1adf
Prepare the codebase for new Auditor subclasses (#45716) (#45731) 2019-08-20 16:03:50 +02:00
Przemysław Witek 80dd0a0948
Get rid of EstimateMemoryUsageRequest and EstimateMemoryUsageAction.Request. (#45718) (#45725) 2019-08-20 15:49:17 +02:00
Przemysław Witek 7bc8400222
Call the new _estimate_memory_usage API endpoint on df analytics _start (#45536) (#45701) 2019-08-19 21:37:55 +02:00
Igor Motov 98c850c08b
Geo: Change order of parameter in Geometries to lon, lat 7.x (#45618)
Changes the order of parameters in Geometries from lat, lon to lon, lat
and moves all Geometry classes are moved to the
org.elasticsearch.geomtery package.

Backport of #45332

Closes #45048
2019-08-16 14:42:02 -04:00
Przemysław Witek df574e5168
[7.x] Implement ml/data_frame/analytics/_estimate_memory_usage API endpoint (#45188) (#45510) 2019-08-14 08:26:03 +02:00
Armin Braun 90803a5caf
Reenable Integ Tests in native-multi-node-tests (#45482) (#45496)
* Reenable Integ Tests in native-multi-node-tests

* The tests broken here were likely fixed by #45463 => let's reenable them and see if things run fine again
* Relates #45405, #45455
2019-08-13 15:55:54 +02:00
Przemysław Witek 1aed388a24
Add view_index_metadata to roles.yml and remove as many df analytics test cases from build.gradle blacklist as possible. (#45451) (#45465) 2019-08-13 08:31:58 +02:00
Mark Vieira 7e3379444b
Fix build failure due to unknown task and disable test conventions
(cherry picked from commit 8ed84bc5cef9bcfae6c817059f764d97e4451a4a)
2019-08-12 09:18:39 -07:00
Przemyslaw Gomulka 421e9b8e8b
Mute integ tests in native-multi-node-tests (#45457)
Tracked at #45405
2019-08-12 17:42:24 +02:00
Przemyslaw Gomulka d11ae08467
Muting ForecastIT.testOverflowToDisk (#45435) (#45438)
awaits #45405
2019-08-12 11:01:32 +02:00
Dimitris Athanasiou d02d6e40c2 [ML] Mute regression integ test
Relates #45425
2019-08-12 10:59:24 +03:00
Armin Braun a9e1402189
Remove Settings from BaseRestRequest Constructor (#45418) (#45429)
* Resolving the todo, cleaning up the unused `settings` parameter
* Cleaning up some other minor dead code in affected classes
2019-08-12 05:14:45 +02:00
Dimitris Athanasiou 27497ff75f
[7.x][ML] Add regression analysis to DF analytics (#45292) (#45388)
This commit adds a first draft of a regression analysis
to data frame analytics. There is high probability that
the exact syntax might change.

This commit adds the new analysis type and its parameters as
well as appropriate validation. It also modifies the extractor
and the fields detector to be able to handle categorical fields
as regression analysis supports them.
2019-08-09 19:31:13 +03:00
Jason Tedor 9a142ff25c
Introduce formal node ML role (#45174)
This commit builds on the ability for plugins to introduce new roles to
add a formal node ML role.
2019-08-06 13:00:05 -04:00
David Roberts a1f0285f0e [TEST] Only test US locale in day/month order test in FIPS JVM (#45141)
In the FIPS JVM the JVM default locale seems to leak into places
where it should be overridden. This change skips assertions
in TimestampFormatFinderTests.testGuessIsDayFirstFromLocale
that may be impacted.

Fixes #45140
2019-08-02 15:04:47 +01:00
David Roberts f617585dbd [ML] Improve CSV header row detection in find_file_structure (#45099)
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.

This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.

Fixes #45047
2019-08-02 09:08:21 +01:00
Dimitris Athanasiou 8a6675b994
[7.x][ML] Check dest index is empty when starting DF analytics (#45094) (#45112)
If one tries to start a DF analytics job that has already run,
the result will be that the task will fail after reindexing the
dest index from the source index. The results of the prior run
will be gone and the task state is not properly set to failed
with the failure reason.

This commit improves the behavior in this scenario. First, we
set the task state to `failed` in a set of failures that were
missed. Second, a validation is added that if the destination
index exists, it must be empty.
2019-08-02 00:19:48 +03:00
Przemysław Witek 6c87845fc1
Persist DatafeedTimingStats with RefreshPolicy.NONE by default (#44940) (#45079) 2019-08-01 14:36:59 +02:00
Dimitris Athanasiou aef419c0b0
[7.x][ML] Catch any error thrown while closing data frame analytics process (#44958) (#44968)
In case closing the process throws an exception we should be catching
it no matter its type. The process may have terminated because of a
fatal error in which case closing the process will throw a server
error, not an `IOException`. If this happens we fail to mark the
persistent task as failed and the task gets in limbo.
2019-07-29 21:59:10 +03:00
Benjamin Trent 3b514f0dae
[ML] update Instant serialization (#44765) (#44954)
* [ML] update Instant serialization

* addressing PR comments

* removing unused import
2019-07-29 13:06:56 -05:00
Dimitris Athanasiou 9dd527328a
[ML] Outlier detection should only fetch docs that have the analyzed … (#44944) (#44959)
As data frame rows with missing values for analyzed fields are skipped,
we can be more efficient by including a query that only picks documents
that have values for all analyzed fields. Besides improving the number
of documents we go through, we also provide a more accurate measurement
of how many rows we need which reduces the memory requirements.

This also adds an integration test that runs outlier detection on data
with missing fields.
2019-07-29 18:23:56 +03:00
Luca Cavanna a3cc32da64 TaskListener#onFailure to accept Exception instead of Throwable (#44946)
TaskListener accepts today Throwable in its onFailure method. Though
looking at where it is called (TransportAction), it can never be
notified of a Throwable.

This commit changes the signature of TaskListener#onFailure so that it
accepts an `Exception` rather than a `Throwable` as second argument.
2019-07-29 16:47:19 +02:00
David Kyle d05f12dadb [ML] Close any opened pipes if there is an error connecting to the process (#44869) 2019-07-29 10:48:31 +01:00
Przemysław Witek 79121ea127
[7.x] Implement exponential average search time per hour statistics. (#44683) (#44897) 2019-07-26 15:56:34 +02:00
Przemysław Witek 8bb8543fdf
Treat PostDataActionResponse.DataCounts.bucketCount as incremental rather than absolute (total). (#44803) (#44856) 2019-07-25 20:46:56 +02:00
Przemysław Witek 53f409e5ae
Add result_type field to TimingStats and DatafeedTimingStats documents (#44812) (#44841) 2019-07-25 10:11:55 +02:00
Andrei Stefan 2633d11eb7
Switch from using docvalue_fields to extracting values from _source (#44062) (#44804)
* Switch from using docvalue_fields to extracting values from _source
where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.
* This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.

(cherry picked from commit 8bf8a055e38f00df5f49c8d97f632f69d6e00c2c)
2019-07-25 10:02:41 +03:00
Alpar Torok b34ac66d96
Mute multiple tests on Windows (7.x) (#44676)
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Fix line endings in ESJsonLayoutTests

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Mute failure unconfigured node name

* mute x-pack internal cluster test windows

tracking #44610

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
2019-07-22 11:32:29 +03:00
David Kyle 0fc091f166
Enable XLint warnings for ML (#44346)
Removes the warning suppression -Xlint:-deprecation,-rawtypes,-serial,-try,-unchecked.
Many warnings were unchecked warnings in the test code often because of the use of mocks.
These are suppressed with @SuppressWarning
2019-07-18 09:33:37 +01:00
Tal Levy a5ad59451c
migrate more ML actions off of using Request suppliers (#44462) (#44529)
many classes still use the Streamable constructors of HandledTransportAction,
this commit moves more of those classes to the new Writeable constructors.

relates #34389.
2019-07-17 20:28:29 -07:00
Ryan Ernst 0755a13c9f
Convert AcknowledgedRequest to Writeable.Reader (#44412) (#44454)
This commit adds constructors to AcknolwedgedRequest subclasses to
implement Writeable.Reader, and ensures all future subclasses implement
the same.

relates #34389
2019-07-17 11:17:36 -07:00
Tal Levy 901310a826
[7.x] Migrate ML Actions to use writeable ActionType (#44302) (#44391)
* Migrate ML Actions to use writeable ActionType (#44302)

This commit converts all the StreamableResponseActionType
actions in the ML core module to be ActionType and leverage
the Writeable infrastructure.
2019-07-16 12:41:10 -07:00
Przemysław Witek 34bf6bcec0
Treat big changes in searchCount as significant and persist the document after such changes (#44413) (#44424) 2019-07-16 16:15:32 +02:00
Lee Hinman fb0461ac76
[7.x] Add Snapshot Lifecycle Management (#44382)
* Add Snapshot Lifecycle Management (#43934)

* Add SnapshotLifecycleService and related CRUD APIs

This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.

This also includes the get, put, and delete APIs for these policies

Relates to #38461

* Make scheduledJobIds return an immutable set

* Use Object.equals for SnapshotLifecyclePolicy

* Remove unneeded TODO

* Implement ToXContentFragment on SnapshotLifecyclePolicyItem

* Copy contents of the scheduledJobIds

* Handle snapshot lifecycle policy updates and deletions (#40062)

(Note this is a PR against the `snapshot-lifecycle-management` feature branch)

This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.

Relates to #38461

* Take a snapshot for the policy when the SLM policy is triggered (#40383)

(This is a PR for the `snapshot-lifecycle-management` branch)

This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.

This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.

Relates to #38461

* Record most recent snapshot policy success/failure (#40619)

Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.

This is the first step toward writing history in a more permanent
fashion.

* Validate snapshot lifecycle policies (#40654)

(This is a PR against the `snapshot-lifecycle-management` branch)

With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.

Part of #38461

* Hook SLM into ILM's start and stop APIs (#40871)

(This pull request is for the `snapshot-lifecycle-management` branch)

This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.

Relates to #38461

* Add tests for SnapshotLifecyclePolicyItem (#40912)

Adds serialization tests for SnapshotLifecyclePolicyItem.

* Fix improper import in build.gradle after master merge

* Add human readable version of modified date for snapshot lifecycle policy (#41035)

* Add human readable version of modified date for snapshot lifecycle policy

This small change changes it from:

```
...
"modified_date": 1554843903242,
...
```

To

```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```

Including the `"modified_date"` field when the `?human` field is used.

Relates to #38461

* Fix test

* Add API to execute SLM policy on demand (#41038)

This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.

```json
PUT /_ilm/snapshot/<policy>/_execute
```

And it returns the response with the generated snapshot name:

```json
{
  "snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```

Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.

Relates to #38461

* Add next_execution to SLM policy metadata (#41221)

* Add next_execution to SLM policy metadata

This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:

```json
GET /_ilm/snapshot?human
{
  "production" : {
    "version" : 1,
    "modified_date" : "2019-04-15T21:16:21.865Z",
    "modified_date_millis" : 1555362981865,
    "policy" : {
      "name" : "<production-snap-{now/d}>",
      "schedule" : "*/30 * * * * ?",
      "repository" : "repo",
      "config" : {
        "indices" : [
          "foo-*",
          "important"
        ],
        "ignore_unavailable" : true,
        "include_global_state" : false
      }
    },
    "next_execution" : "2019-04-15T21:16:30.000Z",
    "next_execution_millis" : 1555362990000
  },
  "other" : {
    "version" : 1,
    "modified_date" : "2019-04-15T21:12:19.959Z",
    "modified_date_millis" : 1555362739959,
    "policy" : {
      "name" : "<other-snap-{now/d}>",
      "schedule" : "0 30 2 * * ?",
      "repository" : "repo",
      "config" : {
        "indices" : [
          "other"
        ],
        "ignore_unavailable" : false,
        "include_global_state" : true
      }
    },
    "next_execution" : "2019-04-16T02:30:00.000Z",
    "next_execution_millis" : 1555381800000
  }
}
```

Relates to #38461

* Fix and enhance tests

* Figured out how to Cron

* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)

This commit changes the endpoint for snapshot lifecycle management from:

```
GET /_ilm/snapshot/<policy>
```

to:

```
GET /_slm/policy/<policy>
```

It mimics the ILM path only using `slm` instead of `ilm`.

Relates to #38461

* Add initial documentation for SLM (#41510)

* Add initial documentation for SLM

This adds the initial documentation for snapshot lifecycle management.

It also includes the REST spec API json files since they're sort of
documentation.

Relates to #38461

* Add `manage_slm` and `read_slm` roles (#41607)

* Add `manage_slm` and `read_slm` roles

This adds two more built in roles -

`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.

`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.

Relates to #38461

* Add execute to the test

* Fix ilm -> slm typo in test

* Record SLM history into an index (#41707)

It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.

This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.

Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s.  Watcher would cause the same problem, but it
is already disabled (for the same reason).

* High Level Rest Client support for SLM (#41767)

* High Level Rest Client support for SLM

This commit add HLRC support for SLM.

Relates to #38461

* Fill out documentation tests with tags

* Add more callouts and asciidoc for HLRC

* Update javadoc links to real locations

* Add security test testing SLM cluster privileges (#42678)

* Add security test testing SLM cluster privileges

This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.

Relates to #38461

* Don't redefine vars

*  Add Getting Started Guide for SLM  (#42878)

This commit adds a basic Getting Started Guide for SLM.

* Include SLM policy name in Snapshot metadata (#43132)

Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.

* Fix compilation after master merge

* [TEST] Move exception wrapping for devious exception throwing

Fixes an issue where an exception was created from one line and thrown in another.

* Fix SLM for the change to AcknowledgedResponse

* Add Snapshot Lifecycle Management Package Docs (#43535)

* Fix compilation for transport actions now that task is required

* Add a note mentioning the privileges needed for SLM (#43708)

* Add a note mentioning the privileges needed for SLM

This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.

Relates to #38461

* Mention that you can create snapshots for indices you can't read

* Fix REST tests for new number of cluster privileges

* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)

* Fix SnapshotHistoryStoreTests after merge

* Remove overridden newResponse functions that have been removed

* Fix compilation for backport

* Fix get snapshot output parsing in test

* [DOCS] Add redirects for removed autogen anchors (#44380)

* Switch <tt>...</tt> in javadocs for {@code ...}
2019-07-16 07:37:13 -06:00
Przemysław Witek 3f3a3d3f2b
[7.x] Add DatafeedTimingStats.average_search_time_per_bucket_ms and TimingStats.total_bucket_processing_time_ms stats (#44125) (#44404) 2019-07-16 12:51:29 +02:00
Ryan Ernst 7e06888bae
Convert testclusters to use distro download plugin (#44253) (#44362)
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
2019-07-15 17:53:05 -07:00
Ryan Ernst 59658daef9
Separate streamable based master node actions (#44313)
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.

relates #34389
2019-07-15 09:20:20 -07:00
Armin Braun d73e2f9c56
HLRC: Fix '+' Not Correctly Encoded in GET Req. (#33164) (#44324)
* HLRC: Fix '+' Not Correctly Encoded in GET Req.

* Encode `+` correctly as `%2B` in URL paths
* Keep encoding `+` as space in URL parameters
* Closes #33077
2019-07-15 10:21:54 +02:00
Ryan Ernst 1dcf53465c Reorder HandledTransportAction ctor args (#44291)
This commit moves the Supplier variant of HandledTransportAction to have
a different ordering than the Writeable.Reader variant. The Supplier
version is used for the legacy Streamable, and currently having the
location of the Writeable.Reader vs Supplier in the same place forces
using casts of Writeable.Reader to select the correct super constructor.
This change in ordering allows easier migration to Writeable.Reader.

relates #34389
2019-07-12 13:45:09 -07:00
Przemysław Witek dd5f4ae00e
Update .ml-config mappings before indexing job, datafeed or df analytics config (#44216) (#44273) 2019-07-12 16:49:48 +02:00
Przemysław Witek 40d3c60d7a
Make testDatafeedTimingStats_DatafeedJobIdUpdated test easier to debug (#44206) (#44268) 2019-07-12 13:52:26 +02:00
Benjamin Trent c82d9c5b50
[ML] Adds support for regression.mean_squared_error to eval API (#44140) (#44218)
* [ML] Adds support for regression.mean_squared_error to eval API

* addressing PR comments

* fixing tests
2019-07-11 09:22:52 -05:00
David Roberts 5886aefeed [ML] Wait for .ml-config primary before assigning persistent tasks (#44170)
Now that ML job configs are stored in an index rather than
cluster state, availability of the .ml-config index is very
important to the operation of ML.  When a cluster starts up
the ML persistent tasks will be considered for node
assignment very early on.  It is best in this case if
assignment is deferred until after the .ml-config index is
available.

The introduction of data frame analytics jobs has made this
problem worse, because anomaly detection jobs already waited
for the primary shards of the .ml-state, .ml-anomalies-shared
and .ml-meta indices to be available before doing node
assignment, and by coincidence this would probably lead to
the primary shards of .ml-config also being searchable.  But
data frame analytics jobs had no other index checks prior to
this change.

This fixes problem 2 of #44156
2019-07-11 11:43:39 +01:00