Commit Graph

306 Commits

Author SHA1 Message Date
Przemysław Witek d40afc7871
[7.x] Do not fail Evaluate API when the actual and predicted fields' types differ (#54255) (#54319) 2020-03-27 10:05:19 +01:00
Dimitris Athanasiou cc981fa377
[7.x][ML] Get ML filters size should default to 100 (#54207) (#54278)
When get filters is called without setting the `size`
paramter only up to 10 filters are returned. However,
100 filters should be returned. This commit fixes this
and adds an integ test to guard it.

It seems this was accidentally broken in #39976.

Closes #54206

Backport of #54207
2020-03-26 17:51:43 +02:00
Mark Vieira 7728ccd920
Encore consistent compile options across all projects (#54120)
(cherry picked from commit ddd068a7e92dc140774598664efdc15155ab05c2)
2020-03-25 08:24:21 -07:00
Dimitris Athanasiou ba09a778dc
[7.x][ML] Unmute classification cardinality integ test (#54165) (#54173)
Adjusts test to work for new cardinality limit.

Backport of #54165
2020-03-25 15:00:34 +02:00
Dimitris Athanasiou 5ce7c99e74
[7.x][ML] Data frame analytics data counts (#53998) (#54031)
This commit instruments data frame analytics
with stats for the data that are being analyzed.
In particular, we count training docs, test docs,
and skipped docs.

In order to account docs with missing values as skipped
docs for analyses that do not support missing values,
this commit changes the extractor so that it only ignores
docs with missing values when it collects the data summary,
which is used to estimate memory usage.

Backport of #53998
2020-03-24 11:30:43 +02:00
Benjamin Trent 19af869243
[ML] adds multi-class feature importance support (#53803) (#54024)
Adds multi-class feature importance calculation. 

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{ 
   “feature_name”: “feature_0”, 
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0, 
   “bar”: 0.5, 
   “baz”: -0.5 
},
```

For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
```
 "ml.inference.feature_importance": {
          "type": "nested",
          "dynamic": true,
          "properties": {
            "feature_name": {
              "type": "keyword"
            },
            "importance": {
              "type": "double"
            }
          }
        }
```
The mapping field name is as follows
`ml.<inference.target_field>.<inference.tag>.feature_importance`
if `inference.tag` is not provided in the processor definition, it is not part of the field path.
`inference.target_field` is defaulted to `ml.inference`.
//cc @lcawl ^ Where should we document this?

If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
2020-03-23 18:49:07 -04:00
Benjamin Trent d276058c6c
[ML] adjusting feature importance mapping for multi-class support (#53821) (#54013)
Feature importance storage format is changing to encompass multi-class.

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
   “feature_name”: “feature_0”,
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0,
   “bar”: 0.5,
   “baz”: -0.5
},
```

This change adjusts the mapping creation for analytics so that the field is mapped as a `nested` type.

Native side change: https://github.com/elastic/ml-cpp/pull/1071
2020-03-23 15:50:12 -04:00
Przemysław Witek 88c5d520b3
[7.x] Verify that the field is aggregatable before attempting cardinality aggregation (#53874) (#54004) 2020-03-23 19:36:33 +01:00
Przemysław Witek a68071dbba
[7.x] Delete empty .ml-state* indices during nightly maintenance task. (#53587) (#53849) 2020-03-20 13:08:36 +01:00
Jake Landis db3420d757
[7.x] Optimize which Rest resources are used by the Rest tests… (#53766)
This should help with Gradle's incremental compile such that projects
only depend upon the resources they use.

related #52114
2020-03-19 12:28:59 -05:00
Przemysław Witek ec13c093df
Make ML index aliases hidden (#53160) (#53710) 2020-03-18 10:28:45 +01:00
Przemysław Witek 376b2ae735
[7.x] Make classification evaluation metrics work when there is field mapping type mismatch (#53458) (#53601) 2020-03-16 15:38:56 +01:00
Dimitris Athanasiou 94da4ca3fc
[7.x][ML] Extend classification to support multiple classes (#53539) (#53597)
Prepares classification analysis to support more than just
two classes. It introduces a new parameter to the process config
which dictates the `num_classes` to the process. It also
changes the max classes limit to `30` provisionally.

Backport of #53539
2020-03-16 15:00:54 +02:00
Benjamin Trent 4e43ede735
[ML] renaming inference processor field field_mappings to new name field_map (#53433) (#53502)
This renames the `inference` processor configuration field `field_mappings` to `field_map`.

`field_mappings` is now deprecated.
2020-03-13 15:40:57 -04:00
Tom Veasey 690099553c
[7.x][ML] Adds the class_assignment_objective parameter to classification (#53552)
Adds a new parameter for classification that enables choosing whether to assign labels to
maximise accuracy or to maximise the minimum class recall.

Fixes #52427.
2020-03-13 17:35:51 +00:00
Benjamin Trent 89668c5ea0
[ML][Inference] adds new default_field_map field to trained models (#53294) (#53419)
Adds a new `default_field_map` field to trained model config objects.

This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.

The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
2020-03-11 13:49:39 -04:00
Przemysław Witek 8c4c19d310
Perform evaluation in multiple steps when necessary (#53295) (#53409) 2020-03-11 15:36:38 +01:00
Przemysław Witek 063957b7d8
Simplify "refresh" calls. (#53385) (#53393) 2020-03-11 12:26:11 +01:00
Dimitris Athanasiou 0fd0516d0d
[7.x][ML] Rename data frame analytics maximum_number_trees to max_trees (#53300) (#53390)
Deprecates `maximum_number_trees` parameter of classification and
regression and replaces it with `max_trees`.

Backport of #53300
2020-03-11 12:45:27 +02:00
Przemysław Witek d54d7f2be0
[7.x] Implement ILM policy for .ml-state* indices (#52356) (#53327) 2020-03-10 14:24:18 +01:00
Benjamin Trent 856d9bfbc1
[ML] fixing data frame analysis test when two jobs are started in succession quickly (#53192) (#53332)
A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. 

This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.

If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. 

For tasks seeking a new assignment after a node failure, that behavior is unchanged.


closes #53188
2020-03-10 08:30:47 -04:00
Mayya Sharipova f96ad5c32d Mute testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows 2020-03-06 12:48:05 -05:00
Mark Vieira 09a3f45880
Mute ClassificationIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet
Signed-off-by: Mark Vieira <portugee@gmail.com>
2020-03-06 07:38:04 -08:00
James Baiera 01f00df5cd
Mute RegressionIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet 2020-03-06 07:37:57 -08:00
Dimitris Athanasiou 9abf537527
[7.x][ML] Improve DF analytics audits and logging (#53179) (#53218)
Adds audits for when the job starts reindexing, loading data,
analyzing, writing results. Also adds some info logging.

Backport of #53179
2020-03-06 13:47:27 +02:00
Benjamin Trent af0b1c2860
[ML] Fix minor race condition in dataframe analytics _stop (#53029) (#53164)
Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated.

This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop.

closes #53007
2020-03-05 09:59:18 -05:00
Yang Wang 70814daa86
Allow _rollup_search with read privilege (#52043) (#53047)
Currently _rollup_search requires manage privilege to access. It should really be
a read only operation. This PR changes the requirement to be read indices privilege.

Resolves: #50245
2020-03-03 22:29:54 +11:00
Mark Vieira f8396e8d15
Mute RunDataFrameAnalyticsIT.testStopOutlierDetectionWithEnoughDocumentsToScroll
Signed-off-by: Mark Vieira <portugee@gmail.com>
2020-03-02 09:21:55 -08:00
Benjamin Trent 19a6c5d980
[7.x] [ML][Inference] Add support for multi-value leaves to the tree model (#52531) (#52901)
* [ML][Inference] Add support for multi-value leaves to the tree model (#52531)

This adds support for multi-value leaves. This is a prerequisite for multi-class boosted tree classification.
2020-02-27 14:05:28 -05:00
Benjamin Trent eac38e9847
[ML] Add indices_options to datafeed config and update (#52793) (#52905)
This adds a new configurable field called `indices_options`. This allows users to create or update the indices_options used when a datafeed reads from an index.

This is necessary for the following use cases:
 - Reading from frozen indices
 - Allowing certain indices in multiple index patterns to not exist yet

These index options are available on datafeed creation and update. Users may specify them as URL parameters or within the configuration object.

closes https://github.com/elastic/elasticsearch/issues/48056
2020-02-27 13:43:25 -05:00
David Kyle d8bdf31110 Revert "Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart"
This reverts commit ad3a3b1af9.
2020-02-27 12:38:13 +00:00
Jake Landis b4179a8814
[7.x] Refactor watcher tests (#52799) (#52844)
This PR moves the majority of the Watcher REST tests under
the Watcher x-pack plugin.

Specifically, moves the Watcher tests from:
x-pack/plugin/test
x-pack/qa/smoke-test-watcher
x-pack/qa/smoke-test-watcher-with-security
x-pack/qa/smoke-test-monitoring-with-watcher

to:
x-pack/plugin/watcher/qa/rest (/test and /qa/smoke-test-watcher)
x-pack/plugin/watcher/qa/with-security
x-pack/plugin/watcher/qa/with-monitoring

Additionally, this disables Watcher from the main
x-pack test cluster and consolidates the stop/start logic
for the tests listed.

No changes to the tests (beyond moving them) are included.

3rd party tests and doc tests (which also touch Watcher)
are not included in the changes here.
2020-02-26 15:57:10 -06:00
David Kyle ad3a3b1af9 Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart 2020-02-26 14:31:00 +00:00
Jake Landis 8d311297ca
[7.x] Smarter copying of the rest specs and tests (#52114) (#52798)
* Smarter copying of the rest specs and tests (#52114)

This PR addresses the unnecessary copying of the rest specs and allows
for better semantics for which specs and tests are copied. By default
the rest specs will get copied if the project applies
`elasticsearch.standalone-rest-test` or `esplugin` and the project
has rest tests or you configure the custom extension `restResources`.

This PR also removes the need for dozens of places where the x-pack
specs were copied by supporting copying of the x-pack rest specs too.

The plugin/task introduced here can also copy the rest tests to the
local project through a similar configuration.

The new plugin/task allows a user to minimize the surface area of
which rest specs are copied. Per project can be configured to include
only a subset of the specs (or tests). Configuring a project to only
copy the specs when actually needed should help with build cache hit
rates since we can better define what is actually in use.
However, project level optimizations for build cache hit rates are
not included with this PR.

Also, with this PR you can no longer use the includePackaged flag on
integTest task.

The following items are included in this PR:
* new plugin: `elasticsearch.rest-resources`
* new tasks: CopyRestApiTask and CopyRestTestsTask - performs the copy
* new extension 'restResources'
```
restResources {
  restApi {
    includeCore 'foo' , 'bar' //will include the core specs that start with foo and bar
    includeXpack 'baz' //will include x-pack specs that start with baz
  }
  restTests {
    includeCore 'foo', 'bar' //will include the core tests that start with foo and bar
    includeXpack 'baz' //will include the x-pack tests that start with baz
  }
}

```
2020-02-26 08:13:41 -06:00
David Kyle de3d674bb7 Revert "Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart"
This reverts commit c4d91143ac.
2020-02-24 15:22:49 +00:00
Benjamin Trent afd90647c9
[ML] Adds feature importance to option to inference processor (#52218) (#52666)
This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : {
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: https://github.com/elastic/ml-cpp/pull/991
2020-02-21 18:42:31 -05:00
Jack Conradson c4d91143ac Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart
Relates: #52654
2020-02-21 09:32:19 -08:00
Przemysław Witek b84e8db7b5
[7.x] Rename .ml-state index to .ml-state-000001 to support rollover (#52510) (#52595) 2020-02-21 08:55:59 +01:00
Benjamin Trent 2a5c181dda
[ML][Inference] don't return inflated definition when storing trained models (#52573) (#52580)
When `PUT` is called to store a trained model, it is useful to return the newly create model config. But, it is NOT useful to return the inflated definition.

These definitions can be large and returning the inflated definition causes undo work on the server and client side.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-02-20 19:47:29 -05:00
Benjamin Trent 013d5c2d24
[ML] Adds support for a global calendar via `_all` (#50372) (#52578)
This adds `_all` to Calendar searches. This enables users to supply the `_all` string in the `job_ids` array when creating a Calendar. That calendar will now be applied to all jobs (existing and newly created).

Closes #45013

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-02-20 17:22:59 -05:00
David Kyle 7bbe5c8464
[Ml] Validate tree feature index is within range (#52514)
This changes the tree validation code to ensure no node in the tree has a
feature index that is beyond the bounds of the feature_names array.
Specifically this handles the situation where the C++ emits a tree containing
a single node and an empty feature_names list. This is valid tree used to
centre the data in the ensemble but the validation code would reject this
as feature_names is empty. This meant a broken workflow as you cannot GET
the model and PUT it back
2020-02-19 14:41:43 +00:00
Przemysław Witek 7cd997df84
[ML] Make ml internal indices hidden (#52423) (#52509) 2020-02-19 14:02:32 +01:00
Dimitris Athanasiou ad56802ac6
[7.x][ML] Refactor ML mappings and templates into JSON resources (#51… (#52353)
ML mappings and index templates have so far been created
programmatically. While this had its merits due to static typing,
there is consensus it would be clear to maintain those in json files.
In addition, we are going to adding ILM policies to these indices
and the component for a plugin to register ILM policies is
`IndexTemplateRegistry`. It expects the templates to be in resource
json files.

For the above reasons this commit refactors ML mappings and index
templates into json resource files that are registered via
`MlIndexTemplateRegistry`.

Backport of #51765
2020-02-14 17:16:06 +02:00
David Roberts 473468d763 [ML] Better error when persistent task assignment disabled (#52014)
Changes the misleading error message when attempting to open
a job while the "cluster.persistent_tasks.allocation.enable"
setting is set to "none" to a clearer message that names the
setting.

Closes #51956
2020-02-11 15:23:21 +00:00
Benjamin Trent 846f87a26e
[ML] allow close/stop for jobs/datafeeds with missing configs (#51888) (#51997)
If the configs are removed (by some horrific means), we should still allow tasks to be cleaned up easily.

Datafeeds and jobs with missing configs are now visible in their respective _stats calls and can be stopped/closed.
2020-02-06 12:10:18 -05:00
Benjamin Trent 1380dd439a
[7.x] [ML][Inference] Fix weighted mode definition (#51648) (#51695)
* [ML][Inference] Fix weighted mode definition (#51648)

Weighted mode inaccurately assumed that the "max value" of the input values would be the maximum class value. This does not make sense. 

Weighted Mode should know how many classes there are. Hence the new parameter `num_classes`. This indicates what the maximum class value to be expected.
2020-01-30 15:33:25 -05:00
Przemysław Witek 683170b007
Increase the number of indexed documents to increase a chance that there are at least 2 training rows. (#51607) (#51615) 2020-01-29 17:17:19 +01:00
Benjamin Trent fc994d9ce1
[ML][Inference] Adds validations for model PUT (#51376) (#51409)
Adds validations making sure that

* `input.field_names` is not empty
* `ensemble.trained_models` is not empty
* `tree.feature_names` is not empty

closes https://github.com/elastic/elasticsearch/issues/51354
2020-01-24 09:29:12 -05:00
David Kyle ca4b90a001
[ML] Calculate results and snapshot retention using latest bucket timestamps (#51061) (#51301)
The retention period is calculated relative to the last bucket result or snapshot
time rather than wall clock
2020-01-22 14:52:33 +00:00
Dimitris Athanasiou 59687a9384
[7.x][ML] Validate classification dependent_variable cardinality is at lea… (#51232) (#51309)
Data frame analytics classification currently only supports 2 classes for the
dependent variable. We were checking that the field's cardinality is not higher
than 2 but we should also check it is not less than that as otherwise the process
fails.

Backport of #51232
2020-01-22 16:51:16 +02:00
Benjamin Trent 2a73e849d6
[ML][Inference] fixing ingest IT tests (#51267) (#51311)
Converts InferenceIngestIT into a `ESRestTestCase`.

closes #51201
2020-01-22 09:50:17 -05:00
Przemysław Witek bfcfcdee33
[7.x] Do not copy mapping from dependent variable to prediction field in regression analysis (#51227) (#51288) 2020-01-22 12:36:24 +01:00
David Roberts 0fa7db9a95 [ML] Make datafeeds work with nanosecond time fields (#51180)
Allows ML datafeeds to work with time fields that have
the "date_nanos" type _and make use of the extra precision_.
(Previously datafeeds only worked with time fields that were
exact multiples of milliseconds.  So datafeeds would work
with "date_nanos" only if the extra precision over "date" was
not used.)

Relates #49889
2020-01-21 09:59:50 +00:00
Tom Veasey 32ec934b15
[7.x][ML] Assert top classes are ordered by score (#51028)
Backport #51003.
2020-01-16 12:23:15 +00:00
Benjamin Trent 72c270946f
[ML][Inference] Adding classification_weights to ensemble models (#50874) (#50994)
* [ML][Inference] Adding classification_weights to ensemble models

classification_weights are a way to allow models to
prefer specific classification results over others
this might be advantageous if classification value
probabilities are a known quantity and can improve
model error rates.
2020-01-14 12:40:25 -05:00
Tom Veasey de5713fa4b
[ML] Disable invalid assertion (#50988)
Backport #50986.
2020-01-14 17:35:00 +00:00
Dimitris Athanasiou 1d8cb3c741
[7.x][ML] Add num_top_feature_importance_values param to regression and classi… (#50914) (#50976)
Adds a new parameter to regression and classification that enables computation
of importance for the top most important features. The computation of the importance
is based on SHAP (SHapley Additive exPlanations) method.

Backport of #50914
2020-01-14 16:46:09 +02:00
Przemysław Witek 9c6ffdc2be
[7.x] Handle nested and aliased fields correctly when copying mapping. (#50918) (#50968) 2020-01-14 14:43:39 +01:00
Benjamin Trent fa116a6d26
[7.x] [ML][Inference] PUT API (#50852) (#50887)
* [ML][Inference] PUT API (#50852)

This adds the `PUT` API for creating trained models that support our format.

This includes

* HLRC change for the API
* API creation
* Validations of model format and call

* fixing backport
2020-01-12 10:59:11 -05:00
Dimitris Athanasiou 422422a2bc
[7.x][ML] Reuse SourceDestValidator for data frame analytics (#50841) (#50850)
This commit removes validation logic of source and dest indices
for data frame analytics and replaces it with using the common
`SourceDestValidator` class which is already used by transforms.
This way the validations and their messages become consistent
while we reduce code.

This means that where these validations fail the error messages
will be slightly different for data frame analytics.

Backport of #50841
2020-01-10 14:24:13 +02:00
Benjamin Trent cc0e64572a
[ML][Inference][HLRC] Add necessary lang ident classes (#50705) (#50794)
This adds the necessary named XContent classes to the HLRC for the lang ident model. This is so the HLRC can call `GET _ml/inference/lang_ident_model_1?include_definition=true` without XContent parsing errors.

The constructors are package private as since this classes are used exclusively within the pre-packaged model (and require the specific weights, etc. to be of any use).
2020-01-09 10:33:38 -05:00
Benjamin Trent 060e0a6277
[ML][Inference] Add support for models shipped as resources (#50680) (#50700)
This adds support for models that are shipped as resources in the ML plugin. The first of which is the `lang_ident` model.
2020-01-07 09:21:59 -05:00
Przemysław Witek 4116452d90
Implement testStopAndRestart for ClassificationIT (#50585) (#50698) 2020-01-07 13:41:37 +01:00
Przemysław Witek 8917c05df8
[7.x] Synchronize processInStream.close() call (#50581) 2020-01-03 10:23:51 +01:00
Przemysław Witek 4ecabe496f
Mute testStopAndRestart test case (#50551) 2020-01-02 15:28:20 +01:00
Christoph Büscher 1599af8428 Fix type conversion problem in Eclipse (#50549)
Eclipse 4.13 shows a type mismatch error in the affected line because it cannot
correctly infer the boolean return type for the method call. Assigning return
value to a local variable resolves this problem.
2020-01-02 14:29:20 +01:00
Przemysław Witek 3e3a93002f
[7.x] Fix accuracy metric (#50310) (#50433) 2019-12-20 15:34:38 +01:00
Przemysław Witek 14d95aae46
[7.x] Make each analysis report desired field mappings to be copied (#50219) (#50428) 2019-12-20 15:10:33 +01:00
Przemysław Witek 5bb668b866
[7.x] Get rid of maxClassesCardinality internal parameter (#50418) (#50423) 2019-12-20 14:24:23 +01:00
Przemysław Witek cc4bc797f9
[7.x] Implement `precision` and `recall` metrics for classification evaluation (#49671) (#50378) 2019-12-19 18:55:05 +01:00
Dimitris Athanasiou 447bac27d2
[7.x][ML] Delete unused data frame analytics state (#50243) (#50280)
This commit adds removal of unused data frame analytics state
from the _delete_expired_data API (and in extend th ML daily
maintenance task). At the moment the potential state docs
include the progress document and state for regression and
classification analyses.

Backport of #50243
2019-12-18 12:30:11 +00:00
Dimitris Athanasiou fe3c9e71d1
[7.x][ML] Fix DFA explain API timeout when source index is missing (#50176) (#50180)
This commit fixes a bug that caused the data frame analytics
_explain API to time out in a multi-node setup when the source
index was missing. When we try to create the extracted fields detector,
we check the index settings. If the index is missing that responds
with a failure that could be wrapped as a remote exception.
While we unwrapped correctly to check if the cause was an
`IndexNotFoundException`, we then proceeded to cast the original
exception instead of the cause.

Backport of #50176
2019-12-13 17:00:55 +02:00
Dimitris Athanasiou e6cbcf7f7c
[7.x] [ML] Persist/restore state for DFA classification (#50040) (#50147)
This commit adds state persist/restore for data frame analytics classification jobs.

Backport of #50040
2019-12-13 10:33:19 +02:00
Benjamin Trent d7ffa7f8f7
[7.x][ML] Add graceful retry for anomaly detector result indexing failures(#49508) (#50145)
* [ML] Add graceful retry for anomaly detector result indexing failures (#49508)

All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.

* fixing test
2019-12-12 12:24:58 -05:00
Benjamin Trent c043aa887f
[ML][Inference] Simplify inference processor options (#50105) (#50146)
* [ML][Inference] Simplify inference processor options

* addressing pr comments
2019-12-12 11:13:55 -05:00
Dimitris Athanasiou 03ecaae221
[7.x][ML] Avoid classification integ test training on single class (#50072) (#50078)
The `ClassificationIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet`
test was previously set up to just have 10 rows. With `training_percent`
of 50%, only 5 rows will be used for training. There is a good chance that
all 5 rows will be of one class which results to failure.

This commit increases the rows to 100. Now 50 rows should be used for training
and the chance of failure should be very small.

Backport of #50072
2019-12-11 18:50:26 +02:00
Dimitris Athanasiou 8891f4db88
[7.x][ML] Introduce randomize_seed setting for regression and classification (#49990) (#50023)
This adds a new `randomize_seed` for regression and classification.
When not explicitly set, the seed is randomly generated. One can
reuse the seed in a similar job in order to ensure the same docs
are picked for training.

Backport of #49990
2019-12-10 15:29:19 +02:00
Benjamin Trent 0b6ce9683c
[ML] Use query in cardinality check (#49939) (#49984)
When checking the cardinality of a field, the query should be take into account. The user might know about some bad data in their index and want to filter down to the target_field values they care about.
2019-12-09 10:14:41 -05:00
Przemysław Witek 0965a10468
[7.x] Pass `prediction_field_type` to C++ analytics process (#49861) (#49981) 2019-12-09 14:43:01 +01:00
Benjamin Trent 049d854360
[ML][Inference] adjust so target_field always has inference result and optionally allow new top classes field in the classification config (#49923) (#49982) 2019-12-09 08:29:45 -05:00
Dimitris Athanasiou e4f838e764
[7.x][ML] Update expected mem estimate in explain API integ test (#49924) (#49979)
Work in progress in the c++ side is increasing memory estimates
a bit and this test fails. At the time of this commit the mem
estimate when there is no source query is a about 2Mb. So I
am relaxing the test to assert memory estimate is less than 1Mb
instead of 500Kb.

Backport of #49924
2019-12-09 11:52:06 +02:00
Przemysław Witek e60837aa3b
[7.x] Log whole analytics stats when the state assertion fails (#49906) (#49911) 2019-12-06 14:31:17 +01:00
Dimitris Athanasiou 4edb2e7bb6
[7.x][ML] Add optional source filtering during data frame reindexing (#49690) (#49718)
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.

Closes #49531

Backport of #49690
2019-11-29 16:10:44 +02:00
Dimitris Athanasiou c149c64dc4
[7.x][ML] Apply source query on data frame analytics memory estimation (#49517) (#49532)
Closes #49454

Backport of #49517
2019-11-25 12:51:57 +02:00
Dimitris Athanasiou 8eaee7cbdc
[7.x][ML] Explain data frame analytics API (#49455) (#49504)
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.

Backport of #49455
2019-11-22 22:06:10 +02:00
Benjamin Trent a7477ad7c3
[7.x] [ML][Inference] compressing model definition and lazy parsing (#49269) (#49446)
* [ML][Inference] compressing model definition and lazy parsing (#49269)

* [ML][Inference] compressing model definition and lazy parsing

* addressing PR comments

* adding commons io

* implementing simplified bounded stream

* adjusting for type inclusion
2019-11-21 15:32:32 -05:00
Benjamin Trent d41b2e3f38
[ML][Inference] allowing per-model licensing (#49398) (#49435)
* [ML][Inference] allowing per-model licensing

* changing to internal action + removing pre-mature opt
2019-11-21 09:46:34 -05:00
Przemysław Witek c7ac2011eb
[7.x] Implement accuracy metric for multiclass classification (#47772) (#49430) 2019-11-21 15:01:18 +01:00
Przemysław Witek 9c0ec7ce23
[7.x] Make AnalyticsProcessManager class more robust (#49282) (#49356) 2019-11-20 10:08:16 +01:00
Przemysław Witek 42bb8ae525
[7.x] Extract indexData method out of RegressionIT tests (#49306) (#49313) 2019-11-19 22:47:12 +01:00
Benjamin Trent eefe7688ce
[7.x][ML] ML Model Inference Ingest Processor (#49052) (#49257)
* [ML] ML Model Inference Ingest Processor (#49052)

* [ML][Inference] adds lazy model loader and inference (#47410)

This adds a couple of things:

- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:

* [ML][Inference] Adjust inference configuration option API (#47812)

* [ML][Inference] adds logistic_regression output aggregator (#48075)

* [ML][Inference] Adding read/del trained models (#47882)

* [ML][Inference] Adding inference ingest processor (#47859)

* [ML][Inference] fixing classification inference for ensemble (#48463)

* [ML][Inference] Adding model memory estimations (#48323)

* [ML][Inference] adding more options to inference processor (#48545)

* [ML][Inference] handle string values better in feature extraction (#48584)

* [ML][Inference] Adding _stats endpoint for inference (#48492)

* [ML][Inference] add inference processors and trained models to usage (#47869)

* [ML][Inference] add new flag for optionally including model definition (#48718)

* [ML][Inference] adding license checks (#49056)

* [ML][Inference] Adding memory and compute estimates to inference (#48955)

* fixing version of indexed docs for model inference
2019-11-18 13:19:17 -05:00
Przemysław Witek 150db2b544
Throw an exception when memory usage estimation endpoint encounters empty data frame. (#49143) (#49164) 2019-11-18 07:52:57 +01:00
Rory Hunter c46a0e8708
Apply 2-space indent to all gradle scripts (#49071)
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
2019-11-14 11:01:23 +00:00
Dimitris Athanasiou dfc6a13b44
[7.x][ML] Handle nested arrays in source fields (#48885) (#48889)
Backport of #48885
2019-11-07 07:30:50 +02:00
Dimitris Athanasiou 1f662e0b12
[7.x][ML] Prevent fetching multi-field from source (#48770) (#48797)
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates #48756

Backport of #48770
2019-11-01 14:18:03 +02:00
Przemysław Witek 7c944d26c5
[7.x] Assert that the results of classification analysis can be evaluated using _evaluate API. (#48626) (#48634) 2019-10-29 16:20:56 +01:00
Przemysław Witek 7e30277a37
Mute RegressionIT.testStopAndRestart (#48575) (#48576) 2019-10-28 13:08:11 +01:00
Przemysław Witek 149537a165
Assert that inference model has been persisted (#48332) (#48453) 2019-10-24 14:18:43 +02:00
Przemysław Witek 60d8ecb2b7
Mute ClassificationIT tests (#48338) (#48339) 2019-10-22 12:45:50 +02:00
Przemysław Witek 2db2b945ec
[7.x] Change format of MulticlassConfusionMatrix result to be more self-explanatory (#48174) (#48294) 2019-10-21 22:07:19 +02:00