255 Commits

Author SHA1 Message Date
Przemysław Witek
88c5d520b3
[7.x] Verify that the field is aggregatable before attempting cardinality aggregation (#53874) (#54004) 2020-03-23 19:36:33 +01:00
Przemysław Witek
a68071dbba
[7.x] Delete empty .ml-state* indices during nightly maintenance task. (#53587) (#53849) 2020-03-20 13:08:36 +01:00
Przemysław Witek
ec13c093df
Make ML index aliases hidden (#53160) (#53710) 2020-03-18 10:28:45 +01:00
Przemysław Witek
376b2ae735
[7.x] Make classification evaluation metrics work when there is field mapping type mismatch (#53458) (#53601) 2020-03-16 15:38:56 +01:00
Dimitris Athanasiou
94da4ca3fc
[7.x][ML] Extend classification to support multiple classes (#53539) (#53597)
Prepares classification analysis to support more than just
two classes. It introduces a new parameter to the process config
which dictates the `num_classes` to the process. It also
changes the max classes limit to `30` provisionally.

Backport of #53539
2020-03-16 15:00:54 +02:00
Benjamin Trent
4e43ede735
[ML] renaming inference processor field field_mappings to new name field_map (#53433) (#53502)
This renames the `inference` processor configuration field `field_mappings` to `field_map`.

`field_mappings` is now deprecated.
2020-03-13 15:40:57 -04:00
Tom Veasey
690099553c
[7.x][ML] Adds the class_assignment_objective parameter to classification (#53552)
Adds a new parameter for classification that enables choosing whether to assign labels to
maximise accuracy or to maximise the minimum class recall.

Fixes #52427.
2020-03-13 17:35:51 +00:00
Benjamin Trent
89668c5ea0
[ML][Inference] adds new default_field_map field to trained models (#53294) (#53419)
Adds a new `default_field_map` field to trained model config objects.

This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.

The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
2020-03-11 13:49:39 -04:00
Przemysław Witek
8c4c19d310
Perform evaluation in multiple steps when necessary (#53295) (#53409) 2020-03-11 15:36:38 +01:00
Przemysław Witek
063957b7d8
Simplify "refresh" calls. (#53385) (#53393) 2020-03-11 12:26:11 +01:00
Dimitris Athanasiou
0fd0516d0d
[7.x][ML] Rename data frame analytics maximum_number_trees to max_trees (#53300) (#53390)
Deprecates `maximum_number_trees` parameter of classification and
regression and replaces it with `max_trees`.

Backport of #53300
2020-03-11 12:45:27 +02:00
Przemysław Witek
d54d7f2be0
[7.x] Implement ILM policy for .ml-state* indices (#52356) (#53327) 2020-03-10 14:24:18 +01:00
Benjamin Trent
856d9bfbc1
[ML] fixing data frame analysis test when two jobs are started in succession quickly (#53192) (#53332)
A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. 

This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.

If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. 

For tasks seeking a new assignment after a node failure, that behavior is unchanged.


closes #53188
2020-03-10 08:30:47 -04:00
Mayya Sharipova
f96ad5c32d Mute testSingleNumericFeatureAndMixedTrainingAndNonTrainingRows 2020-03-06 12:48:05 -05:00
Mark Vieira
09a3f45880
Mute ClassificationIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet
Signed-off-by: Mark Vieira <portugee@gmail.com>
2020-03-06 07:38:04 -08:00
James Baiera
01f00df5cd
Mute RegressionIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet 2020-03-06 07:37:57 -08:00
Dimitris Athanasiou
9abf537527
[7.x][ML] Improve DF analytics audits and logging (#53179) (#53218)
Adds audits for when the job starts reindexing, loading data,
analyzing, writing results. Also adds some info logging.

Backport of #53179
2020-03-06 13:47:27 +02:00
Benjamin Trent
af0b1c2860
[ML] Fix minor race condition in dataframe analytics _stop (#53029) (#53164)
Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated.

This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop.

closes #53007
2020-03-05 09:59:18 -05:00
Yang Wang
70814daa86
Allow _rollup_search with read privilege (#52043) (#53047)
Currently _rollup_search requires manage privilege to access. It should really be
a read only operation. This PR changes the requirement to be read indices privilege.

Resolves: #50245
2020-03-03 22:29:54 +11:00
Mark Vieira
f8396e8d15
Mute RunDataFrameAnalyticsIT.testStopOutlierDetectionWithEnoughDocumentsToScroll
Signed-off-by: Mark Vieira <portugee@gmail.com>
2020-03-02 09:21:55 -08:00
Benjamin Trent
19a6c5d980
[7.x] [ML][Inference] Add support for multi-value leaves to the tree model (#52531) (#52901)
* [ML][Inference] Add support for multi-value leaves to the tree model (#52531)

This adds support for multi-value leaves. This is a prerequisite for multi-class boosted tree classification.
2020-02-27 14:05:28 -05:00
Benjamin Trent
eac38e9847
[ML] Add indices_options to datafeed config and update (#52793) (#52905)
This adds a new configurable field called `indices_options`. This allows users to create or update the indices_options used when a datafeed reads from an index.

This is necessary for the following use cases:
 - Reading from frozen indices
 - Allowing certain indices in multiple index patterns to not exist yet

These index options are available on datafeed creation and update. Users may specify them as URL parameters or within the configuration object.

closes https://github.com/elastic/elasticsearch/issues/48056
2020-02-27 13:43:25 -05:00
David Kyle
d8bdf31110 Revert "Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart"
This reverts commit ad3a3b1af984bc051e7af01b50d1f4c78120e44d.
2020-02-27 12:38:13 +00:00
David Kyle
ad3a3b1af9 Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart 2020-02-26 14:31:00 +00:00
David Kyle
de3d674bb7 Revert "Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart"
This reverts commit c4d91143acc8edaf2895b1d464510e92eb7e16a2.
2020-02-24 15:22:49 +00:00
Benjamin Trent
afd90647c9
[ML] Adds feature importance to option to inference processor (#52218) (#52666)
This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : {
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: https://github.com/elastic/ml-cpp/pull/991
2020-02-21 18:42:31 -05:00
Jack Conradson
c4d91143ac Mute RunDataFrameAnalyticsIT.testOutlierDetectionStopAndRestart
Relates: #52654
2020-02-21 09:32:19 -08:00
Przemysław Witek
b84e8db7b5
[7.x] Rename .ml-state index to .ml-state-000001 to support rollover (#52510) (#52595) 2020-02-21 08:55:59 +01:00
Benjamin Trent
2a5c181dda
[ML][Inference] don't return inflated definition when storing trained models (#52573) (#52580)
When `PUT` is called to store a trained model, it is useful to return the newly create model config. But, it is NOT useful to return the inflated definition.

These definitions can be large and returning the inflated definition causes undo work on the server and client side.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-02-20 19:47:29 -05:00
Benjamin Trent
013d5c2d24
[ML] Adds support for a global calendar via _all (#50372) (#52578)
This adds `_all` to Calendar searches. This enables users to supply the `_all` string in the `job_ids` array when creating a Calendar. That calendar will now be applied to all jobs (existing and newly created).

Closes #45013

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-02-20 17:22:59 -05:00
Przemysław Witek
7cd997df84
[ML] Make ml internal indices hidden (#52423) (#52509) 2020-02-19 14:02:32 +01:00
Dimitris Athanasiou
ad56802ac6
[7.x][ML] Refactor ML mappings and templates into JSON resources (#51… (#52353)
ML mappings and index templates have so far been created
programmatically. While this had its merits due to static typing,
there is consensus it would be clear to maintain those in json files.
In addition, we are going to adding ILM policies to these indices
and the component for a plugin to register ILM policies is
`IndexTemplateRegistry`. It expects the templates to be in resource
json files.

For the above reasons this commit refactors ML mappings and index
templates into json resource files that are registered via
`MlIndexTemplateRegistry`.

Backport of #51765
2020-02-14 17:16:06 +02:00
David Roberts
473468d763 [ML] Better error when persistent task assignment disabled (#52014)
Changes the misleading error message when attempting to open
a job while the "cluster.persistent_tasks.allocation.enable"
setting is set to "none" to a clearer message that names the
setting.

Closes #51956
2020-02-11 15:23:21 +00:00
Benjamin Trent
846f87a26e
[ML] allow close/stop for jobs/datafeeds with missing configs (#51888) (#51997)
If the configs are removed (by some horrific means), we should still allow tasks to be cleaned up easily.

Datafeeds and jobs with missing configs are now visible in their respective _stats calls and can be stopped/closed.
2020-02-06 12:10:18 -05:00
Benjamin Trent
1380dd439a
[7.x] [ML][Inference] Fix weighted mode definition (#51648) (#51695)
* [ML][Inference] Fix weighted mode definition (#51648)

Weighted mode inaccurately assumed that the "max value" of the input values would be the maximum class value. This does not make sense. 

Weighted Mode should know how many classes there are. Hence the new parameter `num_classes`. This indicates what the maximum class value to be expected.
2020-01-30 15:33:25 -05:00
Przemysław Witek
683170b007
Increase the number of indexed documents to increase a chance that there are at least 2 training rows. (#51607) (#51615) 2020-01-29 17:17:19 +01:00
David Kyle
ca4b90a001
[ML] Calculate results and snapshot retention using latest bucket timestamps (#51061) (#51301)
The retention period is calculated relative to the last bucket result or snapshot
time rather than wall clock
2020-01-22 14:52:33 +00:00
Dimitris Athanasiou
59687a9384
[7.x][ML] Validate classification dependent_variable cardinality is at lea… (#51232) (#51309)
Data frame analytics classification currently only supports 2 classes for the
dependent variable. We were checking that the field's cardinality is not higher
than 2 but we should also check it is not less than that as otherwise the process
fails.

Backport of #51232
2020-01-22 16:51:16 +02:00
Benjamin Trent
2a73e849d6
[ML][Inference] fixing ingest IT tests (#51267) (#51311)
Converts InferenceIngestIT into a `ESRestTestCase`.

closes #51201
2020-01-22 09:50:17 -05:00
Przemysław Witek
bfcfcdee33
[7.x] Do not copy mapping from dependent variable to prediction field in regression analysis (#51227) (#51288) 2020-01-22 12:36:24 +01:00
Tom Veasey
32ec934b15
[7.x][ML] Assert top classes are ordered by score (#51028)
Backport #51003.
2020-01-16 12:23:15 +00:00
Benjamin Trent
72c270946f
[ML][Inference] Adding classification_weights to ensemble models (#50874) (#50994)
* [ML][Inference] Adding classification_weights to ensemble models

classification_weights are a way to allow models to
prefer specific classification results over others
this might be advantageous if classification value
probabilities are a known quantity and can improve
model error rates.
2020-01-14 12:40:25 -05:00
Tom Veasey
de5713fa4b
[ML] Disable invalid assertion (#50988)
Backport #50986.
2020-01-14 17:35:00 +00:00
Dimitris Athanasiou
1d8cb3c741
[7.x][ML] Add num_top_feature_importance_values param to regression and classi… (#50914) (#50976)
Adds a new parameter to regression and classification that enables computation
of importance for the top most important features. The computation of the importance
is based on SHAP (SHapley Additive exPlanations) method.

Backport of #50914
2020-01-14 16:46:09 +02:00
Przemysław Witek
9c6ffdc2be
[7.x] Handle nested and aliased fields correctly when copying mapping. (#50918) (#50968) 2020-01-14 14:43:39 +01:00
Benjamin Trent
fa116a6d26
[7.x] [ML][Inference] PUT API (#50852) (#50887)
* [ML][Inference] PUT API (#50852)

This adds the `PUT` API for creating trained models that support our format.

This includes

* HLRC change for the API
* API creation
* Validations of model format and call

* fixing backport
2020-01-12 10:59:11 -05:00
Benjamin Trent
cc0e64572a
[ML][Inference][HLRC] Add necessary lang ident classes (#50705) (#50794)
This adds the necessary named XContent classes to the HLRC for the lang ident model. This is so the HLRC can call `GET _ml/inference/lang_ident_model_1?include_definition=true` without XContent parsing errors.

The constructors are package private as since this classes are used exclusively within the pre-packaged model (and require the specific weights, etc. to be of any use).
2020-01-09 10:33:38 -05:00
Benjamin Trent
060e0a6277
[ML][Inference] Add support for models shipped as resources (#50680) (#50700)
This adds support for models that are shipped as resources in the ML plugin. The first of which is the `lang_ident` model.
2020-01-07 09:21:59 -05:00
Przemysław Witek
4116452d90
Implement testStopAndRestart for ClassificationIT (#50585) (#50698) 2020-01-07 13:41:37 +01:00
Przemysław Witek
8917c05df8
[7.x] Synchronize processInStream.close() call (#50581) 2020-01-03 10:23:51 +01:00