Commit Graph

18 Commits

Author SHA1 Message Date
Przemysław Witek acbd48f834
[ML] Allow setting num_top_classes to a special value -1 (#63587) (#63602) 2020-10-13 13:57:50 +02:00
Lisa Cawley 57ea5d27ae [DOCS] Add experimental tag to data frame analytics APIs (#63153) 2020-10-02 09:44:40 -07:00
Benjamin Trent 1ae2923632
[7.x] [ML] adding docs + hlrc for data frame analysis feature_processors (#61149) (#61493)
* [ML] adding docs + hlrc for data frame analysis feature_processors (#61149)

Adds HLRC and some docs for the new feature_processors field in Data frame analytics.

Co-authored-by: Przemysław Witek <przemyslaw.witek@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2020-08-24 12:56:21 -04:00
Dimitris Athanasiou b2243337d8
[7.x][ML] Data frame analytics max_num_threads setting (#59254) (#59308)
This adds a setting to data frame analytics jobs called
`max_number_threads`. The setting expects a positive integer.
When used the user specifies the max number of threads that may
be used by the analysis. Note that the actual number of threads
used is limited by the number of processors on the node where
the job is assigned. Also, the process may use a couple more threads
for operational functionality that is not the analysis itself.

This setting may also be updated for a stopped job.

More threads may reduce the time it takes to complete the job at the cost
of using more CPU.

Backport of #59254 and #57274
2020-07-09 19:15:46 +03:00
Dimitris Athanasiou 75dadb7a6d
[7.x][ML] Add loss_function to regression (#56118) (#56187)
Adds parameters `loss_function` and `loss_function_parameter`
to regression.

Backport of #56118
2020-05-05 14:59:51 +03:00
Tom Veasey 690099553c
[7.x][ML] Adds the class_assignment_objective parameter to classification (#53552)
Adds a new parameter for classification that enables choosing whether to assign labels to
maximise accuracy or to maximise the minimum class recall.

Fixes #52427.
2020-03-13 17:35:51 +00:00
Dimitris Athanasiou 1d8cb3c741
[7.x][ML] Add num_top_feature_importance_values param to regression and classi… (#50914) (#50976)
Adds a new parameter to regression and classification that enables computation
of importance for the top most important features. The computation of the importance
is based on SHAP (SHapley Additive exPlanations) method.

Backport of #50914
2020-01-14 16:46:09 +02:00
Dimitris Athanasiou 8891f4db88
[7.x][ML] Introduce randomize_seed setting for regression and classification (#49990) (#50023)
This adds a new `randomize_seed` for regression and classification.
When not explicitly set, the seed is randomly generated. One can
reuse the seed in a similar job in order to ensure the same docs
are picked for training.

Backport of #49990
2019-12-10 15:29:19 +02:00
Dimitris Athanasiou 4edb2e7bb6
[7.x][ML] Add optional source filtering during data frame reindexing (#49690) (#49718)
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.

Closes #49531

Backport of #49690
2019-11-29 16:10:44 +02:00
Przemysław Witek 28f68fa221
Make num_top_classes parameter's default value equal to 2 (#48119) (#48201) 2019-10-17 18:43:15 +02:00
Przemysław Witek d210bfa888
[7.x] Add MlClientDocumentationIT tests for classification. (#47569) (#47896) 2019-10-11 10:19:55 +02:00
Dimitris Athanasiou 7667ea5f6f
[7.x][ML] Additional outlier detection parameters (#47600) (#47669)
Adds the following parameters to `outlier_detection`:

- `compute_feature_influence` (boolean): whether to compute or not
   feature influence scores
- `outlier_fraction` (double): the proportion of the data set assumed
   to be outlying prior to running outlier detection
- `standardization_enabled` (boolean): whether to apply standardization
   to the feature values

Backport of #47600
2019-10-07 18:21:33 +03:00
Lisa Cawley d62e1a3d8b [DOCS] Fixes data frame analytics job terminology in HLRC (#46758) 2019-09-16 10:07:59 -07:00
Lisa Cawley dddc9b3d73 [DOCS] Updates dataframe transform terminology (#46642) 2019-09-16 08:32:13 -07:00
Lisa Cawley 7461259ba6 [DOCS] Adds missing icons to ML HLRC APIs (#46515) 2019-09-10 08:28:02 -07:00
Dimitris Athanasiou bb8fcb3cac
[7.x][ML][HLRC] Add data frame analytics regression analysis (#46024) (#46053) 2019-08-28 12:02:14 +03:00
Dimitris Athanasiou dd6c13fdf9
[ML] Add description to DF analytics (#45774) (#46019) 2019-08-27 15:48:59 +03:00
Dimitris Athanasiou 126c2fd2d5
[7.x][ML] Machine learning data frame analytics (#43544) (#43592)
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.

A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.

The APIs are:

- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}

When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:

1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index

In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:

- POST _ml/data_frame/_evaluate
2019-06-25 20:29:11 +03:00