2019-07-05 07:34:05 -04:00
|
|
|
|
[role="xpack"]
|
|
|
|
|
[testenv="platinum"]
|
|
|
|
|
[[put-dfanalytics]]
|
|
|
|
|
=== Create {dfanalytics-jobs} API
|
|
|
|
|
[subs="attributes"]
|
|
|
|
|
++++
|
|
|
|
|
<titleabbrev>Create {dfanalytics-jobs}</titleabbrev>
|
|
|
|
|
++++
|
|
|
|
|
|
|
|
|
|
Instantiates a {dfanalytics-job}.
|
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
experimental[]
|
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-request]]
|
|
|
|
|
==== {api-request-title}
|
|
|
|
|
|
|
|
|
|
`PUT _ml/data_frame/analytics/<data_frame_analytics_id>`
|
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-prereq]]
|
|
|
|
|
==== {api-prereq-title}
|
|
|
|
|
|
2020-01-09 04:44:07 -05:00
|
|
|
|
If the {es} {security-features} are enabled, you must have the following built-in roles and privileges:
|
|
|
|
|
|
|
|
|
|
* `machine_learning_admin`
|
2020-04-06 09:45:08 -04:00
|
|
|
|
* `kibana_admin` (UI only)
|
2020-01-09 04:44:07 -05:00
|
|
|
|
|
|
|
|
|
|
2020-04-13 13:43:52 -04:00
|
|
|
|
* source indices: `read`, `view_index_metadata`
|
2020-01-09 04:44:07 -05:00
|
|
|
|
* destination index: `read`, `create_index`, `manage` and `index`
|
|
|
|
|
* cluster: `monitor` (UI only)
|
|
|
|
|
|
|
|
|
|
For more information, see <<security-privileges>> and <<built-in-roles>>.
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2020-04-29 19:29:38 -04:00
|
|
|
|
NOTE: The {dfanalytics-job} remembers which roles the user who created it had at
|
|
|
|
|
the time of creation. When you start the job, it performs the analysis using
|
|
|
|
|
those same roles. If you provide
|
|
|
|
|
<<http-clients-secondary-authorization,secondary authorization headers>>,
|
|
|
|
|
those credentials are used instead.
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-desc]]
|
|
|
|
|
==== {api-description-title}
|
|
|
|
|
|
|
|
|
|
This API creates a {dfanalytics-job} that performs an analysis on the source
|
2020-04-13 13:43:52 -04:00
|
|
|
|
indices and stores the outcome in a destination index.
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2020-04-13 13:43:52 -04:00
|
|
|
|
If the destination index does not exist, it is created automatically when you
|
|
|
|
|
start the job. See <<start-dfanalytics>>.
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2020-01-09 10:21:35 -05:00
|
|
|
|
[[ml-hyperparam-optimization]]
|
2020-04-13 13:43:52 -04:00
|
|
|
|
If you supply only a subset of the {regression} or {classification} parameters,
|
|
|
|
|
_hyperparameter optimization_ occurs. It determines a value for each of the
|
|
|
|
|
undefined parameters.
|
2020-01-09 10:21:35 -05:00
|
|
|
|
|
2020-04-13 13:43:52 -04:00
|
|
|
|
////
|
|
|
|
|
The starting point is calculated for data dependent parameters by examining the loss
|
2020-01-09 10:21:35 -05:00
|
|
|
|
on the training data. Subject to the size constraint, this operation provides an
|
|
|
|
|
upper bound on the improvement in validation loss.
|
|
|
|
|
|
2020-04-13 13:43:52 -04:00
|
|
|
|
The optimization starts with random search, then
|
2020-01-09 10:21:35 -05:00
|
|
|
|
Bayesian optimization is performed that is targeting maximum expected
|
|
|
|
|
improvement. If you override any parameters by explicitely setting it, the
|
|
|
|
|
optimization calculates the value of the remaining parameters accordingly and
|
|
|
|
|
uses the value you provided for the overridden parameter. The number of rounds
|
|
|
|
|
are reduced respectively. The validation error is estimated in each round by
|
|
|
|
|
using 4-fold cross validation.
|
2020-04-13 13:43:52 -04:00
|
|
|
|
////
|
2019-10-10 06:34:39 -04:00
|
|
|
|
|
2020-01-09 10:21:35 -05:00
|
|
|
|
[[ml-put-dfanalytics-path-params]]
|
|
|
|
|
==== {api-path-parms-title}
|
2019-10-10 06:34:39 -04:00
|
|
|
|
|
2020-01-09 10:21:35 -05:00
|
|
|
|
`<data_frame_analytics_id>`::
|
|
|
|
|
(Required, string)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-define]
|
2019-10-10 06:34:39 -04:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
[role="child_attributes"]
|
2020-01-09 10:21:35 -05:00
|
|
|
|
[[ml-put-dfanalytics-request-body]]
|
|
|
|
|
==== {api-request-body-title}
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2020-01-09 10:21:35 -05:00
|
|
|
|
`allow_lazy_start`::
|
|
|
|
|
(Optional, boolean)
|
2020-04-13 13:43:52 -04:00
|
|
|
|
Specifies whether this job can start when there is insufficient {ml} node
|
|
|
|
|
capacity for it to be immediately assigned to a node. The default is `false`; if
|
|
|
|
|
a {ml} node with capacity to run the job cannot immediately be found, the
|
|
|
|
|
<<start-dfanalytics>> API returns an error. However, this is also subject to the
|
|
|
|
|
cluster-wide `xpack.ml.max_lazy_ml_nodes` setting. See <<advanced-ml-settings>>.
|
|
|
|
|
If this option is set to `true`, the API does not return an error and the job
|
|
|
|
|
waits in the `starting` state until sufficient {ml} node capacity is available.
|
2019-11-06 07:40:27 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
//Begin analysis
|
2020-01-09 10:21:35 -05:00
|
|
|
|
`analysis`::
|
|
|
|
|
(Required, object)
|
|
|
|
|
The analysis configuration, which contains the information necessary to perform
|
|
|
|
|
one of the following types of analysis: {classification}, {oldetection}, or
|
|
|
|
|
{regression}.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
+
|
|
|
|
|
.Properties of `analysis`
|
|
|
|
|
[%collapsible%open]
|
|
|
|
|
====
|
|
|
|
|
//Begin classification
|
|
|
|
|
`classification`:::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Required^*^, object)
|
|
|
|
|
The configuration information necessary to perform
|
|
|
|
|
{ml-docs}/dfa-classification.html[{classification}].
|
|
|
|
|
+
|
|
|
|
|
TIP: Advanced parameters are for fine-tuning {classanalysis}. They are set
|
2020-04-13 13:43:52 -04:00
|
|
|
|
automatically by hyperparameter optimization to give the minimum validation
|
|
|
|
|
error. It is highly recommended to use the default values unless you fully
|
|
|
|
|
understand the function of these parameters.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
+
|
|
|
|
|
.Properties of `classification`
|
|
|
|
|
[%collapsible%open]
|
|
|
|
|
=====
|
2020-04-13 13:43:52 -04:00
|
|
|
|
`class_assignment_objective`::::
|
|
|
|
|
(Optional, string)
|
|
|
|
|
Defines the objective to optimize when assigning class labels:
|
|
|
|
|
`maximize_accuracy` or `maximize_minimum_recall`. When maximizing accuracy,
|
|
|
|
|
class labels are chosen to maximize the number of correct predictions. When
|
|
|
|
|
maximizing minimum recall, labels are chosen to maximize the minimum recall
|
|
|
|
|
for any class. Defaults to `maximize_minimum_recall`.
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`dependent_variable`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Required, string)
|
|
|
|
|
+
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
|
2020-03-31 15:51:04 -04:00
|
|
|
|
+
|
2020-01-09 10:21:35 -05:00
|
|
|
|
The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
|
2020-04-07 13:43:15 -04:00
|
|
|
|
categorical (`ip` or `keyword`), or boolean. There must be no more than 30
|
2020-04-13 13:43:52 -04:00
|
|
|
|
different values in this field.
|
2020-01-09 10:21:35 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`eta`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`feature_bag_fraction`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`gamma`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`lambda`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`max_trees`::::
|
|
|
|
|
(Optional, integer)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=max-trees]
|
2020-03-13 13:35:51 -04:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`num_top_classes`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, integer)
|
2020-04-13 13:43:52 -04:00
|
|
|
|
Defines the number of categories for which the predicted probabilities are
|
|
|
|
|
reported. It must be non-negative. If it is greater than the total number of
|
|
|
|
|
categories, the API reports all category probabilities. Defaults to 2.
|
2020-01-09 10:21:35 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`num_top_feature_importance_values`::::
|
2020-01-14 09:46:09 -05:00
|
|
|
|
(Optional, integer)
|
2020-02-18 11:48:24 -05:00
|
|
|
|
Advanced configuration option. Specifies the maximum number of
|
2020-04-28 03:02:14 -04:00
|
|
|
|
{ml-docs}/ml-feature-importance.html[{feat-imp}] values per document to return.
|
|
|
|
|
By default, it is zero and no {feat-imp} calculation occurs.
|
2020-01-14 09:46:09 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`prediction_field_name`::::
|
|
|
|
|
(Optional, string)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
|
|
|
|
|
|
|
|
|
|
`randomize_seed`::::
|
|
|
|
|
(Optional, long)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
|
|
|
|
|
|
|
|
|
|
`training_percent`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, integer)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
|
2020-03-31 15:51:04 -04:00
|
|
|
|
//End classification
|
|
|
|
|
=====
|
|
|
|
|
//Begin outlier_detection
|
|
|
|
|
`outlier_detection`:::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Required^*^, object)
|
|
|
|
|
The configuration information necessary to perform
|
|
|
|
|
{ml-docs}/dfa-outlier-detection.html[{oldetection}]:
|
2020-03-31 15:51:04 -04:00
|
|
|
|
+
|
|
|
|
|
.Properties of `outlier_detection`
|
|
|
|
|
[%collapsible%open]
|
|
|
|
|
=====
|
|
|
|
|
`compute_feature_influence`::::
|
2020-04-14 21:47:09 -04:00
|
|
|
|
(Optional, boolean)
|
|
|
|
|
If `true`, the feature influence calculation is enabled. Defaults to `true`.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
|
|
|
|
|
`feature_influence_threshold`::::
|
2020-04-14 21:47:09 -04:00
|
|
|
|
(Optional, double)
|
|
|
|
|
The minimum {olscore} that a document needs to have in order to calculate its
|
|
|
|
|
{fiscore}. Value range: 0-1 (`0.1` by default).
|
2020-01-09 10:21:35 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`method`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, string)
|
2020-04-14 21:47:09 -04:00
|
|
|
|
Sets the method that {oldetection} uses. If the method is not set {oldetection}
|
|
|
|
|
uses an ensemble of different methods and normalises and combines their
|
|
|
|
|
individual {olscores} to obtain the overall {olscore}. We recommend to use the
|
|
|
|
|
ensemble method. Available methods are `lof`, `ldof`, `distance_kth_nn`,
|
|
|
|
|
`distance_knn`.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
|
|
|
|
|
`n_neighbors`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, integer)
|
2020-04-14 21:47:09 -04:00
|
|
|
|
Defines the value for how many nearest neighbors each method of
|
|
|
|
|
{oldetection} will use to calculate its {olscore}. When the value is not set,
|
|
|
|
|
different values will be used for different ensemble members. This helps
|
|
|
|
|
improve diversity in the ensemble. Therefore, only override this if you are
|
|
|
|
|
confident that the value you choose is appropriate for the data set.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
|
|
|
|
|
`outlier_fraction`::::
|
2020-04-14 21:47:09 -04:00
|
|
|
|
(Optional, double)
|
|
|
|
|
Sets the proportion of the data set that is assumed to be outlying prior to
|
|
|
|
|
{oldetection}. For example, 0.05 means it is assumed that 5% of values are real
|
|
|
|
|
outliers and 95% are inliers.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
|
|
|
|
|
`standardization_enabled`::::
|
2020-04-14 21:47:09 -04:00
|
|
|
|
(Optional, boolean)
|
|
|
|
|
If `true`, then the following operation is performed on the columns before
|
|
|
|
|
computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to `true`. For
|
|
|
|
|
more information, see
|
|
|
|
|
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
|
2020-03-31 15:51:04 -04:00
|
|
|
|
//End outlier_detection
|
|
|
|
|
=====
|
|
|
|
|
//Begin regression
|
|
|
|
|
`regression`:::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Required^*^, object)
|
|
|
|
|
The configuration information necessary to perform
|
|
|
|
|
{ml-docs}/dfa-regression.html[{regression}].
|
|
|
|
|
+
|
|
|
|
|
TIP: Advanced parameters are for fine-tuning {reganalysis}. They are set
|
2020-04-13 13:43:52 -04:00
|
|
|
|
automatically by hyperparameter optimization to give minimum validation error.
|
|
|
|
|
It is highly recommended to use the default values unless you fully understand
|
|
|
|
|
the function of these parameters.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
+
|
|
|
|
|
.Properties of `regression`
|
|
|
|
|
[%collapsible%open]
|
|
|
|
|
=====
|
|
|
|
|
`dependent_variable`::::
|
2019-12-13 05:48:21 -05:00
|
|
|
|
(Required, string)
|
2020-01-09 10:21:35 -05:00
|
|
|
|
+
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
|
2020-03-31 15:51:04 -04:00
|
|
|
|
+
|
2020-01-09 10:21:35 -05:00
|
|
|
|
The data type of the field must be numeric.
|
2020-01-09 07:57:11 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`eta`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`feature_bag_fraction`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`gamma`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`lambda`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, double)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`max_trees`::::
|
|
|
|
|
(Optional, integer)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=max-trees]
|
2020-01-09 10:21:35 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`num_top_feature_importance_values`::::
|
2020-01-14 09:46:09 -05:00
|
|
|
|
(Optional, integer)
|
2020-02-18 11:48:24 -05:00
|
|
|
|
Advanced configuration option. Specifies the maximum number of
|
2020-04-28 03:02:14 -04:00
|
|
|
|
{ml-docs}/ml-feature-importance.html[{feat-imp}] values per document to return.
|
|
|
|
|
By default, it is zero and no {feat-imp} calculation occurs.
|
2020-01-14 09:46:09 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`prediction_field_name`::::
|
|
|
|
|
(Optional, string)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
|
2020-01-09 10:21:35 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`randomize_seed`::::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, long)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
|
2020-03-31 15:51:04 -04:00
|
|
|
|
|
|
|
|
|
`training_percent`::::
|
|
|
|
|
(Optional, integer)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
|
|
|
|
|
=====
|
|
|
|
|
//End regression
|
|
|
|
|
====
|
|
|
|
|
//End analysis
|
|
|
|
|
|
|
|
|
|
//Begin analyzed_fields
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`analyzed_fields`::
|
2019-12-13 05:48:21 -05:00
|
|
|
|
(Optional, object)
|
2020-04-09 14:16:13 -04:00
|
|
|
|
Specify `includes` and/or `excludes` patterns to select which fields will be
|
|
|
|
|
included in the analysis. The patterns specified in `excludes` are applied last,
|
|
|
|
|
therefore `excludes` takes precedence. In other words, if the same field is
|
|
|
|
|
specified in both `includes` and `excludes`, then the field will not be included
|
|
|
|
|
in the analysis.
|
|
|
|
|
+
|
|
|
|
|
--
|
|
|
|
|
[[dfa-supported-fields]]
|
|
|
|
|
The supported fields for each type of analysis are as follows:
|
|
|
|
|
|
|
|
|
|
* {oldetection-cap} requires numeric or boolean data to analyze. The algorithms
|
|
|
|
|
don't support missing values therefore fields that have data types other than
|
|
|
|
|
numeric or boolean are ignored. Documents where included fields contain missing
|
|
|
|
|
values, null values, or an array are also ignored. Therefore the `dest` index
|
|
|
|
|
may contain documents that don't have an {olscore}.
|
|
|
|
|
* {regression-cap} supports fields that are numeric, `boolean`, `text`,
|
|
|
|
|
`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
|
|
|
|
supported are included in the analysis, other fields are ignored. Documents
|
|
|
|
|
where included fields contain an array with two or more values are also
|
|
|
|
|
ignored. Documents in the `dest` index that don’t contain a results field are
|
|
|
|
|
not included in the {reganalysis}.
|
|
|
|
|
* {classification-cap} supports fields that are numeric, `boolean`, `text`,
|
|
|
|
|
`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
|
|
|
|
supported are included in the analysis, other fields are ignored. Documents
|
|
|
|
|
where included fields contain an array with two or more values are also ignored.
|
|
|
|
|
Documents in the `dest` index that don’t contain a results field are not
|
|
|
|
|
included in the {classanalysis}. {classanalysis-cap} can be improved by mapping
|
|
|
|
|
ordinal variable values to a single number. For example, in case of age ranges,
|
|
|
|
|
you can model the values as "0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
|
|
|
|
|
|
|
|
|
|
If `analyzed_fields` is not set, only the relevant fields will be included. For
|
|
|
|
|
example, all the numeric fields for {oldetection}. For more information about
|
|
|
|
|
field selection, see <<explain-dfanalytics>>.
|
|
|
|
|
--
|
2020-03-31 15:51:04 -04:00
|
|
|
|
+
|
|
|
|
|
.Properties of `analyzed_fields`
|
|
|
|
|
[%collapsible%open]
|
|
|
|
|
====
|
|
|
|
|
`excludes`:::
|
2020-01-09 10:21:35 -05:00
|
|
|
|
(Optional, array)
|
2020-04-09 14:16:13 -04:00
|
|
|
|
An array of strings that defines the fields that will be excluded from the
|
|
|
|
|
analysis. You do not need to add fields with unsupported data types to
|
|
|
|
|
`excludes`, these fields are excluded from the analysis automatically.
|
2019-12-13 05:48:21 -05:00
|
|
|
|
|
2020-03-31 15:51:04 -04:00
|
|
|
|
`includes`:::
|
|
|
|
|
(Optional, array)
|
2020-04-09 14:16:13 -04:00
|
|
|
|
An array of strings that defines the fields that will be included in the
|
|
|
|
|
analysis.
|
2020-03-31 15:51:04 -04:00
|
|
|
|
//End analyzed_fields
|
|
|
|
|
====
|
2019-08-27 08:48:59 -04:00
|
|
|
|
|
|
|
|
|
`description`::
|
2019-12-13 05:48:21 -05:00
|
|
|
|
(Optional, string)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=description-dfa]
|
2019-08-27 08:48:59 -04:00
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`dest`::
|
2019-12-13 05:48:21 -05:00
|
|
|
|
(Required, object)
|
|
|
|
|
include::{docdir}/ml/ml-shared.asciidoc[tag=dest]
|
2019-07-26 05:39:59 -04:00
|
|
|
|
|
|
|
|
|
`model_memory_limit`::
|
2019-12-13 05:48:21 -05:00
|
|
|
|
(Optional, string)
|
2020-04-14 21:47:09 -04:00
|
|
|
|
The approximate maximum amount of memory resources that are permitted for
|
|
|
|
|
analytical processing. The default value for {dfanalytics-jobs} is `1gb`. If
|
|
|
|
|
your `elasticsearch.yml` file contains an `xpack.ml.max_model_memory_limit`
|
|
|
|
|
setting, an error occurs when you try to create {dfanalytics-jobs} that have
|
|
|
|
|
`model_memory_limit` values greater than that setting. For more information, see
|
|
|
|
|
<<ml-settings>>.
|
2019-07-10 20:58:17 -04:00
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`source`::
|
2019-12-13 05:48:21 -05:00
|
|
|
|
(object)
|
2020-04-14 21:47:09 -04:00
|
|
|
|
The configuration of how to source the analysis data. It requires an `index`.
|
|
|
|
|
Optionally, `query` and `_source` may be specified.
|
|
|
|
|
+
|
|
|
|
|
.Properties of `source`
|
|
|
|
|
[%collapsible%open]
|
|
|
|
|
====
|
|
|
|
|
`index`:::
|
|
|
|
|
(Required, string or array) Index or indices on which to perform the analysis.
|
|
|
|
|
It can be a single index or index pattern as well as an array of indices or
|
|
|
|
|
patterns.
|
|
|
|
|
+
|
|
|
|
|
WARNING: If your source indices contain documents with the same IDs, only the
|
|
|
|
|
document that is indexed last appears in the destination index.
|
|
|
|
|
|
|
|
|
|
`query`:::
|
|
|
|
|
(Optional, object) The {es} query domain-specific language (<<query-dsl,DSL>>).
|
|
|
|
|
This value corresponds to the query object in an {es} search POST body. All the
|
|
|
|
|
options that are supported by {es} can be used, as this object is passed
|
|
|
|
|
verbatim to {es}. By default, this property has the following value:
|
|
|
|
|
`{"match_all": {}}`.
|
|
|
|
|
|
|
|
|
|
`_source`:::
|
|
|
|
|
(Optional, object) Specify `includes` and/or `excludes` patterns to select which
|
|
|
|
|
fields will be present in the destination. Fields that are excluded cannot be
|
|
|
|
|
included in the analysis.
|
|
|
|
|
+
|
|
|
|
|
.Properties of `_source`
|
|
|
|
|
[%collapsible%open]
|
|
|
|
|
=====
|
|
|
|
|
`includes`::::
|
|
|
|
|
(array) An array of strings that defines the fields that will be included in the
|
|
|
|
|
destination.
|
|
|
|
|
|
|
|
|
|
`excludes`::::
|
|
|
|
|
(array) An array of strings that defines the fields that will be excluded from
|
|
|
|
|
the destination.
|
|
|
|
|
=====
|
|
|
|
|
====
|
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2020-01-09 08:31:35 -05:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-example]]
|
|
|
|
|
==== {api-examples-title}
|
|
|
|
|
|
2020-01-09 08:31:35 -05:00
|
|
|
|
|
2019-12-05 08:15:19 -05:00
|
|
|
|
[[ml-put-dfanalytics-example-preprocess]]
|
|
|
|
|
===== Preprocessing actions example
|
|
|
|
|
|
|
|
|
|
The following example shows how to limit the scope of the analysis to certain
|
|
|
|
|
fields, specify excluded fields in the destination index, and use a query to
|
|
|
|
|
filter your data before analysis.
|
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/model-flight-delays-pre
|
|
|
|
|
{
|
|
|
|
|
"source": {
|
|
|
|
|
"index": [
|
|
|
|
|
"kibana_sample_data_flights" <1>
|
|
|
|
|
],
|
|
|
|
|
"query": { <2>
|
|
|
|
|
"range": {
|
|
|
|
|
"DistanceKilometers": {
|
|
|
|
|
"gt": 0
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"_source": { <3>
|
|
|
|
|
"includes": [],
|
|
|
|
|
"excludes": [
|
|
|
|
|
"FlightDelay",
|
|
|
|
|
"FlightDelayType"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"dest": { <4>
|
|
|
|
|
"index": "df-flight-delays",
|
|
|
|
|
"results_field": "ml-results"
|
|
|
|
|
},
|
|
|
|
|
"analysis": {
|
|
|
|
|
"regression": {
|
|
|
|
|
"dependent_variable": "FlightDelayMin",
|
|
|
|
|
"training_percent": 90
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"analyzed_fields": { <5>
|
|
|
|
|
"includes": [],
|
|
|
|
|
"excludes": [
|
|
|
|
|
"FlightNum"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"model_memory_limit": "100mb"
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
// TEST[skip:setup kibana sample data]
|
|
|
|
|
|
|
|
|
|
<1> The source index to analyze.
|
|
|
|
|
<2> This query filters out entire documents that will not be present in the
|
|
|
|
|
destination index.
|
|
|
|
|
<3> The `_source` object defines fields in the dataset that will be included or
|
|
|
|
|
excluded in the destination index. In this case, `includes` does not specify any
|
|
|
|
|
fields, so the default behavior takes place: all the fields of the source index
|
|
|
|
|
will included except the ones that are explicitly specified in `excludes`.
|
|
|
|
|
<4> Defines the destination index that contains the results of the analysis and
|
|
|
|
|
the fields of the source index specified in the `_source` object. Also defines
|
|
|
|
|
the name of the `results_field`.
|
|
|
|
|
<5> Specifies fields to be included in or excluded from the analysis. This does
|
|
|
|
|
not affect whether the fields will be present in the destination index, only
|
|
|
|
|
affects whether they are used in the analysis.
|
|
|
|
|
|
|
|
|
|
In this example, we can see that all the fields of the source index are included
|
|
|
|
|
in the destination index except `FlightDelay` and `FlightDelayType` because
|
|
|
|
|
these are defined as excluded fields by the `excludes` parameter of the
|
|
|
|
|
`_source` object. The `FlightNum` field is included in the destination index,
|
|
|
|
|
however it is not included in the analysis because it is explicitly specified as
|
|
|
|
|
excluded field by the `excludes` parameter of the `analyzed_fields` object.
|
|
|
|
|
|
|
|
|
|
|
2019-09-19 03:10:11 -04:00
|
|
|
|
[[ml-put-dfanalytics-example-od]]
|
|
|
|
|
===== {oldetection-cap} example
|
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
The following example creates the `loganalytics` {dfanalytics-job}, the analysis
|
|
|
|
|
type is `outlier_detection`:
|
|
|
|
|
|
2019-09-09 12:35:50 -04:00
|
|
|
|
[source,console]
|
2019-07-05 07:34:05 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/loganalytics
|
|
|
|
|
{
|
2019-08-27 08:48:59 -04:00
|
|
|
|
"description": "Outlier detection on log data",
|
2019-07-05 07:34:05 -04:00
|
|
|
|
"source": {
|
|
|
|
|
"index": "logdata"
|
|
|
|
|
},
|
|
|
|
|
"dest": {
|
|
|
|
|
"index": "logdata_out"
|
|
|
|
|
},
|
|
|
|
|
"analysis": {
|
|
|
|
|
"outlier_detection": {
|
2019-10-07 11:21:33 -04:00
|
|
|
|
"compute_feature_influence": true,
|
|
|
|
|
"outlier_fraction": 0.05,
|
|
|
|
|
"standardization_enabled": true
|
2019-07-05 07:34:05 -04:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
2019-07-08 14:20:57 -04:00
|
|
|
|
// TEST[setup:setup_logdata]
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
The API returns the following result:
|
|
|
|
|
|
2019-09-06 16:09:09 -04:00
|
|
|
|
[source,console-result]
|
2019-07-05 07:34:05 -04:00
|
|
|
|
----
|
|
|
|
|
{
|
2019-12-13 05:48:21 -05:00
|
|
|
|
"id": "loganalytics",
|
|
|
|
|
"description": "Outlier detection on log data",
|
|
|
|
|
"source": {
|
|
|
|
|
"index": ["logdata"],
|
|
|
|
|
"query": {
|
|
|
|
|
"match_all": {}
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"dest": {
|
|
|
|
|
"index": "logdata_out",
|
|
|
|
|
"results_field": "ml"
|
|
|
|
|
},
|
|
|
|
|
"analysis": {
|
|
|
|
|
"outlier_detection": {
|
|
|
|
|
"compute_feature_influence": true,
|
|
|
|
|
"outlier_fraction": 0.05,
|
|
|
|
|
"standardization_enabled": true
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"model_memory_limit": "1gb",
|
|
|
|
|
"create_time" : 1562265491319,
|
|
|
|
|
"version" : "7.6.0",
|
|
|
|
|
"allow_lazy_start" : false
|
2019-07-05 07:34:05 -04:00
|
|
|
|
}
|
|
|
|
|
----
|
2019-12-13 05:48:21 -05:00
|
|
|
|
// TESTRESPONSE[s/1562265491319/$body.$_path/]
|
2020-01-15 12:09:37 -05:00
|
|
|
|
// TESTRESPONSE[s/"version" : "7.6.0"/"version" : $body.version/]
|
2019-09-19 03:10:11 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[ml-put-dfanalytics-example-r]]
|
2019-10-02 04:26:20 -04:00
|
|
|
|
===== {regression-cap} examples
|
2019-09-19 03:10:11 -04:00
|
|
|
|
|
2019-10-02 03:49:59 -04:00
|
|
|
|
The following example creates the `house_price_regression_analysis`
|
|
|
|
|
{dfanalytics-job}, the analysis type is `regression`:
|
2019-09-19 03:10:11 -04:00
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/house_price_regression_analysis
|
|
|
|
|
{
|
|
|
|
|
"source": {
|
|
|
|
|
"index": "houses_sold_last_10_yrs"
|
|
|
|
|
},
|
|
|
|
|
"dest": {
|
|
|
|
|
"index": "house_price_predictions"
|
|
|
|
|
},
|
|
|
|
|
"analysis":
|
|
|
|
|
{
|
|
|
|
|
"regression": {
|
|
|
|
|
"dependent_variable": "price"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
// TEST[skip:TBD]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The API returns the following result:
|
|
|
|
|
|
|
|
|
|
[source,console-result]
|
|
|
|
|
----
|
|
|
|
|
{
|
|
|
|
|
"id" : "house_price_regression_analysis",
|
|
|
|
|
"source" : {
|
|
|
|
|
"index" : [
|
|
|
|
|
"houses_sold_last_10_yrs"
|
|
|
|
|
],
|
|
|
|
|
"query" : {
|
|
|
|
|
"match_all" : { }
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"dest" : {
|
|
|
|
|
"index" : "house_price_predictions",
|
|
|
|
|
"results_field" : "ml"
|
|
|
|
|
},
|
|
|
|
|
"analysis" : {
|
|
|
|
|
"regression" : {
|
|
|
|
|
"dependent_variable" : "price",
|
|
|
|
|
"training_percent" : 100
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"model_memory_limit" : "1gb",
|
|
|
|
|
"create_time" : 1567168659127,
|
2019-10-15 01:55:11 -04:00
|
|
|
|
"version" : "8.0.0",
|
|
|
|
|
"allow_lazy_start" : false
|
2019-09-19 03:10:11 -04:00
|
|
|
|
}
|
|
|
|
|
----
|
|
|
|
|
// TESTRESPONSE[s/1567168659127/$body.$_path/]
|
2019-10-02 04:26:20 -04:00
|
|
|
|
// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The following example creates a job and specifies a training percent:
|
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
|
|
|
|
|
{
|
|
|
|
|
"source": {
|
|
|
|
|
"index": "student_performance_mathematics"
|
|
|
|
|
},
|
|
|
|
|
"dest": {
|
|
|
|
|
"index":"student_performance_mathematics_reg"
|
|
|
|
|
},
|
|
|
|
|
"analysis":
|
|
|
|
|
{
|
|
|
|
|
"regression": {
|
|
|
|
|
"dependent_variable": "G3",
|
2019-12-10 08:29:19 -05:00
|
|
|
|
"training_percent": 70, <1>
|
|
|
|
|
"randomize_seed": 19673948271 <2>
|
2019-10-02 04:26:20 -04:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
// TEST[skip:TBD]
|
|
|
|
|
|
2019-12-13 05:48:21 -05:00
|
|
|
|
<1> The `training_percent` defines the percentage of the data set that will be
|
|
|
|
|
used for training the model.
|
|
|
|
|
<2> The `randomize_seed` is the seed used to randomly pick which data is used
|
|
|
|
|
for training.
|
2019-11-06 07:40:27 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[ml-put-dfanalytics-example-c]]
|
|
|
|
|
===== {classification-cap} example
|
|
|
|
|
|
|
|
|
|
The following example creates the `loan_classification` {dfanalytics-job}, the
|
|
|
|
|
analysis type is `classification`:
|
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/loan_classification
|
|
|
|
|
{
|
|
|
|
|
"source" : {
|
|
|
|
|
"index": "loan-applicants"
|
|
|
|
|
},
|
|
|
|
|
"dest" : {
|
|
|
|
|
"index": "loan-applicants-classified"
|
|
|
|
|
},
|
|
|
|
|
"analysis" : {
|
|
|
|
|
"classification": {
|
|
|
|
|
"dependent_variable": "label",
|
|
|
|
|
"training_percent": 75,
|
|
|
|
|
"num_top_classes": 2
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
// TEST[skip:TBD]
|