2019-07-05 07:34:05 -04:00
|
|
|
|
[role="xpack"]
|
|
|
|
|
[testenv="platinum"]
|
|
|
|
|
[[put-dfanalytics]]
|
|
|
|
|
=== Create {dfanalytics-jobs} API
|
|
|
|
|
[subs="attributes"]
|
|
|
|
|
++++
|
|
|
|
|
<titleabbrev>Create {dfanalytics-jobs}</titleabbrev>
|
|
|
|
|
++++
|
|
|
|
|
|
|
|
|
|
Instantiates a {dfanalytics-job}.
|
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
experimental[]
|
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-request]]
|
|
|
|
|
==== {api-request-title}
|
|
|
|
|
|
|
|
|
|
`PUT _ml/data_frame/analytics/<data_frame_analytics_id>`
|
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-prereq]]
|
|
|
|
|
==== {api-prereq-title}
|
|
|
|
|
|
|
|
|
|
* You must have `machine_learning_admin` built-in role to use this API. You must
|
|
|
|
|
also have `read` and `view_index_metadata` privileges on the source index and
|
|
|
|
|
`read`, `create_index`, and `index` privileges on the destination index. For
|
2019-10-07 18:23:19 -04:00
|
|
|
|
more information, see <<security-privileges>> and <<built-in-roles>>.
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-desc]]
|
|
|
|
|
==== {api-description-title}
|
|
|
|
|
|
|
|
|
|
This API creates a {dfanalytics-job} that performs an analysis on the source
|
|
|
|
|
index and stores the outcome in a destination index.
|
|
|
|
|
|
|
|
|
|
The destination index will be automatically created if it does not exist. The
|
|
|
|
|
`index.number_of_shards` and `index.number_of_replicas` settings of the source
|
|
|
|
|
index will be copied over the destination index. When the source index matches
|
|
|
|
|
multiple indices, these settings will be set to the maximum values found in the
|
|
|
|
|
source indices.
|
|
|
|
|
|
|
|
|
|
The mappings of the source indices are also attempted to be copied over
|
|
|
|
|
to the destination index, however, if the mappings of any of the fields don't
|
|
|
|
|
match among the source indices, the attempt will fail with an error message.
|
|
|
|
|
|
|
|
|
|
If the destination index already exists, then it will be use as is. This makes
|
|
|
|
|
it possible to set up the destination index in advance with custom settings
|
|
|
|
|
and mappings.
|
|
|
|
|
|
2019-10-10 06:34:39 -04:00
|
|
|
|
[[ml-put-dfanalytics-supported-fields]]
|
|
|
|
|
===== Supported fields
|
|
|
|
|
|
|
|
|
|
====== {oldetection-cap}
|
|
|
|
|
|
|
|
|
|
{oldetection-cap} requires numeric or boolean data to analyze. The algorithms
|
|
|
|
|
don't support missing values therefore fields that have data types other than
|
|
|
|
|
numeric or boolean are ignored. Documents where included fields contain missing
|
|
|
|
|
values, null values, or an array are also ignored. Therefore the `dest` index
|
|
|
|
|
may contain documents that don't have an {olscore}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
====== {regression-cap}
|
|
|
|
|
|
|
|
|
|
{regression-cap} supports fields that are numeric, boolean, text, keyword and ip. It
|
|
|
|
|
is also tolerant of missing values. Fields that are supported are included in
|
|
|
|
|
the analysis, other fields are ignored. Documents where included fields contain
|
|
|
|
|
an array with two or more values are also ignored. Documents in the `dest` index
|
|
|
|
|
that don’t contain a results field are not included in the {reganalysis}.
|
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-11-06 07:40:27 -05:00
|
|
|
|
====== {classification-cap}
|
|
|
|
|
|
|
|
|
|
{classification-cap} supports fields that are numeric, boolean, text, keyword
|
|
|
|
|
and ip. It is also tolerant of missing values. Fields that are supported are
|
|
|
|
|
included in the analysis, other fields are ignored. Documents where included
|
|
|
|
|
fields contain an array with two or more values are also ignored. Documents in
|
|
|
|
|
the `dest` index that don’t contain a results field are not included in the
|
|
|
|
|
{classanalysis}.
|
|
|
|
|
|
|
|
|
|
{classanalysis-cap} can be improved by mapping ordinal variable values to a
|
|
|
|
|
single number. For example, in case of age ranges, you can model the values as
|
|
|
|
|
"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
|
|
|
|
|
|
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-path-params]]
|
|
|
|
|
==== {api-path-parms-title}
|
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`<data_frame_analytics_id>`::
|
|
|
|
|
(Required, string) A numerical character string that uniquely identifies the
|
|
|
|
|
{dfanalytics-job}. This identifier can contain lowercase alphanumeric
|
|
|
|
|
characters (a-z and 0-9), hyphens, and underscores. It must start and end with
|
|
|
|
|
alphanumeric characters.
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
[[ml-put-dfanalytics-request-body]]
|
|
|
|
|
==== {api-request-body-title}
|
2019-07-11 12:05:05 -04:00
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`analysis`::
|
2019-10-09 12:13:33 -04:00
|
|
|
|
(Required, object) Defines the type of {dfanalytics} you want to perform on
|
|
|
|
|
your source index. For example: `outlier_detection`. See
|
|
|
|
|
<<dfanalytics-types>>.
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`analyzed_fields`::
|
2019-08-29 08:38:14 -04:00
|
|
|
|
(Optional, object) You can specify both `includes` and/or `excludes` patterns.
|
|
|
|
|
If `analyzed_fields` is not set, only the relevant fields will be included.
|
2019-10-10 06:34:39 -04:00
|
|
|
|
For example, all the numeric fields for {oldetection}. For the supported field
|
2019-11-11 09:53:59 -05:00
|
|
|
|
types, see <<ml-put-dfanalytics-supported-fields>>. If you specify fields –
|
|
|
|
|
either in `includes` or in `excludes` – that have a data type that is not
|
|
|
|
|
supported, an error occurs.
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-10-09 12:13:33 -04:00
|
|
|
|
`includes`:::
|
2019-08-29 08:38:14 -04:00
|
|
|
|
(Optional, array) An array of strings that defines the fields that will be
|
|
|
|
|
included in the analysis.
|
|
|
|
|
|
2019-10-09 12:13:33 -04:00
|
|
|
|
`excludes`:::
|
2019-08-29 08:38:14 -04:00
|
|
|
|
(Optional, array) An array of strings that defines the fields that will be
|
2019-11-11 09:53:59 -05:00
|
|
|
|
excluded from the analysis. You do not need to add fields with unsupported
|
|
|
|
|
data types to `excludes`, these fields are excluded from the analysis
|
|
|
|
|
automatically.
|
2019-08-27 08:48:59 -04:00
|
|
|
|
|
|
|
|
|
`description`::
|
|
|
|
|
(Optional, string) A description of the job.
|
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`dest`::
|
2019-07-26 05:39:59 -04:00
|
|
|
|
(Required, object) The destination configuration, consisting of `index` and
|
2019-08-29 08:38:14 -04:00
|
|
|
|
optionally `results_field` (`ml` by default).
|
|
|
|
|
|
|
|
|
|
`index`:::
|
|
|
|
|
(Required, string) Defines the _destination index_ to store the results of
|
|
|
|
|
the {dfanalytics-job}.
|
|
|
|
|
|
|
|
|
|
`results_field`:::
|
|
|
|
|
(Optional, string) Defines the name of the field in which to store the
|
|
|
|
|
results of the analysis. Default to `ml`.
|
2019-07-26 05:39:59 -04:00
|
|
|
|
|
|
|
|
|
`model_memory_limit`::
|
|
|
|
|
(Optional, string) The approximate maximum amount of memory resources that are
|
|
|
|
|
permitted for analytical processing. The default value for {dfanalytics-jobs}
|
|
|
|
|
is `1gb`. If your `elasticsearch.yml` file contains an
|
|
|
|
|
`xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
|
|
|
|
|
create {dfanalytics-jobs} that have `model_memory_limit` values greater than
|
|
|
|
|
that setting. For more information, see <<ml-settings>>.
|
2019-07-10 20:58:17 -04:00
|
|
|
|
|
2019-07-12 11:26:31 -04:00
|
|
|
|
`source`::
|
2019-07-26 05:39:59 -04:00
|
|
|
|
(Required, object) The source configuration, consisting of `index` and
|
2019-08-29 08:38:14 -04:00
|
|
|
|
optionally a `query`.
|
|
|
|
|
|
|
|
|
|
`index`:::
|
|
|
|
|
(Required, string or array) Index or indices on which to perform the
|
|
|
|
|
analysis. It can be a single index or index pattern as well as an array of
|
|
|
|
|
indices or patterns.
|
|
|
|
|
|
|
|
|
|
`query`:::
|
|
|
|
|
(Optional, object) The {es} query domain-specific language
|
|
|
|
|
(<<query-dsl,DSL>>). This value corresponds to the query object in an {es}
|
|
|
|
|
search POST body. All the options that are supported by {es} can be used,
|
|
|
|
|
as this object is passed verbatim to {es}. By default, this property has
|
|
|
|
|
the following value: `{"match_all": {}}`.
|
|
|
|
|
|
2019-10-15 01:55:11 -04:00
|
|
|
|
`allow_lazy_start`::
|
|
|
|
|
(Optional, boolean) Whether this job should be allowed to start when there
|
|
|
|
|
is insufficient {ml} node capacity for it to be immediately assigned to a node.
|
|
|
|
|
The default is `false`, which means that the <<start-dfanalytics>>
|
|
|
|
|
will return an error if a {ml} node with capacity to run the
|
|
|
|
|
job cannot immediately be found. (However, this is also subject to
|
|
|
|
|
the cluster-wide `xpack.ml.max_lazy_ml_nodes` setting - see
|
|
|
|
|
<<advanced-ml-settings>>.) If this option is set to `true` then
|
|
|
|
|
the <<start-dfanalytics>> will not return an error, and the job will
|
|
|
|
|
wait in the `starting` state until sufficient {ml} node capacity
|
|
|
|
|
is available.
|
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
|
|
|
|
[[ml-put-dfanalytics-example]]
|
|
|
|
|
==== {api-examples-title}
|
|
|
|
|
|
2019-11-06 07:40:27 -05:00
|
|
|
|
|
2019-09-19 03:10:11 -04:00
|
|
|
|
[[ml-put-dfanalytics-example-od]]
|
|
|
|
|
===== {oldetection-cap} example
|
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
The following example creates the `loganalytics` {dfanalytics-job}, the analysis
|
|
|
|
|
type is `outlier_detection`:
|
|
|
|
|
|
2019-09-09 12:35:50 -04:00
|
|
|
|
[source,console]
|
2019-07-05 07:34:05 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/loganalytics
|
|
|
|
|
{
|
2019-08-27 08:48:59 -04:00
|
|
|
|
"description": "Outlier detection on log data",
|
2019-07-05 07:34:05 -04:00
|
|
|
|
"source": {
|
|
|
|
|
"index": "logdata"
|
|
|
|
|
},
|
|
|
|
|
"dest": {
|
|
|
|
|
"index": "logdata_out"
|
|
|
|
|
},
|
|
|
|
|
"analysis": {
|
|
|
|
|
"outlier_detection": {
|
2019-10-07 11:21:33 -04:00
|
|
|
|
"compute_feature_influence": true,
|
|
|
|
|
"outlier_fraction": 0.05,
|
|
|
|
|
"standardization_enabled": true
|
2019-07-05 07:34:05 -04:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
2019-07-08 14:20:57 -04:00
|
|
|
|
// TEST[setup:setup_logdata]
|
2019-07-05 07:34:05 -04:00
|
|
|
|
|
2019-08-29 08:38:14 -04:00
|
|
|
|
|
2019-07-05 07:34:05 -04:00
|
|
|
|
The API returns the following result:
|
|
|
|
|
|
2019-09-06 16:09:09 -04:00
|
|
|
|
[source,console-result]
|
2019-07-05 07:34:05 -04:00
|
|
|
|
----
|
|
|
|
|
{
|
2019-07-08 14:20:57 -04:00
|
|
|
|
"id" : "loganalytics",
|
2019-08-27 08:48:59 -04:00
|
|
|
|
"description": "Outlier detection on log data",
|
2019-07-08 14:20:57 -04:00
|
|
|
|
"source" : {
|
|
|
|
|
"index" : [
|
|
|
|
|
"logdata"
|
|
|
|
|
],
|
|
|
|
|
"query" : {
|
|
|
|
|
"match_all" : { }
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"dest" : {
|
|
|
|
|
"index" : "logdata_out",
|
|
|
|
|
"results_field" : "ml"
|
|
|
|
|
},
|
2019-10-07 11:21:33 -04:00
|
|
|
|
"analysis": {
|
|
|
|
|
"outlier_detection": {
|
|
|
|
|
"compute_feature_influence": true,
|
|
|
|
|
"outlier_fraction": 0.05,
|
|
|
|
|
"standardization_enabled": true
|
|
|
|
|
}
|
2019-07-08 14:20:57 -04:00
|
|
|
|
},
|
|
|
|
|
"model_memory_limit" : "1gb",
|
|
|
|
|
"create_time" : 1562351429434,
|
2019-10-15 01:55:11 -04:00
|
|
|
|
"version" : "7.3.0",
|
|
|
|
|
"allow_lazy_start" : false
|
2019-07-05 07:34:05 -04:00
|
|
|
|
}
|
|
|
|
|
----
|
2019-07-08 14:20:57 -04:00
|
|
|
|
// TESTRESPONSE[s/1562351429434/$body.$_path/]
|
2019-09-19 03:10:11 -04:00
|
|
|
|
// TESTRESPONSE[s/"version" : "7.3.0"/"version" : $body.version/]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[ml-put-dfanalytics-example-r]]
|
2019-10-02 04:26:20 -04:00
|
|
|
|
===== {regression-cap} examples
|
2019-09-19 03:10:11 -04:00
|
|
|
|
|
2019-10-02 03:49:59 -04:00
|
|
|
|
The following example creates the `house_price_regression_analysis`
|
|
|
|
|
{dfanalytics-job}, the analysis type is `regression`:
|
2019-09-19 03:10:11 -04:00
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/house_price_regression_analysis
|
|
|
|
|
{
|
|
|
|
|
"source": {
|
|
|
|
|
"index": "houses_sold_last_10_yrs"
|
|
|
|
|
},
|
|
|
|
|
"dest": {
|
|
|
|
|
"index": "house_price_predictions"
|
|
|
|
|
},
|
|
|
|
|
"analysis":
|
|
|
|
|
{
|
|
|
|
|
"regression": {
|
|
|
|
|
"dependent_variable": "price"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
// TEST[skip:TBD]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The API returns the following result:
|
|
|
|
|
|
|
|
|
|
[source,console-result]
|
|
|
|
|
----
|
|
|
|
|
{
|
|
|
|
|
"id" : "house_price_regression_analysis",
|
|
|
|
|
"source" : {
|
|
|
|
|
"index" : [
|
|
|
|
|
"houses_sold_last_10_yrs"
|
|
|
|
|
],
|
|
|
|
|
"query" : {
|
|
|
|
|
"match_all" : { }
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"dest" : {
|
|
|
|
|
"index" : "house_price_predictions",
|
|
|
|
|
"results_field" : "ml"
|
|
|
|
|
},
|
|
|
|
|
"analysis" : {
|
|
|
|
|
"regression" : {
|
|
|
|
|
"dependent_variable" : "price",
|
|
|
|
|
"training_percent" : 100
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"model_memory_limit" : "1gb",
|
|
|
|
|
"create_time" : 1567168659127,
|
2019-10-15 01:55:11 -04:00
|
|
|
|
"version" : "8.0.0",
|
|
|
|
|
"allow_lazy_start" : false
|
2019-09-19 03:10:11 -04:00
|
|
|
|
}
|
|
|
|
|
----
|
|
|
|
|
// TESTRESPONSE[s/1567168659127/$body.$_path/]
|
2019-10-02 04:26:20 -04:00
|
|
|
|
// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The following example creates a job and specifies a training percent:
|
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
|
|
|
|
|
{
|
|
|
|
|
"source": {
|
|
|
|
|
"index": "student_performance_mathematics"
|
|
|
|
|
},
|
|
|
|
|
"dest": {
|
|
|
|
|
"index":"student_performance_mathematics_reg"
|
|
|
|
|
},
|
|
|
|
|
"analysis":
|
|
|
|
|
{
|
|
|
|
|
"regression": {
|
|
|
|
|
"dependent_variable": "G3",
|
|
|
|
|
"training_percent": 70 <1>
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
// TEST[skip:TBD]
|
|
|
|
|
|
|
|
|
|
<1> The `training_percent` defines the percentage of the data set that will be used
|
2019-10-15 01:55:11 -04:00
|
|
|
|
for training the model.
|
2019-11-06 07:40:27 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[ml-put-dfanalytics-example-c]]
|
|
|
|
|
===== {classification-cap} example
|
|
|
|
|
|
|
|
|
|
The following example creates the `loan_classification` {dfanalytics-job}, the
|
|
|
|
|
analysis type is `classification`:
|
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT _ml/data_frame/analytics/loan_classification
|
|
|
|
|
{
|
|
|
|
|
"source" : {
|
|
|
|
|
"index": "loan-applicants"
|
|
|
|
|
},
|
|
|
|
|
"dest" : {
|
|
|
|
|
"index": "loan-applicants-classified"
|
|
|
|
|
},
|
|
|
|
|
"analysis" : {
|
|
|
|
|
"classification": {
|
|
|
|
|
"dependent_variable": "label",
|
|
|
|
|
"training_percent": 75,
|
|
|
|
|
"num_top_classes": 2
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
// TEST[skip:TBD]
|