404 lines
11 KiB
Plaintext
404 lines
11 KiB
Plaintext
[role="xpack"]
|
||
[testenv="platinum"]
|
||
[[put-dfanalytics]]
|
||
=== Create {dfanalytics-jobs} API
|
||
[subs="attributes"]
|
||
++++
|
||
<titleabbrev>Create {dfanalytics-jobs}</titleabbrev>
|
||
++++
|
||
|
||
Instantiates a {dfanalytics-job}.
|
||
|
||
experimental[]
|
||
|
||
[[ml-put-dfanalytics-request]]
|
||
==== {api-request-title}
|
||
|
||
`PUT _ml/data_frame/analytics/<data_frame_analytics_id>`
|
||
|
||
|
||
[[ml-put-dfanalytics-prereq]]
|
||
==== {api-prereq-title}
|
||
|
||
* You must have `machine_learning_admin` built-in role to use this API. You must
|
||
also have `read` and `view_index_metadata` privileges on the source index and
|
||
`read`, `create_index`, and `index` privileges on the destination index. For
|
||
more information, see <<security-privileges>> and <<built-in-roles>>.
|
||
|
||
|
||
[[ml-put-dfanalytics-desc]]
|
||
==== {api-description-title}
|
||
|
||
This API creates a {dfanalytics-job} that performs an analysis on the source
|
||
index and stores the outcome in a destination index.
|
||
|
||
The destination index will be automatically created if it does not exist. The
|
||
`index.number_of_shards` and `index.number_of_replicas` settings of the source
|
||
index will be copied over the destination index. When the source index matches
|
||
multiple indices, these settings will be set to the maximum values found in the
|
||
source indices.
|
||
|
||
The mappings of the source indices are also attempted to be copied over
|
||
to the destination index, however, if the mappings of any of the fields don't
|
||
match among the source indices, the attempt will fail with an error message.
|
||
|
||
If the destination index already exists, then it will be use as is. This makes
|
||
it possible to set up the destination index in advance with custom settings
|
||
and mappings.
|
||
|
||
[[ml-put-dfanalytics-supported-fields]]
|
||
===== Supported fields
|
||
|
||
====== {oldetection-cap}
|
||
|
||
{oldetection-cap} requires numeric or boolean data to analyze. The algorithms
|
||
don't support missing values therefore fields that have data types other than
|
||
numeric or boolean are ignored. Documents where included fields contain missing
|
||
values, null values, or an array are also ignored. Therefore the `dest` index
|
||
may contain documents that don't have an {olscore}.
|
||
|
||
|
||
====== {regression-cap}
|
||
|
||
{regression-cap} supports fields that are numeric, `boolean`, `text`, `keyword`,
|
||
and `ip`. It is also tolerant of missing values. Fields that are supported are
|
||
included in the analysis, other fields are ignored. Documents where included
|
||
fields contain an array with two or more values are also ignored. Documents in
|
||
the `dest` index that don’t contain a results field are not included in the
|
||
{reganalysis}.
|
||
|
||
|
||
====== {classification-cap}
|
||
|
||
{classification-cap} supports fields that are numeric, `boolean`, `text`,
|
||
`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
|
||
supported are included in the analysis, other fields are ignored. Documents
|
||
where included fields contain an array with two or more values are also ignored.
|
||
Documents in the `dest` index that don’t contain a results field are not
|
||
included in the {classanalysis}.
|
||
|
||
{classanalysis-cap} can be improved by mapping ordinal variable values to a
|
||
single number. For example, in case of age ranges, you can model the values as
|
||
"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
|
||
|
||
|
||
[[ml-put-dfanalytics-path-params]]
|
||
==== {api-path-parms-title}
|
||
|
||
`<data_frame_analytics_id>`::
|
||
(Required, string)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-define]
|
||
|
||
[[ml-put-dfanalytics-request-body]]
|
||
==== {api-request-body-title}
|
||
|
||
`analysis`::
|
||
(Required, object)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=analysis]
|
||
|
||
`analyzed_fields`::
|
||
(Optional, object)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields]
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/loganalytics
|
||
{
|
||
"source": {
|
||
"index": "logdata"
|
||
},
|
||
"dest": {
|
||
"index": "logdata_out"
|
||
},
|
||
"analysis": {
|
||
"outlier_detection": {
|
||
}
|
||
},
|
||
"analyzed_fields": {
|
||
"includes": [ "request.bytes", "response.counts.error" ],
|
||
"excludes": [ "source.geo" ]
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[setup:setup_logdata]
|
||
|
||
|
||
`description`::
|
||
(Optional, string)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=description-dfa]
|
||
|
||
`dest`::
|
||
(Required, object)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=dest]
|
||
|
||
`model_memory_limit`::
|
||
(Optional, string)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=model-memory-limit-dfa]
|
||
|
||
`source`::
|
||
(object)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=source-put-dfa]
|
||
|
||
`allow_lazy_start`::
|
||
(Optional, boolean)
|
||
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
|
||
|
||
|
||
[[ml-put-dfanalytics-example]]
|
||
==== {api-examples-title}
|
||
|
||
|
||
[[ml-put-dfanalytics-example-preprocess]]
|
||
===== Preprocessing actions example
|
||
|
||
The following example shows how to limit the scope of the analysis to certain
|
||
fields, specify excluded fields in the destination index, and use a query to
|
||
filter your data before analysis.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/model-flight-delays-pre
|
||
{
|
||
"source": {
|
||
"index": [
|
||
"kibana_sample_data_flights" <1>
|
||
],
|
||
"query": { <2>
|
||
"range": {
|
||
"DistanceKilometers": {
|
||
"gt": 0
|
||
}
|
||
}
|
||
},
|
||
"_source": { <3>
|
||
"includes": [],
|
||
"excludes": [
|
||
"FlightDelay",
|
||
"FlightDelayType"
|
||
]
|
||
}
|
||
},
|
||
"dest": { <4>
|
||
"index": "df-flight-delays",
|
||
"results_field": "ml-results"
|
||
},
|
||
"analysis": {
|
||
"regression": {
|
||
"dependent_variable": "FlightDelayMin",
|
||
"training_percent": 90
|
||
}
|
||
},
|
||
"analyzed_fields": { <5>
|
||
"includes": [],
|
||
"excludes": [
|
||
"FlightNum"
|
||
]
|
||
},
|
||
"model_memory_limit": "100mb"
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[skip:setup kibana sample data]
|
||
|
||
<1> The source index to analyze.
|
||
<2> This query filters out entire documents that will not be present in the
|
||
destination index.
|
||
<3> The `_source` object defines fields in the dataset that will be included or
|
||
excluded in the destination index. In this case, `includes` does not specify any
|
||
fields, so the default behavior takes place: all the fields of the source index
|
||
will included except the ones that are explicitly specified in `excludes`.
|
||
<4> Defines the destination index that contains the results of the analysis and
|
||
the fields of the source index specified in the `_source` object. Also defines
|
||
the name of the `results_field`.
|
||
<5> Specifies fields to be included in or excluded from the analysis. This does
|
||
not affect whether the fields will be present in the destination index, only
|
||
affects whether they are used in the analysis.
|
||
|
||
In this example, we can see that all the fields of the source index are included
|
||
in the destination index except `FlightDelay` and `FlightDelayType` because
|
||
these are defined as excluded fields by the `excludes` parameter of the
|
||
`_source` object. The `FlightNum` field is included in the destination index,
|
||
however it is not included in the analysis because it is explicitly specified as
|
||
excluded field by the `excludes` parameter of the `analyzed_fields` object.
|
||
|
||
|
||
[[ml-put-dfanalytics-example-od]]
|
||
===== {oldetection-cap} example
|
||
|
||
The following example creates the `loganalytics` {dfanalytics-job}, the analysis
|
||
type is `outlier_detection`:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/loganalytics
|
||
{
|
||
"description": "Outlier detection on log data",
|
||
"source": {
|
||
"index": "logdata"
|
||
},
|
||
"dest": {
|
||
"index": "logdata_out"
|
||
},
|
||
"analysis": {
|
||
"outlier_detection": {
|
||
"compute_feature_influence": true,
|
||
"outlier_fraction": 0.05,
|
||
"standardization_enabled": true
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[setup:setup_logdata]
|
||
|
||
|
||
The API returns the following result:
|
||
|
||
[source,console-result]
|
||
----
|
||
{
|
||
"id": "loganalytics",
|
||
"description": "Outlier detection on log data",
|
||
"source": {
|
||
"index": ["logdata"],
|
||
"query": {
|
||
"match_all": {}
|
||
}
|
||
},
|
||
"dest": {
|
||
"index": "logdata_out",
|
||
"results_field": "ml"
|
||
},
|
||
"analysis": {
|
||
"outlier_detection": {
|
||
"compute_feature_influence": true,
|
||
"outlier_fraction": 0.05,
|
||
"standardization_enabled": true
|
||
}
|
||
},
|
||
"model_memory_limit": "1gb",
|
||
"create_time" : 1562265491319,
|
||
"version" : "7.6.0",
|
||
"allow_lazy_start" : false
|
||
}
|
||
----
|
||
// TESTRESPONSE[s/1562265491319/$body.$_path/]
|
||
// TESTRESPONSE[s/"version": "7.6.0"/"version": $body.version/]
|
||
|
||
|
||
[[ml-put-dfanalytics-example-r]]
|
||
===== {regression-cap} examples
|
||
|
||
The following example creates the `house_price_regression_analysis`
|
||
{dfanalytics-job}, the analysis type is `regression`:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/house_price_regression_analysis
|
||
{
|
||
"source": {
|
||
"index": "houses_sold_last_10_yrs"
|
||
},
|
||
"dest": {
|
||
"index": "house_price_predictions"
|
||
},
|
||
"analysis":
|
||
{
|
||
"regression": {
|
||
"dependent_variable": "price"
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[skip:TBD]
|
||
|
||
|
||
The API returns the following result:
|
||
|
||
[source,console-result]
|
||
----
|
||
{
|
||
"id" : "house_price_regression_analysis",
|
||
"source" : {
|
||
"index" : [
|
||
"houses_sold_last_10_yrs"
|
||
],
|
||
"query" : {
|
||
"match_all" : { }
|
||
}
|
||
},
|
||
"dest" : {
|
||
"index" : "house_price_predictions",
|
||
"results_field" : "ml"
|
||
},
|
||
"analysis" : {
|
||
"regression" : {
|
||
"dependent_variable" : "price",
|
||
"training_percent" : 100
|
||
}
|
||
},
|
||
"model_memory_limit" : "1gb",
|
||
"create_time" : 1567168659127,
|
||
"version" : "8.0.0",
|
||
"allow_lazy_start" : false
|
||
}
|
||
----
|
||
// TESTRESPONSE[s/1567168659127/$body.$_path/]
|
||
// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
|
||
|
||
|
||
The following example creates a job and specifies a training percent:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
|
||
{
|
||
"source": {
|
||
"index": "student_performance_mathematics"
|
||
},
|
||
"dest": {
|
||
"index":"student_performance_mathematics_reg"
|
||
},
|
||
"analysis":
|
||
{
|
||
"regression": {
|
||
"dependent_variable": "G3",
|
||
"training_percent": 70, <1>
|
||
"randomize_seed": 19673948271 <2>
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[skip:TBD]
|
||
|
||
<1> The `training_percent` defines the percentage of the data set that will be
|
||
used for training the model.
|
||
<2> The `randomize_seed` is the seed used to randomly pick which data is used
|
||
for training.
|
||
|
||
|
||
[[ml-put-dfanalytics-example-c]]
|
||
===== {classification-cap} example
|
||
|
||
The following example creates the `loan_classification` {dfanalytics-job}, the
|
||
analysis type is `classification`:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/loan_classification
|
||
{
|
||
"source" : {
|
||
"index": "loan-applicants"
|
||
},
|
||
"dest" : {
|
||
"index": "loan-applicants-classified"
|
||
},
|
||
"analysis" : {
|
||
"classification": {
|
||
"dependent_variable": "label",
|
||
"training_percent": 75,
|
||
"num_top_classes": 2
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[skip:TBD]
|