[DOCS] Moves analysis resources to PUT DFA API docs (#50704)

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
This commit is contained in:
István Zoltán Szabó 2020-01-09 13:57:11 +01:00
parent acd73dda1c
commit 4e1107d5d7
6 changed files with 237 additions and 314 deletions

View File

@ -1,217 +0,0 @@
[role="xpack"]
[testenv="platinum"]
[[ml-dfa-analysis-objects]]
=== Analysis configuration objects
{dfanalytics-cap} resources contain `analysis` objects. For example, when you
create a {dfanalytics-job}, you must define the type of analysis it performs.
This page lists all the available parameters that you can use in the `analysis`
object grouped by {dfanalytics} types.
[discrete]
[[oldetection-resources]]
==== {oldetection-cap} configuration objects
An `outlier_detection` configuration object has the following properties:
`compute_feature_influence`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
`feature_influence_threshold`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
`method`::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=method]
`n_neighbors`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
`outlier_fraction`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
`standardization_enabled`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
[discrete]
[[regression-resources]]
==== {regression-cap} configuration objects
[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/house_price_regression_analysis
{
"source": {
"index": "houses_sold_last_10_yrs" <1>
},
"dest": {
"index": "house_price_predictions" <2>
},
"analysis":
{
"regression": { <3>
"dependent_variable": "price" <4>
}
}
}
--------------------------------------------------
// TEST[skip:TBD]
<1> Training data is taken from source index `houses_sold_last_10_yrs`.
<2> Analysis results will be output to destination index
`house_price_predictions`.
<3> The regression analysis configuration object.
<4> Regression analysis will use field `price` to train on. As no other
parameters have been specified it will train on 100% of eligible data, store its
prediction in destination index field `price_prediction` and use in-built
hyperparameter optimization to give minimum validation errors.
[float]
[[regression-resources-standard]]
===== Standard parameters
`dependent_variable`::
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
+
--
The data type of the field must be numeric.
--
`prediction_field_name`::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`training_percent`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`randomize_seed`::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
[float]
[[regression-resources-advanced]]
===== Advanced parameters
Advanced parameters are for fine-tuning {reganalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters. If these
parameters are not supplied, their values are automatically tuned to give
minimum validation error.
`eta`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`feature_bag_fraction`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`maximum_number_trees`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`gamma`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`lambda`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
[discrete]
[[classification-resources]]
==== {classification-cap} configuration objects
[float]
[[classification-resources-standard]]
===== Standard parameters
`dependent_variable`::
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
+
--
The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
categorical (`ip`, `keyword`, `text`), or boolean.
--
`num_top_classes`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=num-top-classes]
`prediction_field_name`::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`training_percent`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`randomize_seed`::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
[float]
[[classification-resources-advanced]]
===== Advanced parameters
Advanced parameters are for fine-tuning {classanalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters. If these
parameters are not supplied, their values are automatically tuned to give
minimum validation error.
`eta`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`feature_bag_fraction`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`maximum_number_trees`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`gamma`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`lambda`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
[discrete]
[[ml-hyperparam-optimization]]
==== Hyperparameter optimization
If you don't supply {regression} or {classification} parameters, hyperparameter
optimization will be performed by default to set a value for the undefined
parameters. The starting point is calculated for data dependent parameters by
examining the loss on the training data. Subject to the size constraint, this
operation provides an upper bound on the improvement in validation loss.
A fixed number of rounds is used for optimization which depends on the number of
parameters being optimized. The optimization starts with random search, then
Bayesian optimization is performed that is targeting maximum expected
improvement. If you override any parameters, then the optimization will
calculate the value of the remaining parameters accordingly and use the value
you provided for the overridden parameter. The number of rounds are reduced
respectively. The validation error is estimated in each round by using 4-fold
cross validation.

View File

@ -14,8 +14,6 @@ You can use the following APIs to perform {ml} {dfanalytics} activities.
* <<evaluate-dfanalytics,Evaluate {dfanalytics}>> * <<evaluate-dfanalytics,Evaluate {dfanalytics}>>
* <<explain-dfanalytics,Explain {dfanalytics}>> * <<explain-dfanalytics,Explain {dfanalytics}>>
For the `analysis` object resources, check <<ml-dfa-analysis-objects>>.
You can use the following APIs to perform {infer} operations. You can use the following APIs to perform {infer} operations.

View File

@ -53,41 +53,25 @@ If the destination index already exists, then it will be use as is. This makes
it possible to set up the destination index in advance with custom settings it possible to set up the destination index in advance with custom settings
and mappings. and mappings.
[[ml-put-dfanalytics-supported-fields]] [discrete]
===== Supported fields [[ml-hyperparam-optimization]]
===== Hyperparameter optimization
====== {oldetection-cap} If you don't supply {regression} or {classification} parameters, _hyperparameter
optimization_ occurs, which sets a value for the undefined parameters. The
{oldetection-cap} requires numeric or boolean data to analyze. The algorithms starting point is calculated for data dependent parameters by examining the loss
don't support missing values therefore fields that have data types other than on the training data. Subject to the size constraint, this operation provides an
numeric or boolean are ignored. Documents where included fields contain missing upper bound on the improvement in validation loss.
values, null values, or an array are also ignored. Therefore the `dest` index
may contain documents that don't have an {olscore}.
====== {regression-cap}
{regression-cap} supports fields that are numeric, `boolean`, `text`, `keyword`,
and `ip`. It is also tolerant of missing values. Fields that are supported are
included in the analysis, other fields are ignored. Documents where included
fields contain an array with two or more values are also ignored. Documents in
the `dest` index that dont contain a results field are not included in the
{reganalysis}.
====== {classification-cap}
{classification-cap} supports fields that are numeric, `boolean`, `text`,
`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
supported are included in the analysis, other fields are ignored. Documents
where included fields contain an array with two or more values are also ignored.
Documents in the `dest` index that dont contain a results field are not
included in the {classanalysis}.
{classanalysis-cap} can be improved by mapping ordinal variable values to a
single number. For example, in case of age ranges, you can model the values as
"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
A fixed number of rounds is used for optimization which depends on the number of
parameters being optimized. The optimization starts with random search, then
Bayesian optimization is performed that is targeting maximum expected
improvement. If you override any parameters,
//TBD: What is meant by overriding them? Explicitly setting the parameter instead of letting it take the default?
the optimization calculates the value of the remaining parameters accordingly
and uses the value you provided for the overridden parameter. The number of
rounds are reduced respectively. The validation error is estimated in each round
by using 4-fold cross validation.
[[ml-put-dfanalytics-path-params]] [[ml-put-dfanalytics-path-params]]
==== {api-path-parms-title} ==== {api-path-parms-title}
@ -99,36 +83,170 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-define]
[[ml-put-dfanalytics-request-body]] [[ml-put-dfanalytics-request-body]]
==== {api-request-body-title} ==== {api-request-body-title}
`allow_lazy_start`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
`analysis`:: `analysis`::
(Required, object) (Required, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=analysis] The analysis configuration, which contains the information necessary to perform
one of the following types of analysis: {classification}, {oldetection}, or
{regression}.
//include::{docdir}/ml/ml-shared.asciidoc[tag=analysis]
`analysis`.`classification`:::
(Required^*^, object)
The configuration information necessary to perform
{ml-docs}/dfa-classification.html[{classification}].
+
--
TIP: Advanced parameters are for fine-tuning {classanalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters.
--
`analysis`.`classification`.`dependent_variable`::::
(Required, string)
+
--
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
categorical (`ip`, `keyword`, `text`), or boolean.
--
`analysis`.`classification`.`eta`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`analysis`.`classification`.`feature_bag_fraction`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`analysis`.`classification`.`maximum_number_trees`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`analysis`.`classification`.`gamma`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`analysis`.`classification`.`lambda`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
`analysis`.`classification`.`num_top_classes`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=num-top-classes]
`analysis`.`classification`.`prediction_field_name`::::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`analysis`.`classification`.`randomize_seed`::::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
`analysis`.`classification`.`training_percent`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`analysis`.`outlier_detection`:::
(Required^*^, object)
The configuration information necessary to perform
{ml-docs}/dfa-outlier-detection.html[{oldetection}]:
`analysis`.`outlier_detection`.`compute_feature_influence`::::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
`analysis`.`outlier_detection`.`feature_influence_threshold`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
`analysis`.`outlier_detection`.`method`::::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=method]
`analysis`.`outlier_detection`.`n_neighbors`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
`analysis`.`outlier_detection`.`outlier_fraction`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
`analysis`.`outlier_detection`.`standardization_enabled`::::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
`analysis`.`regression`:::
(Required^*^, object)
The configuration information necessary to perform
{ml-docs}/dfa-regression.html[{regression}].
+
--
TIP: Advanced parameters are for fine-tuning {reganalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters.
--
`analysis`.`regression`.`dependent_variable`::::
(Required, string)
+
--
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
The data type of the field must be numeric.
--
`analysis`.`regression`.`eta`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`analysis`.`regression`.`feature_bag_fraction`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`analysis`.`regression`.`maximum_number_trees`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`analysis`.`regression`.`gamma`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`analysis`.`regression`.`lambda`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
`analysis`.`regression`.`prediction_field_name`::::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`analysis`.`regression`.`training_percent`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`analysis`.`regression`.`randomize_seed`::::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
`analyzed_fields`:: `analyzed_fields`::
(Optional, object) (Optional, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields] include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields]
[source,console] `analyzed_fields`.`excludes`:::
-------------------------------------------------- (Optional, array)
PUT _ml/data_frame/analytics/loganalytics include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-excludes]
{
"source": {
"index": "logdata"
},
"dest": {
"index": "logdata_out"
},
"analysis": {
"outlier_detection": {
}
},
"analyzed_fields": {
"includes": [ "request.bytes", "response.counts.error" ],
"excludes": [ "source.geo" ]
}
}
--------------------------------------------------
// TEST[setup:setup_logdata]
`analyzed_fields`.`includes`:::
(Optional, array)
include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-includes]
`description`:: `description`::
(Optional, string) (Optional, string)
@ -146,15 +264,9 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=model-memory-limit-dfa]
(object) (object)
include::{docdir}/ml/ml-shared.asciidoc[tag=source-put-dfa] include::{docdir}/ml/ml-shared.asciidoc[tag=source-put-dfa]
`allow_lazy_start`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
[[ml-put-dfanalytics-example]] [[ml-put-dfanalytics-example]]
==== {api-examples-title} ==== {api-examples-title}
[[ml-put-dfanalytics-example-preprocess]] [[ml-put-dfanalytics-example-preprocess]]
===== Preprocessing actions example ===== Preprocessing actions example

View File

@ -93,22 +93,47 @@ end::analysis-limits[]
tag::analyzed-fields[] tag::analyzed-fields[]
Specify `includes` and/or `excludes` patterns to select which fields will be Specify `includes` and/or `excludes` patterns to select which fields will be
included in the analysis. If `analyzed_fields` is not set, only the relevant included in the analysis.
fields will be included. For example, all the numeric fields for {oldetection}. +
For the supported field types, see <<ml-put-dfanalytics-supported-fields>>. Also --
see the <<explain-dfanalytics>> which helps understand field selection. The supported fields for each type of analysis are as follows:
`includes`::: * {oldetection-cap} requires numeric or boolean data to analyze. The algorithms
(Optional, array) An array of strings that defines the fields that will be don't support missing values therefore fields that have data types other than
included in the analysis. numeric or boolean are ignored. Documents where included fields contain missing
values, null values, or an array are also ignored. Therefore the `dest` index
`excludes`::: may contain documents that don't have an {olscore}.
(Optional, array) An array of strings that defines the fields that will be * {regression-cap} supports fields that are numeric, `boolean`, `text`, `keyword`,
excluded from the analysis. You do not need to add fields with unsupported and `ip`. It is also tolerant of missing values. Fields that are supported are
data types to `excludes`, these fields are excluded from the analysis included in the analysis, other fields are ignored. Documents where included
automatically. fields contain an array with two or more values are also ignored. Documents in
the `dest` index that dont contain a results field are not included in the
{reganalysis}.
* {classification-cap} supports fields that are numeric, `boolean`, `text`,
`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
supported are included in the analysis, other fields are ignored. Documents
where included fields contain an array with two or more values are also ignored.
Documents in the `dest` index that dont contain a results field are not
included in the {classanalysis}. {classanalysis-cap} can be improved by mapping
ordinal variable values to a single number. For example, in case of age ranges,
you can model the values as "0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
If `analyzed_fields` is not set, only the relevant fields will be included. For
example, all the numeric fields for {oldetection}. For more information about
field selection, see <<explain-dfanalytics>>.
--
end::analyzed-fields[] end::analyzed-fields[]
tag::analyzed-fields-excludes[]
An array of strings that defines the fields that will be excluded from the
analysis. You do not need to add fields with unsupported data types to
`excludes`, these fields are excluded from the analysis automatically.
end::analyzed-fields-excludes[]
tag::analyzed-fields-includes[]
An array of strings that defines the fields that will be included in the analysis.
end::analyzed-fields-includes[]
tag::background-persist-interval[] tag::background-persist-interval[]
Advanced configuration option. The time between each periodic persistence of the Advanced configuration option. The time between each periodic persistence of the
model. The default value is a randomized value between 3 to 4 hours, which model. The default value is a randomized value between 3 to 4 hours, which
@ -511,11 +536,11 @@ identifier when you want to update a specific detector.
end::detector-index[] end::detector-index[]
tag::eta[] tag::eta[]
The shrinkage applied to the weights. Smaller values result Advanced configuration option. The shrinkage applied to the weights. Smaller
in larger forests which have better generalization error. However, the smaller values result in larger forests which have better generalization error. However,
the value the longer the training will take. For more information, see the smaller the value the longer the training will take. For more information
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] about shrinkage, see
about shrinkage. https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article].
end::eta[] end::eta[]
tag::exclude-frequent[] tag::exclude-frequent[]
@ -532,8 +557,8 @@ included.
end::exclude-interim-results[] end::exclude-interim-results[]
tag::feature-bag-fraction[] tag::feature-bag-fraction[]
Defines the fraction of features that will be used when Advanced configuration option. Defines the fraction of features that will be
selecting a random bag for each candidate split. used when selecting a random bag for each candidate split.
end::feature-bag-fraction[] end::feature-bag-fraction[]
tag::feature-influence-threshold[] tag::feature-influence-threshold[]
@ -594,10 +619,10 @@ The analysis function that is used. For example, `count`, `rare`, `mean`, `min`,
end::function[] end::function[]
tag::gamma[] tag::gamma[]
Regularization parameter to prevent overfitting on the Advanced configuration option. Regularization parameter to prevent overfitting
training dataset. Multiplies a linear penalty associated with the size of on the training dataset. Multiplies a linear penalty associated with the size of
individual trees in the forest. The higher the value the more training will individual trees in the forest. The higher the value the more training will
prefer smaller trees. The smaller this parameter the larger individual trees prefer smaller trees. The smaller this parameter the larger individual trees
will be and the longer train will take. will be and the longer train will take.
end::gamma[] end::gamma[]
@ -691,10 +716,10 @@ For more information, see <<ml-jobstats>>.
end::jobs-stats-anomaly-detection[] end::jobs-stats-anomaly-detection[]
tag::lambda[] tag::lambda[]
Regularization parameter to prevent overfitting on the Advanced configuration option. Regularization parameter to prevent overfitting
training dataset. Multiplies an L2 regularisation term which applies to leaf on the training dataset. Multiplies an L2 regularisation term which applies to
weights of the individual trees in the forest. The higher the value the more leaf weights of the individual trees in the forest. The higher the value the
training will attempt to keep leaf weights small. This makes the prediction more training will attempt to keep leaf weights small. This makes the prediction
function smoother at the expense of potentially not being able to capture function smoother at the expense of potentially not being able to capture
relevant relationships between the features and the {depvar}. The smaller this relevant relationships between the features and the {depvar}. The smaller this
parameter the larger individual trees will be and the longer train will take. parameter the larger individual trees will be and the longer train will take.
@ -723,8 +748,8 @@ until it is explicitly stopped. By default this setting is not set.
end::max-empty-searches[] end::max-empty-searches[]
tag::maximum-number-trees[] tag::maximum-number-trees[]
Defines the maximum number of trees the forest is allowed Advanced configuration option. Defines the maximum number of trees the forest is
to contain. The maximum value is 2000. allowed to contain. The maximum value is 2000.
end::maximum-number-trees[] end::maximum-number-trees[]
tag::memory-estimation[] tag::memory-estimation[]

View File

@ -298,3 +298,9 @@ See <<ml-get-bucket>>,
<<ml-get-category>>, and <<ml-get-category>>, and
[[ml-results-overall-buckets]] [[ml-results-overall-buckets]]
<<ml-get-overall-buckets>>. <<ml-get-overall-buckets>>.
[role="exclude",id="ml-dfa-analysis-objects"]
=== Analysis configuration objects
This page was deleted.
See <<put-dfanalytics>>.

View File

@ -2,11 +2,10 @@
[[api-definitions]] [[api-definitions]]
== Definitions == Definitions
These resource definitions are used in APIs related to {ml-features} and The role mappings resource definition you can find below is used in APIs related
{security-features} and in {kib} advanced {ml} job configuration options. to security features.
* <<role-mapping-resources,Role mappings>>
* <<ml-dfa-analysis-objects>>
* <<role-mapping-resources,Role mappings>>
include::{es-repo-dir}/ml/df-analytics/apis/analysisobjects.asciidoc[]
include::{xes-repo-dir}/rest-api/security/role-mapping-resources.asciidoc[] include::{xes-repo-dir}/rest-api/security/role-mapping-resources.asciidoc[]