Revert "[DOCS] Moves analysis resources to PUT DFA API docs (#50704)"

This reverts commit 4e1107d5d7.
This commit is contained in:
István Zoltán Szabó 2020-01-09 14:31:35 +01:00
parent 4e1107d5d7
commit 71afeec7d0
6 changed files with 314 additions and 237 deletions

View File

@ -0,0 +1,217 @@
[role="xpack"]
[testenv="platinum"]
[[ml-dfa-analysis-objects]]
=== Analysis configuration objects
{dfanalytics-cap} resources contain `analysis` objects. For example, when you
create a {dfanalytics-job}, you must define the type of analysis it performs.
This page lists all the available parameters that you can use in the `analysis`
object grouped by {dfanalytics} types.
[discrete]
[[oldetection-resources]]
==== {oldetection-cap} configuration objects
An `outlier_detection` configuration object has the following properties:
`compute_feature_influence`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
`feature_influence_threshold`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
`method`::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=method]
`n_neighbors`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
`outlier_fraction`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
`standardization_enabled`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
[discrete]
[[regression-resources]]
==== {regression-cap} configuration objects
[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/house_price_regression_analysis
{
"source": {
"index": "houses_sold_last_10_yrs" <1>
},
"dest": {
"index": "house_price_predictions" <2>
},
"analysis":
{
"regression": { <3>
"dependent_variable": "price" <4>
}
}
}
--------------------------------------------------
// TEST[skip:TBD]
<1> Training data is taken from source index `houses_sold_last_10_yrs`.
<2> Analysis results will be output to destination index
`house_price_predictions`.
<3> The regression analysis configuration object.
<4> Regression analysis will use field `price` to train on. As no other
parameters have been specified it will train on 100% of eligible data, store its
prediction in destination index field `price_prediction` and use in-built
hyperparameter optimization to give minimum validation errors.
[float]
[[regression-resources-standard]]
===== Standard parameters
`dependent_variable`::
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
+
--
The data type of the field must be numeric.
--
`prediction_field_name`::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`training_percent`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`randomize_seed`::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
[float]
[[regression-resources-advanced]]
===== Advanced parameters
Advanced parameters are for fine-tuning {reganalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters. If these
parameters are not supplied, their values are automatically tuned to give
minimum validation error.
`eta`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`feature_bag_fraction`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`maximum_number_trees`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`gamma`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`lambda`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
[discrete]
[[classification-resources]]
==== {classification-cap} configuration objects
[float]
[[classification-resources-standard]]
===== Standard parameters
`dependent_variable`::
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
+
--
The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
categorical (`ip`, `keyword`, `text`), or boolean.
--
`num_top_classes`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=num-top-classes]
`prediction_field_name`::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`training_percent`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`randomize_seed`::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
[float]
[[classification-resources-advanced]]
===== Advanced parameters
Advanced parameters are for fine-tuning {classanalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters. If these
parameters are not supplied, their values are automatically tuned to give
minimum validation error.
`eta`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`feature_bag_fraction`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`maximum_number_trees`::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`gamma`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`lambda`::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
[discrete]
[[ml-hyperparam-optimization]]
==== Hyperparameter optimization
If you don't supply {regression} or {classification} parameters, hyperparameter
optimization will be performed by default to set a value for the undefined
parameters. The starting point is calculated for data dependent parameters by
examining the loss on the training data. Subject to the size constraint, this
operation provides an upper bound on the improvement in validation loss.
A fixed number of rounds is used for optimization which depends on the number of
parameters being optimized. The optimization starts with random search, then
Bayesian optimization is performed that is targeting maximum expected
improvement. If you override any parameters, then the optimization will
calculate the value of the remaining parameters accordingly and use the value
you provided for the overridden parameter. The number of rounds are reduced
respectively. The validation error is estimated in each round by using 4-fold
cross validation.

View File

@ -14,6 +14,8 @@ You can use the following APIs to perform {ml} {dfanalytics} activities.
* <<evaluate-dfanalytics,Evaluate {dfanalytics}>>
* <<explain-dfanalytics,Explain {dfanalytics}>>
For the `analysis` object resources, check <<ml-dfa-analysis-objects>>.
You can use the following APIs to perform {infer} operations.

View File

@ -53,25 +53,41 @@ If the destination index already exists, then it will be use as is. This makes
it possible to set up the destination index in advance with custom settings
and mappings.
[discrete]
[[ml-hyperparam-optimization]]
===== Hyperparameter optimization
[[ml-put-dfanalytics-supported-fields]]
===== Supported fields
If you don't supply {regression} or {classification} parameters, _hyperparameter
optimization_ occurs, which sets a value for the undefined parameters. The
starting point is calculated for data dependent parameters by examining the loss
on the training data. Subject to the size constraint, this operation provides an
upper bound on the improvement in validation loss.
====== {oldetection-cap}
{oldetection-cap} requires numeric or boolean data to analyze. The algorithms
don't support missing values therefore fields that have data types other than
numeric or boolean are ignored. Documents where included fields contain missing
values, null values, or an array are also ignored. Therefore the `dest` index
may contain documents that don't have an {olscore}.
====== {regression-cap}
{regression-cap} supports fields that are numeric, `boolean`, `text`, `keyword`,
and `ip`. It is also tolerant of missing values. Fields that are supported are
included in the analysis, other fields are ignored. Documents where included
fields contain an array with two or more values are also ignored. Documents in
the `dest` index that dont contain a results field are not included in the
{reganalysis}.
====== {classification-cap}
{classification-cap} supports fields that are numeric, `boolean`, `text`,
`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
supported are included in the analysis, other fields are ignored. Documents
where included fields contain an array with two or more values are also ignored.
Documents in the `dest` index that dont contain a results field are not
included in the {classanalysis}.
{classanalysis-cap} can be improved by mapping ordinal variable values to a
single number. For example, in case of age ranges, you can model the values as
"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
A fixed number of rounds is used for optimization which depends on the number of
parameters being optimized. The optimization starts with random search, then
Bayesian optimization is performed that is targeting maximum expected
improvement. If you override any parameters,
//TBD: What is meant by overriding them? Explicitly setting the parameter instead of letting it take the default?
the optimization calculates the value of the remaining parameters accordingly
and uses the value you provided for the overridden parameter. The number of
rounds are reduced respectively. The validation error is estimated in each round
by using 4-fold cross validation.
[[ml-put-dfanalytics-path-params]]
==== {api-path-parms-title}
@ -83,170 +99,36 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-data-frame-analytics-define]
[[ml-put-dfanalytics-request-body]]
==== {api-request-body-title}
`allow_lazy_start`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
`analysis`::
(Required, object)
The analysis configuration, which contains the information necessary to perform
one of the following types of analysis: {classification}, {oldetection}, or
{regression}.
//include::{docdir}/ml/ml-shared.asciidoc[tag=analysis]
`analysis`.`classification`:::
(Required^*^, object)
The configuration information necessary to perform
{ml-docs}/dfa-classification.html[{classification}].
+
--
TIP: Advanced parameters are for fine-tuning {classanalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters.
--
`analysis`.`classification`.`dependent_variable`::::
(Required, string)
+
--
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
The data type of the field must be numeric (`integer`, `short`, `long`, `byte`),
categorical (`ip`, `keyword`, `text`), or boolean.
--
`analysis`.`classification`.`eta`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`analysis`.`classification`.`feature_bag_fraction`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`analysis`.`classification`.`maximum_number_trees`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`analysis`.`classification`.`gamma`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`analysis`.`classification`.`lambda`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
`analysis`.`classification`.`num_top_classes`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=num-top-classes]
`analysis`.`classification`.`prediction_field_name`::::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`analysis`.`classification`.`randomize_seed`::::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
`analysis`.`classification`.`training_percent`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`analysis`.`outlier_detection`:::
(Required^*^, object)
The configuration information necessary to perform
{ml-docs}/dfa-outlier-detection.html[{oldetection}]:
`analysis`.`outlier_detection`.`compute_feature_influence`::::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
`analysis`.`outlier_detection`.`feature_influence_threshold`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
`analysis`.`outlier_detection`.`method`::::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=method]
`analysis`.`outlier_detection`.`n_neighbors`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
`analysis`.`outlier_detection`.`outlier_fraction`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
`analysis`.`outlier_detection`.`standardization_enabled`::::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
`analysis`.`regression`:::
(Required^*^, object)
The configuration information necessary to perform
{ml-docs}/dfa-regression.html[{regression}].
+
--
TIP: Advanced parameters are for fine-tuning {reganalysis}. They are set
automatically by <<ml-hyperparam-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters.
--
`analysis`.`regression`.`dependent_variable`::::
(Required, string)
+
--
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent-variable]
The data type of the field must be numeric.
--
`analysis`.`regression`.`eta`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
`analysis`.`regression`.`feature_bag_fraction`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
`analysis`.`regression`.`maximum_number_trees`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum-number-trees]
`analysis`.`regression`.`gamma`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
`analysis`.`regression`.`lambda`::::
(Optional, double)
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
`analysis`.`regression`.`prediction_field_name`::::
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction-field-name]
`analysis`.`regression`.`training_percent`::::
(Optional, integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=training-percent]
`analysis`.`regression`.`randomize_seed`::::
(Optional, long)
include::{docdir}/ml/ml-shared.asciidoc[tag=randomize-seed]
include::{docdir}/ml/ml-shared.asciidoc[tag=analysis]
`analyzed_fields`::
(Optional, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields]
`analyzed_fields`.`excludes`:::
(Optional, array)
include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-excludes]
[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/loganalytics
{
"source": {
"index": "logdata"
},
"dest": {
"index": "logdata_out"
},
"analysis": {
"outlier_detection": {
}
},
"analyzed_fields": {
"includes": [ "request.bytes", "response.counts.error" ],
"excludes": [ "source.geo" ]
}
}
--------------------------------------------------
// TEST[setup:setup_logdata]
`analyzed_fields`.`includes`:::
(Optional, array)
include::{docdir}/ml/ml-shared.asciidoc[tag=analyzed-fields-includes]
`description`::
(Optional, string)
@ -264,9 +146,15 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=model-memory-limit-dfa]
(object)
include::{docdir}/ml/ml-shared.asciidoc[tag=source-put-dfa]
`allow_lazy_start`::
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-lazy-start]
[[ml-put-dfanalytics-example]]
==== {api-examples-title}
[[ml-put-dfanalytics-example-preprocess]]
===== Preprocessing actions example

View File

@ -93,47 +93,22 @@ end::analysis-limits[]
tag::analyzed-fields[]
Specify `includes` and/or `excludes` patterns to select which fields will be
included in the analysis.
+
--
The supported fields for each type of analysis are as follows:
included in the analysis. If `analyzed_fields` is not set, only the relevant
fields will be included. For example, all the numeric fields for {oldetection}.
For the supported field types, see <<ml-put-dfanalytics-supported-fields>>. Also
see the <<explain-dfanalytics>> which helps understand field selection.
* {oldetection-cap} requires numeric or boolean data to analyze. The algorithms
don't support missing values therefore fields that have data types other than
numeric or boolean are ignored. Documents where included fields contain missing
values, null values, or an array are also ignored. Therefore the `dest` index
may contain documents that don't have an {olscore}.
* {regression-cap} supports fields that are numeric, `boolean`, `text`, `keyword`,
and `ip`. It is also tolerant of missing values. Fields that are supported are
included in the analysis, other fields are ignored. Documents where included
fields contain an array with two or more values are also ignored. Documents in
the `dest` index that dont contain a results field are not included in the
{reganalysis}.
* {classification-cap} supports fields that are numeric, `boolean`, `text`,
`keyword`, and `ip`. It is also tolerant of missing values. Fields that are
supported are included in the analysis, other fields are ignored. Documents
where included fields contain an array with two or more values are also ignored.
Documents in the `dest` index that dont contain a results field are not
included in the {classanalysis}. {classanalysis-cap} can be improved by mapping
ordinal variable values to a single number. For example, in case of age ranges,
you can model the values as "0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
If `analyzed_fields` is not set, only the relevant fields will be included. For
example, all the numeric fields for {oldetection}. For more information about
field selection, see <<explain-dfanalytics>>.
--
`includes`:::
(Optional, array) An array of strings that defines the fields that will be
included in the analysis.
`excludes`:::
(Optional, array) An array of strings that defines the fields that will be
excluded from the analysis. You do not need to add fields with unsupported
data types to `excludes`, these fields are excluded from the analysis
automatically.
end::analyzed-fields[]
tag::analyzed-fields-excludes[]
An array of strings that defines the fields that will be excluded from the
analysis. You do not need to add fields with unsupported data types to
`excludes`, these fields are excluded from the analysis automatically.
end::analyzed-fields-excludes[]
tag::analyzed-fields-includes[]
An array of strings that defines the fields that will be included in the analysis.
end::analyzed-fields-includes[]
tag::background-persist-interval[]
Advanced configuration option. The time between each periodic persistence of the
model. The default value is a randomized value between 3 to 4 hours, which
@ -536,11 +511,11 @@ identifier when you want to update a specific detector.
end::detector-index[]
tag::eta[]
Advanced configuration option. The shrinkage applied to the weights. Smaller
values result in larger forests which have better generalization error. However,
the smaller the value the longer the training will take. For more information
about shrinkage, see
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article].
The shrinkage applied to the weights. Smaller values result
in larger forests which have better generalization error. However, the smaller
the value the longer the training will take. For more information, see
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
about shrinkage.
end::eta[]
tag::exclude-frequent[]
@ -557,8 +532,8 @@ included.
end::exclude-interim-results[]
tag::feature-bag-fraction[]
Advanced configuration option. Defines the fraction of features that will be
used when selecting a random bag for each candidate split.
Defines the fraction of features that will be used when
selecting a random bag for each candidate split.
end::feature-bag-fraction[]
tag::feature-influence-threshold[]
@ -619,10 +594,10 @@ The analysis function that is used. For example, `count`, `rare`, `mean`, `min`,
end::function[]
tag::gamma[]
Advanced configuration option. Regularization parameter to prevent overfitting
on the training dataset. Multiplies a linear penalty associated with the size of
Regularization parameter to prevent overfitting on the
training dataset. Multiplies a linear penalty associated with the size of
individual trees in the forest. The higher the value the more training will
prefer smaller trees. The smaller this parameter the larger individual trees
prefer smaller trees. The smaller this parameter the larger individual trees
will be and the longer train will take.
end::gamma[]
@ -716,10 +691,10 @@ For more information, see <<ml-jobstats>>.
end::jobs-stats-anomaly-detection[]
tag::lambda[]
Advanced configuration option. Regularization parameter to prevent overfitting
on the training dataset. Multiplies an L2 regularisation term which applies to
leaf weights of the individual trees in the forest. The higher the value the
more training will attempt to keep leaf weights small. This makes the prediction
Regularization parameter to prevent overfitting on the
training dataset. Multiplies an L2 regularisation term which applies to leaf
weights of the individual trees in the forest. The higher the value the more
training will attempt to keep leaf weights small. This makes the prediction
function smoother at the expense of potentially not being able to capture
relevant relationships between the features and the {depvar}. The smaller this
parameter the larger individual trees will be and the longer train will take.
@ -748,8 +723,8 @@ until it is explicitly stopped. By default this setting is not set.
end::max-empty-searches[]
tag::maximum-number-trees[]
Advanced configuration option. Defines the maximum number of trees the forest is
allowed to contain. The maximum value is 2000.
Defines the maximum number of trees the forest is allowed
to contain. The maximum value is 2000.
end::maximum-number-trees[]
tag::memory-estimation[]

View File

@ -298,9 +298,3 @@ See <<ml-get-bucket>>,
<<ml-get-category>>, and
[[ml-results-overall-buckets]]
<<ml-get-overall-buckets>>.
[role="exclude",id="ml-dfa-analysis-objects"]
=== Analysis configuration objects
This page was deleted.
See <<put-dfanalytics>>.

View File

@ -2,10 +2,11 @@
[[api-definitions]]
== Definitions
The role mappings resource definition you can find below is used in APIs related
to security features.
* <<role-mapping-resources,Role mappings>>
These resource definitions are used in APIs related to {ml-features} and
{security-features} and in {kib} advanced {ml} job configuration options.
* <<ml-dfa-analysis-objects>>
* <<role-mapping-resources,Role mappings>>
include::{es-repo-dir}/ml/df-analytics/apis/analysisobjects.asciidoc[]
include::{xes-repo-dir}/rest-api/security/role-mapping-resources.asciidoc[]