[DOCS] Adds classification type DFA API docs and ml-shared.asciidoc (#48241)

This commit is contained in:
István Zoltán Szabó 2019-11-06 07:40:27 -05:00
parent 70765dfb05
commit 3c9bd13dca
3 changed files with 193 additions and 54 deletions

View File

@ -18,13 +18,14 @@
`analyzed_fields`:: `analyzed_fields`::
(object) You can specify both `includes` and/or `excludes` patterns. If (object) You can specify both `includes` and/or `excludes` patterns. If
`analyzed_fields` is not set, only the relevant fields will be included. For `analyzed_fields` is not set, only the relevant fields will be included. For
example all the numeric fields for {oldetection}. example, all the numeric fields for {oldetection}. For the supported field
types, see <<ml-put-dfanalytics-supported-fields>>.
`analyzed_fields.includes`::: `includes`:::
(array) An array of strings that defines the fields that will be included in (array) An array of strings that defines the fields that will be included in
the analysis. the analysis.
`analyzed_fields.excludes`::: `excludes`:::
(array) An array of strings that defines the fields that will be excluded (array) An array of strings that defines the fields that will be excluded
from the analysis. from the analysis.
@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
[[regression-resources-standard]] [[regression-resources-standard]]
===== Standard parameters ===== Standard parameters
`dependent_variable`:: include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
(Required, string) Defines which field of the document is to be predicted. +
This parameter is supplied by field name and must match one of the fields in --
the index being used to train. If this field is missing from a document, then The data type of the field must be numeric.
that document will not be used for training, but a prediction with the trained --
model will be generated for it. The data type of the field must be numeric. It
is also known as continuous target variable.
`prediction_field_name`:: include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
(Optional, string) Defines the name of the prediction field in the results.
Defaults to `<dependent_variable>_prediction`.
`training_percent`:: include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
(Optional, integer) Defines what percentage of the eligible documents that will
be used for training. Documents that are ignored by the analysis (for example
those that contain arrays) wont be included in the calculation for used
percentage. Defaults to `100`.
[float] [float]
@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
parameters are not supplied, their values are automatically tuned to give parameters are not supplied, their values are automatically tuned to give
minimum validation error. minimum validation error.
`eta`:: include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
(Optional, double) The shrinkage applied to the weights. Smaller values result
in larger forests which have better generalization error. However, the smaller
the value the longer the training will take. For more information, see
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
about shrinkage.
`feature_bag_fraction`:: include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
(Optional, double) Defines the fraction of features that will be used when
selecting a random bag for each candidate split.
`maximum_number_trees`:: include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
(Optional, integer) Defines the maximum number of trees the forest is allowed
to contain. The maximum value is 2000.
`gamma`:: include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
(Optional, double) Regularization parameter to prevent overfitting on the
training dataset. Multiplies a linear penalty associated with the size of
individual trees in the forest. The higher the value the more training will
prefer smaller trees. The smaller this parameter the larger individual trees
will be and the longer train will take.
`lambda`:: include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
(Optional, double) Regularization parameter to prevent overfitting on the
training dataset. Multiplies an L2 regularisation term which applies to leaf
weights of the individual trees in the forest. The higher the value the more [discrete]
training will attempt to keep leaf weights small. This makes the prediction [[classification-resources]]
function smoother at the expense of potentially not being able to capture ==== {classification-cap} configuration objects
relevant relationships between the features and the {depvar}. The smaller this
parameter the larger individual trees will be and the longer train will take.
[float]
[[classification-resources-standard]]
===== Standard parameters
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
+
--
The data type of the field must be numeric or boolean.
--
`num_top_classes`::
(Optional, integer) Defines the number of categories for which the predicted
probabilities are reported. It must be non-negative. If it is greater than the
total number of categories (in the {version} version of the {stack}, it's two)
to predict then we will report all category probabilities. Defaults to 2.
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
[float]
[[classification-resources-advanced]]
===== Advanced parameters
Advanced parameters are for fine-tuning {classanalysis}. They are set
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters. If these
parameters are not supplied, their values are automatically tuned to give
minimum validation error.
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
[[ml-hyperparameter-optimization]] [[ml-hyperparameter-optimization]]
===== Hyperparameter optimization ===== Hyperparameter optimization
If you don't supply {regression} parameters, hyperparameter optimization will be If you don't supply {regression} or {classification} parameters, hyperparameter
performed by default to set a value for the undefined parameters. The starting optimization will be performed by default to set a value for the undefined
point is calculated for data dependent parameters by examining the loss on the parameters. The starting point is calculated for data dependent parameters by
training data. Subject to the size constraint, this operation provides an upper examining the loss on the training data. Subject to the size constraint, this
bound on the improvement in validation loss. operation provides an upper bound on the improvement in validation loss.
A fixed number of rounds is used for optimization which depends on the number of A fixed number of rounds is used for optimization which depends on the number of
parameters being optimized. The optimitazion starts with random search, then parameters being optimized. The optimitazion starts with random search, then

View File

@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
that dont contain a results field are not included in the {reganalysis}. that dont contain a results field are not included in the {reganalysis}.
====== {classification-cap}
{classification-cap} supports fields that are numeric, boolean, text, keyword
and ip. It is also tolerant of missing values. Fields that are supported are
included in the analysis, other fields are ignored. Documents where included
fields contain an array with two or more values are also ignored. Documents in
the `dest` index that dont contain a results field are not included in the
{classanalysis}.
{classanalysis-cap} can be improved by mapping ordinal variable values to a
single number. For example, in case of age ranges, you can model the values as
"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
Fields that are highly correlated to the `dependent_variable` should be excluded
from the analysis. For example, if you have a multi-value field as
`dependent_variable`, {es} will be mapping it both as text and keyword which
results in two fields (`field` and `field.keyword`). It is required to exclude
the field with the text mapping to get exact results from the analysis.
[[ml-put-dfanalytics-path-params]] [[ml-put-dfanalytics-path-params]]
==== {api-path-parms-title} ==== {api-path-parms-title}
@ -154,6 +174,7 @@ that dont contain a results field are not included in the {reganalysis}.
[[ml-put-dfanalytics-example]] [[ml-put-dfanalytics-example]]
==== {api-examples-title} ==== {api-examples-title}
[[ml-put-dfanalytics-example-od]] [[ml-put-dfanalytics-example-od]]
===== {oldetection-cap} example ===== {oldetection-cap} example
@ -305,3 +326,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
<1> The `training_percent` defines the percentage of the data set that will be used <1> The `training_percent` defines the percentage of the data set that will be used
for training the model. for training the model.
[[ml-put-dfanalytics-example-c]]
===== {classification-cap} example
The following example creates the `loan_classification` {dfanalytics-job}, the
analysis type is `classification`:
[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/loan_classification
{
"source" : {
"index": "loan-applicants"
},
"dest" : {
"index": "loan-applicants-classified"
},
"analysis" : {
"classification": {
"dependent_variable": "label",
"training_percent": 75,
"num_top_classes": 2
}
}
}
--------------------------------------------------
// TEST[skip:TBD]

View File

@ -0,0 +1,70 @@
tag::dependent_variable[]
`dependent_variable`::
(Required, string) Defines which field of the document is to be predicted.
This parameter is supplied by field name and must match one of the fields in
the index being used to train. If this field is missing from a document, then
that document will not be used for training, but a prediction with the trained
model will be generated for it. It is also known as continuous target variable.
end::dependent_variable[]
tag::eta[]
`eta`::
(Optional, double) The shrinkage applied to the weights. Smaller values result
in larger forests which have better generalization error. However, the smaller
the value the longer the training will take. For more information, see
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
about shrinkage.
end::eta[]
tag::feature_bag_fraction[]
`feature_bag_fraction`::
(Optional, double) Defines the fraction of features that will be used when
selecting a random bag for each candidate split.
end::feature_bag_fraction[]
tag::gamma[]
`gamma`::
(Optional, double) Regularization parameter to prevent overfitting on the
training dataset. Multiplies a linear penalty associated with the size of
individual trees in the forest. The higher the value the more training will
prefer smaller trees. The smaller this parameter the larger individual trees
will be and the longer train will take.
end::gamma[]
tag::lambda[]
`lambda`::
(Optional, double) Regularization parameter to prevent overfitting on the
training dataset. Multiplies an L2 regularisation term which applies to leaf
weights of the individual trees in the forest. The higher the value the more
training will attempt to keep leaf weights small. This makes the prediction
function smoother at the expense of potentially not being able to capture
relevant relationships between the features and the {depvar}. The smaller this
parameter the larger individual trees will be and the longer train will take.
end::lambda[]
tag::maximum_number_trees[]
`maximum_number_trees`::
(Optional, integer) Defines the maximum number of trees the forest is allowed
to contain. The maximum value is 2000.
end::maximum_number_trees[]
tag::prediction_field_name[]
`prediction_field_name`::
(Optional, string) Defines the name of the prediction field in the results.
Defaults to `<dependent_variable>_prediction`.
end::prediction_field_name[]
tag::training_percent[]
`training_percent`::
(Optional, integer) Defines what percentage of the eligible documents that will
be used for training. Documents that are ignored by the analysis (for example
those that contain arrays) wont be included in the calculation for used
percentage. Defaults to `100`.
end::training_percent[]