[DOCS] Adds classification type DFA API docs and ml-shared.asciidoc (#48241)
This commit is contained in:
parent
70765dfb05
commit
3c9bd13dca
|
@ -18,13 +18,14 @@
|
|||
`analyzed_fields`::
|
||||
(object) You can specify both `includes` and/or `excludes` patterns. If
|
||||
`analyzed_fields` is not set, only the relevant fields will be included. For
|
||||
example all the numeric fields for {oldetection}.
|
||||
|
||||
`analyzed_fields.includes`:::
|
||||
example, all the numeric fields for {oldetection}. For the supported field
|
||||
types, see <<ml-put-dfanalytics-supported-fields>>.
|
||||
|
||||
`includes`:::
|
||||
(array) An array of strings that defines the fields that will be included in
|
||||
the analysis.
|
||||
|
||||
`analyzed_fields.excludes`:::
|
||||
|
||||
`excludes`:::
|
||||
(array) An array of strings that defines the fields that will be excluded
|
||||
from the analysis.
|
||||
|
||||
|
@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
|
|||
[[regression-resources-standard]]
|
||||
===== Standard parameters
|
||||
|
||||
`dependent_variable`::
|
||||
(Required, string) Defines which field of the document is to be predicted.
|
||||
This parameter is supplied by field name and must match one of the fields in
|
||||
the index being used to train. If this field is missing from a document, then
|
||||
that document will not be used for training, but a prediction with the trained
|
||||
model will be generated for it. The data type of the field must be numeric. It
|
||||
is also known as continuous target variable.
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
|
||||
+
|
||||
--
|
||||
The data type of the field must be numeric.
|
||||
--
|
||||
|
||||
`prediction_field_name`::
|
||||
(Optional, string) Defines the name of the prediction field in the results.
|
||||
Defaults to `<dependent_variable>_prediction`.
|
||||
|
||||
`training_percent`::
|
||||
(Optional, integer) Defines what percentage of the eligible documents that will
|
||||
be used for training. Documents that are ignored by the analysis (for example
|
||||
those that contain arrays) won’t be included in the calculation for used
|
||||
percentage. Defaults to `100`.
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
|
||||
|
||||
|
||||
[float]
|
||||
|
@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
|
|||
parameters are not supplied, their values are automatically tuned to give
|
||||
minimum validation error.
|
||||
|
||||
`eta`::
|
||||
(Optional, double) The shrinkage applied to the weights. Smaller values result
|
||||
in larger forests which have better generalization error. However, the smaller
|
||||
the value the longer the training will take. For more information, see
|
||||
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
|
||||
about shrinkage.
|
||||
|
||||
`feature_bag_fraction`::
|
||||
(Optional, double) Defines the fraction of features that will be used when
|
||||
selecting a random bag for each candidate split.
|
||||
|
||||
`maximum_number_trees`::
|
||||
(Optional, integer) Defines the maximum number of trees the forest is allowed
|
||||
to contain. The maximum value is 2000.
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
||||
|
||||
`gamma`::
|
||||
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||
training dataset. Multiplies a linear penalty associated with the size of
|
||||
individual trees in the forest. The higher the value the more training will
|
||||
prefer smaller trees. The smaller this parameter the larger individual trees
|
||||
will be and the longer train will take.
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
||||
|
||||
|
||||
[discrete]
|
||||
[[classification-resources]]
|
||||
==== {classification-cap} configuration objects
|
||||
|
||||
`lambda`::
|
||||
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||
training dataset. Multiplies an L2 regularisation term which applies to leaf
|
||||
weights of the individual trees in the forest. The higher the value the more
|
||||
training will attempt to keep leaf weights small. This makes the prediction
|
||||
function smoother at the expense of potentially not being able to capture
|
||||
relevant relationships between the features and the {depvar}. The smaller this
|
||||
parameter the larger individual trees will be and the longer train will take.
|
||||
|
||||
[float]
|
||||
[[classification-resources-standard]]
|
||||
===== Standard parameters
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
|
||||
+
|
||||
--
|
||||
The data type of the field must be numeric or boolean.
|
||||
--
|
||||
|
||||
`num_top_classes`::
|
||||
(Optional, integer) Defines the number of categories for which the predicted
|
||||
probabilities are reported. It must be non-negative. If it is greater than the
|
||||
total number of categories (in the {version} version of the {stack}, it's two)
|
||||
to predict then we will report all category probabilities. Defaults to 2.
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
|
||||
|
||||
|
||||
[float]
|
||||
[[classification-resources-advanced]]
|
||||
===== Advanced parameters
|
||||
|
||||
Advanced parameters are for fine-tuning {classanalysis}. They are set
|
||||
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
|
||||
to give minimum validation error. It is highly recommended to use the default
|
||||
values unless you fully understand the function of these parameters. If these
|
||||
parameters are not supplied, their values are automatically tuned to give
|
||||
minimum validation error.
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
||||
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
||||
|
||||
|
||||
[[ml-hyperparameter-optimization]]
|
||||
===== Hyperparameter optimization
|
||||
|
||||
If you don't supply {regression} parameters, hyperparameter optimization will be
|
||||
performed by default to set a value for the undefined parameters. The starting
|
||||
point is calculated for data dependent parameters by examining the loss on the
|
||||
training data. Subject to the size constraint, this operation provides an upper
|
||||
bound on the improvement in validation loss.
|
||||
If you don't supply {regression} or {classification} parameters, hyperparameter
|
||||
optimization will be performed by default to set a value for the undefined
|
||||
parameters. The starting point is calculated for data dependent parameters by
|
||||
examining the loss on the training data. Subject to the size constraint, this
|
||||
operation provides an upper bound on the improvement in validation loss.
|
||||
|
||||
A fixed number of rounds is used for optimization which depends on the number of
|
||||
parameters being optimized. The optimitazion starts with random search, then
|
||||
|
|
|
@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
|
|||
that don’t contain a results field are not included in the {reganalysis}.
|
||||
|
||||
|
||||
====== {classification-cap}
|
||||
|
||||
{classification-cap} supports fields that are numeric, boolean, text, keyword
|
||||
and ip. It is also tolerant of missing values. Fields that are supported are
|
||||
included in the analysis, other fields are ignored. Documents where included
|
||||
fields contain an array with two or more values are also ignored. Documents in
|
||||
the `dest` index that don’t contain a results field are not included in the
|
||||
{classanalysis}.
|
||||
|
||||
{classanalysis-cap} can be improved by mapping ordinal variable values to a
|
||||
single number. For example, in case of age ranges, you can model the values as
|
||||
"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
|
||||
|
||||
Fields that are highly correlated to the `dependent_variable` should be excluded
|
||||
from the analysis. For example, if you have a multi-value field as
|
||||
`dependent_variable`, {es} will be mapping it both as text and keyword which
|
||||
results in two fields (`field` and `field.keyword`). It is required to exclude
|
||||
the field with the text mapping to get exact results from the analysis.
|
||||
|
||||
|
||||
[[ml-put-dfanalytics-path-params]]
|
||||
==== {api-path-parms-title}
|
||||
|
||||
|
@ -154,6 +174,7 @@ that don’t contain a results field are not included in the {reganalysis}.
|
|||
[[ml-put-dfanalytics-example]]
|
||||
==== {api-examples-title}
|
||||
|
||||
|
||||
[[ml-put-dfanalytics-example-od]]
|
||||
===== {oldetection-cap} example
|
||||
|
||||
|
@ -305,3 +326,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
|
|||
|
||||
<1> The `training_percent` defines the percentage of the data set that will be used
|
||||
for training the model.
|
||||
|
||||
|
||||
[[ml-put-dfanalytics-example-c]]
|
||||
===== {classification-cap} example
|
||||
|
||||
The following example creates the `loan_classification` {dfanalytics-job}, the
|
||||
analysis type is `classification`:
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT _ml/data_frame/analytics/loan_classification
|
||||
{
|
||||
"source" : {
|
||||
"index": "loan-applicants"
|
||||
},
|
||||
"dest" : {
|
||||
"index": "loan-applicants-classified"
|
||||
},
|
||||
"analysis" : {
|
||||
"classification": {
|
||||
"dependent_variable": "label",
|
||||
"training_percent": 75,
|
||||
"num_top_classes": 2
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[skip:TBD]
|
||||
|
|
|
@ -0,0 +1,70 @@
|
|||
tag::dependent_variable[]
|
||||
`dependent_variable`::
|
||||
(Required, string) Defines which field of the document is to be predicted.
|
||||
This parameter is supplied by field name and must match one of the fields in
|
||||
the index being used to train. If this field is missing from a document, then
|
||||
that document will not be used for training, but a prediction with the trained
|
||||
model will be generated for it. It is also known as continuous target variable.
|
||||
end::dependent_variable[]
|
||||
|
||||
|
||||
tag::eta[]
|
||||
`eta`::
|
||||
(Optional, double) The shrinkage applied to the weights. Smaller values result
|
||||
in larger forests which have better generalization error. However, the smaller
|
||||
the value the longer the training will take. For more information, see
|
||||
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
|
||||
about shrinkage.
|
||||
end::eta[]
|
||||
|
||||
|
||||
tag::feature_bag_fraction[]
|
||||
`feature_bag_fraction`::
|
||||
(Optional, double) Defines the fraction of features that will be used when
|
||||
selecting a random bag for each candidate split.
|
||||
end::feature_bag_fraction[]
|
||||
|
||||
|
||||
tag::gamma[]
|
||||
`gamma`::
|
||||
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||
training dataset. Multiplies a linear penalty associated with the size of
|
||||
individual trees in the forest. The higher the value the more training will
|
||||
prefer smaller trees. The smaller this parameter the larger individual trees
|
||||
will be and the longer train will take.
|
||||
end::gamma[]
|
||||
|
||||
|
||||
tag::lambda[]
|
||||
`lambda`::
|
||||
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||
training dataset. Multiplies an L2 regularisation term which applies to leaf
|
||||
weights of the individual trees in the forest. The higher the value the more
|
||||
training will attempt to keep leaf weights small. This makes the prediction
|
||||
function smoother at the expense of potentially not being able to capture
|
||||
relevant relationships between the features and the {depvar}. The smaller this
|
||||
parameter the larger individual trees will be and the longer train will take.
|
||||
end::lambda[]
|
||||
|
||||
|
||||
tag::maximum_number_trees[]
|
||||
`maximum_number_trees`::
|
||||
(Optional, integer) Defines the maximum number of trees the forest is allowed
|
||||
to contain. The maximum value is 2000.
|
||||
end::maximum_number_trees[]
|
||||
|
||||
|
||||
tag::prediction_field_name[]
|
||||
`prediction_field_name`::
|
||||
(Optional, string) Defines the name of the prediction field in the results.
|
||||
Defaults to `<dependent_variable>_prediction`.
|
||||
end::prediction_field_name[]
|
||||
|
||||
|
||||
tag::training_percent[]
|
||||
`training_percent`::
|
||||
(Optional, integer) Defines what percentage of the eligible documents that will
|
||||
be used for training. Documents that are ignored by the analysis (for example
|
||||
those that contain arrays) won’t be included in the calculation for used
|
||||
percentage. Defaults to `100`.
|
||||
end::training_percent[]
|
Loading…
Reference in New Issue