[DOCS] Adds classification type DFA API docs and ml-shared.asciidoc (#48241)
This commit is contained in:
parent
70765dfb05
commit
3c9bd13dca
|
@ -18,13 +18,14 @@
|
||||||
`analyzed_fields`::
|
`analyzed_fields`::
|
||||||
(object) You can specify both `includes` and/or `excludes` patterns. If
|
(object) You can specify both `includes` and/or `excludes` patterns. If
|
||||||
`analyzed_fields` is not set, only the relevant fields will be included. For
|
`analyzed_fields` is not set, only the relevant fields will be included. For
|
||||||
example all the numeric fields for {oldetection}.
|
example, all the numeric fields for {oldetection}. For the supported field
|
||||||
|
types, see <<ml-put-dfanalytics-supported-fields>>.
|
||||||
`analyzed_fields.includes`:::
|
|
||||||
|
`includes`:::
|
||||||
(array) An array of strings that defines the fields that will be included in
|
(array) An array of strings that defines the fields that will be included in
|
||||||
the analysis.
|
the analysis.
|
||||||
|
|
||||||
`analyzed_fields.excludes`:::
|
`excludes`:::
|
||||||
(array) An array of strings that defines the fields that will be excluded
|
(array) An array of strings that defines the fields that will be excluded
|
||||||
from the analysis.
|
from the analysis.
|
||||||
|
|
||||||
|
@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
|
||||||
[[regression-resources-standard]]
|
[[regression-resources-standard]]
|
||||||
===== Standard parameters
|
===== Standard parameters
|
||||||
|
|
||||||
`dependent_variable`::
|
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
|
||||||
(Required, string) Defines which field of the document is to be predicted.
|
+
|
||||||
This parameter is supplied by field name and must match one of the fields in
|
--
|
||||||
the index being used to train. If this field is missing from a document, then
|
The data type of the field must be numeric.
|
||||||
that document will not be used for training, but a prediction with the trained
|
--
|
||||||
model will be generated for it. The data type of the field must be numeric. It
|
|
||||||
is also known as continuous target variable.
|
|
||||||
|
|
||||||
`prediction_field_name`::
|
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
|
||||||
(Optional, string) Defines the name of the prediction field in the results.
|
|
||||||
Defaults to `<dependent_variable>_prediction`.
|
include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
|
||||||
|
|
||||||
`training_percent`::
|
|
||||||
(Optional, integer) Defines what percentage of the eligible documents that will
|
|
||||||
be used for training. Documents that are ignored by the analysis (for example
|
|
||||||
those that contain arrays) won’t be included in the calculation for used
|
|
||||||
percentage. Defaults to `100`.
|
|
||||||
|
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
|
@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
|
||||||
parameters are not supplied, their values are automatically tuned to give
|
parameters are not supplied, their values are automatically tuned to give
|
||||||
minimum validation error.
|
minimum validation error.
|
||||||
|
|
||||||
`eta`::
|
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
||||||
(Optional, double) The shrinkage applied to the weights. Smaller values result
|
|
||||||
in larger forests which have better generalization error. However, the smaller
|
|
||||||
the value the longer the training will take. For more information, see
|
|
||||||
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
|
|
||||||
about shrinkage.
|
|
||||||
|
|
||||||
`feature_bag_fraction`::
|
|
||||||
(Optional, double) Defines the fraction of features that will be used when
|
|
||||||
selecting a random bag for each candidate split.
|
|
||||||
|
|
||||||
`maximum_number_trees`::
|
|
||||||
(Optional, integer) Defines the maximum number of trees the forest is allowed
|
|
||||||
to contain. The maximum value is 2000.
|
|
||||||
|
|
||||||
`gamma`::
|
include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
|
||||||
(Optional, double) Regularization parameter to prevent overfitting on the
|
|
||||||
training dataset. Multiplies a linear penalty associated with the size of
|
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
|
||||||
individual trees in the forest. The higher the value the more training will
|
|
||||||
prefer smaller trees. The smaller this parameter the larger individual trees
|
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
||||||
will be and the longer train will take.
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
||||||
|
|
||||||
|
|
||||||
|
[discrete]
|
||||||
|
[[classification-resources]]
|
||||||
|
==== {classification-cap} configuration objects
|
||||||
|
|
||||||
`lambda`::
|
|
||||||
(Optional, double) Regularization parameter to prevent overfitting on the
|
[float]
|
||||||
training dataset. Multiplies an L2 regularisation term which applies to leaf
|
[[classification-resources-standard]]
|
||||||
weights of the individual trees in the forest. The higher the value the more
|
===== Standard parameters
|
||||||
training will attempt to keep leaf weights small. This makes the prediction
|
|
||||||
function smoother at the expense of potentially not being able to capture
|
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
|
||||||
relevant relationships between the features and the {depvar}. The smaller this
|
+
|
||||||
parameter the larger individual trees will be and the longer train will take.
|
--
|
||||||
|
The data type of the field must be numeric or boolean.
|
||||||
|
--
|
||||||
|
|
||||||
|
`num_top_classes`::
|
||||||
|
(Optional, integer) Defines the number of categories for which the predicted
|
||||||
|
probabilities are reported. It must be non-negative. If it is greater than the
|
||||||
|
total number of categories (in the {version} version of the {stack}, it's two)
|
||||||
|
to predict then we will report all category probabilities. Defaults to 2.
|
||||||
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
|
||||||
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
|
||||||
|
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[classification-resources-advanced]]
|
||||||
|
===== Advanced parameters
|
||||||
|
|
||||||
|
Advanced parameters are for fine-tuning {classanalysis}. They are set
|
||||||
|
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
|
||||||
|
to give minimum validation error. It is highly recommended to use the default
|
||||||
|
values unless you fully understand the function of these parameters. If these
|
||||||
|
parameters are not supplied, their values are automatically tuned to give
|
||||||
|
minimum validation error.
|
||||||
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
|
||||||
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
|
||||||
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
|
||||||
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
|
||||||
|
|
||||||
|
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
|
||||||
|
|
||||||
|
|
||||||
[[ml-hyperparameter-optimization]]
|
[[ml-hyperparameter-optimization]]
|
||||||
===== Hyperparameter optimization
|
===== Hyperparameter optimization
|
||||||
|
|
||||||
If you don't supply {regression} parameters, hyperparameter optimization will be
|
If you don't supply {regression} or {classification} parameters, hyperparameter
|
||||||
performed by default to set a value for the undefined parameters. The starting
|
optimization will be performed by default to set a value for the undefined
|
||||||
point is calculated for data dependent parameters by examining the loss on the
|
parameters. The starting point is calculated for data dependent parameters by
|
||||||
training data. Subject to the size constraint, this operation provides an upper
|
examining the loss on the training data. Subject to the size constraint, this
|
||||||
bound on the improvement in validation loss.
|
operation provides an upper bound on the improvement in validation loss.
|
||||||
|
|
||||||
A fixed number of rounds is used for optimization which depends on the number of
|
A fixed number of rounds is used for optimization which depends on the number of
|
||||||
parameters being optimized. The optimitazion starts with random search, then
|
parameters being optimized. The optimitazion starts with random search, then
|
||||||
|
|
|
@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
|
||||||
that don’t contain a results field are not included in the {reganalysis}.
|
that don’t contain a results field are not included in the {reganalysis}.
|
||||||
|
|
||||||
|
|
||||||
|
====== {classification-cap}
|
||||||
|
|
||||||
|
{classification-cap} supports fields that are numeric, boolean, text, keyword
|
||||||
|
and ip. It is also tolerant of missing values. Fields that are supported are
|
||||||
|
included in the analysis, other fields are ignored. Documents where included
|
||||||
|
fields contain an array with two or more values are also ignored. Documents in
|
||||||
|
the `dest` index that don’t contain a results field are not included in the
|
||||||
|
{classanalysis}.
|
||||||
|
|
||||||
|
{classanalysis-cap} can be improved by mapping ordinal variable values to a
|
||||||
|
single number. For example, in case of age ranges, you can model the values as
|
||||||
|
"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
|
||||||
|
|
||||||
|
Fields that are highly correlated to the `dependent_variable` should be excluded
|
||||||
|
from the analysis. For example, if you have a multi-value field as
|
||||||
|
`dependent_variable`, {es} will be mapping it both as text and keyword which
|
||||||
|
results in two fields (`field` and `field.keyword`). It is required to exclude
|
||||||
|
the field with the text mapping to get exact results from the analysis.
|
||||||
|
|
||||||
|
|
||||||
[[ml-put-dfanalytics-path-params]]
|
[[ml-put-dfanalytics-path-params]]
|
||||||
==== {api-path-parms-title}
|
==== {api-path-parms-title}
|
||||||
|
|
||||||
|
@ -154,6 +174,7 @@ that don’t contain a results field are not included in the {reganalysis}.
|
||||||
[[ml-put-dfanalytics-example]]
|
[[ml-put-dfanalytics-example]]
|
||||||
==== {api-examples-title}
|
==== {api-examples-title}
|
||||||
|
|
||||||
|
|
||||||
[[ml-put-dfanalytics-example-od]]
|
[[ml-put-dfanalytics-example-od]]
|
||||||
===== {oldetection-cap} example
|
===== {oldetection-cap} example
|
||||||
|
|
||||||
|
@ -305,3 +326,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
|
||||||
|
|
||||||
<1> The `training_percent` defines the percentage of the data set that will be used
|
<1> The `training_percent` defines the percentage of the data set that will be used
|
||||||
for training the model.
|
for training the model.
|
||||||
|
|
||||||
|
|
||||||
|
[[ml-put-dfanalytics-example-c]]
|
||||||
|
===== {classification-cap} example
|
||||||
|
|
||||||
|
The following example creates the `loan_classification` {dfanalytics-job}, the
|
||||||
|
analysis type is `classification`:
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
PUT _ml/data_frame/analytics/loan_classification
|
||||||
|
{
|
||||||
|
"source" : {
|
||||||
|
"index": "loan-applicants"
|
||||||
|
},
|
||||||
|
"dest" : {
|
||||||
|
"index": "loan-applicants-classified"
|
||||||
|
},
|
||||||
|
"analysis" : {
|
||||||
|
"classification": {
|
||||||
|
"dependent_variable": "label",
|
||||||
|
"training_percent": 75,
|
||||||
|
"num_top_classes": 2
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
// TEST[skip:TBD]
|
||||||
|
|
|
@ -0,0 +1,70 @@
|
||||||
|
tag::dependent_variable[]
|
||||||
|
`dependent_variable`::
|
||||||
|
(Required, string) Defines which field of the document is to be predicted.
|
||||||
|
This parameter is supplied by field name and must match one of the fields in
|
||||||
|
the index being used to train. If this field is missing from a document, then
|
||||||
|
that document will not be used for training, but a prediction with the trained
|
||||||
|
model will be generated for it. It is also known as continuous target variable.
|
||||||
|
end::dependent_variable[]
|
||||||
|
|
||||||
|
|
||||||
|
tag::eta[]
|
||||||
|
`eta`::
|
||||||
|
(Optional, double) The shrinkage applied to the weights. Smaller values result
|
||||||
|
in larger forests which have better generalization error. However, the smaller
|
||||||
|
the value the longer the training will take. For more information, see
|
||||||
|
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
|
||||||
|
about shrinkage.
|
||||||
|
end::eta[]
|
||||||
|
|
||||||
|
|
||||||
|
tag::feature_bag_fraction[]
|
||||||
|
`feature_bag_fraction`::
|
||||||
|
(Optional, double) Defines the fraction of features that will be used when
|
||||||
|
selecting a random bag for each candidate split.
|
||||||
|
end::feature_bag_fraction[]
|
||||||
|
|
||||||
|
|
||||||
|
tag::gamma[]
|
||||||
|
`gamma`::
|
||||||
|
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||||
|
training dataset. Multiplies a linear penalty associated with the size of
|
||||||
|
individual trees in the forest. The higher the value the more training will
|
||||||
|
prefer smaller trees. The smaller this parameter the larger individual trees
|
||||||
|
will be and the longer train will take.
|
||||||
|
end::gamma[]
|
||||||
|
|
||||||
|
|
||||||
|
tag::lambda[]
|
||||||
|
`lambda`::
|
||||||
|
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||||
|
training dataset. Multiplies an L2 regularisation term which applies to leaf
|
||||||
|
weights of the individual trees in the forest. The higher the value the more
|
||||||
|
training will attempt to keep leaf weights small. This makes the prediction
|
||||||
|
function smoother at the expense of potentially not being able to capture
|
||||||
|
relevant relationships between the features and the {depvar}. The smaller this
|
||||||
|
parameter the larger individual trees will be and the longer train will take.
|
||||||
|
end::lambda[]
|
||||||
|
|
||||||
|
|
||||||
|
tag::maximum_number_trees[]
|
||||||
|
`maximum_number_trees`::
|
||||||
|
(Optional, integer) Defines the maximum number of trees the forest is allowed
|
||||||
|
to contain. The maximum value is 2000.
|
||||||
|
end::maximum_number_trees[]
|
||||||
|
|
||||||
|
|
||||||
|
tag::prediction_field_name[]
|
||||||
|
`prediction_field_name`::
|
||||||
|
(Optional, string) Defines the name of the prediction field in the results.
|
||||||
|
Defaults to `<dependent_variable>_prediction`.
|
||||||
|
end::prediction_field_name[]
|
||||||
|
|
||||||
|
|
||||||
|
tag::training_percent[]
|
||||||
|
`training_percent`::
|
||||||
|
(Optional, integer) Defines what percentage of the eligible documents that will
|
||||||
|
be used for training. Documents that are ignored by the analysis (for example
|
||||||
|
those that contain arrays) won’t be included in the calculation for used
|
||||||
|
percentage. Defaults to `100`.
|
||||||
|
end::training_percent[]
|
Loading…
Reference in New Issue