[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)
* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs. Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com> Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
This commit is contained in:
parent
65fffcc9c1
commit
3be51fbdf7
|
@ -12,7 +12,8 @@
|
||||||
|
|
||||||
`analysis`::
|
`analysis`::
|
||||||
(object) The type of analysis that is performed on the `source`. For example:
|
(object) The type of analysis that is performed on the `source`. For example:
|
||||||
`outlier_detection`. For more information, see <<dfanalytics-types>>.
|
`outlier_detection` or `regression`. For more information, see
|
||||||
|
<<dfanalytics-types>>.
|
||||||
|
|
||||||
`analyzed_fields`::
|
`analyzed_fields`::
|
||||||
(object) You can specify both `includes` and/or `excludes` patterns. If
|
(object) You can specify both `includes` and/or `excludes` patterns. If
|
||||||
|
@ -99,14 +100,12 @@ PUT _ml/data_frame/analytics/loganalytics
|
||||||
|
|
||||||
{dfanalytics-cap} resources contain `analysis` objects. For example, when you
|
{dfanalytics-cap} resources contain `analysis` objects. For example, when you
|
||||||
create a {dfanalytics-job}, you must define the type of analysis it performs.
|
create a {dfanalytics-job}, you must define the type of analysis it performs.
|
||||||
Currently, `outlier_detection` is the only available type of analysis, however,
|
|
||||||
other types will be added, for example `regression`.
|
|
||||||
|
|
||||||
[discrete]
|
[discrete]
|
||||||
[[oldetection-resources]]
|
[[oldetection-resources]]
|
||||||
==== {oldetection-cap} configuration objects
|
==== {oldetection-cap} configuration objects
|
||||||
|
|
||||||
An {oldetection} configuration object has the following properties:
|
An `outlier_detection` configuration object has the following properties:
|
||||||
|
|
||||||
`compute_feature_influence`::
|
`compute_feature_influence`::
|
||||||
(boolean) If `true`, the feature influence calculation is enabled. Defaults to
|
(boolean) If `true`, the feature influence calculation is enabled. Defaults to
|
||||||
|
@ -123,7 +122,7 @@ An {oldetection} configuration object has the following properties:
|
||||||
recommend to use the ensemble method. Available methods are `lof`, `ldof`,
|
recommend to use the ensemble method. Available methods are `lof`, `ldof`,
|
||||||
`distance_kth_nn`, `distance_knn`.
|
`distance_kth_nn`, `distance_knn`.
|
||||||
|
|
||||||
`n_neighbors`::
|
`n_neighbors`::
|
||||||
(integer) Defines the value for how many nearest neighbors each method of
|
(integer) Defines the value for how many nearest neighbors each method of
|
||||||
{oldetection} will use to calculate its {olscore}. When the value is not set,
|
{oldetection} will use to calculate its {olscore}. When the value is not set,
|
||||||
different values will be used for different ensemble members. This helps
|
different values will be used for different ensemble members. This helps
|
||||||
|
@ -140,3 +139,122 @@ An {oldetection} configuration object has the following properties:
|
||||||
before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
|
before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
|
||||||
`true`. For more information, see
|
`true`. For more information, see
|
||||||
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
|
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
|
||||||
|
|
||||||
|
|
||||||
|
[discrete]
|
||||||
|
[[regression-resources]]
|
||||||
|
==== {regression-cap} configuration objects
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
PUT _ml/data_frame/analytics/house_price_regression_analysis
|
||||||
|
{
|
||||||
|
"source": {
|
||||||
|
"index": "houses_sold_last_10_yrs" <1>
|
||||||
|
},
|
||||||
|
"dest": {
|
||||||
|
"index": "house_price_predictions" <2>
|
||||||
|
},
|
||||||
|
"analysis":
|
||||||
|
{
|
||||||
|
"regression": { <3>
|
||||||
|
"dependent_variable": "price" <4>
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
// TEST[skip:TBD]
|
||||||
|
|
||||||
|
<1> Training data is taken from source index `houses_sold_last_10_yrs`.
|
||||||
|
<2> Analysis results will be output to destination index
|
||||||
|
`house_price_predictions`.
|
||||||
|
<3> The regression analysis configuration object.
|
||||||
|
<4> Regression analysis will use field `price` to train on. As no other
|
||||||
|
parameters have been specified it will train on 100% of eligible data, store its
|
||||||
|
prediction in destination index field `price_prediction` and use in-built
|
||||||
|
hyperparameter optimization to give minimum validation errors.
|
||||||
|
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[regression-resources-standard]]
|
||||||
|
===== Standard parameters
|
||||||
|
|
||||||
|
`dependent_variable`::
|
||||||
|
(Required, string) Defines which field of the {dataframe} is to be predicted.
|
||||||
|
This parameter is supplied by field name and must match one of the fields in
|
||||||
|
the index being used to train. If this field is missing from a document, then
|
||||||
|
that document will not be used for training, but a prediction with the trained
|
||||||
|
model will be generated for it. The data type of the field must be numeric. It
|
||||||
|
is also known as continuous target variable.
|
||||||
|
|
||||||
|
`prediction_field_name`::
|
||||||
|
(Optional, string) Defines the name of the prediction field in the results.
|
||||||
|
Defaults to `<dependent_variable>_prediction`.
|
||||||
|
|
||||||
|
`training_percent`::
|
||||||
|
(Optional, integer) Defines what percentage of the eligible documents that will
|
||||||
|
be used for training. Documents that are ignored by the analysis (for example
|
||||||
|
those that contain arrays) won’t be included in the calculation for used
|
||||||
|
percentage. Defaults to `100`.
|
||||||
|
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[regression-resources-advanced]]
|
||||||
|
===== Advanced parameters
|
||||||
|
|
||||||
|
Advanced parameters are for fine-tuning {reganalysis}. They are set
|
||||||
|
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
|
||||||
|
to give minimum validation error. It is highly recommended to use the default
|
||||||
|
values unless you fully understand the function of these parameters. If these
|
||||||
|
parameters are not supplied, their values are automatically tuned to give
|
||||||
|
minimum validation error.
|
||||||
|
|
||||||
|
`eta`::
|
||||||
|
(Optional, double) The shrinkage applied to the weights. Smaller values result
|
||||||
|
in larger forests which have better generalization error. However, the smaller
|
||||||
|
the value the longer the training will take. For more information, see
|
||||||
|
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
|
||||||
|
about shrinkage.
|
||||||
|
|
||||||
|
`feature_bag_fraction`::
|
||||||
|
(Optional, double) Defines the fraction of features that will be used when
|
||||||
|
selecting a random bag for each candidate split.
|
||||||
|
|
||||||
|
`maximum_number_trees`::
|
||||||
|
(Optional, integer) Defines the maximum number of trees the forest is allowed
|
||||||
|
to contain. The maximum value is 2000.
|
||||||
|
|
||||||
|
`gamma`::
|
||||||
|
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||||
|
training dataset. Multiplies a linear penalty associated with the size of
|
||||||
|
individual trees in the forest. The higher the value the more training will
|
||||||
|
prefer smaller trees. The smaller this parameter the larger individual trees
|
||||||
|
will be and the longer train will take.
|
||||||
|
|
||||||
|
`lambda`::
|
||||||
|
(Optional, double) Regularization parameter to prevent overfitting on the
|
||||||
|
training dataset. Multiplies an L2 regularisation term which applies to leaf
|
||||||
|
weights of the individual trees in the forest. The higher the value the more
|
||||||
|
training will attempt to keep leaf weights small. This makes the prediction
|
||||||
|
function smoother at the expense of potentially not being able to capture
|
||||||
|
relevant relationships between the features and the {depvar}. The smaller this
|
||||||
|
parameter the larger individual trees will be and the longer train will take.
|
||||||
|
|
||||||
|
|
||||||
|
[[ml-hyperparameter-optimization]]
|
||||||
|
===== Hyperparameter optimization
|
||||||
|
|
||||||
|
If you don't supply {regression} parameters, hyperparameter optimization will be
|
||||||
|
performed by default to set a value for the undefined parameters. The starting
|
||||||
|
point is calculated for data dependent parameters by examining the loss on the
|
||||||
|
training data. Subject to the size constraint, this operation provides an upper
|
||||||
|
bound on the improvement in validation loss.
|
||||||
|
|
||||||
|
A fixed number of rounds is used for optimization which depends on the number of
|
||||||
|
parameters being optimized. The optimitazion starts with random search, then
|
||||||
|
Bayesian Optimisation is performed that is targeting maximum expected
|
||||||
|
improvement. If you override any parameters, then the optimization will
|
||||||
|
calculate the value of the remaining parameters accordingly and use the value
|
||||||
|
you provided for the overridden parameter. The number of rounds are reduced
|
||||||
|
respectively. The validation error is estimated in each round by using 4-fold
|
||||||
|
cross validation.
|
|
@ -27,15 +27,11 @@ information, see {stack-ov}/security-privileges.html[Security privileges] and
|
||||||
[[ml-evaluate-dfanalytics-desc]]
|
[[ml-evaluate-dfanalytics-desc]]
|
||||||
==== {api-description-title}
|
==== {api-description-title}
|
||||||
|
|
||||||
This API evaluates the executed analysis on an index that is already annotated
|
The API packages together commonly used evaluation metrics for various types of
|
||||||
with a field that contains the results of the analytics (the `ground truth`)
|
machine learning features. This has been designed for use on indexes created by
|
||||||
for each {dataframe} row.
|
{dfanalytics}. Evaluation requires both a ground truth field and an analytics
|
||||||
|
result field to be present.
|
||||||
|
|
||||||
Evaluation is typically done by calculating a set of metrics that capture various aspects of the quality of the results over the data for which you have the
|
|
||||||
`ground truth`.
|
|
||||||
|
|
||||||
For different types of analyses different metrics are suitable. This API
|
|
||||||
packages together commonly used metrics for various analyses.
|
|
||||||
|
|
||||||
[[ml-evaluate-dfanalytics-request-body]]
|
[[ml-evaluate-dfanalytics-request-body]]
|
||||||
==== {api-request-body-title}
|
==== {api-request-body-title}
|
||||||
|
@ -45,14 +41,19 @@ packages together commonly used metrics for various analyses.
|
||||||
performed.
|
performed.
|
||||||
|
|
||||||
`query`::
|
`query`::
|
||||||
(Optional, object) Query used to select data from the index.
|
(Optional, object) A query clause that retrieves a subset of data from the
|
||||||
The {es} query domain-specific language (DSL). This value corresponds to the query
|
source index. See <<query-dsl>>.
|
||||||
object in an {es} search POST body. By default, this property has the following
|
|
||||||
value: `{"match_all": {}}`.
|
|
||||||
|
|
||||||
`evaluation`::
|
`evaluation`::
|
||||||
(Required, object) Defines the type of evaluation you want to perform. For example:
|
(Required, object) Defines the type of evaluation you want to perform. See
|
||||||
`binary_soft_classification`. See <<ml-evaluate-dfanalytics-resources>>.
|
<<ml-evaluate-dfanalytics-resources>>.
|
||||||
|
+
|
||||||
|
--
|
||||||
|
Available evaluation types:
|
||||||
|
* `binary_soft_classification`
|
||||||
|
* `regression`
|
||||||
|
--
|
||||||
|
|
||||||
|
|
||||||
////
|
////
|
||||||
[[ml-evaluate-dfanalytics-results]]
|
[[ml-evaluate-dfanalytics-results]]
|
||||||
|
@ -74,6 +75,8 @@ packages together commonly used metrics for various analyses.
|
||||||
[[ml-evaluate-dfanalytics-example]]
|
[[ml-evaluate-dfanalytics-example]]
|
||||||
==== {api-examples-title}
|
==== {api-examples-title}
|
||||||
|
|
||||||
|
===== Binary soft classification
|
||||||
|
|
||||||
[source,console]
|
[source,console]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
POST _ml/data_frame/_evaluate
|
POST _ml/data_frame/_evaluate
|
||||||
|
@ -131,3 +134,40 @@ The API returns the following results:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
||||||
|
===== {regression-cap}
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
POST _ml/data_frame/_evaluate
|
||||||
|
{
|
||||||
|
"index": "house_price_predictions", <1>
|
||||||
|
"query": {
|
||||||
|
"bool": {
|
||||||
|
"filter": [
|
||||||
|
{ "term": { "ml.is_training": false } } <2>
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"evaluation": {
|
||||||
|
"regression": {
|
||||||
|
"actual_field": "price", <3>
|
||||||
|
"predicted_field": "ml.price_prediction", <4>
|
||||||
|
"metrics": {
|
||||||
|
"r_squared": {},
|
||||||
|
"mean_squared_error": {}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
// TEST[skip:TBD]
|
||||||
|
|
||||||
|
<1> The output destination index from a {dfanalytics} {reganalysis}.
|
||||||
|
<2> In this example, a test/train split (`training_percent`) was defined for the
|
||||||
|
{reganalysis}. This query limits evaluation to be performed on the test split
|
||||||
|
only.
|
||||||
|
<3> The ground truth value for the actual house price. This is required in order
|
||||||
|
to evaluate results.
|
||||||
|
<4> The predicted value for house price calculated by the {reganalysis}.
|
||||||
|
|
|
@ -12,7 +12,19 @@ Evaluation configuration objects relate to the <<evaluate-dfanalytics>>.
|
||||||
`evaluation`::
|
`evaluation`::
|
||||||
(object) Defines the type of evaluation you want to perform. The value of this
|
(object) Defines the type of evaluation you want to perform. The value of this
|
||||||
object can be different depending on the type of evaluation you want to
|
object can be different depending on the type of evaluation you want to
|
||||||
perform. For example, it can contain <<binary-sc-resources>>.
|
perform.
|
||||||
|
+
|
||||||
|
--
|
||||||
|
Available evaluation types:
|
||||||
|
* `binary_soft_classification`
|
||||||
|
* `regression`
|
||||||
|
--
|
||||||
|
|
||||||
|
`query`::
|
||||||
|
(object) A query clause that retrieves a subset of data from the source index.
|
||||||
|
See <<query-dsl>>. The evaluation only applies to those documents of the index
|
||||||
|
that match the query.
|
||||||
|
|
||||||
|
|
||||||
[[binary-sc-resources]]
|
[[binary-sc-resources]]
|
||||||
==== Binary soft classification configuration objects
|
==== Binary soft classification configuration objects
|
||||||
|
@ -27,18 +39,18 @@ probability whether each row is an outlier.
|
||||||
===== {api-definitions-title}
|
===== {api-definitions-title}
|
||||||
|
|
||||||
`actual_field`::
|
`actual_field`::
|
||||||
(string) The field of the `index` which contains the `ground
|
(string) The field of the `index` which contains the `ground truth`.
|
||||||
truth`. The data type of this field can be boolean or integer. If the data
|
The data type of this field can be boolean or integer. If the data type is
|
||||||
type is integer, the value has to be either `0` (false) or `1` (true).
|
integer, the value has to be either `0` (false) or `1` (true).
|
||||||
|
|
||||||
`predicted_probability_field`::
|
`predicted_probability_field`::
|
||||||
(string) The field of the `index` that defines the probability of whether the
|
(string) The field of the `index` that defines the probability of
|
||||||
item belongs to the class in question or not. It's the field that contains the
|
whether the item belongs to the class in question or not. It's the field that
|
||||||
results of the analysis.
|
contains the results of the analysis.
|
||||||
|
|
||||||
`metrics`::
|
`metrics`::
|
||||||
(object) Specifies the metrics that are used for the evaluation. Available
|
(object) Specifies the metrics that are used for the evaluation.
|
||||||
metrics:
|
Available metrics:
|
||||||
|
|
||||||
`auc_roc`::
|
`auc_roc`::
|
||||||
(object) The AUC ROC (area under the curve of the receiver operating
|
(object) The AUC ROC (area under the curve of the receiver operating
|
||||||
|
@ -61,3 +73,26 @@ probability whether each row is an outlier.
|
||||||
false negative) are calculated.
|
false negative) are calculated.
|
||||||
Default value is {"at": [0.25, 0.50, 0.75]}.
|
Default value is {"at": [0.25, 0.50, 0.75]}.
|
||||||
|
|
||||||
|
|
||||||
|
[[regression-evaluation-resources]]
|
||||||
|
==== {regression-cap} evaluation objects
|
||||||
|
|
||||||
|
{regression-cap} evaluation evaluates the results of a {regression} analysis
|
||||||
|
which outputs a prediction of values.
|
||||||
|
|
||||||
|
|
||||||
|
[discrete]
|
||||||
|
[[regression-evaluation-resources-properties]]
|
||||||
|
===== {api-definitions-title}
|
||||||
|
|
||||||
|
`actual_field`::
|
||||||
|
(string) The field of the `index` which contains the `ground truth`. The data
|
||||||
|
type of this field must be numerical.
|
||||||
|
|
||||||
|
`predicted_field`::
|
||||||
|
(string) The field in the `index` that contains the predicted value,
|
||||||
|
in other words the results of the {regression} analysis.
|
||||||
|
|
||||||
|
`metrics`::
|
||||||
|
(object) Specifies the metrics that are used for the evaluation. Available
|
||||||
|
metrics are `r_squared` and `mean_squared_error`.
|
|
@ -121,6 +121,9 @@ and mappings.
|
||||||
[[ml-put-dfanalytics-example]]
|
[[ml-put-dfanalytics-example]]
|
||||||
==== {api-examples-title}
|
==== {api-examples-title}
|
||||||
|
|
||||||
|
[[ml-put-dfanalytics-example-od]]
|
||||||
|
===== {oldetection-cap} example
|
||||||
|
|
||||||
The following example creates the `loganalytics` {dfanalytics-job}, the analysis
|
The following example creates the `loganalytics` {dfanalytics-job}, the analysis
|
||||||
type is `outlier_detection`:
|
type is `outlier_detection`:
|
||||||
|
|
||||||
|
@ -173,3 +176,63 @@ The API returns the following result:
|
||||||
----
|
----
|
||||||
// TESTRESPONSE[s/1562351429434/$body.$_path/]
|
// TESTRESPONSE[s/1562351429434/$body.$_path/]
|
||||||
// TESTRESPONSE[s/"version" : "7.3.0"/"version" : $body.version/]
|
// TESTRESPONSE[s/"version" : "7.3.0"/"version" : $body.version/]
|
||||||
|
|
||||||
|
|
||||||
|
[[ml-put-dfanalytics-example-r]]
|
||||||
|
===== {regression-cap} example
|
||||||
|
|
||||||
|
The following example creates the `house_price_regression_analysis` {
|
||||||
|
dfanalytics-job}, the analysis type is `regression`:
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
PUT _ml/data_frame/analytics/house_price_regression_analysis
|
||||||
|
{
|
||||||
|
"source": {
|
||||||
|
"index": "houses_sold_last_10_yrs"
|
||||||
|
},
|
||||||
|
"dest": {
|
||||||
|
"index": "house_price_predictions"
|
||||||
|
},
|
||||||
|
"analysis":
|
||||||
|
{
|
||||||
|
"regression": {
|
||||||
|
"dependent_variable": "price"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
// TEST[skip:TBD]
|
||||||
|
|
||||||
|
|
||||||
|
The API returns the following result:
|
||||||
|
|
||||||
|
[source,console-result]
|
||||||
|
----
|
||||||
|
{
|
||||||
|
"id" : "house_price_regression_analysis",
|
||||||
|
"source" : {
|
||||||
|
"index" : [
|
||||||
|
"houses_sold_last_10_yrs"
|
||||||
|
],
|
||||||
|
"query" : {
|
||||||
|
"match_all" : { }
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"dest" : {
|
||||||
|
"index" : "house_price_predictions",
|
||||||
|
"results_field" : "ml"
|
||||||
|
},
|
||||||
|
"analysis" : {
|
||||||
|
"regression" : {
|
||||||
|
"dependent_variable" : "price",
|
||||||
|
"training_percent" : 100
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"model_memory_limit" : "1gb",
|
||||||
|
"create_time" : 1567168659127,
|
||||||
|
"version" : "8.0.0"
|
||||||
|
}
|
||||||
|
----
|
||||||
|
// TESTRESPONSE[s/1567168659127/$body.$_path/]
|
||||||
|
// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]
|
Loading…
Reference in New Issue