From 3be51fbdf7a8ba151b6615a83881a9e285fc0a60 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Thu, 19 Sep 2019 09:10:11 +0200 Subject: [PATCH] [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176) * [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs. Co-Authored-By: Benjamin Trent Co-Authored-By: Tom Veasey --- .../apis/dfanalyticsresources.asciidoc | 132 +++++++++++++++++- .../apis/evaluate-dfanalytics.asciidoc | 70 ++++++++-- .../apis/evaluateresources.asciidoc | 55 ++++++-- .../apis/put-dfanalytics.asciidoc | 65 ++++++++- 4 files changed, 289 insertions(+), 33 deletions(-) diff --git a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc index 4151e4c413c..2b666a54022 100644 --- a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc +++ b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc @@ -12,7 +12,8 @@ `analysis`:: (object) The type of analysis that is performed on the `source`. For example: - `outlier_detection`. For more information, see <>. + `outlier_detection` or `regression`. For more information, see + <>. `analyzed_fields`:: (object) You can specify both `includes` and/or `excludes` patterns. If @@ -98,15 +99,13 @@ PUT _ml/data_frame/analytics/loganalytics ==== Analysis objects {dfanalytics-cap} resources contain `analysis` objects. For example, when you -create a {dfanalytics-job}, you must define the type of analysis it performs. -Currently, `outlier_detection` is the only available type of analysis, however, -other types will be added, for example `regression`. - +create a {dfanalytics-job}, you must define the type of analysis it performs. + [discrete] [[oldetection-resources]] ==== {oldetection-cap} configuration objects -An {oldetection} configuration object has the following properties: +An `outlier_detection` configuration object has the following properties: `compute_feature_influence`:: (boolean) If `true`, the feature influence calculation is enabled. Defaults to @@ -123,7 +122,7 @@ An {oldetection} configuration object has the following properties: recommend to use the ensemble method. Available methods are `lof`, `ldof`, `distance_kth_nn`, `distance_knn`. -`n_neighbors`:: + `n_neighbors`:: (integer) Defines the value for how many nearest neighbors each method of {oldetection} will use to calculate its {olscore}. When the value is not set, different values will be used for different ensemble members. This helps @@ -140,3 +139,122 @@ An {oldetection} configuration object has the following properties: before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to `true`. For more information, see https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization]. + + +[discrete] +[[regression-resources]] +==== {regression-cap} configuration objects + +[source,console] +-------------------------------------------------- +PUT _ml/data_frame/analytics/house_price_regression_analysis +{ + "source": { + "index": "houses_sold_last_10_yrs" <1> + }, + "dest": { + "index": "house_price_predictions" <2> + }, + "analysis": + { + "regression": { <3> + "dependent_variable": "price" <4> + } + } +} +-------------------------------------------------- +// TEST[skip:TBD] + +<1> Training data is taken from source index `houses_sold_last_10_yrs`. +<2> Analysis results will be output to destination index +`house_price_predictions`. +<3> The regression analysis configuration object. +<4> Regression analysis will use field `price` to train on. As no other +parameters have been specified it will train on 100% of eligible data, store its +prediction in destination index field `price_prediction` and use in-built +hyperparameter optimization to give minimum validation errors. + + +[float] +[[regression-resources-standard]] +===== Standard parameters + +`dependent_variable`:: + (Required, string) Defines which field of the {dataframe} is to be predicted. + This parameter is supplied by field name and must match one of the fields in + the index being used to train. If this field is missing from a document, then + that document will not be used for training, but a prediction with the trained + model will be generated for it. The data type of the field must be numeric. It + is also known as continuous target variable. + +`prediction_field_name`:: + (Optional, string) Defines the name of the prediction field in the results. + Defaults to `_prediction`. + +`training_percent`:: + (Optional, integer) Defines what percentage of the eligible documents that will + be used for training. Documents that are ignored by the analysis (for example + those that contain arrays) won’t be included in the calculation for used + percentage. Defaults to `100`. + + +[float] +[[regression-resources-advanced]] +===== Advanced parameters + +Advanced parameters are for fine-tuning {reganalysis}. They are set +automatically by <> +to give minimum validation error. It is highly recommended to use the default +values unless you fully understand the function of these parameters. If these +parameters are not supplied, their values are automatically tuned to give +minimum validation error. + +`eta`:: + (Optional, double) The shrinkage applied to the weights. Smaller values result + in larger forests which have better generalization error. However, the smaller + the value the longer the training will take. For more information, see + https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] + about shrinkage. + +`feature_bag_fraction`:: + (Optional, double) Defines the fraction of features that will be used when + selecting a random bag for each candidate split. + +`maximum_number_trees`:: + (Optional, integer) Defines the maximum number of trees the forest is allowed + to contain. The maximum value is 2000. + +`gamma`:: + (Optional, double) Regularization parameter to prevent overfitting on the + training dataset. Multiplies a linear penalty associated with the size of + individual trees in the forest. The higher the value the more training will + prefer smaller trees. The smaller this parameter the larger individual trees + will be and the longer train will take. + +`lambda`:: + (Optional, double) Regularization parameter to prevent overfitting on the + training dataset. Multiplies an L2 regularisation term which applies to leaf + weights of the individual trees in the forest. The higher the value the more + training will attempt to keep leaf weights small. This makes the prediction + function smoother at the expense of potentially not being able to capture + relevant relationships between the features and the {depvar}. The smaller this + parameter the larger individual trees will be and the longer train will take. + + +[[ml-hyperparameter-optimization]] +===== Hyperparameter optimization + +If you don't supply {regression} parameters, hyperparameter optimization will be +performed by default to set a value for the undefined parameters. The starting +point is calculated for data dependent parameters by examining the loss on the +training data. Subject to the size constraint, this operation provides an upper +bound on the improvement in validation loss. + +A fixed number of rounds is used for optimization which depends on the number of +parameters being optimized. The optimitazion starts with random search, then +Bayesian Optimisation is performed that is targeting maximum expected +improvement. If you override any parameters, then the optimization will +calculate the value of the remaining parameters accordingly and use the value +you provided for the overridden parameter. The number of rounds are reduced +respectively. The validation error is estimated in each round by using 4-fold +cross validation. \ No newline at end of file diff --git a/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc b/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc index 38d72ced401..3c855b18289 100644 --- a/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc +++ b/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc @@ -27,15 +27,11 @@ information, see {stack-ov}/security-privileges.html[Security privileges] and [[ml-evaluate-dfanalytics-desc]] ==== {api-description-title} -This API evaluates the executed analysis on an index that is already annotated -with a field that contains the results of the analytics (the `ground truth`) -for each {dataframe} row. +The API packages together commonly used evaluation metrics for various types of +machine learning features. This has been designed for use on indexes created by +{dfanalytics}. Evaluation requires both a ground truth field and an analytics +result field to be present. -Evaluation is typically done by calculating a set of metrics that capture various aspects of the quality of the results over the data for which you have the -`ground truth`. - -For different types of analyses different metrics are suitable. This API -packages together commonly used metrics for various analyses. [[ml-evaluate-dfanalytics-request-body]] ==== {api-request-body-title} @@ -45,15 +41,20 @@ packages together commonly used metrics for various analyses. performed. `query`:: - (Optional, object) Query used to select data from the index. - The {es} query domain-specific language (DSL). This value corresponds to the query - object in an {es} search POST body. By default, this property has the following - value: `{"match_all": {}}`. + (Optional, object) A query clause that retrieves a subset of data from the + source index. See <>. `evaluation`:: - (Required, object) Defines the type of evaluation you want to perform. For example: - `binary_soft_classification`. See <>. - + (Required, object) Defines the type of evaluation you want to perform. See + <>. ++ +-- +Available evaluation types: +* `binary_soft_classification` +* `regression` +-- + + //// [[ml-evaluate-dfanalytics-results]] ==== {api-response-body-title} @@ -74,6 +75,8 @@ packages together commonly used metrics for various analyses. [[ml-evaluate-dfanalytics-example]] ==== {api-examples-title} +===== Binary soft classification + [source,console] -------------------------------------------------- POST _ml/data_frame/_evaluate @@ -131,3 +134,40 @@ The API returns the following results: } } ---- + + +===== {regression-cap} + +[source,console] +-------------------------------------------------- +POST _ml/data_frame/_evaluate +{ + "index": "house_price_predictions", <1> + "query": { + "bool": { + "filter": [ + { "term": { "ml.is_training": false } } <2> + ] + } + }, + "evaluation": { + "regression": { + "actual_field": "price", <3> + "predicted_field": "ml.price_prediction", <4> + "metrics": { + "r_squared": {}, + "mean_squared_error": {} + } + } + } +} +-------------------------------------------------- +// TEST[skip:TBD] + +<1> The output destination index from a {dfanalytics} {reganalysis}. +<2> In this example, a test/train split (`training_percent`) was defined for the +{reganalysis}. This query limits evaluation to be performed on the test split +only. +<3> The ground truth value for the actual house price. This is required in order +to evaluate results. +<4> The predicted value for house price calculated by the {reganalysis}. diff --git a/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc b/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc index 186e54bb378..caf05f97c0b 100644 --- a/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc +++ b/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc @@ -12,7 +12,19 @@ Evaluation configuration objects relate to the <>. `evaluation`:: (object) Defines the type of evaluation you want to perform. The value of this object can be different depending on the type of evaluation you want to - perform. For example, it can contain <>. + perform. ++ +-- +Available evaluation types: +* `binary_soft_classification` +* `regression` +-- + +`query`:: + (object) A query clause that retrieves a subset of data from the source index. + See <>. The evaluation only applies to those documents of the index + that match the query. + [[binary-sc-resources]] ==== Binary soft classification configuration objects @@ -27,18 +39,18 @@ probability whether each row is an outlier. ===== {api-definitions-title} `actual_field`:: - (string) The field of the `index` which contains the `ground - truth`. The data type of this field can be boolean or integer. If the data - type is integer, the value has to be either `0` (false) or `1` (true). + (string) The field of the `index` which contains the `ground truth`. + The data type of this field can be boolean or integer. If the data type is + integer, the value has to be either `0` (false) or `1` (true). `predicted_probability_field`:: - (string) The field of the `index` that defines the probability of whether the - item belongs to the class in question or not. It's the field that contains the - results of the analysis. + (string) The field of the `index` that defines the probability of + whether the item belongs to the class in question or not. It's the field that + contains the results of the analysis. `metrics`:: - (object) Specifies the metrics that are used for the evaluation. Available - metrics: + (object) Specifies the metrics that are used for the evaluation. + Available metrics: `auc_roc`:: (object) The AUC ROC (area under the curve of the receiver operating @@ -60,4 +72,27 @@ probability whether each row is an outlier. (`tp` - true positive, `fp` - false positive, `tn` - true negative, `fn` - false negative) are calculated. Default value is {"at": [0.25, 0.50, 0.75]}. - \ No newline at end of file + + +[[regression-evaluation-resources]] +==== {regression-cap} evaluation objects + +{regression-cap} evaluation evaluates the results of a {regression} analysis +which outputs a prediction of values. + + +[discrete] +[[regression-evaluation-resources-properties]] +===== {api-definitions-title} + +`actual_field`:: + (string) The field of the `index` which contains the `ground truth`. The data + type of this field must be numerical. + +`predicted_field`:: + (string) The field in the `index` that contains the predicted value, + in other words the results of the {regression} analysis. + +`metrics`:: + (object) Specifies the metrics that are used for the evaluation. Available + metrics are `r_squared` and `mean_squared_error`. \ No newline at end of file diff --git a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc index 490e749ce96..f9884626ae5 100644 --- a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc +++ b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc @@ -121,6 +121,9 @@ and mappings. [[ml-put-dfanalytics-example]] ==== {api-examples-title} +[[ml-put-dfanalytics-example-od]] +===== {oldetection-cap} example + The following example creates the `loganalytics` {dfanalytics-job}, the analysis type is `outlier_detection`: @@ -172,4 +175,64 @@ The API returns the following result: } ---- // TESTRESPONSE[s/1562351429434/$body.$_path/] -// TESTRESPONSE[s/"version" : "7.3.0"/"version" : $body.version/] \ No newline at end of file +// TESTRESPONSE[s/"version" : "7.3.0"/"version" : $body.version/] + + +[[ml-put-dfanalytics-example-r]] +===== {regression-cap} example + +The following example creates the `house_price_regression_analysis` { +dfanalytics-job}, the analysis type is `regression`: + +[source,console] +-------------------------------------------------- +PUT _ml/data_frame/analytics/house_price_regression_analysis +{ + "source": { + "index": "houses_sold_last_10_yrs" + }, + "dest": { + "index": "house_price_predictions" + }, + "analysis": + { + "regression": { + "dependent_variable": "price" + } + } +} +-------------------------------------------------- +// TEST[skip:TBD] + + +The API returns the following result: + +[source,console-result] +---- +{ + "id" : "house_price_regression_analysis", + "source" : { + "index" : [ + "houses_sold_last_10_yrs" + ], + "query" : { + "match_all" : { } + } + }, + "dest" : { + "index" : "house_price_predictions", + "results_field" : "ml" + }, + "analysis" : { + "regression" : { + "dependent_variable" : "price", + "training_percent" : 100 + } + }, + "model_memory_limit" : "1gb", + "create_time" : 1567168659127, + "version" : "8.0.0" +} +---- +// TESTRESPONSE[s/1567168659127/$body.$_path/] +// TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/] \ No newline at end of file