[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)

* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs. Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com> Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
2019-09-19 09:10:11 +02:00 · 2019-09-19 09:10:11 +02:00 · 3be51fbdf7
parent 65fffcc9c1
commit 3be51fbdf7
4 changed files with 289 additions and 33 deletions
--- a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
@ -12,7 +12,8 @@
 `analysis`::
  (object) The type of analysis that is performed on the `source`. For example: 
-  `outlier_detection`. For more information, see <<dfanalytics-types>>.
+  `outlier_detection` or `regression`. For more information, see 
  <<dfanalytics-types>>.
 `analyzed_fields`::
  (object) You can specify both `includes` and/or `excludes` patterns. If 
@ -99,14 +100,12 @@ PUT _ml/data_frame/analytics/loganalytics
 {dfanalytics-cap} resources contain `analysis` objects. For example, when you
 create a {dfanalytics-job}, you must define the type of analysis it performs.
 Currently, `outlier_detection` is the only available type of analysis, however, 
 other types will be added, for example `regression`.
 [discrete]
 [[oldetection-resources]]
 ==== {oldetection-cap} configuration objects 
-An {oldetection} configuration object has the following properties:
+An `outlier_detection` configuration object has the following properties:
 `compute_feature_influence`::
  (boolean) If `true`, the feature influence calculation is enabled. Defaults to 
@ -123,7 +122,7 @@ An {oldetection} configuration object has the following properties:
  recommend to use the ensemble method. Available methods are `lof`, `ldof`, 
  `distance_kth_nn`, `distance_knn`.
-`n_neighbors`::
+  `n_neighbors`::
  (integer) Defines the value for how many nearest neighbors each method of 
  {oldetection} will use to calculate its {olscore}. When the value is not set, 
  different values will be used for different ensemble members. This helps 
@ -140,3 +139,122 @@ An {oldetection} configuration object has the following properties:
  before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to 
  `true`. For more information, see 
  https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
 [discrete]
 [[regression-resources]]
 ==== {regression-cap} configuration objects
 [source,console]
 --------------------------------------------------
 PUT _ml/data_frame/analytics/house_price_regression_analysis
 {
  "source": {
    "index": "houses_sold_last_10_yrs" <1>
  },
  "dest": {
    "index": "house_price_predictions" <2>
  },
  "analysis": 
    {
      "regression": { <3>
        "dependent_variable": "price" <4>
      }
    }
 }
 --------------------------------------------------
 // TEST[skip:TBD]
 <1> Training data is taken from source index `houses_sold_last_10_yrs`.
 <2> Analysis results will be output to destination index 
 `house_price_predictions`.
 <3> The regression analysis configuration object.
 <4> Regression analysis will use field `price` to train on. As no other 
 parameters have been specified it will train on 100% of eligible data, store its 
 prediction in destination index field `price_prediction` and use in-built 
 hyperparameter optimization to give minimum validation errors.
 [float]
 [[regression-resources-standard]]
 ===== Standard parameters
 `dependent_variable`::
  (Required, string) Defines which field of the {dataframe} is to be predicted. 
  This parameter is supplied by field name and must match one of the fields in 
  the index being used to train. If this field is missing from a document, then 
  that document will not be used for training, but a prediction with the trained 
  model will be generated for it. The data type of the field must be numeric. It 
  is also known as continuous target variable.
 `prediction_field_name`::
 (Optional, string) Defines the name of the prediction field in the results. 
 Defaults to `<dependent_variable>_prediction`.
 `training_percent`::
 (Optional, integer) Defines what percentage of the eligible documents that will 
 be used for training. Documents that are ignored by the analysis (for example 
 those that contain arrays) won’t be included in the calculation for used 
 percentage. Defaults to `100`.
 [float]
 [[regression-resources-advanced]]
 ===== Advanced parameters
 Advanced parameters are for fine-tuning {reganalysis}. They are set 
 automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
 to give minimum validation error. It is highly recommended to use the default 
 values unless you fully understand the function of these parameters. If these 
 parameters are not supplied, their values are automatically tuned to give 
 minimum validation error.
 `eta`::
 (Optional, double) The shrinkage applied to the weights. Smaller values result 
 in larger forests which have better generalization error. However, the smaller 
 the value the longer the training will take. For more information, see 
 https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
 about shrinkage.
 `feature_bag_fraction`::
 (Optional, double) Defines the fraction of features that will be used when 
 selecting a random bag for each candidate split. 
 `maximum_number_trees`::
 (Optional, integer) Defines the maximum number of trees the forest is allowed 
 to contain. The maximum value is 2000.
 `gamma`::
 (Optional, double) Regularization parameter to prevent overfitting on the 
 training dataset. Multiplies a linear penalty associated with the size of 
 individual trees in the forest. The higher the value the more training will 
 prefer smaller trees. The smaller this parameter the larger individual trees 
 will be and the longer train will take.
 `lambda`::
 (Optional, double) Regularization parameter to prevent overfitting on the 
 training dataset. Multiplies an L2 regularisation term which applies to leaf 
 weights of the individual trees in the forest. The higher the value the more 
 training will attempt to keep leaf weights small. This makes the prediction 
 function smoother at the expense of potentially not being able to capture 
 relevant relationships between the features and the {depvar}. The smaller this 
 parameter the larger individual trees will be and the longer train will take.
 [[ml-hyperparameter-optimization]]
 ===== Hyperparameter optimization
 If you don't supply {regression} parameters, hyperparameter optimization will be 
 performed by default to set a value for the undefined parameters. The starting 
 point is calculated for data dependent parameters by examining the loss on the 
 training data. Subject to the size constraint, this operation provides an upper 
 bound on the improvement in validation loss.
 A fixed number of rounds is used for optimization which depends on the number of 
 parameters being optimized. The optimitazion starts with random search, then 
 Bayesian Optimisation is performed that is targeting maximum expected 
 improvement. If you override any parameters, then the optimization will 
 calculate the value of the remaining parameters accordingly and use the value 
 you provided for the overridden parameter. The number of rounds are reduced 
 respectively. The validation error is estimated in each round by using 4-fold 
 cross validation.
--- a/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc
@ -27,15 +27,11 @@ information, see {stack-ov}/security-privileges.html[Security privileges] and
 [[ml-evaluate-dfanalytics-desc]]
 ==== {api-description-title}
-This API evaluates the executed analysis on an index that is already annotated
+The API packages together commonly used evaluation metrics for various types of 
-with a field that contains the results of the analytics (the `ground truth`)
+machine learning features. This has been designed for use on indexes created by 
-for each {dataframe} row.
+{dfanalytics}. Evaluation requires both a ground truth field and an analytics 
 result field to be present.
 Evaluation is typically done by calculating a set of metrics that capture various aspects of the quality of the results over the data for which you have the
 `ground truth`.
 For different types of analyses different metrics are suitable. This API
 packages together commonly used metrics for various analyses.
 [[ml-evaluate-dfanalytics-request-body]]
 ==== {api-request-body-title}
@ -45,14 +41,19 @@ packages together commonly used metrics for various analyses.
  performed.
 `query`::
-  (Optional, object) Query used to select data from the index.
+  (Optional, object) A query clause that retrieves a subset of data from the 
-  The {es} query domain-specific language (DSL). This value corresponds to the query
+  source index. See <<query-dsl>>.
  object in an {es} search POST body. By default, this property has the following
  value: `{"match_all": {}}`.
 `evaluation`::
-  (Required, object) Defines the type of evaluation you want to perform. For example: 
+  (Required, object) Defines the type of evaluation you want to perform. See 
-  `binary_soft_classification`. See <<ml-evaluate-dfanalytics-resources>>.
+  <<ml-evaluate-dfanalytics-resources>>.
 +
 --
 Available evaluation types:
 * `binary_soft_classification`
 * `regression`
 --
 ////
 [[ml-evaluate-dfanalytics-results]]
@ -74,6 +75,8 @@ packages together commonly used metrics for various analyses.
 [[ml-evaluate-dfanalytics-example]]
 ==== {api-examples-title}
 ===== Binary soft classification
 [source,console]
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
@ -131,3 +134,40 @@ The API returns the following results:
  }
 }
 ----
 ===== {regression-cap}
 [source,console]
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
 {
  "index": "house_price_predictions", <1>
  "query": {
      "bool": {
        "filter": [
          { "term":  { "ml.is_training": false } } <2>
        ]
      }
  },
  "evaluation": {
    "regression": { 
      "actual_field": "price", <3>
      "predicted_field": "ml.price_prediction", <4>
      "metrics": {  
        "r_squared": {},
        "mean_squared_error": {}                             
      }
    }
  }
 }
 --------------------------------------------------
 // TEST[skip:TBD]
 <1> The output destination index from a {dfanalytics} {reganalysis}.
 <2> In this example, a test/train split (`training_percent`) was defined for the 
 {reganalysis}. This query limits evaluation to be performed on the test split 
 only. 
 <3> The ground truth value for the actual house price. This is required in order 
 to evaluate results.
 <4> The predicted value for house price calculated by the {reganalysis}.
--- a/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/evaluateresources.asciidoc
@ -12,7 +12,19 @@ Evaluation configuration objects relate to the <<evaluate-dfanalytics>>.
 `evaluation`::
  (object) Defines the type of evaluation you want to perform. The value of this 
  object can be different depending on the type of evaluation you want to 
-  perform. For example, it can contain <<binary-sc-resources>>.
+  perform.
 +
 --
 Available evaluation types:
 * `binary_soft_classification`
 * `regression`
 --
 `query`::
  (object) A query clause that retrieves a subset of data from the source index. 
  See <<query-dsl>>. The evaluation only applies to those documents of the index 
  that match the query.
 [[binary-sc-resources]]
 ==== Binary soft classification configuration objects
@ -27,18 +39,18 @@ probability whether each row is an outlier.
 ===== {api-definitions-title}
 `actual_field`::
-  (string) The field of the `index` which contains the `ground 
+  (string) The field of the `index` which contains the `ground truth`. 
-  truth`. The data type of this field can be boolean or integer. If the data 
+  The data type of this field can be boolean or integer. If the data type is 
-  type is integer, the value has to be either `0` (false) or `1` (true).
+  integer, the value has to be either `0` (false) or `1` (true).
 `predicted_probability_field`::
-  (string) The field of the `index` that defines the probability of whether the 
+  (string) The field of the `index` that defines the probability of 
-  item belongs to the class in question or not. It's the field that contains the 
+  whether the item belongs to the class in question or not. It's the field that 
-  results of the analysis.
+  contains the results of the analysis.
 `metrics`::
-  (object) Specifies the metrics that are used for the evaluation. Available 
+  (object) Specifies the metrics that are used for the evaluation. 
-  metrics:
+  Available metrics:
  `auc_roc`::
    (object) The AUC ROC (area under the curve of the receiver operating 
@ -61,3 +73,26 @@ probability whether each row is an outlier.
    false negative) are calculated.
    Default value is {"at": [0.25, 0.50, 0.75]}.
 [[regression-evaluation-resources]]
 ==== {regression-cap} evaluation objects
 {regression-cap} evaluation evaluates the results of a {regression} analysis 
 which outputs a prediction of values.
 [discrete]
 [[regression-evaluation-resources-properties]]
 ===== {api-definitions-title}
 `actual_field`::
  (string) The field of the `index` which contains the `ground truth`. The data 
  type of this field must be numerical.
 `predicted_field`::
  (string) The field in the `index` that contains the predicted value, 
  in other words the results of the {regression} analysis.
 `metrics`::
  (object) Specifies the metrics that are used for the evaluation. Available 
  metrics are `r_squared` and `mean_squared_error`.
--- a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
@ -121,6 +121,9 @@ and mappings.
 [[ml-put-dfanalytics-example]]
 ==== {api-examples-title}
 [[ml-put-dfanalytics-example-od]]
 ===== {oldetection-cap} example
 The following example creates the `loganalytics` {dfanalytics-job}, the analysis 
 type is `outlier_detection`:
@ -173,3 +176,63 @@ The API returns the following result:
 ----
 // TESTRESPONSE[s/1562351429434/$body.$_path/]
 // TESTRESPONSE[s/"version" : "7.3.0"/"version" : $body.version/]
 [[ml-put-dfanalytics-example-r]]
 ===== {regression-cap} example
 The following example creates the `house_price_regression_analysis` {
 dfanalytics-job}, the analysis type is `regression`:
 [source,console]
 --------------------------------------------------
 PUT _ml/data_frame/analytics/house_price_regression_analysis
 {
  "source": {
    "index": "houses_sold_last_10_yrs"
  },
  "dest": {
    "index": "house_price_predictions"
  },
  "analysis": 
    {
      "regression": {
        "dependent_variable": "price"
      }
    }
 }
 --------------------------------------------------
 // TEST[skip:TBD]
 The API returns the following result:
 [source,console-result]
 ----
 {
  "id" : "house_price_regression_analysis",
  "source" : {
    "index" : [
      "houses_sold_last_10_yrs"
    ],
    "query" : {
      "match_all" : { }
    }
  },
  "dest" : {
    "index" : "house_price_predictions",
    "results_field" : "ml"
  },
  "analysis" : {
    "regression" : {
      "dependent_variable" : "price",
      "training_percent" : 100
    }
  },
  "model_memory_limit" : "1gb",
  "create_time" : 1567168659127,
  "version" : "8.0.0"
 }
 ----
 // TESTRESPONSE[s/1567168659127/$body.$_path/]
 // TESTRESPONSE[s/"version": "8.0.0"/"version": $body.version/]