[DOCS] Adds classification type DFA API docs and ml-shared.asciidoc (#48241)

2019-11-06 07:40:27 -05:00 · 2019-11-06 07:40:27 -05:00 · 3c9bd13dca
parent 70765dfb05
commit 3c9bd13dca
3 changed files with 193 additions and 54 deletions
--- a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
@ -18,13 +18,14 @@
 `analyzed_fields`::
  (object) You can specify both `includes` and/or `excludes` patterns. If 
  `analyzed_fields` is not set, only the relevant fields will be included. For 
-  example all the numeric fields for {oldetection}.
+  example, all the numeric fields for {oldetection}. For the supported field 
  types, see <<ml-put-dfanalytics-supported-fields>>.
-  `analyzed_fields.includes`:::
+  `includes`:::
    (array) An array of strings that defines the fields that will be included in 
    the analysis.
-  `analyzed_fields.excludes`:::
+  `excludes`:::
    (array) An array of strings that defines the fields that will be excluded 
    from the analysis.
@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
 [[regression-resources-standard]]
 ===== Standard parameters
-`dependent_variable`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
-  (Required, string) Defines which field of the document is to be predicted. 
+
-  This parameter is supplied by field name and must match one of the fields in 
+--
-  the index being used to train. If this field is missing from a document, then 
+The data type of the field must be numeric.
-  that document will not be used for training, but a prediction with the trained 
+--
  model will be generated for it. The data type of the field must be numeric. It 
  is also known as continuous target variable.
-`prediction_field_name`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
 (Optional, string) Defines the name of the prediction field in the results. 
 Defaults to `<dependent_variable>_prediction`.
-`training_percent`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
 (Optional, integer) Defines what percentage of the eligible documents that will 
 be used for training. Documents that are ignored by the analysis (for example 
 those that contain arrays) won’t be included in the calculation for used 
 percentage. Defaults to `100`.
 [float]
@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
 parameters are not supplied, their values are automatically tuned to give 
 minimum validation error.
-`eta`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
 (Optional, double) The shrinkage applied to the weights. Smaller values result 
 in larger forests which have better generalization error. However, the smaller 
 the value the longer the training will take. For more information, see 
 https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
 about shrinkage.
-`feature_bag_fraction`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
 (Optional, double) Defines the fraction of features that will be used when 
 selecting a random bag for each candidate split. 
-`maximum_number_trees`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
 (Optional, integer) Defines the maximum number of trees the forest is allowed 
 to contain. The maximum value is 2000.
-`gamma`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
 (Optional, double) Regularization parameter to prevent overfitting on the 
 training dataset. Multiplies a linear penalty associated with the size of 
 individual trees in the forest. The higher the value the more training will 
 prefer smaller trees. The smaller this parameter the larger individual trees 
 will be and the longer train will take.
-`lambda`::
+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
- (Optional, double) Regularization parameter to prevent overfitting on the 
+
- training dataset. Multiplies an L2 regularisation term which applies to leaf 
+
- weights of the individual trees in the forest. The higher the value the more 
+[discrete]
- training will attempt to keep leaf weights small. This makes the prediction 
+[[classification-resources]]
- function smoother at the expense of potentially not being able to capture 
+==== {classification-cap} configuration objects 
- relevant relationships between the features and the {depvar}. The smaller this 
+ 
- parameter the larger individual trees will be and the longer train will take.
+ 
 [float]
 [[classification-resources-standard]]
 ===== Standard parameters
 include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
 +
 --
 The data type of the field must be numeric or boolean.
 --
 `num_top_classes`::
  (Optional, integer) Defines the number of categories for which the predicted 
  probabilities are reported. It must be non-negative. If it is greater than the 
  total number of categories (in the {version} version of the {stack}, it's two) 
  to predict then we will report all category probabilities. Defaults to 2.
 include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
 include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
 [float]
 [[classification-resources-advanced]]
 ===== Advanced parameters
 Advanced parameters are for fine-tuning {classanalysis}. They are set 
 automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
 to give minimum validation error. It is highly recommended to use the default 
 values unless you fully understand the function of these parameters. If these 
 parameters are not supplied, their values are automatically tuned to give 
 minimum validation error.
 include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
 include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
 include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
 include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
 include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
 [[ml-hyperparameter-optimization]]
 ===== Hyperparameter optimization
-If you don't supply {regression} parameters, hyperparameter optimization will be 
+If you don't supply {regression} or {classification} parameters, hyperparameter 
-performed by default to set a value for the undefined parameters. The starting 
+optimization will be performed by default to set a value for the undefined 
-point is calculated for data dependent parameters by examining the loss on the 
+parameters. The starting point is calculated for data dependent parameters by 
-training data. Subject to the size constraint, this operation provides an upper 
+examining the loss on the training data. Subject to the size constraint, this 
-bound on the improvement in validation loss.
+operation provides an upper bound on the improvement in validation loss.
 A fixed number of rounds is used for optimization which depends on the number of 
 parameters being optimized. The optimitazion starts with random search, then 
--- a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
 that don’t contain a results field are not included in the {reganalysis}.
 ====== {classification-cap}
 {classification-cap} supports fields that are numeric, boolean, text, keyword 
 and ip. It is also tolerant of missing values. Fields that are supported are 
 included in the analysis, other fields are ignored. Documents where included 
 fields contain an array with two or more values are also ignored. Documents in 
 the `dest` index that don’t contain a results field are not included in the 
 {classanalysis}.
 {classanalysis-cap} can be improved by mapping ordinal variable values to a 
 single number. For example, in case of age ranges, you can model the values as 
 "0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
 Fields that are highly correlated to the `dependent_variable` should be excluded 
 from the analysis. For example, if you have a multi-value field as 
 `dependent_variable`, {es} will be mapping it both as text and keyword which 
 results in two fields (`field` and `field.keyword`). It is required to exclude 
 the field with the text mapping to get exact results from the analysis.
 [[ml-put-dfanalytics-path-params]]
 ==== {api-path-parms-title}
@ -154,6 +174,7 @@ that don’t contain a results field are not included in the {reganalysis}.
 [[ml-put-dfanalytics-example]]
 ==== {api-examples-title}
 [[ml-put-dfanalytics-example-od]]
 ===== {oldetection-cap} example
@ -305,3 +326,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3
 <1> The `training_percent` defines the percentage of the data set that will be used 
 for training the model.
 [[ml-put-dfanalytics-example-c]]
 ===== {classification-cap} example
 The following example creates the `loan_classification` {dfanalytics-job}, the 
 analysis type is `classification`:
 [source,console]
 --------------------------------------------------
 PUT _ml/data_frame/analytics/loan_classification
 {
  "source" : {
    "index": "loan-applicants"
  },
  "dest" : {
    "index": "loan-applicants-classified"
  },
  "analysis" : {
    "classification": {
      "dependent_variable": "label",
      "training_percent": 75,
      "num_top_classes": 2
    }
  }
 }
 --------------------------------------------------
 // TEST[skip:TBD]
--- a/docs/reference/ml/ml-shared.asciidoc
+++ b/docs/reference/ml/ml-shared.asciidoc
@ -0,0 +1,70 @@
 tag::dependent_variable[]
 `dependent_variable`::
 (Required, string) Defines which field of the document is to be predicted. 
 This parameter is supplied by field name and must match one of the fields in 
 the index being used to train. If this field is missing from a document, then 
 that document will not be used for training, but a prediction with the trained 
 model will be generated for it. It is also known as continuous target variable.
 end::dependent_variable[]
 tag::eta[]
 `eta`::
 (Optional, double) The shrinkage applied to the weights. Smaller values result 
 in larger forests which have better generalization error. However, the smaller 
 the value the longer the training will take. For more information, see 
 https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
 about shrinkage.
 end::eta[]
 tag::feature_bag_fraction[]
 `feature_bag_fraction`::
 (Optional, double) Defines the fraction of features that will be used when 
 selecting a random bag for each candidate split. 
 end::feature_bag_fraction[]
 tag::gamma[]
 `gamma`::
 (Optional, double) Regularization parameter to prevent overfitting on the 
 training dataset. Multiplies a linear penalty associated with the size of 
 individual trees in the forest. The higher the value the more training will 
 prefer smaller trees. The smaller this parameter the larger individual trees 
 will be and the longer train will take.
 end::gamma[]
 tag::lambda[] 
 `lambda`::
 (Optional, double) Regularization parameter to prevent overfitting on the 
 training dataset. Multiplies an L2 regularisation term which applies to leaf 
 weights of the individual trees in the forest. The higher the value the more 
 training will attempt to keep leaf weights small. This makes the prediction  
 function smoother at the expense of potentially not being able to capture 
 relevant relationships between the features and the {depvar}. The smaller this 
 parameter the larger individual trees will be and the longer train will take.
 end::lambda[]
 tag::maximum_number_trees[]
 `maximum_number_trees`::
 (Optional, integer) Defines the maximum number of trees the forest is allowed 
 to contain. The maximum value is 2000.
 end::maximum_number_trees[]
 tag::prediction_field_name[]
 `prediction_field_name`::
 (Optional, string) Defines the name of the prediction field in the results. 
 Defaults to `<dependent_variable>_prediction`.
 end::prediction_field_name[]
 tag::training_percent[]
 `training_percent`::
 (Optional, integer) Defines what percentage of the eligible documents that will 
 be used for training. Documents that are ignored by the analysis (for example 
 those that contain arrays) won’t be included in the calculation for used 
 percentage. Defaults to `100`.
 end::training_percent[]