[DOCS] Adds classification type DFA API docs and ml-shared.asciidoc (#48241)

2019-11-06 07:40:27 -05:00 · 2019-11-06 07:40:27 -05:00 · 3c9bd13dca
parent 70765dfb05
commit 3c9bd13dca
3 changed files with 193 additions and 54 deletions
--- a/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc
@ -18,13 +18,14 @@
 `analyzed_fields`::
  (object) You can specify both `includes` and/or `excludes` patterns. If 
  `analyzed_fields` is not set, only the relevant fields will be included. For 
-  example all the numeric fields for {oldetection}.
-  
-  `analyzed_fields.includes`:::
+  example, all the numeric fields for {oldetection}. For the supported field 
+  types, see <<ml-put-dfanalytics-supported-fields>>.
+    
+  `includes`:::
    (array) An array of strings that defines the fields that will be included in 
    the analysis.
-    
-  `analyzed_fields.excludes`:::
+      
+  `excludes`:::
    (array) An array of strings that defines the fields that will be excluded 
    from the analysis.
  
@ -179,23 +180,15 @@ hyperparameter optimization to give minimum validation errors.
 [[regression-resources-standard]]
 ===== Standard parameters

-`dependent_variable`::
-  (Required, string) Defines which field of the document is to be predicted. 
-  This parameter is supplied by field name and must match one of the fields in 
-  the index being used to train. If this field is missing from a document, then 
-  that document will not be used for training, but a prediction with the trained 
-  model will be generated for it. The data type of the field must be numeric. It 
-  is also known as continuous target variable.
+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
+
+--
+The data type of the field must be numeric.
+--

-`prediction_field_name`::
- (Optional, string) Defines the name of the prediction field in the results. 
- Defaults to `<dependent_variable>_prediction`.
- 
-`training_percent`::
- (Optional, integer) Defines what percentage of the eligible documents that will 
- be used for training. Documents that are ignored by the analysis (for example 
- those that contain arrays) won’t be included in the calculation for used 
- percentage. Defaults to `100`.
+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]


 [float]
@ -209,46 +202,73 @@ values unless you fully understand the function of these parameters. If these
 parameters are not supplied, their values are automatically tuned to give 
 minimum validation error.

-`eta`::
- (Optional, double) The shrinkage applied to the weights. Smaller values result 
- in larger forests which have better generalization error. However, the smaller 
- the value the longer the training will take. For more information, see 
- https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
- about shrinkage.
- 
-`feature_bag_fraction`::
- (Optional, double) Defines the fraction of features that will be used when 
- selecting a random bag for each candidate split. 
- 
-`maximum_number_trees`::
- (Optional, integer) Defines the maximum number of trees the forest is allowed 
- to contain. The maximum value is 2000.
+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]

-`gamma`::
- (Optional, double) Regularization parameter to prevent overfitting on the 
- training dataset. Multiplies a linear penalty associated with the size of 
- individual trees in the forest. The higher the value the more training will 
- prefer smaller trees. The smaller this parameter the larger individual trees 
- will be and the longer train will take.
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
+
+
+[discrete]
+[[classification-resources]]
+==== {classification-cap} configuration objects 
 
-`lambda`::
- (Optional, double) Regularization parameter to prevent overfitting on the 
- training dataset. Multiplies an L2 regularisation term which applies to leaf 
- weights of the individual trees in the forest. The higher the value the more 
- training will attempt to keep leaf weights small. This makes the prediction 
- function smoother at the expense of potentially not being able to capture 
- relevant relationships between the features and the {depvar}. The smaller this 
- parameter the larger individual trees will be and the longer train will take.
+ 
+[float]
+[[classification-resources-standard]]
+===== Standard parameters
+ 
+include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
+
+--
+The data type of the field must be numeric or boolean.
+--
+  
+`num_top_classes`::
+  (Optional, integer) Defines the number of categories for which the predicted 
+  probabilities are reported. It must be non-negative. If it is greater than the 
+  total number of categories (in the {version} version of the {stack}, it's two) 
+  to predict then we will report all category probabilities. Defaults to 2.
+ 
+include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]
+
+
+[float]
+[[classification-resources-advanced]]
+===== Advanced parameters
+
+Advanced parameters are for fine-tuning {classanalysis}. They are set 
+automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
+to give minimum validation error. It is highly recommended to use the default 
+values unless you fully understand the function of these parameters. If these 
+parameters are not supplied, their values are automatically tuned to give 
+minimum validation error.
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
+
+include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]


 [[ml-hyperparameter-optimization]]
 ===== Hyperparameter optimization

-If you don't supply {regression} parameters, hyperparameter optimization will be 
-performed by default to set a value for the undefined parameters. The starting 
-point is calculated for data dependent parameters by examining the loss on the 
-training data. Subject to the size constraint, this operation provides an upper 
-bound on the improvement in validation loss.
+If you don't supply {regression} or {classification} parameters, hyperparameter 
+optimization will be performed by default to set a value for the undefined 
+parameters. The starting point is calculated for data dependent parameters by 
+examining the loss on the training data. Subject to the size constraint, this 
+operation provides an upper bound on the improvement in validation loss.

 A fixed number of rounds is used for optimization which depends on the number of 
 parameters being optimized. The optimitazion starts with random search, then 
--- a/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
+++ b/docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
@ -67,6 +67,26 @@ an array with two or more values are also ignored. Documents in the `dest` index
 that don’t contain a results field are not included in the {reganalysis}.


+====== {classification-cap}
+
+{classification-cap} supports fields that are numeric, boolean, text, keyword 
+and ip. It is also tolerant of missing values. Fields that are supported are 
+included in the analysis, other fields are ignored. Documents where included 
+fields contain an array with two or more values are also ignored. Documents in 
+the `dest` index that don’t contain a results field are not included in the 
+{classanalysis}.
+
+{classanalysis-cap} can be improved by mapping ordinal variable values to a 
+single number. For example, in case of age ranges, you can model the values as 
+"0-14" = 0, "15-24" = 1, "25-34" = 2, and so on.
+
+Fields that are highly correlated to the `dependent_variable` should be excluded 
+from the analysis. For example, if you have a multi-value field as 
+`dependent_variable`, {es} will be mapping it both as text and keyword which 
+results in two fields (`field` and `field.keyword`). It is required to exclude 
+the field with the text mapping to get exact results from the analysis.
+
+
 [[ml-put-dfanalytics-path-params]]
 ==== {api-path-parms-title}

@ -154,6 +174,7 @@ that don’t contain a results field are not included in the {reganalysis}.
 [[ml-put-dfanalytics-example]]
 ==== {api-examples-title}

+
 [[ml-put-dfanalytics-example-od]]
 ===== {oldetection-cap} example

@ -305,3 +326,31 @@ PUT _ml/data_frame/analytics/student_performance_mathematics_0.3

 <1> The `training_percent` defines the percentage of the data set that will be used 
 for training the model.
+
+
+[[ml-put-dfanalytics-example-c]]
+===== {classification-cap} example
+
+The following example creates the `loan_classification` {dfanalytics-job}, the 
+analysis type is `classification`:
+
+[source,console]
+--------------------------------------------------
+PUT _ml/data_frame/analytics/loan_classification
+{
+  "source" : {
+    "index": "loan-applicants"
+  },
+  "dest" : {
+    "index": "loan-applicants-classified"
+  },
+  "analysis" : {
+    "classification": {
+      "dependent_variable": "label",
+      "training_percent": 75,
+      "num_top_classes": 2
+    }
+  }
+}
+--------------------------------------------------
+// TEST[skip:TBD]
--- a/docs/reference/ml/ml-shared.asciidoc
+++ b/docs/reference/ml/ml-shared.asciidoc
@ -0,0 +1,70 @@
+tag::dependent_variable[]
+`dependent_variable`::
+(Required, string) Defines which field of the document is to be predicted. 
+This parameter is supplied by field name and must match one of the fields in 
+the index being used to train. If this field is missing from a document, then 
+that document will not be used for training, but a prediction with the trained 
+model will be generated for it. It is also known as continuous target variable.
+end::dependent_variable[]
+
+
+tag::eta[]
+`eta`::
+(Optional, double) The shrinkage applied to the weights. Smaller values result 
+in larger forests which have better generalization error. However, the smaller 
+the value the longer the training will take. For more information, see 
+https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
+about shrinkage.
+end::eta[]
+
+
+tag::feature_bag_fraction[]
+`feature_bag_fraction`::
+(Optional, double) Defines the fraction of features that will be used when 
+selecting a random bag for each candidate split. 
+end::feature_bag_fraction[]
+
+
+tag::gamma[]
+`gamma`::
+(Optional, double) Regularization parameter to prevent overfitting on the 
+training dataset. Multiplies a linear penalty associated with the size of 
+individual trees in the forest. The higher the value the more training will 
+prefer smaller trees. The smaller this parameter the larger individual trees 
+will be and the longer train will take.
+end::gamma[]
+
+
+tag::lambda[] 
+`lambda`::
+(Optional, double) Regularization parameter to prevent overfitting on the 
+training dataset. Multiplies an L2 regularisation term which applies to leaf 
+weights of the individual trees in the forest. The higher the value the more 
+training will attempt to keep leaf weights small. This makes the prediction  
+function smoother at the expense of potentially not being able to capture 
+relevant relationships between the features and the {depvar}. The smaller this 
+parameter the larger individual trees will be and the longer train will take.
+end::lambda[]
+
+
+tag::maximum_number_trees[]
+`maximum_number_trees`::
+(Optional, integer) Defines the maximum number of trees the forest is allowed 
+to contain. The maximum value is 2000.
+end::maximum_number_trees[]
+
+
+tag::prediction_field_name[]
+`prediction_field_name`::
+(Optional, string) Defines the name of the prediction field in the results. 
+Defaults to `<dependent_variable>_prediction`.
+end::prediction_field_name[]
+
+
+tag::training_percent[]
+`training_percent`::
+(Optional, integer) Defines what percentage of the eligible documents that will 
+be used for training. Documents that are ignored by the analysis (for example 
+those that contain arrays) won’t be included in the calculation for used 
+percentage. Defaults to `100`.
+end::training_percent[]