OpenSearch/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc

[role="xpack"]
[testenv="platinum"]
[[ml-dfanalytics-resources]]
=== {dfanalytics-cap} job resources

{dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
<<get-dfanalytics>>.	

[discrete]	
[[ml-dfanalytics-properties]]	
==== {api-definitions-title}

`analysis`::
  (object) The type of analysis that is performed on the `source`. For example: 
  `outlier_detection` or `regression`. For more information, see 
  <<dfanalytics-types>>.
  
`analyzed_fields`::
  (object) You can specify both `includes` and/or `excludes` patterns. If 
  `analyzed_fields` is not set, only the relevant fields will be included. For 
  example all the numeric fields for {oldetection}.
  
  `analyzed_fields.includes`:::
    (array) An array of strings that defines the fields that will be included in 
    the analysis.
    
  `analyzed_fields.excludes`:::
    (array) An array of strings that defines the fields that will be excluded 
    from the analysis.
  

[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/loganalytics
{
  "source": {
    "index": "logdata"
  },
  "dest": {
    "index": "logdata_out"
  },
  "analysis": {
    "outlier_detection": {
    }
  },
  "analyzed_fields": {
        "includes": [ "request.bytes", "response.counts.error" ],
        "excludes": [ "source.geo" ]
  }
}
--------------------------------------------------
// TEST[setup:setup_logdata]

`description`::
  (Optional, string) A description of the job.

`dest`::
  (object) The destination configuration of the analysis.
  
  `index`:::
    (Required, string) Defines the _destination index_ to store the results of 
    the {dfanalytics-job}.
  
  `results_field`:::
    (Optional, string) Defines the name of the field in which to store the 
    results of the analysis. Default to `ml`.

`id`::
  (string) The unique identifier for the {dfanalytics-job}. This identifier can 
  contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and 
  underscores. It must start and end with alphanumeric characters. This property 
  is informational; you cannot change the identifier for existing jobs.
  
`model_memory_limit`::
  (string) The approximate maximum amount of memory resources that are 
  permitted for analytical processing. The default value for {dfanalytics-jobs} 
  is `1gb`. If your `elasticsearch.yml` file contains an 
  `xpack.ml.max_model_memory_limit` setting, an error occurs when you try to 
  create {dfanalytics-jobs} that have `model_memory_limit` values greater than 
  that setting. For more information, see <<ml-settings>>.

`source`::
  (object) The source configuration consisting an `index` and optionally a 
  `query` object.
  
  `index`:::
    (Required, string or array) Index or indices on which to perform the 
    analysis. It can be a single index or index pattern as well as an array of 
    indices or patterns.
    
  `query`:::
    (Optional, object) The {es} query domain-specific language 
    (<<query-dsl,DSL>>). This value corresponds to the query object in an {es} 
    search POST body. All the options that are supported by {es} can be used, 
    as this object is passed verbatim to {es}. By default, this property has 
    the following value: `{"match_all": {}}`.

[[dfanalytics-types]]
==== Analysis objects

{dfanalytics-cap} resources contain `analysis` objects. For example, when you
create a {dfanalytics-job}, you must define the type of analysis it performs.

[discrete]
[[oldetection-resources]]
==== {oldetection-cap} configuration objects 

An `outlier_detection` configuration object has the following properties:

`compute_feature_influence`::
  (boolean) If `true`, the feature influence calculation is enabled. Defaults to 
  `true`.
  
`feature_influence_threshold`:: 
  (double) The minimum {olscore} that a document needs to have in order to 
  calculate its {fiscore}. Value range: 0-1 (`0.1` by default).

`method`::
  (string) Sets the method that {oldetection} uses. If the method is not set 
  {oldetection} uses an ensemble of different methods and normalises and 
  combines their individual {olscores} to obtain the overall {olscore}. We 
  recommend to use the ensemble method. Available methods are `lof`, `ldof`, 
  `distance_kth_nn`, `distance_knn`.
  
  `n_neighbors`::
  (integer) Defines the value for how many nearest neighbors each method of 
  {oldetection} will use to calculate its {olscore}. When the value is not set, 
  different values will be used for different ensemble members. This helps 
  improve diversity in the ensemble. Therefore, only override this if you are 
  confident that the value you choose is appropriate for the data set.
  
`outlier_fraction`::
  (double) Sets the proportion of the data set that is assumed to be outlying prior to 
  {oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers 
  and 95% are inliers.
  
`standardization_enabled`::
  (boolean) If `true`, then the following operation is performed on the columns 
  before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to 
  `true`. For more information, see 
  https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].


[discrete]
[[regression-resources]]
==== {regression-cap} configuration objects

[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/house_price_regression_analysis
{
  "source": {
    "index": "houses_sold_last_10_yrs" <1>
  },
  "dest": {
    "index": "house_price_predictions" <2>
  },
  "analysis": 
    {
      "regression": { <3>
        "dependent_variable": "price" <4>
      }
    }
}
--------------------------------------------------
// TEST[skip:TBD]

<1> Training data is taken from source index `houses_sold_last_10_yrs`.
<2> Analysis results will be output to destination index 
`house_price_predictions`.
<3> The regression analysis configuration object.
<4> Regression analysis will use field `price` to train on. As no other 
parameters have been specified it will train on 100% of eligible data, store its 
prediction in destination index field `price_prediction` and use in-built 
hyperparameter optimization to give minimum validation errors.


[float]
[[regression-resources-standard]]
===== Standard parameters

`dependent_variable`::
  (Required, string) Defines which field of the document is to be predicted. 
  This parameter is supplied by field name and must match one of the fields in 
  the index being used to train. If this field is missing from a document, then 
  that document will not be used for training, but a prediction with the trained 
  model will be generated for it. The data type of the field must be numeric. It 
  is also known as continuous target variable.

`prediction_field_name`::
 (Optional, string) Defines the name of the prediction field in the results. 
 Defaults to `<dependent_variable>_prediction`.
 
`training_percent`::
 (Optional, integer) Defines what percentage of the eligible documents that will 
 be used for training. Documents that are ignored by the analysis (for example 
 those that contain arrays) won’t be included in the calculation for used 
 percentage. Defaults to `100`.


[float]
[[regression-resources-advanced]]
===== Advanced parameters

Advanced parameters are for fine-tuning {reganalysis}. They are set 
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
to give minimum validation error. It is highly recommended to use the default 
values unless you fully understand the function of these parameters. If these 
parameters are not supplied, their values are automatically tuned to give 
minimum validation error.

`eta`::
 (Optional, double) The shrinkage applied to the weights. Smaller values result 
 in larger forests which have better generalization error. However, the smaller 
 the value the longer the training will take. For more information, see 
 https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article] 
 about shrinkage.
 
`feature_bag_fraction`::
 (Optional, double) Defines the fraction of features that will be used when 
 selecting a random bag for each candidate split. 
 
`maximum_number_trees`::
 (Optional, integer) Defines the maximum number of trees the forest is allowed 
 to contain. The maximum value is 2000.

`gamma`::
 (Optional, double) Regularization parameter to prevent overfitting on the 
 training dataset. Multiplies a linear penalty associated with the size of 
 individual trees in the forest. The higher the value the more training will 
 prefer smaller trees. The smaller this parameter the larger individual trees 
 will be and the longer train will take.
 
`lambda`::
 (Optional, double) Regularization parameter to prevent overfitting on the 
 training dataset. Multiplies an L2 regularisation term which applies to leaf 
 weights of the individual trees in the forest. The higher the value the more 
 training will attempt to keep leaf weights small. This makes the prediction 
 function smoother at the expense of potentially not being able to capture 
 relevant relationships between the features and the {depvar}. The smaller this 
 parameter the larger individual trees will be and the longer train will take.


[[ml-hyperparameter-optimization]]
===== Hyperparameter optimization

If you don't supply {regression} parameters, hyperparameter optimization will be 
performed by default to set a value for the undefined parameters. The starting 
point is calculated for data dependent parameters by examining the loss on the 
training data. Subject to the size constraint, this operation provides an upper 
bound on the improvement in validation loss.

A fixed number of rounds is used for optimization which depends on the number of 
parameters being optimized. The optimitazion starts with random search, then 
Bayesian Optimisation is performed that is targeting maximum expected 
improvement. If you override any parameters, then the optimization will 
calculate the value of the remaining parameters accordingly and use the value 
you provided for the overridden parameter. The number of rounds are reduced 
respectively. The validation error is estimated in each round by using 4-fold 
cross validation.
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
+								[role="xpack"]
 								[testenv="platinum"]
 								[[ml-dfanalytics-resources]]
 								=== {dfanalytics-cap} job resources
 								{dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
 								<<get-dfanalytics>>.
 								[discrete]
 								[[ml-dfanalytics-properties]]
 								==== {api-definitions-title}
 								`analysis`::
 								  (object) The type of analysis that is performed on the `source`. For example:
-												[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)

* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs.
Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>

											
										
										
											2019-09-19 03:10:11 -04:00
+								  `outlier_detection` or `regression`. For more information, see
 								  <<dfanalytics-types>>.
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
 								`analyzed_fields`::
 								  (object) You can specify both `includes` and/or `excludes` patterns. If
 								  `analyzed_fields` is not set, only the relevant fields will be included. For
 								  example all the numeric fields for {oldetection}.
-												[DOCS] Amends data frame analytics resources, GET, and PUT API docs (#44806)

This PR addresses the feedback in  https://github.com/elastic/ml-team/issues/175#issuecomment-512215731.

* Adds an example to `analyzed_fields`
* Includes `source` and `dest` objects inline in the resource page
* Lists `model_memory_limit` in the PUT API page
* Amends the `analysis` section in the resource page
* Removes Properties headings in subsections
											
										
										
											2019-07-26 05:39:59 -04:00
-												[DOCS] [PUT DFA] Documents inline the child params of source and dest (#45649)

* [DOCS] [PUT DFA] Documents inline the child params of source and dest.

* [DOCS] Fixes indentation issues and amends dfa definitions.

											
										
										
											2019-08-29 08:38:14 -04:00
+								  `analyzed_fields.includes`:::
 								    (array) An array of strings that defines the fields that will be included in
 								    the analysis.
 								  `analyzed_fields.excludes`:::
 								    (array) An array of strings that defines the fields that will be excluded
 								    from the analysis.
-												[DOCS] Change // CONSOLE comments to [source,console] (#46440) (#46494)


											
										
										
											2019-09-09 12:35:50 -04:00
+								[source,console]
-												[DOCS] Amends data frame analytics resources, GET, and PUT API docs (#44806)

This PR addresses the feedback in  https://github.com/elastic/ml-team/issues/175#issuecomment-512215731.

* Adds an example to `analyzed_fields`
* Includes `source` and `dest` objects inline in the resource page
* Lists `model_memory_limit` in the PUT API page
* Amends the `analysis` section in the resource page
* Removes Properties headings in subsections
											
										
										
											2019-07-26 05:39:59 -04:00
+								--------------------------------------------------
 								PUT _ml/data_frame/analytics/loganalytics
 								{
 								  "source": {
 								    "index": "logdata"
 								  },
 								  "dest": {
 								    "index": "logdata_out"
 								  },
 								  "analysis": {
 								    "outlier_detection": {
 								    }
 								  },
 								  "analyzed_fields": {
 								        "includes": [ "request.bytes", "response.counts.error" ],
 								        "excludes": [ "source.geo" ]
 								  }
 								}
 								--------------------------------------------------
 								// TEST[setup:setup_logdata]
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
-												[ML] Add description to DF analytics (#45774) (#46019)


											
										
										
											2019-08-27 08:48:59 -04:00
+								`description`::
 								  (Optional, string) A description of the job.
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
+								`dest`::
-												[DOCS] [PUT DFA] Documents inline the child params of source and dest (#45649)

* [DOCS] [PUT DFA] Documents inline the child params of source and dest.

* [DOCS] Fixes indentation issues and amends dfa definitions.

											
										
										
											2019-08-29 08:38:14 -04:00
+								  (object) The destination configuration of the analysis.
 								  `index`:::
 								    (Required, string) Defines the _destination index_ to store the results of
 								    the {dfanalytics-job}.
 								  `results_field`:::
 								    (Optional, string) Defines the name of the field in which to store the
 								    results of the analysis. Default to `ml`.
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
 								`id`::
 								  (string) The unique identifier for the {dfanalytics-job}. This identifier can
 								  contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and
 								  underscores. It must start and end with alphanumeric characters. This property
 								  is informational; you cannot change the identifier for existing jobs.
 								`model_memory_limit`::
 								  (string) The approximate maximum amount of memory resources that are
 								  permitted for analytical processing. The default value for {dfanalytics-jobs}
 								  is `1gb`. If your `elasticsearch.yml` file contains an
 								  `xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
 								  create {dfanalytics-jobs} that have `model_memory_limit` values greater than
 								  that setting. For more information, see <<ml-settings>>.
 								`source`::
-												[DOCS] [PUT DFA] Documents inline the child params of source and dest (#45649)

* [DOCS] [PUT DFA] Documents inline the child params of source and dest.

* [DOCS] Fixes indentation issues and amends dfa definitions.

											
										
										
											2019-08-29 08:38:14 -04:00
+								  (object) The source configuration consisting an `index` and optionally a
 								  `query` object.
 								  `index`:::
 								    (Required, string or array) Index or indices on which to perform the
 								    analysis. It can be a single index or index pattern as well as an array of
 								    indices or patterns.
 								  `query`:::
 								    (Optional, object) The {es} query domain-specific language
 								    (<<query-dsl,DSL>>). This value corresponds to the query object in an {es}
 								    search POST body. All the options that are supported by {es} can be used,
 								    as this object is passed verbatim to {es}. By default, this property has
 								    the following value: `{"match_all": {}}`.
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
 								[[dfanalytics-types]]
 								==== Analysis objects
 								{dfanalytics-cap} resources contain `analysis` objects. For example, when you
-												[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)

* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs.
Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>

											
										
										
											2019-09-19 03:10:11 -04:00
+								create a {dfanalytics-job}, you must define the type of analysis it performs.
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
+								[discrete]
 								[[oldetection-resources]]
-												[DOCS] Amends data frame analytics resources, GET, and PUT API docs (#44806)

This PR addresses the feedback in  https://github.com/elastic/ml-team/issues/175#issuecomment-512215731.

* Adds an example to `analyzed_fields`
* Includes `source` and `dest` objects inline in the resource page
* Lists `model_memory_limit` in the PUT API page
* Amends the `analysis` section in the resource page
* Removes Properties headings in subsections
											
										
										
											2019-07-26 05:39:59 -04:00
+								==== {oldetection-cap} configuration objects
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
-												[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)

* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs.
Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>

											
										
										
											2019-09-19 03:10:11 -04:00
+								An `outlier_detection` configuration object has the following properties:
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
-												[DOCS] Adds outlier detection params to the data frame analytics resources (#46323)

* [DOCS] Adds outlier detection params to the data frame analytics resources.
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
											
										
										
											2019-09-16 08:21:50 -04:00
+								`compute_feature_influence`::
 								  (boolean) If `true`, the feature influence calculation is enabled. Defaults to
 								  `true`.
 								`feature_influence_threshold`::
 								  (double) The minimum {olscore} that a document needs to have in order to
 								  calculate its {fiscore}. Value range: 0-1 (`0.1` by default).
-												[DOCS] Adds data frame analytics API and evaluate API resource documentation (#43972)

This PR adds the resource documentation of the data frame analytics APIs and the evaluate API to the ML API doc pool.
											
										
										
											2019-07-11 12:05:05 -04:00
 								`method`::
 								  (string) Sets the method that {oldetection} uses. If the method is not set
 								  {oldetection} uses an ensemble of different methods and normalises and
-												[DOCS] Amends data frame analytics resources, GET, and PUT API docs (#44806)

This PR addresses the feedback in  https://github.com/elastic/ml-team/issues/175#issuecomment-512215731.

* Adds an example to `analyzed_fields`
* Includes `source` and `dest` objects inline in the resource page
* Lists `model_memory_limit` in the PUT API page
* Amends the `analysis` section in the resource page
* Removes Properties headings in subsections
											
										
										
											2019-07-26 05:39:59 -04:00
+								  combines their individual {olscores} to obtain the overall {olscore}. We
 								  recommend to use the ensemble method. Available methods are `lof`, `ldof`,
 								  `distance_kth_nn`, `distance_knn`.
-												[DOCS] Adds outlier detection params to the data frame analytics resources (#46323)

* [DOCS] Adds outlier detection params to the data frame analytics resources.
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
											
										
										
											2019-09-16 08:21:50 -04:00
-												[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)

* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs.
Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>

											
										
										
											2019-09-19 03:10:11 -04:00
+								  `n_neighbors`::
-												[DOCS] Adds outlier detection params to the data frame analytics resources (#46323)

* [DOCS] Adds outlier detection params to the data frame analytics resources.
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
											
										
										
											2019-09-16 08:21:50 -04:00
+								  (integer) Defines the value for how many nearest neighbors each method of
 								  {oldetection} will use to calculate its {olscore}. When the value is not set,
 								  different values will be used for different ensemble members. This helps
 								  improve diversity in the ensemble. Therefore, only override this if you are
 								  confident that the value you choose is appropriate for the data set.
 								`outlier_fraction`::
 								  (double) Sets the proportion of the data set that is assumed to be outlying prior to
 								  {oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers
 								  and 95% are inliers.
-												[7.x][ML] Additional outlier detection parameters (#47600) (#47669)

Adds the following parameters to `outlier_detection`:

- `compute_feature_influence` (boolean): whether to compute or not
   feature influence scores
- `outlier_fraction` (double): the proportion of the data set assumed
   to be outlying prior to running outlier detection
- `standardization_enabled` (boolean): whether to apply standardization
   to the feature values

Backport of #47600
											
										
										
											2019-10-07 11:21:33 -04:00
+								`standardization_enabled`::
-												[DOCS] Adds outlier detection params to the data frame analytics resources (#46323)

* [DOCS] Adds outlier detection params to the data frame analytics resources.
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
											
										
										
											2019-09-16 08:21:50 -04:00
+								  (boolean) If `true`, then the following operation is performed on the columns
 								  before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
 								  `true`. For more information, see
 								  https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
-												[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)

* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs.
Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>

											
										
										
											2019-09-19 03:10:11 -04:00
 								[discrete]
 								[[regression-resources]]
 								==== {regression-cap} configuration objects
 								[source,console]
 								--------------------------------------------------
 								PUT _ml/data_frame/analytics/house_price_regression_analysis
 								{
 								  "source": {
 								    "index": "houses_sold_last_10_yrs" <1>
 								  },
 								  "dest": {
 								    "index": "house_price_predictions" <2>
 								  },
 								  "analysis":
 								    {
 								      "regression": { <3>
 								        "dependent_variable": "price" <4>
 								      }
 								    }
 								}
 								--------------------------------------------------
 								// TEST[skip:TBD]
 								<1> Training data is taken from source index `houses_sold_last_10_yrs`.
 								<2> Analysis results will be output to destination index
 								`house_price_predictions`.
 								<3> The regression analysis configuration object.
 								<4> Regression analysis will use field `price` to train on. As no other
 								parameters have been specified it will train on 100% of eligible data, store its
 								prediction in destination index field `price_prediction` and use in-built
 								hyperparameter optimization to give minimum validation errors.
 								[float]
 								[[regression-resources-standard]]
 								===== Standard parameters
 								`dependent_variable`::
-												[DOCS] Changes wording to move away from data frame terminology in the ES repo (#47093)

* [DOCS] Changes wording to move away from data frame terminology in the ES repo.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>



											
										
										
											2019-10-01 02:04:06 -04:00
+								  (Required, string) Defines which field of the document is to be predicted.
-												[DOCS] Adds regression analytics resources and examples to the data frame analytics APIs and the evaluation API (#46176)

* [DOCS] Adds regression analytics resources and examples to the data frame analytics APIs.
Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>

											
										
										
											2019-09-19 03:10:11 -04:00
+								  This parameter is supplied by field name and must match one of the fields in
 								  the index being used to train. If this field is missing from a document, then
 								  that document will not be used for training, but a prediction with the trained
 								  model will be generated for it. The data type of the field must be numeric. It
 								  is also known as continuous target variable.
 								`prediction_field_name`::
 								 (Optional, string) Defines the name of the prediction field in the results.
 								 Defaults to `<dependent_variable>_prediction`.
 								`training_percent`::
 								 (Optional, integer) Defines what percentage of the eligible documents that will
 								 be used for training. Documents that are ignored by the analysis (for example
 								 those that contain arrays) won’t be included in the calculation for used
 								 percentage. Defaults to `100`.
 								[float]
 								[[regression-resources-advanced]]
 								===== Advanced parameters
 								Advanced parameters are for fine-tuning {reganalysis}. They are set
 								automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
 								to give minimum validation error. It is highly recommended to use the default
 								values unless you fully understand the function of these parameters. If these
 								parameters are not supplied, their values are automatically tuned to give
 								minimum validation error.
 								`eta`::
 								 (Optional, double) The shrinkage applied to the weights. Smaller values result
 								 in larger forests which have better generalization error. However, the smaller
 								 the value the longer the training will take. For more information, see
 								 https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
 								 about shrinkage.
 								`feature_bag_fraction`::
 								 (Optional, double) Defines the fraction of features that will be used when
 								 selecting a random bag for each candidate split.
 								`maximum_number_trees`::
 								 (Optional, integer) Defines the maximum number of trees the forest is allowed
 								 to contain. The maximum value is 2000.
 								`gamma`::
 								 (Optional, double) Regularization parameter to prevent overfitting on the
 								 training dataset. Multiplies a linear penalty associated with the size of
 								 individual trees in the forest. The higher the value the more training will
 								 prefer smaller trees. The smaller this parameter the larger individual trees
 								 will be and the longer train will take.
 								`lambda`::
 								 (Optional, double) Regularization parameter to prevent overfitting on the
 								 training dataset. Multiplies an L2 regularisation term which applies to leaf
 								 weights of the individual trees in the forest. The higher the value the more
 								 training will attempt to keep leaf weights small. This makes the prediction
 								 function smoother at the expense of potentially not being able to capture
 								 relevant relationships between the features and the {depvar}. The smaller this
 								 parameter the larger individual trees will be and the longer train will take.
 								[[ml-hyperparameter-optimization]]
 								===== Hyperparameter optimization
 								If you don't supply {regression} parameters, hyperparameter optimization will be
 								performed by default to set a value for the undefined parameters. The starting
 								point is calculated for data dependent parameters by examining the loss on the
 								training data. Subject to the size constraint, this operation provides an upper
 								bound on the improvement in validation loss.
 								A fixed number of rounds is used for optimization which depends on the number of
 								parameters being optimized. The optimitazion starts with random search, then
 								Bayesian Optimisation is performed that is targeting maximum expected
 								improvement. If you override any parameters, then the optimization will
 								calculate the value of the remaining parameters accordingly and use the value
 								you provided for the overridden parameter. The number of rounds are reduced
 								respectively. The validation error is estimated in each round by using 4-fold
 								cross validation.