OpenSearch/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc

[role="xpack"]
[testenv="platinum"]
[[ml-dfanalytics-resources]]
=== {dfanalytics-cap} job resources

{dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
<<get-dfanalytics>>.	

[discrete]	
[[ml-dfanalytics-properties]]	
==== {api-definitions-title}

`analysis`::
  (object) The type of analysis that is performed on the `source`. For example: 
  `outlier_detection` or `regression`. For more information, see 
  <<dfanalytics-types>>.
  
`analyzed_fields`::
  (Optional, object) Specify `includes` and/or `excludes` patterns to select
  which fields will be included in the analysis. If `analyzed_fields` is not set,
  only the relevant fields will be included. For example, all the numeric fields
  for {oldetection}. For the supported field types, see <<ml-put-dfanalytics-supported-fields>>.
  Also see the <<explain-dfanalytics>> which helps understand field selection.
    
  `includes`:::
    (Optional, array) An array of strings that defines the fields that will be included in 
    the analysis.
      
  `excludes`:::
    (Optional, array) An array of strings that defines the fields that will be excluded 
    from the analysis.
  

[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/loganalytics
{
  "source": {
    "index": "logdata"
  },
  "dest": {
    "index": "logdata_out"
  },
  "analysis": {
    "outlier_detection": {
    }
  },
  "analyzed_fields": {
        "includes": [ "request.bytes", "response.counts.error" ],
        "excludes": [ "source.geo" ]
  }
}
--------------------------------------------------
// TEST[setup:setup_logdata]

`description`::
  (Optional, string) A description of the job.

`dest`::
  (object) The destination configuration of the analysis.
  
  `index`:::
    (Required, string) Defines the _destination index_ to store the results of 
    the {dfanalytics-job}.
  
  `results_field`:::
    (Optional, string) Defines the name of the field in which to store the 
    results of the analysis. Default to `ml`.

`id`::
  (string) The unique identifier for the {dfanalytics-job}. This identifier can 
  contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and 
  underscores. It must start and end with alphanumeric characters. This property 
  is informational; you cannot change the identifier for existing jobs.
  
`model_memory_limit`::
  (string) The approximate maximum amount of memory resources that are 
  permitted for analytical processing. The default value for {dfanalytics-jobs} 
  is `1gb`. If your `elasticsearch.yml` file contains an 
  `xpack.ml.max_model_memory_limit` setting, an error occurs when you try to 
  create {dfanalytics-jobs} that have `model_memory_limit` values greater than 
  that setting. For more information, see <<ml-settings>>.

`source`::
  (object) The configuration of how to source the analysis data. It requires an `index`.
  Optionally, `query` and `_source` may be specified.
  
  `index`:::
    (Required, string or array) Index or indices on which to perform the 
    analysis. It can be a single index or index pattern as well as an array of 
    indices or patterns.
    
  `query`:::
    (Optional, object) The {es} query domain-specific language 
    (<<query-dsl,DSL>>). This value corresponds to the query object in an {es} 
    search POST body. All the options that are supported by {es} can be used, 
    as this object is passed verbatim to {es}. By default, this property has 
    the following value: `{"match_all": {}}`.

  `_source`:::
    (Optional, object) Specify `includes` and/or `excludes` patterns to select
    which fields will be present in the destination. Fields that are excluded
    cannot be included in the analysis.
        
      `includes`::::
        (array) An array of strings that defines the fields that will be included in 
        the destination.
          
      `excludes`::::
        (array) An array of strings that defines the fields that will be excluded 
        from the destination.

[[dfanalytics-types]]
==== Analysis objects

{dfanalytics-cap} resources contain `analysis` objects. For example, when you
create a {dfanalytics-job}, you must define the type of analysis it performs.

[discrete]
[[oldetection-resources]]
==== {oldetection-cap} configuration objects 

An `outlier_detection` configuration object has the following properties:

`compute_feature_influence`::
  (boolean) If `true`, the feature influence calculation is enabled. Defaults to 
  `true`.
  
`feature_influence_threshold`:: 
  (double) The minimum {olscore} that a document needs to have in order to 
  calculate its {fiscore}. Value range: 0-1 (`0.1` by default).

`method`::
  (string) Sets the method that {oldetection} uses. If the method is not set 
  {oldetection} uses an ensemble of different methods and normalises and 
  combines their individual {olscores} to obtain the overall {olscore}. We 
  recommend to use the ensemble method. Available methods are `lof`, `ldof`, 
  `distance_kth_nn`, `distance_knn`.
  
  `n_neighbors`::
  (integer) Defines the value for how many nearest neighbors each method of 
  {oldetection} will use to calculate its {olscore}. When the value is not set, 
  different values will be used for different ensemble members. This helps 
  improve diversity in the ensemble. Therefore, only override this if you are 
  confident that the value you choose is appropriate for the data set.
  
`outlier_fraction`::
  (double) Sets the proportion of the data set that is assumed to be outlying prior to 
  {oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers 
  and 95% are inliers.
  
`standardization_enabled`::
  (boolean) If `true`, then the following operation is performed on the columns 
  before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to 
  `true`. For more information, see 
  https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].


[discrete]
[[regression-resources]]
==== {regression-cap} configuration objects

[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/house_price_regression_analysis
{
  "source": {
    "index": "houses_sold_last_10_yrs" <1>
  },
  "dest": {
    "index": "house_price_predictions" <2>
  },
  "analysis": 
    {
      "regression": { <3>
        "dependent_variable": "price" <4>
      }
    }
}
--------------------------------------------------
// TEST[skip:TBD]

<1> Training data is taken from source index `houses_sold_last_10_yrs`.
<2> Analysis results will be output to destination index 
`house_price_predictions`.
<3> The regression analysis configuration object.
<4> Regression analysis will use field `price` to train on. As no other 
parameters have been specified it will train on 100% of eligible data, store its 
prediction in destination index field `price_prediction` and use in-built 
hyperparameter optimization to give minimum validation errors.


[float]
[[regression-resources-standard]]
===== Standard parameters

include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
+
--
The data type of the field must be numeric.
--

include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]

include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]


[float]
[[regression-resources-advanced]]
===== Advanced parameters

Advanced parameters are for fine-tuning {reganalysis}. They are set 
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
to give minimum validation error. It is highly recommended to use the default 
values unless you fully understand the function of these parameters. If these 
parameters are not supplied, their values are automatically tuned to give 
minimum validation error.

include::{docdir}/ml/ml-shared.asciidoc[tag=eta]

include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]

include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]

include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]

include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]


[discrete]
[[classification-resources]]
==== {classification-cap} configuration objects 
 
 
[float]
[[classification-resources-standard]]
===== Standard parameters
 
include::{docdir}/ml/ml-shared.asciidoc[tag=dependent_variable]
+
--
The data type of the field must be numeric or boolean.
--
  
`num_top_classes`::
  (Optional, integer) Defines the number of categories for which the predicted 
  probabilities are reported. It must be non-negative. If it is greater than the 
  total number of categories (in the {version} version of the {stack}, it's two) 
  to predict then we will report all category probabilities. Defaults to 2.
 
include::{docdir}/ml/ml-shared.asciidoc[tag=prediction_field_name]

include::{docdir}/ml/ml-shared.asciidoc[tag=training_percent]


[float]
[[classification-resources-advanced]]
===== Advanced parameters

Advanced parameters are for fine-tuning {classanalysis}. They are set 
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>> 
to give minimum validation error. It is highly recommended to use the default 
values unless you fully understand the function of these parameters. If these 
parameters are not supplied, their values are automatically tuned to give 
minimum validation error.

include::{docdir}/ml/ml-shared.asciidoc[tag=eta]

include::{docdir}/ml/ml-shared.asciidoc[tag=feature_bag_fraction]

include::{docdir}/ml/ml-shared.asciidoc[tag=maximum_number_trees]

include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]

include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]


[[ml-hyperparameter-optimization]]
===== Hyperparameter optimization

If you don't supply {regression} or {classification} parameters, hyperparameter 
optimization will be performed by default to set a value for the undefined 
parameters. The starting point is calculated for data dependent parameters by 
examining the loss on the training data. Subject to the size constraint, this 
operation provides an upper bound on the improvement in validation loss.

A fixed number of rounds is used for optimization which depends on the number of 
parameters being optimized. The optimization starts with random search, then 
Bayesian optimization is performed that is targeting maximum expected 
improvement. If you override any parameters, then the optimization will 
calculate the value of the remaining parameters accordingly and use the value 
you provided for the overridden parameter. The number of rounds are reduced 
respectively. The validation error is estimated in each round by using 4-fold 
cross validation.