OpenSearch/docs/reference/ml/df-analytics/apis/dfanalyticsresources.asciidoc

260 lines
10 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[role="xpack"]
[testenv="platinum"]
[[ml-dfanalytics-resources]]
=== {dfanalytics-cap} job resources
{dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
<<get-dfanalytics>>.
[discrete]
[[ml-dfanalytics-properties]]
==== {api-definitions-title}
`analysis`::
(object) The type of analysis that is performed on the `source`. For example:
`outlier_detection` or `regression`. For more information, see
<<dfanalytics-types>>.
`analyzed_fields`::
(object) You can specify both `includes` and/or `excludes` patterns. If
`analyzed_fields` is not set, only the relevant fields will be included. For
example all the numeric fields for {oldetection}.
`analyzed_fields.includes`:::
(array) An array of strings that defines the fields that will be included in
the analysis.
`analyzed_fields.excludes`:::
(array) An array of strings that defines the fields that will be excluded
from the analysis.
[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/loganalytics
{
"source": {
"index": "logdata"
},
"dest": {
"index": "logdata_out"
},
"analysis": {
"outlier_detection": {
}
},
"analyzed_fields": {
"includes": [ "request.bytes", "response.counts.error" ],
"excludes": [ "source.geo" ]
}
}
--------------------------------------------------
// TEST[setup:setup_logdata]
`description`::
(Optional, string) A description of the job.
`dest`::
(object) The destination configuration of the analysis.
`index`:::
(Required, string) Defines the _destination index_ to store the results of
the {dfanalytics-job}.
`results_field`:::
(Optional, string) Defines the name of the field in which to store the
results of the analysis. Default to `ml`.
`id`::
(string) The unique identifier for the {dfanalytics-job}. This identifier can
contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and
underscores. It must start and end with alphanumeric characters. This property
is informational; you cannot change the identifier for existing jobs.
`model_memory_limit`::
(string) The approximate maximum amount of memory resources that are
permitted for analytical processing. The default value for {dfanalytics-jobs}
is `1gb`. If your `elasticsearch.yml` file contains an
`xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
create {dfanalytics-jobs} that have `model_memory_limit` values greater than
that setting. For more information, see <<ml-settings>>.
`source`::
(object) The source configuration consisting an `index` and optionally a
`query` object.
`index`:::
(Required, string or array) Index or indices on which to perform the
analysis. It can be a single index or index pattern as well as an array of
indices or patterns.
`query`:::
(Optional, object) The {es} query domain-specific language
(<<query-dsl,DSL>>). This value corresponds to the query object in an {es}
search POST body. All the options that are supported by {es} can be used,
as this object is passed verbatim to {es}. By default, this property has
the following value: `{"match_all": {}}`.
[[dfanalytics-types]]
==== Analysis objects
{dfanalytics-cap} resources contain `analysis` objects. For example, when you
create a {dfanalytics-job}, you must define the type of analysis it performs.
[discrete]
[[oldetection-resources]]
==== {oldetection-cap} configuration objects
An `outlier_detection` configuration object has the following properties:
`compute_feature_influence`::
(boolean) If `true`, the feature influence calculation is enabled. Defaults to
`true`.
`feature_influence_threshold`::
(double) The minimum {olscore} that a document needs to have in order to
calculate its {fiscore}. Value range: 0-1 (`0.1` by default).
`method`::
(string) Sets the method that {oldetection} uses. If the method is not set
{oldetection} uses an ensemble of different methods and normalises and
combines their individual {olscores} to obtain the overall {olscore}. We
recommend to use the ensemble method. Available methods are `lof`, `ldof`,
`distance_kth_nn`, `distance_knn`.
`n_neighbors`::
(integer) Defines the value for how many nearest neighbors each method of
{oldetection} will use to calculate its {olscore}. When the value is not set,
different values will be used for different ensemble members. This helps
improve diversity in the ensemble. Therefore, only override this if you are
confident that the value you choose is appropriate for the data set.
`outlier_fraction`::
(double) Sets the proportion of the data set that is assumed to be outlying prior to
{oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers
and 95% are inliers.
`standardize_columns`::
(boolean) If `true`, then the following operation is performed on the columns
before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
`true`. For more information, see
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
[discrete]
[[regression-resources]]
==== {regression-cap} configuration objects
[source,console]
--------------------------------------------------
PUT _ml/data_frame/analytics/house_price_regression_analysis
{
"source": {
"index": "houses_sold_last_10_yrs" <1>
},
"dest": {
"index": "house_price_predictions" <2>
},
"analysis":
{
"regression": { <3>
"dependent_variable": "price" <4>
}
}
}
--------------------------------------------------
// TEST[skip:TBD]
<1> Training data is taken from source index `houses_sold_last_10_yrs`.
<2> Analysis results will be output to destination index
`house_price_predictions`.
<3> The regression analysis configuration object.
<4> Regression analysis will use field `price` to train on. As no other
parameters have been specified it will train on 100% of eligible data, store its
prediction in destination index field `price_prediction` and use in-built
hyperparameter optimization to give minimum validation errors.
[float]
[[regression-resources-standard]]
===== Standard parameters
`dependent_variable`::
(Required, string) Defines which field of the {dataframe} is to be predicted.
This parameter is supplied by field name and must match one of the fields in
the index being used to train. If this field is missing from a document, then
that document will not be used for training, but a prediction with the trained
model will be generated for it. The data type of the field must be numeric. It
is also known as continuous target variable.
`prediction_field_name`::
(Optional, string) Defines the name of the prediction field in the results.
Defaults to `<dependent_variable>_prediction`.
`training_percent`::
(Optional, integer) Defines what percentage of the eligible documents that will
be used for training. Documents that are ignored by the analysis (for example
those that contain arrays) wont be included in the calculation for used
percentage. Defaults to `100`.
[float]
[[regression-resources-advanced]]
===== Advanced parameters
Advanced parameters are for fine-tuning {reganalysis}. They are set
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
to give minimum validation error. It is highly recommended to use the default
values unless you fully understand the function of these parameters. If these
parameters are not supplied, their values are automatically tuned to give
minimum validation error.
`eta`::
(Optional, double) The shrinkage applied to the weights. Smaller values result
in larger forests which have better generalization error. However, the smaller
the value the longer the training will take. For more information, see
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
about shrinkage.
`feature_bag_fraction`::
(Optional, double) Defines the fraction of features that will be used when
selecting a random bag for each candidate split.
`maximum_number_trees`::
(Optional, integer) Defines the maximum number of trees the forest is allowed
to contain. The maximum value is 2000.
`gamma`::
(Optional, double) Regularization parameter to prevent overfitting on the
training dataset. Multiplies a linear penalty associated with the size of
individual trees in the forest. The higher the value the more training will
prefer smaller trees. The smaller this parameter the larger individual trees
will be and the longer train will take.
`lambda`::
(Optional, double) Regularization parameter to prevent overfitting on the
training dataset. Multiplies an L2 regularisation term which applies to leaf
weights of the individual trees in the forest. The higher the value the more
training will attempt to keep leaf weights small. This makes the prediction
function smoother at the expense of potentially not being able to capture
relevant relationships between the features and the {depvar}. The smaller this
parameter the larger individual trees will be and the longer train will take.
[[ml-hyperparameter-optimization]]
===== Hyperparameter optimization
If you don't supply {regression} parameters, hyperparameter optimization will be
performed by default to set a value for the undefined parameters. The starting
point is calculated for data dependent parameters by examining the loss on the
training data. Subject to the size constraint, this operation provides an upper
bound on the improvement in validation loss.
A fixed number of rounds is used for optimization which depends on the number of
parameters being optimized. The optimitazion starts with random search, then
Bayesian Optimisation is performed that is targeting maximum expected
improvement. If you override any parameters, then the optimization will
calculate the value of the remaining parameters accordingly and use the value
you provided for the overridden parameter. The number of rounds are reduced
respectively. The validation error is estimated in each round by using 4-fold
cross validation.