260 lines
10 KiB
Plaintext
260 lines
10 KiB
Plaintext
[role="xpack"]
|
||
[testenv="platinum"]
|
||
[[ml-dfanalytics-resources]]
|
||
=== {dfanalytics-cap} job resources
|
||
|
||
{dfanalytics-cap} resources relate to APIs such as <<put-dfanalytics>> and
|
||
<<get-dfanalytics>>.
|
||
|
||
[discrete]
|
||
[[ml-dfanalytics-properties]]
|
||
==== {api-definitions-title}
|
||
|
||
`analysis`::
|
||
(object) The type of analysis that is performed on the `source`. For example:
|
||
`outlier_detection` or `regression`. For more information, see
|
||
<<dfanalytics-types>>.
|
||
|
||
`analyzed_fields`::
|
||
(object) You can specify both `includes` and/or `excludes` patterns. If
|
||
`analyzed_fields` is not set, only the relevant fields will be included. For
|
||
example all the numeric fields for {oldetection}.
|
||
|
||
`analyzed_fields.includes`:::
|
||
(array) An array of strings that defines the fields that will be included in
|
||
the analysis.
|
||
|
||
`analyzed_fields.excludes`:::
|
||
(array) An array of strings that defines the fields that will be excluded
|
||
from the analysis.
|
||
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/loganalytics
|
||
{
|
||
"source": {
|
||
"index": "logdata"
|
||
},
|
||
"dest": {
|
||
"index": "logdata_out"
|
||
},
|
||
"analysis": {
|
||
"outlier_detection": {
|
||
}
|
||
},
|
||
"analyzed_fields": {
|
||
"includes": [ "request.bytes", "response.counts.error" ],
|
||
"excludes": [ "source.geo" ]
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[setup:setup_logdata]
|
||
|
||
`description`::
|
||
(Optional, string) A description of the job.
|
||
|
||
`dest`::
|
||
(object) The destination configuration of the analysis.
|
||
|
||
`index`:::
|
||
(Required, string) Defines the _destination index_ to store the results of
|
||
the {dfanalytics-job}.
|
||
|
||
`results_field`:::
|
||
(Optional, string) Defines the name of the field in which to store the
|
||
results of the analysis. Default to `ml`.
|
||
|
||
`id`::
|
||
(string) The unique identifier for the {dfanalytics-job}. This identifier can
|
||
contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and
|
||
underscores. It must start and end with alphanumeric characters. This property
|
||
is informational; you cannot change the identifier for existing jobs.
|
||
|
||
`model_memory_limit`::
|
||
(string) The approximate maximum amount of memory resources that are
|
||
permitted for analytical processing. The default value for {dfanalytics-jobs}
|
||
is `1gb`. If your `elasticsearch.yml` file contains an
|
||
`xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
|
||
create {dfanalytics-jobs} that have `model_memory_limit` values greater than
|
||
that setting. For more information, see <<ml-settings>>.
|
||
|
||
`source`::
|
||
(object) The source configuration consisting an `index` and optionally a
|
||
`query` object.
|
||
|
||
`index`:::
|
||
(Required, string or array) Index or indices on which to perform the
|
||
analysis. It can be a single index or index pattern as well as an array of
|
||
indices or patterns.
|
||
|
||
`query`:::
|
||
(Optional, object) The {es} query domain-specific language
|
||
(<<query-dsl,DSL>>). This value corresponds to the query object in an {es}
|
||
search POST body. All the options that are supported by {es} can be used,
|
||
as this object is passed verbatim to {es}. By default, this property has
|
||
the following value: `{"match_all": {}}`.
|
||
|
||
[[dfanalytics-types]]
|
||
==== Analysis objects
|
||
|
||
{dfanalytics-cap} resources contain `analysis` objects. For example, when you
|
||
create a {dfanalytics-job}, you must define the type of analysis it performs.
|
||
|
||
[discrete]
|
||
[[oldetection-resources]]
|
||
==== {oldetection-cap} configuration objects
|
||
|
||
An `outlier_detection` configuration object has the following properties:
|
||
|
||
`compute_feature_influence`::
|
||
(boolean) If `true`, the feature influence calculation is enabled. Defaults to
|
||
`true`.
|
||
|
||
`feature_influence_threshold`::
|
||
(double) The minimum {olscore} that a document needs to have in order to
|
||
calculate its {fiscore}. Value range: 0-1 (`0.1` by default).
|
||
|
||
`method`::
|
||
(string) Sets the method that {oldetection} uses. If the method is not set
|
||
{oldetection} uses an ensemble of different methods and normalises and
|
||
combines their individual {olscores} to obtain the overall {olscore}. We
|
||
recommend to use the ensemble method. Available methods are `lof`, `ldof`,
|
||
`distance_kth_nn`, `distance_knn`.
|
||
|
||
`n_neighbors`::
|
||
(integer) Defines the value for how many nearest neighbors each method of
|
||
{oldetection} will use to calculate its {olscore}. When the value is not set,
|
||
different values will be used for different ensemble members. This helps
|
||
improve diversity in the ensemble. Therefore, only override this if you are
|
||
confident that the value you choose is appropriate for the data set.
|
||
|
||
`outlier_fraction`::
|
||
(double) Sets the proportion of the data set that is assumed to be outlying prior to
|
||
{oldetection}. For example, 0.05 means it is assumed that 5% of values are real outliers
|
||
and 95% are inliers.
|
||
|
||
`standardize_columns`::
|
||
(boolean) If `true`, then the following operation is performed on the columns
|
||
before computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to
|
||
`true`. For more information, see
|
||
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
|
||
|
||
|
||
[discrete]
|
||
[[regression-resources]]
|
||
==== {regression-cap} configuration objects
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ml/data_frame/analytics/house_price_regression_analysis
|
||
{
|
||
"source": {
|
||
"index": "houses_sold_last_10_yrs" <1>
|
||
},
|
||
"dest": {
|
||
"index": "house_price_predictions" <2>
|
||
},
|
||
"analysis":
|
||
{
|
||
"regression": { <3>
|
||
"dependent_variable": "price" <4>
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TEST[skip:TBD]
|
||
|
||
<1> Training data is taken from source index `houses_sold_last_10_yrs`.
|
||
<2> Analysis results will be output to destination index
|
||
`house_price_predictions`.
|
||
<3> The regression analysis configuration object.
|
||
<4> Regression analysis will use field `price` to train on. As no other
|
||
parameters have been specified it will train on 100% of eligible data, store its
|
||
prediction in destination index field `price_prediction` and use in-built
|
||
hyperparameter optimization to give minimum validation errors.
|
||
|
||
|
||
[float]
|
||
[[regression-resources-standard]]
|
||
===== Standard parameters
|
||
|
||
`dependent_variable`::
|
||
(Required, string) Defines which field of the {dataframe} is to be predicted.
|
||
This parameter is supplied by field name and must match one of the fields in
|
||
the index being used to train. If this field is missing from a document, then
|
||
that document will not be used for training, but a prediction with the trained
|
||
model will be generated for it. The data type of the field must be numeric. It
|
||
is also known as continuous target variable.
|
||
|
||
`prediction_field_name`::
|
||
(Optional, string) Defines the name of the prediction field in the results.
|
||
Defaults to `<dependent_variable>_prediction`.
|
||
|
||
`training_percent`::
|
||
(Optional, integer) Defines what percentage of the eligible documents that will
|
||
be used for training. Documents that are ignored by the analysis (for example
|
||
those that contain arrays) won’t be included in the calculation for used
|
||
percentage. Defaults to `100`.
|
||
|
||
|
||
[float]
|
||
[[regression-resources-advanced]]
|
||
===== Advanced parameters
|
||
|
||
Advanced parameters are for fine-tuning {reganalysis}. They are set
|
||
automatically by <<ml-hyperparameter-optimization,hyperparameter optimization>>
|
||
to give minimum validation error. It is highly recommended to use the default
|
||
values unless you fully understand the function of these parameters. If these
|
||
parameters are not supplied, their values are automatically tuned to give
|
||
minimum validation error.
|
||
|
||
`eta`::
|
||
(Optional, double) The shrinkage applied to the weights. Smaller values result
|
||
in larger forests which have better generalization error. However, the smaller
|
||
the value the longer the training will take. For more information, see
|
||
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article]
|
||
about shrinkage.
|
||
|
||
`feature_bag_fraction`::
|
||
(Optional, double) Defines the fraction of features that will be used when
|
||
selecting a random bag for each candidate split.
|
||
|
||
`maximum_number_trees`::
|
||
(Optional, integer) Defines the maximum number of trees the forest is allowed
|
||
to contain. The maximum value is 2000.
|
||
|
||
`gamma`::
|
||
(Optional, double) Regularization parameter to prevent overfitting on the
|
||
training dataset. Multiplies a linear penalty associated with the size of
|
||
individual trees in the forest. The higher the value the more training will
|
||
prefer smaller trees. The smaller this parameter the larger individual trees
|
||
will be and the longer train will take.
|
||
|
||
`lambda`::
|
||
(Optional, double) Regularization parameter to prevent overfitting on the
|
||
training dataset. Multiplies an L2 regularisation term which applies to leaf
|
||
weights of the individual trees in the forest. The higher the value the more
|
||
training will attempt to keep leaf weights small. This makes the prediction
|
||
function smoother at the expense of potentially not being able to capture
|
||
relevant relationships between the features and the {depvar}. The smaller this
|
||
parameter the larger individual trees will be and the longer train will take.
|
||
|
||
|
||
[[ml-hyperparameter-optimization]]
|
||
===== Hyperparameter optimization
|
||
|
||
If you don't supply {regression} parameters, hyperparameter optimization will be
|
||
performed by default to set a value for the undefined parameters. The starting
|
||
point is calculated for data dependent parameters by examining the loss on the
|
||
training data. Subject to the size constraint, this operation provides an upper
|
||
bound on the improvement in validation loss.
|
||
|
||
A fixed number of rounds is used for optimization which depends on the number of
|
||
parameters being optimized. The optimitazion starts with random search, then
|
||
Bayesian Optimisation is performed that is targeting maximum expected
|
||
improvement. If you override any parameters, then the optimization will
|
||
calculate the value of the remaining parameters accordingly and use the value
|
||
you provided for the overridden parameter. The number of rounds are reduced
|
||
respectively. The validation error is estimated in each round by using 4-fold
|
||
cross validation. |