OpenSearch/docs/java-rest/high-level/ml/put-data-frame-analytics.asciidoc
Dimitris Athanasiou b2243337d8
[7.x][ML] Data frame analytics max_num_threads setting (#59254) (#59308)
This adds a setting to data frame analytics jobs called
`max_number_threads`. The setting expects a positive integer.
When used the user specifies the max number of threads that may
be used by the analysis. Note that the actual number of threads
used is limited by the number of processors on the node where
the job is assigned. Also, the process may use a couple more threads
for operational functionality that is not the analysis itself.

This setting may also be updated for a stopped job.

More threads may reduce the time it takes to complete the job at the cost
of using more CPU.

Backport of #59254 and #57274
2020-07-09 19:15:46 +03:00

170 lines
7.0 KiB
Plaintext

--
:api: put-data-frame-analytics
:request: PutDataFrameAnalyticsRequest
:response: PutDataFrameAnalyticsResponse
--
[role="xpack"]
[id="{upid}-{api}"]
=== Put {dfanalytics-jobs} API
Creates a new {dfanalytics-job}.
The API accepts a +{request}+ object as a request and returns a +{response}+.
[id="{upid}-{api}-request"]
==== Put {dfanalytics-jobs} request
A +{request}+ requires the following argument:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
--------------------------------------------------
<1> The configuration of the {dfanalytics-job} to create
[id="{upid}-{api}-config"]
==== {dfanalytics-cap} configuration
The `DataFrameAnalyticsConfig` object contains all the details about the {dfanalytics-job}
configuration and contains the following arguments:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-config]
--------------------------------------------------
<1> The {dfanalytics-job} ID
<2> The source index and query from which to gather data
<3> The destination index
<4> The analysis to be performed
<5> The fields to be included in / excluded from the analysis
<6> The memory limit for the model created as part of the analysis process
<7> Optionally, a human-readable description
<8> The maximum number of threads to be used by the analysis. Defaults to 1.
[id="{upid}-{api}-query-config"]
==== SourceConfig
The index and the query from which to collect data.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-source-config]
--------------------------------------------------
<1> Constructing a new DataFrameAnalyticsSource
<2> The source index
<3> The query from which to gather the data. If query is not set, a `match_all` query is used by default.
<4> Source filtering to select which fields will exist in the destination index.
===== QueryConfig
The query with which to select data from the source.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-query-config]
--------------------------------------------------
==== DestinationConfig
The index to which data should be written by the {dfanalytics-job}.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-dest-config]
--------------------------------------------------
<1> Constructing a new DataFrameAnalyticsDest
<2> The destination index
==== Analysis
The analysis to be performed.
Currently, the supported analyses include: +OutlierDetection+, +Classification+, +Regression+.
===== Outlier detection
+OutlierDetection+ analysis can be created in one of two ways:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-outlier-detection-default]
--------------------------------------------------
<1> Constructing a new OutlierDetection object with default strategy to determine outliers
or
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-outlier-detection-customized]
--------------------------------------------------
<1> Constructing a new OutlierDetection object
<2> The method used to perform the analysis
<3> Number of neighbors taken into account during analysis
<4> The min `outlier_score` required to compute feature influence
<5> Whether to compute feature influence
<6> The proportion of the data set that is assumed to be outlying prior to outlier detection
<7> Whether to apply standardization to feature values
===== Classification
+Classification+ analysis requires to set which is the +dependent_variable+ and
has a number of other optional parameters:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-classification]
--------------------------------------------------
<1> Constructing a new Classification builder object with the required dependent variable
<2> The lambda regularization parameter. A non-negative double.
<3> The gamma regularization parameter. A non-negative double.
<4> The applied shrinkage. A double in [0.001, 1].
<5> The maximum number of trees the forest is allowed to contain. An integer in [1, 2000].
<6> The fraction of features which will be used when selecting a random bag for each candidate split. A double in (0, 1].
<7> If set, feature importance for the top most important features will be computed.
<8> The name of the prediction field in the results object.
<9> The percentage of training-eligible rows to be used in training. Defaults to 100%.
<10> The seed to be used by the random generator that picks which rows are used in training.
<11> The optimization objective to target when assigning class labels. Defaults to maximize_minimum_recall.
<12> The number of top classes to be reported in the results. Defaults to 2.
===== Regression
+Regression+ analysis requires to set which is the +dependent_variable+ and
has a number of other optional parameters:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-regression]
--------------------------------------------------
<1> Constructing a new Regression builder object with the required dependent variable
<2> The lambda regularization parameter. A non-negative double.
<3> The gamma regularization parameter. A non-negative double.
<4> The applied shrinkage. A double in [0.001, 1].
<5> The maximum number of trees the forest is allowed to contain. An integer in [1, 2000].
<6> The fraction of features which will be used when selecting a random bag for each candidate split. A double in (0, 1].
<7> If set, feature importance for the top most important features will be computed.
<8> The name of the prediction field in the results object.
<9> The percentage of training-eligible rows to be used in training. Defaults to 100%.
<10> The seed to be used by the random generator that picks which rows are used in training.
<11> The loss function used for regression. Defaults to `mse`.
<12> An optional parameter to the loss function.
==== Analyzed fields
FetchContext object containing fields to be included in / excluded from the analysis
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-analyzed-fields]
--------------------------------------------------
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ contains the newly created {dfanalytics-job}.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-response]
--------------------------------------------------