OpenSearch/docs/reference/ml/df-analytics/apis/explain-dfanalytics.asciidoc
Dimitris Athanasiou 8eaee7cbdc
[7.x][ML] Explain data frame analytics API (#49455) (#49504)
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.

Backport of #49455
2019-11-22 22:06:10 +02:00

160 lines
4.6 KiB
Plaintext

[role="xpack"]
[testenv="platinum"]
[[explain-dfanalytics]]
=== Explain {dfanalytics} API
[subs="attributes"]
++++
<titleabbrev>Explain {dfanalytics} API</titleabbrev>
++++
Explains a {dataframe-analytics-config}.
experimental[]
[[ml-explain-dfanalytics-request]]
==== {api-request-title}
`GET _ml/data_frame/analytics/_explain` +
`POST _ml/data_frame/analytics/_explain` +
`GET _ml/data_frame/analytics/<data_frame_analytics_id>/_explain` +
`POST _ml/data_frame/analytics/<data_frame_analytics_id>/_explain`
[[ml-explain-dfanalytics-prereq]]
==== {api-prereq-title}
* You must have `monitor_ml` privilege to use this API. For more
information, see <<security-privileges>> and <<built-in-roles>>.
[[ml-explain-dfanalytics-desc]]
==== {api-description-title}
This API provides explanations for a {dataframe-analytics-config} that either exists already or one that has not been created yet.
The following explanations are provided:
* which fields are included or not in the analysis and why
* how much memory is estimated to be required. The estimate can be used when deciding the appropriate value for `model_memory_limit` setting later on.
about either an existing {dfanalytics-job} or one that has not been created yet.
[[ml-explain-dfanalytics-path-params]]
==== {api-path-parms-title}
`<data_frame_analytics_id>`::
(Optional, string) A numerical character string that uniquely identifies the existing
{dfanalytics-job} to explain. This identifier can contain lowercase alphanumeric
characters (a-z and 0-9), hyphens, and underscores. It must start and end with
alphanumeric characters.
[[ml-explain-dfanalytics-request-body]]
==== {api-request-body-title}
`data_frame_analytics_config`::
(Optional, object) Intended configuration of {dfanalytics-job}. For more information, see
<<ml-dfanalytics-resources>>.
Note that `id` and `dest` don't need to be provided in the context of this API.
[[ml-explain-dfanalytics-results]]
==== {api-response-body-title}
The API returns a response that contains the following:
`field_selection`::
(array) An array of objects that explain selection for each field, sorted by the field names.
Each object in the array has the following properties:
`name`:::
(string) The field name.
`mapping_types`:::
(string) The mapping types of the field.
`is_included`:::
(boolean) Whether the field is selected to be included in the analysis.
`is_required`:::
(boolean) Whether the field is required.
`feature_type`:::
(string) The feature type of this field for the analysis. May be `categorical` or `numerical`.
`reason`:::
(string) The reason a field is not selected to be included in the analysis.
`memory_estimation`::
(object) An object containing the memory estimates. The object has the following properties:
`expected_memory_without_disk`:::
(string) Estimated memory usage under the assumption that the whole {dfanalytics} should happen in memory
(i.e. without overflowing to disk).
`expected_memory_with_disk`:::
(string) Estimated memory usage under the assumption that overflowing to disk is allowed during {dfanalytics}.
`expected_memory_with_disk` is usually smaller than `expected_memory_without_disk` as using disk allows to
limit the main memory needed to perform {dfanalytics}.
[[ml-explain-dfanalytics-example]]
==== {api-examples-title}
[source,console]
--------------------------------------------------
POST _ml/data_frame/analytics/_explain
{
"data_frame_analytics_config": {
"source": {
"index": "houses_sold_last_10_yrs"
},
"analysis": {
"regression": {
"dependent_variable": "price"
}
}
}
}
--------------------------------------------------
// TEST[skip:TBD]
The API returns the following results:
[source,console-result]
----
{
"field_selection": [
{
"field": "number_of_bedrooms",
"mappings_types": ["integer"],
"is_included": true,
"is_required": false,
"feature_type": "numerical"
},
{
"field": "postcode",
"mappings_types": ["text"],
"is_included": false,
"is_required": false,
"reason": "[postcode.keyword] is preferred because it is aggregatable"
},
{
"field": "postcode.keyword",
"mappings_types": ["keyword"],
"is_included": true,
"is_required": false,
"feature_type": "categorical"
},
{
"field": "price",
"mappings_types": ["float"],
"is_included": true,
"is_required": true,
"feature_type": "numerical"
}
],
"memory_estimation": {
"expected_memory_without_disk": "128MB",
"expected_memory_with_disk": "32MB"
}
}
----