Benjamin Trent afd90647c9
[ML] Adds feature importance to option to inference processor (#52218) (#52666)
This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : {
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: https://github.com/elastic/ml-cpp/pull/991
2020-02-21 18:42:31 -05:00

119 lines
3.7 KiB
Plaintext

[role="xpack"]
[testenv="basic"]
[[inference-processor]]
=== {infer-cap} Processor
Uses a pre-trained {dfanalytics} model to infer against the data that is being
ingested in the pipeline.
[[inference-options]]
.{infer-cap} Options
[options="header"]
|======
| Name | Required | Default | Description
| `model_id` | yes | - | (String) The ID of the model to load and infer against.
| `target_field` | no | `ml.inference.<processor_tag>` | (String) Field added to incoming documents to contain results objects.
| `field_mappings` | yes | - | (Object) Maps the document field names to the known field names of the model.
| `inference_config` | yes | - | (Object) Contains the inference type and its options. There are two types: <<inference-processor-regression-opt,`regression`>> and <<inference-processor-classification-opt,`classification`>>.
include::common-options.asciidoc[]
|======
[source,js]
--------------------------------------------------
{
"inference": {
"model_id": "flight_delay_regression-1571767128603",
"target_field": "FlightDelayMin_prediction_infer",
"field_mappings": {},
"inference_config": { "regression": {} }
}
}
--------------------------------------------------
// NOTCONSOLE
[discrete]
[[inference-processor-regression-opt]]
==== {regression-cap} configuration options
`results_field`::
(Optional, string)
Specifies the field to which the inference prediction is written. Defaults to
`predicted_value`.
`num_top_feature_importance_values`::::
(Optional, integer)
Specifies the maximum number of
{ml-docs}/dfa-regression.html#dfa-regression-feature-importance[feature
importance] values per document. By default, it is zero and no feature importance
calculation occurs.
[discrete]
[[inference-processor-classification-opt]]
==== {classification-cap} configuration options
`results_field`::
(Optional, string)
The field that is added to incoming documents to contain the inference prediction. Defaults to
`predicted_value`.
`num_top_classes`::
(Optional, integer)
Specifies the number of top class predictions to return. Defaults to 0.
`top_classes_results_field`::
(Optional, string)
Specifies the field to which the top classes are written. Defaults to
`top_classes`.
`num_top_feature_importance_values`::::
(Optional, integer)
Specifies the maximum number of
{ml-docs}/dfa-classification.html#dfa-classification-feature-importance[feature
importance] values per document. By default, it is zero and no feature importance
calculation occurs.
[discrete]
[[inference-processor-config-example]]
==== `inference_config` examples
[source,js]
--------------------------------------------------
{
"inference_config": {
"regression": {
"results_field": "my_regression"
}
}
}
--------------------------------------------------
// NOTCONSOLE
This configuration specifies a `regression` inference and the results are
written to the `my_regression` field contained in the `target_field` results
object.
[source,js]
--------------------------------------------------
{
"inference_config": {
"classification": {
"num_top_classes": 2,
"results_field": "prediction",
"top_classes_results_field": "probabilities"
}
}
}
--------------------------------------------------
// NOTCONSOLE
This configuration specifies a `classification` inference. The number of
categories for which the predicted probabilities are reported is 2
(`num_top_classes`). The result is written to the `prediction` field and the top
classes to the `probabilities` field. Both fields are contained in the
`target_field` results object.