Adds a new `default_field_map` field to trained model config objects.
This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.
The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
This adds machine learning model feature importance calculations to the inference processor.
The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
"field_mappings": {},
"model_id": "my_model",
"inference_config": {
"regression": {
"num_top_feature_importance_values": 3
}
}
}
```
This will write to the document as follows:
```
"inference" : {
"feature_importance" : {
"FlightTimeMin" : -76.90955548511226,
"FlightDelayType" : 114.13514762158526,
"DistanceMiles" : 13.731580450792187
},
"predicted_value" : 108.33165831875137,
"model_id" : "my_model"
}
```
This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).
It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.
Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.
NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc
usability blocked by: https://github.com/elastic/ml-cpp/pull/991
The changes add more granularity for identiying the data ingestion user.
The ingest pipeline can now be configure to record authentication realm and
type. It can also record API key name and ID when one is in use.
This improves traceability when data are being ingested from multiple agents
and will become more relevant with the incoming support of required
pipelines (#46847)
Resolves: #49106
* Add empty_value parameter to CSV processor
This change adds `empty_value` parameter to the CSV processor.
This value is used to fill empty fields. Fields will be skipped
if this parameter is ommited. This behavior is the same for both
quoted and unquoted fields.
* docs updated
* Fix compilation problem
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Backport: #50467
This commit adds the name of the current pipeline to ingest metadata.
This pipeline name is accessible under the following key: '_ingest.pipeline'.
Example usage in pipeline:
PUT /_ingest/pipeline/2
{
"processors": [
{
"set": {
"field": "pipeline_name",
"value": "{{_ingest.pipeline}}"
}
}
]
}
Closes#42106
* CSV ingest processor (#49509)
This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.
Closes#49113
* Allow list of IPs in geoip ingest processor
This change lets you use array of IPs in addition to string in geoip processor source field.
It will set array containing geoip data for each element in source, unless first_only parameter
option is enabled, then only first found will be returned.
Closes#46193
The documentation contained a small error, as bytes and duration was not
properly converted to a number and thus remained a string.
The documentation is now also properly tested by providing a full blown
simulate pipeline example.
When the enrich processor appends enrich data to an incoming document,
it adds a `target_field` to contain the enrich data.
This `target_field` contains both the `match_field` AND `enrich_fields`
specified in the enrich policy.
Previously, this was reflected in the documented example but not
explicitly stated. This adds several explicit statements to the docs.
For the user agent ingest processor, custom regex files must end
with the `.yml` file extension.
This corrects the docs which said the `.yaml` extension was required.
Prior to this change the `target_field` would always be a json array
field in the document being ingested. This to take into account that
multiple enrich documents could be inserted into the `target_field`.
However the default `max_matches` is `1`. Meaning that by default
only a single enrich document would be added to `target_field` json
array field.
This commit changes this; if `max_matches` is set to `1` then the single
document would be added as a json object to the `target_field` and
if it is configured to a higher value then the enrich documents will be
added as a json array (even if a single enrich document happens to be
enriched).
Besides a rename, this changes allows to processor to attach multiple
enrich docs to the document being ingested.
Also in order to control the maximum number of enrich docs to be
included in the document being ingested, the `max_matches` setting
is added to the enrich processor.
Relates #32789
Enrich processor configuration changes:
* Renamed `enrich_key` option to `field` option.
* Replaced `set_from` and `targets` options with `target_field`.
The `target_field` option behaves different to how `set_from` and
`targets` worked. The `target_field` is the field that will contain
the looked up document.
Relates to #32789
If a pipeline that refrences the policy exists, we should not allow the
policy to be deleted. The user will need to remove the processor from
the pipeline before deleting the policy. This commit adds a check to
ensure that the policy cannot be deleted if it is referenced by any
pipeline in the system.
These docs were misleading for package installations of
Elasticsearch. Instead, we should refer to $ES_CONFIG/ingest-geoip as
the path to place the custom database files. For non-package
installations, this is the same as $ES_HOME/config, but for package
installations this is not the case as the config directory for package
installations is /etc/elasticsearch, and is not relative to
$ES_HOME. This commit corrects the docs.
Add an explanatory NOTE section to draw attention to the difference
between small and capital letters used for the index date patterns.
e.g.: HH vs hh, MM vs mm.
Closes: #22322
(cherry picked from commit c8125417dc33215651f9bb76c9b1ffaf25f41caf)
This processor uses the lucene HTMLStripCharFilter class to remove HTML
entities from a field. This adds to the char filter, so that there is
possibility to store the stripped version as well.
Note, that the characeter filter replaces tags with a newline, so that
the produced HTML will look slightly different than the incoming HTML
with regards to newlines.
This commit is a correction of a doc bug in the docs for the ingest
date-index-name processor. The correct pattern is
yyyy-MM-dd'T'HH:mm:ss.SSSXX. This is due to the transition from Joda
time to Java time where Z does not mean the same thing between the two.
Forward port of https://github.com/elastic/elasticsearch/pull/38757
This change reverts the initial 7.0 commits and replaces them
with the 6.7 variant that still allows for the ecs flag.
This commit differs from the 6.7 variants in that ecs flag will
now default to true.
6.7: `ecs` : default `false`
7.x: `ecs` : default `true`
8.0: no option, but behaves as `true`
* Revert "Ingest node - user agent, move device to an object (#38115)"
This reverts commit 5b008a34aa.
* Revert "Add ECS schema for user-agent ingest processor (#37727) (#37984)"
This reverts commit cac6b8e06f.
* cherry-pick 5dfe1935345da3799931fd4a3ebe0b6aa9c17f57
Add ECS schema for user-agent ingest processor (#37727)
* cherry-pick ec8ddc890a34853ee8db6af66f608b0ad0cd1099
Ingest node - user agent, move device to an object (#38115) (#38121)
* cherry-pick f63cbdb9b426ba24ee4d987ca767ca05a22f2fbb (with manual merge fixes)
Dep. check for ECS changes to User Agent processor (#38362)
* make true the default for the ecs option, and update 7.0 references and tests