When get filters is called without setting the `size`
paramter only up to 10 filters are returned. However,
100 filters should be returned. This commit fixes this
and adds an integ test to guard it.
It seems this was accidentally broken in #39976.
Closes#54206
Backport of #54207
This commit instruments data frame analytics
with stats for the data that are being analyzed.
In particular, we count training docs, test docs,
and skipped docs.
In order to account docs with missing values as skipped
docs for analyses that do not support missing values,
this commit changes the extractor so that it only ignores
docs with missing values when it collects the data summary,
which is used to estimate memory usage.
Backport of #53998
Adds multi-class feature importance calculation.
Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
"feature_name": "feature_0",
"importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
“feature_name”: “feature_0”,
“importance”: 2.0, // sum(abs()) of class importances
“foo”: 1.0,
“bar”: 0.5,
“baz”: -0.5
},
```
For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
```
"ml.inference.feature_importance": {
"type": "nested",
"dynamic": true,
"properties": {
"feature_name": {
"type": "keyword"
},
"importance": {
"type": "double"
}
}
}
```
The mapping field name is as follows
`ml.<inference.target_field>.<inference.tag>.feature_importance`
if `inference.tag` is not provided in the processor definition, it is not part of the field path.
`inference.target_field` is defaulted to `ml.inference`.
//cc @lcawl ^ Where should we document this?
If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
Feature importance storage format is changing to encompass multi-class.
Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
"feature_name": "feature_0",
"importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
“feature_name”: “feature_0”,
“importance”: 2.0, // sum(abs()) of class importances
“foo”: 1.0,
“bar”: 0.5,
“baz”: -0.5
},
```
This change adjusts the mapping creation for analytics so that the field is mapped as a `nested` type.
Native side change: https://github.com/elastic/ml-cpp/pull/1071
Prepares classification analysis to support more than just
two classes. It introduces a new parameter to the process config
which dictates the `num_classes` to the process. It also
changes the max classes limit to `30` provisionally.
Backport of #53539
Adds a new parameter for classification that enables choosing whether to assign labels to
maximise accuracy or to maximise the minimum class recall.
Fixes#52427.
Adds a new `default_field_map` field to trained model config objects.
This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.
The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start.
This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.
If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled.
For tasks seeking a new assignment after a node failure, that behavior is unchanged.
closes#53188
Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated.
This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop.
closes#53007
Currently _rollup_search requires manage privilege to access. It should really be
a read only operation. This PR changes the requirement to be read indices privilege.
Resolves: #50245
* [ML][Inference] Add support for multi-value leaves to the tree model (#52531)
This adds support for multi-value leaves. This is a prerequisite for multi-class boosted tree classification.
This adds a new configurable field called `indices_options`. This allows users to create or update the indices_options used when a datafeed reads from an index.
This is necessary for the following use cases:
- Reading from frozen indices
- Allowing certain indices in multiple index patterns to not exist yet
These index options are available on datafeed creation and update. Users may specify them as URL parameters or within the configuration object.
closes https://github.com/elastic/elasticsearch/issues/48056
This PR moves the majority of the Watcher REST tests under
the Watcher x-pack plugin.
Specifically, moves the Watcher tests from:
x-pack/plugin/test
x-pack/qa/smoke-test-watcher
x-pack/qa/smoke-test-watcher-with-security
x-pack/qa/smoke-test-monitoring-with-watcher
to:
x-pack/plugin/watcher/qa/rest (/test and /qa/smoke-test-watcher)
x-pack/plugin/watcher/qa/with-security
x-pack/plugin/watcher/qa/with-monitoring
Additionally, this disables Watcher from the main
x-pack test cluster and consolidates the stop/start logic
for the tests listed.
No changes to the tests (beyond moving them) are included.
3rd party tests and doc tests (which also touch Watcher)
are not included in the changes here.
* Smarter copying of the rest specs and tests (#52114)
This PR addresses the unnecessary copying of the rest specs and allows
for better semantics for which specs and tests are copied. By default
the rest specs will get copied if the project applies
`elasticsearch.standalone-rest-test` or `esplugin` and the project
has rest tests or you configure the custom extension `restResources`.
This PR also removes the need for dozens of places where the x-pack
specs were copied by supporting copying of the x-pack rest specs too.
The plugin/task introduced here can also copy the rest tests to the
local project through a similar configuration.
The new plugin/task allows a user to minimize the surface area of
which rest specs are copied. Per project can be configured to include
only a subset of the specs (or tests). Configuring a project to only
copy the specs when actually needed should help with build cache hit
rates since we can better define what is actually in use.
However, project level optimizations for build cache hit rates are
not included with this PR.
Also, with this PR you can no longer use the includePackaged flag on
integTest task.
The following items are included in this PR:
* new plugin: `elasticsearch.rest-resources`
* new tasks: CopyRestApiTask and CopyRestTestsTask - performs the copy
* new extension 'restResources'
```
restResources {
restApi {
includeCore 'foo' , 'bar' //will include the core specs that start with foo and bar
includeXpack 'baz' //will include x-pack specs that start with baz
}
restTests {
includeCore 'foo', 'bar' //will include the core tests that start with foo and bar
includeXpack 'baz' //will include the x-pack tests that start with baz
}
}
```
This adds machine learning model feature importance calculations to the inference processor.
The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
"field_mappings": {},
"model_id": "my_model",
"inference_config": {
"regression": {
"num_top_feature_importance_values": 3
}
}
}
```
This will write to the document as follows:
```
"inference" : {
"feature_importance" : {
"FlightTimeMin" : -76.90955548511226,
"FlightDelayType" : 114.13514762158526,
"DistanceMiles" : 13.731580450792187
},
"predicted_value" : 108.33165831875137,
"model_id" : "my_model"
}
```
This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).
It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.
Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.
NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc
usability blocked by: https://github.com/elastic/ml-cpp/pull/991
When `PUT` is called to store a trained model, it is useful to return the newly create model config. But, it is NOT useful to return the inflated definition.
These definitions can be large and returning the inflated definition causes undo work on the server and client side.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This adds `_all` to Calendar searches. This enables users to supply the `_all` string in the `job_ids` array when creating a Calendar. That calendar will now be applied to all jobs (existing and newly created).
Closes#45013
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This changes the tree validation code to ensure no node in the tree has a
feature index that is beyond the bounds of the feature_names array.
Specifically this handles the situation where the C++ emits a tree containing
a single node and an empty feature_names list. This is valid tree used to
centre the data in the ensemble but the validation code would reject this
as feature_names is empty. This meant a broken workflow as you cannot GET
the model and PUT it back
ML mappings and index templates have so far been created
programmatically. While this had its merits due to static typing,
there is consensus it would be clear to maintain those in json files.
In addition, we are going to adding ILM policies to these indices
and the component for a plugin to register ILM policies is
`IndexTemplateRegistry`. It expects the templates to be in resource
json files.
For the above reasons this commit refactors ML mappings and index
templates into json resource files that are registered via
`MlIndexTemplateRegistry`.
Backport of #51765
Changes the misleading error message when attempting to open
a job while the "cluster.persistent_tasks.allocation.enable"
setting is set to "none" to a clearer message that names the
setting.
Closes#51956
If the configs are removed (by some horrific means), we should still allow tasks to be cleaned up easily.
Datafeeds and jobs with missing configs are now visible in their respective _stats calls and can be stopped/closed.
* [ML][Inference] Fix weighted mode definition (#51648)
Weighted mode inaccurately assumed that the "max value" of the input values would be the maximum class value. This does not make sense.
Weighted Mode should know how many classes there are. Hence the new parameter `num_classes`. This indicates what the maximum class value to be expected.
Data frame analytics classification currently only supports 2 classes for the
dependent variable. We were checking that the field's cardinality is not higher
than 2 but we should also check it is not less than that as otherwise the process
fails.
Backport of #51232
Allows ML datafeeds to work with time fields that have
the "date_nanos" type _and make use of the extra precision_.
(Previously datafeeds only worked with time fields that were
exact multiples of milliseconds. So datafeeds would work
with "date_nanos" only if the extra precision over "date" was
not used.)
Relates #49889
* [ML][Inference] Adding classification_weights to ensemble models
classification_weights are a way to allow models to
prefer specific classification results over others
this might be advantageous if classification value
probabilities are a known quantity and can improve
model error rates.
Adds a new parameter to regression and classification that enables computation
of importance for the top most important features. The computation of the importance
is based on SHAP (SHapley Additive exPlanations) method.
Backport of #50914
* [ML][Inference] PUT API (#50852)
This adds the `PUT` API for creating trained models that support our format.
This includes
* HLRC change for the API
* API creation
* Validations of model format and call
* fixing backport
This commit removes validation logic of source and dest indices
for data frame analytics and replaces it with using the common
`SourceDestValidator` class which is already used by transforms.
This way the validations and their messages become consistent
while we reduce code.
This means that where these validations fail the error messages
will be slightly different for data frame analytics.
Backport of #50841
This adds the necessary named XContent classes to the HLRC for the lang ident model. This is so the HLRC can call `GET _ml/inference/lang_ident_model_1?include_definition=true` without XContent parsing errors.
The constructors are package private as since this classes are used exclusively within the pre-packaged model (and require the specific weights, etc. to be of any use).
Eclipse 4.13 shows a type mismatch error in the affected line because it cannot
correctly infer the boolean return type for the method call. Assigning return
value to a local variable resolves this problem.
This commit adds removal of unused data frame analytics state
from the _delete_expired_data API (and in extend th ML daily
maintenance task). At the moment the potential state docs
include the progress document and state for regression and
classification analyses.
Backport of #50243
This commit fixes a bug that caused the data frame analytics
_explain API to time out in a multi-node setup when the source
index was missing. When we try to create the extracted fields detector,
we check the index settings. If the index is missing that responds
with a failure that could be wrapped as a remote exception.
While we unwrapped correctly to check if the cause was an
`IndexNotFoundException`, we then proceeded to cast the original
exception instead of the cause.
Backport of #50176
* [ML] Add graceful retry for anomaly detector result indexing failures (#49508)
All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.
* fixing test
The `ClassificationIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet`
test was previously set up to just have 10 rows. With `training_percent`
of 50%, only 5 rows will be used for training. There is a good chance that
all 5 rows will be of one class which results to failure.
This commit increases the rows to 100. Now 50 rows should be used for training
and the chance of failure should be very small.
Backport of #50072
This adds a new `randomize_seed` for regression and classification.
When not explicitly set, the seed is randomly generated. One can
reuse the seed in a similar job in order to ensure the same docs
are picked for training.
Backport of #49990
When checking the cardinality of a field, the query should be take into account. The user might know about some bad data in their index and want to filter down to the target_field values they care about.
Work in progress in the c++ side is increasing memory estimates
a bit and this test fails. At the time of this commit the mem
estimate when there is no source query is a about 2Mb. So I
am relaxing the test to assert memory estimate is less than 1Mb
instead of 500Kb.
Backport of #49924
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.
Closes#49531
Backport of #49690
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.
The API consolidates information that is useful before
creating a data frame analytics job.
It includes:
- memory estimation
- field selection explanation
Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.
Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.
Backport of #49455
* [ML] ML Model Inference Ingest Processor (#49052)
* [ML][Inference] adds lazy model loader and inference (#47410)
This adds a couple of things:
- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:
* [ML][Inference] Adjust inference configuration option API (#47812)
* [ML][Inference] adds logistic_regression output aggregator (#48075)
* [ML][Inference] Adding read/del trained models (#47882)
* [ML][Inference] Adding inference ingest processor (#47859)
* [ML][Inference] fixing classification inference for ensemble (#48463)
* [ML][Inference] Adding model memory estimations (#48323)
* [ML][Inference] adding more options to inference processor (#48545)
* [ML][Inference] handle string values better in feature extraction (#48584)
* [ML][Inference] Adding _stats endpoint for inference (#48492)
* [ML][Inference] add inference processors and trained models to usage (#47869)
* [ML][Inference] add new flag for optionally including model definition (#48718)
* [ML][Inference] adding license checks (#49056)
* [ML][Inference] Adding memory and compute estimates to inference (#48955)
* fixing version of indexed docs for model inference
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.
This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.
Relates #48756
Backport of #48770