As classification now works for multiple classes, randomly
picking training/test data frame rows is not good enough.
This commit introduces a stratified cross validation splitter
that maintains the proportion of the each class in the dataset
in the sample that is used for training the model.
Backport of #54087
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true. Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.
This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started. The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.
Backport of #53918
This commit instruments data frame analytics
with stats for the data that are being analyzed.
In particular, we count training docs, test docs,
and skipped docs.
In order to account docs with missing values as skipped
docs for analyses that do not support missing values,
this commit changes the extractor so that it only ignores
docs with missing values when it collects the data summary,
which is used to estimate memory usage.
Backport of #53998
Adds multi-class feature importance calculation.
Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
"feature_name": "feature_0",
"importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
“feature_name”: “feature_0”,
“importance”: 2.0, // sum(abs()) of class importances
“foo”: 1.0,
“bar”: 0.5,
“baz”: -0.5
},
```
For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
```
"ml.inference.feature_importance": {
"type": "nested",
"dynamic": true,
"properties": {
"feature_name": {
"type": "keyword"
},
"importance": {
"type": "double"
}
}
}
```
The mapping field name is as follows
`ml.<inference.target_field>.<inference.tag>.feature_importance`
if `inference.tag` is not provided in the processor definition, it is not part of the field path.
`inference.target_field` is defaulted to `ml.inference`.
//cc @lcawl ^ Where should we document this?
If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
Feature importance storage format is changing to encompass multi-class.
Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
"feature_name": "feature_0",
"importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
“feature_name”: “feature_0”,
“importance”: 2.0, // sum(abs()) of class importances
“foo”: 1.0,
“bar”: 0.5,
“baz”: -0.5
},
```
This change adjusts the mapping creation for analytics so that the field is mapped as a `nested` type.
Native side change: https://github.com/elastic/ml-cpp/pull/1071
Since a data frame analytics job may have associated docs
in the .ml-stats-* indices, when the job is deleted we
should delete those docs too.
Backport of #53933
While `CustomProcessor` is generic and allows for flexibility, there
are new requirements that make cross validation a concept it's hard
to abstract behind custom processor. In particular, we would like to
add data_counts to the DFA jobs stats. Counting training VS. test
docs would be a useful statistic. We would also want to add a
different cross validation strategy for multiclass classification.
This commit renames custom processors to cross validation splitters
which allows for those enhancements without cryptically doing
things as a side effect of the abstract custom processing.
Backport of #53915
It's simple to deprecate a field used in an ObjectParser just by adding deprecation
markers to the relevant ParseField objects. The warnings themselves don't currently
have any context - they simply say that a deprecated field has been used, but not
where in the input xcontent it appears. This commit adds the parent object parser
name and XContentLocation to these deprecation messages.
Note that the context is automatically stripped from warning messages when they
are asserted on by integration tests and REST tests, because randomization of
xcontent type during these tests means that the XContentLocation is not constant
Adds parsing and indexing of analysis instrumentation stats.
The latest one is also returned from the get-stats API.
Note that we chose to duplicate objects even where they are currently
similar. There are already ideas on how these will diverge in the future
and while the duplication looks ugly at the moment, it is the option
that offers the highest flexibility.
Backport of #53788
* [ML] only retry persistence failures when the failure is intermittent and stop retrying when analytics job is stopping (#53725)
This fixes two issues:
- Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
- Data frame analytics would continue to retry to persist results even after the job is stopped.
closes https://github.com/elastic/elasticsearch/issues/53687
Prepares classification analysis to support more than just
two classes. It introduces a new parameter to the process config
which dictates the `num_classes` to the process. It also
changes the max classes limit to `30` provisionally.
Backport of #53539
the ML portion of the x-pack info API was erroneously counting configuration documents and definition documents. The underlying implementation of our storage separates the two out.
This PR filters the query so that only trained model config documents are counted.
Adds a new parameter for classification that enables choosing whether to assign labels to
maximise accuracy or to maximise the minimum class recall.
Fixes#52427.
Adds a new `default_field_map` field to trained model config objects.
This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.
The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
This is a partial implementation of an endpoint for anomaly
detector model memory estimation.
It is not complete, lacking docs, HLRC and sensible numbers
for many anomaly detector configurations. These will be
added in a followup PR in time for 7.7 feature freeze.
A skeleton endpoint is useful now because it allows work on
the UI side of the change to commence. The skeleton endpoint
handles the same cases that the old UI code used to handle,
and produces very similar estimates for these cases.
Backport of #53333
A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start.
This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.
If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled.
For tasks seeking a new assignment after a node failure, that behavior is unchanged.
closes#53188
Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated.
This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop.
closes#53007
For analytics, we need a consistent way of indicating when a value is missing. Inheriting from anomaly detection, analysis sent `""` when a field is missing. This works fine with numbers, but the underlying analytics process actually treats `""` as a category in categorical values.
Consequently, you end up with this situation in the resulting model
```
{
"frequency_encoding" : {
"field" : "RainToday",
"feature_name" : "RainToday_frequency",
"frequency_map" : {
"" : 0.009844409027270245,
"No" : 0.6472019970785184,
"Yes" : 0.6472019970785184
}
}
}
```
For inference this is a problem, because inference will treat missing values as `null`. And thus not include them on the infer call against the model.
This PR takes advantage of our new `missing_field_value` option and supplies `\0` as the value.
The assumption added in #52631 skips a problematic test
if it fails to create the required conditions for the
scenario it is supposed to be testing. (This happens
very rarely.)
However, before skipping the test it needs to remove the
failed job it has created because the standard test
cleanup code treats failed jobs as fatal errors.
Closes#52608
This commit introduces a module for Kibana that exposes REST APIs that
will be used by Kibana for access to its system indices. These APIs are wrapped
versions of the existing REST endpoints. A new setting is also introduced since
the Kibana system indices' names are allowed to be changed by a user in case
multiple instances of Kibana use the same instance of Elasticsearch.
Additionally, the ThreadContext has been extended to indicate that the use of
system indices may be allowed in a request. This will be built upon in the future
for the protection of system indices.
Backport of #52385
Currently _rollup_search requires manage privilege to access. It should really be
a read only operation. This PR changes the requirement to be read indices privilege.
Resolves: #50245
Adds reporting of memory usage for data frame analytics jobs.
This commit introduces a new index pattern `.ml-stats-*` whose
first concrete index will be `.ml-stats-000001`. This index serves
to store instrumentation information for those jobs.
Backport of #52778 and #52958
* [ML][Inference] Add support for multi-value leaves to the tree model (#52531)
This adds support for multi-value leaves. This is a prerequisite for multi-class boosted tree classification.
This adds a new configurable field called `indices_options`. This allows users to create or update the indices_options used when a datafeed reads from an index.
This is necessary for the following use cases:
- Reading from frozen indices
- Allowing certain indices in multiple index patterns to not exist yet
These index options are available on datafeed creation and update. Users may specify them as URL parameters or within the configuration object.
closes https://github.com/elastic/elasticsearch/issues/48056
This change removes TrainedModelConfig#isAvailableWithLicense method with calls to
XPackLicenseState#isAllowedByLicense.
Please note there are subtle changes to the code logic. But they are the right changes:
* Instead of Platinum license, Enterprise license nows guarantees availability.
* No explicit check when the license requirement is basic. Since basic license is always available, this check is unnecessary.
* Trial license is always allowed.
This PR moves the majority of the Watcher REST tests under
the Watcher x-pack plugin.
Specifically, moves the Watcher tests from:
x-pack/plugin/test
x-pack/qa/smoke-test-watcher
x-pack/qa/smoke-test-watcher-with-security
x-pack/qa/smoke-test-monitoring-with-watcher
to:
x-pack/plugin/watcher/qa/rest (/test and /qa/smoke-test-watcher)
x-pack/plugin/watcher/qa/with-security
x-pack/plugin/watcher/qa/with-monitoring
Additionally, this disables Watcher from the main
x-pack test cluster and consolidates the stop/start logic
for the tests listed.
No changes to the tests (beyond moving them) are included.
3rd party tests and doc tests (which also touch Watcher)
are not included in the changes here.
* Smarter copying of the rest specs and tests (#52114)
This PR addresses the unnecessary copying of the rest specs and allows
for better semantics for which specs and tests are copied. By default
the rest specs will get copied if the project applies
`elasticsearch.standalone-rest-test` or `esplugin` and the project
has rest tests or you configure the custom extension `restResources`.
This PR also removes the need for dozens of places where the x-pack
specs were copied by supporting copying of the x-pack rest specs too.
The plugin/task introduced here can also copy the rest tests to the
local project through a similar configuration.
The new plugin/task allows a user to minimize the surface area of
which rest specs are copied. Per project can be configured to include
only a subset of the specs (or tests). Configuring a project to only
copy the specs when actually needed should help with build cache hit
rates since we can better define what is actually in use.
However, project level optimizations for build cache hit rates are
not included with this PR.
Also, with this PR you can no longer use the includePackaged flag on
integTest task.
The following items are included in this PR:
* new plugin: `elasticsearch.rest-resources`
* new tasks: CopyRestApiTask and CopyRestTestsTask - performs the copy
* new extension 'restResources'
```
restResources {
restApi {
includeCore 'foo' , 'bar' //will include the core specs that start with foo and bar
includeXpack 'baz' //will include x-pack specs that start with baz
}
restTests {
includeCore 'foo', 'bar' //will include the core tests that start with foo and bar
includeXpack 'baz' //will include the x-pack tests that start with baz
}
}
```
This adds machine learning model feature importance calculations to the inference processor.
The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
"field_mappings": {},
"model_id": "my_model",
"inference_config": {
"regression": {
"num_top_feature_importance_values": 3
}
}
}
```
This will write to the document as follows:
```
"inference" : {
"feature_importance" : {
"FlightTimeMin" : -76.90955548511226,
"FlightDelayType" : 114.13514762158526,
"DistanceMiles" : 13.731580450792187
},
"predicted_value" : 108.33165831875137,
"model_id" : "my_model"
}
```
This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).
It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.
Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.
NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc
usability blocked by: https://github.com/elastic/ml-cpp/pull/991
This commit modifies the codebase so that our production code uses a
single instance of the IndexNameExpressionResolver class. This change
is being made in preparation for allowing name expression resolution
to be augmented by a plugin.
In order to remove some instances of IndexNameExpressionResolver, the
single instance is added as a parameter of Plugin#createComponents and
PersistentTaskPlugin#getPersistentTasksExecutor.
Backport of #52596
Add enterprise operation mode to properly map enterprise license.
Aslo refactor XPackLicenstate class to consolidate license status and mode checks.
This class has many sychronised methods to check basically three things:
* Minimum operation mode required
* Whether security is enabled
* Whether current license needs to be active
Depends on the actual feature, either 1, 2 or all of above checks are performed.
These are now consolidated in to 3 helper methods (2 of them are new).
The synchronization is pushed down to the helper methods so actual checking
methods no longer need to worry about it.
resolves: #51081
When `PUT` is called to store a trained model, it is useful to return the newly create model config. But, it is NOT useful to return the inflated definition.
These definitions can be large and returning the inflated definition causes undo work on the server and client side.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This adds `_all` to Calendar searches. This enables users to supply the `_all` string in the `job_ids` array when creating a Calendar. That calendar will now be applied to all jobs (existing and newly created).
Closes#45013
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This changes the tree validation code to ensure no node in the tree has a
feature index that is beyond the bounds of the feature_names array.
Specifically this handles the situation where the C++ emits a tree containing
a single node and an empty feature_names list. This is valid tree used to
centre the data in the ensemble but the validation code would reject this
as feature_names is empty. This meant a broken workflow as you cannot GET
the model and PUT it back
When changing a job state using a mechanism that doesn't
wait for the desired state to be reached within the production
code the test code needs to loop until the cluster state has
been updated.
Closes#52451
Following the change to store cluster state in Lucene indices
(#50907) it can take longer for all the cluster state updates
associated with node failure scenarios to be processed during
internal cluster tests where several nodes all run in the same
JVM.
ML mappings and index templates have so far been created
programmatically. While this had its merits due to static typing,
there is consensus it would be clear to maintain those in json files.
In addition, we are going to adding ILM policies to these indices
and the component for a plugin to register ILM policies is
`IndexTemplateRegistry`. It expects the templates to be in resource
json files.
For the above reasons this commit refactors ML mappings and index
templates into json resource files that are registered via
`MlIndexTemplateRegistry`.
Backport of #51765
This commit removes the need for DeprecatedRoute and ReplacedRoute to
have an instance of a DeprecationLogger. Instead the RestController now
has a DeprecationLogger that will be used for all deprecated and
replaced route messages.
Relates #51950
Backport of #52278
During a bug hunt, I caught a handful of things (unrelated to the bug) that could be potential issues:
1. Needlessly wrapping in exception handling (minor cleanup)
2. Potential of notifying listeners of a failure multiple times + even trying to notify of a success after a failure notification
In #51146 a rudimentary check for poor categorization was added to
7.6.
This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets. categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.
Closes#50749
Changes the misleading error message when attempting to open
a job while the "cluster.persistent_tasks.allocation.enable"
setting is set to "none" to a clearer message that names the
setting.
Closes#51956
Refactors `DataFrameAnalyticsTask` to hold a `StatsHolder` object.
That just has a `ProgressTracker` for now but this is paving the
way to add additional stats like memory usage, analysis stats, etc.
Backport #52134
Employs `ResultsPersisterService` from `DataFrameRowsJoiner` in order
to add retries when a data frame analytics job is persisting the results
to the destination data frame.
Backport of #52048
This change adds support for the following new model_size_stats
fields:
- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status
Backport of #51879
This commit changes how RestHandlers are registered with the
RestController so that a RestHandler no longer needs to register itself
with the RestController. Instead the RestHandler interface has new
methods which when called provide information about the routes
(method and path combinations) that are handled by the handler
including any deprecated and/or replaced combinations.
This change also makes the publication of RestHandlers safe since they
no longer publish a reference to themselves within their constructors.
Closes#51622
Co-authored-by: Jason Tedor <jason@tedor.me>
Backport of #51950
* [ML] Add bwc serialization unit test scaffold (#51889)
Adds new `AbstractBWCSerializationTestCase` which provides easy scaffolding for BWC serialization unit tests.
These are no replacement for true BWC tests (which execute actual old code). These tests do provide some good coverage for the current code when serializing to/from old versions.
* removing unnecessary override for 7.series branch
* adding necessary import
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
If the configs are removed (by some horrific means), we should still allow tasks to be cleaned up easily.
Datafeeds and jobs with missing configs are now visible in their respective _stats calls and can be stopped/closed.
Currently, the same class `FieldCapabilities` is used both to represent the
capabilities for one index, and also the merged capabilities across indices. To
help clarify the logic, this PR proposes to create a separate class
`IndexFieldCapabilities` for the capabilities in one index. The refactor will
also help when adding `source_path` information in #49264, since the merged
source path field will have a different structure from the field for a single index.
Individual changes:
* Add a new class IndexFieldCapabilities.
* Remove extra constructor from FieldCapabilities.
* Combine the add and merge methods in FieldCapabilities.Builder.
The work to switch file upload over to treating delimited files
like semi-structured text and using the ingest pipeline for CSV
parsing makes the multi-line start pattern used for delimited
files much more critical than it used to be.
Previously it was always based on the time field, even if that
was towards the end of the columns, and no multi-line pattern
was created if no timestamp was detected.
This change improves the multi-line start pattern by:
1. Never creating a multi-line pattern if the sample contained
only single line records. This improves the import
efficiency in a common case.
2. Choosing the leftmost field that has a well-defined pattern,
whether that be the time field or a boolean/numeric field.
This reduces the risk of a field with newlines occurring
earlier, and also means the algorithm doesn't automatically
fail for data without a timestamp.
This setting was introduced with the purpose of reducing the time took by
tests that shut nodes down. Tests like `MlDistributedFailureIT` and
`NetworkDisruptionIT`. However, it is unfortunate to have to set the value
to an explicit value in production. In addition, and most important, the dynamically
choosing the value for this setting makes it impossible to adopt static index template configs
that we register via `IndexTemplateRegistry`, which we need to use in order to start
registering ILM policies for the ML indices.
This commit removes this setting from our templates. I run the tests a few times and could
not see execution time differing significantly.
Backport of #51740
This adds logic to handle paging problems when the ID pattern + tags reference models stored as resources.
Most of the complexity comes from the issue where a model stored as a resource could be at the start, or the end of a page or when we are on the last page.
This commit switches the strategy for managing dot-prefixed indices that
should be hidden indices from using "fake" system indices to an explicit
exclusions list that must be updated when those indices are converted to
hidden indices.
* [ML][Inference] Fix weighted mode definition (#51648)
Weighted mode inaccurately assumed that the "max value" of the input values would be the maximum class value. This does not make sense.
Weighted Mode should know how many classes there are. Hence the new parameter `num_classes`. This indicates what the maximum class value to be expected.
Datafeeds being closed while starting could result in and NPE. This was
handled as any other failure, masking out the NPE. However, this
conflicts with the changes in #50886.
Related to #50886 and #51302
This commit deprecates the creation of dot-prefixed index names (e.g.
.watches) unless they are either 1) a hidden index, or 2) registered by
a plugin that extends SystemIndexPlugin. This is the first step
towards more thorough protections for system indices.
This commit also modifies several plugins which use dot-prefixed indices
to register indices they own as system indices, and adds a plugin to
register .tasks as a system index.
Changes the find_file_structure response to include a CSV
ingest processor in the ingest pipeline it suggests.
Previously the Kibana file upload functionality parsed CSV
in the browser, but by parsing CSV in the ingest pipeline
it makes the Kibana file upload functionality more easily
interchangable with Filebeat such that the configurations
it creates can more easily be used to import data with the
same structure repeatedly in production.
The DATE and DATESTAMP Grok patterns match 2 digit years
as well as 4 digit years. The pattern determination in
find_file_structure worked correctly in this case, but
the regex used to create a multi-line start pattern was
assuming a 4 digit year. Also, the quick rule-out
patterns did not always correctly consider 2 digit years,
meaning that detection was inconsistent.
This change fixes both problems, and also extends the
tests for DATE and DATESTAMP to check both 2 and 4 digit
years.
* [ML][Inference] add tags url param to GET (#51330)
Adds a new URL parameter, `tags` to the GET _ml/inference/<model_id> endpoint.
This parameter allows the list of models to be further reduced to those who contain all the provided tags.
As we prepare to introduce a new index for storing additional
information about data frame analytics jobs (e.g. intrumentation),
renaming this class to `DestinationIndex` better captures what it does
and leaves its prior name available for a more suitable use.
Backport of #51353
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>