Deleting expired data can take a long time leading to timeouts if there
are many jobs. Often the problem is due to a few large jobs which
prevent the regular maintenance of the remaining jobs. This change adds
a job_id parameter to the delete expired data endpoint to help clean up
those problematic jobs.
This PR adds the initial Java side changes to enable
use of the per-partition categorization functionality
added in elastic/ml-cpp#1293.
There will be a followup change to complete the work,
as there cannot be any end-to-end integration tests
until elastic/ml-cpp#1293 is merged, and also
elastic/ml-cpp#1293 does not implement some of the
more peripheral functionality, like stop_on_warn and
per-partition stats documents.
The changes so far cover REST APIs, results object
formats, HLRC and docs.
Backport of #57683
This is a major refactor of the underlying inference logic.
The main refactor is now we are separating the model configuration and
the inference interfaces.
This has the following benefits:
- we can store extra things with the model that are not
necessary for inference (i.e. treenode split information gain)
- we can optimize inference separate from model serialization and storage.
- The user is oblivious to the optimizations (other than seeing the benefits).
A major part of this commit is removing all inference related methods from the
trained model configurations (ensemble, tree, etc.) and moving them to a new class.
This new class satisfies a new interface that is ONLY for inference.
The optimizations applied currently are:
- feature maps are flattened once
- feature extraction only happens once at the highest level
(improves inference + feature importance through put)
- Only storing what we need for inference + feature importance on heap
When we force delete a DF analytics job, we currently first force
stop it and then we proceed with deleting the job config.
This may result in logging errors if the job config is deleted
before it is retrieved while the job is starting.
Instead of force stopping the job, it would make more sense to
try to stop the job gracefully first. So we now try that out first.
If normal stop fails, then we resort to force stopping the job to
ensure we can go through with the delete.
In addition, this commit introduces `timeout` for the delete action
and makes use of it in the child requests.
Backport of #57680
In #55592 and #55416, we deprecated the settings for enabling and disabling
basic license features and turned those settings into no-ops. Since doing so,
we've had feedback that this change may not give users enough time to cleanly
switch from non-ILM index management tools to ILM. If two index managers
operate simultaneously, results could be strange and difficult to
reconstruct. We don't know of any cases where SLM will cause a problem, but we
are restoring that setting as well, to be on the safe side.
This PR is not a strict commit reversion. First, we are keeping the new
xpack.watcher.use_ilm_index_management setting, introduced when
xpack.ilm.enabled was made a no-op, so that users can begin migrating to using
it. Second, the SLM setting was modified in the same commit as a group of other
settings, so I have taken just the changes relating to SLM.
* [ML] mark forecasts for force closed/failed jobs as failed (#57143)
forecasts that are still running should be marked as failed/finished in the following scenarios:
- Job is force closed
- Job is re-assigned to another node.
Forecasts are not "resilient". Their execution does not continue after a node failure. Consequently, forecasts marked as STARTED or SCHEDULED should be flagged as failed. These forecasts can then be deleted.
Additionally, force closing a job kills the native task directly. This means that if a forecast was running, it is not allowed to complete and could still have the status of `STARTED` in the index.
relates to https://github.com/elastic/elasticsearch/issues/56419
* [ML] adds new for_export flag to GET _ml/inference API (#57351)
Adds a new boolean flag, `for_export` to the `GET _ml/inference/<model_id>` API.
This flag is useful for moving models between clusters.
This adds a max_model_memory setting to forecast requests.
This setting can take a string value that is formatted according to byte sizes (i.e. "50mb", "150mb").
The default value is `20mb`.
There is a HARD limit at `500mb` which will throw an error if used.
If the limit is larger than 40% the anomaly job's configured model limit, the forecast limit is reduced to be strictly lower than that value. This reduction is logged and audited.
related native change: https://github.com/elastic/ml-cpp/pull/1238
closes: https://github.com/elastic/elasticsearch/issues/56420
Allows geo fields (`geo_point`, `geo_shape`) to have missing values.
Fixes a bug where such missing values would result in an error.
Closes#57299
Backport of #57300
Since #51888 the ML job stats endpoint has returned entries for
jobs that have a persistent task but not job config. Such
orphaned tasks caused monitoring to fail.
This change ignores any such corrupt jobs for monitoring purposes.
Backport of #57235
If a job is NOT opened, forecasts should be able to be deleted, no matter their state.
This also fixes a bug with expanding forecast IDs. We should check for wildcard `*` and `_all` when expanding the ids
closes https://github.com/elastic/elasticsearch/issues/56419
Fix delete_expired_data/nightly maintenance when
many model snapshots need deleting (#57041)
The queries performed by the expired data removers pull back entire
documents when only a few fields are required. For ModelSnapshots in
particular this is a problem as they contain quantiles which may be
100s of KB and the search size is set to 10,000.
This change makes the search more efficient by only requesting the
fields needed to work out which expired data should be deleted.
Field mapping detection is done via grok patterns.
This commit adds well-known text (WKT) formatted geometry detection.
If everything is a `POINT`, then a `geo_point` mapping is preferred.
Otherwise, if all the fields are WKT geometries a `geo_shape` mapping is preferred.
This does **NOT** detect other types of formatted geometries (geohash, comma delimited points, etc.)
closes https://github.com/elastic/elasticsearch/issues/56967
Merging logic is currently split between FieldMapper, with its merge() method, and
MappedFieldType, which checks for merging compatibility. The compatibility checks
are called from a third class, MappingMergeValidator. This makes it difficult to reason
about what is or is not compatible in updates, and even what is in fact updateable - we
have a number of tests that check compatibility on changes in mapping configuration
that are not in fact possible.
This commit refactors the compatibility logic so that it all sits on FieldMapper, and
makes it called at merge time. It adds a new FieldMapperTestCase base class that
FieldMapper tests can extend, and moves the compatibility testing machinery from
FieldTypeTestCase to here.
Relates to #56814
Throttling nightly cleanup as much as we do has been over cautious.
Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.
Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.
This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
In DF analytics classification, it is possible to use no samples
of a class if its cardinality is too low.
This commit fixes this by ensuring the target sample count can never be zero.
Backport of #56783
This is a followup to #56632. Tests that had to be changed
to mock the C++ log handler more accurately need to be more
careful about when that stream ends, as ending of that
stream is used to detect crashes in the production system.
Fixes#56796
Adds the conflicting types and an example of an index which specifies
them in order to make it easier for the user to understand the conflict.
Backport of #56700
Prior to this change the named pipes that connect the ML C++
processes to the Elasticsearch JVM were all opened before any
of them were read from or written to.
This created a problem, where if the C++ process logged more
messages between opening the log pipe and opening the last
pipe to be connected than there was space for in the named
pipe's buffer then the C++ process would block. This would
mean it never got as far as opening the last named pipe, so
the JVM would never get as far as reading from the log pipe,
hence a deadlock.
This change alters the connection order so that the JVM
starts reading from the logging pipe immediately after opening
it so that if the C++ process logs messages while opening the
other named pipes they are captured in a timely manner and
there is no danger of a deadlock.
Backport of #56632
Two spots that allow for some optimization:
* We are often creating a composite reference of just a single item in
the transport layer => special cased via static constructor to make sure we never do that
* Also removed the pointless case of an empty composite bytes ref
* `ByteBufferReference` is practically always created from a heap buffer these days so there
is no point of dealing with all the bounds checks and extra references to sliced buffers from that
and we can just use the underlying array directly
We have been using a zero timeout in the case that DF analytics
is stopped. This may cause a timeout when we cancel, for example,
the reindex task.
This commit fixes this by using the default timeout instead.
Backport of #56423
It is possible that the config document for a data frame
analytics job is deleted from the config index. If that is
the case the user is unable to stop a running job because
we attempt to retrieve the config and that will throw.
This commit changes that. When the request is forced,
we do not expand the requested ids based on the existing
configs but from the list of running tasks instead.
Backport of #56360
Due to multi-threading it is possible that phase progress
updates written from the c++ process arrive reordered.
We can address this by ensuring that progress may only increase.
Closes#56282
Backport of #56339
* [ML] lay ground work for handling >1 result indices (#55892)
This commit removes all but one reference to `getInitialResultsIndexName`.
This is to support more than one result index for a single job.
The following settings are now no-ops:
* xpack.flattened.enabled
* xpack.logstash.enabled
* xpack.rollup.enabled
* xpack.slm.enabled
* xpack.sql.enabled
* xpack.transform.enabled
* xpack.vectors.enabled
Since these settings no longer need to be checked, we can remove settings
parameters from a number of constructors and methods, and do so in this
commit.
We also update documentation to remove references to these settings.
This PR implements the following changes to make ML model snapshot
retention more flexible in advance of adding a UI for the feature in
an upcoming release.
- The default for `model_snapshot_retention_days` for new jobs is now
10 instead of 1
- There is a new job setting, `daily_model_snapshot_retention_after_days`,
that defaults to 1 for new jobs and `model_snapshot_retention_days`
for pre-7.8 jobs
- For days that are older than `model_snapshot_retention_days`, all
model snapshots are deleted as before
- For days that are in between `daily_model_snapshot_retention_after_days`
and `model_snapshot_retention_days` all but the first model snapshot
for that day are deleted
- The `retain` setting of model snapshots is still respected to allow
selected model snapshots to be retained indefinitely
Backport of #56125
In #55763 I thought I could remove the flag that marks
reindexing was finished on a data frame analytics task.
However, that exposed a race condition. It is possible that
between updating reindexing progress to 100 because we
have called `DataFrameAnalyticsManager.startAnalytics()` and
a call to the _stats API which updates reindexing progress via the
method `DataFrameAnalyticsTask.updateReindexTaskProgress()` we
end up overwriting the 100 with a lower progress value.
This commit fixes this issue by bringing back the help of
a `isReindexingFinished` flag as it was prior to #55763.
Closes#56128
Backport of #56135
Backport of #56034.
Move includeDataStream flag from an IndicesOptions to IndexNameExpressionResolver.Context
as a dedicated field that callers to IndexNameExpressionResolver can set.
Also alter indices stats api to support data streams.
The rollover api uses this api and otherwise rolling over data stream does no longer work.
Relates to #53100
If there are ill-formed pipelines, or other pipelines are not ready to be parsed, `InferenceProcessor.Factory::accept(ClusterState)` logs warnings. This can be confusing and cause log spam.
It might lead folks to think there an issue with the inference processor. Also, they would see logs for the inference processor even though they might not be using the inference processor. Leading to more confusion.
Additionally, pipelines might not be parseable in this method as some processors require the new cluster state metadata before construction (e.g. `enrich` requires cluster metadata to be set before creating the processor).
closes https://github.com/elastic/elasticsearch/issues/55985
Backport of #55858 to 7.x branch.
Currently the TransportBulkAction detects whether an index is missing and
then decides whether it should be auto created. The coordination of the
index creation also happens in the TransportBulkAction on the coordinating node.
This change adds a new transport action that the TransportBulkAction delegates to
if missing indices need to be created. The reasons for this change:
* Auto creation of data streams can't occur on the coordinating node.
Based on the index template (v2) either a regular index or a data stream should be created.
However if the coordinating node is slow in processing cluster state updates then it may be
unaware of the existence of certain index templates, which then can load to the
TransportBulkAction creating an index instead of a data stream. Therefor the coordination of
creating an index or data stream should occur on the master node. See #55377
* From a security perspective it is useful to know whether index creation originates from the
create index api or from auto creating a new index via the bulk or index api. For example
a user would be allowed to auto create an index, but not to use the create index api. The
auto create action will allow security to distinguish these two different patterns of
index creation.
This change adds the following new transport actions:
AutoCreateAction, the TransportBulkAction redirects to this action and this action will actually create the index (instead of the TransportCreateIndexAction). Later via #55377, can improve the AutoCreateAction to also determine whether an index or data stream should be created.
The create_index index privilege is also modified, so that if this permission is granted then a user is also allowed to auto create indices. This change does not yet add an auto_create index privilege. A future change can introduce this new index privilege or modify an existing index / write index privilege.
Relates to #53100
* Make xpack.monitoring.enabled setting a no-op
This commit turns xpack.monitoring.enabled into a no-op. Mostly, this involved
removing the setting from the setup for integration tests. Monitoring may
introduce some complexity for test setup and teardown, so we should keep an eye
out for turbulence and failures
* Docs for making deprecated setting a no-op
This commit converts the remaining isXXXAllowed methods to instead of
use isAllowed with a Feature value. There are a couple other methods
that are static, as well as some licensed features that check the
license directly, but those will be dealt with in other followups.
This commit correctly sets the maxLinesPerRow in the CsvPreference for delimited files given the file structure finder settings.
Previously, it was silently ignored.
This refactors native integ tests to assert progress without
expecting explicit phases for analyses. We can test those with
yaml tests in a single place.
Backport of #55925
* Make xpack.ilm.enabled setting a no-op
* Add watcher setting to not use ILM
* Update documentation for no-op setting
* Remove NO_ILM ml index templates
* Remove unneeded setting from test setup
* Inline variable definitions for ML templates
* Use identical parameter names in templates
* New ILM/watcher setting falls back to old setting
* Add fallback unit test for watcher/ilm setting
Fixes test by exposing the method ModelLoadingService::addModelLoadedListener()
so that the test class can be notified when a model is loaded which happens in
a background thread
On second thought, this check does not seem to be adding value.
We can test that the phases are as we expect them for each analysis
by adding yaml tests. Those would fail if we introduce new phases
from c++ accidentally or without coordination. This would achieve
the same thing. At the same time we would not have to comment out
this code each time a new phase is introduced. Instead we can just
temporarily mute those yaml tests. Note I will add those tests
right after the imminent new phases are added to the c++ side.
Backport of #55926
While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).
This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.
This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.
related to https://github.com/elastic/elasticsearch/issues/38890
We were previously checking at least one supported field existed
when the _explain API was called. However, in the case of analyses
with required fields (e.g. regression) we were not accounting that
the dependent variable is not a feature and thus if the source index
only contains the dependent variable field there are no features to
train a model on.
This commit adds a validation that at least one feature is available
for analysis. Note that we also move that validation away from
`ExtractedFieldsDetector` and the _explain API and straight into
the _start API. The reason for doing this is to allow the user to use
the _explain API in order to understand why they would be seeing an
error like this one.
For example, the user might be using an index that has fields but
they are of unsupported types. If they start the job and get
an error that there are no features, they will wonder why that is.
Calling the _explain API will show them that all their fields are
unsupported. If the _explain API was failing instead, there would
be no way for the user to understand why all those fields are
ignored.
Closes#55593
Backport of #55876
This change adds a new setting, daily_model_snapshot_retention_after_days,
to the anomaly detection job config.
Initially this has no effect, the effect will be added in a followup PR.
This PR gets the complexities of making changes that interact with BWC
over well before feature freeze.
Backport of #55878
Adding to #55659, we missed another way we could set the task to
failed due to task cancellation. CI revealed that we might also
get a `SearchPhaseExecutionException` whose cause is a
`TaskCancelledException`. That exception is not wrapped so
unwrapping it will not return the underlying `TaskCancelledException`.
Thus to be complete in catching this, we also need to check the
error's cause.
Closes#55068
Backport of #55797
This is a continuation from #55580.
Now that we're parsing phase progresses from the analytics process
we change `ProgressTracker` to allow for custom phases between
the `loading_data` and `writing_results` phases. Each `DataFrameAnalysis`
may declare its own phases.
This commit sets things in place for the analytics process to start
reporting different phases per analysis type. However, this is
still preserving existing behaviour as all analyses currently
declare a single `analyzing` phase.
Backport of #55763
The failed_category_count statistic records the number of times
categorization wanted to create a new category but couldn't
because the job had reached its model_memory_limit.
Backport of #55716
Since #55580 we've introduced a new format for parsing progress
from the data frame analytics process. As the process is now
writing out progress in this new way, we can remove the parsing
of the old format.
Backport of #55711
Also unmutes the integ test that stops and restarts
an outlier detection job with the hope of learning more
of the failure in #55068.
Backport of #55545
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Previously audit messages were indexed when datafeeds that were
assigned to a node were stopped, but not datafeeds that were
unassigned at the time they were stopped.
This change adds auditing for the unassigned case.
Backport of #55656
While we were catching `TaskCancelledException` while we wait for
reindexing to complete, we missed the fact that this exception
may be wrapped in a multi-node cluster. This is the reason
we may still fail the task when stop is called while reindexing.
Some times we're lucky and the exception is thrown by the same
node that runs the job. Then the exception is not wrapped and
things work fine. But when that is not the case the exception is
wrapped, we fail to catch it, and set the task to failed.
The fix is to simply unwrap the exception when we check it it
is `TaskCancelledException`.
Closes#55068
Backport of #55659
Backport of #55115.
Replace calls to deprecate(String,Object...) with deprecateAndMaybeLog(...),
with an appropriate key, so that all messages can potentially be deduplicated.
The Kibana CSV export feature uses a non-standard timestamp format.
This change adds it to the formats the find_file_structure endpoint
recognizes out-of-the-box, to make round-tripping data from Kibana
back to Kibana via CSV files easier.
Fixes#55586
Data frame analytics process currently reports progress as
an integer `progress_percent`. We parse that and report it
from the _stats API as the progress of the `analyzing` phase.
However, we want to allow the DFA process to report progress
for more than one phase. This commit prepares for this by
parsing `phase_progress` from the process, an object that
contains the `phase` name plus the `progress_percent` for that
phase.
Backport of #55580
Instead of doing our own checks against REST status, shard counts, and shard failures, this commit changes all our extractor search requests to set `.setAllowPartialSearchResults(false)`.
- Scrolls are automatically cleared when a search failure occurs with `.setAllowPartialSearchResults(false)` set.
- Code error handling is simplified
closes https://github.com/elastic/elasticsearch/issues/40793
Issue #55521 suggested that audit messages were not written when
closing an unassigned job. This is not the case, but we didn't
have a test to prove it.
Backport of #55571
The ML info endpoint returns the max_model_memory_limit setting
if one is configured. However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.
This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.
The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.
Backport of #55529
Adds a "node" field to the response from the following endpoints:
1. Open anomaly detection job
2. Start datafeed
3. Start data frame analytics job
If the job or datafeed is assigned to a node immediately then
this field will return the ID of that node.
In the case where a job or datafeed is opened or started lazily
the node field will contain an empty string. Clients that want
to test whether a job or datafeed was opened or started lazily
can therefore check for this.
Backport of #55473
In the test after the first load event is is not known which models are cached as
loading a later one will evict an earlier one and the order is not known.
The models could have been loaded 1 or 2 times not exactly twice
`updateAndGet` could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
```* @param updateFunction a side-effect-free function```
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.
To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`.
When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`.
This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.
I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).
closes https://github.com/elastic/elasticsearch/issues/54786
This paves the data layer way so that exceptionally large models are partitioned across multiple documents.
This change means that nodes before 7.8.0 will not be able to use trained inference models created on nodes on or after 7.8.0.
I chose the definition document limit to be 100. This *SHOULD* be plenty for any large model. One of the largest models that I have created so far had the following stats:
~314MB of inflated JSON, ~66MB when compressed, ~177MB of heap.
With the chunking sizes of `16 * 1024 * 1024` its compressed string could be partitioned to 5 documents.
Supporting models 20 times this size (compressed) seems adequate for now.
* [ML] fix native ML test log spam (#55459)
This adds a dependency to ingest common. This removes the log spam resulting from basic plugins being enabled that require the common ingest processors.
* removing unnecessary changes
* removing unused imports
* removing unnecessary java setting
Removing the deprecated "xpack.monitoring.enabled" setting introduced
log spam and potentially some failures in ML tests. It's possible to use
a different, non-deprecated setting to disable monitoring, so we do that
here.
We believe there's no longer a need to be able to disable basic-license
features completely using the "xpack.*.enabled" settings. If users don't
want to use those features, they simply don't need to use them. Having
such features always available lets us build more complex features that
assume basic-license features are present.
This commit deprecates settings of the form "xpack.*.enabled" for
basic-license features, excluding "security", which is a special case.
It also removes deprecated settings from integration tests and unit
tests where they're not directly relevant; e.g. monitoring and ILM are
no longer disabled in many integration tests.
This fixes the long muted testHRDSplit. Some minor adjustments for modern day elasticsearch changes :).
The cause of the failure is that a new `by` field entering the model with an exceptionally high count does not cause an anomaly. We have since stopped combining the `rare` and `by` in this manner. New entries in a `by` field are not anomalous because we have no history on them yet.
closes https://github.com/elastic/elasticsearch/issues/32966
When a anomaly jobs, datafeeds, and analytics tasks are stopped, they enter an ephemeral state called `STOPPING`.
If the node executing the task fails while this is occurring, they could be stuck in the limbo state of `STOPPING`. It is best to mark the tasks as completed if they get reassigned to a node.
Backport from: #54726
The INCLUDE_DATA_STREAMS indices option controls whether data streams can be resolved in an api for both concrete names and wildcard expressions. If data streams cannot be resolved then a 400 error is returned indicating that data streams cannot be used.
In this pr, the INCLUDE_DATA_STREAMS indices option is enabled in the following APIs: search, msearch, refresh, index (op_type create only) and bulk (index requests with op type create only). In a subsequent later change, we will determine which other APIs need to be able to resolve data streams and enable the INCLUDE_DATA_STREAMS indices option for these APIs.
Whether an api resolve all backing indices of a data stream or the latest index of a data stream (write index) depends on the IndexNameExpressionResolver.Context.isResolveToWriteIndex().
If isResolveToWriteIndex() returns true then data streams resolve to the latest index (for example: index api) and otherwise a data stream resolves to all backing indices of a data stream (for example: search api).
Relates to #53100
* Add ValuesSource Registry and associated logic (#54281)
* Remove ValuesSourceType argument to ValuesSourceAggregationBuilder (#48638)
* ValuesSourceRegistry Prototype (#48758)
* Remove generics from ValuesSource related classes (#49606)
* fix percentile aggregation tests (#50712)
* Basic thread safety for ValuesSourceRegistry (#50340)
* Remove target value type from ValuesSourceAggregationBuilder (#49943)
* Cleanup default values source type (#50992)
* CoreValuesSourceType no longer implements Writable (#51276)
* Remove genereics & hard coded ValuesSource references from Matrix Stats (#51131)
* Put values source types on fields (#51503)
* Remove VST Any (#51539)
* Rewire terms agg to use new VS registry (#51182)
Also adds some basic AggTestCases for untested code
paths (and boilerplate for future tests once the IT are
converted over)
* Wire Cardinality aggregation to work with the ValuesSourceRegistry (#51337)
* Wire Percentiles aggregator into new VS framework (#51639)
This required a bit of a refactor to percentiles itself. Before,
the Builder would switch on the chosen algo to generate an
algo-specific factory. This doesn't work (or at least, would be
difficult) in the new VS framework.
This refactor consolidates both factories together and introduces
a PercentilesConfig object to act as a standardized way to pass
algo-specific parameters through the factory. This object
is then used when deciding which kind of aggregator to create
Note: CoreValuesSourceType.HISTOGRAM still lives in core, and will
be moved in a subsequent PR.
* Remove generics and target value type from MultiVSAB (#51647)
* fix checkstyle after merge (#52008)
* Plumb ValuesSourceRegistry through to QuerySearchContext (#51710)
* Convert RareTerms to new VS registry (#52166)
* Wire up Value Count (#52225)
* Wire up Max & Min aggregations (#52219)
* ValuesSource refactoring: Wire up Sum aggregation (#52571)
* ValuesSource refactoring: Wire up SigTerms aggregation (#52590)
* Soft immutability for VSConfig (#52729)
* Unmute testSupportedFieldTypes, fix Percentiles/Ranks/Terms tests (#52734)
Also fixes Percentiles which was incorrectly specified to only accept
numeric, but in fact also accepts Boolean and Date (because those are
numeric on master - thanks `testSupportedFieldTypes` for catching it!)
* VS refactoring: Wire up stats aggregation (#52891)
* ValuesSource refactoring: Wire up string_stats aggregation (#52875)
* VS refactoring: Wire up median (MAD) aggregation (#52945)
* fix valuesourcetype issue with constant_keyword field (#53041)x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/job/RollupIndexer.java
this commit implements `getValuesSourceType` for
the ConstantKeyword field type.
master was merged into feature/extensible-values-source
introducing a new field type that was not implementing
`getValuesSourceType`.
* ValuesSource refactoring: Wire up Avg aggregation (#52752)
* Wire PercentileRanks aggregator into new VS framework (#51693)
* Add a VSConfig resolver for aggregations not using the registry (#53038)
* Vs refactor wire up ranges and date ranges (#52918)
* Wire up geo_bounds aggregation to ValuesSourceRegistry (#53034)
This commit updates the geo_bounds aggregation to depend
on registering itself in the ValuesSourceRegistry
relates #42949.
* VS refactoring: convert Boxplot to new registry (#53132)
* Wire-up geotile_grid and geohash_grid to ValuesSourceRegistry (#53037)
This commit updates the geo*_grid aggregations to depend
on registering itself in the ValuesSourceRegistry
relates to the values-source refactoring meta issue #42949.
* Wire-up geo_centroid agg to ValuesSourceRegistry (#53040)
This commit updates the geo_centroid aggregation to depend
on registering itself in the ValuesSourceRegistry.
relates to the values-source refactoring meta issue #42949.
* Fix type tests for Missing aggregation (#53501)
* ValuesSource Refactor: move histo VSType into XPack module (#53298)
- Introduces a new API (`getBareAggregatorRegistrar()`) which allows plugins to register aggregations against existing agg definitions defined in Core.
- This moves the histogram VSType over to XPack where it belongs. `getHistogramValues()` still remains as a Core concept
- Moves the histo-specific bits over to xpack (e.g. the actual aggregator logic). This requires extra boilerplate since we need to create a new "Analytics" Percentile/Rank aggregators to deal with the histo field. Doubly-so since percentiles/ranks are extra boiler-plate'y... should be much lighter for other aggs
* Wire up DateHistogram to the ValuesSourceRegistry (#53484)
* Vs refactor parser cleanup (#53198)
Co-authored-by: Zachary Tong <polyfractal@elastic.co>
Co-authored-by: Zachary Tong <zach@elastic.co>
Co-authored-by: Christos Soulios <1561376+csoulios@users.noreply.github.com>
Co-authored-by: Tal Levy <JubBoy333@gmail.com>
* First batch of easy fixes
* Remove List.of from ValuesSourceRegistry
Note that we intend to have a follow up PR dealing with the mutability
of the registry, so I didn't even try to address that here.
* More compiler fixes
* More compiler fixes
* More compiler fixes
* Precommit is happy and so am I
* Add new Core VSTs to tests
* Disabled supported type test on SigTerms until we can backport it's fix
* fix checkstyle
* Fix test failure from semantic merge issue
* Fix some metaData->metadata replacements that got lost
* Fix list of supported types for MinAggregator
* Fix list of supported types for Avg
* remove unused import
Co-authored-by: Zachary Tong <polyfractal@elastic.co>
Co-authored-by: Zachary Tong <zach@elastic.co>
Co-authored-by: Christos Soulios <1561376+csoulios@users.noreply.github.com>
Co-authored-by: Tal Levy <JubBoy333@gmail.com>
Today we pass the `RepositoriesService` to the searchable snapshots plugin
during the initialization of the `RepositoryModule`, forcing the plugin to be a
`RepositoryPlugin` even though it does not implement any repositories.
After discussion we decided it best for now to pass this in via
`Plugin#createComponents` instead, pending some future work in which plugins
can depend on services more dynamically.
Simplify the code by removing the generic type from InferenceConfigUpdate which
meant wildcard types were used in many places. Instead check the class type is
appropriate where used.
When a datafeed transitions from lookback to real-time we request
that state is persisted from the autodetect process in the
background.
This PR adds a test to prove that for a categorization job the
state that is persisted includes the categorization state.
Without the fix from elastic/ml-cpp#1137 this test fails. After
that C++ fix is merged this test should pass.
Backport of #55243
After #54650 we catch `TaskCancelledException` when we wait for
reindexing to complete as it may be thrown. However, when that happens
we do not mark the task as completed. This results in the stop request
never returning and the failures we saw in #55068.
Closes#55068
Backport of #55286
I've noticed that a lot of our tests are using deprecated static methods
from the Hamcrest matchers. While this is not a big deal in any
objective sense, it seems like a small good thing to reduce compilation
warnings and be ready for a new release of the matcher library if we
need to upgrade. I've also switched a few other methods in tests that
have drop-in replacements.
* [ML] adding prediction_field_type to inference config (#55128)
Data frame analytics dynamically determines the classification field type. This field type then dictates the encoded JSON that is written to Elasticsearch.
Inference needs to know about this field type so that it may provide the EXACT SAME predicted values as analytics.
Here is added a new field `prediction_field_type` which indicates the desired type. Options are: `string` (DEFAULT), `number`, `boolean` (where close_to(1.0) == true, false otherwise).
Analytics provides the default `prediction_field_type` when the model is created from the process.
The isAuthAllowed() method for license checking is used by code that
wants to ensure security is both enabled and available. The enabled
state is dynamic and provided by isSecurityEnabled(). But since security
is available with all license types, an check on the license level is
not necessary. Thus, this change replaces isAuthAllowed() with calling
isSecurityEnabled().
We needlessly send documents to be persisted. If there are no stats added, then we should not attempt to persist them.
Also, this PR fixes the race condition that caused issue: https://github.com/elastic/elasticsearch/issues/54786
* [ML] Start gathering and storing inference stats (#53429)
This PR enables stats on inference to be gathered and stored in the `.ml-stats-*` indices.
Each node + model_id will have its own running stats document and these will later be summed together when returning _stats to the user.
`.ml-stats-*` is ILM managed (when possible). So, at any point the underlying index could change. This means that a stats document that is read in and then later updated will actually be a new doc in a new index. This complicates matters as this means that having a running knowledge of seq_no and primary_term is complicated and almost impossible. This is because we don't know the latest index name.
We should also strive for throughput, as this code sits in the middle of an ingest pipeline (or even a query).
This change makes sure that all internal client requests spawned by the
data frame analytics persistent task executor and that use the end user
security credentials, have the parent task id assigned. The objective here
is to permit auditing (as well as tracking for debugging purposes) of all
the end-user requests executed on its behalf by persistent tasks.
Because data frame analytics taks already implements graceful shutdown
of child tasks, this change does not interfere with it by opting out of
the persistent task cancellation of child tasks.
Relates #54943#52314
This change preserves the task id for internal requests for the `StartDatafeedPersistentTask`.
Task ids are a way to express a relationship between related internal requests.
In this particular case, the task ids are used for debugging and (soon) security auditing,
but not for task cancellation, because there is already a graceful-shutdown of child
internal requests (given a task id) in place.
This change reintroduces the system index APIs for Kibana without the
changes made for marking what system indices could be accessed using
these APIs. In essence, this is a partial revert of #53912. The changes
for marking what system indices should be allowed access will be
handled in a separate change.
The APIs introduced here are wrapped versions of the existing REST
endpoints. A new setting is also introduced since the Kibana system
indices' names are allowed to be changed by a user in case multiple
instances of Kibana use the same instance of Elasticsearch.
Relates #52385
Backport of #54858
Guava was removed from Elasticsearch many years ago, but remnants of it
remain due to transitive dependencies. When a dependency pulls guava
into the compile classpath, devs can inadvertently begin using methods
from guava without realizing it. This commit moves guava to a runtime
dependency in the modules that it is needed.
Note that one special case is the html sanitizer in watcher. The third
party dep uses guava in the PolicyFactory class signature. However, only
calling a method on the PolicyFactory actually causes the class to be
loaded, a reference alone does not trigger compilation to look at the
class implementation. There we utilize a MethodHandle for invoking the
relevant method at runtime, where guava will continue to exist.
When a data frame analytics job is stopped, if the reindexing
task was still in progress we cancel it. Cancelling it should
be done from the same context as when we executed the reindexing
task. That means from a thread context with ML origin.
Backport of #54874
It seems the 20 seconds timeout is occasionally not enough.
We still get sporadic failures where the logs reveal the job
wasn't opened within 20 seconds. I'm increasing the wait time
to 30 seconds.
Closes#54448
Backport of #54792
The test results are affected by the off-by-one error that is
fixed by https://github.com/elastic/ml-cpp/pull/1122
This test can be unmuted once that fix is merged and has been
built into ml-cpp snapshots.
Force stopping a failed job used to work but it
now puts the job in `stopping` state and hangs.
In addition, force stopping a `stopping` job is
not handled.
This commit addresses those issues with force
stopping data frame analytics. It inlines the
approach with that followed for anomaly detection
jobs.
Backport of #54650
This adds training_percent parameter to the analytics process for Classification and Regression. This parameter is then used to give more accurate memory estimations.
See native side pr: elastic/ml-cpp#1111
* [ML] add new inference_config field to trained model config (#54421)
A new field called `inference_config` is now added to the trained model config object. This new field allows for default inference settings from analytics or some external model builder.
The inference processor can still override whatever is set as the default in the trained model config.
* fixing for backport
* [ML] prefer secondary authorization header for data[feed|frame] authz (#54121)
Secondary authorization headers are to be used to facilitate Kibana spaces support + ML jobs/datafeeds.
Now on PUT/Update/Preview datafeed, and PUT data frame analytics the secondary authorization is preferred over the primary (if provided).
closes https://github.com/elastic/elasticsearch/issues/53801
* fixing for backport
When one of ML's normalize processes fails to connect to the JVM
quickly enough and another normalize process for the same job
starts shortly afterwards it is possible that their named pipes
can get mixed up.
This change avoids the risk of that by adding an incrementing
counter value into the named pipe names used for normalize
processes.
Backport of #54636
* [ML] add num_matches and preferred_to_categories to category defintion objects (#54214)
This adds two new fields to category definitions.
- `num_matches` indicating how many documents have been seen by this category
- `preferred_to_categories` indicating which other categories this particular category supersedes when messages are categorized.
These fields are only guaranteed to be up to date after a `_flush` or `_close`
native change: https://github.com/elastic/ml-cpp/pull/1062
* adjusting for backport
Refactor SearchHit to have separate document and meta fields.
This is a part of bigger refactoring of issue #24422 to remove
dependency on MapperService to check if a field is metafield.
Relates to PR: #38373
Relates to issue #24422
Co-authored-by: sandmannn <bohdanpukalskyi@gmail.com>
This is a follow up to a previous commit that renamed MetaData to
Metadata in all of the places. In that commit in master, we renamed
META_DATA to METADATA, but lost this on the backport. This commit
addresses that.
This is a simple naming change PR, to fix the fact that "metadata" is a
single English word, and for too long we have not followed general
naming conventions for it. We are also not consistent about it, for
example, METADATA instead of META_DATA if we were trying to be
consistent with MetaData (although METADATA is correct when considered
in the context of "metadata"). This was a simple find and replace across
the code base, only taking a few minutes to fix this naming issue
forever.
This PR:
1. Fixes the bug where a cardinality estimate of zero could cause
a 500 status
2. Adds tests for that scenario and a few others
3. Adds sensible estimates for the cases that were previously TODO
Backport of #54462
Backport of #53982
In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams,
the abstraction needs to be made more flexible, because currently it really can be only
an alias or an index.
* Renamed `AliasOrIndex` to `IndexAbstraction`.
* Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is.
* Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum.
* Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface.
* Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`.
* Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method.
Relates to #53100
Today the machine learning plugin stashes a copy of the environment in
its constructor, and uses the stashed copy to construct its components
even though it is provided with an environment to create these
components. What is more, the environment it creates in its constructor
is not fully initialized, as it does not have the final copy of the
settings, but the environment passed in while creating components
does. This commit removes that stashed copy of the environment.
The NodesStatsRequest class uses a set of strings for its internal
serialization. This commit updates the class's interface so that we
no longer use hard-coded getters and setters, but rather
methods that add strings directly. For example, the old way of
adding "os" metrics to a request would be to call request.os(true).
The new way of doing this is to call request.addMetric("os").
For the time being, the canonical list of metrics is an enum in
NodesStatsRequest. This will eventually be replaced with something
pluggable.
This commit populates the _stats API response with sensible "empty"
`data_counts` and `memory_usage` objects when the job itself
has not started reporting them.
Backport of #54210
When get filters is called without setting the `size`
paramter only up to 10 filters are returned. However,
100 filters should be returned. This commit fixes this
and adds an integ test to guard it.
It seems this was accidentally broken in #39976.
Closes#54206
Backport of #54207
As classification now works for multiple classes, randomly
picking training/test data frame rows is not good enough.
This commit introduces a stratified cross validation splitter
that maintains the proportion of the each class in the dataset
in the sample that is used for training the model.
Backport of #54087
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true. Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.
This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started. The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.
Backport of #53918
This commit instruments data frame analytics
with stats for the data that are being analyzed.
In particular, we count training docs, test docs,
and skipped docs.
In order to account docs with missing values as skipped
docs for analyses that do not support missing values,
this commit changes the extractor so that it only ignores
docs with missing values when it collects the data summary,
which is used to estimate memory usage.
Backport of #53998
Adds multi-class feature importance calculation.
Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
"feature_name": "feature_0",
"importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
“feature_name”: “feature_0”,
“importance”: 2.0, // sum(abs()) of class importances
“foo”: 1.0,
“bar”: 0.5,
“baz”: -0.5
},
```
For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
```
"ml.inference.feature_importance": {
"type": "nested",
"dynamic": true,
"properties": {
"feature_name": {
"type": "keyword"
},
"importance": {
"type": "double"
}
}
}
```
The mapping field name is as follows
`ml.<inference.target_field>.<inference.tag>.feature_importance`
if `inference.tag` is not provided in the processor definition, it is not part of the field path.
`inference.target_field` is defaulted to `ml.inference`.
//cc @lcawl ^ Where should we document this?
If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
Feature importance storage format is changing to encompass multi-class.
Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
"feature_name": "feature_0",
"importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
“feature_name”: “feature_0”,
“importance”: 2.0, // sum(abs()) of class importances
“foo”: 1.0,
“bar”: 0.5,
“baz”: -0.5
},
```
This change adjusts the mapping creation for analytics so that the field is mapped as a `nested` type.
Native side change: https://github.com/elastic/ml-cpp/pull/1071
Since a data frame analytics job may have associated docs
in the .ml-stats-* indices, when the job is deleted we
should delete those docs too.
Backport of #53933
While `CustomProcessor` is generic and allows for flexibility, there
are new requirements that make cross validation a concept it's hard
to abstract behind custom processor. In particular, we would like to
add data_counts to the DFA jobs stats. Counting training VS. test
docs would be a useful statistic. We would also want to add a
different cross validation strategy for multiclass classification.
This commit renames custom processors to cross validation splitters
which allows for those enhancements without cryptically doing
things as a side effect of the abstract custom processing.
Backport of #53915
It's simple to deprecate a field used in an ObjectParser just by adding deprecation
markers to the relevant ParseField objects. The warnings themselves don't currently
have any context - they simply say that a deprecated field has been used, but not
where in the input xcontent it appears. This commit adds the parent object parser
name and XContentLocation to these deprecation messages.
Note that the context is automatically stripped from warning messages when they
are asserted on by integration tests and REST tests, because randomization of
xcontent type during these tests means that the XContentLocation is not constant
Adds parsing and indexing of analysis instrumentation stats.
The latest one is also returned from the get-stats API.
Note that we chose to duplicate objects even where they are currently
similar. There are already ideas on how these will diverge in the future
and while the duplication looks ugly at the moment, it is the option
that offers the highest flexibility.
Backport of #53788
* [ML] only retry persistence failures when the failure is intermittent and stop retrying when analytics job is stopping (#53725)
This fixes two issues:
- Results persister would retry actions even if they are not intermittent. An example of an persistent failure is a doc mapping problem.
- Data frame analytics would continue to retry to persist results even after the job is stopped.
closes https://github.com/elastic/elasticsearch/issues/53687
Prepares classification analysis to support more than just
two classes. It introduces a new parameter to the process config
which dictates the `num_classes` to the process. It also
changes the max classes limit to `30` provisionally.
Backport of #53539
the ML portion of the x-pack info API was erroneously counting configuration documents and definition documents. The underlying implementation of our storage separates the two out.
This PR filters the query so that only trained model config documents are counted.
Adds a new parameter for classification that enables choosing whether to assign labels to
maximise accuracy or to maximise the minimum class recall.
Fixes#52427.
Adds a new `default_field_map` field to trained model config objects.
This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.
The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
This is a partial implementation of an endpoint for anomaly
detector model memory estimation.
It is not complete, lacking docs, HLRC and sensible numbers
for many anomaly detector configurations. These will be
added in a followup PR in time for 7.7 feature freeze.
A skeleton endpoint is useful now because it allows work on
the UI side of the change to commence. The skeleton endpoint
handles the same cases that the old UI code used to handle,
and produces very similar estimates for these cases.
Backport of #53333
A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start.
This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.
If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled.
For tasks seeking a new assignment after a node failure, that behavior is unchanged.
closes#53188
Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated.
This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop.
closes#53007
For analytics, we need a consistent way of indicating when a value is missing. Inheriting from anomaly detection, analysis sent `""` when a field is missing. This works fine with numbers, but the underlying analytics process actually treats `""` as a category in categorical values.
Consequently, you end up with this situation in the resulting model
```
{
"frequency_encoding" : {
"field" : "RainToday",
"feature_name" : "RainToday_frequency",
"frequency_map" : {
"" : 0.009844409027270245,
"No" : 0.6472019970785184,
"Yes" : 0.6472019970785184
}
}
}
```
For inference this is a problem, because inference will treat missing values as `null`. And thus not include them on the infer call against the model.
This PR takes advantage of our new `missing_field_value` option and supplies `\0` as the value.
The assumption added in #52631 skips a problematic test
if it fails to create the required conditions for the
scenario it is supposed to be testing. (This happens
very rarely.)
However, before skipping the test it needs to remove the
failed job it has created because the standard test
cleanup code treats failed jobs as fatal errors.
Closes#52608
This commit introduces a module for Kibana that exposes REST APIs that
will be used by Kibana for access to its system indices. These APIs are wrapped
versions of the existing REST endpoints. A new setting is also introduced since
the Kibana system indices' names are allowed to be changed by a user in case
multiple instances of Kibana use the same instance of Elasticsearch.
Additionally, the ThreadContext has been extended to indicate that the use of
system indices may be allowed in a request. This will be built upon in the future
for the protection of system indices.
Backport of #52385
Currently _rollup_search requires manage privilege to access. It should really be
a read only operation. This PR changes the requirement to be read indices privilege.
Resolves: #50245
Adds reporting of memory usage for data frame analytics jobs.
This commit introduces a new index pattern `.ml-stats-*` whose
first concrete index will be `.ml-stats-000001`. This index serves
to store instrumentation information for those jobs.
Backport of #52778 and #52958
* [ML][Inference] Add support for multi-value leaves to the tree model (#52531)
This adds support for multi-value leaves. This is a prerequisite for multi-class boosted tree classification.
This adds a new configurable field called `indices_options`. This allows users to create or update the indices_options used when a datafeed reads from an index.
This is necessary for the following use cases:
- Reading from frozen indices
- Allowing certain indices in multiple index patterns to not exist yet
These index options are available on datafeed creation and update. Users may specify them as URL parameters or within the configuration object.
closes https://github.com/elastic/elasticsearch/issues/48056
This change removes TrainedModelConfig#isAvailableWithLicense method with calls to
XPackLicenseState#isAllowedByLicense.
Please note there are subtle changes to the code logic. But they are the right changes:
* Instead of Platinum license, Enterprise license nows guarantees availability.
* No explicit check when the license requirement is basic. Since basic license is always available, this check is unnecessary.
* Trial license is always allowed.
This PR moves the majority of the Watcher REST tests under
the Watcher x-pack plugin.
Specifically, moves the Watcher tests from:
x-pack/plugin/test
x-pack/qa/smoke-test-watcher
x-pack/qa/smoke-test-watcher-with-security
x-pack/qa/smoke-test-monitoring-with-watcher
to:
x-pack/plugin/watcher/qa/rest (/test and /qa/smoke-test-watcher)
x-pack/plugin/watcher/qa/with-security
x-pack/plugin/watcher/qa/with-monitoring
Additionally, this disables Watcher from the main
x-pack test cluster and consolidates the stop/start logic
for the tests listed.
No changes to the tests (beyond moving them) are included.
3rd party tests and doc tests (which also touch Watcher)
are not included in the changes here.
* Smarter copying of the rest specs and tests (#52114)
This PR addresses the unnecessary copying of the rest specs and allows
for better semantics for which specs and tests are copied. By default
the rest specs will get copied if the project applies
`elasticsearch.standalone-rest-test` or `esplugin` and the project
has rest tests or you configure the custom extension `restResources`.
This PR also removes the need for dozens of places where the x-pack
specs were copied by supporting copying of the x-pack rest specs too.
The plugin/task introduced here can also copy the rest tests to the
local project through a similar configuration.
The new plugin/task allows a user to minimize the surface area of
which rest specs are copied. Per project can be configured to include
only a subset of the specs (or tests). Configuring a project to only
copy the specs when actually needed should help with build cache hit
rates since we can better define what is actually in use.
However, project level optimizations for build cache hit rates are
not included with this PR.
Also, with this PR you can no longer use the includePackaged flag on
integTest task.
The following items are included in this PR:
* new plugin: `elasticsearch.rest-resources`
* new tasks: CopyRestApiTask and CopyRestTestsTask - performs the copy
* new extension 'restResources'
```
restResources {
restApi {
includeCore 'foo' , 'bar' //will include the core specs that start with foo and bar
includeXpack 'baz' //will include x-pack specs that start with baz
}
restTests {
includeCore 'foo', 'bar' //will include the core tests that start with foo and bar
includeXpack 'baz' //will include the x-pack tests that start with baz
}
}
```
This adds machine learning model feature importance calculations to the inference processor.
The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
"field_mappings": {},
"model_id": "my_model",
"inference_config": {
"regression": {
"num_top_feature_importance_values": 3
}
}
}
```
This will write to the document as follows:
```
"inference" : {
"feature_importance" : {
"FlightTimeMin" : -76.90955548511226,
"FlightDelayType" : 114.13514762158526,
"DistanceMiles" : 13.731580450792187
},
"predicted_value" : 108.33165831875137,
"model_id" : "my_model"
}
```
This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).
It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.
Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.
NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc
usability blocked by: https://github.com/elastic/ml-cpp/pull/991
This commit modifies the codebase so that our production code uses a
single instance of the IndexNameExpressionResolver class. This change
is being made in preparation for allowing name expression resolution
to be augmented by a plugin.
In order to remove some instances of IndexNameExpressionResolver, the
single instance is added as a parameter of Plugin#createComponents and
PersistentTaskPlugin#getPersistentTasksExecutor.
Backport of #52596
Add enterprise operation mode to properly map enterprise license.
Aslo refactor XPackLicenstate class to consolidate license status and mode checks.
This class has many sychronised methods to check basically three things:
* Minimum operation mode required
* Whether security is enabled
* Whether current license needs to be active
Depends on the actual feature, either 1, 2 or all of above checks are performed.
These are now consolidated in to 3 helper methods (2 of them are new).
The synchronization is pushed down to the helper methods so actual checking
methods no longer need to worry about it.
resolves: #51081
When `PUT` is called to store a trained model, it is useful to return the newly create model config. But, it is NOT useful to return the inflated definition.
These definitions can be large and returning the inflated definition causes undo work on the server and client side.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This adds `_all` to Calendar searches. This enables users to supply the `_all` string in the `job_ids` array when creating a Calendar. That calendar will now be applied to all jobs (existing and newly created).
Closes#45013
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This changes the tree validation code to ensure no node in the tree has a
feature index that is beyond the bounds of the feature_names array.
Specifically this handles the situation where the C++ emits a tree containing
a single node and an empty feature_names list. This is valid tree used to
centre the data in the ensemble but the validation code would reject this
as feature_names is empty. This meant a broken workflow as you cannot GET
the model and PUT it back
When changing a job state using a mechanism that doesn't
wait for the desired state to be reached within the production
code the test code needs to loop until the cluster state has
been updated.
Closes#52451
Following the change to store cluster state in Lucene indices
(#50907) it can take longer for all the cluster state updates
associated with node failure scenarios to be processed during
internal cluster tests where several nodes all run in the same
JVM.
ML mappings and index templates have so far been created
programmatically. While this had its merits due to static typing,
there is consensus it would be clear to maintain those in json files.
In addition, we are going to adding ILM policies to these indices
and the component for a plugin to register ILM policies is
`IndexTemplateRegistry`. It expects the templates to be in resource
json files.
For the above reasons this commit refactors ML mappings and index
templates into json resource files that are registered via
`MlIndexTemplateRegistry`.
Backport of #51765
This commit removes the need for DeprecatedRoute and ReplacedRoute to
have an instance of a DeprecationLogger. Instead the RestController now
has a DeprecationLogger that will be used for all deprecated and
replaced route messages.
Relates #51950
Backport of #52278
During a bug hunt, I caught a handful of things (unrelated to the bug) that could be potential issues:
1. Needlessly wrapping in exception handling (minor cleanup)
2. Potential of notifying listeners of a failure multiple times + even trying to notify of a success after a failure notification
In #51146 a rudimentary check for poor categorization was added to
7.6.
This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets. categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.
Closes#50749
Changes the misleading error message when attempting to open
a job while the "cluster.persistent_tasks.allocation.enable"
setting is set to "none" to a clearer message that names the
setting.
Closes#51956
Refactors `DataFrameAnalyticsTask` to hold a `StatsHolder` object.
That just has a `ProgressTracker` for now but this is paving the
way to add additional stats like memory usage, analysis stats, etc.
Backport #52134
Employs `ResultsPersisterService` from `DataFrameRowsJoiner` in order
to add retries when a data frame analytics job is persisting the results
to the destination data frame.
Backport of #52048
This change adds support for the following new model_size_stats
fields:
- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status
Backport of #51879
This commit changes how RestHandlers are registered with the
RestController so that a RestHandler no longer needs to register itself
with the RestController. Instead the RestHandler interface has new
methods which when called provide information about the routes
(method and path combinations) that are handled by the handler
including any deprecated and/or replaced combinations.
This change also makes the publication of RestHandlers safe since they
no longer publish a reference to themselves within their constructors.
Closes#51622
Co-authored-by: Jason Tedor <jason@tedor.me>
Backport of #51950
* [ML] Add bwc serialization unit test scaffold (#51889)
Adds new `AbstractBWCSerializationTestCase` which provides easy scaffolding for BWC serialization unit tests.
These are no replacement for true BWC tests (which execute actual old code). These tests do provide some good coverage for the current code when serializing to/from old versions.
* removing unnecessary override for 7.series branch
* adding necessary import
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
If the configs are removed (by some horrific means), we should still allow tasks to be cleaned up easily.
Datafeeds and jobs with missing configs are now visible in their respective _stats calls and can be stopped/closed.
Currently, the same class `FieldCapabilities` is used both to represent the
capabilities for one index, and also the merged capabilities across indices. To
help clarify the logic, this PR proposes to create a separate class
`IndexFieldCapabilities` for the capabilities in one index. The refactor will
also help when adding `source_path` information in #49264, since the merged
source path field will have a different structure from the field for a single index.
Individual changes:
* Add a new class IndexFieldCapabilities.
* Remove extra constructor from FieldCapabilities.
* Combine the add and merge methods in FieldCapabilities.Builder.
The work to switch file upload over to treating delimited files
like semi-structured text and using the ingest pipeline for CSV
parsing makes the multi-line start pattern used for delimited
files much more critical than it used to be.
Previously it was always based on the time field, even if that
was towards the end of the columns, and no multi-line pattern
was created if no timestamp was detected.
This change improves the multi-line start pattern by:
1. Never creating a multi-line pattern if the sample contained
only single line records. This improves the import
efficiency in a common case.
2. Choosing the leftmost field that has a well-defined pattern,
whether that be the time field or a boolean/numeric field.
This reduces the risk of a field with newlines occurring
earlier, and also means the algorithm doesn't automatically
fail for data without a timestamp.
This setting was introduced with the purpose of reducing the time took by
tests that shut nodes down. Tests like `MlDistributedFailureIT` and
`NetworkDisruptionIT`. However, it is unfortunate to have to set the value
to an explicit value in production. In addition, and most important, the dynamically
choosing the value for this setting makes it impossible to adopt static index template configs
that we register via `IndexTemplateRegistry`, which we need to use in order to start
registering ILM policies for the ML indices.
This commit removes this setting from our templates. I run the tests a few times and could
not see execution time differing significantly.
Backport of #51740
This adds logic to handle paging problems when the ID pattern + tags reference models stored as resources.
Most of the complexity comes from the issue where a model stored as a resource could be at the start, or the end of a page or when we are on the last page.
This commit switches the strategy for managing dot-prefixed indices that
should be hidden indices from using "fake" system indices to an explicit
exclusions list that must be updated when those indices are converted to
hidden indices.
* [ML][Inference] Fix weighted mode definition (#51648)
Weighted mode inaccurately assumed that the "max value" of the input values would be the maximum class value. This does not make sense.
Weighted Mode should know how many classes there are. Hence the new parameter `num_classes`. This indicates what the maximum class value to be expected.
Datafeeds being closed while starting could result in and NPE. This was
handled as any other failure, masking out the NPE. However, this
conflicts with the changes in #50886.
Related to #50886 and #51302
This commit deprecates the creation of dot-prefixed index names (e.g.
.watches) unless they are either 1) a hidden index, or 2) registered by
a plugin that extends SystemIndexPlugin. This is the first step
towards more thorough protections for system indices.
This commit also modifies several plugins which use dot-prefixed indices
to register indices they own as system indices, and adds a plugin to
register .tasks as a system index.
Changes the find_file_structure response to include a CSV
ingest processor in the ingest pipeline it suggests.
Previously the Kibana file upload functionality parsed CSV
in the browser, but by parsing CSV in the ingest pipeline
it makes the Kibana file upload functionality more easily
interchangable with Filebeat such that the configurations
it creates can more easily be used to import data with the
same structure repeatedly in production.
The DATE and DATESTAMP Grok patterns match 2 digit years
as well as 4 digit years. The pattern determination in
find_file_structure worked correctly in this case, but
the regex used to create a multi-line start pattern was
assuming a 4 digit year. Also, the quick rule-out
patterns did not always correctly consider 2 digit years,
meaning that detection was inconsistent.
This change fixes both problems, and also extends the
tests for DATE and DATESTAMP to check both 2 and 4 digit
years.
* [ML][Inference] add tags url param to GET (#51330)
Adds a new URL parameter, `tags` to the GET _ml/inference/<model_id> endpoint.
This parameter allows the list of models to be further reduced to those who contain all the provided tags.
As we prepare to introduce a new index for storing additional
information about data frame analytics jobs (e.g. intrumentation),
renaming this class to `DestinationIndex` better captures what it does
and leaves its prior name available for a more suitable use.
Backport of #51353
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Data frame analytics classification currently only supports 2 classes for the
dependent variable. We were checking that the field's cardinality is not higher
than 2 but we should also check it is not less than that as otherwise the process
fails.
Backport of #51232