This commit bumps our JNA dependency from 4.5.1 to 5.5.0, so that we are
now on the latest maintained line, and pick up a large collection of bug
fixes that have accumulated.
The main improvement here is that the total expected
count of training rows in the test is calculated as the
sum of the training fraction times the cardinality of each
class (instead of the training fraction times the total doc count).
Also relaxes slightly the error bound on the uniformity test from 0.12
to 0.13.
Closes#54122
Backport of #58180
This commit adds an optional field, `description`, to all ingest processors
so that users can explain the purpose of the specific processor instance.
Closes#56000.
* Remove usage of deprecated testCompile configuration
* Replace testCompile usage by testImplementation
* Make testImplementation non transitive by default (as we did for testCompile)
* Update CONTRIBUTING about using testImplementation for test dependencies
* Fail on testCompile configuration usage
This type of result will store stats about how well categorization
is performing. When per-partition categorization is in use, separate
documents will be written for every partition so that it is possible
to see if categorization is working well for some partitions but not
others.
This PR is a minimal implementation to allow the C++ side changes to
be made. More Java side changes related to per-partition
categorization will be in followup PRs. However, even in the long
term I do not see a major benefit in introducing dedicated APIs for
querying categorizer stats. Like forecast request stats the
categorizer stats can be read directly from the job's results alias.
Backport of #57978
Adds support for reading in `model_size_info` objects.
These objects contain numeric values indicating the model definition size and complexity.
Additionally, these objects are not stored or serialized to any other node. They are to be used for calculating and storing model metadata. They are much smaller on heap than the true model definition and should help prevent the analytics process from using too much memory.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Now that annotations are part of the anomaly detection job results
the annotations index should be refreshed on flushing and closing
the job so that flush and close continue to fulfil their contracts
that immediately after returning all results the job generated up
to that point are searchable.
ModelLoadingService only caches models if they are referenced by an
ingest pipeline. For models used in search we want to always cache the
models and rely on TTL to evict them. Additionally when an ingest
pipeline is deleted the model it references should not be evicted if
it is used in search.
Search after is a better choice for the delete expired data iterators
where processing takes a long time as unlike scroll a context does not
have to be kept alive. Also changes the delete expired data endpoint to
404 if the job is unknown
Since we change the memory estimates for data frame analytics jobs from worst case to a realistic case, the strict less-than assertion in the test does not hold anymore. I replaced it with a less-or-equal-than assertion.
Backport or #57882
When Joni, the regex engine that powers grok emits a warning it
does so by default to System.err. System.err logs are all bucketed
together in the server log at WARN level. When Joni emits a warning,
it can be extremely verbose, logging a message for each execution
again that pattern. For ingest node that means for every document
that is run that through Grok. Fortunately, Joni provides a call
back hook to push these warnings to a custom location.
This commit implements Joni's callback hook to push the Joni warning
to the Elasticsearch server logger (logger.org.elasticsearch.ingest.common.GrokProcessor)
at debug level. Generally these warning indicate a possible issue with
the regular expression and upon creation of the Grok processor will
do a "test run" of the expression and log the result (if any) at WARN
level. This WARN level log should only occur on pipeline creation which
is a much lower frequency then every document.
Additionally, the documentation is updated with instructions for how
to set the logger to debug level.
This adds new plugin level circuit breaker for the ML plugin.
`model_inference` is the circuit breaker qualified name.
Right now it simply adds to the breaker when the model is loaded (and possibly breaking) and removing from the breaker when the model is unloaded.
Before to determine if a field is meta-field, a static method of MapperService
isMetadataField was used. This method was using an outdated static list
of meta-fields.
This PR instead changes this method to the instance method that
is also aware of meta-fields in all registered plugins.
Related #38373, #41656Closes#24422
Deleting expired data can take a long time leading to timeouts if there
are many jobs. Often the problem is due to a few large jobs which
prevent the regular maintenance of the remaining jobs. This change adds
a job_id parameter to the delete expired data endpoint to help clean up
those problematic jobs.
This PR adds the initial Java side changes to enable
use of the per-partition categorization functionality
added in elastic/ml-cpp#1293.
There will be a followup change to complete the work,
as there cannot be any end-to-end integration tests
until elastic/ml-cpp#1293 is merged, and also
elastic/ml-cpp#1293 does not implement some of the
more peripheral functionality, like stop_on_warn and
per-partition stats documents.
The changes so far cover REST APIs, results object
formats, HLRC and docs.
Backport of #57683
This is a major refactor of the underlying inference logic.
The main refactor is now we are separating the model configuration and
the inference interfaces.
This has the following benefits:
- we can store extra things with the model that are not
necessary for inference (i.e. treenode split information gain)
- we can optimize inference separate from model serialization and storage.
- The user is oblivious to the optimizations (other than seeing the benefits).
A major part of this commit is removing all inference related methods from the
trained model configurations (ensemble, tree, etc.) and moving them to a new class.
This new class satisfies a new interface that is ONLY for inference.
The optimizations applied currently are:
- feature maps are flattened once
- feature extraction only happens once at the highest level
(improves inference + feature importance through put)
- Only storing what we need for inference + feature importance on heap
When we force delete a DF analytics job, we currently first force
stop it and then we proceed with deleting the job config.
This may result in logging errors if the job config is deleted
before it is retrieved while the job is starting.
Instead of force stopping the job, it would make more sense to
try to stop the job gracefully first. So we now try that out first.
If normal stop fails, then we resort to force stopping the job to
ensure we can go through with the delete.
In addition, this commit introduces `timeout` for the delete action
and makes use of it in the child requests.
Backport of #57680
In #55592 and #55416, we deprecated the settings for enabling and disabling
basic license features and turned those settings into no-ops. Since doing so,
we've had feedback that this change may not give users enough time to cleanly
switch from non-ILM index management tools to ILM. If two index managers
operate simultaneously, results could be strange and difficult to
reconstruct. We don't know of any cases where SLM will cause a problem, but we
are restoring that setting as well, to be on the safe side.
This PR is not a strict commit reversion. First, we are keeping the new
xpack.watcher.use_ilm_index_management setting, introduced when
xpack.ilm.enabled was made a no-op, so that users can begin migrating to using
it. Second, the SLM setting was modified in the same commit as a group of other
settings, so I have taken just the changes relating to SLM.
* [ML] mark forecasts for force closed/failed jobs as failed (#57143)
forecasts that are still running should be marked as failed/finished in the following scenarios:
- Job is force closed
- Job is re-assigned to another node.
Forecasts are not "resilient". Their execution does not continue after a node failure. Consequently, forecasts marked as STARTED or SCHEDULED should be flagged as failed. These forecasts can then be deleted.
Additionally, force closing a job kills the native task directly. This means that if a forecast was running, it is not allowed to complete and could still have the status of `STARTED` in the index.
relates to https://github.com/elastic/elasticsearch/issues/56419
* [ML] adds new for_export flag to GET _ml/inference API (#57351)
Adds a new boolean flag, `for_export` to the `GET _ml/inference/<model_id>` API.
This flag is useful for moving models between clusters.
This adds a max_model_memory setting to forecast requests.
This setting can take a string value that is formatted according to byte sizes (i.e. "50mb", "150mb").
The default value is `20mb`.
There is a HARD limit at `500mb` which will throw an error if used.
If the limit is larger than 40% the anomaly job's configured model limit, the forecast limit is reduced to be strictly lower than that value. This reduction is logged and audited.
related native change: https://github.com/elastic/ml-cpp/pull/1238
closes: https://github.com/elastic/elasticsearch/issues/56420
Allows geo fields (`geo_point`, `geo_shape`) to have missing values.
Fixes a bug where such missing values would result in an error.
Closes#57299
Backport of #57300
Since #51888 the ML job stats endpoint has returned entries for
jobs that have a persistent task but not job config. Such
orphaned tasks caused monitoring to fail.
This change ignores any such corrupt jobs for monitoring purposes.
Backport of #57235
If a job is NOT opened, forecasts should be able to be deleted, no matter their state.
This also fixes a bug with expanding forecast IDs. We should check for wildcard `*` and `_all` when expanding the ids
closes https://github.com/elastic/elasticsearch/issues/56419
Fix delete_expired_data/nightly maintenance when
many model snapshots need deleting (#57041)
The queries performed by the expired data removers pull back entire
documents when only a few fields are required. For ModelSnapshots in
particular this is a problem as they contain quantiles which may be
100s of KB and the search size is set to 10,000.
This change makes the search more efficient by only requesting the
fields needed to work out which expired data should be deleted.
Field mapping detection is done via grok patterns.
This commit adds well-known text (WKT) formatted geometry detection.
If everything is a `POINT`, then a `geo_point` mapping is preferred.
Otherwise, if all the fields are WKT geometries a `geo_shape` mapping is preferred.
This does **NOT** detect other types of formatted geometries (geohash, comma delimited points, etc.)
closes https://github.com/elastic/elasticsearch/issues/56967
Merging logic is currently split between FieldMapper, with its merge() method, and
MappedFieldType, which checks for merging compatibility. The compatibility checks
are called from a third class, MappingMergeValidator. This makes it difficult to reason
about what is or is not compatible in updates, and even what is in fact updateable - we
have a number of tests that check compatibility on changes in mapping configuration
that are not in fact possible.
This commit refactors the compatibility logic so that it all sits on FieldMapper, and
makes it called at merge time. It adds a new FieldMapperTestCase base class that
FieldMapper tests can extend, and moves the compatibility testing machinery from
FieldTypeTestCase to here.
Relates to #56814
Throttling nightly cleanup as much as we do has been over cautious.
Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.
Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.
This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
In DF analytics classification, it is possible to use no samples
of a class if its cardinality is too low.
This commit fixes this by ensuring the target sample count can never be zero.
Backport of #56783
This is a followup to #56632. Tests that had to be changed
to mock the C++ log handler more accurately need to be more
careful about when that stream ends, as ending of that
stream is used to detect crashes in the production system.
Fixes#56796
Adds the conflicting types and an example of an index which specifies
them in order to make it easier for the user to understand the conflict.
Backport of #56700
Prior to this change the named pipes that connect the ML C++
processes to the Elasticsearch JVM were all opened before any
of them were read from or written to.
This created a problem, where if the C++ process logged more
messages between opening the log pipe and opening the last
pipe to be connected than there was space for in the named
pipe's buffer then the C++ process would block. This would
mean it never got as far as opening the last named pipe, so
the JVM would never get as far as reading from the log pipe,
hence a deadlock.
This change alters the connection order so that the JVM
starts reading from the logging pipe immediately after opening
it so that if the C++ process logs messages while opening the
other named pipes they are captured in a timely manner and
there is no danger of a deadlock.
Backport of #56632
Two spots that allow for some optimization:
* We are often creating a composite reference of just a single item in
the transport layer => special cased via static constructor to make sure we never do that
* Also removed the pointless case of an empty composite bytes ref
* `ByteBufferReference` is practically always created from a heap buffer these days so there
is no point of dealing with all the bounds checks and extra references to sliced buffers from that
and we can just use the underlying array directly
We have been using a zero timeout in the case that DF analytics
is stopped. This may cause a timeout when we cancel, for example,
the reindex task.
This commit fixes this by using the default timeout instead.
Backport of #56423