Tasks intending to use a particular java home provided by JAVA<N>_HOME
use the getJavaHome method, which verifies the given java home is
available, or will be if the task will run. However, the verification
logic was broken, in addition to unnecessarily delaying retrieving the
java home until runtime. This commit fixes the verification logic to run
at either config time, delaying verification, or at runtime which
immediately checks if java home is available.
closes#49153
This commit wraps the calls to retrieve the current step in a try/catch
so that the exception does not bubble up. Instead, step info is added
containing the exception to the existing step.
Semi-related to #49128
(cherry picked from commit 72530f8a7f40ae1fca3704effb38cf92daf29057)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of #49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.
In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
Previously, DATEDIFF for minutes and hours was doing a
rounding calculation using all the time fields (secs, msecs/micros/nanos).
Instead it should first truncate the 2 dates to the respective field (mins or hours)
zeroing out all the more detailed time fields and then make the subtraction.
(cherry picked from commit 124cd18e20429e19d52fd8dc383827ea5132d428)
The following edge cases were fixed:
1. A request to force-stop a stopping datafeed is no longer
ignored. Force-stop is an important recovery mechanism
if normal stop doesn't work for some reason, and needs
to operate on a datafeed in any state other than stopped.
2. If the node that a datafeed is running on is removed from
the cluster during a normal stop then the stop request is
retried (and will likely succeed on this retry by simply
cancelling the persistent task for the affected datafeed).
3. If there are multiple simultaneous force-stop requests for
the same datafeed we no longer fail the one that is
processed second. The previous behaviour was wrong as
stopping a stopped datafeed is not an error, so stopping
a datafeed twice simultaneously should not be either.
Backport of #49191
We shouldn't be throwing `RepositoryException` when the repository
wasn't concurrently modified in an unexpected fashion (i.e. on the blob/file level).
When we know that the known repo gen moved higher in terms of the generation
tracked in master memory we should throw the concurrent snapshot exception.
This change makes concurrent snapshot create and delete always throw the same exception,
prevents unnecessary listings when the generation is known to be off and prevents
future test failures in SLM tests that assume the concurrent snapshot exception
is always thrown here.
Without this change, the newly added test randomly fails the `instanceOf` assertion
by running into a `RepositoryException`.
Reindex, update by query and delete by query would silently disregard
RED/unavailable shards, thus not copying, updating or deleting matching
data in those shards. Now use `allow_partial_search_results=false` to
ensure these operations fail if the search crosses an unavailable chard.
Added the option to explicitly specify `allow_partial_search_results=true`
for reindex only (seemed too strange for update/delete by query).
Relates #45739 and #42612
* [ML] ML Model Inference Ingest Processor (#49052)
* [ML][Inference] adds lazy model loader and inference (#47410)
This adds a couple of things:
- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:
* [ML][Inference] Adjust inference configuration option API (#47812)
* [ML][Inference] adds logistic_regression output aggregator (#48075)
* [ML][Inference] Adding read/del trained models (#47882)
* [ML][Inference] Adding inference ingest processor (#47859)
* [ML][Inference] fixing classification inference for ensemble (#48463)
* [ML][Inference] Adding model memory estimations (#48323)
* [ML][Inference] adding more options to inference processor (#48545)
* [ML][Inference] handle string values better in feature extraction (#48584)
* [ML][Inference] Adding _stats endpoint for inference (#48492)
* [ML][Inference] add inference processors and trained models to usage (#47869)
* [ML][Inference] add new flag for optionally including model definition (#48718)
* [ML][Inference] adding license checks (#49056)
* [ML][Inference] Adding memory and compute estimates to inference (#48955)
* fixing version of indexed docs for model inference
The logic for `cleanupInProgress()` was backwards everywhere (method itself and
all but one user). Also, we weren't checking it when removing a repository.
This lead to a bug (in the one spot that didn't use the method backwards) that prevented
the cleanup cluster state entry from ever being removed from the cluster state if master
failed over during the cleanup process.
This change corrects the backwards logic, adds a test that makes sure the cleanup
is always removed and adds a check that prevents repository removal during cleanup
to the repositories service.
Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.
The `string` type (with option `analyzed`) has been replaced by `text` after `6.0`,
also the `annonated_text` field do not support doc values and should be mentioned.
* Better Exceptions on Concurrent Snapshot Operations
It is somewhat tricky to debug test failures from concurrent operations
without having the exact knowledge of what ran concurrently so I added
it to these exceptions in all spots.
The network disruption was acting on node ids and node names
which made reconnects not work. Moved all usages to node names
to fix this. Since the map of all nodes in the test is indexed
by name this was easier to work with.
Similarly to what has been done for Azure (#48636) and GCS (#48762),
this committ removes the existing Ant fixture that emulates a S3 storage
service in favor of multiple docker-compose based fixtures.
The goals here are multiple: be able to reuse a s3-fixture outside of the
repository-s3 plugin; allow parallel execution of integration tests; removes
the existing AmazonS3Fixture that has evolved in a weird beast in
dedicated, more maintainable fixtures.
The server side logic that emulates S3 mostly comes from the latest
HttpHandler made for S3 blob store repository tests, with additional
features extracted from the (now removed) AmazonS3Fixture:
authentication checks, session token checks and improved response
errors. Chunked upload request support for S3 object has been added
too.
The server side logic of all tests now reside in a single S3HttpHandler class.
Whereas AmazonS3Fixture contained logic for basic tests, session token
tests, EC2 tests or ECS tests, the S3 fixtures are now dedicated to each
kind of test. Fixtures are inheriting from each other, making things easier
to maintain.
improve error handling for script errors, treating it as irrecoverable errors which puts the task
immediately into failed state, also improves the error extraction to properly report the script
error.
fixes#48467
AutoFollowIT relies on assertBusy() calls to wait for a given number of
leader indices to be created but this is prone to failures on CI. Instead,
we should use latches to indicate when auto-follow patterns must be
paused and resumed.
This commit fixes a NPE problem as reported in #49150.
But this problem uncovered that we never added proper handling
of state for data frame analytics tasks.
In this commit we improve the `MlTasks.getDataFrameAnalyticsState`
method to handle null tasks and state tasks properly.
Closes#49150
Backport of #49186
This PR adds build configuration to use the `jdk-download` plugin with
unit tests when no runtime java is configured externally.
It's a first part in a longer chain of changes described in #40531.
Backport of #47573.
Closes#43603. Allow environment variables to be passed to ES in a Docker
container via a file, by setting an environment variable with the `_FILE`
suffix that points to the file with the intended value of the env var.
Backport of #49136.
Prior to 3a3e5f6, there was no default indent configured in
`.editorconfig`. This changed to 4 spaces when we configured an explicit
indent of 2 for Gradle files. However, this change meant that YAML files
then had 4-space indents, which is valid, but the repo's YAML files
typically use 2-space indents.
Remove the default indent again, and instead explicitly set an indent size
for a variety of file types.
This commit upgrades the JDK that is bundled with Windows from 13+33 to
13.0.1+9. Our other platforms have previously been upgraded but Windows
was delayed because the artifacts were not available at the time that we
made the previous upgrade.
Relates #48587
If CCR encounters a rejected execution exception, today we treat this as
fatal. This is not though, as the stuffed queue could drain. Requiring
an administrator to manually restart the follow tasks that faced such an
exception is a burden. This commit addresses this by making CCR
auto-retry on rejected execution exceptions.
We can't guarantee expected request failures if search request is across
many indexes, as if expected shards fail, some indexes may return 200.
closes#47743