Rollup jobs should be stopped + deleted before the indices are removed.
It's possible for an active rollup job to issue a bulk request, the test
ends and the cleanup code deletes all indices. The in-flight bulk
request will then stall + error because the index no-longer exists...
but this process might take longer than the StopRollup timeout.
Which means the test fails, and often fails several other tests since
the job is still active (e.g. other tests cannot create the same-named
job, or fail to stop the job in their cleanup because it's still stalled).
This tends to knock over several tests before the bulk finally times
out and the job shuts down.
Instead, we need to simply stop jobs first. Inflight bulks will resolve
quickly, and we can carry on with deleting indices after the jobs are
confirmed inactive.
stop-job.asciidoc tended to trigger this issue because it executed
an async stop API and then exited, which setup the above situation. In
can and did happen with other tests though. As an extra precaution,
the doc test was modified to substitute in wait_for_completion
to help head off these issues too.
This change aims to fix failures in the session factory load balancing
tests that mock failure scenarios. For these tests, we randomly shut
down ldap servers and bind a client socket to the port they were
listening on. Unfortunately, we would occasionally encounter failures
in these tests where a socket was already in use and/or the port
we expected to connect to was wrong and in fact was to one of the ldap
instances that should have been shut down.
The failures are caused by the behavior of certain operating systems
when it comes to binding ports and wildcard addresses. It is possible
for a separate application to be bound to a wildcard address and still
allow our code to bind to that port on a specific address. So when we
close the server socket and open the client socket, we are still able
to establish a connection since the other application is already
listening on that port on a wildcard address. Another variant is that
the os will allow a wildcard bind of a server socket when there is
already an application listening on that port for a specific address.
In order to do our best to prevent failures in these scenarios, this
change does the following:
1. Binds a client socket to all addresses in an awaitBusy
2. Adds assumption that we could bind all valid addresses
3. In the case that we still establish a connection to an address that
we should not be able to, try to bind and expect a failure of not
being connected
Closes#32190
Fixes an error in median calculation in
MedianAbsoluteDeviationAggregatorTests for odd number of sample points,
which causes some rare test failures.
Fixes#38937
This commit fixes a broken CCR retention lease unfollow test. The
problem with the test is that the random subset of shards that we picked
to disrupt would not necessarily overlap with the actual shards in
use. We could take a non-empty subset of [0, 3] (e.g., { 2 }) when the
only shard IDs in use were [0, 1]. This commit fixes this by taking into
account the number of shards in use in the test.
With this change, we also take measure to ensure that a successful
branch is tested more frequently than would otherwise be the case. On
that branch, we want to sometimes pretend that the retention lease is
already removed. The randomness here was also sometimes selecting a
subset of shards that did not overlap with the shards actually in use
during the test. While this does not break the test, it is confusing and
reduces the amount of coverage of that branch.
Relates #39185
In the ClusterConfiguration class of the build source, there is an Ant waitfor block
that runs to ensure that the seed node's transport ports file is created before
trying to read it. If the wait times out, the file read fails with not much helpful info.
This just adds a timeout property to the waitfor block and throws a descriptive
exception instead.
Backports #37574
In most of the places we avoid creating the `.security` index (or updating the mapping)
for read/search operations. This is more of a nit for the case of the getRole call,
that fixes a possible mapping update during a get role, and removes a dead if branch
about creating the `.security` index.
This commit attempts to remove the retention leases on the leader shards
when unfollowing an index. This is best effort, since the leader might
not be available.
This test had been disabled because of test failures, but it only affected the
6.x branch. The fix for 6.x is at #39054. On master/7.x/7.0 we can reenable the
test as-is.
This defaults to "true" (current behavior) and will throw an exception
if there is a property that cannot be recognized. If "false", it will
ignore anything unrecognizable.
(cherry picked from commit 38fbf9792bcf4fe66bb3f17589e5fe6d29748d07)
Today's calculation of the maximum number of shards per attribute is rather
convoluted. This commit clarifies that it returns
ceil(shardCount/numberOfAttributes).
Blob store compression was not enabled for some of the files in
snapshots due to constructor accessing sub-class fields. Fixed to
instead accept compress field as constructor param. Also fixed chunk
size validation to work.
Deprecated repositories.fs.compress setting as well to be able to unify
in a future commit.
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.
This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.
Backport of: #39092
This ensures we do not attempt to parse non english locale dates
in FIPS mode. The error, originally assumed to affect only Joda,
affects Java time in the same manner and manifests only with the
version of BouncyCastle FIPS certified provider we use in tests.
The upstream issue https://github.com/bcgit/bc-java/issues/405
indicates that the behavior is resolved in later versions of the
BouncyCastle library and should be tested again when the new
versions become FIPS 140 certified
the RethrottleTests assumed that tasks that were
unprepared to rethrottle would bubble up into the
Rethrottle response as an ElasticsearchException
wrapping an IllegalArgumentException. This seems to
have changed to potentially involve further levels of
wrapping.
This change makes the retry logic more resilient to
arbitrary nesting of the underlying IllegalArgumentException
Sometimes we turn objects into strings for logging or debugging using
`toString()`, but the default implementation is often unhelpful. This change
improves on this in two places I ran into recently.
The ML memory tracker does searches against ML results
and config indices. These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.
This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete. Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.
The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.
Fixes#37117
This test had a bug. We attempt to allow only the primary to be
allocated, to force all replicas to recovery from the primary after we
had set the state of the retention leases on the primary. However, in
building the index settings, we were overwriting the settings that
exclude the replicas from being allocated. This means that some of the
replicas would end up assigned and rather than receive retention leases
during recovery, they would be part of the replication group receiving
retention leases as they are manipulated. Since retention lease renewals
are only synced periodically, this means that the replica could be
lagging a little behind in some cases leading to an assertion tripping
in the test. This commit addresses this by ensuring that the replicas
are indeed not allocated until after the retention leases are done being
manipulated on the replica. We did this by not overwriting the exclude
settings.
Closes#39105
These two changes are interlinked.
Before this change unsetting ML upgrade mode would wait for all
datafeeds to be assigned and not waiting for their corresponding
jobs to initialise. However, this could be inappropriate, if
there was a reason other that upgrade mode why one job was unable
to be assigned or slow to start up. Unsetting of upgrade mode
would hang in this case.
This change relaxes the condition for considering upgrade mode to
be unset to simply that an assignment attempt has been made for
each ML persistent task that did not fail because upgrade mode
was enabled. Thus after unsetting upgrade mode there is no
guarantee that every ML persistent task is assigned, just that
each is not unassigned due to upgrade mode.
In order to make setting upgrade mode work immediately after
unsetting upgrade mode it was then also necessary to make it
possible to stop a datafeed that was not assigned. There was
no particularly good reason why this was not allowed in the past.
It is trivial to stop an unassigned datafeed because it just
involves removing the persistent task.
The parseMillis method was able to work on formats without timezones by
falling back to UTC. The Date Formatter interface did not support this, as
the calling code was using the `Instant.from` java time API.
This switches over to an internal method which adds UTC as a timezone.
Closes#39067
As the Dockerfile evolved we don't need anymore certain commands like
`unzip`, `which` and `wget` allowing us to slightly shrink the images.
Backport of: #39040
- Notes that you can adjust the `s3.client.*.endpoint` setting to point to a
repository held on an S3-compatible service.
- Notes that the default is `s3.amazonaws.com` and not to auto-detect the
endpoint.
- Reformats docs to width.
Closes#35925