This commit makes `TransformIntegrationTests` into a standard integration test, as
opposed to using `TimeWarp`, which registers the mock component
`ScheduleEngineTriggerMock` to trigger watches.
The simplification may help with flakiness we've observed `TimeWarp, as in #37882.
The SchedulerEngine is used in several places in our code and not all
of these usages properly stopped the SchedulerEngine, which could lead
to test failures due to leaked threads from the SchedulerEngine. This
change adds stopping to these usages in order to avoid the thread leaks
that cause CI failures and noise.
Closes#38875
Currently remote compression and ping schedule settings are dynamic.
However, we do not listen for changes. This commit adds listeners for
changes to those two settings. Additionally, when those settings change
we now close existing connections and open new ones with the settings
applied.
Fixes#37201.
Sleeps in tests smell funny, and we try to avoid them to the extent
possible. We are using a small one in a CCR test. This commit clarifies
the purpose of that sleep by adding a comment explaining it. We also
removed a hard-coded value from the test, that if we ever modified the
value higher up where it was set, we could end up forgetting to change
the value here. Now we ensure that these would move in lock step if we
ever maintain them later.
We have some CCR tests where we use mock transport send rules to control
the behavior that we desire in these tests. Namely, we want to simulate
an exception being thrown on the leader side, or a variety of other
situations. These send rules were put in place between the data nodes on
each side. However, it might not be the case that these requests are
being sent between data nodes. For example, a request that is handled on
a non-data master node would not be sent from a data node. And it might
not be the case that the request is sent to a data node, as it could be
proxied through a non-data coordinating node. This commit addresses this
by putting these send rules in places between all nodes on each side.
Closes#39011Closes#39201
`ReadOnlyEngine` never recovers operations from translog and never
updates translog information in the index shard's recovery state, even
though the recovery state goes through the `TRANSLOG` stage during
the recovery. It means that recovery information for frozen shards indicates
an unkown number of recovered translog ops in the Recovery APIs
(translog_ops: `-1` and translog_ops_percent: `-1.0%`) and this is confusing.
This commit changes the `recoverFromTranslog()` method in `ReadOnlyEngine`
so that it always recover from an empty translog snapshot, allowing the recovery
state translog information to be correctly updated.
Related to #33888
Initially in #38910, ShardFollowTask was reusing ImmutableFollowParameters'
serialization logic. After merging, bwc tests failed sometimes and
the binary serialization that ShardFollowTask was originally was using
was added back. ImmutableFollowParameters is using optional fields (optional vint)
while ShardFollowTask was not (vint).
Today we always refresh when looking up the primary term in
FollowingEngine. This is not necessary for we can simply
return none for operations before the global checkpoint.
The follower won't always have the same history as the leader for its
soft-deletes retention can be different. However, if some operation
exists on the history of the follower, then the same operation must
exist on the leader. This change relaxes the history check in
ShardFollowTaskReplicationTests.
Closes#39093
This commit introduces the retention leases to ESIndexLevelReplicationTestCase,
then adds some tests verifying that the retention leases replication works
correctly in spite of the presence of the primary failover or out of order
delivery of retention leases sync requests.
Relates #37165
This change aims to fix failures in the session factory load balancing
tests that mock failure scenarios. For these tests, we randomly shut
down ldap servers and bind a client socket to the port they were
listening on. Unfortunately, we would occasionally encounter failures
in these tests where a socket was already in use and/or the port
we expected to connect to was wrong and in fact was to one of the ldap
instances that should have been shut down.
The failures are caused by the behavior of certain operating systems
when it comes to binding ports and wildcard addresses. It is possible
for a separate application to be bound to a wildcard address and still
allow our code to bind to that port on a specific address. So when we
close the server socket and open the client socket, we are still able
to establish a connection since the other application is already
listening on that port on a wildcard address. Another variant is that
the os will allow a wildcard bind of a server socket when there is
already an application listening on that port for a specific address.
In order to do our best to prevent failures in these scenarios, this
change does the following:
1. Binds a client socket to all addresses in an awaitBusy
2. Adds assumption that we could bind all valid addresses
3. In the case that we still establish a connection to an address that
we should not be able to, try to bind and expect a failure of not
being connected
Closes#32190
This commit fixes a broken CCR retention lease unfollow test. The
problem with the test is that the random subset of shards that we picked
to disrupt would not necessarily overlap with the actual shards in
use. We could take a non-empty subset of [0, 3] (e.g., { 2 }) when the
only shard IDs in use were [0, 1]. This commit fixes this by taking into
account the number of shards in use in the test.
With this change, we also take measure to ensure that a successful
branch is tested more frequently than would otherwise be the case. On
that branch, we want to sometimes pretend that the retention lease is
already removed. The randomness here was also sometimes selecting a
subset of shards that did not overlap with the shards actually in use
during the test. While this does not break the test, it is confusing and
reduces the amount of coverage of that branch.
Relates #39185
In most of the places we avoid creating the `.security` index (or updating the mapping)
for read/search operations. This is more of a nit for the case of the getRole call,
that fixes a possible mapping update during a get role, and removes a dead if branch
about creating the `.security` index.
This commit attempts to remove the retention leases on the leader shards
when unfollowing an index. This is best effort, since the leader might
not be available.
This defaults to "true" (current behavior) and will throw an exception
if there is a property that cannot be recognized. If "false", it will
ignore anything unrecognizable.
(cherry picked from commit 38fbf9792bcf4fe66bb3f17589e5fe6d29748d07)
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.
This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.
Backport of: #39092
The ML memory tracker does searches against ML results
and config indices. These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.
This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete. Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.
The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.
Fixes#37117
These two changes are interlinked.
Before this change unsetting ML upgrade mode would wait for all
datafeeds to be assigned and not waiting for their corresponding
jobs to initialise. However, this could be inappropriate, if
there was a reason other that upgrade mode why one job was unable
to be assigned or slow to start up. Unsetting of upgrade mode
would hang in this case.
This change relaxes the condition for considering upgrade mode to
be unset to simply that an assignment attempt has been made for
each ML persistent task that did not fail because upgrade mode
was enabled. Thus after unsetting upgrade mode there is no
guarantee that every ML persistent task is assigned, just that
each is not unassigned due to upgrade mode.
In order to make setting upgrade mode work immediately after
unsetting upgrade mode it was then also necessary to make it
possible to stop a datafeed that was not assigned. There was
no particularly good reason why this was not allowed in the past.
It is trivial to stop an unassigned datafeed because it just
involves removing the persistent task.
Prior to this commit, if during fetch leader / follower GCP
a fatal error occurred, then the shard follow task was removed.
This is unexpected, because if such an error occurs during the lifetime of shard follow task then replication is stopped and the fatal error flag is set. This allows the ccr stats api to report the fatal exception that has occurred (instead of the user grepping through the elasticsearch logs).
This issue was found by a rare failure of the `FollowStatsIT#testFollowStatsApiIncludeShardFollowStatsWithRemovedFollowerIndex` test.
Closes#38779
* Disable specific locales for tests in fips mode
The Bouncy Castle FIPS provider that we use for running our tests
in fips mode has an issue with locale sensitive handling of Dates as
described in https://github.com/bcgit/bc-java/issues/405
This causes certificate validation to fail if any given test that
includes some form of certificate validation happens to run in one
of the locales. This manifested earlier in #33081 which was
handled insufficiently in #33299
This change ensures that the problematic 3 locales
* th-TH
* ja-JP-u-ca-japanese-x-lvariant-JP
* th-TH-u-nu-thai-x-lvariant-TH
will not be used when running our tests in a FIPS 140 JVM. It also
reverts #33299
The .ml-annotations index is created asynchronously when
some other ML index exists. This can interfere with the
post-test index deletion, as the .ml-annotations index
can be created after all other indices have been deleted.
This change adds an ML specific post-test cleanup step
that runs before the main cleanup and:
1. Checks if any ML indices exist
2. If so, waits for the .ml-annotations index to exist
3. Deletes the other ML indices found in step 1.
4. Calls the super class cleanup
This means that by the time the main post-test index
cleanup code runs:
1. The only ML index it has to delete will be the
.ml-annotations index
2. No other ML indices will exist that could trigger
recreation of the .ml-annotations index
Fixes#38952
* Fix#38623 remove xpack namespace REST API
Except for xpack.usage and xpack.info API's, this moves the last remaining API's out of the xpack namespace
* rename xpack api's inside inside the files as well
* updated yaml tests references to xpack namespaces api's
* update callsApi calls in the IT subclasses
* make sure docs testing does not use xpack namespaced api's
* fix leftover xpack namespaced method names in docs/build.gradle
* found another leftover reference
(cherry picked from commit ccb5d934363c37506b76119ac050a254fa80b5e7)
The data frame plugin allows users to create feature indexes by pivoting a source index. In a
nutshell this can be understood as reindex supporting aggregations or similar to the so called entity
centric indexing.
Full history is provided in: feature/data-frame-transforms
* During fetching remote mapping if remote client is missing then
`NoSuchRemoteClusterException` was not handled.
* When adding remote connection, check that it is really connected
before continue-ing to run the tests.
Relates to #38695
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
This commit adds a `ListenerTimeouts` class that will wrap a
`ActionListener` in a listener with a timeout scheduled on the generic
thread pool. If the timeout expires before the listener is completed,
`onFailure` will be called with an `ElasticsearchTimeoutException`.
Timeouts for the get ccr file chunk action are implemented using this
functionality. Additionally, this commit attempts to fix#38027 by also
blocking proxied get ccr file chunk actions. This test being un-muted is
useful to verify the timeout functionality.
Today when processing an operation on a replica engine (or the
following engine), we first add it to Lucene, then add it to translog,
then finally marks its seq_no as completed. If a flush occurs after step1,
but before step-3, the max_seq_no in the commit's user_data will be
smaller than the seq_no of some documents in the Lucene commit.
We verify seq_no_stats is aligned between copies at the end of some
disruption tests. Sometimes, the assertion `assertSeqNos` is tripped due
to a lagged global checkpoint on replicas. The global checkpoint on
replicas is lagged because we sync the global checkpoint 30 seconds (by
default) after the last replication operation. This change reduces the
global checkpoint sync-internal to 1s in the disruption tests.
Closes#38318Closes#36789
The CCR REST tests that rely on these assertions are flaky. They are
flaky since the introduction of recovery from the remote.
The underlying problem is this: these tests are making assertions about
the number of operations read by the shard following task. However, with
recovery from remote, we no longer have guarantees that the assumptions
these tests were relying on hold. Namely, these tests were assuming that
the only way that a document could land in the follower index is via the
shard following task. With recovery from remote, there is another way,
which is via the files that are copied over during the recovery
phase. Most of the time this will not be a problem because with the
small number of documents that we are indexing in these tests, it is
usally not the case that a flush would occur and so there would not be
any documents in the files copied over. However, a flush can occur any
time at which point all of the indexed documents could end up in a safe
commit and copied over during recovery from remote. This commit modifies
these assertions to ones that are not prone to this issue, yet still
validate the health of the follower shard.
Few tests failed intermittently and most of the
times due to invalidated or expired keys that were
deleted were still reported in search results.
This commit removes the test and adds enhancements
to other tests testing different scenario's.
When ExpiredApiKeysRemover is triggered, the tests
did not await its termination thereby sometimes
the results would be wrong for a search operation.
DELETE_INTERVAL setting has been further reduced to
100ms so we can trigger ExpiredApiKeysRemover faster.
Closes#38408