Prior to this commit, if during fetch leader / follower GCP
a fatal error occurred, then the shard follow task was removed.
This is unexpected, because if such an error occurs during the lifetime of shard follow task then replication is stopped and the fatal error flag is set. This allows the ccr stats api to report the fatal exception that has occurred (instead of the user grepping through the elasticsearch logs).
This issue was found by a rare failure of the `FollowStatsIT#testFollowStatsApiIncludeShardFollowStatsWithRemovedFollowerIndex` test.
Closes#38779
* Disable specific locales for tests in fips mode
The Bouncy Castle FIPS provider that we use for running our tests
in fips mode has an issue with locale sensitive handling of Dates as
described in https://github.com/bcgit/bc-java/issues/405
This causes certificate validation to fail if any given test that
includes some form of certificate validation happens to run in one
of the locales. This manifested earlier in #33081 which was
handled insufficiently in #33299
This change ensures that the problematic 3 locales
* th-TH
* ja-JP-u-ca-japanese-x-lvariant-JP
* th-TH-u-nu-thai-x-lvariant-TH
will not be used when running our tests in a FIPS 140 JVM. It also
reverts #33299
The build script file for the `:libs:elasticsearch-ssl-config` and
`:libs:ssl-config-tests` projects was incorrectly named `eclipse.build.gradle`
while the expected name was `eclipse-build.gradle`.
In addition, this also adds a missing snippet in the `build.gradle` conf file,
that fixes the project setup for Eclipse users.
When retention leases fail to sync after an expiration check, we emit a
log message about this. This commit adds the retention leases that
failed to sync.
When the background retention lease sync fires, we check an see if any
retention leases are expired. If any did expire, we execute a full
retention lease sync (write action). Since this is happening on a
background thread, we do not block that thread waiting for success (it
will simply try again when the timer elapses). However, we were
swallowing exceptions that indicate failure. This commit addresses that
by logging the failures. Additionally, we add some trace logging to the
execution of syncing retention leases.
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.
Strengthened the locking to ensure that once we stop the transport, no
new server channels can be made.
Relates to CI failure issue: #37543
The .ml-annotations index is created asynchronously when
some other ML index exists. This can interfere with the
post-test index deletion, as the .ml-annotations index
can be created after all other indices have been deleted.
This change adds an ML specific post-test cleanup step
that runs before the main cleanup and:
1. Checks if any ML indices exist
2. If so, waits for the .ml-annotations index to exist
3. Deletes the other ML indices found in step 1.
4. Calls the super class cleanup
This means that by the time the main post-test index
cleanup code runs:
1. The only ML index it has to delete will be the
.ml-annotations index
2. No other ML indices will exist that could trigger
recreation of the .ml-annotations index
Fixes#38952
Many of our index components use ref-counting so that in the event that a shard
is closed while there are still ongoing requests, then the index reader and the
store only effectively get closed when ongoing requests have finished. However
we don't apply the same principle to the request and query caches, which might
get closed while there are still in-flight requests.
This commit adds ref-counting to `IndicesService` so that the caches and other
components it maintains only get closed when all shards are effectively closed.
Closes#37117
* Make pullFixture a task dependency of resolveAllDependencies
With this change we will pull the docker test fixtures ( and thus cache
the images ) for older versions too as the `resolveAllDependencies` is
already being called on the bwc checkouts too.
* Fix#38623 remove xpack namespace REST API
Except for xpack.usage and xpack.info API's, this moves the last remaining API's out of the xpack namespace
* rename xpack api's inside inside the files as well
* updated yaml tests references to xpack namespaces api's
* update callsApi calls in the IT subclasses
* make sure docs testing does not use xpack namespaced api's
* fix leftover xpack namespaced method names in docs/build.gradle
* found another leftover reference
(cherry picked from commit ccb5d934363c37506b76119ac050a254fa80b5e7)
The data frame plugin allows users to create feature indexes by pivoting a source index. In a
nutshell this can be understood as reindex supporting aggregations or similar to the so called entity
centric indexing.
Full history is provided in: feature/data-frame-transforms
* During fetching remote mapping if remote client is missing then
`NoSuchRemoteClusterException` was not handled.
* When adding remote connection, check that it is really connected
before continue-ing to run the tests.
Relates to #38695
Follow index in follow cluster that follows an index in the leader cluster and another
follow index in the leader index that follows that index in the follow cluster.
During the upgrade index following is paused and after the upgrade
index following is resumed and then verified index following works as expected.
Relates to #38037
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
This commit adds a `ListenerTimeouts` class that will wrap a
`ActionListener` in a listener with a timeout scheduled on the generic
thread pool. If the timeout expires before the listener is completed,
`onFailure` will be called with an `ElasticsearchTimeoutException`.
Timeouts for the get ccr file chunk action are implemented using this
functionality. Additionally, this commit attempts to fix#38027 by also
blocking proxied get ccr file chunk actions. This test being un-muted is
useful to verify the timeout functionality.
`SearchShardIterator` inherits its `compareTo` implementation from `PlainShardIterator`. That is good in most of the cases, as such comparisons are based on the shard id which is unique, even when searching against indices with same names across multiple clusters (thanks to the index uuid being different). In case though the same cluster is registered multiple times with different aliases, the shard id is exactly the same, hence remote results will be returned before local ones with same shard id objects. That is because remote iterators are added before local ones, and we use a stable sorting method in `GroupShardIterators` constructor.
This PR enhances `compareTo` for `SearchShardIterator` to tie break on cluster alias and introduces consistent `equals` and `hashcode` methods. This allows to remove a TODO in `SearchResponseMerger` which otherwise has to handle this special case specifically. Also, while at it I added missing tests around equals/hashcode and compareTo and expanded existing ones.
Today when processing an operation on a replica engine (or the
following engine), we first add it to Lucene, then add it to translog,
then finally marks its seq_no as completed. If a flush occurs after step1,
but before step-3, the max_seq_no in the commit's user_data will be
smaller than the seq_no of some documents in the Lucene commit.
We verify seq_no_stats is aligned between copies at the end of some
disruption tests. Sometimes, the assertion `assertSeqNos` is tripped due
to a lagged global checkpoint on replicas. The global checkpoint on
replicas is lagged because we sync the global checkpoint 30 seconds (by
default) after the last replication operation. This change reduces the
global checkpoint sync-internal to 1s in the disruption tests.
Closes#38318Closes#36789
The predicate shouldPeriodicallyFlush is determined by the uncommitted
translog size and the local checkpoint. The uncommitted translog size
depends on the local checkpoint. The condition shouldPeriodicallyFlush
can be true twice in in the test in the following scenario:
1. Index doc-0 and advances the local checkpoint to 0, the condition
shouldPeriodicallyFlush remains false.
2. Index doc-1 and add it to translog, but the local checkpoint is not
advanced yet (still 0). The condition shouldPeriodicallyFlush becomes
true because the uncommitted translog size is 216bytes (2ops + gen-1 +
gen-2) > 180bytes and the translog generation of the new index commit
would advance from 1 to 2.
> [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=0,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=2,
> max_seq_no=1}]
1. The shouldPeriodicallyFlush becomes true again after the local
checkpoint is advanced to 1 because the uncommitted translog size is
216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation
of the new index commit would advance from 2 to 4.
> [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=1,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=4,
> max_seq_no=1}]
We need to relax the assertion in this test to cover this situation.
Closes#31629
* Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread
* Closes#38489
the get-ml-info API documentation tested that the
response show that ML's `upgrade_mode` was false.
For reasons that may be true due to other tests running in
parallel or not cleaning themselves up, this may not be
guaranteed. Since the actual value here is not of importance,
this commit relaxes the requirement that upgrade_mode be
static.
This commit moves validation logic for ensuring our testclusters
configuration doesn't contain unexpected artifacts into the plugin
itself. This change allows us to remove the custom copy task
implementation altogether.
Additionally, the error message has been improved to display component
ids in addition to the artifacts to make it easier to figure out what
actual dependency is at fault.
The CCR REST tests that rely on these assertions are flaky. They are
flaky since the introduction of recovery from the remote.
The underlying problem is this: these tests are making assertions about
the number of operations read by the shard following task. However, with
recovery from remote, we no longer have guarantees that the assumptions
these tests were relying on hold. Namely, these tests were assuming that
the only way that a document could land in the follower index is via the
shard following task. With recovery from remote, there is another way,
which is via the files that are copied over during the recovery
phase. Most of the time this will not be a problem because with the
small number of documents that we are indexing in these tests, it is
usally not the case that a flush would occur and so there would not be
any documents in the files copied over. However, a flush can occur any
time at which point all of the indexed documents could end up in a safe
commit and copied over during recovery from remote. This commit modifies
these assertions to ones that are not prone to this issue, yet still
validate the health of the follower shard.