`Node#close` is pretty hard to rely on today:
- it might swallow exceptions
- it waits for 10 seconds for threads to terminate but doesn't signal anything
if threads are still not terminated after 10 seconds
This commit makes `IOException`s propagated and splits `Node#close` into
`Node#close` and `Node#awaitClose` so that the decision what to do if a node
takes too long to close can be done on top of `Node#close`.
It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.
* Adds Bulk delete API to blob container
* Implement bulk delete API for S3
* Adjust S3Fixture to accept both path styles for bulk deletes since the S3 SDK uses both during our ITs
* Closes#40250
Today a new replica of a closed index does not have a safe commit
invariant when its engine is opened because we won't initialize the
global checkpoint on a recovering replica until the finalize step. With
this change, we can achieve that property by creating a new translog
with the global checkpoint from the primary at the end of phase 1.
A small refactoring that removes the primaryTerm field from ReplicasProxy and
instead passes it directly in to the methods that need it. Relates #40706.
This is a dependency of #39504
Motivation:
By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update.
With this change, the mapping update will trigger a new task in the `write` queue instead.
This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines.
The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down.
Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)
This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.
(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)
* Fix forking JVM runner
* Don't bump shadow plugin version
Many gradle projects specifically use the -try exclude flag, because
there are many cases where auto-closeable resource ignore is never
referenced in body of corresponding try statement. Suppressing this
warning specifically in each case that it happens using
`@SuppressWarnings("try")` would be very verbose.
This change removes `-try` from any gradle project and adds it to the
build plugin. Also this change removes exclude flags from gradle projects
that is already specified in build plugin (for example -deprecation).
Relates to #40366
Today we are running our internal tests with bootstrap.memory_lock
enabled. This is not out default setting, and not the recommended
value. This commit switches to use the default value, which is to not
enable bootstrap.memory_lock.
The hdfs-fixture is actually executed in plugin/repository-hdfs as a
dependency. The fixture is not needed and actually causes a failure
because we have two copies now and both use the same ports.
Replaces the vagrant based kerberos fixtures with docker based test fixtures plugin.
The configuration is now entirely static on the docker side and no longer driven by Gradle,
also two different services are being configured since there are two different consumers of the fixture that can run in parallel and require different configurations.
We discussed recently that the cluster state API should be considered
"internal" and therefore our usual cast-iron stability guarantees do not hold
for this API.
However, there are a good number of REST tests that try to identify the master
node. Today they call `GET /_cluster/state` API and extract the master node ID
from the response. In fact many of these tests just want an arbitary node ID
(or perhaps a data node ID) so an alternative is to call `GET _nodes` or `GET
_nodes/data:true` and obtain a node ID from the keys of the `nodes` map in the
response.
This change adds the ability for YAML-based REST tests to extract an arbitrary
key from a map so that they can obtain a node ID from the nodes info API
instead of using the master node ID from the cluster state API.
Relates #40047.
This commit adds an InboundHandler to handle inbound message processing.
With this commit, this code is moved out of the TcpTransport.
Additionally, finer grained unit tests are added to ensure that the
inbound processing works as expected
Replicated closed indices can't be indexed into or searched, and therefore don't need a shard with
full indexing and search capabilities allocated. We can save on a lot of heap memory for those
indices by not allocating a mapper service and caching infrastructure (which preallocates a constant
amount per instance). Before this change, a 1GB ES instance could host 250 replicated closed
metricbeat indices (each index with one shard). After this change, the same instance can host 7300
replicated closed metricbeat instances (not that this would be a recommended configuration). Most
of the remaining memory is in the cluster state and the IndexSettings object.
Switches "discovery.type: single-node" from using a separate implementation for single-node discovery to using the existing standard discovery implementation, with two small adaptions:
- auto-bootstrapping, but requiring initial_master_nodes not to be set.
- not actively pinging other nodes using the Peerfinder
- not allowing other nodes to join its single-node cluster (if they have e.g. been set up using regular discovery and connect to the single-disco node).
Currently there are some components of message serializer and sending
that still occur in TcpTransport. This commit makes it possible to
send a message without the TcpTransport by moving all of the remaining
application logic to the OutboundHandler. Additionally, it adds unit
tests to ensure that this logic works as expected.
Currently the TransportMessageListener is applied and used in the
Transport class. However, local requests and responses never make it to
this class. This PR moves the listener add/remove methods to the
TransportService. After this change the Transport can only have one
listener set with it. This one listener is the TransportService, which
will then propogate the events to the external listeners.
Additionally this commit back ports #40237
Remove Tracer from MockTransportService
Currently the TransportMessageListener is applied and used in the
Transport class. However, local requests and responses never make it to
this class. This PR moves the listener add/remove methods to the
TransportService. After this change the Transport can only have one
listener set with it. This one listener is the TransportService, which
will then propogate the events to the external listeners.
FilterDirectory.getPendingDeletions does not delegate, fixed
temporarily by overriding in StoreDirectory.
This in turn caused duplicate file name use after a trimUnsafeCommits
had been done, since a new IndexWriter would not consider the pending
deletes in IndexFileDeleter. This should only happen on windows (AFAIK).
Reenabled doing index updates for all tests using
IndexShardTests.indexOnReplicaWithGaps (which could fail due to above
when using mocked WindowsFS).
Added getPendingDeletions delegation to all elasticsearch
FilterDirectory subclasses that were not trivial test-only overrides to
minimize the risk of hitting this issue in another case.
* Mistake was made in #39662
* The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state
If a replica were first reset due to one primary failover and then
promoted (before resync completes), its MSU would not include changes
since global checkpoint, leading to errors during translog replay.
Fixed by re-initializing MSU before restoring local history.
Unlike index operations which can fail at the document level to
analyzing errors, delete operations should never fail at the document
level whether soft-deletes is enabled or not. With this change, we will
always fail the engine if we fail to apply a delete operation to Lucene.
Closes#33256
We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped
using it since #25692 (6.0). This change removes that action and related
code in 7.x and 8.0.
Relates #19287
Relates #25692
When the method ensureGreen in QA tests is timed out, it does not
provide enough info for us to investigate why the testing index is
not green yet. With this change, we will dump the cluster state if
ensureGreen timed out.
Relates #32027
This PR introduces AsyncRecoveryTarget which executes remote calls of
peer recovery asynchronously. In this change, we also add a new
assertion to ensure that method sendBatch, which sends a batch of
history operations in phase2, is never called recursively on the same
thread. This new assertion will also be used in method sendFileChunks.
This commit unmutes NetworkDisruptionIT.
It makes changes necessary for Zen2 - avoids usage of
autoMinMasterNodes and selects cluster size, such that there is no
need to call AddVotingExclusion.
This test also introduces refactors a single method
prepareDistruptedCluster to be used by both test methods.
Unfortunately, NetworkDisruption is broken and the
testNetworkPartitionRemovalRestoresConnections "is fixed" by
introducing assertBusy - #38348.
Relates #36205
Relates #38348
(cherry picked from commit 97707c7f892636e5b75c3df546b067414acb27cd)
This change adds a wrapper for IndexSearcher that makes IndexSearcher#search(List, Weight, Collector) visible by
sub-classes. The wrapper is used by the ContextIndexSearcher to call this protected method on a searcher created by a plugin.
This ensures that an override of the protected method in an IndexSearcherWrapper plugin is called when a search is executed.
Closes#30758
Currently, we maintain a transport name ("mock-nio", "nio", "netty")
that is passed to a `TcpTransportChannel` when a request is received.
The value of this name is to associate with the task when we register a
task with the task manager. However, it is only possible to run ES with
one transport, so having an implementation specific name is unnecessary.
This commit removes the name and replaces it with the generic
"transport".
Today we load the shard history retention leases from disk whenever opening the
engine, and treat a missing file as an empty set of leases. However in some
cases this is inappropriate: we might be restoring from a snapshot (if the
target index already exists then there may be leases on disk) or
force-allocating a stale primary, and in neither case does it make sense to
restore the retention leases from disk.
With this change we write an empty retention leases file during recovery,
except for the following cases:
- During peer recovery the on-disk leases may be accurate and could be needed
if the recovery target is made into a primary.
- During recovery from an existing store, as long as we are not
force-allocating a stale primary.
Relates #37165
Computing the compressed size of the cluster state on every invocation
of cluster:monitor/state action is expensive, and the value of this
field is dubious anyway. Therefore we want to remove computing this
field. As a first step, we stop computing and return this field by
default. To avoid breaking users, we will give them a system property to
use to tide them over until the next major release when we will actually
remove this field. This comes with a deprecation warning too, and the
backport to the appropriate minor will also include a note in the
migration guide. There will be a follow-up to remove this field in the
next major version.
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.
If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.
This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.
Resolves#29025.
Recently we have had a number of test issues related to blocking
activity occuring on the io thread. This commit adds a log warning for
when handling event takes a >150 milliseconds. This is implemented
for the MockNioTransport which is the transport used in
ESIntegTestCase.
* Back port build changes from #39102
This back-ports how versions are determined and bwc test are set up from
#39102 without enabling the bwc from current version tests so it's
easier/possible to backmerge future buld changes.
It's expected that the tets are lacking many of the required fixes in
this version to enable them.
* Wipe Snapshots Before Indices in RestTests
* If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe
* Fixed by first deleting/aborting snapshots
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.
Closes#39467
Lucene added an optimization to leave the term dictionary on disk
for non-id like fields. This change happened very late in the release
processes such that it's better to have an escape hatch if certain
use-cases are hurt by this optimization. This setting might be
removed in the future if it turns out to be unnecessary.
When a node is joining the cluster we ensure that it can send requests to the
master _at that time_. If it joins the cluster and _then_ loses the ability to
send requests to the master then it should be removed from the cluster. Today
this is not the case: the master can still receive responses to its follower
checks, and receives acknowledgements to cluster state publications, so has no
reason to remove the node.
This commit changes the handling of follower checks so that they fail if they
come from a master that the other node was following but which it now believes
to have failed.
* Optimize Bulk Message Parsing and Message Length Parsing
* findNextMarker took almost 1ms per invocation during the PMC rally track
* Fixed to be about an order of magnitude faster by using Netty's bulk `ByteBuf` search
* It is unnecessary to instantiate an object (the input stream wrapper) and throw it away, just to read the `int` length from the message bytes
* Fixed by adding bulk `int` read to BytesReference
Backport support for replicating closed indices (#39499)
Before this change, closed indexes were simply not replicated. It was therefore
possible to close an index and then decommission a data node without knowing
that this data node contained shards of the closed index, potentially leading to
data loss. Shards of closed indices were not completely taken into account when
balancing the shards within the cluster, or automatically replicated through shard
copies, and they were not easily movable from node A to node B using APIs like
Cluster Reroute without being fully reopened and closed again.
This commit changes the logic executed when closing an index, so that its shards
are not just removed and forgotten but are instead reinitialized and reallocated on
data nodes using an engine implementation which does not allow searching or
indexing, which has a low memory overhead (compared with searchable/indexable
opened shards) and which allows shards to be recovered from peer or promoted
as primaries when needed.
This new closing logic is built on top of the new Close Index API introduced in
6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before
closing them, and closing an index on a 8.0 cluster will reinitialize the index shards
and therefore impact the cluster health.
Some APIs have been adapted to make them work with closed indices:
- Cluster Health API
- Cluster Reroute API
- Cluster Allocation Explain API
- Recovery API
- Cat Indices
- Cat Shards
- Cat Health
- Cat Recovery
This commit contains all the following changes (most recent first):
* c6c42a1 Adapt NoOpEngineTests after #39006
* 3f9993d Wait for shards to be active after closing indices (#38854)
* 5e7a428 Adapt the Cluster Health API to closed indices (#39364)
* 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767)
* 71f5c34 Recover closed indices after a full cluster restart (#39249)
* 4db7fd9 Adapt the Recovery API for closed indices (#38421)
* 4fd1bb2 Adapt more tests suites to closed indices (#39186)
* 0519016 Add replica to primary promotion test for closed indices (#39110)
* b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631)
* c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955)
* 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex()
* e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329)
* cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327)
* b9becdd Adapt testPendingTasks() for replicated closed indices (#38326)
* 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024)
* e53a9be Fix compilation error in IndexShardIT after merge with master
* cae4155 Relax NoOpEngine constraints (#37413)
* 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes
* c63fd69 [RCI] Add NoOpEngine for closed indices (#33903)
Relates to #33888
This adds a `details` parameter to shard locking in `NodeEnvironment`. This is
intended to be used for diagnosing issues such as
```
1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index
1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index
1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms
1> at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?]
1> at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?]
1> at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?]
1> at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?]
```
In the hope that we will be able to determine why the shard is still locked.
Relates to #30290 as well as some other CI failures
there are testing situations where newly created indices
are being wiped before they are fully initialized. This results
in an edge-case in the shard-locking strategy where an index
cannot be deleted.
This should fix that
Currently there are two security tests that specifically target the
netty security transport. This PR moves the client authentication tests
into `AbstractSimpleSecurityTransportTestCase` so that the nio transport
will also be tested.
Additionally the work to build transport configurations is moved out of
the netty transport and tested independently.
This commit is the final piece of the integration of CCR with retention
leases. Namely, we periodically renew retention leases and advance the
retaining sequence number while following.
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.
The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:
- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).
- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica. However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e. also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.
Relates #39000Closes#38949
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
* The lambda invoked by the `lockedExecutor` eventually gets JITed (which runs a static initializer that we will suspend in with a very tiny chance).
* Fixed by creating the `Runnable` in the main test thread and using the same instance in all threads
* Closes#35686
Currently remote compression and ping schedule settings are dynamic.
However, we do not listen for changes. This commit adds listeners for
changes to those two settings. Additionally, when those settings change
we now close existing connections and open new ones with the settings
applied.
Fixes#37201.
* Simplify and Fix Synchronization in InternalTestCluster (#39168)
* Remove unnecessary `synchronized` statements
* Make `Predicate`s constants where possible
* Cleanup some stream usage
* Make unsafe public methods `synchronized`
* Closes#37965
* Closes#37275
* Closes#37345
There may be situations where indices have not yet been closed from a Lucene
perspective, causing the breaker to not immediately be at 0
Relates to #30290
This commit introduces the retention leases to ESIndexLevelReplicationTestCase,
then adds some tests verifying that the retention leases replication works
correctly in spite of the presence of the primary failover or out of order
delivery of retention leases sync requests.
Relates #37165
Rollup jobs should be stopped + deleted before the indices are removed.
It's possible for an active rollup job to issue a bulk request, the test
ends and the cleanup code deletes all indices. The in-flight bulk
request will then stall + error because the index no-longer exists...
but this process might take longer than the StopRollup timeout.
Which means the test fails, and often fails several other tests since
the job is still active (e.g. other tests cannot create the same-named
job, or fail to stop the job in their cleanup because it's still stalled).
This tends to knock over several tests before the bulk finally times
out and the job shuts down.
Instead, we need to simply stop jobs first. Inflight bulks will resolve
quickly, and we can carry on with deleting indices after the jobs are
confirmed inactive.
stop-job.asciidoc tended to trigger this issue because it executed
an async stop API and then exited, which setup the above situation. In
can and did happen with other tests though. As an extra precaution,
the doc test was modified to substitute in wait_for_completion
to help head off these issues too.
* Disable specific locales for tests in fips mode
The Bouncy Castle FIPS provider that we use for running our tests
in fips mode has an issue with locale sensitive handling of Dates as
described in https://github.com/bcgit/bc-java/issues/405
This causes certificate validation to fail if any given test that
includes some form of certificate validation happens to run in one
of the locales. This manifested earlier in #33081 which was
handled insufficiently in #33299
This change ensures that the problematic 3 locales
* th-TH
* ja-JP-u-ca-japanese-x-lvariant-JP
* th-TH-u-nu-thai-x-lvariant-TH
will not be used when running our tests in a FIPS 140 JVM. It also
reverts #33299
Today when processing an operation on a replica engine (or the
following engine), we first add it to Lucene, then add it to translog,
then finally marks its seq_no as completed. If a flush occurs after step1,
but before step-3, the max_seq_no in the commit's user_data will be
smaller than the seq_no of some documents in the Lucene commit.
This test failed on 7.1 when running full cluster restart tests against pre-7.0
clusters (e.g. 6.6 clusters). The fixes the expected type in the
templates after the cluster restart.
Backport of #38903
When tearing down from `ESSingleNodeTestCase` we perform a delete on "*"
indices, it some cases, however, those indices are not fully deleted. Rather
than have a failure occur later down the change (see:
https://github.com/elastic/elasticsearch/issues/30290#issuecomment-463589008 )
the failure should occurr immediately so it can be diagnosed more easily.
Instead of using `WarningsHandler.PERMISSIVE`, we only match warnings
that are due to types removal.
This PR also renames `allowTypeRemovalWarnings` to `allowTypesRemovalWarnings`.
Relates to #37920.
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.
Relates #38633
Adds the ability to fetch chunks from different files in parallel, configurable using the new `ccr.indices.recovery.max_concurrent_file_chunks` setting, which defaults to 5 in this PR.
The implementation uses the parallel file writer functionality that is also used by peer recoveries.
When we are preparing to release a major version the rules around
"unreleased" versions and branches get a bit more complex.
This change implements the following rules:
- If the tip version on the previous major is a .0 (e.g. 6.7.0) then
the tip of the minor before that (e.g. 6.6.1) must be unreleased.
(This is because 6.7.0 would be "staged" in preparation for release,
but 6.6.1 would be open for bug fixes on the release 6.6.x line)
(in VersionCollection & VersionUtils)
- The "major.x" branch (if it exists) will always point to the latest
minor in that series. Anything that is not the latest minor, must
therefore be on a the "major.minor" branch
For example, if v7.1.0 exists then the "7.x" branch must be 7.1.0,
and 7.0.0 must be on the "7.0" branch
(in VersionCollection)
This commit adds the 7.1 version constant to the 7.x branch.
Co-authored-by: Andy Bristol <andy.bristol@elastic.co>
Co-authored-by: Tim Brooks <tim@uncontended.net>
Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
Co-authored-by: Luca Cavanna <javanna@users.noreply.github.com>
Co-authored-by: markharwood <markharwood@gmail.com>
Co-authored-by: Ioannis Kakavas <ioannis@elastic.co>
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: Jason Tedor <jason@tedor.me>
Co-authored-by: Alpar Torok <torokalpar@gmail.com>
Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Tim Vernum <tim@adjective.org>
Co-authored-by: Albert Zaharovits <albert.zaharovits@gmail.com>
In #38333 and #38350 we moved away from the `discovery.zen` settings namespace
since these settings have an effect even though Zen Discovery itself is being
phased out. This change aligns the documentation and the names of related
classes and methods with the newly-introduced naming conventions.
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.
The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).
Given all these issues, it seems that this custom timeout mechanism
gives little benefits while causing a lot of harm. We should rather rely
on connect and socket timeout exposed by the underlying http client
and accept that a request can overall take longer than the configured
timeout, which is the case even with a single retry anyways.
This commit removes the `maxRetryTimeout` setting and all of its usages.
Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.
This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.
With this change we no longer support pluggable discovery implementations. No
known implementations of `DiscoveryPlugin` actually override this method, so in
practice this should have no effect on the wider world. However, we were using
this rather extensively in tests to provide the `test-zen` discovery type. We
no longer need a separate discovery type for tests as we no longer need to
customise its behaviour.
Relates #38410
Today we throw a fatal `RuntimeException` if an exception occurs in
`getMasterName()`, and this includes the case where there is currently no
master. However, sometimes we call this method inside an `assertBusy()` in
order to allow for a cluster that is in the process of stabilising and electing
a master. The trouble is that `assertBusy()` only retries on an
`AssertionError` and not on a general `RuntimeException`, so the lack of a
master is immediately fatal.
This commit fixes the issue by asserting there is a master, triggering a retry
if there is not.
Fixes#38331
Renames the following settings to remove the mention of `zen` in their names:
- `discovery.zen.hosts_provider` -> `discovery.seed_providers`
- `discovery.zen.ping.unicast.concurrent_connects` -> `discovery.seed_resolver.max_concurrent_resolvers`
- `discovery.zen.ping.unicast.hosts.resolve_timeout` -> `discovery.seed_resolver.timeout`
- `discovery.zen.ping.unicast.hosts` -> `discovery.seed_addresses`
X-Pack security supports built-in authentication service
`token-service` that allows access tokens to be used to
access Elasticsearch without using Basic authentication.
The tokens are generated by `token-service` based on
OAuth2 spec. The access token is a short-lived token
(defaults to 20m) and refresh token with a lifetime of 24 hours,
making them unsuitable for long-lived or recurring tasks where
the system might go offline thereby failing refresh of tokens.
This commit introduces a built-in authentication service
`api-key-service` that adds support for long-lived tokens aka API
keys to access Elasticsearch. The `api-key-service` is consulted
after `token-service` in the authentication chain. By default,
if TLS is enabled then `api-key-service` is also enabled.
The service can be disabled using the configuration setting.
The API keys:-
- by default do not have an expiration but expiration can be
configured where the API keys need to be expired after a
certain amount of time.
- when generated will keep authentication information of the user that
generated them.
- can be defined with a role describing the privileges for accessing
Elasticsearch and will be limited by the role of the user that
generated them
- can be invalidated via invalidation API
- information can be retrieved via a get API
- that have been expired or invalidated will be retained for 1 week
before being deleted. The expired API keys remover task handles this.
Following are the API key management APIs:-
1. Create API Key - `PUT/POST /_security/api_key`
2. Get API key(s) - `GET /_security/api_key`
3. Invalidate API Key(s) `DELETE /_security/api_key`
The API keys can be used to access Elasticsearch using `Authorization`
header, where the auth scheme is `ApiKey` and the credentials, is the
base64 encoding of API key Id and API key separated by a colon.
Example:-
```
curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health
```
Closes#34383
Relax test warning message checking to pre-empt PR 38022 landing in 6.7 with new warning messages.
The relaxed test now just assumes any warning message starting with “[types removal]” is tolerated rather than the precise phrasing used in the 6.7 branch.
This commit introduces a background sync for retention leases. The idea
here is that we do a heavyweight sync when adding a new retention lease,
and then periodically we want to background sync any retention lease
renewals to the replicas. As long as the background sync interval is
significantly lower than the extended lifetime of a retention lease, it
is okay if from time to time a replica misses a sync (it will still have
an older version of the lease that is retaining more data as we assume
that renewals do not decrease the retaining sequence number). There are
two follow-ups that will come after this commit. The first is to address
the fact that we have not adapted the should periodically flush logic to
possibly flush the retention leases. We want to do something like flush
if we have not flushed in the last five minutes and there are renewed
retention leases since the last time that we flushed. An additional
follow-up will remove the syncing of retention leases when a retention
lease expires. Today this sync could be invoked in the background by a
merge operation. Rather, we will move the syncing of retention lease
expiration to be done under the background sync. The background sync
will use the heavyweight sync (write action) if a lease has expired, and
will use the lightweight background sync (replication action) otherwise.
Today the following settings in the `discovery.zen` namespace are still used:
- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`
This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
Today we have DiscoveryDisruptionIT tests for checking that discovery can still
work once the cluster has formed, even if the cluster is misconfigured and only
has a single master-eligible node in its unicast hosts list. In fact with Zen2
we can go one better: we do not need any nodes in the unicast hosts list,
because nodes also use the contents of the last-committed cluster state for
discovery. Additionally, the DiscoveryDisruptionIT tests were failing due to
the overenthusiastic fault-detection timeouts.
This commit replaces these tests with deterministic `CoordinatorTests` that
verify the same behaviour. It also removes some duplication by extracting a
test method called `testFollowerCheckerAfterMasterReelection()`
Closes#37687
Because concurrent sync requests from a primary to its replicas could be
in flight, it can be the case that an older retention leases collection
arrives and is processed on the replica after a newer retention leases
collection has arrived and been processed. Without a defense, in this
case the replica would overwrite the newer retention leases with the
older retention leases. This commit addresses this issue by introducing
a versioning scheme to retention leases. This versioning scheme is used
to resolve out-of-order processing on the replica. We persist this
version into Lucene and restore it on recovery. The encoding of
retention leases is starting to get a little ugly. We can consider
addressing this in a follow-up.
This PR removes the temporary change we made to the yml test harness in #37285
to automatically set `include_type_name` to `true` in index creation requests
if it's not already specified. This is possible now that the vast majority of
index creation requests were updated to be typeless in #37611. A few additional
tests also needed updating here.
Additionally, this PR updates the test harness to set `include_type_name` to
`false` in index creation requests when communicating with 6.x nodes. This
mirrors the logic added in #37611 to allow for typeless document write requests
in test set-up code. With this update in place, we can remove many references
to `include_type_name: false` from the yml tests.
Currently, there are a few tests that use autoMinMasterNodes=false and
hence override addExtraClusterBootstrapSettings, mostly this is 10-30
lines of codes that are copy-pasted from class to class.
This PR introduces `InternalTestCluster.setBootstrapMasterNodeIndex`
which is suitable for all classes and copy-paste could be removed.
Removing code is always a good thing!
* Fixes two broken spots:
1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting
2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state
* Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk
* Significantly extended test infrastructure to reproduce the above two issues
* Two new test runs:
1. Reproducing the effects of node disconnects/restarts in isolation
2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
* Relates #32265
* Closes#32348
Implements `geotile_grid` aggregation
This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240
This code uses the same base classes as `geohash_grid` agg, but uses a different hashing
algorithm to allow zoom consistency. Each grid bucket is aligned to Web Mercator tiles.
When extending ESIntegTestCase are run on the same jvm, the static field in
NodeAndClusterIdConverter will throw an AlreadySet exceptions.
overriding the configuration method from Node.configureNodeAndClusterIdStateListener in the MockNode will prevent the listener registration from happening
relates #32850
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.
This is a continuation of #28667, #36137 and also fixes#37708.
This commit pushes the primary term into the replication tracker. This
is a precursor to using the primary term to resolving ordering problems
for retention leases. Namely, it can be that out-of-order retention
lease sync requests arrive on a replica. To resolve this, we need a
tuple of (primary term, version). For this to be, the primary term needs
to be accessible in the replication tracker. As the primary term is part
of the replication group anyway, this change conceptually makes sense.
With #37566 we have introduced the ability to merge multiple search responses into one. That makes it possible to expose a new way of executing cross-cluster search requests, that makes CCS much faster whenever there is network latency between the CCS coordinating node and the remote clusters. The coordinating node can now send a single search request to each remote cluster, which gets reduced by each one of them. from + size results are requested to each cluster, and the reduce phase in each cluster is non final (meaning that buckets are not pruned and pipeline aggs are not executed). The CCS coordinating node performs an additional, final reduction, which produces one search response out of the multiple responses received from the different clusters.
This new execution path will be activated by default for any CCS request unless a scroll is provided or inner hits are requested as part of field collapsing. The search API accepts now a new parameter called ccs_minimize_roundtrips that allows to opt-out of the default behaviour.
Relates to #32125
Today we pass `discovery.zen.minimum_master_nodes` to nodes started up in
tests, but for 7.x nodes this setting is not required as it has no effect.
This commit removes this setting so that nodes are started with more realistic
configurations, and deprecates it.
This drops the option for tests to disable strict deprecation mode in
the low level rest client in favor of configuring expected warnings on
any calls that should expect warnings. This behavior is
paranoid-by-default which is generally the right way to handle
deprecations and tests in general.