Today the default for USE_ZEN2 is false and it is overridden in many places. By
defaulting it to true we can be sure that the only places in which Zen2 does
not work are those in which it is explicitly set to false.
Today GatewayMetaState is capable of atomically storing MetaData to
disk. We've also moved fields that are needed to be persisted in Zen2
from ClusterState to ClusterState.MetaData.CoordinationMetaData.
This commit implements PersistedState interface.
version and currentTerm are persisted as a part of Manifest.
GatewayMetaState now implements both ClusterStateApplier and
PersistedState interfaces. We started with two descendants
Zen1GatewayMetaState and Zen2GatewayMetaState, but it turned
out to be not easy to glue it.
GatewayMetaState now constructs previousClusterState (including
MetaData) and previousManifest inside the constructor so that all
PersistedState methods are usable as soon as GatewayMetaState
instance is constructed. Also, loadMetaData is renamed to
getMetaData, because it just returns
previousClusterState.metaData().
Sadly, we don't have access to localNode (obtained from
TransportService in the constructor, so getLastAcceptedState
should be called, after setLocalNode method is invoked.
Currently, when deciding whether to write IndexMetaData to disk,
we're comparing current IndexMetaData version and received
IndexMetaData version. This is not safe in Zen2 if the term has changed.
So updateClusterState now accepts incremental write
method parameter. When it's set to false, we always write
IndexMetaData to disk.
Things that are not covered by GatewayMetaStateTests are covered
by GatewayMetaStatePersistedStateTests.
This commit also adds an option to use GatewayMetaState instead of
InMemoryPersistedState in TestZenDiscovery. However, by default
InMemoryPersistedState is used and only one test in PersistedStateIT
used GatewayMetaState. In order to use it for other tests, proper
state recovery should be implemented.
Today we have a way to atomically persist global MetaData and
IndexMetaData to disk when new ClusterState is received. All other
ClusterState fields are not persisted.
However, there are other parts of ClusterState that should be
persisted, namely:
version
term
lastCommittedConfiguration
lastAcceptedConfiguration
votingTombstones
version is changed frequently, other fields are not. We decided
to group term, lastCommittedConfiguration,
lastAcceptedConfiguration and votingTombstones into
CoordinationMetaData class and make CoordinationMetaData a field
inside MetaData.
MetaData.toXContent and MetaData.fromXContent should take care of
CoordinationMetaData.
version stays as a top level field in ClusterState and will be
persisted as part of Manifest in a follow-up commit.
Also MetaData.isGlobalStateEquals should be extended to include
coordinationMetaData in comparison.
This commit favors exposing getters, such as getTerm directly in
ClusterState to avoid massive code changes.
An example of CoordinationMetaState.toXContent:
{
"term": 1,
"last_committed_config": [
"TiIuBcbBtpuXyDDVHXeD",
"ZIAoVbkjjLPLUuYLaTkw"
],
"last_accepted_config": [
"OwkXbXZNOZPJqccdFHdz",
"LouzsGYwmQzpeQMrboZe",
"fCKGRZdjLTqzXAqPUtGL",
"pLoxshjpJXwDhbgjfYJy",
"SjINLwFIlIEFZCbjrSFo",
"MDkVncJEVyZLJktopWje"
]
}
- Moves disruption tests to Zen2
- Registers a few missing settings
- Removes .put(TestZenDiscovery.USE_ZEN2.getKey(), true) from tests where Zen2 is now enabled
by default through the parent test class
- Moves QuorumGatewayIT back to Zen1, as it is not stable with Zen2 as it currently relies on
dangling indices due to the lack of proper CS persistence, which triggers secondary failures
This commit adds a rest endpoint for freezing and unfreezing an index.
Among other cleanups mainly fixing an issue accessing package private APIs
from a plugin that got caught by integration tests this change also adds
documentation for frozen indices.
Note: frozen indices are marked as `beta` and available as a basic feature.
Relates to #34352
Zen2 is now feature-complete enough to run most ESIntegTestCase tests. The changes in this PR
are as follows:
- ClusterSettingsIT is adapted to not be Zen1 specific anymore (it was using Zen1 settings).
- Some of the integration tests require persistent storage of the cluster state, which is not fully
implemented yet (see #33958). These tests keep running with Zen1 for now but will be switched
over as soon as that is fully implemented.
- Some very few integration tests are not running yet with Zen2 for other reasons, depending on
some of the other open points in #32006.
Removed extending of AbstractComponent and changed logger usage to
explicit declaration. Abstract classes still have logger
declaration using this.getClass() in order to show implementation class
name in its logs.
See #34488
This changes the test to not use a `CountDownlatch`, instead adding an assertion
for the final logging message and waiting until the `MockAppender` has seen it
before proceeding.
Resolves#23739
Today, the bootstrapping of a Zen2 cluster is driven externally, requiring
something else to wait for discovery to converge and then to inject the initial
configuration. This is hard to use in some situations, such as REST tests.
This change introduces the `ClusterBootstrapService` which brings the bootstrap
retry logic within each node and allows it to be controlled via an (unsafe)
node setting.
This pull request replaces some blocks of code that must be run once
and that are currently based on AtomicBoolean by the convient RunOnce
class added in #35489.
Refactors and simplifies the logic around stopping nodes, making sure that for a full cluster restart
onNodeStopped is only called after the nodes are actually all stopped (and in particular not while
starting up some nodes again). This change also ensures that a closed node client is not being used
anymore (which required a small change to a test).
Relates to #35049
This adds a `wait_for_completion` flag which allows the user to block
the Stop API until the task has actually moved to a stopped state,
instead of returning immediately. If the flag is set, a `timeout` parameter
can be specified to determine how long (at max) to block the API
call. If unspecified, the timeout is 30s.
If the timeout is exceeded before the job moves to STOPPED, a
timeout exception is thrown. Note: this is just signifying that the API
call itself timed out. The job will remain in STOPPING and evenutally
flip over to STOPPED in the background.
If the user asks the API to block, we move over the the generic
threadpool so that we don't hold up a networking thread.
If shutting down half or more of the master-eligible nodes, their votes must
first be explicitly withdrawn to ensure that the cluster doesn't lose its
quorum. This works via _voting tombstones_, stored in the cluster state, which
tell the reconfigurator to remove nodes from the voting configuration.
This change introduces voting tombstones to the cluster state, together with
transport APIs for adding and removing them, and makes use of these APIs in
`InternalTestCluster` to support tests which remove at least half of the
master-eligible nodes at once (e.g. shrinking from two master-eligible nodes to
one).
AbstractComponent was deprecated in #35140 and is looking like it will be
removed at some point by #34888. Today all it does is provide a logger. This
change removes the usages of AbstractComponent that live solely in the zen2
feature branch to avoid some future merge pain, and replaces it where necessary
with some directly-created loggers.
The MockTcpTransport is not friendly in regards to memory usage. It must
allocate multiple byte arrays for every message. This improves the
memory situation by failing fast if the message is improperly formatted.
Additionally, it uses reusable big arrays for at least half of the
allocated byte arrays.
This change adds a logger for the query and fetch phases that prints all requests
before their execution at the trace level. This will help debugging cases where an issue
occurs during the execution since only completed queries are logged by the slow logs.
This is related to #34483. It introduces a namespaced setting for
compression that allows users to configure compression on a per remote
cluster basis. The transport.tcp.compress remains as a fallback
setting. If transport.tcp.compress is set to true, then all requests
and responses are compressed. If it is set to false, only requests to
clusters based on the cluster.remote.cluster_name.transport.compress
setting are compressed. However, after this change regardless of any
local settings, responses will be compressed if the request that is
received was compressed.
Enables diff-based publishing, which is an optimization where only the changing parts of the cluster
state are published to the nodes in the cluster, falling back to full cluster state publishing if the
receiver does not have the previous cluster state.
- Introduces a transport API for bootstrapping a Zen2 cluster
- Introduces a transport API for requesting the set of nodes that a
master-eligible node has discovered and for waiting until this comprises the
expected number of nodes.
- Alters ESIntegTestCase to use these APIs when forming a cluster, rather than
injecting the initial configuration directly.
This change adds a `frozen` engine that allows lazily open a directory reader
on a read-only shard. The engine wraps general purpose searchers in a LazyDirectoryReader
that also allows to release and reset the underlying index readers after any and before
secondary search phases.
Relates to #34352
With this change, `Version` no longer carries information about the qualifier,
we still need a way to show the "display version" that does have both
qualifier and snapshot. This is now stored by the build and red from `META-INF`.
This is related to #29023. Additionally at other points we have
discussed a preference for removing the need to unnecessarily block
threads for opening new node connections. This commit lays the groudwork
for this by opening connections asynchronously at the transport level.
We still block, however, this work will make it possible to eventually
remove all blocking on new connections out of the TransportService
and Transport.
Today we allow the user to set the minimum size of a voting configuration. On
reflection we would rather this was simply '3' where possible, and we can use
the retirement API to control the removal of nodes more explicitly.
This change replaces the old reconfigurator setting with a new one,
`cluster.auto_shrink_voting_configuration`, which determines whether
Elasticsearch should automatically remove nodes from the voting configuration
or not.
When the engine is asked for historical operations, we check if some of the requested operations
are not yet refreshed and if so we refresh before returning the operations. The refresh check is
based on capturing the local checkpoint before each refresh and comparing that value to the one
requested when `newChangesSnapshot` was called. If the requested range is above the captured
local checkpoint we issue a refresh.
This can currently cause unneeded extra refreshes if the method is called concurrently which may cause unwanted degradation in indexing performance. This is especially relevant for CCR where we always ask for a range below the global checkpoint. That range is guaranteed to be below the local
checkpoint of the shard and one refresh is enough to serve multiple changes requests.
This commit fixes this by introducing a dedicated mutex to make sure the test for whether a refresh
is needed actually wait for concurrents for concurrent refreshes that were caused by another
change refresh.
Note that this is not a big change in semantics as refreshes are serialized by lucene anyway. I also
opted not to keep the synchronization to the changes snapshot request only even if in theory we
can apply it to all refreshes, not matter where they come from.
Currently we create a new netty event loop group for client connections
and all server profiles. Each new group creates new threads for io
processing. This means 2 * num of processors new threads for each group.
A single group should be able to handle all io processing (for the
transports). This also brings the netty module inline with what we do
for nio.
Additionally, this PR renames the worker threads to be the same for
netty and nio.
* Introduce property to set version qualifier
- VersionProperties.elasticsearch is now a string which can have qualifier
and snapshot too
- The Version class in the build no longer cares about snapshot and
qualifier.
Stop passing `Settings` to `AbstractComponent`'s ctor. This allows us to
stop passing around `Settings` in a *ton* of places. While this change
touches many files, it touches them all in fairly small, mechanical
ways, doing a few things per file:
1. Drop the `super(settings);` line on everything that extends
`AbstractComponent`.
2. Drop the `settings` argument to the ctor if it is no longer used.
3. If the file doesn't use `logger` then drop `extends
AbstractComponent` from it.
4. Clean up all compilation failure caused by the `settings` removal
and drop any now unused `settings` isntances and method arguments.
I've intentionally *not* removed the `settings` argument from a few
files:
1. TransportAction
2. AbstractLifecycleComponent
3. BaseRestHandler
These files don't *need* `settings` either, but this change is large
enough as is.
Relates to #34488
* NETWORKING: MockTransportService Wait for Close
* Make `MockTransportService` wait `30s` for close listeners to run before failing the assertion
* Closes#34990
Today when ESIntegTestCase starts some nodes it writes out the unicast hosts
files each time a node starts its transport service. This does mean that a
number of nodes can start and perform their first pinging round without any
unicast hosts which, if the timing is unlucky and a lot of nodes are all
started at the same time, can lead to a split brain as in #35052.
Prior to #33554 this was unlikely to happen since the MockUncasedHostsProvider
would always have yielded the existing hosts, so the timing would have to have
been implausibly unlucky. Since #33554, however, it's more likely because the
race occurs between the start of the first round of pinging and the writing of
the unicast hosts file. It is realistic that new nodes will be configured with
the existing nodes from startup, so this change reinstates that behaviour.
Closes#35052.