Zen2 is now feature-complete enough to run most ESIntegTestCase tests. The changes in this PR
are as follows:
- ClusterSettingsIT is adapted to not be Zen1 specific anymore (it was using Zen1 settings).
- Some of the integration tests require persistent storage of the cluster state, which is not fully
implemented yet (see #33958). These tests keep running with Zen1 for now but will be switched
over as soon as that is fully implemented.
- Some very few integration tests are not running yet with Zen2 for other reasons, depending on
some of the other open points in #32006.
Elasticsearch node is responsible for storing cluster metadata.
There are 2 types of metadata: global metadata and index metadata.
`GatewayMetaState` implements `ClusterStateApplier` and receives all
`ClusterStateChanged` events and is responsible for storing modified
metadata to disk.
When new `ClusterStateChanged` event is received, `GatewayMetaState`
checks if global metadata has changed and if it's the case writes new
global metadata to disk. After that `GatewayMetaState` checks if index
metadata has changed or there are new indices assigned to this node and
if it's the case writes new index metadata to disk. Atomicity of global
metadata and index metadata writes is ensured by `MetaDataStateFormat`
class.
Unfortunately, there is no atomicity when more than one metadata changes
(global and index, or metadata for two indices). And atomicity is
important for Zen2 correctness.
This commit adds atomicity by adding a notion of manifest file,
represented by `MetaState` class. `MetaState` contains pointers to
current metadata.
More precisely, it stores global state generation as long and map from
`Index` to index metadata generation as long. Atomicity of writes for
manifest file is ensured by `MetaStateFormat` class.
The algorithm of writing changes to the disk would be the following:
1. Write global metadata state file to disk and remember
it's generation.
2. For each new/changed index write state file to disk and remember
it's generation. For each not-changed index use generation from
previous manifest file. If index is removed or this node is no longer
responsible for this index - forget about the index.
3. Create `MetaState` object using previously remembered generations and
write it to disk.
4. Remove old state files for global metadata, indices metadata and
manifest.
Additonally new implementation relies on enhanced `MetaDataStateFormat`
failure semantics, `applyClusterState` throws IOException, whose
descendant `WriteStateException` could be (and should be in Zen2)
explicitly handled.
Today, the bootstrapping of a Zen2 cluster is driven externally, requiring
something else to wait for discovery to converge and then to inject the initial
configuration. This is hard to use in some situations, such as REST tests.
This change introduces the `ClusterBootstrapService` which brings the bootstrap
retry logic within each node and allows it to be controlled via an (unsafe)
node setting.
Today, the TransportReplicationAction checks the global level blocks and
the index level blocks before routing the operation to the primary, in the
ReroutePhase, and it happens at the very beginning of the transport
replication action execution. For the upcoming rework of the Close Index
API and in order to deal with primary relocation, we'll need to also check
for blocks before executing the operation on the primary (while holding a
permit) but before routing to the new primary.
This pull request change the AsyncPrimaryAction so that it checks for
replication action's blocks before executing the operation locally or before
routing the primary action to the newly primary shard. The check is done
while holding a PrimaryShardReference.
Related to #33888
The way ScoreAccessor implements `compareTo()` is problematic because it doesn't
completely follow the Comparable contract, specificaly symmetry (if x is a
ScoreAccessor and y any Number then x.comparTo(y) works, but y.compareTo(x)
generally does not even compile). Fortunately we don't seem to use the fact that
ScoreAccessor is a Comparable anywhere, so we can simply remove it.
Today the `PeerFinder` probes each address it obtains, identifies the node to
which it just connected, and then returns all such nodes. However, this can
lead to duplicates if a node manages to connect to another node via two
distinct addresses. This causes bootstrapping to fail since
`BootstrapConfiguration#resolve` forbids duplicates.
This change alters the behaviour of the `PeerFinder` to remove duplicates in
this situation.
If shutting down half or more of the master-eligible nodes, their votes must
first be explicitly withdrawn to ensure that the cluster doesn't lose its
quorum. This works via _voting tombstones_, stored in the cluster state, which
tell the reconfigurator to remove nodes from the voting configuration.
This change introduces voting tombstones to the cluster state, together with
transport APIs for adding and removing them, and makes use of these APIs in
`InternalTestCluster` to support tests which remove at least half of the
master-eligible nodes at once (e.g. shrinking from two master-eligible nodes to
one).
AbstractComponent was deprecated in #35140 and is looking like it will be
removed at some point by #34888. Today all it does is provide a logger. This
change removes the usages of AbstractComponent that live solely in the zen2
feature branch to avoid some future merge pain, and replaces it where necessary
with some directly-created loggers.
This change adds a special caching reader that caches all relevant
values for a range query to rewrite correctly in a can_match phase
without actually opening the underlying directory reader. This
allows frozen indices to be filtered with can_match and in-turn
searched with wildcards in a efficient way since it allows us to
exclude shards that won't match based on their date-ranges without
opening their directory readers.
Relates to #34352
Depends on #34357
The ParsedReverseNested implementation should implement the ReverseNested
interface and not the Nested interface. Although this is an empty marker
interface it is confusing and can lead to casting errors. Also adding a test to
check that both ParsedNested and ParsedReverseNested implement the correct
interface.
Closes#35449
Some very old ancient versions of Linux do not have /etc/os-release. For
example, old Red Hat-like OS. This commit adds a fallback for handling
pretty name for these OS.
This is a follow up to #35357. That commit failed to register the new
cluster.remote.cluster_name.transport.compress setting with
`ClusterSettings`. This commit fixes that.
Some OS (e.g., Oracle Linux Server 6.9) have a trailing space at the end
of the PRETTY_NAME line in /etc/os-release. This commit addresses this
by accounting for this trailing space when extracting the pretty name.
Implements serialization compatibility between Zen1 and Zen2 transport action, allowing a Zen1 node to join a fully formed Zen2 cluster and vice-versa.
The MockTcpTransport is not friendly in regards to memory usage. It must
allocate multiple byte arrays for every message. This improves the
memory situation by failing fast if the message is improperly formatted.
Additionally, it uses reusable big arrays for at least half of the
allocated byte arrays.
This change adds a logger for the query and fetch phases that prints all requests
before their execution at the trace level. This will help debugging cases where an issue
occurs during the execution since only completed queries are logged by the slow logs.
This is related to #34483. It introduces a namespaced setting for
compression that allows users to configure compression on a per remote
cluster basis. The transport.tcp.compress remains as a fallback
setting. If transport.tcp.compress is set to true, then all requests
and responses are compressed. If it is set to false, only requests to
clusters based on the cluster.remote.cluster_name.transport.compress
setting are compressed. However, after this change regardless of any
local settings, responses will be compressed if the request that is
received was compressed.
Today our OS information returned in node stats only returns a
high-level name of the OS (e.g., "Linux"). Yet, for some uses this is
too high-level and knowing at a finer level of granularity the
underlying OS can be useful. This commit extracts the pretty name on
Linux from /etc/os-release. This pretty name usually includes the Linux
vendor and the Linux vendor version number (e.g., Fedora 28).
Enables diff-based publishing, which is an optimization where only the changing parts of the cluster
state are published to the nodes in the cluster, falling back to full cluster state publishing if the
receiver does not have the previous cluster state.
- Introduces a transport API for bootstrapping a Zen2 cluster
- Introduces a transport API for requesting the set of nodes that a
master-eligible node has discovered and for waiting until this comprises the
expected number of nodes.
- Alters ESIntegTestCase to use these APIs when forming a cluster, rather than
injecting the initial configuration directly.
Currently we introduced a hard limit of 1024 to the number of fields a query can
be expanded to in #26541. Instead of using a hard limit, we should make this
configurable. This change removes the hard limit check and uses the existing
`max_clause_count` setting instead.
Closes#34778
Currently when aggregating on an unmapped date field (e.g. using a
date_histogram) we don't preserve the aggregations `format` setting but instead
use the default format. This can lead to loosing the aggregations `format` when
aggregating over several indices where some of them contain unmapped date fields
and are encountered first in the reduce phase.
Related to #31760
Today the `composite` aggregation throws an error if a source targets an
unmapped field and `missing_bucket` is set to false. Documents without a
value for a source cannot produce any bucket if `missing_bucket` is not
activated so the error is a shortcut to say that the response will be empty.
However this is not consistent with the `terms` aggregation which accepts
unmapped field by default even if the response is also guaranteed to be empty.
This commit removes this restriction, if a source contains an unmapped field
we now return an empty response (no buckets).
Closes#35317
The current implementation of asyncBlockOperations() can be used to
execute some code once all indexing operations permits have been acquired,
then releases all permits immediately after the code execution. This
immediate release is not suitable for treatments that need to keep all
permits over multiple execution steps.
This commit adds a new asyncBlockOperations() that exposes a Releasable,
making it possible to acquire all permits and only release them all
when needed by closing the Releasable. The existing blockOperations()
method has been modified to delegate permit acquisition/releasing to this new
method.
Relates to #33888
This commit uses the index settings version so that a follower can
replicate index settings changes as needed from the leader.
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
The lookup vars under params (namely _fields and _source) were
inadvertently removed when scoring scripts were converted to using
script contexts. This commit adds them back, along with deprecation
warnings for those that should not be used.
A CCR test failure shows that the approach in #34474 is flawed.
Restoring the LocalCheckpointTracker from an index commit can cause both
FollowingEngine and InternalEngine to incorrectly ignore some deletes.
Here is a small scenario illustrating the problem:
1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene
2. Flush a commit consisting of only the delete tombstone
3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted
4. Restart an engine with the commit (step 2); the engine will fill its
LocalCheckpointTracker with the delete tombstone in the commit
5. Replay the local translog in reverse order: index#0 then delete#1
6. When process index#0, an engine will add it into Lucene as a live doc
and advance the local checkpoint to 1 (seq#1 was restored from the
commit - step 4).
7. When process delete#1, an engine will skip it because seq_no=1 is
less than or equal to the local checkpoint.
We should have zero document after recovering from translog, but here we
have one.
Since all operations after the local checkpoint of the safe commit are
retained, we should find them if the look-up considers also soft-deleted
documents. This PR fills the disparity between the version map and the
local checkpoint tracker by taking soft-deleted documents into account
while resolving strategy for engine operations.
Relates #34474
Relates #33656
Today when a percolator query contains a date range then the query
analyzer extracts that range, so that at search time the `percolate` query
can exclude percolator queries efficiently that are never going to match.
The problem is that if 'now' is used it is evaluated at index time.
So the idea is to rewrite date ranges with 'now' to a match all query,
so that the query analyzer can't extract it and the `percolate` query
is then able to evaluate 'now' at query time.
This change adds a `frozen` engine that allows lazily open a directory reader
on a read-only shard. The engine wraps general purpose searchers in a LazyDirectoryReader
that also allows to release and reset the underlying index readers after any and before
secondary search phases.
Relates to #34352
Today we only apply `ingore_throttled` to expansions from wildcards,
date math expressions and aliases. Yet, this is tricky since we might
have resolved certain expressions in pre-filter steps like security.
It's more consistent to apply this logic to all expressions including
concrete indices.
Relates to #34354