Because concurrent sync requests from a primary to its replicas could be
in flight, it can be the case that an older retention leases collection
arrives and is processed on the replica after a newer retention leases
collection has arrived and been processed. Without a defense, in this
case the replica would overwrite the newer retention leases with the
older retention leases. This commit addresses this issue by introducing
a versioning scheme to retention leases. This versioning scheme is used
to resolve out-of-order processing on the replica. We persist this
version into Lucene and restore it on recovery. The encoding of
retention leases is starting to get a little ugly. We can consider
addressing this in a follow-up.
* Instead of replacing the `shardSnapshots` field, we mutate it, explicitly removing entries from it in only a single spot
* Decreased the amount of indirection by moving all logic for starting a snapshot's newly discovered shard tasks into `startNewShards` (saves us two maps (keyed by snapshot) and iterations over them)
In Zen 1 there are commit timeout and publish timeout and these
settings could be changed on-the-fly.
In Zen 2, there is only commit timeout and this setting is static.
RareClusterStateIT is actively using these settings and the fact, they
are dynamic.
This commit adds cancelCommitedPublication method to Coordinator to
be used by tests. This method will cancel current committed publication
if there is any.
When there is BlockClusterStateProcessing on the non-master node, the
publication will be accepted and committed, but not yet applied. So we
can use the method above to cancel it.
Also, this commit replaces callback + AtomicReference with ActionFuture,
which makes test code easier to read.
Using index.routing.allocation.require._host does not correctly work because the boolean logic in
filter matching is broken (DiscoveryNodeFilters.match(...) will return false) when
opType ==OpType.AND
When restoring shards of existing indices, the RestoreService also
restores the values of primary terms stored in the snapshot index
metadata. The primary terms are not updated and could potentially
conflict with current index primary terms if the restored primary terms
are lower than the existing ones.
This situation is likely to happen with replicated closed indices
(because primary terms are increased when the index is transitioning
from open to closed state, and the snapshotted primary terms are the
one at the time the index was opened) (see #38024) and maybe also
with CCR.
This commit changes the RestoreService so that it updates the primary
terms using the maximum value between the snapshotted values and
the existing values.
Related to #33888
If custom date formats are used, there may be combinations that the new
performat DateFormatters.from() method has not covered yet. This adds a
few such corner cases and ensures the tests are correctly commented
out.
In #38158 we ensured that global ordinals are not loaded when another execution hint is explicitly set on the source. This change is a follow up that addresses a comment
dd6043c1c0 (r252984782) added after the merge.
This commit adds the second part of `elasticsearch-node` tool -
`detach-cluster` command in addition to `unsafe-bootstrap` command.
Also, this commit changes the semantics of `unsafe-bootstrap`, now
`unsafe-bootstrap` changes clusterUUID.
So the algorithm of running `elasticsearch-node` tool is the following:
1) Stop all nodes in the cluster.
2) Pick master-eligible node with the highest (term, version) pair and
run the `unsafe-bootstrap` command on it. If there are no survived
master-eligible nodes - skip this step.
3) Run `detach-cluster` command on the remaining survived nodes.
Detach cluster makes the following changes to the node metadata:
1) Sets clusterUUID committed to false.
2) Sets currentTerm and term to 0.
3) Removes voting tombstones and sets voting configurations to special
constant MUST_JOIN_ELECTED_MASTER, that prevents initial cluster
bootstrap.
`ElasticsearchNodeCommand` base abstract class is introduced, because
`UnsafeBootstrapMasterCommand` and `DetachClusterCommand` have a lot in
common.
Also, this commit adds "ordinal" parameter to both commands, because it's
impossible to write IT otherwise.
For MUST_JOIN_ELECTED_MASTER case special handling is introduced in
`ClusterFormationFailureHelper`.
Tests for both commands reside in `ElasticsearchNodeCommandIT` (renamed
from `UnsafeBootstrapMasterIT`).
The current CloseWhileRelocatingShardsIT test adds some "send behavior"
rule to a target node's mocked transport service in order to detect when shard
relocating are started. These rules are never cleared and prevent the test to
complete normally after the rebalance is re-enabled again.
This commit changes the test so that rules are cleared and most verifications
are done before the rebalance is reenabled again.
Closes#38090
Folks at the Lucene project do not seem to be interested in classifying corruptions and
distinguishing them from file-system exceptions (see https://issues.apache.org/jira/browse/LUCENE-8525),
so we'll just cop out as well.
Closes#34322
With #37000 we made sure that fnial reduction is automatically disabled
whenever a localClusterAlias is provided with a SearchRequest.
While working on #37838, we found a scenario where we do need to set a
localClusterAlias yet we would like to perform a final reduction in the
remote cluster: when searching on a single remote cluster.
Relates to #32125
This commit adds support for a separate finalReduce flag to
SearchRequest and makes use of it in TransportSearchAction in case we
are searching against a single remote cluster.
This also makes sure that num_reduce_phases is correct when searching
against a single remote cluster: it makes little sense to return
`num_reduce_phases` set to `2`, which looks especially weird in case
the search was performed against a single remote shard. We should
perform one reduction phase only in this case and `num_reduce_phases`
should reflect that.
* line length
This change forbids negative field boost in the `query_string`, `simple_query_string`
and `multi_match` queries.
Negative boosts are not allowed in Lucene 8 (scores must be positive).
The backport of this change to 6x will turn the error into a deprecation warning
in order to raise the awareness of this breaking change in 7.0.
Closes#33309
Currently, there are a few tests that use autoMinMasterNodes=false and
hence override addExtraClusterBootstrapSettings, mostly this is 10-30
lines of codes that are copy-pasted from class to class.
This PR introduces `InternalTestCluster.setBootstrapMasterNodeIndex`
which is suitable for all classes and copy-paste could be removed.
Removing code is always a good thing!
The terms aggregator loads the global ordinals to retrieve the cardinality of the field to aggregate on. This information is then used to select the strategy to use for the aggregation (breadth_first or depth_first). However this should be avoided if the execution_hint is explicitly set to map since this mode doesn't really need the global ordinals. Since we still need the cardinality of the field this change picks the maximum cardinality in the segments as an estimation of the total cardinality to select the strategy to use (breadth_first or depth_first). This estimation is only used if the execution hint is set to map, otherwise the global ordinals are still used to retrieve the accurate cardinality.
Closes#37705
Today we use `AbstractDisruptionTestCase` to test the behaviour of things like
master elections in the presence of cluster disruptions. These tests have
rather enthusiastic fault detection settings, detecting a fault if a single
ping fails, with a one-second timeout. Furthermore there are some tests that
assert the identity of the master remains unchanged during some disruption, and
these assertions fail rather often thanks to the overly sensitive fault
detector.
However in a number of these tests the fault detector need not be this
sensitive. This commit moves some such tests into their own test suite and uses
more sensible fault-detection settings to avoid the kind of master instability
that is causing CI failures.
Closes#37699
The self written epoch date formatters were not properly able to format
an Instant to a string due to a misconfiguration.
This fix also removes a until now existing runtime behaviour under java
8 regarding the names of the aggregation buckets, which are now the same
as before and have been under java 11.
The '{' as a first character in log line is causing problems for beats when parsing plaintext logs. This can happen if the submitted document has an additional '\n' at the beginning and we are not reformatting.
Trimming the source part of a SlogLog solves that and keeps the logs readable.
closes#38080
* Fixes two broken spots:
1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting
2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state
* Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk
* Significantly extended test infrastructure to reproduce the above two issues
* Two new test runs:
1. Reproducing the effects of node disconnects/restarts in isolation
2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
* Relates #32265
* Closes#32348
Since #31140 we no longer require acking on the dynamic mapping of index
requests. Thus, a returned mapping from a get mapping request does not
necessarily contain the dynamic updates from the index request. This
commit replaces the dynamic mapping update with a manual put mapping.
Relates #31140Closes#37928
This changes the test to not use a `CountDownlatch`, instead adding an assertion
for the final logging message and waiting until the `MockAppender` has seen it
before proceeding.
Related to df2c06f6f30f7e23a6863a3f72fc3bdb7648885c
Resolves#23739
Implements `geotile_grid` aggregation
This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240
This code uses the same base classes as `geohash_grid` agg, but uses a different hashing
algorithm to allow zoom consistency. Each grid bucket is aligned to Web Mercator tiles.
If a replica does not have a right mapping yet, we will retry the index
request on that replica; then the actual tasks is higher than the
expected tasks. Since #31140 this happens more frequently for we no
longer require acking on the dynamic mapping of index requests.
Relates #31140Closes#37893
When extending ESIntegTestCase are run on the same jvm, the static field in
NodeAndClusterIdConverter will throw an AlreadySet exceptions.
overriding the configuration method from Node.configureNodeAndClusterIdStateListener in the MockNode will prevent the listener registration from happening
relates #32850
If a new retention lease is added while a primary's soft-deletes policy
is locked for peer-recovery, that lease won't be baked into the Lucene
commit.
Relates #37165
Relates #37375
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.
This is a continuation of #28667, #36137 and also fixes#37708.
This commit adapts the version used in StartedShardEntry serialization
after the backport of #37899 and reenables bwc tests.
Related to #37899
Related to #38074
This commit pushes the primary term into the replication tracker. This
is a precursor to using the primary term to resolving ordering problems
for retention leases. Namely, it can be that out-of-order retention
lease sync requests arrive on a replica. To resolve this, we need a
tuple of (primary term, version). For this to be, the primary term needs
to be accessible in the replication tracker. As the primary term is part
of the replication group anyway, this change conceptually makes sense.
With #37566 we have introduced the ability to merge multiple search responses into one. That makes it possible to expose a new way of executing cross-cluster search requests, that makes CCS much faster whenever there is network latency between the CCS coordinating node and the remote clusters. The coordinating node can now send a single search request to each remote cluster, which gets reduced by each one of them. from + size results are requested to each cluster, and the reduce phase in each cluster is non final (meaning that buckets are not pruned and pipeline aggs are not executed). The CCS coordinating node performs an additional, final reduction, which produces one search response out of the multiple responses received from the different clusters.
This new execution path will be activated by default for any CCS request unless a scroll is provided or inner hits are requested as part of field collapsing. The search API accepts now a new parameter called ccs_minimize_roundtrips that allows to opt-out of the default behaviour.
Relates to #32125
This reduces objects creations in the rounding class (used by aggs) by properly
creating the objects only once. Furthermore a few unneeded ZonedDateTime objects
were created in order to create other objects out of them. This was
changed as well.
Running the benchmarks shows a much faster performance for all of the
java time based Rounding classes.
Currently the put-mapping API assumes that because the type name is `_doc` then
it is dealing with a typeless put-mapping call. Yet we still allow running the
put-mapping API in a typed fashion with `_doc` as a type name. The current logic
triggers surprising errors when doing a typed put-mapping call with `_doc` as a
type name on an index that has a type already.
This is a bit of a corner-case, but is more important on 6.x due to the fact
that using the index API with `_doc` as a type name triggers typed calls to the
put-mapping API with `_doc` as a type name.
Zen2 nodes will bootstrap themselves once they believe there to be no remaining
Zen1 master-eligible nodes in the cluster, as long as minimum_master_nodes is
satisfied.
Today the bootstrap configuration comprises just the ids of the known
master-eligible nodes, and this might be too small to be safe. For instance, if
there are 5 master-eligible nodes (so that minimum_master_nodes is 3) then the
bootstrap configuration could comprise just 3 nodes, of which 2 form a quorum,
and this does not intersect other quorums that might arise, leading to a
split-brain.
This commit fixes this by expanding the bootstrap configuration so that its
quorums satisfy minimum_master_nodes, by adding some of the IDs of the other
master-eligible nodes in the last-published cluster state.
The existing implementation was slow due to exceptions being thrown if
an accessor did not have a time zone. This implementation queries for
having a timezone, local time and local date and also checks for an
instant preventing to throw an exception and thus speeding up the conversion.
This removes the existing method and create a new one named
DateFormatters.from(TemporalAccessor accessor) to resemble the naming of
the java time ones.
Before this change an epoch millis parser using the toZonedDateTime
method took approximately 50x longer.
Relates #37826