mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-03-25 09:28:27 +00:00
Move 'lost cluster state updates' issue to DONE (#36959)
Relates #34714.
This commit is contained in:
parent
561b704129
commit
bd41150338
@ -63,22 +63,6 @@ to create new scenarios. We have currently ported all published Jepsen scenarios
|
||||
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
|
||||
all new scenarios and will report issues that we find on this page and in our GitHub repository.
|
||||
|
||||
[float]
|
||||
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
|
||||
|
||||
During a networking partition, cluster state updates (like mapping changes or shard assignments)
|
||||
are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access
|
||||
to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
|
||||
up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster
|
||||
is still recovering from the previous one *and* the old master falls on the minority side, it may be that a new master is elected
|
||||
which has not yet catch up. If that happens, cluster state updates can be lost.
|
||||
|
||||
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes committed cluster state updates into account during master
|
||||
election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition
|
||||
happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
|
||||
that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this
|
||||
will amount to the loss of an acknowledged change. Fixing that last scenario needs considerable work. We are currently working on it but have no ETA yet.
|
||||
|
||||
[float]
|
||||
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
|
||||
|
||||
@ -170,6 +154,33 @@ shard.
|
||||
|
||||
== Completed
|
||||
|
||||
[float]
|
||||
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)
|
||||
|
||||
During a networking partition, cluster state updates (like mapping changes or
|
||||
shard assignments) are committed if a majority of the master-eligible nodes
|
||||
received the update correctly. This means that the current master has access to
|
||||
enough nodes in the cluster to continue to operate correctly. When the network
|
||||
partition heals, the isolated nodes catch up with the current state and receive
|
||||
the previously missed changes. However, if a second partition happens while the
|
||||
cluster is still recovering from the previous one *and* the old master falls on
|
||||
the minority side, it may be that a new master is elected which has not yet
|
||||
catch up. If that happens, cluster state updates can be lost.
|
||||
|
||||
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes
|
||||
committed cluster state updates into account during master election. This
|
||||
considerably reduces the chance of this rare problem occurring but does not
|
||||
fully mitigate it. If the second partition happens concurrently with a cluster
|
||||
state update and blocks the cluster state commit message from reaching a
|
||||
majority of nodes, it may be that the in flight update will be lost. If the
|
||||
now-isolated master can still acknowledge the cluster state update to the client
|
||||
this will amount to the loss of an acknowledged change.
|
||||
|
||||
Fixing this last scenario was one of the goals of {GIT}32006[#32006] and its
|
||||
sub-issues. See particularly {GIT}32171[#32171] and
|
||||
https://github.com/elastic/elasticsearch-formal-models/blob/master/ZenWithTerms/tla/ZenWithTerms.tla[the
|
||||
TLA+ formal model] used to verify these changes.
|
||||
|
||||
[float]
|
||||
=== Divergence between primary and replica shard copies when documents deleted (STATUS: DONE, V6.3.0)
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user