Move 'lost cluster state updates' issue to DONE (#36959)

Relates #34714.
This commit is contained in:
David Turner 2018-12-24 09:29:08 +00:00 committed by GitHub
parent 561b704129
commit bd41150338
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -63,22 +63,6 @@ to create new scenarios. We have currently ported all published Jepsen scenarios
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
During a networking partition, cluster state updates (like mapping changes or shard assignments)
are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access
to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master falls on the minority side, it may be that a new master is elected
which has not yet catch up. If that happens, cluster state updates can be lost.
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes committed cluster state updates into account during master
election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition
happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this
will amount to the loss of an acknowledged change. Fixing that last scenario needs considerable work. We are currently working on it but have no ETA yet.
[float]
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
@ -170,6 +154,33 @@ shard.
== Completed
[float]
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)
During a networking partition, cluster state updates (like mapping changes or
shard assignments) are committed if a majority of the master-eligible nodes
received the update correctly. This means that the current master has access to
enough nodes in the cluster to continue to operate correctly. When the network
partition heals, the isolated nodes catch up with the current state and receive
the previously missed changes. However, if a second partition happens while the
cluster is still recovering from the previous one *and* the old master falls on
the minority side, it may be that a new master is elected which has not yet
catch up. If that happens, cluster state updates can be lost.
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes
committed cluster state updates into account during master election. This
considerably reduces the chance of this rare problem occurring but does not
fully mitigate it. If the second partition happens concurrently with a cluster
state update and blocks the cluster state commit message from reaching a
majority of nodes, it may be that the in flight update will be lost. If the
now-isolated master can still acknowledge the cluster state update to the client
this will amount to the loss of an acknowledged change.
Fixing this last scenario was one of the goals of {GIT}32006[#32006] and its
sub-issues. See particularly {GIT}32171[#32171] and
https://github.com/elastic/elasticsearch-formal-models/blob/master/ZenWithTerms/tla/ZenWithTerms.tla[the
TLA+ formal model] used to verify these changes.
[float]
=== Divergence between primary and replica shard copies when documents deleted (STATUS: DONE, V6.3.0)