From f4a432e456a0e8e9ab55697c198a3e44d2ebcd52 Mon Sep 17 00:00:00 2001 From: Jason Tedor Date: Tue, 7 Mar 2017 14:25:23 -0500 Subject: [PATCH] Add note regarding out-of-sync replicas This commit adds a note to the resiliency status page regarding the fact that replicas can fall out of sync with the primary shard after primary promotion occurs due to a failing primary shard. Relates #23503 --- docs/resiliency/index.asciidoc | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/resiliency/index.asciidoc b/docs/resiliency/index.asciidoc index 0ded9530e0e..458e3b89fe4 100644 --- a/docs/resiliency/index.asciidoc +++ b/docs/resiliency/index.asciidoc @@ -152,6 +152,21 @@ We have ported the known scenarios in the Jepsen blogs that check loss of acknow The new tests are run continuously in our testing farm and are passing. We are also working on running Jepsen independently to verify that no failures are found. +[float] +=== Replicas can fall out of sync when a primary shard fails (STATUS: ONGOING) + +When a primary shard fails, a replica shard will be promoted to be the +primary shard. If there is more than one replica shard, it is possible +for the remaining replicas to be out of sync with the new primary +shard. This is caused by operations that were in-flight when the primary +shard failed and may not have been processed on all replica +shards. Currently, the discrepancies are not repaired on primary +promotion but instead would be repaired if replica shards are relocated +(e.g., from hot to cold nodes); this does mean that the length of time +which replicas can be out of sync with the primary shard is +unbounded. Sequence numbers {GIT}10708[#10708] will provide a mechanism +for syncing the remaining replicas with the newly-promoted primary +shard. == Completed