Update resiliency docs (#19303)

Adds clarifications about Jepsen tests and new section on issues with versioning.
This commit is contained in:
Yannick Welsch 2016-07-08 17:30:46 +02:00 committed by GitHub
parent 982e01d463
commit 7dff8fbb1d
1 changed files with 24 additions and 9 deletions

View File

@ -59,9 +59,9 @@ We are committed to tracking down and fixing all the issues that are posted.
==== Jepsen Tests
The Jepsen platform is specifically designed to test distributed systems. It is not a single test and is regularly adapted
to create new scenarios. We have ported all published scenarios to our testing infrastructure. Of course
as the system evolves, new scenarios can come up that are not yet covered. We are committed to investigating all new scenarios and will
report issues that we find on this page and in our GitHub repository.
to create new scenarios. We have currently ported all published Jepsen scenarios that deal with loss of acknowledged writes to our testing
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
all new scenarios and will report issues that we find on this page and in our GitHub repository.
[float]
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
@ -102,6 +102,19 @@ space. The following issues have been identified:
Other safeguards are tracked in the meta-issue {GIT}11511[#11511].
[float]
=== The _version field may not uniquely identify document content during a network partition (STATUS: ONGOING)
When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue
indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is
partitioned away. It won't acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed.
The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to
step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from
the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted
writes for the same document (see {GIT}19269[#19269]).
We are currently implementing Sequence numbers {GIT}10708[#10708] which better track primary changes. Sequence numbers thus provide a basis
for uniquely identifying writes even in the presence of network partitions and will replace `_version` in operations that require this.
[float]
=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)
@ -119,20 +132,22 @@ in the case of each type of failure. The plan is to have a test case that valida
[float]
=== Run Jepsen (STATUS: ONGOING)
We have ported all of the known scenarios in the Jepsen blogs to our testing infrastructure. The new tests are run continuously in our
testing farm and are passing. We are also working on running Jepsen independently to verify that no failures are found.
We have ported the known scenarios in the Jepsen blogs that check loss of acknowledged writes to our testing infrastructure.
The new tests are run continuously in our testing farm and are passing. We are also working on running Jepsen independently to verify
that no failures are found.
== Unreleased
[float]
=== Port Jepsen tests to our testing framework (STATUS: UNRELEASED, V5.0.0)
=== Port Jepsen tests dealing with loss of acknowledged writes to our testing framework (STATUS: UNRELEASED, V5.0.0)
We have increased our test coverage to include scenarios tested by Jepsen, as described in the Elasticsearch related blogs. We make heavy
use of randomization to expand on the scenarios that can be tested and to introduce new error conditions.
We have increased our test coverage to include scenarios tested by Jepsen that demonstrate loss of acknowledged writes, as described in
the Elasticsearch related blogs. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce
new error conditions.
You can follow the work on the master branch of the
https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class],
where the `testAckedIndexing` test was specifically added to cover known Jepsen related scenarios.
where the `testAckedIndexing` test was specifically added to check that we don't lose acknowledged writes in various failure scenarios.
[float]