Updated the resiliency status page for v1.4.0

Closes #9969
2025-03-25 09:28:27 +00:00 · 2015-02-10 20:07:03 +01:00 · 2015-02-10 20:07:03 +01:00 · 6a43ed8b28
commit 6a43ed8b28
parent 94a74ddaec
1 changed files with 129 additions and 23 deletions
--- a/docs/resiliency/index.asciidoc
+++ b/docs/resiliency/index.asciidoc
@ -55,6 +55,72 @@ If you encounter an issue, https://github.com/elasticsearch/elasticsearch/issues

 We are committed to tracking down and fixing all the issues that are posted.

+[float]
+=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
+
+If the node holding a primary shard is disconnected for whatever reason, the
+coordinating node retries the request on the same or a new primary shard.  In
+certain rare conditions, where the node disconnects and immediately
+reconnects, it is possible that the original request has already been
+successfully applied but has not been reported, resulting in duplicate
+requests. This is particularly true when retrying bulk requests, where some
+actions may  have completed and some may not have.
+
+An optimization which disabled the existence check for documents indexed with
+auto-generated IDs could result in the creation of duplicate documents. This
+optimization has been removed. {GIT}9468[#9468] (STATUS: DONE, v1.5.0)
+
+Further issues remain with the retry mechanism:
+
+* Unversioned index requests could increment the `_version` twice,
+  obscuring a `created` status.
+* Versioned index requests could return a conflict exception, even
+  though they were applied correctly.
+* Update requests could be applied twice.
+
+See {GIT}9967[#9967]. (STATUS: ONGOING)
+
+[float]
+=== Write index metadata on data nodes where shards allocated (STATUS: ONGOING)
+
+Today, index metadata is written only on nodes that are master-eligible, not on
+data-only nodes.  This is not a problem when running with multiple master nodes,
+as recommended, as the loss of all but one master node is still recoverable.
+However, users running with a single master node are at risk of losing
+their index metadata if the master fails.  Instead, this metadata should
+also be written on any node where a shard is allocated. {GIT}8823[#8823]
+
+[float]
+=== Better file distribution with multiple data paths (STATUS: ONGOING)
+
+Today, a node configured with multiple data paths distributes writes across
+all paths by writing one file to each path in turn.  This can mean that the
+failure of a single disk corrupts many shards at once.  Instead, by allocating
+an entire shard to a single data path, the extent of the damage can be limited
+to just the shards on that disk. {GIT}9498[#9498]
+
+[float]
+=== OOM resiliency (STATUS: ONGOING)
+
+The family of circuit breakers has greatly reduced the occurrence of OOM
+exceptions, but it is still possible to cause a node to run out of heap
+space.  The following issues have been identified:
+
+* Set a hard limit on `from`/`size` parameters {GIT}9311[#9311]. (STATUS: ONGOING)
+* Prevent combinatorial explosion in aggregations from causing OOM {GIT}8081[#8081]. (STATUS: ONGOING)
+* Add the byte size of each hit to the request circuit breaker {GIT}9310[#9310]. (STATUS: ONGOING)
+
+[float]
+=== Mapping changes should be applied synchronously (STATUS: ONGOING)
+
+When introducing new fields using dynamic mapping, it is possible that the same
+field can be added to different shards with different data types.  Each shard
+will operate with its local data type but, if the shard is relocated, the
+data type from the cluster state will be applied to the new shard, which
+can result in a corrupt shard.  To prevent this, new fields should not
+be added to a shard's mapping until confirmed by the master.
+{GIT}8688[#8688] (STATUS: ONGOING)
+
 [float]
 === Loss of documents during network partition (STATUS: ONGOING)

@ -70,16 +136,14 @@ A test to replicate this condition was added in {GIT}7493[#7493].
 Certain versions of the JVM are known to have bugs which can cause index corruption.  {GIT}7580[#7580] prevents Elasticsearch startup if known bad versions are in use.

 [float]
-=== Lucene checksums phase 2 (STATUS:ONGOING)
+=== Lucene checksums phase 3 (STATUS:ONGOING)

-When Lucene opens a segment for reading, it validates the checksum on the smaller segment files -- those which it reads entirely into memory -- but not the large files like term frequencies and positions, as this would be very expensive. During merges, term vectors and stored fields are validated, as long the segments being merged come from the same version of Lucene. Checksumming for term vectors and stored fields is important because merging consists of performing optimized byte copies. Term frequencies, term positions, payloads, doc values, and norms are currently not checked during merges, although Lucene provides the option to do so.  These files are less prone to silent corruption as they are actively decoded during merge, and so are more likely to throw exceptions if there is any corruption.
+Almost all files in Elasticsearch now have checksums which are validated before use.  A few changes remain:

-There are a few ongoing efforts to improve coverage:
-
-* {GIT}7360[#7360] validates checksums on all segment files during merges. (STATUS: DONE, fixed in 1.4.0.Beta1)
-* {JIRA}5842[LUCENE-5842] validates the structure of the checksum footer of the postings lists, doc values, stored fields and term vectors when opening a new segment, to ensure that these files have not been truncated. (STATUS: DONE, Fixed in Lucene 4.10 and 1.4.0.Beta1)
-* {JIRA}5894[LUCENE-5894] lays the groundwork for extending more efficient checksum validation to all files during optimized bulk merges, if possible. (STATUS: ONGOING, Fixed in Lucene 5.0)
-* {GIT}7586[#7586] adds checksums for cluster and index state files. (STATUS: ONGOING, fixed in 1.5.0)
+* {GIT}7586[#7586] adds checksums for cluster and index state files. (STATUS: DONE, Fixed in v1.5.0)
+* {GIT}9183[#9183] supports validating the checksums on all files when starting a node. (STATUS: DONE, Fixed in v2.0.0)
+* {JIRA}5894[LUCENE-5894] lays the groundwork for extending more efficient checksum validation to all files during optimized bulk merges. (STATUS: ONGOING, Fixed in Lucene 5.0)
+* {GIT}8403[#8403] to add validation of checksums on Lucene `segments_N` files. (STATUS: NOT STARTED)

 [float]
 === Add per-segment and per-commit ID to help replication (STATUS: ONGOING)
@ -87,22 +151,35 @@ There are a few ongoing efforts to improve coverage:
 {JIRA}5895[LUCENE-5895] adds a unique ID for each segment and each commit point. File-based replication (as performed by snapshot/restore) can use this ID to know whether the segment/commit on the source and destination machines are the same.  Fixed in Lucene 5.0.

 [float]
-=== Improving Zen Discovery (STATUS: ONGOING)
+=== Report shard-level statuses on write operations (STATUS: ONGOING)

-Recovery from failure is a complicated process, especially in an asynchronous distributed system like Elasticsearch. With several processes happening in parallel, it is important to ensure that recovery proceeds swiftly and safely. While fixing the {GIT}2488[split-brain issue] we have been hunting down corner cases that were not handled optimally, adding tests to demonstrate the issues, and working on fixes:
-
-* Faster & better detection of master & node failures, including not trying to reconnect upon disconnect, fail on disconnect error on ping, verify cluster names in pings. Previously, Elasticsearch had to wait a bit for the node to complete the process required to join the cluster. Recent changes guarantee that a node has fully joined the cluster before we start the fault detection process. Therefore we can do an immediate check causing faster detection of errors and validation of cluster state after a minimum master node breach. {GIT}6706[#6706], {GIT}7399[#7399] (STATUS: DONE, v1.4.0.Beta1)
-* Broaden Unicast pinging when master fails: When a node loses it’s current master it will start pinging to find a new one. Previously, when using unicast based pinging, the node would ping a set of predefined nodes asking them whether the master had really disappeared or whether there was a network hiccup. Now, we ping all nodes in the cluster to increase coverage. In the case that all unicast hosts are disconnected from the current master during a network failure, this improvement is essential to allow the cluster to reform once the partition is healed. {GIT}7336[#7336] (STATUS: DONE, v1.4.0.Beta1)
-* After joining a cluster, validate that the join was successful and that the master has been set in the local cluster state. {GIT}6969[#6969]. (STATUS: DONE, v1.4.0.Beta1)
-* Write additional tests that use the test infrastructure to verify proper behavior during network disconnections and garbage collections. {GIT}7082[#7082] (STATUS: ONGOING)
-* Make write calls return the number of total/successful/missing shards in the same way that we do in search, which ensures transparency in the consistency of write operations. {GIT}7994[#7994]. (STATUS: DONE, v2.0.0)
+Make write calls return the number of total/successful/missing shards in the same way that we do in search, which ensures transparency in the consistency of write operations. {GIT}7994[#7994]. (STATUS: DONE, v2.0.0)

 [float]
-=== Validate quorum before accepting a write request (STATUS: ONGOING)
+=== Simplify and harden shard recovery and allocation (STATUS: ONGOING)

-Today, when a node holding a primary shard receives an index request, it checks the local cluster state to see whether a quorum of shards is available before it accepts the request. However, it can take some time before an unresponsive node is removed from the cluster state. We are adding an optional live check, where the primary node tries to contact its replicas to confirm that they are still responding before accepting any changes. See {GIT}6937[#6937].
+Randomized testing combined with chaotic failures has revealed corner cases
+where the recovery and allocation of shards in a concurrent manner can result
+in shard corruption.  There is an ongoing effort to reduce the complexity of
+these operations in order to make them more deterministic.  These include:

-While the work is going on, we tightened the current checks by bringing them closer to the index code. See {GIT}7873[#7873] (STATUS: DONE, fixed in 1.4.0)
+* Introduce shard level locks to prevent concurrent shard modifications {GIT}8436[#8436]. (STATUS: DONE, Fixed in v1.5.0)
+* Delete shard contents under a lock {GIT}9083[#9083]. (STATUS: DONE, Fixed in v1.5.0)
+* Delete shard under a lock {GIT}8579[#8579]. (STATUS: DONE, Fixed in v1.5.0)
+* Refactor RecoveryTarget state management {GIT}8092[#8092]. (STATUS: DONE, Fixed in v1.5.0)
+* Cancelling a recovery may leave temporary files behind {GIT}7893[#7893]. (STATUS: DONE, Fixed in v1.5.0)
+* Quick cluster state processing can result in both shard copies being deleted {GIT}9503[#9503]. (STATUS: DONE, Fixed in v1.5.0)
+* Rapid creation and deletion of an index can cause reuse of old index metadata {GIT}9489[#9489]. (STATUS: DONE, Fixed in v1.5.0)
+* Flush immediately after the last concurrent recovery finishes to clear out the translog before a new recovery starts {GIT}9439[#9439]. (STATUS: DONE, Fixed in v1.5.0)
+
+[float]
+=== Prevent setting minimum_master_nodes to more than the current node count (STATUS: ONGOING)
+
+Setting `zen.discovery.minimum_master_nodes` to a value higher than the current node count
+effectively leaves the cluster without a master and unable to process requests.  The only
+way to fix this is to add more master-eligibile nodes.  {GIT}8321[#8321] adds a mechanism
+to validate settings before applying them, and {GIT}9051[#9051] extends this validation
+support to settings applied during a cluster restore. (STATUS: DONE, Fixed in v1.5.0)

 [float]
 === Jepsen Test Failures (STATUS: ONGOING)
@ -120,14 +197,43 @@ This status page is a start, but we can do a better job of explicitly documentin

 Commonly used filters are cached in Elasticsearch. That cache is limited in size (10% of node's memory by default) and is being evicted based on a least recently used policy. The amount of memory used by the cache depends on two primary components - the values it stores and the keys associated with them. Calculating the memory footprint of the values is easy enough but the keys accounting is trickier to achieve as they are, by default, raw Lucene objects. This is largely not a problem as the keys are dominated by the values. However, recent optimizations in Lucene have changed the balance causing the filter cache to grow beyond it's size.

-While we are working on a longer term solution, we introduced a minimum weight of 1k for each cache entry. This puts an effective limit on the number of entries in the cache. See {GIT}8304[#8304] (STATUS: DONE, fixed in 1.4.0)
+While we are working on a longer term solution ({GIT}9176[#9176]), we introduced a minimum weight of 1k for each cache entry. This puts an effective limit on the number of entries in the cache. See {GIT}8304[#8304] (STATUS: DONE, fixed in v1.4.0)
+
+[float]
+=== Make recovery be more resilient to partial network partitions (STATUS: ONGOING, Fixed in v1.5.0)
+
+When a node is experience network issues, the master detects it and removes the node from the cluster. That causes all ongoing recoveries from and to that node to be stopped and a new location is found for the relevant shards. However, in the of case partial network partition, where there are connectivity issues between the source and target nodes of a recovery but not between those nodes and the current master things may go wrong. While the nodes successfully restore the connection, the on going recoveries may have encountered issues. In {GIT}8720[#8720], we added test simulations for these and solved several issues that were flagged by them.

 == Completed

 [float]
-=== Make recovery be more resilient to partial network partitions (STATUS: DONE, v1.5.0)
+=== Validate quorum before accepting a write request (STATUS: DONE)

-When a node is experience network issues, the master detects it and removes the node from the cluster. That causes all ongoing recoveries from and to that node to be stopped and a new location is found for the relevant shards. However, in the of case partial network partition, where there are connectivity issues between the source and target nodes of a recovery but not between those nodes and the current master things may go wrong. While the nodes successfully restore the connection, the on going recoveries may have encountered issues. In {GIT}8720[#8720], we added test simulations for these and solved several issues that were flagged by them.
+Today, when a node holding a primary shard receives an index request, it checks the local cluster state to see whether a quorum of shards is available before it accepts the request. However, it can take some time before an unresponsive node is removed from the cluster state. We are adding an optional live check, where the primary node tries to contact its replicas to confirm that they are still responding before accepting any changes. See {GIT}6937[#6937].
+
+While the work is going on, we tightened the current checks by bringing them closer to the index code. See {GIT}7873[#7873] (STATUS: DONE, fixed in v1.4.0)
+
+
+[float]
+=== Improving Zen Discovery (STATUS: DONE, v1.4.0.Beta1)
+
+Recovery from failure is a complicated process, especially in an asynchronous distributed system like Elasticsearch. With several processes happening in parallel, it is important to ensure that recovery proceeds swiftly and safely. While fixing the {GIT}2488[split-brain issue] we have been hunting down corner cases that were not handled optimally, adding tests to demonstrate the issues, and working on fixes:
+
+* Faster & better detection of master & node failures, including not trying to reconnect upon disconnect, fail on disconnect error on ping, verify cluster names in pings. Previously, Elasticsearch had to wait a bit for the node to complete the process required to join the cluster. Recent changes guarantee that a node has fully joined the cluster before we start the fault detection process. Therefore we can do an immediate check causing faster detection of errors and validation of cluster state after a minimum master node breach. {GIT}6706[#6706], {GIT}7399[#7399] (STATUS: DONE, v1.4.0.Beta1)
+* Broaden Unicast pinging when master fails: When a node loses it’s current master it will start pinging to find a new one. Previously, when using unicast based pinging, the node would ping a set of predefined nodes asking them whether the master had really disappeared or whether there was a network hiccup. Now, we ping all nodes in the cluster to increase coverage. In the case that all unicast hosts are disconnected from the current master during a network failure, this improvement is essential to allow the cluster to reform once the partition is healed. {GIT}7336[#7336] (STATUS: DONE, v1.4.0.Beta1)
+* After joining a cluster, validate that the join was successful and that the master has been set in the local cluster state. {GIT}6969[#6969]. (STATUS: DONE, v1.4.0.Beta1)
+* Write additional tests that use the test infrastructure to verify proper behavior during network disconnections and garbage collections. {GIT}7082[#7082] (STATUS: DONE, v1.4.0.Beta1)
+
+[float]
+=== Lucene checksums phase 2 (STATUS:DONE, v1.4.0.Beta1)
+
+When Lucene opens a segment for reading, it validates the checksum on the smaller segment files -- those which it reads entirely into memory -- but not the large files like term frequencies and positions, as this would be very expensive. During merges, term vectors and stored fields are validated, as long the segments being merged come from the same version of Lucene. Checksumming for term vectors and stored fields is important because merging consists of performing optimized byte copies. Term frequencies, term positions, payloads, doc values, and norms are currently not checked during merges, although Lucene provides the option to do so.  These files are less prone to silent corruption as they are actively decoded during merge, and so are more likely to throw exceptions if there is any corruption.
+
+The following changes have been made:
+
+* {GIT}7360[#7360] validates checksums on all segment files during merges. (STATUS: DONE, fixed in v1.4.0.Beta1)
+* {JIRA}5842[LUCENE-5842] validates the structure of the checksum footer of the postings lists, doc values, stored fields and term vectors when opening a new segment, to ensure that these files have not been truncated. (STATUS: DONE, Fixed in Lucene 4.10 and v1.4.0.Beta1)
+* {GIT}8407[#8407] validates Lucene checksums for legacy files. (STATUS: DONE; Fixed in v1.3.6)

 [float]
 === Don't allow unsupported codecs (STATUS: DONE, v1.4.0.Beta1)
@ -172,7 +278,7 @@ See {GIT}6967[#6967], {GIT}6908[#6908], {GIT}4548[#4548], {GIT}3829[#3829], {GIT
 [float]
 === Index corruption when upgrading Lucene 3.x indices (STATUS: DONE, v1.4.0.Beta1)

-Upgrading indices create with Lucene 3.x (Elasticsearch v0.20 and before) to Lucene 4.7 - 4.9 (Elasticsearch 1.1.0 to 1.3.x), could result in index corruption. {JIRA}5907[LUCENE-5907] fixes this issue in Lucene 4.10.
+Upgrading indices create with Lucene 3.x (Elasticsearch v0.20 and before) to Lucene 4.7 - 4.9 (Elasticsearch v1.1.0 to v1.3.x), could result in index corruption. {JIRA}5907[LUCENE-5907] fixes this issue in Lucene 4.10.

 [float]
 === Improve error handling when deleting files (STATUS: DONE, v1.4.0.Beta1)