From 4ea83740a2eb0d83561d89c64d35674b40fe7e92 Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Sun, 14 Feb 2016 14:50:44 +0100 Subject: [PATCH] Updates to resiliency documentation Closes #16658 --- docs/resiliency/index.asciidoc | 89 ++++++++++++++++++++++------------ 1 file changed, 59 insertions(+), 30 deletions(-) diff --git a/docs/resiliency/index.asciidoc b/docs/resiliency/index.asciidoc index 3929bc7c8a8..ede2ea97c87 100644 --- a/docs/resiliency/index.asciidoc +++ b/docs/resiliency/index.asciidoc @@ -55,29 +55,6 @@ If you encounter an issue, https://github.com/elasticsearch/elasticsearch/issues We are committed to tracking down and fixing all the issues that are posted. -[float] -=== Use two phase commit for Cluster State publishing (STATUS: ONGOING, v3.0.0) - -A master node in Elasticsearch continuously https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection[monitors the cluster nodes] -and removes any node from the cluster that doesn't respond to its pings in a timely -fashion. If the master is left with fewer nodes than the `discovery.zen.minimum_master_nodes` -settings, it will step down and a new master election will start. - -When a network partition causes a master node to lose many followers, there is a short window -in time until the node loss is detected and the master steps down. During that window, the -master may erroneously accept and acknowledge cluster state changes. To avoid this, we introduce -a new phase to cluster state publishing where the proposed cluster state is sent to all nodes -but is not yet committed. Only once enough nodes (`discovery.zen.minimum_master_nodes`) actively acknowledge -the change, it is committed and commit messages are sent to the nodes. See {GIT}13062[#13062]. - -[float] -=== Make index creation more user friendly (STATUS: ONGOING) - -Today, Elasticsearch returns as soon as a create-index request has been processed, -but before the shards are allocated. Users should wait for a `green` cluster health -before continuing, but we can make this easier for users by waiting for a quorum -of shards to be allocated before returning. See {GIT}9126[#9126] - [float] === Better request retry mechanism when nodes are disconnected (STATUS: ONGOING) @@ -113,15 +90,37 @@ space. The following issues have been identified: * Set a hard limit on `from`/`size` parameters {GIT}9311[#9311]. (STATUS: DONE, v2.1.0) * Prevent combinatorial explosion in aggregations from causing OOM {GIT}8081[#8081]. (STATUS: ONGOING) * Add the byte size of each hit to the request circuit breaker {GIT}9310[#9310]. (STATUS: ONGOING) +* Limit the size of individual requests and also add a circuit breaker for the total memory used by in-flight request objects {GIT}16011[#16011]. (STATUS: ONGOING) + +Other safeguards are tracked in the meta-issue {GIT}11511[#11511]. [float] === Loss of documents during network partition (STATUS: ONGOING) If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults). -If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in {GIT}2488[split-brain] before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master. {GIT}7572[#7572] +If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in {GIT}2488[split-brain] before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master ({GIT}7572[#7572]). +To prevent this situation, the primary needs to wait for the master to acknowledge replica shard failures before acknowledging the write to the client. {GIT}14252[#14252] -A test to replicate this condition was added in {GIT}7493[#7493]. +[float] +=== Safe primary relocations (STATUS: ONGOING) + +When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As +cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the +cluster state and believes to be the active primary and the relocation source has not yet processed the cluster state update and still +believes itself to be the active primary. This means that an index request that gets routed to the new primary does not get replicated to +the old primary (as it has been deactivated from point of view of the new primary). If a subsequent read request gets routed to the old +primary, it cannot see the indexed document. {GIT}15900[#15900] + +In the reverse situation where a cluster state update that completes primary relocation is first applied on the relocation source and then +on the relocation target, each of the nodes believes the other to be the active primary. This leads to the issue of indexing requests +chasing the primary being quickly sent back and forth between the nodes, potentially making them both go OOM. {GIT}12573[#12573] + +[float] +=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING) + +Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node +while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719] [float] === Jepsen Test Failures (STATUS: ONGOING) @@ -134,12 +133,42 @@ We have increased our test coverage to include scenarios tested by Jepsen. We ma This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch, and what happens in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test will document the expected results, the associated test code and an explicit PASS or FAIL status for each simulated case. [float] -=== Do not allow stale shards to automatically be promoted to primary (STATUS: ONGOING) +=== Do not allow stale shards to automatically be promoted to primary (STATUS: ONGOING, v3.0.0) -In some scenarios, after the loss of all valid copies, a stale replica shard can be assigned as a primary. This can lead to -a loss of acknowledged writes if the valid copies are not lost but are rather temporarily isolated. Work is underway -({GIT}14671[#14671]) to prevent the automatic promotion of a stale primary and only allow such promotion to occur when -a system operator manually intervenes. +In some scenarios, after the loss of all valid copies, a stale replica shard can be automatically assigned as a primary, preferring old data +to no data at all ({GIT}14671[#14671]). This can lead to a loss of acknowledged writes if the valid copies are not lost but are rather +temporarily unavailable. Allocation IDs ({GIT}14739[#14739]) solve this issue by tracking non-stale shard copies in the cluster and using +this tracking information to allocate primary shards. When all shard copies are lost or only stale ones available, Elasticsearch will wait +for one of the good shard copies to reappear. In case where all good copies are lost, a manual override command can be used to allocate a +stale shard copy. + +[float] +=== Make index creation resilient to index closing and full cluster crashes (STATUS: ONGOING, v3.0.0) + +Recovering an index requires a quorum (with an exception for 2) of shard copies to be available to allocate a primary. This means that +a primary cannot be assigned if the cluster dies before enough shards have been allocated ({GIT}9126[#9126]). The same happens if an index +is closed before enough shard copies were started, making it impossible to reopen the index ({GIT}15281[#15281]). +Allocation IDs ({GIT}14739[#14739]) solve this issue by tracking allocated shard copies in the cluster. This makes it possible to safely +recover an index in the presence of a single shard copy. Allocation IDs can also distinguish the situation where an index has been created +but none of the shards have been started. If such an index was inadvertently closed before at least one shard could be started, a fresh +shard will be allocated upon reopening the index. + +== Unreleased + +[float] +=== Use two phase commit for Cluster State publishing (STATUS: UNRELEASED, v3.0.0) + +A master node in Elasticsearch continuously https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection[monitors the cluster nodes] +and removes any node from the cluster that doesn't respond to its pings in a timely +fashion. If the master is left with fewer nodes than the `discovery.zen.minimum_master_nodes` +settings, it will step down and a new master election will start. + +When a network partition causes a master node to lose many followers, there is a short window +in time until the node loss is detected and the master steps down. During that window, the +master may erroneously accept and acknowledge cluster state changes. To avoid this, we introduce +a new phase to cluster state publishing where the proposed cluster state is sent to all nodes +but is not yet committed. Only once enough nodes (`discovery.zen.minimum_master_nodes`) actively acknowledge +the change, it is committed and commit messages are sent to the nodes. See {GIT}13062[#13062]. == Completed