diff --git a/docs/reference/modules/discovery/zen.asciidoc b/docs/reference/modules/discovery/zen.asciidoc index fa5ad6ac282..5e1d4651e04 100644 --- a/docs/reference/modules/discovery/zen.asciidoc +++ b/docs/reference/modules/discovery/zen.asciidoc @@ -108,12 +108,18 @@ considered failed. Defaults to `3`. The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one cluster state update at a time, applies the required changes and publishes the updated cluster state to all -the other nodes in the cluster. Each node receives the publish message, -updates its own cluster state and replies to the master node, which waits for -all nodes to respond, up to a timeout, before going ahead processing the next -updates in the queue. The `discovery.zen.publish_timeout` is set by default -to 30 seconds and can be changed dynamically through the -<> +the other nodes in the cluster. Each node receives the publish message, acknowledges +it but do *not* yet apply it. If the master does not receive acknowledgement from +at least `discovery.zen.minimum_master_nodes` nodes within a certain time (controlled by +the `discovery.zen.commit_timeout` setting and defaults to 30 seconds) the cluster state +change is rejected. + +Once enough nodes have responded, the cluster state is committed and a message will +be sent to all the nodes. The nodes then proceed and apply the new cluster state to their +internal state. The master node waits for all nodes to respond, up to a timeout, before +going ahead processing the next updates in the queue. The `discovery.zen.publish_timeout` is +set by default to 30 seconds and is measured from the moment the publishing started. Both +timeout settings can be changed dynamically through the <> [float] [[no-master-block]] diff --git a/docs/resiliency/index.asciidoc b/docs/resiliency/index.asciidoc index 2a055611935..2783447c51f 100644 --- a/docs/resiliency/index.asciidoc +++ b/docs/resiliency/index.asciidoc @@ -55,6 +55,21 @@ If you encounter an issue, https://github.com/elasticsearch/elasticsearch/issues We are committed to tracking down and fixing all the issues that are posted. +[float] +=== Use two phase commit for Cluster State publishing (STATUS: ONGOING) + +A master node in Elasticsearch continuously https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection[monitors the cluster nodes] +and removes any node from the cluster that doesn't respond to it's pings in a timely +fashion. If the master is left with less nodes than the `discovery.zen.minimum_master_nodes` +settings, it will step down and a new master election will start. + +When a network partition occurs causing a master to loose many followers, there is a +short window of time until detects it and master steps down. During that window, the +master may erroneously accept and ack cluster state changes. To avoid this, we introduce +a new phase to cluster state publishing where the proposed cluster state is sent to all nodes +but is not yet committed. Only once enough nodes (`minimum_master_nodes`) actively acknowledge +the change, it is committed and commit messages are sent to the nodes. See See {GIT}13062[#13062]. + [float] === Make index creation more user friendly (STATUS: ONGOING)