OpenSearch/docs/reference/modules/discovery/quorums.asciidoc

[[modules-discovery-quorums]]
=== Quorum-based decision making

Electing a master node and changing the cluster state are the two fundamental
tasks that master-eligible nodes must work together to perform. It is important
that these activities work robustly even if some nodes have failed.
Elasticsearch achieves this robustness by considering each action to have
succeeded on receipt of responses from a _quorum_, which is a subset of the
master-eligible nodes in the cluster. The advantage of requiring only a subset
of the nodes to respond is that it means some of the nodes can fail without
preventing the cluster from making progress. The quorums are carefully chosen so
the cluster does not have a "split brain" scenario where it's partitioned into
two pieces such that each piece may make decisions that are inconsistent with
those of the other piece.

Elasticsearch allows you to add and remove master-eligible nodes to a running
cluster. In many cases you can do this simply by starting or stopping the nodes
as required. See <<modules-discovery-adding-removing-nodes>>.

As nodes are added or removed Elasticsearch maintains an optimal level of fault
tolerance by updating the cluster's _voting configuration_, which is the set of
master-eligible nodes whose responses are counted when making decisions such as
electing a new master or committing a new cluster state. A decision is made only
after more than half of the nodes in the voting configuration have responded.
Usually the voting configuration is the same as the set of all the
master-eligible nodes that are currently in the cluster. However, there are some
situations in which they may be different.

To be sure that the cluster remains available you **must not stop half or more
of the nodes in the voting configuration at the same time**. As long as more
than half of the voting nodes are available the cluster can still work normally.
This means that if there are three or four master-eligible nodes, the cluster
can tolerate one of them being unavailable. If there are two or fewer
master-eligible nodes, they must all remain available.

After a node has joined or left the cluster the elected master must issue a
cluster-state update that adjusts the voting configuration to match, and this
can take a short time to complete. It is important to wait for this adjustment
to complete before removing more nodes from the cluster.

[float]
==== Setting the initial quorum

When a brand-new cluster starts up for the first time, it must elect its first
master node. To do this election, it needs to know the set of master-eligible
nodes whose votes should count. This initial voting configuration is known as
the _bootstrap configuration_ and is set in the
<<modules-discovery-bootstrap-cluster,cluster bootstrapping process>>.

It is important that the bootstrap configuration identifies exactly which nodes
should vote in the first election. It is not sufficient to configure each node
with an expectation of how many nodes there should be in the cluster. It is also
important to note that the bootstrap configuration must come from outside the
cluster: there is no safe way for the cluster to determine the bootstrap
configuration correctly on its own.

If the bootstrap configuration is not set correctly, when you start a brand-new
cluster there is a risk that you will accidentally form two separate clusters
instead of one. This situation can lead to data loss: you might start using both
clusters before you notice that anything has gone wrong and it is impossible to
merge them together later.

NOTE: To illustrate the problem with configuring each node to expect a certain
cluster size, imagine starting up a three-node cluster in which each node knows
that it is going to be part of a three-node cluster. A majority of three nodes
is two, so normally the first two nodes to discover each other form a cluster
and the third node joins them a short time later. However, imagine that four
nodes were erroneously started instead of three. In this case, there are enough
nodes to form two separate clusters. Of course if each node is started manually
then it's unlikely that too many nodes are started. If you're using an automated
orchestrator, however, it's certainly possible to get into this situation--
particularly if the orchestrator is not resilient to failures such as network
partitions.

The initial quorum is only required the very first time a whole cluster starts
up. New nodes joining an established cluster can safely obtain all the
information they need from the elected master. Nodes that have previously been
part of a cluster will have stored to disk all the information that is required
when they restart.

[float]
==== Master elections

Elasticsearch uses an election process to agree on an elected master node, both
at startup and if the existing elected master fails. Any master-eligible node
can start an election, and normally the first election that takes place will
succeed. Elections only usually fail when two nodes both happen to start their
elections at about the same time, so elections are scheduled randomly on each
node to reduce the probability of this happening. Nodes will retry elections
until a master is elected, backing off on failure, so that eventually an
election will succeed (with arbitrarily high probability). The scheduling of
master elections are controlled by the <<master-election-settings,master
election settings>>.

[float]
==== Cluster maintenance, rolling restarts and migrations

Many cluster maintenance tasks involve temporarily shutting down one or more
nodes and then starting them back up again. By default Elasticsearch can remain
available if one of its master-eligible nodes is taken offline, such as during a
<<rolling-upgrades,rolling restart>>. Furthermore, if multiple nodes are stopped
and then started again then it will automatically recover, such as during a
<<restart-upgrade,full cluster restart>>. There is no need to take any further
action with the APIs described here in these cases, because the set of master
nodes is not changing permanently.

[float]
==== Automatic changes to the voting configuration

Nodes may join or leave the cluster, and Elasticsearch reacts by automatically
making corresponding changes to the voting configuration in order to ensure that
the cluster is as resilient as possible. 

The default auto-reconfiguration
behaviour is expected to give the best results in most situations. The current
voting configuration is stored in the cluster state so you can inspect its
current contents as follows:

[source,js]
--------------------------------------------------
GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config
--------------------------------------------------
// CONSOLE

NOTE: The current voting configuration is not necessarily the same as the set of
all available master-eligible nodes in the cluster. Altering the voting
configuration involves taking a vote, so it takes some time to adjust the
configuration as nodes join or leave the cluster. Also, there are situations
where the most resilient configuration includes unavailable nodes, or does not
include some available nodes, and in these situations the voting configuration
differs from the set of available master-eligible nodes in the cluster.

Larger voting configurations are usually more resilient, so Elasticsearch
normally prefers to add master-eligible nodes to the voting configuration after
they join the cluster. Similarly, if a node in the voting configuration
leaves the cluster and there is another master-eligible node in the cluster that
is not in the voting configuration then it is preferable to swap these two nodes
over. The size of the voting configuration is thus unchanged but its
resilience increases.

It is not so straightforward to automatically remove nodes from the voting
configuration after they have left the cluster. Different strategies have
different benefits and drawbacks, so the right choice depends on how the cluster
will be used. You can control whether the voting configuration automatically shrinks by using the following setting:

`cluster.auto_shrink_voting_configuration`::

    Defaults to `true`, meaning that the voting configuration will automatically
    shrink, shedding departed nodes, as long as it still contains at least 3
    nodes.  If set to `false`, the voting configuration never automatically
    shrinks; departed nodes must be removed manually using the
    <<modules-discovery-adding-removing-nodes,voting configuration exclusions API>>.

NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the
recommended and default setting, and there are at least three master-eligible
nodes in the cluster, then Elasticsearch remains capable of processing
cluster-state updates as long as all but one of its master-eligible nodes are
healthy.

There are situations in which Elasticsearch might tolerate the loss of multiple
nodes, but this is not guaranteed under all sequences of failures. If this
setting is set to `false` then departed nodes must be removed from the voting
configuration manually, using the
<<modules-discovery-adding-removing-nodes,voting exclusions API>>, to achieve
the desired level of resilience.

No matter how it is configured, Elasticsearch will not suffer from a "split-brain" inconsistency.
The `cluster.auto_shrink_voting_configuration` setting affects only its availability in the
event of the failure of some of its nodes, and the administrative tasks that
must be performed as nodes join and leave the cluster.

[float]
==== Even numbers of master-eligible nodes

There should normally be an odd number of master-eligible nodes in a cluster.
If there is an even number, Elasticsearch leaves one of them out of the voting
configuration to ensure that it has an odd size. This omission does not decrease
the failure-tolerance of the cluster. In fact, improves it slightly: if the
cluster suffers from a network partition that divides it into two equally-sized
halves then one of the halves will contain a majority of the voting
configuration and will be able to keep operating. If all of the master-eligible
nodes' votes were counted, neither side would contain a strict majority of the
nodes and so the cluster would not be able to make any progress.

For instance if there are four master-eligible nodes in the cluster and the
voting configuration contained all of them, any quorum-based decision would
require votes from at least three of them. This situation means that the cluster
can tolerate the loss of only a single master-eligible node. If this cluster
were split into two equal halves, neither half would contain three
master-eligible nodes and the cluster would not be able to make any progress.
If the voting configuration contains only three of the four master-eligible
nodes, however, the cluster is still only fully tolerant to the loss of one
node, but quorum-based decisions require votes from two of the three voting
nodes. In the event of an even split, one half will contain two of the three
voting nodes so that half will remain available.
[Zen2] Update documentation for Zen2 (#34714) This commit overhauls the documentation of discovery and cluster coordination, removing mention of the Zen Discovery module and replacing it with docs for the new cluster coordination mechanism introduced in 7.0. Relates #32006 2018-12-20 08:02:44 -05:00			`[[modules-discovery-quorums]]`
			`=== Quorum-based decision making`

			`Electing a master node and changing the cluster state are the two fundamental`
			`tasks that master-eligible nodes must work together to perform. It is important`
			`that these activities work robustly even if some nodes have failed.`
			`Elasticsearch achieves this robustness by considering each action to have`
			`succeeded on receipt of responses from a _quorum_, which is a subset of the`
			`master-eligible nodes in the cluster. The advantage of requiring only a subset`
			`of the nodes to respond is that it means some of the nodes can fail without`
			`preventing the cluster from making progress. The quorums are carefully chosen so`
			`the cluster does not have a "split brain" scenario where it's partitioned into`
			`two pieces such that each piece may make decisions that are inconsistent with`
			`those of the other piece.`

			`Elasticsearch allows you to add and remove master-eligible nodes to a running`
			`cluster. In many cases you can do this simply by starting or stopping the nodes`
			`as required. See <<modules-discovery-adding-removing-nodes>>.`

			`As nodes are added or removed Elasticsearch maintains an optimal level of fault`
			`tolerance by updating the cluster's _voting configuration_, which is the set of`
			`master-eligible nodes whose responses are counted when making decisions such as`
			`electing a new master or committing a new cluster state. A decision is made only`
			`after more than half of the nodes in the voting configuration have responded.`
			`Usually the voting configuration is the same as the set of all the`
			`master-eligible nodes that are currently in the cluster. However, there are some`
			`situations in which they may be different.`

			`To be sure that the cluster remains available you **must not stop half or more`
			`of the nodes in the voting configuration at the same time**. As long as more`
			`than half of the voting nodes are available the cluster can still work normally.`
			`This means that if there are three or four master-eligible nodes, the cluster`
			`can tolerate one of them being unavailable. If there are two or fewer`
			`master-eligible nodes, they must all remain available.`

			`After a node has joined or left the cluster the elected master must issue a`
			`cluster-state update that adjusts the voting configuration to match, and this`
			`can take a short time to complete. It is important to wait for this adjustment`
			`to complete before removing more nodes from the cluster.`

			`[float]`
			`==== Setting the initial quorum`

			`When a brand-new cluster starts up for the first time, it must elect its first`
			`master node. To do this election, it needs to know the set of master-eligible`
			`nodes whose votes should count. This initial voting configuration is known as`
			`the _bootstrap configuration_ and is set in the`
			`<<modules-discovery-bootstrap-cluster,cluster bootstrapping process>>.`

			`It is important that the bootstrap configuration identifies exactly which nodes`
			`should vote in the first election. It is not sufficient to configure each node`
			`with an expectation of how many nodes there should be in the cluster. It is also`
			`important to note that the bootstrap configuration must come from outside the`
			`cluster: there is no safe way for the cluster to determine the bootstrap`
			`configuration correctly on its own.`

			`If the bootstrap configuration is not set correctly, when you start a brand-new`
			`cluster there is a risk that you will accidentally form two separate clusters`
			`instead of one. This situation can lead to data loss: you might start using both`
			`clusters before you notice that anything has gone wrong and it is impossible to`
			`merge them together later.`

			`NOTE: To illustrate the problem with configuring each node to expect a certain`
			`cluster size, imagine starting up a three-node cluster in which each node knows`
			`that it is going to be part of a three-node cluster. A majority of three nodes`
			`is two, so normally the first two nodes to discover each other form a cluster`
			`and the third node joins them a short time later. However, imagine that four`
			`nodes were erroneously started instead of three. In this case, there are enough`
			`nodes to form two separate clusters. Of course if each node is started manually`
			`then it's unlikely that too many nodes are started. If you're using an automated`
			`orchestrator, however, it's certainly possible to get into this situation--`
			`particularly if the orchestrator is not resilient to failures such as network`
			`partitions.`

			`The initial quorum is only required the very first time a whole cluster starts`
			`up. New nodes joining an established cluster can safely obtain all the`
			`information they need from the elected master. Nodes that have previously been`
			`part of a cluster will have stored to disk all the information that is required`
			`when they restart.`

			`[float]`
			`==== Master elections`

			`Elasticsearch uses an election process to agree on an elected master node, both`
			`at startup and if the existing elected master fails. Any master-eligible node`
			`can start an election, and normally the first election that takes place will`
			`succeed. Elections only usually fail when two nodes both happen to start their`
			`elections at about the same time, so elections are scheduled randomly on each`
			`node to reduce the probability of this happening. Nodes will retry elections`
			`until a master is elected, backing off on failure, so that eventually an`
			`election will succeed (with arbitrarily high probability). The scheduling of`
			`master elections are controlled by the <<master-election-settings,master`
			`election settings>>.`

			`[float]`
			`==== Cluster maintenance, rolling restarts and migrations`

			`Many cluster maintenance tasks involve temporarily shutting down one or more`
			`nodes and then starting them back up again. By default Elasticsearch can remain`
			`available if one of its master-eligible nodes is taken offline, such as during a`
			`<<rolling-upgrades,rolling restart>>. Furthermore, if multiple nodes are stopped`
			`and then started again then it will automatically recover, such as during a`
			`<<restart-upgrade,full cluster restart>>. There is no need to take any further`
			`action with the APIs described here in these cases, because the set of master`
			`nodes is not changing permanently.`

			`[float]`
			`==== Automatic changes to the voting configuration`

			`Nodes may join or leave the cluster, and Elasticsearch reacts by automatically`
			`making corresponding changes to the voting configuration in order to ensure that`
[DOCS] Merges list of discovery and cluster formation settings (#36909) 2018-12-21 14:24:48 -05:00			`the cluster is as resilient as possible.`

			`The default auto-reconfiguration`
[Zen2] Update documentation for Zen2 (#34714) This commit overhauls the documentation of discovery and cluster coordination, removing mention of the Zen Discovery module and replacing it with docs for the new cluster coordination mechanism introduced in 7.0. Relates #32006 2018-12-20 08:02:44 -05:00			`behaviour is expected to give the best results in most situations. The current`
			`voting configuration is stored in the cluster state so you can inspect its`
			`current contents as follows:`

			`[source,js]`
			`--------------------------------------------------`
			`GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config`
			`--------------------------------------------------`
			`// CONSOLE`

			`NOTE: The current voting configuration is not necessarily the same as the set of`
			`all available master-eligible nodes in the cluster. Altering the voting`
			`configuration involves taking a vote, so it takes some time to adjust the`
			`configuration as nodes join or leave the cluster. Also, there are situations`
			`where the most resilient configuration includes unavailable nodes, or does not`
			`include some available nodes, and in these situations the voting configuration`
			`differs from the set of available master-eligible nodes in the cluster.`

			`Larger voting configurations are usually more resilient, so Elasticsearch`
			`normally prefers to add master-eligible nodes to the voting configuration after`
			`they join the cluster. Similarly, if a node in the voting configuration`
			`leaves the cluster and there is another master-eligible node in the cluster that`
			`is not in the voting configuration then it is preferable to swap these two nodes`
			`over. The size of the voting configuration is thus unchanged but its`
			`resilience increases.`

			`It is not so straightforward to automatically remove nodes from the voting`
			`configuration after they have left the cluster. Different strategies have`
			`different benefits and drawbacks, so the right choice depends on how the cluster`
			`will be used. You can control whether the voting configuration automatically shrinks by using the following setting:`

			`cluster.auto_shrink_voting_configuration`::

			Defaults to `true`, meaning that the voting configuration will automatically
			`shrink, shedding departed nodes, as long as it still contains at least 3`
			nodes. If set to `false`, the voting configuration never automatically
			`shrinks; departed nodes must be removed manually using the`
			`<<modules-discovery-adding-removing-nodes,voting configuration exclusions API>>.`

			NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the
			`recommended and default setting, and there are at least three master-eligible`
			`nodes in the cluster, then Elasticsearch remains capable of processing`
			`cluster-state updates as long as all but one of its master-eligible nodes are`
			`healthy.`

			`There are situations in which Elasticsearch might tolerate the loss of multiple`
			`nodes, but this is not guaranteed under all sequences of failures. If this`
			setting is set to `false` then departed nodes must be removed from the voting
			`configuration manually, using the`
			`<<modules-discovery-adding-removing-nodes,voting exclusions API>>, to achieve`
			`the desired level of resilience.`

			`No matter how it is configured, Elasticsearch will not suffer from a "split-brain" inconsistency.`
			The `cluster.auto_shrink_voting_configuration` setting affects only its availability in the
			`event of the failure of some of its nodes, and the administrative tasks that`
			`must be performed as nodes join and leave the cluster.`

			`[float]`
			`==== Even numbers of master-eligible nodes`

			`There should normally be an odd number of master-eligible nodes in a cluster.`
			`If there is an even number, Elasticsearch leaves one of them out of the voting`
			`configuration to ensure that it has an odd size. This omission does not decrease`
			`the failure-tolerance of the cluster. In fact, improves it slightly: if the`
			`cluster suffers from a network partition that divides it into two equally-sized`
			`halves then one of the halves will contain a majority of the voting`
			`configuration and will be able to keep operating. If all of the master-eligible`
			`nodes' votes were counted, neither side would contain a strict majority of the`
			`nodes and so the cluster would not be able to make any progress.`

			`For instance if there are four master-eligible nodes in the cluster and the`
			`voting configuration contained all of them, any quorum-based decision would`
			`require votes from at least three of them. This situation means that the cluster`
			`can tolerate the loss of only a single master-eligible node. If this cluster`
			`were split into two equal halves, neither half would contain three`
			`master-eligible nodes and the cluster would not be able to make any progress.`
			`If the voting configuration contains only three of the four master-eligible`
			`nodes, however, the cluster is still only fully tolerant to the loss of one`
			`node, but quorum-based decisions require votes from two of the three voting`
			`nodes. In the event of an even split, one half will contain two of the three`
			`voting nodes so that half will remain available.`