OpenSearch/docs/reference/modules/discovery/quorums.asciidoc

[[modules-discovery-quorums]]
=== Quorum-based decision making

Electing a master node and changing the cluster state are the two fundamental
tasks that master-eligible nodes must work together to perform. It is important
that these activities work robustly even if some nodes have failed.
Elasticsearch achieves this robustness by considering each action to have
succeeded on receipt of responses from a _quorum_, which is a subset of the
master-eligible nodes in the cluster. The advantage of requiring only a subset
of the nodes to respond is that it means some of the nodes can fail without
preventing the cluster from making progress. The quorums are carefully chosen so
the cluster does not have a "split brain" scenario where it's partitioned into
two pieces such that each piece may make decisions that are inconsistent with
those of the other piece.

Elasticsearch allows you to add and remove master-eligible nodes to a running
cluster. In many cases you can do this simply by starting or stopping the nodes
as required. See <<modules-discovery-adding-removing-nodes>>.

As nodes are added or removed Elasticsearch maintains an optimal level of fault
tolerance by updating the cluster's <<modules-discovery-voting,voting
configuration>>, which is the set of master-eligible nodes whose responses are
counted when making decisions such as electing a new master or committing a new
cluster state. A decision is made only after more than half of the nodes in the
voting configuration have responded.  Usually the voting configuration is the
same as the set of all the master-eligible nodes that are currently in the
cluster. However, there are some situations in which they may be different.

To be sure that the cluster remains available you **must not stop half or more
of the nodes in the voting configuration at the same time**. As long as more
than half of the voting nodes are available the cluster can still work normally.
This means that if there are three or four master-eligible nodes, the cluster
can tolerate one of them being unavailable. If there are two or fewer
master-eligible nodes, they must all remain available.

After a node has joined or left the cluster the elected master must issue a
cluster-state update that adjusts the voting configuration to match, and this
can take a short time to complete. It is important to wait for this adjustment
to complete before removing more nodes from the cluster.

[float]
==== Master elections

Elasticsearch uses an election process to agree on an elected master node, both
at startup and if the existing elected master fails. Any master-eligible node
can start an election, and normally the first election that takes place will
succeed. Elections only usually fail when two nodes both happen to start their
elections at about the same time, so elections are scheduled randomly on each
node to reduce the probability of this happening. Nodes will retry elections
until a master is elected, backing off on failure, so that eventually an
election will succeed (with arbitrarily high probability). The scheduling of
master elections are controlled by the <<master-election-settings,master
election settings>>.

[float]
==== Cluster maintenance, rolling restarts and migrations

Many cluster maintenance tasks involve temporarily shutting down one or more
nodes and then starting them back up again. By default Elasticsearch can remain
available if one of its master-eligible nodes is taken offline, such as during a
<<rolling-upgrades,rolling restart>>. Furthermore, if multiple nodes are stopped
and then started again then it will automatically recover, such as during a
<<restart-upgrade,full cluster restart>>. There is no need to take any further
action with the APIs described here in these cases, because the set of master
nodes is not changing permanently.