[DOCS] Adds overview and API ref for cluster voting configurations (#36954)

This commit is contained in:
Lisa Cawley 2019-01-07 09:11:14 -08:00 committed by GitHub
parent 1780ced82d
commit f307847f29
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 263 additions and 149 deletions

View File

@ -104,3 +104,5 @@ include::cluster/tasks.asciidoc[]
include::cluster/nodes-hot-threads.asciidoc[]
include::cluster/allocation-explain.asciidoc[]
include::cluster/voting-exclusions.asciidoc[]

View File

@ -0,0 +1,76 @@
[[voting-config-exclusions]]
== Voting configuration exclusions API
++++
<titleabbrev>Voting Configuration Exclusions</titleabbrev>
++++
Adds or removes master-eligible nodes from the
<<modules-discovery-voting,voting configuration exclusion list>>.
[float]
=== Request
`POST _cluster/voting_config_exclusions/<node_name>` +
`DELETE _cluster/voting_config_exclusions`
[float]
=== Path parameters
`node_name`::
A <<cluster-nodes,node filter>> that identifies {es} nodes.
[float]
=== Description
By default, if there are more than three master-eligible nodes in the cluster
and you remove fewer than half of the master-eligible nodes in the cluster at
once, the <<modules-discovery-voting,voting configuration>> automatically
shrinks.
If you want to shrink the voting configuration to contain fewer than three nodes
or to remove half or more of the master-eligible nodes in the cluster at once,
you must use this API to remove departed nodes from the voting configuration
manually. It adds an entry for that node in the voting configuration exclusions
list. The cluster then tries to reconfigure the voting configuration to remove
that node and to prevent it from returning.
If the API fails, you can safely retry it. Only a successful response
guarantees that the node has been removed from the voting configuration and will
not be reinstated.
NOTE: Voting exclusions are required only when you remove at least half of the
master-eligible nodes from a cluster in a short time period. They are not
required when removing master-ineligible nodes or fewer than half of the
master-eligible nodes.
The <<modules-discovery-settings,`cluster.max_voting_config_exclusions`
setting>> limits the size of the voting configuration exclusion list. The
default value is `10`. Since voting configuration exclusions are persistent and
limited in number, you must clear the voting config exclusions list once the
exclusions are no longer required.
There is also a
<<modules-discovery-settings,`cluster.auto_shrink_voting_configuration` setting>>,
which is set to true by default. If it is set to false, you must use this API to
maintain the voting configuration.
For more information, see <<modules-discovery-removing-nodes>>.
[float]
=== Examples
Add `nodeId1` to the voting configuration exclusions list:
[source,js]
--------------------------------------------------
POST /_cluster/voting_config_exclusions/nodeId1
--------------------------------------------------
// CONSOLE
// TEST[catch:bad_request]
Remove all exclusions from the list:
[source,js]
--------------------------------------------------
DELETE /_cluster/voting_config_exclusions
--------------------------------------------------
// CONSOLE

View File

@ -13,6 +13,16 @@ module. This module is divided into the following sections:
unknown, such as when a node has just started up or when the previous
master has failed.
<<modules-discovery-quorums>>::
This section describes how {es} uses a quorum-based voting mechanism to
make decisions even if some nodes are unavailable.
<<modules-discovery-voting>>::
This section describes the concept of voting configurations, which {es}
automatically updates as nodes leave and join the cluster.
<<modules-discovery-bootstrap-cluster>>::
Bootstrapping a cluster is required when an Elasticsearch cluster starts up
@ -40,11 +50,10 @@ module. This module is divided into the following sections:
Cluster state publishing is the process by which the elected master node
updates the cluster state on all the other nodes in the cluster.
<<modules-discovery-quorums>>::
<<cluster-fault-detection>>::
{es} performs health checks to detect and remove faulty nodes.
This section describes the detailed design behind the master election and
auto-reconfiguration logic.
<<modules-discovery-settings,Settings>>::
There are settings that enable users to influence the discovery, cluster
@ -52,14 +61,16 @@ module. This module is divided into the following sections:
include::discovery/discovery.asciidoc[]
include::discovery/quorums.asciidoc[]
include::discovery/voting.asciidoc[]
include::discovery/bootstrapping.asciidoc[]
include::discovery/adding-removing-nodes.asciidoc[]
include::discovery/publishing.asciidoc[]
include::discovery/quorums.asciidoc[]
include::discovery/fault-detection.asciidoc[]
include::discovery/discovery-settings.asciidoc[]
include::discovery/discovery-settings.asciidoc[]

View File

@ -12,6 +12,7 @@ cluster, and to scale the cluster up and down by adding and removing
master-ineligible nodes only. However there are situations in which it may be
desirable to add or remove some master-eligible nodes to or from a cluster.
[[modules-discovery-adding-nodes]]
==== Adding master-eligible nodes
If you wish to add some nodes to your cluster, simply configure the new nodes
@ -24,6 +25,7 @@ cluster. You can use the `cluster.join.timeout` setting to configure how long a
node waits after sending a request to join a cluster. Its default value is `30s`.
See <<modules-discovery-settings>>.
[[modules-discovery-removing-nodes]]
==== Removing master-eligible nodes
When removing master-eligible nodes, it is important not to remove too many all
@ -50,7 +52,7 @@ will never automatically move a node on the voting exclusions list back into the
voting configuration. Once an excluded node has been successfully
auto-reconfigured out of the voting configuration, it is safe to shut it down
without affecting the cluster's master-level availability. A node can be added
to the voting configuration exclusion list using the following API:
to the voting configuration exclusion list using the <<voting-config-exclusions>> API. For example:
[source,js]
--------------------------------------------------

View File

@ -3,6 +3,15 @@
Discovery and cluster formation are affected by the following settings:
`cluster.auto_shrink_voting_configuration`::
Controls whether the <<modules-discovery-voting,voting configuration>>
sheds departed nodes automatically, as long as it still contains at least 3
nodes. The default value is `true`. If set to `false`, the voting
configuration never shrinks automatically and you must remove departed
nodes manually with the <<voting-config-exclusions,voting configuration
exclusions API>>.
[[master-election-settings]]`cluster.election.back_off_time`::
Sets the amount to increase the upper bound on the wait before an election
@ -152,9 +161,11 @@ APIs are not be blocked and can run on any available node.
Provides a list of master-eligible nodes in the cluster. The list contains
either an array of hosts or a comma-delimited string. Each value has the
format `host:port` or `host`, where `port` defaults to the setting `transport.profiles.default.port`. Note that IPv6 hosts must be bracketed.
format `host:port` or `host`, where `port` defaults to the setting
`transport.profiles.default.port`. Note that IPv6 hosts must be bracketed.
The default value is `127.0.0.1, [::1]`. See <<unicast.hosts>>.
`discovery.zen.ping.unicast.hosts.resolve_timeout`::
Sets the amount of time to wait for DNS lookups on each round of discovery. This is specified as a <<time-units, time unit>> and defaults to `5s`.
Sets the amount of time to wait for DNS lookups on each round of discovery.
This is specified as a <<time-units, time unit>> and defaults to `5s`.

View File

@ -2,8 +2,9 @@
=== Cluster fault detection
The elected master periodically checks each of the nodes in the cluster to
ensure that they are still connected and healthy. Each node in the cluster also periodically checks the health of the elected master. These checks
are known respectively as _follower checks_ and _leader checks_.
ensure that they are still connected and healthy. Each node in the cluster also
periodically checks the health of the elected master. These checks are known
respectively as _follower checks_ and _leader checks_.
Elasticsearch allows these checks to occasionally fail or timeout without
taking any action. It considers a node to be faulty only after a number of
@ -16,4 +17,4 @@ and retry setting values and attempts to remove the node from the cluster.
Similarly, if a node detects that the elected master has disconnected, this
situation is treated as an immediate failure. The node bypasses the timeout and
retry settings and restarts its discovery phase to try and find or elect a new
master.
master.

View File

@ -18,13 +18,13 @@ cluster. In many cases you can do this simply by starting or stopping the nodes
as required. See <<modules-discovery-adding-removing-nodes>>.
As nodes are added or removed Elasticsearch maintains an optimal level of fault
tolerance by updating the cluster's _voting configuration_, which is the set of
master-eligible nodes whose responses are counted when making decisions such as
electing a new master or committing a new cluster state. A decision is made only
after more than half of the nodes in the voting configuration have responded.
Usually the voting configuration is the same as the set of all the
master-eligible nodes that are currently in the cluster. However, there are some
situations in which they may be different.
tolerance by updating the cluster's <<modules-discovery-voting,voting
configuration>>, which is the set of master-eligible nodes whose responses are
counted when making decisions such as electing a new master or committing a new
cluster state. A decision is made only after more than half of the nodes in the
voting configuration have responded. Usually the voting configuration is the
same as the set of all the master-eligible nodes that are currently in the
cluster. However, there are some situations in which they may be different.
To be sure that the cluster remains available you **must not stop half or more
of the nodes in the voting configuration at the same time**. As long as more
@ -38,46 +38,6 @@ cluster-state update that adjusts the voting configuration to match, and this
can take a short time to complete. It is important to wait for this adjustment
to complete before removing more nodes from the cluster.
[float]
==== Setting the initial quorum
When a brand-new cluster starts up for the first time, it must elect its first
master node. To do this election, it needs to know the set of master-eligible
nodes whose votes should count. This initial voting configuration is known as
the _bootstrap configuration_ and is set in the
<<modules-discovery-bootstrap-cluster,cluster bootstrapping process>>.
It is important that the bootstrap configuration identifies exactly which nodes
should vote in the first election. It is not sufficient to configure each node
with an expectation of how many nodes there should be in the cluster. It is also
important to note that the bootstrap configuration must come from outside the
cluster: there is no safe way for the cluster to determine the bootstrap
configuration correctly on its own.
If the bootstrap configuration is not set correctly, when you start a brand-new
cluster there is a risk that you will accidentally form two separate clusters
instead of one. This situation can lead to data loss: you might start using both
clusters before you notice that anything has gone wrong and it is impossible to
merge them together later.
NOTE: To illustrate the problem with configuring each node to expect a certain
cluster size, imagine starting up a three-node cluster in which each node knows
that it is going to be part of a three-node cluster. A majority of three nodes
is two, so normally the first two nodes to discover each other form a cluster
and the third node joins them a short time later. However, imagine that four
nodes were erroneously started instead of three. In this case, there are enough
nodes to form two separate clusters. Of course if each node is started manually
then it's unlikely that too many nodes are started. If you're using an automated
orchestrator, however, it's certainly possible to get into this situation--
particularly if the orchestrator is not resilient to failures such as network
partitions.
The initial quorum is only required the very first time a whole cluster starts
up. New nodes joining an established cluster can safely obtain all the
information they need from the elected master. Nodes that have previously been
part of a cluster will have stored to disk all the information that is required
when they restart.
[float]
==== Master elections
@ -104,92 +64,3 @@ and then started again then it will automatically recover, such as during a
action with the APIs described here in these cases, because the set of master
nodes is not changing permanently.
[float]
==== Automatic changes to the voting configuration
Nodes may join or leave the cluster, and Elasticsearch reacts by automatically
making corresponding changes to the voting configuration in order to ensure that
the cluster is as resilient as possible.
The default auto-reconfiguration
behaviour is expected to give the best results in most situations. The current
voting configuration is stored in the cluster state so you can inspect its
current contents as follows:
[source,js]
--------------------------------------------------
GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config
--------------------------------------------------
// CONSOLE
NOTE: The current voting configuration is not necessarily the same as the set of
all available master-eligible nodes in the cluster. Altering the voting
configuration involves taking a vote, so it takes some time to adjust the
configuration as nodes join or leave the cluster. Also, there are situations
where the most resilient configuration includes unavailable nodes, or does not
include some available nodes, and in these situations the voting configuration
differs from the set of available master-eligible nodes in the cluster.
Larger voting configurations are usually more resilient, so Elasticsearch
normally prefers to add master-eligible nodes to the voting configuration after
they join the cluster. Similarly, if a node in the voting configuration
leaves the cluster and there is another master-eligible node in the cluster that
is not in the voting configuration then it is preferable to swap these two nodes
over. The size of the voting configuration is thus unchanged but its
resilience increases.
It is not so straightforward to automatically remove nodes from the voting
configuration after they have left the cluster. Different strategies have
different benefits and drawbacks, so the right choice depends on how the cluster
will be used. You can control whether the voting configuration automatically shrinks by using the following setting:
`cluster.auto_shrink_voting_configuration`::
Defaults to `true`, meaning that the voting configuration will automatically
shrink, shedding departed nodes, as long as it still contains at least 3
nodes. If set to `false`, the voting configuration never automatically
shrinks; departed nodes must be removed manually using the
<<modules-discovery-adding-removing-nodes,voting configuration exclusions API>>.
NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the
recommended and default setting, and there are at least three master-eligible
nodes in the cluster, then Elasticsearch remains capable of processing
cluster-state updates as long as all but one of its master-eligible nodes are
healthy.
There are situations in which Elasticsearch might tolerate the loss of multiple
nodes, but this is not guaranteed under all sequences of failures. If this
setting is set to `false` then departed nodes must be removed from the voting
configuration manually, using the
<<modules-discovery-adding-removing-nodes,voting exclusions API>>, to achieve
the desired level of resilience.
No matter how it is configured, Elasticsearch will not suffer from a "split-brain" inconsistency.
The `cluster.auto_shrink_voting_configuration` setting affects only its availability in the
event of the failure of some of its nodes, and the administrative tasks that
must be performed as nodes join and leave the cluster.
[float]
==== Even numbers of master-eligible nodes
There should normally be an odd number of master-eligible nodes in a cluster.
If there is an even number, Elasticsearch leaves one of them out of the voting
configuration to ensure that it has an odd size. This omission does not decrease
the failure-tolerance of the cluster. In fact, improves it slightly: if the
cluster suffers from a network partition that divides it into two equally-sized
halves then one of the halves will contain a majority of the voting
configuration and will be able to keep operating. If all of the master-eligible
nodes' votes were counted, neither side would contain a strict majority of the
nodes and so the cluster would not be able to make any progress.
For instance if there are four master-eligible nodes in the cluster and the
voting configuration contained all of them, any quorum-based decision would
require votes from at least three of them. This situation means that the cluster
can tolerate the loss of only a single master-eligible node. If this cluster
were split into two equal halves, neither half would contain three
master-eligible nodes and the cluster would not be able to make any progress.
If the voting configuration contains only three of the four master-eligible
nodes, however, the cluster is still only fully tolerant to the loss of one
node, but quorum-based decisions require votes from two of the three voting
nodes. In the event of an even split, one half will contain two of the three
voting nodes so that half will remain available.

View File

@ -0,0 +1,140 @@
[[modules-discovery-voting]]
=== Voting configurations
Each {es} cluster has a _voting configuration_, which is the set of
<<master-node,master-eligible nodes>> whose responses are counted when making
decisions such as electing a new master or committing a new cluster state.
Decisions are made only after a majority (more than half) of the nodes in the
voting configuration respond.
Usually the voting configuration is the same as the set of all the
master-eligible nodes that are currently in the cluster. However, there are some
situations in which they may be different.
IMPORTANT: To ensure the cluster remains available, you **must not stop half or
more of the nodes in the voting configuration at the same time**. As long as more
than half of the voting nodes are available, the cluster can work normally. For
example, if there are three or four master-eligible nodes, the cluster
can tolerate one unavailable node. If there are two or fewer master-eligible
nodes, they must all remain available.
After a node joins or leaves the cluster, {es} reacts by automatically making
corresponding changes to the voting configuration in order to ensure that the
cluster is as resilient as possible. It is important to wait for this adjustment
to complete before you remove more nodes from the cluster. For more information,
see <<modules-discovery-adding-removing-nodes>>.
The current voting configuration is stored in the cluster state so you can
inspect its current contents as follows:
[source,js]
--------------------------------------------------
GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config
--------------------------------------------------
// CONSOLE
NOTE: The current voting configuration is not necessarily the same as the set of
all available master-eligible nodes in the cluster. Altering the voting
configuration involves taking a vote, so it takes some time to adjust the
configuration as nodes join or leave the cluster. Also, there are situations
where the most resilient configuration includes unavailable nodes or does not
include some available nodes. In these situations, the voting configuration
differs from the set of available master-eligible nodes in the cluster.
Larger voting configurations are usually more resilient, so Elasticsearch
normally prefers to add master-eligible nodes to the voting configuration after
they join the cluster. Similarly, if a node in the voting configuration
leaves the cluster and there is another master-eligible node in the cluster that
is not in the voting configuration then it is preferable to swap these two nodes
over. The size of the voting configuration is thus unchanged but its
resilience increases.
It is not so straightforward to automatically remove nodes from the voting
configuration after they have left the cluster. Different strategies have
different benefits and drawbacks, so the right choice depends on how the cluster
will be used. You can control whether the voting configuration automatically
shrinks by using the
<<modules-discovery-settings,`cluster.auto_shrink_voting_configuration` setting>>.
NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true` (which is
the default and recommended value) and there are at least three master-eligible
nodes in the cluster, Elasticsearch remains capable of processing cluster state
updates as long as all but one of its master-eligible nodes are healthy.
There are situations in which Elasticsearch might tolerate the loss of multiple
nodes, but this is not guaranteed under all sequences of failures. If the
`cluster.auto_shrink_voting_configuration` setting is `false`, you must remove
departed nodes from the voting configuration manually. Use the
<<voting-config-exclusions,voting exclusions API>> to achieve the desired level
of resilience.
No matter how it is configured, Elasticsearch will not suffer from a
"split-brain" inconsistency. The `cluster.auto_shrink_voting_configuration`
setting affects only its availability in the event of the failure of some of its
nodes and the administrative tasks that must be performed as nodes join and
leave the cluster.
[float]
==== Even numbers of master-eligible nodes
There should normally be an odd number of master-eligible nodes in a cluster.
If there is an even number, Elasticsearch leaves one of them out of the voting
configuration to ensure that it has an odd size. This omission does not decrease
the failure-tolerance of the cluster. In fact, improves it slightly: if the
cluster suffers from a network partition that divides it into two equally-sized
halves then one of the halves will contain a majority of the voting
configuration and will be able to keep operating. If all of the votes from
master-eligible nodes were counted, neither side would contain a strict majority
of the nodes and so the cluster would not be able to make any progress.
For instance if there are four master-eligible nodes in the cluster and the
voting configuration contained all of them, any quorum-based decision would
require votes from at least three of them. This situation means that the cluster
can tolerate the loss of only a single master-eligible node. If this cluster
were split into two equal halves, neither half would contain three
master-eligible nodes and the cluster would not be able to make any progress.
If the voting configuration contains only three of the four master-eligible
nodes, however, the cluster is still only fully tolerant to the loss of one
node, but quorum-based decisions require votes from two of the three voting
nodes. In the event of an even split, one half will contain two of the three
voting nodes so that half will remain available.
[float]
==== Setting the initial voting configuration
When a brand-new cluster starts up for the first time, it must elect its first
master node. To do this election, it needs to know the set of master-eligible
nodes whose votes should count. This initial voting configuration is known as
the _bootstrap configuration_ and is set in the
<<modules-discovery-bootstrap-cluster,cluster bootstrapping process>>.
It is important that the bootstrap configuration identifies exactly which nodes
should vote in the first election. It is not sufficient to configure each node
with an expectation of how many nodes there should be in the cluster. It is also
important to note that the bootstrap configuration must come from outside the
cluster: there is no safe way for the cluster to determine the bootstrap
configuration correctly on its own.
If the bootstrap configuration is not set correctly, when you start a brand-new
cluster there is a risk that you will accidentally form two separate clusters
instead of one. This situation can lead to data loss: you might start using both
clusters before you notice that anything has gone wrong and it is impossible to
merge them together later.
NOTE: To illustrate the problem with configuring each node to expect a certain
cluster size, imagine starting up a three-node cluster in which each node knows
that it is going to be part of a three-node cluster. A majority of three nodes
is two, so normally the first two nodes to discover each other form a cluster
and the third node joins them a short time later. However, imagine that four
nodes were erroneously started instead of three. In this case, there are enough
nodes to form two separate clusters. Of course if each node is started manually
then it's unlikely that too many nodes are started. If you're using an automated
orchestrator, however, it's certainly possible to get into this situation--
particularly if the orchestrator is not resilient to failures such as network
partitions.
The initial quorum is only required the very first time a whole cluster starts
up. New nodes joining an established cluster can safely obtain all the
information they need from the elected master. Nodes that have previously been
part of a cluster will have stored to disk all the information that is required
when they restart.