2018-12-21 14:24:48 -05:00
|
|
|
[[cluster-fault-detection]]
|
|
|
|
=== Cluster fault detection
|
2018-12-20 08:02:44 -05:00
|
|
|
|
2018-12-21 14:24:48 -05:00
|
|
|
The elected master periodically checks each of the nodes in the cluster to
|
2019-01-07 12:11:14 -05:00
|
|
|
ensure that they are still connected and healthy. Each node in the cluster also
|
|
|
|
periodically checks the health of the elected master. These checks are known
|
|
|
|
respectively as _follower checks_ and _leader checks_.
|
2018-12-20 08:02:44 -05:00
|
|
|
|
2018-12-21 14:24:48 -05:00
|
|
|
Elasticsearch allows these checks to occasionally fail or timeout without
|
|
|
|
taking any action. It considers a node to be faulty only after a number of
|
|
|
|
consecutive checks have failed. You can control fault detection behavior with
|
|
|
|
<<modules-discovery-settings,`cluster.fault_detection.*` settings>>.
|
|
|
|
|
|
|
|
If the elected master detects that a node has disconnected, however, this
|
|
|
|
situation is treated as an immediate failure. The master bypasses the timeout
|
|
|
|
and retry setting values and attempts to remove the node from the cluster.
|
|
|
|
Similarly, if a node detects that the elected master has disconnected, this
|
|
|
|
situation is treated as an immediate failure. The node bypasses the timeout and
|
|
|
|
retry settings and restarts its discovery phase to try and find or elect a new
|
2019-01-07 12:11:14 -05:00
|
|
|
master.
|
2020-07-07 09:14:35 -04:00
|
|
|
|
|
|
|
[[cluster-fault-detection-filesystem-health]]
|
|
|
|
Additionally, each node periodically verifies that its data path is healthy by
|
|
|
|
writing a small file to disk and then deleting it again. If a node discovers
|
|
|
|
its data path is unhealthy then it is removed from the cluster until the data
|
|
|
|
path recovers. You can control this behavior with the
|
|
|
|
<<modules-discovery-settings,`monitor.fs.health` settings>>.
|