OpenSearch/docs/reference/modules/discovery/fault-detection.asciidoc

[[fault-detection-settings]]
=== Cluster fault detection settings

An elected master periodically checks each of the nodes in the cluster in order
to ensure that they are still connected and healthy, and in turn each node in
the cluster periodically checks the health of the elected master. These checks
are known respectively as _follower checks_ and _leader checks_.

Elasticsearch allows for these checks occasionally to fail or timeout without
taking any action, and will only consider a node to be truly faulty after a
number of consecutive checks have failed. The following settings control the
behaviour of fault detection.

`cluster.fault_detection.follower_check.interval`::

    Sets how long the elected master waits between follower checks to each
    other node in the cluster. Defaults to `1s`.

`cluster.fault_detection.follower_check.timeout`::

    Sets how long the elected master waits for a response to a follower check
    before considering it to have failed. Defaults to `30s`.

`cluster.fault_detection.follower_check.retry_count`::

    Sets how many consecutive follower check failures must occur to each node
    before the elected master considers that node to be faulty and removes it
    from the cluster. Defaults to `3`.

`cluster.fault_detection.leader_check.interval`::

    Sets how long each node waits between checks of the elected master.
    Defaults to `1s`.

`cluster.fault_detection.leader_check.timeout`::

    Sets how long each node waits for a response to a leader check from the
    elected master before considering it to have failed. Defaults to `30s`.

`cluster.fault_detection.leader_check.retry_count`::

    Sets how many consecutive leader check failures must occur before a node
    considers the elected master to be faulty and attempts to find or elect a
    new master. Defaults to `3`.

If the elected master detects that a node has disconnected then this is treated
as an immediate failure, bypassing the timeouts and retries listed above, and
the master attempts to remove the node from the cluster. Similarly, if a node
detects that the elected master has disconnected then this is treated as an
immediate failure, bypassing the timeouts and retries listed above, and the
follower restarts its discovery phase to try and find or elect a new master.
[Zen2] Update documentation for Zen2 (#34714) This commit overhauls the documentation of discovery and cluster coordination, removing mention of the Zen Discovery module and replacing it with docs for the new cluster coordination mechanism introduced in 7.0. Relates #32006 2018-12-20 13:02:44 +00:00			`[[fault-detection-settings]]`
			`=== Cluster fault detection settings`

			`An elected master periodically checks each of the nodes in the cluster in order`
			`to ensure that they are still connected and healthy, and in turn each node in`
			`the cluster periodically checks the health of the elected master. These checks`
			`are known respectively as _follower checks_ and _leader checks_.`

			`Elasticsearch allows for these checks occasionally to fail or timeout without`
			`taking any action, and will only consider a node to be truly faulty after a`
			`number of consecutive checks have failed. The following settings control the`
			`behaviour of fault detection.`

			`cluster.fault_detection.follower_check.interval`::

			`Sets how long the elected master waits between follower checks to each`
			other node in the cluster. Defaults to `1s`.

			`cluster.fault_detection.follower_check.timeout`::

			`Sets how long the elected master waits for a response to a follower check`
			before considering it to have failed. Defaults to `30s`.

			`cluster.fault_detection.follower_check.retry_count`::

			`Sets how many consecutive follower check failures must occur to each node`
			`before the elected master considers that node to be faulty and removes it`
			from the cluster. Defaults to `3`.

			`cluster.fault_detection.leader_check.interval`::

			`Sets how long each node waits between checks of the elected master.`
			Defaults to `1s`.

			`cluster.fault_detection.leader_check.timeout`::

			`Sets how long each node waits for a response to a leader check from the`
			elected master before considering it to have failed. Defaults to `30s`.

			`cluster.fault_detection.leader_check.retry_count`::

			`Sets how many consecutive leader check failures must occur before a node`
			`considers the elected master to be faulty and attempts to find or elect a`
			new master. Defaults to `3`.

			`If the elected master detects that a node has disconnected then this is treated`
			`as an immediate failure, bypassing the timeouts and retries listed above, and`
			`the master attempts to remove the node from the cluster. Similarly, if a node`
			`detects that the elected master has disconnected then this is treated as an`
			`immediate failure, bypassing the timeouts and retries listed above, and the`
			`follower restarts its discovery phase to try and find or elect a new master.`