OpenSearch/docs/reference/modules/cluster/disk_allocator.asciidoc

[[disk-allocator]]
=== Disk-based shard allocation

Elasticsearch considers the available disk space on a node before deciding
whether to allocate new shards to that node or to actively relocate shards away
from that node.

Below are the settings that can be configured in the `elasticsearch.yml` config
file or updated dynamically on a live cluster with the
<<cluster-update-settings,cluster-update-settings>> API:

`cluster.routing.allocation.disk.threshold_enabled`::

    Defaults to `true`.  Set to `false` to disable the disk allocation decider.

`cluster.routing.allocation.disk.watermark.low`::

    Controls the low watermark for disk usage. It defaults to `85%`, meaning
    that Elasticsearch will not allocate shards to nodes that have more than
    85% disk used. It can also be set to an absolute byte value (like `500mb`)
    to prevent Elasticsearch from allocating shards if less than the specified
    amount of space is available. This setting has no effect on the primary
    shards of newly-created indices but will prevent their replicas from being allocated.

`cluster.routing.allocation.disk.watermark.high`::

    Controls the high watermark. It defaults to `90%`, meaning that
    Elasticsearch will attempt to relocate shards away from a node whose disk
    usage is above 90%. It can also be set to an absolute byte value (similarly
    to the low watermark) to relocate shards away from a node if it has less
    than the specified amount of free space. This setting affects the
    allocation of all shards, whether previously allocated or not.

`cluster.routing.allocation.disk.watermark.enable_for_single_data_node`::
    For a single data node, the default is to disregard disk watermarks when
    making an allocation decision. This is deprecated behavior and will be
    changed in 8.0. This setting can be set to `true` to enable the
    disk watermarks for a single data node cluster (will become default in 8.0).

`cluster.routing.allocation.disk.watermark.flood_stage`::
+
--
Controls the flood stage watermark. It defaults to 95%, meaning that
Elasticsearch enforces a read-only index block
(`index.blocks.read_only_allow_delete`) on every index that has one or more
shards allocated on the node that has at least one disk exceeding the flood
stage. This is a last resort to prevent nodes from running out of disk space.
The index block is automatically released once the disk utilization falls below
the high watermark.

NOTE: You can not mix the usage of percentage values and byte values within
these settings. Either all are set to percentage values, or all are set to byte
values. This is so that we can we validate that the settings are internally
consistent (that is, the low disk threshold is not more than the high disk
threshold, and the high disk threshold is not more than the flood stage
threshold).

An example of resetting the read-only index block on the `twitter` index:

[source,console]
--------------------------------------------------
PUT /twitter/_settings
{
  "index.blocks.read_only_allow_delete": null
}
--------------------------------------------------
// TEST[setup:twitter]
--

`cluster.info.update.interval`::

    How often Elasticsearch should check on disk usage for each node in the
    cluster. Defaults to `30s`.

`cluster.routing.allocation.disk.include_relocations`::

    deprecated:[7.5.0, Future versions will always account for relocations.]
    Defaults to +true+, which means that Elasticsearch will take into account
    shards that are currently being relocated to the target node when computing
    a node's disk usage. Taking relocating shards' sizes into account may,
    however, mean that the disk usage for a node is incorrectly estimated on
    the high side, since the relocation could be 90% complete and a recently
    retrieved disk usage would include the total size of the relocating shard
    as well as the space already used by the running relocation.


NOTE: Percentage values refer to used disk space, while byte values refer to
free disk space. This can be confusing, since it flips the meaning of high and
low. For example, it makes sense to set the low watermark to 10gb and the high
watermark to 5gb, but not the other way around.

An example of updating the low watermark to at least 100 gigabytes free, a high
watermark of at least 50 gigabytes free, and a flood stage watermark of 10
gigabytes free, and updating the information about the cluster every minute:

[source,console]
--------------------------------------------------
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "100gb",
    "cluster.routing.allocation.disk.watermark.high": "50gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
    "cluster.info.update.interval": "1m"
  }
}
--------------------------------------------------
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00			`[[disk-allocator]]`
[DOCS] Reworked the shard allocation filtering info. (#36456) * [DOCS] Reworked the shard allocation filtering info. Closes #36079 * Added multiple index allocation settings example back. * Removed extraneous space 2018-12-11 10:44:57 -05:00			`=== Disk-based shard allocation`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00
Improve docs for disk watermarks (#30249) * Clarify that the low watermark does not affect brand-new shards. * Replace ES -> Elasticsearch. * Format to 80 columns. Resolves #25163 2018-04-30 12:31:11 -04:00			`Elasticsearch considers the available disk space on a node before deciding`
			`whether to allocate new shards to that node or to actively relocate shards away`
			`from that node.`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00
Add note about multi data path and disk threshold deciders Prior to 2.0 we summed up the available space on all disk on a node due to the raid-0 like behavior. Now we don't do this anymore and use the min & max disk space to make decisions. Closes #13106 2015-08-31 09:55:00 -04:00			Below are the settings that can be configured in the `elasticsearch.yml` config
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00			`file or updated dynamically on a live cluster with the`
			`<<cluster-update-settings,cluster-update-settings>> API:`

			`cluster.routing.allocation.disk.threshold_enabled`::

			Defaults to `true`. Set to `false` to disable the disk allocation decider.

			`cluster.routing.allocation.disk.watermark.low`::

Improve docs for disk watermarks (#30249) * Clarify that the low watermark does not affect brand-new shards. * Replace ES -> Elasticsearch. * Format to 80 columns. Resolves #25163 2018-04-30 12:31:11 -04:00			Controls the low watermark for disk usage. It defaults to `85%`, meaning
			`that Elasticsearch will not allocate shards to nodes that have more than`
			85% disk used. It can also be set to an absolute byte value (like `500mb`)
			`to prevent Elasticsearch from allocating shards if less than the specified`
			`amount of space is available. This setting has no effect on the primary`
Clarify low watermark documentation (#48112) Today the docs say that the low watermark has no effect on any shards that have never been allocated, but this is confusing. Here "shard" means "replication group" not "shard copy" but this conflicts with the "never been allocated" qualifier since one allocates shard copies and not replication groups. This commit removes the misleading words. A newly-created replication group remains newly-created until one of its copies is assigned, which might be quite some time later, but it seems better to leave this implicit. 2019-10-16 07:22:54 -04:00			`shards of newly-created indices but will prevent their replicas from being allocated.`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00
			`cluster.routing.allocation.disk.watermark.high`::

Improve docs for disk watermarks (#30249) * Clarify that the low watermark does not affect brand-new shards. * Replace ES -> Elasticsearch. * Format to 80 columns. Resolves #25163 2018-04-30 12:31:11 -04:00			Controls the high watermark. It defaults to `90%`, meaning that
			`Elasticsearch will attempt to relocate shards away from a node whose disk`
			`usage is above 90%. It can also be set to an absolute byte value (similarly`
			`to the low watermark) to relocate shards away from a node if it has less`
			`than the specified amount of free space. This setting affects the`
			`allocation of all shards, whether previously allocated or not.`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00
Disk decider respect watermarks for single data node (#55805) (#55847) The disk decider had special handling for the single data node case, allowing any allocation (skipping watermark checks) for such clusters. This special handling can now be avoided via a setting. 2020-04-28 12:46:22 -04:00			`cluster.routing.allocation.disk.watermark.enable_for_single_data_node`::
			`For a single data node, the default is to disregard disk watermarks when`
			`making an allocation decision. This is deprecated behavior and will be`
			changed in 8.0. This setting can be set to `true` to enable the
			`disk watermarks for a single data node cluster (will become default in 8.0).`

Add an underscore to flood stage setting This is a minor nitty bikeshedding change that renames the suffix of the disk flood stage setting to "flood_stage" from "floodstage". Relates #25659 2017-07-11 22:02:00 -04:00			`cluster.routing.allocation.disk.watermark.flood_stage`::
Fix disk allocator docs This commit fixes the disk allocator docs which were broken due to the inadvertent removal of some docs snippet markup. 2017-07-07 22:11:09 -04:00			`+`
			`--`
Improve docs for disk watermarks (#30249) * Clarify that the low watermark does not affect brand-new shards. * Replace ES -> Elasticsearch. * Format to 80 columns. Resolves #25163 2018-04-30 12:31:11 -04:00			`Controls the flood stage watermark. It defaults to 95%, meaning that`
			`Elasticsearch enforces a read-only index block`
			(`index.blocks.read_only_allow_delete`) on every index that has one or more
			`shards allocated on the node that has at least one disk exceeding the flood`
			`stage. This is a last resort to prevent nodes from running out of disk space.`
Auto-release flood-stage write block (#42559) If a node exceeds the flood-stage disk watermark then we add a block to all of its indices to prevent further writes as a last-ditch attempt to prevent the node completely exhausting its disk space. However today this block remains in place until manually removed, and this block is a source of confusion for users who current have ample disk space and did not even realise they nearly ran out at some point in the past. This commit changes our behaviour to automatically remove this block when a node drops below the high watermark again. The expectation is that the high watermark is some distance below the flood-stage watermark and therefore the disk space problem is truly resolved. Fixes #39334 2019-08-07 05:53:17 -04:00			`The index block is automatically released once the disk utilization falls below`
			`the high watermark.`
Add disk threshold settings validation This commit adds cross-settings validation for the low/high/flood stage disk watermark settings. This validation was enabled by the introduction of multiple settings validation. Relates #25600 2017-07-07 19:54:36 -04:00
			`NOTE: You can not mix the usage of percentage values and byte values within`
			`these settings. Either all are set to percentage values, or all are set to byte`
			`values. This is so that we can we validate that the settings are internally`
			`consistent (that is, the low disk threshold is not more than the high disk`
			`threshold, and the high disk threshold is not more than the flood stage`
			`threshold).`
Switch indices read-only if a node runs out of disk space (#25541) Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in #24299 this change monitors disk utilization and adds a flood-stage watermark that causes all indices that are allocated on a node hitting the flood-stage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The flood-stage watermark is set to `95%` utilization by default. Closes #24299 2017-07-05 16:18:23 -04:00
			An example of resetting the read-only index block on the `twitter` index:

[DOCS] Change // CONSOLE comments to [source,console] (#46440) (#46494) 2019-09-09 12:35:50 -04:00			`[source,console]`
Switch indices read-only if a node runs out of disk space (#25541) Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in #24299 this change monitors disk utilization and adds a flood-stage watermark that causes all indices that are allocated on a node hitting the flood-stage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The flood-stage watermark is set to `95%` utilization by default. Closes #24299 2017-07-05 16:18:23 -04:00			`--------------------------------------------------`
			`PUT /twitter/_settings`
			`{`
			`"index.blocks.read_only_allow_delete": null`
			`}`
			`--------------------------------------------------`
			`// TEST[setup:twitter]`
Tidied up the disk allocator docs 2017-07-06 06:16:53 -04:00			`--`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00
			`cluster.info.update.interval`::

			`How often Elasticsearch should check on disk usage for each node in the`
			cluster. Defaults to `30s`.

			`cluster.routing.allocation.disk.include_relocations`::

Fix deprecation docs formatting (#47725) Relates #47443 2019-10-08 09:41:34 -04:00			`deprecated:[7.5.0, Future versions will always account for relocations.]`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00			`Defaults to +true+, which means that Elasticsearch will take into account`
Improve docs for disk watermarks (#30249) * Clarify that the low watermark does not affect brand-new shards. * Replace ES -> Elasticsearch. * Format to 80 columns. Resolves #25163 2018-04-30 12:31:11 -04:00			`shards that are currently being relocated to the target node when computing`
			`a node's disk usage. Taking relocating shards' sizes into account may,`
			`however, mean that the disk usage for a node is incorrectly estimated on`
			`the high side, since the relocation could be 90% complete and a recently`
			`retrieved disk usage would include the total size of the relocating shard`
			`as well as the space already used by the running relocation.`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00

Tidied up the disk allocator docs 2017-07-06 06:16:53 -04:00			`NOTE: Percentage values refer to used disk space, while byte values refer to`
			`free disk space. This can be confusing, since it flips the meaning of high and`
			`low. For example, it makes sense to set the low watermark to 10gb and the high`
			`watermark to 5gb, but not the other way around.`

Add disk threshold settings validation This commit adds cross-settings validation for the low/high/flood stage disk watermark settings. This validation was enabled by the introduction of multiple settings validation. Relates #25600 2017-07-07 19:54:36 -04:00			`An example of updating the low watermark to at least 100 gigabytes free, a high`
			`watermark of at least 50 gigabytes free, and a flood stage watermark of 10`
			`gigabytes free, and updating the information about the cluster every minute:`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00
[DOCS] Change // CONSOLE comments to [source,console] (#46440) (#46494) 2019-09-09 12:35:50 -04:00			`[source,console]`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00			`--------------------------------------------------`
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`PUT _cluster/settings`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00			`{`
			`"transient": {`
Add disk threshold settings validation This commit adds cross-settings validation for the low/high/flood stage disk watermark settings. This validation was enabled by the introduction of multiple settings validation. Relates #25600 2017-07-07 19:54:36 -04:00			`"cluster.routing.allocation.disk.watermark.low": "100gb",`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00			`"cluster.routing.allocation.disk.watermark.high": "50gb",`
Add an underscore to flood stage setting This is a minor nitty bikeshedding change that renames the suffix of the disk flood stage setting to "flood_stage" from "floodstage". Relates #25659 2017-07-11 22:02:00 -04:00			`"cluster.routing.allocation.disk.watermark.flood_stage": "10gb",`
Docs: Refactored modules and index modules sections 2015-06-22 17:49:45 -04:00			`"cluster.info.update.interval": "1m"`
			`}`
			`}`
			`--------------------------------------------------`