2015-02-18 17:34:06 -05:00
|
|
|
[[indices-shadow-replicas]]
|
|
|
|
== Shadow replica indices
|
|
|
|
|
|
|
|
experimental[]
|
|
|
|
|
|
|
|
If you would like to use a shared filesystem, you can use the shadow replicas
|
|
|
|
settings to choose where on disk the data for an index should be kept, as well
|
|
|
|
as how Elasticsearch should replay operations on all the replica shards of an
|
|
|
|
index.
|
|
|
|
|
|
|
|
In order to fully utilize the `index.data_path` and `index.shadow_replicas`
|
2015-08-12 13:25:57 -04:00
|
|
|
settings, you need to allow Elasticsearch to use the same data directory for
|
Persistent Node Ids (#19140)
Node IDs are currently randomly generated during node startup. That means they change every time the node is restarted. While this doesn't matter for ES proper, it makes it hard for external services to track nodes. Another, more minor, side effect is that indexing the output of, say, the node stats API results in creating new fields due to node ID being used as keys.
The first approach I considered was to use the node's published address as the base for the id. We already [treat nodes with the same address as the same](https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/discovery/zen/NodeJoinController.java#L387) so this is a simple change (see [here](https://github.com/elastic/elasticsearch/compare/master...bleskes:node_persistent_id_based_on_address)). While this is simple and it works for probably most cases, it is not perfect. For example, if after a node restart, the node is not able to bind to the same port (because it's not yet freed by the OS), it will cause the node to still change identity. Also in environments where the host IP can change due to a host restart, identity will not be the same.
Due to those limitation, I opted to go with a different approach where the node id will be persisted in the node's data folder. This has the upside of connecting the id to the nodes data. It also means that the host can be adapted in any way (replace network cards, attach storage to a new VM). I
It does however also have downsides - we now run the risk of two nodes having the same id, if someone copies clones a data folder from one node to another. To mitigate this I changed the semantics of the protection against multiple nodes with the same address to be stricter - it will now reject the incoming join if a node exists with the same id but a different address. Note that if the existing node doesn't respond to pings (i.e., it's not alive) it will be removed and the new node will be accepted when it tries another join.
Last, and most importantly, this change requires that *all* nodes persist data to disk. This is a change from current behavior where only data & master nodes store local files. This is the main reason for marking this PR as breaking.
Other less important notes:
- DummyTransportAddress is removed as we need a unique network address per node. Use `LocalTransportAddress.buildUnique()` instead.
- I renamed `node.add_lid_to_custom_path` to `node.add_lock_id_to_custom_path` to avoid confusion with the node ID which is now part of the `NodeEnvironment` logic.
- I removed the `version` paramater from `MetaDataStateFormat#write` , it wasn't really used and was just in the way :)
- TribeNodes are special in the sense that they do start multiple sub-nodes (previously known as client nodes). Those sub-nodes do not store local files but derive their ID from the parent node id, so they are generated consistently.
2016-07-04 15:09:25 -04:00
|
|
|
multiple instances by setting `node.add_lock_id_to_custom_path` to false in
|
2015-08-12 13:25:57 -04:00
|
|
|
elasticsearch.yml:
|
2015-02-18 17:34:06 -05:00
|
|
|
|
|
|
|
[source,yaml]
|
|
|
|
--------------------------------------------------
|
Persistent Node Ids (#19140)
Node IDs are currently randomly generated during node startup. That means they change every time the node is restarted. While this doesn't matter for ES proper, it makes it hard for external services to track nodes. Another, more minor, side effect is that indexing the output of, say, the node stats API results in creating new fields due to node ID being used as keys.
The first approach I considered was to use the node's published address as the base for the id. We already [treat nodes with the same address as the same](https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/discovery/zen/NodeJoinController.java#L387) so this is a simple change (see [here](https://github.com/elastic/elasticsearch/compare/master...bleskes:node_persistent_id_based_on_address)). While this is simple and it works for probably most cases, it is not perfect. For example, if after a node restart, the node is not able to bind to the same port (because it's not yet freed by the OS), it will cause the node to still change identity. Also in environments where the host IP can change due to a host restart, identity will not be the same.
Due to those limitation, I opted to go with a different approach where the node id will be persisted in the node's data folder. This has the upside of connecting the id to the nodes data. It also means that the host can be adapted in any way (replace network cards, attach storage to a new VM). I
It does however also have downsides - we now run the risk of two nodes having the same id, if someone copies clones a data folder from one node to another. To mitigate this I changed the semantics of the protection against multiple nodes with the same address to be stricter - it will now reject the incoming join if a node exists with the same id but a different address. Note that if the existing node doesn't respond to pings (i.e., it's not alive) it will be removed and the new node will be accepted when it tries another join.
Last, and most importantly, this change requires that *all* nodes persist data to disk. This is a change from current behavior where only data & master nodes store local files. This is the main reason for marking this PR as breaking.
Other less important notes:
- DummyTransportAddress is removed as we need a unique network address per node. Use `LocalTransportAddress.buildUnique()` instead.
- I renamed `node.add_lid_to_custom_path` to `node.add_lock_id_to_custom_path` to avoid confusion with the node ID which is now part of the `NodeEnvironment` logic.
- I removed the `version` paramater from `MetaDataStateFormat#write` , it wasn't really used and was just in the way :)
- TribeNodes are special in the sense that they do start multiple sub-nodes (previously known as client nodes). Those sub-nodes do not store local files but derive their ID from the parent node id, so they are generated consistently.
2016-07-04 15:09:25 -04:00
|
|
|
node.add_lock_id_to_custom_path: false
|
2015-02-18 17:34:06 -05:00
|
|
|
--------------------------------------------------
|
|
|
|
|
2015-08-07 08:27:35 -04:00
|
|
|
You will also need to indicate to the security manager where the custom indices
|
|
|
|
will be, so that the correct permissions can be applied. You can do this by
|
|
|
|
setting the `path.shared_data` setting in elasticsearch.yml:
|
2015-05-07 14:43:00 -04:00
|
|
|
|
|
|
|
[source,yaml]
|
|
|
|
--------------------------------------------------
|
2015-08-07 08:27:35 -04:00
|
|
|
path.shared_data: /opt/data
|
2015-05-07 14:43:00 -04:00
|
|
|
--------------------------------------------------
|
|
|
|
|
2015-08-07 08:27:35 -04:00
|
|
|
This means that Elasticsearch can read and write to files in any subdirectory of
|
|
|
|
the `path.shared_data` setting.
|
2015-05-07 14:43:00 -04:00
|
|
|
|
2015-02-18 17:34:06 -05:00
|
|
|
You can then create an index with a custom data path, where each node will use
|
|
|
|
this path for the data:
|
|
|
|
|
|
|
|
[WARNING]
|
|
|
|
========================
|
|
|
|
Because shadow replicas do not index the document on replica shards, it's
|
|
|
|
possible for the replica's known mapping to be behind the index's known mapping
|
|
|
|
if the latest cluster state has not yet been processed on the node containing
|
|
|
|
the replica. Because of this, it is highly recommended to use pre-defined
|
|
|
|
mappings when using shadow replicas.
|
|
|
|
========================
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
curl -XPUT 'localhost:9200/my_index' -d '
|
|
|
|
{
|
|
|
|
"index" : {
|
|
|
|
"number_of_shards" : 1,
|
|
|
|
"number_of_replicas" : 4,
|
2015-08-07 08:27:35 -04:00
|
|
|
"data_path": "/opt/data/my_index",
|
2015-02-18 17:34:06 -05:00
|
|
|
"shadow_replicas": true
|
2016-02-21 15:11:45 -05:00
|
|
|
}
|
2015-02-18 17:34:06 -05:00
|
|
|
}'
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
[WARNING]
|
|
|
|
========================
|
2015-08-07 08:27:35 -04:00
|
|
|
In the above example, the "/opt/data/my_index" path is a shared filesystem that
|
2015-02-18 17:34:06 -05:00
|
|
|
must be available on every node in the Elasticsearch cluster. You must also
|
|
|
|
ensure that the Elasticsearch process has the correct permissions to read from
|
|
|
|
and write to the directory used in the `index.data_path` setting.
|
|
|
|
========================
|
|
|
|
|
2016-02-21 15:11:45 -05:00
|
|
|
The `data_path` does not have to contain the index name, in this case,
|
|
|
|
"my_index" was used but it could easily also have been "/opt/data/"
|
|
|
|
|
2015-02-18 17:34:06 -05:00
|
|
|
An index that has been created with the `index.shadow_replicas` setting set to
|
|
|
|
"true" will not replicate document operations to any of the replica shards,
|
|
|
|
instead, it will only continually refresh. Once segments are available on the
|
|
|
|
filesystem where the shadow replica resides (after an Elasticsearch "flush"), a
|
|
|
|
regular refresh (governed by the `index.refresh_interval`) can be used to make
|
|
|
|
the new data searchable.
|
|
|
|
|
|
|
|
NOTE: Since documents are only indexed on the primary shard, realtime GET
|
|
|
|
requests could fail to return a document if executed on the replica shard,
|
|
|
|
therefore, GET API requests automatically have the `?preference=_primary` flag
|
|
|
|
set if there is no preference flag already set.
|
|
|
|
|
|
|
|
In order to ensure the data is being synchronized in a fast enough manner, you
|
|
|
|
may need to tune the flush threshold for the index to a desired number. A flush
|
|
|
|
is needed to fsync segment files to disk, so they will be visible to all other
|
|
|
|
replica nodes. Users should test what flush threshold levels they are
|
|
|
|
comfortable with, as increased flushing can impact indexing performance.
|
|
|
|
|
|
|
|
The Elasticsearch cluster will still detect the loss of a primary shard, and
|
|
|
|
transform the replica into a primary in this situation. This transformation will
|
|
|
|
take slightly longer, since no `IndexWriter` is maintained for each shadow
|
|
|
|
replica.
|
|
|
|
|
|
|
|
Below is the list of settings that can be changed using the update
|
|
|
|
settings API:
|
|
|
|
|
|
|
|
`index.data_path` (string)::
|
|
|
|
Path to use for the index's data. Note that by default Elasticsearch will
|
|
|
|
append the node ordinal by default to the path to ensure multiple instances
|
|
|
|
of Elasticsearch on the same machine do not share a data directory.
|
|
|
|
|
|
|
|
`index.shadow_replicas`::
|
|
|
|
Boolean value indicating this index should use shadow replicas. Defaults to
|
|
|
|
`false`.
|
|
|
|
|
|
|
|
`index.shared_filesystem`::
|
|
|
|
Boolean value indicating this index uses a shared filesystem. Defaults to
|
|
|
|
the `true` if `index.shadow_replicas` is set to true, `false` otherwise.
|
|
|
|
|
2015-05-07 14:43:00 -04:00
|
|
|
`index.shared_filesystem.recover_on_any_node`::
|
|
|
|
Boolean value indicating whether the primary shards for the index should be
|
2015-12-07 09:09:45 -05:00
|
|
|
allowed to recover on any node in the cluster. If a node holding a copy of
|
|
|
|
the shard is found, recovery prefers that node. Defaults to `false`.
|
2015-05-07 14:43:00 -04:00
|
|
|
|
2015-02-18 17:34:06 -05:00
|
|
|
=== Node level settings related to shadow replicas
|
|
|
|
|
|
|
|
These are non-dynamic settings that need to be configured in `elasticsearch.yml`
|
|
|
|
|
Persistent Node Ids (#19140)
Node IDs are currently randomly generated during node startup. That means they change every time the node is restarted. While this doesn't matter for ES proper, it makes it hard for external services to track nodes. Another, more minor, side effect is that indexing the output of, say, the node stats API results in creating new fields due to node ID being used as keys.
The first approach I considered was to use the node's published address as the base for the id. We already [treat nodes with the same address as the same](https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/discovery/zen/NodeJoinController.java#L387) so this is a simple change (see [here](https://github.com/elastic/elasticsearch/compare/master...bleskes:node_persistent_id_based_on_address)). While this is simple and it works for probably most cases, it is not perfect. For example, if after a node restart, the node is not able to bind to the same port (because it's not yet freed by the OS), it will cause the node to still change identity. Also in environments where the host IP can change due to a host restart, identity will not be the same.
Due to those limitation, I opted to go with a different approach where the node id will be persisted in the node's data folder. This has the upside of connecting the id to the nodes data. It also means that the host can be adapted in any way (replace network cards, attach storage to a new VM). I
It does however also have downsides - we now run the risk of two nodes having the same id, if someone copies clones a data folder from one node to another. To mitigate this I changed the semantics of the protection against multiple nodes with the same address to be stricter - it will now reject the incoming join if a node exists with the same id but a different address. Note that if the existing node doesn't respond to pings (i.e., it's not alive) it will be removed and the new node will be accepted when it tries another join.
Last, and most importantly, this change requires that *all* nodes persist data to disk. This is a change from current behavior where only data & master nodes store local files. This is the main reason for marking this PR as breaking.
Other less important notes:
- DummyTransportAddress is removed as we need a unique network address per node. Use `LocalTransportAddress.buildUnique()` instead.
- I renamed `node.add_lid_to_custom_path` to `node.add_lock_id_to_custom_path` to avoid confusion with the node ID which is now part of the `NodeEnvironment` logic.
- I removed the `version` paramater from `MetaDataStateFormat#write` , it wasn't really used and was just in the way :)
- TribeNodes are special in the sense that they do start multiple sub-nodes (previously known as client nodes). Those sub-nodes do not store local files but derive their ID from the parent node id, so they are generated consistently.
2016-07-04 15:09:25 -04:00
|
|
|
`node.add_lock_id_to_custom_path`::
|
2015-02-18 17:34:06 -05:00
|
|
|
Boolean setting indicating whether Elasticsearch should append the node's
|
|
|
|
ordinal to the custom data path. For example, if this is enabled and a path
|
|
|
|
of "/tmp/foo" is used, the first locally-running node will use "/tmp/foo/0",
|
|
|
|
the second will use "/tmp/foo/1", the third "/tmp/foo/2", etc. Defaults to
|
|
|
|
`true`.
|