OpenSearch/docs/resiliency/index.asciidoc

= Elasticsearch Resiliency Status

:JIRA: https://issues.apache.org/jira/browse/LUCENE-
:GIT:  https://github.com/elastic/elasticsearch/issues/

== Overview

The team at Elasticsearch is committed to continuously improving both
Elasticsearch and Apache Lucene to protect your data.  As with any distributed
system, Elasticsearch is complex and has many moving parts, each of which can
encounter edge cases that require proper handling.  Our resiliency project is
an ongoing effort to find and fix these edge cases. If you want to keep up
with all this project on GitHub, see our issues list under the tag
https://github.com/elastic/elasticsearch/issues?q=label%3Aresiliency[resiliency].

While GitHub is great for sharing our work, it can be difficult to get an
overview of the current state of affairs and the previous work that has been
done from an issues list. This page provides an overview of all the
resiliency-related issues that we are aware of, improvements that have already
been made and current in-progress work. We’ve also listed some historical
improvements throughout this page to provide the full context.

If you’re interested in more on how we approach ensuring resiliency in
Elasticsearch, you may be interested in Igor Motov’s recent talk
https://www.elastic.co/videos/improving-elasticsearch-resiliency[Improving Elasticsearch Resiliency].

You may also be interested in our blog post
https://www.elastic.co/blog/resiliency-elasticsearch[Resiliency in Elasticsearch],
which details our thought processes when addressing resiliency in both
Elasticsearch and the work our developers do upstream in Apache Lucene.

== Data Store Recommendations

Some customers use Elasticsearch as a primary datastore, some set-up
comprehensive back-up solutions using features such as our Snapshot and
Restore, while others use Elasticsearch in conjunction with a data storage
system like Hadoop or even flat files. Elasticsearch can be used for so many
different use cases which is why we have created this page to make sure you
are fully informed when you are architecting your system.

== Work in Progress

[discrete]
=== Known Unknowns (STATUS: ONGOING)

We consider this topic to be the most important in our quest for
resiliency. We put a tremendous amount of effort into testing
Elasticsearch to simulate failures and randomize configuration to
produce extreme conditions. In addition, our users are an important
source of information on unexpected edge cases and your bug reports
help us make fixes that ensure that our system continues to be
resilient.

If you encounter an issue, https://github.com/elastic/elasticsearch/issues[please report it]!

We are committed to tracking down and fixing all the issues that are posted.

[discrete]
==== Jepsen Tests

The Jepsen platform is specifically designed to test distributed systems. It is not a single test and is regularly adapted
to create new scenarios. We have currently ported all published Jepsen scenarios that deal with loss of acknowledged writes to our testing
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
all new scenarios and will report issues that we find on this page and in our GitHub repository.

[discrete]
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)

If the node holding a primary shard is disconnected for whatever reason, the
coordinating node retries the request on the same or a new primary shard.  In
certain rare conditions, where the node disconnects and immediately
reconnects, it is possible that the original request has already been
successfully applied but has not been reported, resulting in duplicate
requests. This is particularly true when retrying bulk requests, where some
actions may  have completed and some may not have.

An optimization which disabled the existence check for documents indexed with
auto-generated IDs could result in the creation of duplicate documents. This
optimization has been removed. {GIT}9468[#9468] (STATUS: DONE, v1.5.0)

Further issues remain with the retry mechanism:

* Unversioned index requests could increment the `_version` twice,
  obscuring a `created` status.
* Versioned index requests could return a conflict exception, even
  though they were applied correctly.
* Update requests could be applied twice.

See {GIT}9967[#9967]. (STATUS: ONGOING)

[discrete]
=== OOM resiliency (STATUS: ONGOING)

The family of circuit breakers has greatly reduced the occurrence of OOM
exceptions, but it is still possible to cause a node to run out of heap
space.  The following issues have been identified:

* Set a hard limit on `from`/`size` parameters {GIT}9311[#9311]. (STATUS: DONE, v2.1.0)
* Prevent combinatorial explosion in aggregations from causing OOM {GIT}8081[#8081]. (STATUS: DONE, v5.0.0)
* Add the byte size of each hit to the request circuit breaker {GIT}9310[#9310]. (STATUS: ONGOING)
* Limit the size of individual requests and also add a circuit breaker for the total memory used by in-flight request objects {GIT}16011[#16011]. (STATUS: DONE, v5.0.0)

Other safeguards are tracked in the meta-issue {GIT}11511[#11511].

[discrete]
=== The _version field may not uniquely identify document content during a network partition (STATUS: ONGOING)

When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue
indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is
partitioned away. It won't acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed.
The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to
step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from
the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted
writes for the same document (see {GIT}19269[#19269]).

We are currently implementing Sequence numbers {GIT}10708[#10708] which better track primary changes. Sequence numbers thus provide a basis
for uniquely identifying writes even in the presence of network partitions and will replace `_version` in operations that require this.

[discrete]
=== Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)

Indices stats and indices segments requests reach out to all nodes that have shards of that index. Shards that have relocated from a node
while the stats request arrives will make that part of the request fail and are just ignored in the overall stats result. {GIT}13719[#13719]

[discrete]
=== Documentation of guarantees and handling of failures (STATUS: ONGOING)

This status page is a start, but we can do a better job of explicitly documenting the processes at work in Elasticsearch and what happens
in the case of each type of failure. The plan is to have a test case that validates each behavior under simulated conditions. Every test
 will document the expected results, the associated test code, and an explicit PASS or FAIL status for each simulated case.

[discrete]
=== Run Jepsen (STATUS: ONGOING)

We have ported the known scenarios in the Jepsen blogs that check loss of acknowledged writes to our testing infrastructure.
The new tests are run continuously in our testing farm and are passing. We are also working on running Jepsen independently to verify
that no failures are found.

[discrete]
=== Replicas can fall out of sync when a primary shard fails (STATUS: ONGOING)

When a primary shard fails, a replica shard will be promoted to be the
primary shard. If there is more than one replica shard, it is possible
for the remaining replicas to be out of sync with the new primary
shard. This is caused by operations that were in-flight when the primary
shard failed and may not have been processed on all replica
shards. Currently, the discrepancies are not repaired on primary
promotion but instead would be repaired if replica shards are relocated
(e.g., from hot to cold nodes); this does mean that the length of time
which replicas can be out of sync with the primary shard is
unbounded. Sequence numbers {GIT}10708[#10708] will provide a mechanism
for syncing the remaining replicas with the newly-promoted primary
shard.

== Completed

[discrete]
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)

During a networking partition, cluster state updates (like mapping changes or
shard assignments) are committed if a majority of the master-eligible nodes
received the update correctly. This means that the current master has access to
enough nodes in the cluster to continue to operate correctly. When the network
partition heals, the isolated nodes catch up with the current state and receive
the previously missed changes. However, if a second partition happens while the
cluster is still recovering from the previous one *and* the old master falls on
the minority side, it may be that a new master is elected which has not yet
catch up. If that happens, cluster state updates can be lost.

This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes
committed cluster state updates into account during master election. This
considerably reduces the chance of this rare problem occurring but does not
fully mitigate it. If the second partition happens concurrently with a cluster
state update and blocks the cluster state commit message from reaching a
majority of nodes, it may be that the in flight update will be lost. If the
now-isolated master can still acknowledge the cluster state update to the client
this will amount to the loss of an acknowledged change.

Fixing this last scenario was one of the goals of {GIT}32006[#32006] and its
sub-issues. See particularly {GIT}32171[#32171] and
https://github.com/elastic/elasticsearch-formal-models/blob/master/ZenWithTerms/tla/ZenWithTerms.tla[the
TLA+ formal model] used to verify these changes.

[discrete]
=== Divergence between primary and replica shard copies when documents deleted (STATUS: DONE, V6.3.0)

Certain combinations of delays in performing activities related to the deletion
of a document could result in the operations on that document being interpreted
differently on different shard copies. This could lead to a divergence in the
number of documents held in each copy.

Deleting an unacknowledged document that was concurrently being inserted using
an auto-generated ID was erroneously sensitive to the order in which those
operations were processed on each shard copy. Thanks to the introduction of
sequence numbers ({GIT}10708[#10708]) it is now possible to detect these
out-of-order operations, and this issue was fixed in {GIT}28787[#28787].

Re-creating a document a specific interval after it was deleted could result in
that document's tombstone having being cleaned up on some, but not all, copies
when processing the indexing operation that re-creates it. This resulted in
varying behaviour across the shard copies. The problematic interval was set by
the `index.gc_deletes` setting, which is 60 seconds by default. Again, sequence
numbers ({GIT}10708[#10708]) gives us the machinery to detect these conflicting
activities, and this issue was fixed in {GIT}28790[#28790].

Under certain rare circumstances a replica might erroneously interpret a stale
tombstone for a document as fresh, resulting in a concurrent indexing operation
for that same document behaving differently on this replica than on the
primary. This is fixed in {GIT}29619[#29619]. Triggering this issue required
the following activities all to occur in a short time window, in a specific
order on the primary and a different specific order on the replica:

* a document is deleted twice
* another document is indexed with the same ID as this first document
* another document is indexed with a completely different, auto-generated, ID
* two refreshes

We found the first two of these issues by empirical testing, and then we built
https://github.com/elastic/elasticsearch-formal-models/blob/master/ReplicaEngine/tla/ReplicaEngine.tla[a
formal model of the replica's behaviour] using TLA+. Running the TLC model
checker on this model found all three issues. We then applied the proposed
fixes to the model and validated that the fixed design behaved as expected.

[discrete]
=== Port Jepsen tests dealing with loss of acknowledged writes to our testing framework (STATUS: DONE, V5.0.0)

We have increased our test coverage to include scenarios tested by Jepsen that demonstrate loss of acknowledged writes, as described in
the Elasticsearch related blogs. We make heavy use of randomization to expand on the scenarios that can be tested and to introduce
new error conditions.
You can follow the work on the master branch of the
https://github.com/elastic/elasticsearch/blob/master/core/src/test/java/org/elasticsearch/discovery/DiscoveryWithServiceDisruptionsIT.java[`DiscoveryWithServiceDisruptionsIT` class],
where the `testAckedIndexing` test was specifically added to check that we don't lose acknowledged writes in various failure scenarios.


[discrete]
=== Loss of documents during network partition (STATUS: DONE, v5.0.0)

If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).

If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in {GIT}2488[split-brain] before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master ({GIT}7572[#7572]).
To prevent this situation, the primary needs to wait for the master to acknowledge replica shard failures before acknowledging the write to the client. {GIT}14252[#14252]

[discrete]
=== Safe primary relocations (STATUS: DONE, v5.0.0)

When primary relocation completes, a cluster state is propagated that deactivates the old primary and marks the new primary as active. As
cluster state changes are not applied synchronously on all nodes, there can be a time interval where the relocation target has processed the
cluster state and believes to be the active primary and the relocation source has not yet processed the cluster state update and still
believes itself to be the active primary. This means that an index request that gets routed to the new primary does not get replicated to
the old primary (as it has been deactivated from point of view of the new primary). If a subsequent read request gets routed to the old
primary, it cannot see the indexed document. {GIT}15900[#15900]

In the reverse situation where a cluster state update that completes primary relocation is first applied on the relocation source and then
on the relocation target, each of the nodes believes the other to be the active primary. This leads to the issue of indexing requests
chasing the primary being quickly sent back and forth between the nodes, potentially making them both go OOM. {GIT}12573[#12573]

[discrete]
=== Do not allow stale shards to automatically be promoted to primary (STATUS: DONE, v5.0.0)

In some scenarios, after the loss of all valid copies, a stale replica shard can be automatically assigned as a primary, preferring old data
to no data at all ({GIT}14671[#14671]). This can lead to a loss of acknowledged writes if the valid copies are not lost but are rather
temporarily unavailable. Allocation IDs ({GIT}14739[#14739]) solve this issue by tracking non-stale shard copies in the cluster and using
this tracking information to allocate primary shards. When all shard copies are lost or only stale ones available, Elasticsearch will wait
for one of the good shard copies to reappear. In case where all good copies are lost, a manual override command can be used to allocate a
stale shard copy.

[discrete]
=== Make index creation resilient to index closing and full cluster crashes (STATUS: DONE, v5.0.0)

Recovering an index requires a quorum (with an exception for 2) of shard copies to be available to allocate a primary. This means that
a primary cannot be assigned if the cluster dies before enough shards have been allocated ({GIT}9126[#9126]). The same happens if an index
is closed before enough shard copies were started, making it impossible to reopen the index ({GIT}15281[#15281]).
Allocation IDs ({GIT}14739[#14739]) solve this issue by tracking allocated shard copies in the cluster. This makes it possible to safely
recover an index in the presence of a single shard copy. Allocation IDs can also distinguish the situation where an index has been created
but none of the shards have been started. If such an index was inadvertently closed before at least one shard could be started, a fresh
shard will be allocated upon reopening the index.


[discrete]
=== Use two phase commit for Cluster State publishing (STATUS: DONE, v5.0.0)

A master node in Elasticsearch continuously https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-fault-detection.html[monitors the cluster nodes]
and removes any node from the cluster that doesn't respond to its pings in a timely
fashion. If the master is left with too few nodes, it will step down and a new master election will start.

When a network partition causes a master node to lose many followers, there is a short window
in time until the node loss is detected and the master steps down. During that window, the
master may erroneously accept and acknowledge cluster state changes. To avoid this, we introduce
a new phase to cluster state publishing where the proposed cluster state is sent to all nodes
but is not yet committed. Only once enough nodes actively acknowledge
the change, it is committed and commit messages are sent to the nodes. See {GIT}13062[#13062].

[discrete]
=== Wait on incoming joins before electing local node as master (STATUS: DONE, v2.0.0)

During master election each node pings in order to discover other nodes and validate the liveness of existing
nodes. Based on this information the node either discovers an existing master or, if enough nodes are found a new master will be elected. Currently, the node that is
elected as master will update the cluster state to indicate the result of the election. Other nodes will submit
a join request to the newly elected master node. Instead of immediately processing the election result, the elected master
node should wait for the incoming joins from other nodes, thus validating that the result of the election is properly applied. As soon as enough
nodes have sent their joins request (based on the `minimum_master_nodes` settings) the cluster state is updated.
{GIT}12161[#12161]

[discrete]
=== Mapping changes should be applied synchronously (STATUS: DONE, v2.0.0)

When introducing new fields using dynamic mapping, it is possible that the same
field can be added to different shards with different data types.  Each shard
will operate with its local data type but, if the shard is relocated, the
data type from the cluster state will be applied to the new shard, which
can result in a corrupt shard.  To prevent this, new fields should not
be added to a shard's mapping until confirmed by the master.
{GIT}8688[#8688] (STATUS: DONE)

[discrete]
=== Add per-segment and per-commit ID to help replication (STATUS: DONE, v2.0.0)

{JIRA}5895[LUCENE-5895] adds a unique ID for each segment and each commit point. File-based replication (as performed by snapshot/restore) can use this ID to know whether the segment/commit on the source and destination machines are the same.  Fixed in Lucene 5.0.

[discrete]
=== Write index metadata on data nodes where shards allocated (STATUS: DONE, v2.0.0)

Today, index metadata is written only on nodes that are master-eligible, not on
data-only nodes.  This is not a problem when running with multiple master nodes,
as recommended, as the loss of all but one master node is still recoverable.
However, users running with a single master node are at risk of losing
their index metadata if the master fails.  Instead, this metadata should
also be written on any node where a shard is allocated. {GIT}8823[#8823], {GIT}9952[#9952]

[discrete]
=== Better file distribution with multiple data paths (STATUS: DONE, v2.0.0)

Today, a node configured with multiple data paths distributes writes across
all paths by writing one file to each path in turn.  This can mean that the
failure of a single disk corrupts many shards at once.  Instead, by allocating
an entire shard to a single data path, the extent of the damage can be limited
to just the shards on that disk. {GIT}9498[#9498]

[discrete]
=== Lucene checksums phase 3 (STATUS: DONE, v2.0.0)

Almost all files in Elasticsearch now have checksums which are validated before use.  A few changes remain:

* {GIT}7586[#7586] adds checksums for cluster and index state files. (STATUS: DONE, Fixed in v1.5.0)
* {GIT}9183[#9183] supports validating the checksums on all files when starting a node. (STATUS: DONE, Fixed in v2.0.0)
* {JIRA}5894[LUCENE-5894] lays the groundwork for extending more efficient checksum validation to all files during optimized bulk merges. (STATUS: DONE, Fixed in v2.0.0)
* {GIT}8403[#8403] to add validation of checksums on Lucene `segments_N` files. (STATUS: DONE, v2.0.0)

[discrete]
=== Report shard-level statuses on write operations (STATUS: DONE, v2.0.0)

Make write calls return the number of total/successful/missing shards in the same way that we do in search, which ensures transparency in the consistency of write operations. {GIT}7994[#7994]. (STATUS: DONE, v2.0.0)

[discrete]
=== Take filter cache key size into account (STATUS: DONE, v2.0.0)

Commonly used filters are cached in Elasticsearch. That cache is limited in size
(10% of node's memory by default) and is being evicted based on a least recently
used policy. The amount of memory used by the cache depends on two primary
components - the values it stores and the keys associated with them. Calculating
the memory footprint of the values is easy enough but the keys accounting is
trickier to achieve as they are, by default, raw Lucene objects. This is largely
not a problem as the keys are dominated by the values. However, recent
optimizations in Lucene have changed the balance causing the filter cache to
grow beyond it's size.

As a temporary solution, we introduced a minimum weight of 1k for each cache entry.
This puts an effective limit on the number of entries in the cache. See {GIT}8304[#8304] (STATUS: DONE, fixed in v1.4.0)

The issue has been completely solved by the move to Lucene's query cache. See {GIT}10897[#10897]

[discrete]
=== Ensure shard state ID is incremental (STATUS: DONE, v1.5.1)

It is possible in very extreme cases during a complicated full cluster restart,
that the current shard state ID can be reset or even go backwards.
Elasticsearch now ensures that the state ID always moves
forwards, and throws an exception when a legacy ID is higher than the
current ID.  See {GIT}10316[#10316] (STATUS: DONE, v1.5.1)

[discrete]
=== Verification of index UUIDs (STATUS: DONE, v1.5.0)

When deleting and recreating indices rapidly, it is possible that cluster state
updates can arrive out of sync and old states can be merged incorrectly.  Instead,
Elasticsearch now checks the index UUID to ensure that cluster state updates
refer to the same index version that is present on the local node.
See {GIT}9541[#9541] and {GIT}10200[#10200] (STATUS: DONE, Fixed in v1.5.0)

[discrete]
=== Disable recovery from known buggy versions (STATUS: DONE, v1.5.0)

Corruptions have been known to occur when doing a rolling restart from older, buggy versions.
Now, shards from versions before v1.4.0 are copied over in full and recovery from versions
before v1.3.2 are disabled entirely. See {GIT}9925[#9925] (STATUS: DONE, Fixed in v1.5.0)


[discrete]
=== Upgrade 3.x segments metadata on engine startup (STATUS: DONE, v1.5.0)

Upgrading the metadata of old 3.x segments on node upgrade can be error prone
and can result in corruption when merges are being run concurrently. Instead,
Elasticsearch will now upgrade the metadata of 3.x segments before the engine
starts.  See {GIT}9899[#9899] (STATUS; DONE, fixed in v1.5.0)

[discrete]
=== Prevent setting minimum_master_nodes to more than the current node count (STATUS: DONE, v1.5.0)

Setting `zen.discovery.minimum_master_nodes` to a value higher than the current node count
effectively leaves the cluster without a master and unable to process requests.  The only
way to fix this is to add more master-eligible nodes.  {GIT}8321[#8321] adds a mechanism
to validate settings before applying them, and {GIT}9051[#9051] extends this validation
support to settings applied during a cluster restore. (STATUS: DONE, Fixed in v1.5.0)

[discrete]
=== Simplify and harden shard recovery and allocation (STATUS: DONE, v1.5.0)

Randomized testing combined with chaotic failures has revealed corner cases
where the recovery and allocation of shards in a concurrent manner can result
in shard corruption.  There is an ongoing effort to reduce the complexity of
these operations in order to make them more deterministic.  These include:

* Introduce shard level locks to prevent concurrent shard modifications {GIT}8436[#8436]. (STATUS: DONE, Fixed in v1.5.0)
* Delete shard contents under a lock {GIT}9083[#9083]. (STATUS: DONE, Fixed in v1.5.0)
* Delete shard under a lock {GIT}8579[#8579]. (STATUS: DONE, Fixed in v1.5.0)
* Refactor RecoveryTarget state management {GIT}8092[#8092]. (STATUS: DONE, Fixed in v1.5.0)
* Cancelling a recovery may leave temporary files behind {GIT}7893[#7893]. (STATUS: DONE, Fixed in v1.5.0)
* Quick cluster state processing can result in both shard copies being deleted {GIT}9503[#9503]. (STATUS: DONE, Fixed in v1.5.0)
* Rapid creation and deletion of an index can cause reuse of old index metadata {GIT}9489[#9489]. (STATUS: DONE, Fixed in v1.5.0)
* Flush immediately after the last concurrent recovery finishes to clear out the translog before a new recovery starts {GIT}9439[#9439]. (STATUS: DONE, Fixed in v1.5.0)

[discrete]
=== Prevent use of known-bad Java versions (STATUS: DONE, v1.5.0)

Certain versions of the JVM are known to have bugs which can cause index corruption.  {GIT}7580[#7580] prevents Elasticsearch startup if known bad versions are in use.

[discrete]
=== Make recovery be more resilient to partial network partitions (STATUS: DONE, v1.5.0)

When a node is experience network issues, the master detects it and removes the node from the cluster. That causes all ongoing recoveries from and to that node to be stopped and a new location is found for the relevant shards. However, in the of case partial network partition, where there are connectivity issues between the source and target nodes of a recovery but not between those nodes and the current master things may go wrong. While the nodes successfully restore the connection, the on going recoveries may have encountered issues. In {GIT}8720[#8720], we added test simulations for these and solved several issues that were flagged by them.

[discrete]
=== Improving Zen Discovery (STATUS: DONE, v1.4.0.Beta1)

Recovery from failure is a complicated process, especially in an asynchronous distributed system like Elasticsearch. With several processes happening in parallel, it is important to ensure that recovery proceeds swiftly and safely. While fixing the {GIT}2488[split-brain issue] we have been hunting down corner cases that were not handled optimally, adding tests to demonstrate the issues, and working on fixes:

* Faster & better detection of master & node failures, including not trying to reconnect upon disconnect, fail on disconnect error on ping, verify cluster names in pings. Previously, Elasticsearch had to wait a bit for the node to complete the process required to join the cluster. Recent changes guarantee that a node has fully joined the cluster before we start the fault detection process. Therefore we can do an immediate check causing faster detection of errors and validation of cluster state after a minimum master node breach. {GIT}6706[#6706], {GIT}7399[#7399] (STATUS: DONE, v1.4.0.Beta1)
* Broaden Unicast pinging when master fails: When a node loses it’s current master it will start pinging to find a new one. Previously, when using unicast based pinging, the node would ping a set of predefined nodes asking them whether the master had really disappeared or whether there was a network hiccup. Now, we ping all nodes in the cluster to increase coverage. In the case that all unicast hosts are disconnected from the current master during a network failure, this improvement is essential to allow the cluster to reform once the partition is healed. {GIT}7336[#7336] (STATUS: DONE, v1.4.0.Beta1)
* After joining a cluster, validate that the join was successful and that the master has been set in the local cluster state. {GIT}6969[#6969]. (STATUS: DONE, v1.4.0.Beta1)
* Write additional tests that use the test infrastructure to verify proper behavior during network disconnections and garbage collections. {GIT}7082[#7082] (STATUS: DONE, v1.4.0.Beta1)

[discrete]
=== Lucene checksums phase 2 (STATUS:DONE, v1.4.0.Beta1)

When Lucene opens a segment for reading, it validates the checksum on the smaller segment files -- those which it reads entirely into memory -- but not the large files like term frequencies and positions, as this would be very expensive. During merges, term vectors and stored fields are validated, as long the segments being merged come from the same version of Lucene. Checksumming for term vectors and stored fields is important because merging consists of performing optimized byte copies. Term frequencies, term positions, payloads, doc values, and norms are currently not checked during merges, although Lucene provides the option to do so.  These files are less prone to silent corruption as they are actively decoded during merge, and so are more likely to throw exceptions if there is any corruption.

The following changes have been made:

* {GIT}7360[#7360] validates checksums on all segment files during merges. (STATUS: DONE, fixed in v1.4.0.Beta1)
* {JIRA}5842[LUCENE-5842] validates the structure of the checksum footer of the postings lists, doc values, stored fields and term vectors when opening a new segment, to ensure that these files have not been truncated. (STATUS: DONE, Fixed in Lucene 4.10 and v1.4.0.Beta1)
* {GIT}8407[#8407] validates Lucene checksums for legacy files. (STATUS: DONE; Fixed in v1.3.6)

[discrete]
=== Don't allow unsupported codecs (STATUS: DONE, v1.4.0.Beta1)

Lucene 4 added a number of alternative codecs for experimentation purposes, and Elasticsearch exposed the ability to change codecs.  Since then, Lucene has settled on the best choice of codec and provides backwards compatibility only for the default codec.  {GIT}7566[#7566] removes the ability to set alternate codecs.

[discrete]
=== Use checksums to identify entire segments (STATUS: DONE, v1.4.0.Beta1)

A hash collision makes it possible for two different files to have the same length and the same checksum. Instead, a segment's identity should rely on checksums from all of the files in a single segment, which greatly reduces the chance of a collision. This change has been merged ({GIT}7351[#7351]).

[discrete]
=== Fix ''Split Brain can occur even with minimum_master_nodes'' (STATUS: DONE, v1.4.0.Beta1)

Even when minimum master nodes is set, split brain can still occur under certain conditions, e.g. disconnection between master eligible nodes, which can lead to data loss. The scenario is described in detail in {GIT}2488[issue 2488]:

* Introduce a new testing infrastructure to simulate different types of node disconnections, including loss of network connection, lost messages, message delays, etc. See {GIT}5631[MockTransportService] support and {GIT}6505[service disruption] for more details. (STATUS: DONE, v1.4.0.Beta1).
* Added tests that simulated the bug described in issue 2488. You can take a look at the https://github.com/elastic/elasticsearch/commit/7bf3ffe73c44f1208d1f7a78b0629eb48836e726[original commit] of a reproduction on master. (STATUS: DONE, v1.2.0)
* The bug described in {GIT}2488[issue 2488] is caused by an issue in our zen discovery gossip protocol. This specific issue has been fixed, and work has been done to make the algorithm more resilient. (STATUS: DONE, v1.4.0.Beta1)

[discrete]
=== Translog Entry Checksum (STATUS: DONE, v1.4.0.Beta1)

Each translog entry in Elasticsearch should have its own checksum, and potentially additional information, so that we can properly detect corrupted translog entries and act accordingly. You can find more detail in issue {GIT}6554[#6554].

To start, we will begin by adding checksums to the translog to detect corrupt entries. Once this work has been completed, we will add translog entry markers so that corrupt entries can be skipped in the translog if/when desired.

[discrete]
=== Request-Level Memory Circuit Breaker (STATUS: DONE, v1.4.0.Beta1)

We are in the process of introducing multiple circuit breakers in Elasticsearch, which can “borrow” space from each other in the event that one runs out of memory. This architecture will allow limits for certain parts of memory, but still allow flexibility in the event that another reserve like field data is not being used. This change includes adding a breaker for the BigArrays internal object used for some aggregations. See issue {GIT}6739[#6739] for more details.

[discrete]
=== Doc Values (STATUS: DONE, v1.4.0.Beta1)

Fielddata is one of the largest consumers of heap memory, and thus one of the primary reasons for running out of memory and causing node instability. Elasticsearch has had the “doc values” option for a while, which allows you to build these structures at index time so that they live on disk instead of in memory. Up until recently, doc values were significantly slower than in-memory fielddata.

By benchmarking and profiling both Lucene and Elasticsearch, we identified the bottlenecks and have made a series of improvements to improve the performance of doc values. They are now almost as fast as the in-memory option.

See {GIT}6967[#6967], {GIT}6908[#6908], {GIT}4548[#4548], {GIT}3829[#3829], {GIT}4518[#4518], {GIT}5669[#5669], {JIRA}5748[LUCENE-5748], {JIRA}5703[LUCENE-5703], {JIRA}5750[LUCENE-5750], {JIRA}5721[LUCENE-5721], {JIRA}5799[LUCENE-5799].

[discrete]
=== Index corruption when upgrading Lucene 3.x indices (STATUS: DONE, v1.4.0.Beta1)

Upgrading indices create with Lucene 3.x (Elasticsearch v0.20 and before) to Lucene 4.7 - 4.9 (Elasticsearch v1.1.0 to v1.3.x), could result in index corruption. {JIRA}5907[LUCENE-5907] fixes this issue in Lucene 4.10.

[discrete]
=== Improve error handling when deleting files (STATUS: DONE, v1.4.0.Beta1)

Lucene uses reference counting to prevent files that are still in use from being deleted.  Lucene testing discovered a bug ({JIRA}5919[LUCENE-5919]) when decrementing the ref count on a batch of files. If deleting some of the files resulted in an exception (e.g. due to interference from a virus scanner), the files that had their ref counts decremented successfully could later have their ref counts deleted again, incorrectly, resulting in files being physically deleted before their time. This is fixed in Lucene 4.10.

[discrete]
=== Using Lucene Checksums to verify shards during snapshot/restore (STATUS:DONE, v1.3.3)

The snapshot process should verify checksums for each file that is being snapshotted to make sure that created snapshot doesn’t contain corrupted files. If a corrupted file is detected, the snapshot should fail with an error. In order to implement this feature we need to have correct and verifiable checksums stored with segment files, which is only possible for files that were written by the officially supported append-only codecs. See {GIT}7159[#7159].

[discrete]
=== Rare compression corruption during shard recovery (STATUS: DONE, v1.3.2)

During recovery, the primary shard is copied over the network to become a new replica shard. In rare cases, it was possible for a hash collision to trigger a bug in the compression library that is used to produce corruption in the replica shard. This bug was exposed by the change to validate checksums during recovery. We tracked down the bug in the in compression library and submitted a patch, which was accepted and merged by the upstream project. See {GIT}7210[#7210].

[discrete]
=== Safer recovery of replica shards (STATUS: DONE, v1.3.0)

If a primary shard fails or is closed while a replica is using it for recovery, we need to ensure that the replica is properly failed as well, and allow recovery to start from the new primary. Also check that an active copy of a shard is available on another node before physically removing an inactive shard from disk. {GIT}6825[#6825], {GIT}6645[#6645], {GIT}6995[#6995].

[discrete]
=== Using Lucene Checksums to verify shards during recovery (STATUS: DONE, v1.3.0)

Elasticsearch can use Lucene checksums to validate files while {GIT}6776[recovering a replica shard from a primary].

This issue exposed a bug in Elasticsearch’s handling of primary shard failure when having more than 2 replicas, causing the second replica to not be properly unassigned if it is in the middle of recovery. It was fixed with the merge of issue {GIT}6808[#6808].

In order to verify the checksumming mechanism, we added functionality to our testing infrastructure that can corrupt an arbitrary index file and at any point, such as while it’s traveling over the wire or residing on disk. The tests utilizing this feature expect full or partial recovery from the failure while neither losing data nor spreading the corruption.

[discrete]
=== Detect File Corruption (STATUS: DONE, v1.3.0)

When a corrupted index can be detected during merging or refresh, Elasticsearch will fail the shard if a checksum failure is detected. You can read the full details in pull request {GIT}6776[#6776].

[discrete]
=== Network disconnect events could be lost, causing a zombie node to stay in the cluster state (STATUS: DONE, v1.3.0)

Previously, there was a very short window in which we could lose a node disconnect event. To prevent this from occurring, we added extra handling of connection errors to our nodes & master fault detection pinging to make sure the node disconnect event is detected. See issue {GIT}6686[#6686].

[discrete]
=== Other fixes to Lucene to address resiliency (STATUS: DONE, v1.3.0)

* NativeLock is released if Lock is closed after failing on obtain {JIRA}5738[LUCENE-5738].
* NRT Reader close can wipe an index it doesn’t own. {JIRA}5574[LUCENE-5574]
* FSDirectory’s fsync() is lenient, now throws exceptions when errors occur {JIRA}5570[LUCENE-5570]
* fsync() directory when committing {JIRA}5588[LUCENE-5588]

[discrete]
=== Backwards Compatibility Testings (STATUS: DONE, v1.3.0)

Since founding Elasticsearch Inc, we grew our test base from ~1k tests to about 4k in just about over a year. We invested massively into our testing infrastructure, running our tests continuously on different operating systems, bare metal hardware and cloud environments, all while randomizing JVMs and their settings.

Yet, backwards compatibility testing was a very manual thing until we released a pretty {GIT}6393[insane bug] with Elasticsearch 1.2. We tried to fix places where the absolute value of a number was negative (a documented behavior of Math.abs(int) in Java) and missed that the fix for this also changed the result of our routing function. No matter how much randomization we applied to the tests, we didn’t catch this particular failure. We always had backwards compatibility tests on our list of things to do, but didn’t have them in place back then.

We recently tweaked our testing infrastructure to be able to run tests against a hybrid cluster composed of a released version of Elasticsearch and our current stable branch. This test pattern allowed us to mimic typical upgrade scenarios like rolling upgrades, index backwards compatibility and recovering from old to new nodes.

Now, even the simplest test that relies on routing fails against 1.2.0, which is exactly we were aiming for. The test would not have caught the aforementioned {GIT}6393[routing bug] before releasing 1.2.0, but it immediately saved us from {GIT}6660[another problem] in the stable branch.

The work on our testing infrastructure is more than just issue prevention, it allows us to develop and test upgrade paths, introduce new features and evolve indexing over time. It isn’t enough to introduce more resilient implementations, we also have to ensure that users take advantage of them when they upgrade.

You can read more about backwards compatibility tests in issue {GIT}6497[#6497].

[discrete]
=== Full Translog Writes on all Platforms (STATUS: DONE, v1.2.2 and v1.3.0)

We have recently received bug reports of transaction log corruption that can occur when indexing very large documents (in the area of 300 KB). Although some Linux users reported this behavior, it appears the problem occurs more frequently when running Windows. We traced the source of the problem to the fact that when serializing documents to the transaction log, the Operating System can actually write only part of the document before returning from the write call. We can now detect this situation and make sure that the entire document is properly written. You can read the full details in pull request {GIT}6576[#6576].

[discrete]
=== Lucene Checksums (STATUS: DONE, v1.2.0)

Before Apache Lucene version 4.8, checksums were not computed on generated index files. The result was that it was difficult to identify when or if a Lucene index got corrupted, whether by hardware failure, JVM bug or for an entirely different reason.

For an idea of the checksum efforts in progress in Apache Lucene, see issues {JIRA}2446[LUCENE-2446], {JIRA}5580[LUCENE-5580] and {JIRA}5602[LUCENE-5602]. The gist is that Lucene 4.8+ now computes full checksums on all index files and it verifies them when opening metadata or other smaller files as well as other files during merges.

[discrete]
=== Detect errors faster by locally failing a shard upon an indexing error (STATUS: DONE, v1.2.0)

Previously, Elasticsearch notified the master of the shard failure and waited for the master to close the local copy of the shard, thus assigning it to other nodes. This architecture caused delays in failure detection, potentially causing unneeded failures of other incoming requests. In rare cases, such as concurrency racing conditions or certain network partitions configurations, we could lose these failure notifications. We solved this issue by locally failing shards upon indexing errors. See issue {GIT}5847[#5847].

[discrete]
=== Snapshot/Restore API (STATUS: DONE, v1.0.0)

In Elasticsearch version 1.0, we significantly improved the backup process by introducing the Snapshot/Restore API. While it was always possible to make backups of Elasticsearch, the Snapshot/Restore API made the backup process much easier.

The backup process is incremental, making it very efficient since only files changed since the last backup are copied. Even with this efficiency introduced, each snapshot contains a full picture of the cluster at the moment when backup started. The restore API allows speedy recovery of a full cluster as well as selected indices.

Since that first release in version 1.0, the API has continued to evolve. In version 1.1.0, we added a new snapshot status API that allows users to monitor the snapshot process. In 1.3.0 we added the ability to {GIT}6457[restore indices without their aliases] and in 1.4 we are planning to add the ability to {GIT}6368[restore partial snapshots].

The Snapshot/Restore API supports a number of different repository types for storing backups. Currently, it’s possible to make backups to a shared file system, Amazon S3, HDFS, and Azure storage. We are continuing to work on adding other types of storage systems, as well as improving the robustness of the snapshot/restore process.

[discrete]
=== Circuit Breaker: Fielddata (STATUS: DONE, v1.0.0)

Currently, the circuit breaker protects against loading too much field data by estimating how much memory the field data will take to load, then aborting the request if the memory requirements are too high. This feature was added in Elasticsearch version 1.0.0.

[discrete]
=== Use of Paginated Data Structures to Ease Garbage Collection (STATUS: DONE, v1.0.0 & v1.2.0)

Elasticsearch has moved from an object-based cache to a page-based cache recycler as described in issue {GIT}4557[#4557]. This change makes garbage collection easier by limiting fragmentation, since all pages have the same size and are recycled. It also allows managing the size of the cache not based on the number of objects it contains, but on the memory that it uses.

These pages are used for two main purposes: implementing higher level data structures such as hash tables that are used internally by aggregations to e.g. map terms to counts, as well as reusing memory in the translog/transport layer as detailed in issue {GIT}5691[#5691].

[discrete]
=== Dedicated Master Nodes Resiliency (STATUS: DONE, v1.0.0)

In order to run a more resilient cluster, we recommend running dedicated master nodes to ensure master nodes are not affected by resources consumed by data nodes. We also have made master nodes more resilient to heavy resource usage, mainly associated with large clusters / cluster states.

These changes include:

* Improve the balancing algorithm to execute faster across large clusters / many indices. (See issue {GIT}4458[#4458] and {GIT}4459[#4459])
* Improve cluster state publishing to not create an additional network buffer per node. (More in https://github.com/elastic/elasticsearch/commit/a9e259d438c3cb1d3bef757db2d2a91cf85be609[this commit].)
* Improve master handling of large scale mapping updates from data nodes by batching them into a single cluster event. (See issue {GIT}4373[#4373].)
* Add an ack mechanism where next phase cluster updates are processed only when nodes acknowledged they received the previous cluster state. (See issues {GIT}3736[#3736], {GIT}3786[#3786], {GIT}4114[#4114], {GIT}4169[#4169], {GIT}4228[#4228] and {GIT}4421[#4421], which also include enhancements to the ack mechanism implementation.)

[discrete]
=== Multi Data Paths May Falsely Report Corrupt Index (STATUS: DONE, v1.0.0)

When using multiple data paths, an index could be falsely reported as corrupted. This has been fixed with pull request {GIT}4674[#4674].

[discrete]
=== Randomized Testing (STATUS: DONE, v1.0.0)

In order to best validate for resiliency in Elasticsearch, we rewrote the Elasticsearch test infrastructure to introduce the concept of https://github.com/randomizedtesting/randomizedtesting[randomized testing]. Randomized testing allows us to easily enhance the Elasticsearch testing infrastructure with predictably irrational conditions, making the resulting code base more resilient.

Each of our integration tests runs against a cluster with a random number of nodes, and indices have a random number of shards and replicas. Merge settings change for every run, indexing is done in serial or async fashion or even wrapped in a bulk operation and thread pool sizes vary to ensure that we don’t produce a deadlock no matter what happens. The list of places we use this randomization infrastructure is long, and growing every day, and has saved us headaches several times before we shipped a particular feature.

At Elasticsearch, we live the philosophy that we can miss a bug once, but never a second time. We make our tests more evil as you go, introducing randomness in all the areas where we discovered bugs. We figure if our tests don’t fail, we are not trying hard enough! If you are interested in how we have evolved our test infrastructure over time check out https://github.com/elastic/elasticsearch/issues?q=label%3Atest[issues labeled with ``test'' on GitHub].

[discrete]
=== Lucene Loses Data On File Descriptors Failure (STATUS: DONE, v0.90.0)

When a process runs out of file descriptors, Lucene can causes an index to be completely deleted. This issue was fixed in Lucene ({JIRA}4870[version 4.2.1]) and fixed in an early version of Elasticsearch. See issue {GIT}2812[#2812].