After a node joins the clusters, it starts pinging the master to verify it's health. Before, the cluster join request was processed async and we had to give some time to complete. With #6480 we changed this to wait for the join process to complete on the master. We can therefore start pinging immediately for fast detection of failures. Similar change can be made to the Node fault detection from the master side.
Closes#6706
* waiting time should be long enough depending on the type of the disruption scheme
* MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic
This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This
abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various
network partions. There is also one implementation for causing a node to be slow in processing cluster state updates.
This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests.
A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions.
Closes#6505
If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again.
Closes#6526
The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure.
Also added tests to verify the settings' behaviour
We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different.
Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements)
Closes#6466
When a node steps down from being a master (because, for example, min_master_node is breached), it may still have
cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node
no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be
executed even if the current node is no longer master.
This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks
that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask`
Closes#6230
Only the previous master node has been removed, so only shards allocated to that node will get failed.
This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event),
but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards
that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner.
Switches TranslogStreams to check a header in the file to determine the
translog format, delegating to the version-specific stream.
Version 1 of the translog format writes a header using Lucene's
CodecUtil at the beginning of the file and appends a checksum for each
translog operation written.
Also refactors much of the translog operations, such as merging
.hasNext() and .next() in FsChannelSnapshot
Relates to #6554
These caches have no advantage compared to the default node cache. Additionally,
the soft cache makes use of soft references which make fielddata loading quite
unpredictable in addition to pushing more pressure on the garbage collector.
The `none` cache is still there because of tests. There is no other good
reason to use it.
LongFieldDataBenchmark has been removed because the refactoring exposed a
compilation error in this class, which seems to not having been working for a
long time. In addition it's not as much useful now that we are progressively
moving more fields to doc values.
Close#7443
This also fixes has_parent filters with a nested empty bool filter
(see test SimpleChildQuerySearchTests#test6722, the test should actually expect
either 0 results when searching for has_parent "test" or one result when
search for has_parent "foo")
closes#7240closes#7347
Complex explanations were always read as Explanations. Depending
on if the response was streamed or not the explanation was
therefore generated by a ComplexExplanation or by a regular
Explanation.
closes#7257
Added the ability to register template filters that are being applied when a new index is created. The default filter that checks whether the template pattern matches the index name always runs first, additional filters can also be registered so that templates can be filtered out based on custom logic.
Took the chance to add the handy source(Object... source) method to PutIndexTemplateRequest and corresponding builder
Closes#7459Closes#7454
This change forces the GetRequest when a script is being loaded from an index
to use preference("_local") and threaded(false) to prevent the script service from
forking for GetRequests.
The comparison and read code in the BlobStoreIndexShardRepository
used the physicalName and Name in reverse order. This caused
SnapshotBackwardsCompatibilityTest to fail.
This reverts commit 636af40da1