Similar to the batching of "shards-started" actions, this commit implements batching of snapshot status updates. This is useful when backing up many indices as the cluster state does not need to be republished as many times.
Closes#10295
Today, when a primary shard is not allocated we go to all the nodes to find where it is allocated (listing its started state). When we allocate a replica shard, we head to all the nodes and list its store to allocate the replica on a node that holds the closest matching index files to the primary.
Those two operations today execute synchronously within the GatewayAllocator, which means they execute in a sync manner within the cluster update thread. For large clusters, or environments with very slow disk, those operations will stall the cluster update thread, making it seem like its stuck.
Worse, if the FS is really slow, we timeout after 30s the operation (to not stall the cluster update thread completely). This means that we will have another run for the primary shard if we didn't find one, or we won't find the best node to place a shard since it might have timed out (listing stores need to list all files and read the checksum at the end of each file).
On top of that, this sync operation happen one shard at a time, so its effectively compounding the problem in a serial manner the more shards we have and the slower FS is...
This change moves to perform both listing the shard started states and the shard stores to an async manner. During the allocation by the GatewayAllocator, if data needs to be fetched from a node, it is done in an async fashion, with the response triggering a reroute to make sure the results will be taken into account. Also, if there are on going operations happening, the relevant shard data will not be taken into account until all the ongoing listing operations are done executing.
The execution of listing shard states and stores has been moved to their own respective thread pools (scaling, so will go down to 0 when not needed anymore, unbounded queue, since we don't want to timeout, just let it execute based on how fast the local FS is). This is needed sine we are going to blast nodes with a lot of requests and we need to make sure there is no thread explosion.
This change also improves the handling of shard failures coming from a specific node. Today, those nodes were ignored from allocation only for the single reroute round. Now, since fetching is async, we need to keep those failures around at least until a single successful fetch without the node is done, to make sure not to repeat allocating to the failed node all the time.
Note, if before the indication of slow allocation was high pending tasks since the allocator was waiting for responses, not the pending tasks will be much smaller. In order to still indicate that the cluster is in the middle of fetching shard data, 2 attributes were added to the cluster health API, indicating the number of ongoing fetches of both started shards and shard store.
closes#9502closes#11101
The default for reuse address was null on Windows and casting to a boolean would
result in a NPE. This makes the default on Windows false and changes the return
type to a boolean and removes the need to check for nulls.
This also fixes a potential race condition when the number of docs
is compared across shards with the same seal ID since the assertion
was taking the number of docs form the live index reader which might
not be equivalent to the committed num docs.
CompressedString relied on the assumption that two CompressedString instanes
are equal if there compressed representation are equal. Unfortunately this is
not always true because the compressed representation also depends on when
flush() was called on the output stream or on the size of the hash table that
has been used at compression time.
When index is introduced into the cluster via cluster upgrade, restore or as a dangled index the MetaDataIndexUpgradeService checks if this index can be upgraded to the current version. If upgrade is not possible, the newly upgraded cluster startup and restore process are aborted, the dangled index is imported as a closed index that cannot be open.
Closes#10215
This commit also reenables CorruptedFileTest#testReplicaCorruption which had
a missing `GatewayAllocator.INDEX_RECOVERY_INITIAL_SHARDS: "one"` setting.
Closes#11226
When we use unicast discovery in tests, we test for available ports by binding to ports and if
the bind was successful, we use that port. This has a timing issue on certain operating systems
the socket can still be in a TIME_WAIT causing subsequent binds to fail before a certain timeout.
Setting reuse address on the Java socket will instruct the underlying operating system to allow
the socket to be bound to immediately, usually by specifying the SO_REUSEADDR socket option.
This commit makes FilteredQuery a forbidden API and also removes some more usage
of the Filter API. There are some remaining code using filters for parent/child
queries but I'm not touching this as they are already being refactored in #6511.
Add a test that verifies that even though all replicas are corrupted on all available nodes, and listing of shard stores faield, it still get allocated and properly recovered from the primary shard
Some requests in the SyncedFlushService were sill blocking on network
calls which made calling this service error prone if done on a network
thread. This commit makes this service fully async based on ActionListener.