OpenSearch/server
Boaz Leskes 13917162ad
ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316)
At the end of recovery, we mark the recovering shard as "in sync" on the primary. From this point on 
the primary will treat any replication failure on it as critical and will reach out to the master to fail the 
shard. To do so, we wait for the local checkpoint of the recovered shard to be above the global 
checkpoint (in order to maintain global checkpoint invariant).

If the master decides to cancel the allocation of the recovering shard while we wait, the method can 
currently hang and fail to return. It will also ignore the interrupts that are triggered by the cancelled 
recovery due to the primary closing. 

Note that this is crucial as this method is called while holding a primary permit. Since the method 
never comes back, the permit is never released. The unreleased permit will then block any primary 
relocation *and* while the primary is trying to relocate all indexing will be blocked for 30m as it 
waits to acquire the missing permit.
2018-05-02 19:40:29 +02:00
..
cli Add useful message when no input from terminal (#29369) 2018-04-10 21:50:39 -04:00
licenses Upgrade to lucene 7.3.0 (#29387) 2018-04-05 10:34:44 +01:00
src ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316) 2018-05-02 19:40:29 +02:00
build.gradle Build: Split distributions into oss and default 2018-04-20 15:33:57 -07:00