10dcfa3304
Once a primary is marked as relocated, we can not safely move it back to started (as we have no way of waiting on inflight operations that are performed on the target primary). If the master cancels the relocation in that state, we fail the primary. Sadly, there is a racing condition between the `updateRoutingEntry` method (which is called when the relocation is cancelled by the master) and the `relocated` method. That racing condition can leave the shard as marked "relocated" but have the routing entry not reflect the target relocation. This in turns causes NPEs in TransportReplicationAction: ``` java.util.Objects requireNonNull Objects.java 203 org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest <init> TransportReplicationAction.java 982 ``` Sadly, once we end up in this state, we will never recover. This commit fixes that race condition by making sure `updateRoutingEntry` acquires the mutex when checking for the relocated status. While at it, I also tightened up the code and added lots of assertions/hard checks. |
||
---|---|---|
.. | ||
src | ||
build.gradle |