Sync global checkpoint on pending in-sync shards (#43526)

At the end of a peer recovery the primary wants to mark the replica as in-sync. For that the
persisted local checkpoint of the replica needs to have caught up with the global checkpoint on the
primary. If translog durability is set to ASYNC, this means that information about the persisted local
checkpoint can lag on the primary and might need to be explicitly fetched through a global
checkpoint sync action. Unfortunately, that action will only be triggered after 30 seconds, and, even
worse, will only run based on what the in-sync shard copies say (see
IndexShard.maybeSyncGlobalCheckpoint). As the replica has not been marked as in-sync yet, it is
not taken into consideration, and the primary might have its global checkpoint equal to the max seq
no, so it thinks nothing needs to be done.

Closes #43486
This commit is contained in:
Yannick Welsch 2019-06-24 18:35:20 +02:00
parent 97cd417829
commit d45f12799c
3 changed files with 6 additions and 4 deletions

View File

@ -1079,7 +1079,7 @@ public class ReplicationTracker extends AbstractIndexShardComponent implements L
}
/**
* Whether the are shards blocking global checkpoint advancement. Used by tests.
* Whether the are shards blocking global checkpoint advancement.
*/
public synchronized boolean pendingInSync() {
assert primaryMode;

View File

@ -2134,9 +2134,11 @@ public class IndexShard extends AbstractIndexShardComponent implements IndicesCl
final long globalCheckpoint = replicationTracker.getGlobalCheckpoint();
// async durability means that the local checkpoint might lag (as it is only advanced on fsync)
// periodically ask for the newest local checkpoint by syncing the global checkpoint, so that ultimately the global
// checkpoint can be synced
// checkpoint can be synced. Also take into account that a shard might be pending sync, which means that it isn't
// in the in-sync set just yet but might be blocked on waiting for its persisted local checkpoint to catch up to
// the global checkpoint.
final boolean syncNeeded =
(asyncDurability && stats.getGlobalCheckpoint() < stats.getMaxSeqNo())
(asyncDurability && (stats.getGlobalCheckpoint() < stats.getMaxSeqNo() || replicationTracker.pendingInSync()))
// check if the persisted global checkpoint
|| StreamSupport
.stream(globalCheckpoints.values().spliterator(), false)

View File

@ -1330,7 +1330,7 @@ public final class InternalTestCluster extends TestCluster {
}
}
}
});
}, 60, TimeUnit.SECONDS);
}
private void assertOpenTranslogReferences() throws Exception {