OpenSearch/x-pack/plugin
Yannick Welsch 7f8e1454ab Advance checkpoints only after persisting ops (#43205)
Local and global checkpoints currently do not correctly reflect what's persisted to disk. The issue is
that the local checkpoint is adapted as soon as an operation is processed (but not fsynced yet). This
leaves room for the history below the global checkpoint to still change in case of a crash. As we rely
on global checkpoints for CCR as well as operation-based recoveries, this has the risk of shard
copies / follower clusters going out of sync.

This commit required changing some core classes in the system:

- The LocalCheckpointTracker keeps track now not only of the information whether an operation has
been processed, but also whether that operation has been persisted to disk.
- TranslogWriter now keeps track of the sequence numbers that have not been fsynced yet. Once
they are fsynced, TranslogWriter notifies LocalCheckpointTracker of this.
- ReplicationTracker now keeps track of the persisted local and persisted global checkpoints of all
shard copies when in primary mode. The computed global checkpoint (which represents the
minimum of all persisted local checkpoints of all in-sync shard copies), which was previously stored
in the checkpoint entry for the local shard copy, has been moved to an extra field.
- The periodic global checkpoint sync now also takes async durability into account, where the local
checkpoints on shards only advance when the translog is asynchronously fsynced. This means that
the previous condition to detect inactivity (max sequence number is equal to global checkpoint) is
not sufficient anymore.
- The new index closing API does not work when combined with async durability. The shard
verification step is now requires an additional pre-flight step to fsync the translog, so that the main
verify shard step has the most up-to-date global checkpoint at disposition.
2019-06-20 11:12:38 +02:00
..
ccr Advance checkpoints only after persisting ops (#43205) 2019-06-20 11:12:38 +02:00
core Advance checkpoints only after persisting ops (#43205) 2019-06-20 11:12:38 +02:00
data-frame [ML][Data Frame] make response.count be total count of hits (#43241) (#43389) 2019-06-19 16:19:06 -05:00
deprecation Fix hang in test for "too many fields" dep. check (#42909) 2019-06-06 08:28:32 -06:00
graph Testclusters: graph (#43033) 2019-06-13 09:50:59 +03:00
ilm [7.x] Narrow period of Shrink action in which ILM prevents stopping (#43254) (#43393) 2019-06-19 16:37:41 -06:00
logstash Remove description from xpack feature sets (#43065) 2019-06-11 09:22:58 -07:00
ml Remove stale test logging annotations (#43403) 2019-06-19 22:58:22 -04:00
monitoring Return 0 for negative "free" and "total" memory reported by the OS (#42725) 2019-06-19 10:35:48 -06:00
rollup Remove description from xpack feature sets (#43065) 2019-06-11 09:22:58 -07:00
security Remove stale test logging annotations (#43403) 2019-06-19 22:58:22 -04:00
sql Fix NPE in case of subsequent scrolled requests for a CSV/TSV formatted response (#43365) 2019-06-20 11:26:11 +03:00
src/test [ML][Data Frame] make response.count be total count of hits (#43241) (#43389) 2019-06-19 16:19:06 -05:00
vectors Move dense_vector and sparse_vector to module (#43280) (#43333) 2019-06-18 11:56:04 -04:00
watcher Remove stale test logging annotations (#43403) 2019-06-19 22:58:22 -04:00
build.gradle Remove trace logging from ML datafeeds in tests 2019-06-18 22:24:36 -04:00