mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-03-09 14:34:43 +00:00
Snapshot deletes should first check the cluster state for an in-progress snapshot and try to abort it before checking the repository contents. This allows for atomically checking and aborting a snapshot in the same cluster state update, removing all possible races where a snapshot that is in-progress could not be found if it finishes between checking the repository contents and the cluster state. Also removes confusing races, where checking the cluster state off of the cluster state thread finds an in-progress snapshot that is then not found in the cluster state update to abort it. Finally, the logic to use the repository generation of the in-progress snapshot + 1 was error prone because it would always fail the delete when the repository had a pending generation different from its safe generation when a snapshot started (leading to the snapshot finalizing at a higher generation). These issues (particularly that last point) can easily be reproduced by running `SLMSnapshotBlockingIntegTests` in a loop with current `master` (see #54766). The snapshot resiliency test for concurrent snapshot creation and deletion was made to more aggressively start the delete operation so that the above races would become visible. Previously, the fact that deletes would never coincide with initializing snapshots resulted in a number of the above races not reproducing. This PR is the most consistent I could get snapshot deletes without changes to the state machine. The fact that aborted deletes will not put the delete operation in the cluster state before waiting for the snapshot to abort still allows for some possible (though practically very unlikely) races. These will be fixed by a state-machine change in upcoming work in #54705 (which will have a much simpler and clearer diff after this change). Closes #54766