Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode (#44860)

The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails because the resize operation is not acknowledged (see #44736). This resize operation creates a new index "splitagain" and it results in a cluster state update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() to create the resized index). This cluster state update is expected to be acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but this is not always true: the data node that was just stopped in the test before executing the resize operation might still be considered as a "faulty" node (and not yet removed from the cluster nodes) by the FollowersChecker. The cluster state is then acked on all nodes but one, and it results in a non acknowledged resize operation. This commit adds an ensureStableCluster() check after stopping the node in the test. The goal is to ensure that the data node has been correctly removed from the cluster and that all nodes are fully connected to each before moving forward with the resize operation. Closes #44736
2025-03-25 01:19:02 +00:00 · 2019-07-26 10:12:59 +02:00 · 2019-07-26 10:12:59 +02:00 · 8848fcfb22
commit 8848fcfb22
parent 6ea2b5dec0
1 changed files with 2 additions and 0 deletions
--- a/server/src/test/java/org/elasticsearch/action/admin/indices/create/ShrinkIndexIT.java
+++ b/server/src/test/java/org/elasticsearch/action/admin/indices/create/ShrinkIndexIT.java
@ -580,7 +580,9 @@ public class ShrinkIndexIT extends ESIntegTestCase {
            .build()).setResizeType(ResizeType.SHRINK).get());
        ensureGreen();

+        final int nodeCount = cluster().size();
        internalCluster().stopRandomNode(InternalTestCluster.nameFilter(shrinkNode));
+        ensureStableCluster(nodeCount - 1);

        // demonstrate that the index.routing.allocation.initial_recovery setting from the shrink doesn't carry over into the split index,
        // because this would cause the shrink to fail as the initial_recovery node is no longer present.