ClusterHealth shouldn't fail with "unexpected failure" if master steps down while waiting for events

In order to wait for events of a certain priority to pass, TransportClusterHealthAction submits a cluster state update task. If the current master steps down while this task is in the queue, the task will fail causing the ClusterHealth to report an unexpected error. We often use this request to ensure cluster stability in tests after disruption. However, depends on the nature of the failure it may happen (if we're unfortunate) that two master election rounds are needed. The above issues causes the get health request to fail after the first one. Instead we should try to wait for a new master to be elected (or the local node to be re-elected). Closes #11493
2025-03-09 14:34:43 +00:00 · 2015-06-04 13:07:16 +02:00 · 2015-06-04 13:07:16 +02:00 · 129d8ec29a
commit 129d8ec29a
parent 6ab3141021
1 changed files with 6 additions and 0 deletions
--- a/src/main/java/org/elasticsearch/action/admin/cluster/health/TransportClusterHealthAction.java
+++ b/src/main/java/org/elasticsearch/action/admin/cluster/health/TransportClusterHealthAction.java
@ -84,6 +84,12 @@ public class TransportClusterHealthAction extends TransportMasterNodeReadAction<
                    executeHealth(request, listener);
                }

+                @Override
+                public void onNoLongerMaster(String source) {
+                    logger.trace("stopped being master while waiting for events with priority [{}]. retrying.", request.waitForEvents());
+                    doExecute(request, listener);
+                }
+
                @Override
                public void onFailure(String source, Throwable t) {
                    logger.error("unexpected failure during [{}]", t, source);