[ML] Reduce persistent tasks periodic reassignment interval in ... (#36845)

... MlDistributedFailureIT.testLoseDedicatedMasterNode.

An intermittent failure has been observed in
`MlDistributedFailureIT. testLoseDedicatedMasterNode`.
The test launches a cluster comprised by a dedicated master node
and a data and ML node. It creates a job and datafeed and starts them.
It then shuts down and restarts the master node. Finally, the test asserts
that the two tasks have been reassigned within 10s.

The intermittent failure is due to the assertions that the tasks have been
reassigned failing. Investigating the failure revealed that the `assertBusy`
that performs that assertion times out. Furthermore, it appears that the
job task is not reassigned because the memory tracking info is stale.

Memory tracking info is refreshed asynchronously when a job is attempted
to be reassigned. Tasks are attempted to be reassigned either due to a relevant
cluster state change or periodically. The periodic interval is controlled by a cluster
setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s.

What seems to be happening in this test is that if all cluster state changes after the
master node is restarted come through before the async memory info refresh completes,
then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy`
times out.

This commit changes the test to reduce the periodic check that reassigns persistent
tasks to `200ms`. If the above theory is correct, this should eradicate those failures.

Closes #36760
This commit is contained in:
Dimitris Athanasiou 2018-12-20 14:53:36 +02:00 committed by GitHub
parent 0f2f00a20a
commit 08bcd83757
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 15 additions and 2 deletions

View File

@ -73,7 +73,8 @@ public class PersistentTasksClusterService implements ClusterStateListener, Clos
this::setRecheckInterval);
}
void setRecheckInterval(TimeValue recheckInterval) {
// visible for testing only
public void setRecheckInterval(TimeValue recheckInterval) {
periodicRechecker.setInterval(recheckInterval);
}

View File

@ -15,12 +15,14 @@ import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.unit.ByteSizeUnit;
import org.elasticsearch.common.unit.ByteSizeValue;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.DeprecationHandler;
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
import org.elasticsearch.common.xcontent.XContentHelper;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.persistent.PersistentTasksClusterService;
import org.elasticsearch.persistent.PersistentTasksCustomMetaData;
import org.elasticsearch.persistent.PersistentTasksCustomMetaData.PersistentTask;
import org.elasticsearch.test.junit.annotations.TestLogging;
@ -72,7 +74,6 @@ public class MlDistributedFailureIT extends BaseMlIntegTestCase {
});
}
@AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/32905")
public void testLoseDedicatedMasterNode() throws Exception {
internalCluster().ensureAtMostNumDataNodes(0);
logger.info("Starting dedicated master node...");
@ -290,6 +291,17 @@ public class MlDistributedFailureIT extends BaseMlIntegTestCase {
client().admin().indices().prepareSyncedFlush().get();
disrupt.run();
PersistentTasksClusterService persistentTasksClusterService =
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
// Speed up rechecks to a rate that is quicker than what settings would allow.
// The tests would work eventually without doing this, but the assertBusy() below
// would need to wait 30 seconds, which would make the suite run very slowly.
// The 200ms refresh puts a greater burden on the master node to recheck
// persistent tasks, but it will cope in these tests as it's not doing anything
// else.
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(200));
assertBusy(() -> {
ClusterState clusterState = client().admin().cluster().prepareState().get().getState();
PersistentTasksCustomMetaData tasks = clusterState.metaData().custom(PersistentTasksCustomMetaData.TYPE);