- Initial targetWorkerCount must be subject to pool size limits
- Use consistent workerSetupData for the entire autoscaling run
- Don't call terminate() when we have nothing to terminate
- Terminate obsolete workers even faster
- Log and return immediately when workerSetupData is null
- Allow provisioning more nodes while other nodes are still provisioning
- Add tests for bumping up the minimum version
This should let us grow and shrink the worker pool in chunks when necessary
(like when a bunch of them go offline, or when there is a worker version
change).
Each one now one operates on at most a collection of segments that comprise
a single partition. The main purpose of this change is to prevent audit log
payload sizes from getting out of control.
- Workers are not added to zkWorkers until caches have been initialized.
- Worker status we haven't heard about will be added to runningTasks or
completeTasks as appropriate.
- TaskRunnerWorkItem now only needs a taskId, not the entire Task. This makes
it possible to create them from TaskStatus objects, if that's all we have.
- Also remove some dead code.
The TaskQueue directly manages the TaskRunner. The main management loop runs
periodically and checks that the runner is doing reasonable things. If not, it
attempts to adjust the runner. The management loop also runs on-demand when a
task is added to keep task assignment relatively low latency. The TaskConsumer
is no longer necessary and so it no longer exists.
Task interval locks are handled differently. Instead of some tasks acquiring
locks at runtime and some tasks having implicit fixed lock intervals, all tasks
ask for locks explicitly. This occurs either in "isReady" (which runs on the
overlord) or in "run" (which runs on the peon).
Other changes:
- The TaskQueue is attached to the leader lifecycle, instead of global
- The TaskLockbox is able to sync itself from storage and is no longer
bootstrapped by the TaskQueue.
- RemoteTaskRunner does not clean up zk paths until asked to. This will
prevent deletion of statuses that have not yet been committed.
- Added retries on DbTaskStorage operations.
- Removed SpawnTasksAction (no more subtasks)
- Removed obsolete EventReceiverFirehose configs
- Removed obsolete OldOverlordResource
- Removed TaskStorageQueryAdapter methods related to subtasks