The initial implementation of two phase commit based cluster state publishing (#13062) relied on a single in memory "pending" cluster state that is only processed by ZenDiscovery once committed by the master. While this is fine on it's own, it resulted in an issue with acknowledged APIs, such as the open index API, in the extreme case where a node falls behind and receives a commit message after a new cluster state has been published. Specifically:
1) Master receives and acked-API call and publishes cluster state CS1
2) Master waits for a min-master nodes to receives CS1 and commits it.
3) All nodes that have responded to CS1 are sent a commit message, however, node N didn't respond yet
4) Master waits for publish timeout (defaults to 30s) for all nodes to process the commit. Node N fails to do so.
5) Master publishes a cluster state CS2. Node N responds to cluster state CS1's publishing but receives cluster state CS2 before the commit for CS1 arrives.
6) The commit message for cluster CS1 is processed on node N, but fails because CS2 is pending. This caused the acked API in step 1 to return (but CS2 , is not yet processed).
In this case, the action indicated by CS1 is not yet executed on node N and therefore the acked API calls return pre-maturely. Note that once CS2 is processed but the change in CS1 takes effect (cluster state operations are safe to batch and we do so all the time).
An example failure can be found on: http://build-us-00.elastic.co/job/es_feature_two_phase_pub/314/
This commit extracts the already existing pending cluster state queue (processNewClusterStates) from ZenDiscovery into it's own class, which serves as a temporary container for in-flight cluster states. Once committed the cluster states are transferred to ZenDiscovery as they used to before. This allows "lagging" cluster states to still be successfully committed and processed (and likely to be ignored as a newer cluster state has already been processed).
As a side effect, all batching logic is now extracted from ZenDiscovery and is unit tested.
Previously multiple clusters in the same JVM reused the same port ranges, leading to potential big gaps in port selection, which in turns causes unicast based discovery to fail, missing to find another node in the default 5 port range.
Also the previous logic had http use a range that is assigned to another JVMs.
We default the value to be 20x the value of a ping timeout, however we only use the legacy ping timeout settings value for the calculation.
Closes#13162
This commit removes and now forbids all uses of
com.google.common.collect.Lists across the codebase. This is the first
of many steps in the eventual removal of Guava as a dependency.
lessThanOrEqualTo is more appropriate when comparing _ttl than lessThan
because in rare cases, when tests run very fast, the ttl you fetch will
still equal the one you sent.
Some listeners may need to do work before a shard's path is
accessed (such as creating the directory in a plugin), so the listener
should be called before anything happens (as its name implies).
detect_noop is pretty cheap and noop updates compartively expensive so this
feels like a sensible default.
Also had to do some testing and documentation around how _ttl works with
detect_noop.
Closes#11282