Commit Graph

15167 Commits

Author SHA1 Message Date
Boaz Leskes 80b59e0d66 Discovery: Add a dedicate queue for incoming ClusterStates
The initial implementation of two phase commit based cluster state publishing (#13062) relied on a single in memory "pending" cluster state that is only processed by ZenDiscovery once committed by the master. While this is fine on it's own, it resulted in an issue with acknowledged APIs, such as the open index API, in the extreme case where a node falls behind and receives a commit message after a new cluster state has been published. Specifically:

1) Master receives and acked-API call and publishes cluster state CS1
2) Master waits for a min-master nodes to receives CS1 and commits it.
3) All nodes that have responded to CS1 are sent a commit message, however, node N didn't respond yet
4) Master waits for publish timeout (defaults to 30s) for all nodes to process the commit. Node N fails to do so.
5) Master publishes a cluster state CS2. Node N responds to cluster state CS1's publishing but receives cluster state CS2 before the commit for CS1 arrives.
6) The commit message for cluster CS1 is processed on node N, but fails because CS2 is pending. This caused the acked API in step 1 to return (but CS2 , is not yet processed).

In this case, the action indicated by CS1 is not yet executed on node N and therefore the acked API calls return pre-maturely. Note that once CS2 is processed but the change in CS1 takes effect (cluster state operations are safe to batch and we do so all the time).

An example failure can be found on: http://build-us-00.elastic.co/job/es_feature_two_phase_pub/314/

This commit extracts the already existing pending cluster state queue (processNewClusterStates) from ZenDiscovery into it's own class, which serves as a temporary container for in-flight cluster states. Once committed the cluster states are transferred to ZenDiscovery as they used to before. This allows "lagging" cluster states to still be successfully committed and processed (and likely to be ignored as a newer cluster state has already been processed).

As a side effect, all batching logic is now extracted from ZenDiscovery and is unit tested.
2015-09-11 09:23:41 +02:00
Boaz Leskes 218979da1b remove committedOrFailed and use committedOrFailedLatch for state 2015-08-28 12:31:46 +02:00
Boaz Leskes 10e8c410ea more feedback 2015-08-28 12:31:46 +02:00
Boaz Leskes 0668e0d623 more feedback 2015-08-28 12:31:46 +02:00
Boaz Leskes c9ee8dbd16 tighten up FailedToCommitClusterStateException semantics and other feedback 2015-08-28 12:31:45 +02:00
Boaz Leskes 98ed133dd7 reduce log chatter 2015-08-28 12:31:45 +02:00
Boaz Leskes d9f6e302b5 doc feedback 2015-08-28 12:31:45 +02:00
Boaz Leskes f70ed876d6 added docs 2015-08-28 12:31:45 +02:00
Boaz Leskes c7c65b626f commit timeout default should never be larger than publishing timeout 2015-08-28 12:31:45 +02:00
Boaz Leskes 6208248215 fix defaults in DiscoverySettings 2015-08-28 12:31:44 +02:00
Boaz Leskes 91dee8b311 reject older cluster state from the same master 2015-08-28 12:31:44 +02:00
Boaz Leskes a56d67d8d7 force mock transport in testCanNotPublishWithoutMinMastNodes 2015-08-28 12:31:44 +02:00
Boaz Leskes e3e0aa5049 Improved concurrency controls In SendingController to make sure that a CS is never committed after publishing is marked out as timed out 2015-08-28 12:31:44 +02:00
Boaz Leskes 234a3794e5 improved timeout handling 2015-08-28 12:31:44 +02:00
Boaz Leskes 4d31681057 added constructor to FailedToCommitException 2015-08-28 12:31:43 +02:00
Boaz Leskes 7d3a36b20f fix ZenDiscoveryUnitTest.testShouldIgnoreNewClusterState 2015-08-28 12:31:43 +02:00
Boaz Leskes 7390bcf833 add FailedToCommitException to registration 2015-08-28 12:31:43 +02:00
Boaz Leskes b702843fe9 beefed up testing... 2015-08-28 12:31:43 +02:00
Boaz Leskes 81e07e81e0 simplified PublishClusterStateActionTests infra 2015-08-28 12:31:42 +02:00
Boaz Leskes 3815a41626 initial copy over from POC 2015-08-28 12:31:42 +02:00
Boaz Leskes 35f9ee7a62 Tests: better isolation of cluster ports
Previously multiple clusters in the same JVM reused the same port ranges, leading to potential big gaps in port selection, which in turns causes unicast based discovery to fail, missing to find another node in the default 5 port range.

Also the previous logic had http use a range that is assigned to another JVMs.
2015-08-28 11:39:30 +02:00
Michael McCandless 07b5d22d91 disable new test on windows 2015-08-28 05:06:35 -04:00
Michael McCandless fb703845dd Merge pull request #13158 from mikemccand/new_path_for_shard_test
Add unit test for ShardPath.selectNewPathForShard
2015-08-28 04:15:15 -04:00
Michael McCandless b646ed9cd8 try to work on Windows too 2015-08-28 04:13:21 -04:00
Michael McCandless 8dbc1fbdbd use ShardPath.getRootStatePath; allow forbidden API 2015-08-28 03:59:02 -04:00
Boaz Leskes db5e225a25 Discovery: fix `discovery.zen.join_timeout` default value logic
We default the value to be 20x the value of a ping timeout, however we only use the legacy ping timeout settings value for the calculation.

Closes #13162
2015-08-28 09:47:15 +02:00
javanna 9b2e77903d Internal: make ValidationException methods final and fix javadocs 2015-08-28 09:41:47 +02:00
javanna 37ec221df5 Internal: remove unused MapperQueryParser constructor 2015-08-28 09:38:29 +02:00
Jason Tedor 90bc784194 Work around for JDK-8039214 on JDK 9 2015-08-27 23:29:22 -04:00
Jason Tedor 3e88cc0bd0 Merge pull request #13170 from jasontedor/fix/lists-be-gone
Remove and forbid use of com.google.common.collect.Lists
2015-08-27 22:19:44 -04:00
Jason Tedor 3067cacb66 Remove and forbid use of com.google.common.collect.Lists
This commit removes and now forbids all uses of
com.google.common.collect.Lists across the codebase. This is the first
of many steps in the eventual removal of Guava as a dependency.
2015-08-27 22:14:33 -04:00
Igor Motov 2b87d7d919 Add `readonly` option for repositories
Closes #7831
Closes #11753
2015-08-27 18:21:29 -04:00
Simon Willnauer f64a875e03 use provided version in smoke test file paths 2015-08-27 23:20:01 +02:00
Nik Everett 144a641a5d Merge pull request #13165 from nik9000/fix_ttl_test
Use proper comparison operator ttl test
2015-08-27 16:49:27 -04:00
Nik Everett 19a79c99f9 [test] Use proper comparison operator
lessThanOrEqualTo is more appropriate when comparing _ttl than lessThan
because in rare cases, when tests run very fast, the ttl you fetch will
still equal the one you sent.
2015-08-27 16:43:10 -04:00
Britta Weber e6eeadd171 [test] make sure that the scripts in testScoreAccessWithinScript never compute log(0) 2015-08-27 22:02:51 +02:00
Ryan Ernst 48ea97cace Merge pull request #13133 from rjernst/fix/bwc_creation
Fix generation scripts for bwc indexes, and add 2.0 beta1 index
2015-08-27 10:21:32 -07:00
Ryan Ernst 38b8f20cc5 Make 0.x and 1.x indexes still work with get-bwc-version 2015-08-27 10:19:59 -07:00
Ryan Ernst 448d3498b1 Merge branch 'master' into fix/bwc_creation 2015-08-27 10:16:45 -07:00
Michael McCandless e2e1b7f76a reference original issue 2015-08-27 13:06:00 -04:00
Michael McCandless 30a3e431ec polish 2015-08-27 13:01:36 -04:00
Michael McCandless 11f09f0a68 add basic unit test 2015-08-27 12:33:04 -04:00
Michael McCandless 4d38856f70 simplify API for ShardPath.selectNewPathForShard to enable unit testing: don't pass IndexShard 2015-08-27 12:32:21 -04:00
Lee Hinman 9f03f8cf44 Call `beforeIndexShardCreated` listener earlier in `createShard`
Some listeners may need to do work before a shard's path is
accessed (such as creating the directory in a plugin), so the listener
should be called before anything happens (as its name implies).
2015-08-27 10:05:27 -06:00
Nik Everett 38fdacdbf7 Merge pull request #11306 from nik9000/default_detect_noop
Default detect_noop to true
2015-08-27 11:22:13 -04:00
Michael McCandless 8f2ae59316 add asserts to make sure mocking 'took' 2015-08-27 11:19:55 -04:00
Michael McCandless 7a8a608d50 initial mock filesystem setup for test case 2015-08-27 10:55:04 -04:00
Nik Everett 9eb684da51 Default detect_noop to true
detect_noop is pretty cheap and noop updates compartively expensive so this
feels like a sensible default.

Also had to do some testing and documentation around how _ttl works with
detect_noop.

Closes #11282
2015-08-27 10:34:18 -04:00
Simon Willnauer 9a1b5cf966 [TEST] comparing paths seems to be hard on windonws 2015-08-27 13:20:22 +02:00
Dan Tuffery d8298e1d3a Update query_dsl.asciidoc
Fixed typo.'
2015-08-27 12:47:15 +02:00