OpenSearch

Commit Graph

Author	SHA1	Message	Date
Boaz Leskes	ff8b7409f7	[Discovery] add a debug log if a node responds to a publish request after publishing timed out.	2014-08-27 15:47:41 +02:00
Martijn van Groningen	5932371f21	[TEST] Adapt testNoMasterActions since metadata isn't cleared if there is a no master block	2014-08-27 15:47:41 +02:00
Martijn van Groningen	c8919e4bf5	[TEST] Changed action names.	2014-08-27 15:47:41 +02:00
Martijn van Groningen	702890e461	[TEST] Remove the forceful `network.mode` setting in DiscoveryWithServiceDisruptions#testMasterNodeGCs now local transport use worker threads.	2014-08-27 15:47:41 +02:00
Boaz Leskes	26d90882e5	[Transport] Introduced worker threads to prevent alien threads of entering a node. Requests are handled by the worked thread pool of the target node instead of the generic thread pool of the source node. Also this change is required in order to make GC disruption work with local transport. Previously the handling of the a request was performed on on a node that that was being GC disrupted, resulting in some actions being performed while GC was being simulated.	2014-08-27 15:47:40 +02:00
Martijn van Groningen	966a55d21c	Typo: s/Recieved/Received	2014-08-27 15:47:40 +02:00
Martijn van Groningen	47326adb67	[TEST] Make sure all shards are allocated before killing a random data node.	2014-08-27 15:47:40 +02:00
Martijn van Groningen	403ebc9e07	[Discovery] Added cluster version and master node to the nodes fault detecting ping request The cluster state version allows resolving the case where a old master node become unresponsive and later wakes up and pings all the nodes in the cluster, allowing the newly elected master to decide whether it should step down or ask the old master to rejoin.	2014-08-27 15:47:40 +02:00
Boaz Leskes	50f852ffeb	[TEST] Added LongGCDisruption and a test simulating GC on master nodes Also rename DiscoveryWithNetworkFailuresTests to DiscoveryWithServiceDisruptions which better suites what we do.	2014-08-27 15:47:40 +02:00
Martijn van Groningen	4b8456e954	[Discovery] Master fault detection and nodes fault detection should take cluster name into account. Both master fault detection and nodes fault detection request should also send the cluster name, so that on the receiving side the handling of these requests can be failed with an error. This error can be caught on the sending side and for master fault detection the node can fail the master locally and for nodes fault detection the node can be failed. Note this validation will most likely never fail in a production cluster, but in during automated tests where cluster / nodes are created and destroyed very frequently.	2014-08-27 15:47:39 +02:00
Martijn van Groningen	364374dd03	[TEST] Added test that verifies that no shard relocations happen during / after a master re-election.	2014-08-27 15:47:39 +02:00
Martijn van Groningen	130e680cfb	[Discovery] Made the handeling of the join request batch oriented. In large clusters when a new elected master is chosen, there are many join requests to handle. By batching them up the the cluster state doesn't get published for each individual join request, but many handled at the same time, which results into a single new cluster state which ends up be published. Closes #6984	2014-08-27 15:47:39 +02:00
Shay Banon	0244ddb0cd	retry logic to unwrap exception to check for illegal state it probably comes wrapped in a remote exception, which we should unwrap in order to detect it..., also, simplified a bit the retry logic	2014-08-27 15:47:39 +02:00
Boaz Leskes	cccd060a0c	[Discovery] verify we have a master after a successful join request After master election, nodes send join requests to the elected master. Master is then responsible for publishing a new cluster state which sets the master on the local node's cluster state. If something goes wrong with the cluster state publishing, this process will not successfully complete. We should check it after the join request returns and if it failed, retry pinging. Closes #6969	2014-08-27 15:47:38 +02:00
Boaz Leskes	ffcf1077d8	[Discovery] join master after first election Currently, pinging results are only used if the local node is elected master or if they detect another already active master. This has the effect that master election requires two pinging rounds - one for the elected master to take is role and another for the other nodes to detect it and join the cluster. We can be smarter and use the election of the first round on other nodes as well. Those nodes can try to join the elected master immediately. There is a catch though - the elected master node may still be processing the election and may reject the join request if not ready yet. To compensate a retry mechanism is introduced to try again (up to 3 times by default) if this happens. Closes #6943	2014-08-27 15:47:38 +02:00
Boaz Leskes	a40984887b	[Tests] Fixed some issues with SlowClusterStateProcessing Reduced expected time to heal to 0 (we interrupt and wait on stop disruption). It was also wrongly indicated in seconds. We didn't properly wait between slow cluster state tasks	2014-08-27 15:47:38 +02:00
Martijn van Groningen	c2142c0f6d	Discovery: Don't include local node to pingMasters list. We might end up electing ourselves without any form of verification.	2014-08-27 15:47:38 +02:00
Martijn van Groningen	5e38e9eb4f	Discovery: Only add local node to possibleMasterNodes if it is a master node.	2014-08-27 15:47:37 +02:00
Martijn van Groningen	67685cb026	Discovery: If not enough possible masters are found, but there are masters to ping (ping responses did include master node) then these nodes should be resolved. After the findMaster() call we try to connect to the node and if it isn't the master we start looking for a new master via pinging again. Closes #6904	2014-08-27 15:47:37 +02:00
Boaz Leskes	f029a24d53	[Store] migrate non-allocated shard deletion to use ClusterStateNonMasterUpdateTask	2014-08-27 15:47:37 +02:00
Boaz Leskes	bebaf9799c	[Tests] stability improvements added explicit cleaning of temp unicast ping results reduce gateway local.list_timeout to 10s. testVerifyApiBlocksDuringPartition: verify master node has stepped down before restoring partition	2014-08-27 15:47:30 +02:00
Boaz Leskes	ea2783787c	[Tests] Introduced ClusterDiscoveryConfiguration Closes #6890	2014-08-27 15:47:23 +02:00
Boaz Leskes	ccabb4aa20	Remove unneeded reference to DiscoveryService which potentially causes circular references	2014-08-27 15:47:23 +02:00
Boaz Leskes	7fa3d7081b	[logging] don't log an error if scheduled reroute is rejected because local node is no longer master Since it runs in a background thread after a node is added, or submits a cluster state update when a node leaves, it may be that by the time it is executed the local node is no longer master.	2014-08-27 15:47:23 +02:00
Boaz Leskes	e0543b3426	[Internal] Migrate new initial state cluster update task to a ClusterStateNonMasterUpdateTask	2014-08-27 15:47:23 +02:00
Boaz Leskes	c12d0901f6	[Tests] Increase timeout when waiting for partitions to heal the current 30s addition is tricky because we use 30s as timeout in many places...	2014-08-27 15:47:22 +02:00
Boaz Leskes	7b6e194923	[Tests] Don't log about restoring a partition if the partition is not active.	2014-08-27 15:47:22 +02:00
Boaz Leskes	522d4afe0c	[Tests] Use local gateway This is important to for proper primary allocation decisions	2014-08-27 15:47:22 +02:00
Boaz Leskes	3586e38c40	[Discovery] Start master fault detection after pingInterval This is to allow the master election to complete on the chosen master. Relates to #6706	2014-08-27 15:47:22 +02:00
Boaz Leskes	5302a53145	[Discovery] immediately start Master\|Node fault detection pinging After a node joins the clusters, it starts pinging the master to verify it's health. Before, the cluster join request was processed async and we had to give some time to complete. With #6480 we changed this to wait for the join process to complete on the master. We can therefore start pinging immediately for fast detection of failures. Similar change can be made to the Node fault detection from the master side. Closes #6706	2014-08-27 15:47:22 +02:00
Boaz Leskes	48c7da1fd4	[Test] testVerifyApiBlocksDuringPartition - wait for stable cluster after partition	2014-08-27 15:47:21 +02:00
Martijn van Groningen	d99ca806cb	[TEST] Properly clear the disruption schemes after test completed.	2014-08-27 15:47:21 +02:00
Boaz Leskes	e897dccb52	[Tests] improved automatic disruption healing after tests	2014-08-27 15:47:21 +02:00
Boaz Leskes	5e5f8a9daf	Added java docs to all tests in DiscoveryWithNetworkFailuresTests Moved testVerifyApiBlocksDuringPartition to test blocks rather then rely on specific API rejections. Did some cleaning while at it.	2014-08-27 15:47:21 +02:00
Martijn van Groningen	77dae631e1	[TEST] Make sure get request is always local	2014-08-27 15:47:20 +02:00
Martijn van Groningen	52f69c64f7	[TEST] Verify no master block during partition for read and write apis	2014-08-27 15:47:20 +02:00
Martijn van Groningen	98084c02ce	[TEST] Added test to verify if 'discovery.zen.rejoin_on_master_gone' is updatable at runtime.	2014-08-27 15:47:20 +02:00
Boaz Leskes	c3e84eb639	Fixed compilation issue caused by the lack of a thread pool name	2014-08-27 15:47:20 +02:00
Boaz Leskes	1af82fd96a	[Tests] Disabling testAckedIndexing The test is currently unstable and needs some more work	2014-08-27 15:47:20 +02:00
Boaz Leskes	a7a61a0392	[Test] ensureStableCluster failed to pass viaNode parameter correctly Also improved timeouts & logs	2014-08-27 15:47:19 +02:00
Martijn van Groningen	f7b962a417	[TEST] Renamed afterDistribution timeout to expectedTimeToHeal Accumulate expected shard failures to log later	2014-08-27 15:47:19 +02:00
Martijn van Groningen	785d0e55ab	[TEST] Reduced failures in DiscoveryWithNetworkFailuresTests#testAckedIndexing test: * waiting time should be long enough depending on the type of the disruption scheme * MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic	2014-08-27 15:47:19 +02:00
Martijn van Groningen	8aed9ee46f	[TEST] Check if worker if null to prevent NPE on double stopping	2014-08-27 15:47:19 +02:00
Boaz Leskes	28489cee45	[Tests] Added ServiceDisruptionScheme(s) and testAckedIndexing This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various network partions. There is also one implementation for causing a node to be slow in processing cluster state updates. This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests. A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions. Closes #6505	2014-08-27 15:47:14 +02:00
Boaz Leskes	5d13571dbe	[Discovery] when master is gone, flush all pending cluster states If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again. Closes #6526	2014-08-27 15:47:13 +02:00
Boaz Leskes	8b85d97ea6	[Discovery] Improved logging when a join request is not executed because local node is no longer master	2014-08-27 15:47:09 +02:00
Boaz Leskes	7db9e98ee7	[Discovery] Change (Master\|Nodes)FaultDetection's connect_on_network_disconnect default to false The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure. Also added tests to verify the settings' behaviour	2014-08-27 15:47:05 +02:00
Boaz Leskes	e39ac7eef4	[Test] testIsolateMasterAndVerifyClusterStateConsensus didn't wait on initializing shards before comparing cluster states	2014-08-27 15:46:51 +02:00
Martijn van Groningen	f3d90cdb17	[TEST] Remove 'index.routing.allocation.total_shards_per_node' setting in data consistency test	2014-08-27 15:46:51 +02:00
Boaz Leskes	58f8774fa2	[Discovery] do not use versions to optimize cluster state copying for a first update from a new master We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different. Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements) Closes #6466	2014-08-27 15:46:50 +02:00

1 2 3 4 5 ...

9303 Commits All Branches Search

9303 Commits

All Branches