OpenSearch

Commit Graph

Author	SHA1	Message	Date
Like	6f64267626	Make setting index.translog.sync_interval be dynamic (#37382 ) Currently, we cannot update index setting index.translog.sync_interval if index is open, because it's not dynamic which can be updated for closed index only. Closes #32763	2019-03-20 17:12:45 +01:00
Henning Andersen	4c2a8638ca	Cascading primary failure lead to MSU too low (#40249 ) If a replica were first reset due to one primary failover and then promoted (before resync completes), its MSU would not include changes since global checkpoint, leading to errors during translog replay. Fixed by re-initializing MSU before restoring local history.	2019-03-20 14:00:43 +01:00
Jason Tedor	f88e4181ca	Enable reading auto-follow patterns from x-content (#40130 ) This named writable was never registered, so it means that we could not read auto-follow patterns that were registered in the cluster state. This causes them to be lost on restarts, a bad bug. This commit addresses this by registering this named writable, and we add a basic CCR restart test to ensure that CCR keeps functioning properly when the follower is restarted.	2019-03-18 21:48:44 -04:00
Nhat Nguyen	38e9522218	Remove wait for cluster state step in peer recovery (#40004 ) We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped using it since #25692 (6.0). This change removes that action and related code in 7.x and 8.0. Relates #19287 Relates #25692	2019-03-18 15:17:21 -04:00
Jason Tedor	5be12e0999	Safe publication of AutoFollowCoordinator (#40153 ) We were leaking a reference to an AutoFollowCoordinator during construction, violating safe publication according to the JLS specification. This commit addresses this by waiting to register AutoFollowCoordinator with the ClusterApplierService after the AutoFollowCoordinator is fully constructed. We also remove ourselves as a listener when stopping.	2019-03-18 10:13:41 -04:00
Jason Tedor	b8ad337234	Stop auto-followers on shutdown (#40124 ) When shutting down a node, auto-followers will keep trying to run. This is happening even as transport services and other components are being closed. In some cases, this can lead to a stack overflow as we rapidly try to check the license state of the remote cluster, can not because the transport service is shutdown, and then immeidately retry again. This can happen faster than the shutdown, and we die with stack overflow. This commit adds a stop command to auto-followers so that this retry loop occurs at most once on shutdown.	2019-03-18 07:25:31 -04:00
Jason Tedor	0824eceacf	Add log message for auto-follower timeout When an auto-follower coordinator times out waiting for the remote cluster state, we do not log any indication of this. While this is expected behavior in quiet deployments, it is still useful to see this information for tracing the behavior of the auto-follow coordinator. This commit adds a trace log message indicating that the timeout.	2019-03-16 10:46:20 -04:00
Jason Tedor	86d1d03c37	Remove cluster state size (#40109 ) This commit removes the cluster state size field from the cluster state response, and drops the backwards compatibility layer added in 6.7.0 to continue to support this field. As calculation of this field was expensive and had dubious value, we have elected to remove this field.	2019-03-15 17:16:25 -04:00
David Kyle	4eb3683d65	Mute CcrRetentionLeaseIT tests (#40090 )	2019-03-15 15:05:47 +00:00
Jake Landis	b0b0f66669	Remove types from internal monitoring templates and bump to api 7 (#39888 ) (#39926 ) This commit removes the "doc" type from monitoring internal indexes. The template still carries the "_doc" type since that is needed for the internal representation. This change impacts the following templates: monitoring-alerts.json monitoring-beats.json monitoring-es.json monitoring-kibana.json monitoring-logstash.json As part of the required changes, the system_api_version has been bumped from "6" to "7" and support for version "2" has been dropped. A new empty pipeline is now introduced for the version "7", and the formerly empty "6" pipeline will now remove the type and re-direct the request to the "7" index. Additionally, to due to a difference in the internal representation (which requires the inclusion of "_doc" type) and external representation (which requires the exclusion of any type) a helper method is introduced to help convert internal to external representation, and used by the monitoring HTTP template exporter. Relates #38637	2019-03-11 13:17:27 -05:00
Martijn van Groningen	8925a2c6c2	Further tweak AutoFollowIT#testAutoFollowManyIndices: * reduce the number of leader indices to be auto followed * also check the number of follower indices being created * also check the whether leader indices are marked as auto followed Relates to #36761	2019-03-11 10:01:56 +01:00
Daniel Mitterdorfer	1bc31aca03	Mute CcrRetentionLeaseIT#testRetentionLeaseRenewalIsCancelledWhenFollowingIsPaused (#39897 ) Relates #39509	2019-03-11 08:47:51 +01:00
Jason Tedor	6675bafc49	Simplify CcrRetentionLeaseIT#testForgetFollower This test was more complicated than necessary, where we were capturing requests to prevent removal of retention leases, so that our forget follower request could remove the retention leases instead. Instead, a pause is enough to ensure that the retention leases are not re-added after we remove them by the forget follower request. This commit simplifies this test, and should remove some spurious failures. Relates #39850	2019-03-08 12:33:17 -05:00
Martijn van Groningen	8666aa1ed2	unmuted and tweaked test Relates to #36761	2019-03-08 12:43:23 +01:00
Jason Tedor	0250d554b6	Introduce forget follower API (#39718 ) This commit introduces the forget follower API. This API is needed in cases that unfollowing a following index fails to remove the shard history retention leases on the leader index. This can happen explicitly through user action, or implicitly through an index managed by ILM. When this occurs, history will be retained longer than necessary. While the retention lease will eventually expire, it can be expensive to allow history to persist for that long, and also prevent ILM from performing actions like shrink on the leader index. As such, we introduce an API to allow for manual removal of the shard history retention leases in this case.	2019-03-07 11:08:45 -05:00
Nhat Nguyen	83688ce2d4	Unmute testFollowIndexAndCloseNode Resolved in #39584	2019-03-06 22:39:13 -05:00
David Turner	77dd711847	Tidy up GroupedActionListener (#39633 ) Today the `GroupedActionListener` accepts a `defaults` parameter but all callers pass an empty list. Also it is permitted to pass an empty group but this is trappy because the delegated listener is never be called in that case. This commit removes the `defaults` parameter and forbids an empty group.	2019-03-06 09:25:10 +00:00
Jason Tedor	75a0d4f470	Rename retention lease setting (#39719 ) This commit renames the retention lease setting index.soft_deletes.retention.lease so that it is under the namespace index.soft_deletes.retention_lease. As such, we rename the setting to index.soft_deletes.retention_lease.period.	2019-03-05 22:04:45 -05:00
Nhat Nguyen	af4918ebff	Simplify AutoFollowCoordinator with GroupedListener (#39603 ) This change simplifies AutoFollowCoordinator by replacing a combination of AtomicArray and CountDown with GroupedActionListener.	2019-03-04 13:50:27 -05:00
Yannick Welsch	0f65390c29	Do not mutate engine during planning step (#39571 ) This cleans up the Engine implementation by separating the sequence number generation from the planning step in the engine, to avoid for the planning step to have any side effects. This makes it easier to see that every sequence number is properly accounted for.	2019-03-04 10:11:39 +01:00
Tanguy Leroux	e005eeb0b3	Backport support for replicating closed indices to 7.x (#39506 )(#39499 ) Backport support for replicating closed indices (#39499) Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888	2019-03-01 14:48:26 +01:00
Martijn van Groningen	24e478c58e	Fix test, more than one node may be connected. Relates to #37681	2019-02-26 10:40:09 +01:00
Martijn van Groningen	b159cc51c0	Ensure remote connection established and clean remote connection prior to leader cluster restart Relates to #37681	2019-02-26 09:06:30 +01:00
Nhat Nguyen	e9dda75834	Enable soft-deletes by default for 7.0+ indices (#38929 ) Today when users upgrade to 7.0, existing indices will automatically switch to soft-deletes without an opt-out option. With this change, we only enable soft-deletes by default for new indices. Relates #36141	2019-02-25 17:54:29 -05:00
Jason Tedor	a6c0166d68	Renew retention leases while following (#39335 ) This commit is the final piece of the integration of CCR with retention leases. Namely, we periodically renew retention leases and advance the retaining sequence number while following.	2019-02-25 17:14:19 -05:00
Nhat Nguyen	0f29b89655	Unmute FollowerFailOverIT#testFailOverOnFollower Relates #38633	2019-02-25 14:44:44 -05:00
Nhat Nguyen	48219112e3	Do not wait for advancement of checkpoint in recovery (#39006 ) With this change, we won't wait for the local checkpoint to advance to the max_seq_no before starting phase2 of peer-recovery. We also remove the sequence number range check in peer-recovery. We can safely do these thanks to Yannick's finding. The replication group to be used is currently sampled after indexing into the primary (see `ReplicationOperation` class). This means that when initiating tracking of a new replica, we have to consider the following two cases: - There are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable). - There are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to `endingSeqNo` are available in the snapshot (i.e. also see comment in `RecoverySourceHandler` saying `We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.`). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint. Relates #39000 Closes #38949 Co-authored-by: Yannick Welsch <yannick@welsch.lu>	2019-02-25 12:10:14 -05:00
Martijn van Groningen	6f69ef165b	Protect against the leader index being removed (#39351 ) when dealing with TimeoutException The `IndexFollowingIT#testDeleteLeaderIndex()`` test failed, because a NPE was captured as fatal error instead of an IndexNotFoundException. Closes #39308	2019-02-25 13:40:10 +01:00
Martijn van Groningen	9bf0538878	Wait for index following is active for auto followed index (#39175 ) before executing pause follow api: https://github.com/elastic/elasticsearch/issues/39126#issuecomment-465512002 Closes #39126	2019-02-25 10:44:20 +01:00
Jason Tedor	6e06f82106	Fix failing CCR retention lease test Finally! This commit should fix the issues with the CCR retention lease that has been plaguing build failures. The issue here is that we are trying to prevent the clear session requests from being executed until after we have been able to validate that retention leases are being renewed. However, we were only blocking the clear session requests but not blocking them when they are proxied through another node. This commit addresses that. Relates #39268	2019-02-22 20:43:39 -05:00
Jason Tedor	2d4c98a991	Change sort order of shard stats in CCR test This commit changes the sort order of shard stats that are collected in CCR retention lease integration tests. This change is done so that primaries appear first in sort order.	2019-02-22 18:17:28 -05:00
Jason Tedor	e569cf8324	Address failing CCR retention lease test This test fails rarely but it is flaky in its current form. The problem here is that we lack a guarantee on the retention leases having been synced to all shard copies. We need to sleep long enough to ensure that that occurs, and then we can sample the retention leases, possibly sleep again (we usually will not have too since the first sleep will have been long enough to allow a sync and a renewal to happen, if one was going to happen), and the sample the retention leases for comparison. Closes #39331	2019-02-22 18:15:10 -05:00
Jason Tedor	e4e96b8181	Fix shard logged in background lease renewal The shard logged here is the leader shard but it should be the follower shard since this background retention lease renewal is happening on the follower side. This commit fixes that.	2019-02-22 17:32:51 -05:00
Jason Tedor	feb25c71a0	Simplify mocking in CCR retention lease tests This commit simplifies the use of transport mocking in the CCR retention lease integration tests. Instead of adding a send rule between nodes, we add a default send rule. This greatly simplifies the code here, and speeds the test up a little bit too.	2019-02-22 17:24:12 -05:00
Tim Brooks	931953a3ee	Ensure index commit released when testing timeouts (#39273 ) This fixes #39245. Currently it is possible in this test that the clear session call times-out. This means that the index commit will not be released and there will be an assertion triggered in the test teardown. This commit ensures that we wipe the leader index in the test to avoid this assertion. It is okay if the clear session call times-out in normal usage. This scenario is unavoidable due to potential network issues. We have a local timeout on the leader to clean it up when this scenario happens.	2019-02-22 11:14:42 -07:00
Tim Brooks	44df76251f	Rebuild remote connections on profile changes (#39146 ) Currently remote compression and ping schedule settings are dynamic. However, we do not listen for changes. This commit adds listeners for changes to those two settings. Additionally, when those settings change we now close existing connections and open new ones with the settings applied. Fixes #37201.	2019-02-21 14:00:39 -07:00
Benjamin Trent	34d06471c3	[CI] Mute CcrRetentionLeaseIT.testRetentionLeaseIsRenewedDuringRecovery (#39270 )	2019-02-21 14:17:03 -06:00
Benjamin Trent	8072543428	Muting AutoFollowIT.testAutoFollowManyIndices (#39265 )	2019-02-21 13:43:09 -06:00
Jason Tedor	b9f8be6968	Clarify the use of sleep in CCR test Sleeps in tests smell funny, and we try to avoid them to the extent possible. We are using a small one in a CCR test. This commit clarifies the purpose of that sleep by adding a comment explaining it. We also removed a hard-coded value from the test, that if we ever modified the value higher up where it was set, we could end up forgetting to change the value here. Now we ensure that these would move in lock step if we ever maintain them later.	2019-02-21 14:05:48 -05:00
Jason Tedor	719c38a36d	Fix CCR tests that manipulate transport requests We have some CCR tests where we use mock transport send rules to control the behavior that we desire in these tests. Namely, we want to simulate an exception being thrown on the leader side, or a variety of other situations. These send rules were put in place between the data nodes on each side. However, it might not be the case that these requests are being sent between data nodes. For example, a request that is handled on a non-data master node would not be sent from a data node. And it might not be the case that the request is sent to a data node, as it could be proxied through a non-data coordinating node. This commit addresses this by putting these send rules in places between all nodes on each side. Closes #39011 Closes #39201	2019-02-21 12:26:09 -05:00
Martijn van Groningen	f40139c403	Change ShardFollowTask to reuse common serialization logic (#39094 ) Initially in #38910, ShardFollowTask was reusing ImmutableFollowParameters' serialization logic. After merging, bwc tests failed sometimes and the binary serialization that ShardFollowTask was originally was using was added back. ImmutableFollowParameters is using optional fields (optional vint) while ShardFollowTask was not (vint).	2019-02-21 09:32:33 +01:00
Nhat Nguyen	a96df5d209	Reduce refresh when lookup term in FollowingEngine (#39184 ) Today we always refresh when looking up the primary term in FollowingEngine. This is not necessary for we can simply return none for operations before the global checkpoint.	2019-02-20 19:21:00 -05:00
Nhat Nguyen	cdec11c4eb	Relax history check in ShardFollowTaskReplicationTests (#39162 ) The follower won't always have the same history as the leader for its soft-deletes retention can be different. However, if some operation exists on the history of the follower, then the same operation must exist on the leader. This change relaxes the history check in ShardFollowTaskReplicationTests. Closes #39093	2019-02-20 19:21:00 -05:00
Mark Vieira	24ac9da276	Mute CCR retention test that is consistently failing locally and in CI	2019-02-20 11:57:46 -08:00
Jason Tedor	90b1b36f50	Add cleanup logic to CCR retention lease test This commit adds some logic to remove the mock transport rules at the end of a CCR retention lease test.	2019-02-20 13:20:07 -05:00
Jason Tedor	cfd7c77b64	Fix broken CCR retention lease unfollow test This commit fixes a broken CCR retention lease unfollow test. The problem with the test is that the random subset of shards that we picked to disrupt would not necessarily overlap with the actual shards in use. We could take a non-empty subset of [0, 3] (e.g., { 2 }) when the only shard IDs in use were [0, 1]. This commit fixes this by taking into account the number of shards in use in the test. With this change, we also take measure to ensure that a successful branch is tested more frequently than would otherwise be the case. On that branch, we want to sometimes pretend that the retention lease is already removed. The randomness here was also sometimes selecting a subset of shards that did not overlap with the shards actually in use during the test. While this does not break the test, it is confusing and reduces the amount of coverage of that branch. Relates #39185	2019-02-20 12:09:28 -05:00
Jason Tedor	48984f647d	Mute failing CCR retention lease unfollow test This commit mutes a CCR retention lease unfollow test that is failing randomly, but frequently.	2019-02-20 09:47:17 -05:00
Jason Tedor	09ea3ccd16	Remove retention leases when unfollowing (#39088 ) This commit attempts to remove the retention leases on the leader shards when unfollowing an index. This is best effort, since the leader might not be available.	2019-02-20 07:06:49 -05:00
Tal Levy	b5dbd1a027	AwaitsFix XPackUsageIT#testXPackCcrUsage. relates to #39126.	2019-02-19 13:28:46 -08:00
Martijn van Groningen	c8d59f6f0f	Fix shard follow task startup error handling (#39053 ) Prior to this commit, if during fetch leader / follower GCP a fatal error occurred, then the shard follow task was removed. This is unexpected, because if such an error occurs during the lifetime of shard follow task then replication is stopped and the fatal error flag is set. This allows the ccr stats api to report the fatal exception that has occurred (instead of the user grepping through the elasticsearch logs). This issue was found by a rare failure of the `FollowStatsIT#testFollowStatsApiIncludeShardFollowStatsWithRemovedFollowerIndex` test. Closes #38779	2019-02-19 08:54:02 +01:00
Martijn van Groningen	ce412908ed	also check ccr stats api return empty response in ensureNoCcrTasks() If this fails then it returns more detailed information, for example fatal error.	2019-02-18 16:15:22 +01:00
Nhat Nguyen	2947ccf5c3	Add remote recovery to ShardFollowTaskReplicationTests (#39007 ) We simulate remote recovery in ShardFollowTaskReplicationTests by bootstrapping the follower with the safe commit of the leader. Relates #35975	2019-02-18 09:57:56 -05:00
Martijn van Groningen	4fd1f8048d	Mute test #38949	2019-02-18 15:24:07 +01:00
Martijn van Groningen	9aa542fb1b	Mute test Relates to #38779	2019-02-18 12:02:52 +01:00
Martijn van Groningen	ed08bc3537	Fix LocalIndexFollowingIT#testRemoveRemoteConnection() test (#38709 ) * During fetching remote mapping if remote client is missing then `NoSuchRemoteClusterException` was not handled. * When adding remote connection, check that it is really connected before continue-ing to run the tests. Relates to #38695	2019-02-18 09:41:44 +01:00
Nhat Nguyen	204480d818	Mute testRetentionLeaseIsRenewedDuringRecovery Tracked at #39011	2019-02-17 15:34:51 -05:00
Jason Tedor	a5ce1e0bec	Integrate retention leases to recovery from remote (#38829 ) This commit is the first step in integrating shard history retention leases with CCR. In this commit we integrate shard history retention leases with recovery from remote. Before we start transferring files, we take out a retention lease on the primary. Then during the file copy phase, we repeatedly renew the retention lease. Finally, when recovery from remote is complete, we disable the background renewing of the retention lease.	2019-02-16 15:37:52 -05:00
Tim Brooks	b1c1daa63f	Add get file chunk timeouts with listener timeouts (#38758 ) This commit adds a `ListenerTimeouts` class that will wrap a `ActionListener` in a listener with a timeout scheduled on the generic thread pool. If the timeout expires before the listener is completed, `onFailure` will be called with an `ElasticsearchTimeoutException`. Timeouts for the get ccr file chunk action are implemented using this functionality. Additionally, this commit attempts to fix #38027 by also blocking proxied get ccr file chunk actions. This test being un-muted is useful to verify the timeout functionality.	2019-02-16 10:56:03 -07:00
Jason Tedor	d80325f288	Mark fail over on follower test as awaits fix This test is failing since the introduction of recovery from remote. This commit marks this test as awaits fix.	2019-02-16 12:28:16 -05:00
Nhat Nguyen	7e20a92888	Advance max_seq_no before add operation to Lucene (#38879 ) Today when processing an operation on a replica engine (or the following engine), we first add it to Lucene, then add it to translog, then finally marks its seq_no as completed. If a flush occurs after step1, but before step-3, the max_seq_no in the commit's user_data will be smaller than the seq_no of some documents in the Lucene commit.	2019-02-15 21:04:28 -05:00
Nhat Nguyen	20755e666c	Reduce global checkpoint sync interval in disruption tests (#38931 ) We verify seq_no_stats is aligned between copies at the end of some disruption tests. Sometimes, the assertion `assertSeqNos` is tripped due to a lagged global checkpoint on replicas. The global checkpoint on replicas is lagged because we sync the global checkpoint 30 seconds (by default) after the last replication operation. This change reduces the global checkpoint sync-internal to 1s in the disruption tests. Closes #38318 Closes #36789	2019-02-15 21:04:20 -05:00
Jason Tedor	58551198d5	Address some CCR REST test case flakiness (#38975 ) The CCR REST tests that rely on these assertions are flaky. They are flaky since the introduction of recovery from the remote. The underlying problem is this: these tests are making assertions about the number of operations read by the shard following task. However, with recovery from remote, we no longer have guarantees that the assumptions these tests were relying on hold. Namely, these tests were assuming that the only way that a document could land in the follower index is via the shard following task. With recovery from remote, there is another way, which is via the files that are copied over during the recovery phase. Most of the time this will not be a problem because with the small number of documents that we are indexing in these tests, it is usally not the case that a flush would occur and so there would not be any documents in the files copied over. However, a flush can occur any time at which point all of the indexed documents could end up in a safe commit and copied over during recovery from remote. This commit modifies these assertions to ones that are not prone to this issue, yet still validate the health of the follower shard.	2019-02-15 16:01:02 -05:00
Martijn van Groningen	03b67b3ee1	Introduced class reuses follow parameter code between ShardFollowTasks (#38910 ) and AutoFollowPattern classes. The ImmutableFollowParameters is like the already existing FollowParameters, but all of its fields are final.	2019-02-15 18:26:15 +01:00
iverase	b19b778cbb	[CI] Muting method testFollowIndex in IndexFollowingIT Relates to #38949	2019-02-15 16:07:45 +01:00
Yannick Welsch	d55e52223f	Smarter CCR concurrent file chunk fetching (#38841 ) The previous logic for concurrent file chunk fetching did not allow for multiple chunks from the same file to be fetched in parallel. The parallelism only allowed to fetch chunks from different files in parallel. This required complex logic on the follower to be aware from which file it was already fetching information, in order to ensure that chunks for the same file would be fetched in sequential order. During benchmarking, this exhibited throughput issues when recovery came towards the end, where it would only be sequentially fetching chunks for the same largest segment file, with throughput considerably going down in a high-latency network as there was no parallelism anymore. The new logic here follows the peer recovery model more closely, and sends multiple requests for the same file in parallel, and then reorders the results as necessary. Benchmarks show that this leads to better overall throughput and the implementation is also simpler.	2019-02-15 07:51:58 +01:00
Martijn van Groningen	96e7d71948	Handle the fact that `ShardStats` instance may have no commit or seqno stats (#38782 ) The should fix the following NPE: ``` [2019-02-11T23:27:48,452][WARN ][o.e.p.PersistentTasksNodeService] [node_s_0] task kD8YzUhHTK6uKNBNQI-1ZQ-0 failed with an exception 1> java.lang.NullPointerException: null 1> at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor.lambda$fetchFollowerShardInfo$7(ShardFollowTasksExecutor.java:305) ~[main/:?] 1> at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:383) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:352) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:324) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:314) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1189) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1169) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:417) [elasticsearch-8.0.0-SNAP SHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:391) [elasticsearch-8.0.0-SNAP SHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:687) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] ``` Relates to #38779	2019-02-14 13:05:21 +01:00
Martijn van Groningen	88489a3f3a	Backport rolling upgrade multi cluster module (#38859 ) * Add rolling upgrade multi cluster test module (#38277) This test starts 2 clusters, each with 3 nodes. First the leader cluster is started and tests are run against it and then the follower cluster is started and tests execute against this two cluster. Then the follower cluster is upgraded, one node at a time. After that the leader cluster is upgraded, one node at a time. Every time a node is upgraded tests are ran while both clusters are online. (and either leader cluster has mixed node versions or the follower cluster) This commit only tests CCR index following, but could be used for CCS tests as well. In particular for CCR, unidirectional index following is tested during a rolling upgrade. During the test several indices are created and followed in the leader cluster before or while the follower cluster is being upgraded. This tests also verifies that attempting to follow an index in the upgraded cluster from the not upgraded cluster fails. After both clusters are upgraded following the index that previously failed should succeed. Relates to #37231 and #38037 * Filter out upgraded version index settings when starting index following (#38838) The `index.version.upgraded` and `index.version.upgraded_string` are likely to be different between leader and follower index. In the event that a follower index gets restored on a upgraded node while the leader index is still on non-upgraded nodes. Closes #38835	2019-02-14 08:12:14 +01:00
Tim Brooks	ec08581319	Improve CcrRepositoryIT mappings tests (#38817 ) Currently we index documents concurrently to attempt to ensure that we update mappings during the restore process. However, this does not actually test that the mapping will be correct and is dangerous as it can lead to a misalignment between the max sequence number and the local checkpoint. If these are not aligned, peer recovery cannot be completed without initiating following which this test does not do. That causes teardown assertions to fail. This commit removes the concurrent indexing and flushes after the documents are indexed. Additionally it modifies the mapping specific test to ensure that there is a mapping update when the restore session is initiated. This mapping update is picked up at the end of the restore by the follower.	2019-02-13 13:47:10 -07:00
Nhat Nguyen	a3f39741be	Adjust log and unmute testFailOverOnFollower (#38762 ) There were two documents (seq=2 and seq=103) missing on the follower in one of the failures of `testFailOverOnFollower`. I spent several hours on that failure but could not figure out the reason. I adjust log and unmute this test so we can collect more information. Relates #38633	2019-02-12 11:42:25 -05:00
Martijn van Groningen	40d5beaf41	muted test Relates to #38779	2019-02-12 16:54:54 +01:00
Martijn van Groningen	6290d59ffa	Use clear cluster names in order to make debugging easier. Relates to #37681	2019-02-12 10:19:39 +01:00
Yannick Welsch	bafc709326	Fix CCR concurrent file chunk fetching bug (#38736 ) Fixes a bug with concurrent file chunk fetching during recovery from remote where the wrong offset was used.	2019-02-11 19:15:57 +01:00
Tanguy Leroux	dc212de822	Specialize pre-closing checks for engine implementations (#38702 ) (#38722 ) The Close Index API has been refactored in 6.7.0 and it now performs pre-closing sanity checks on shards before an index is closed: the maximum sequence number must be equals to the global checkpoint. While this is a strong requirement for regular shards, we identified the need to relax this check in the case of CCR following shards. The following shards are not in charge of managing the max sequence number or global checkpoint, which are pulled from a leader shard. They also fetch and process batches of operations from the leader in an unordered way, potentially leaving gaps in the history of ops. If the following shard lags a lot it's possible that the global checkpoint and max seq number never get in sync, preventing the following shard to be closed and a new PUT Follow action to be issued on this shard (which is our recommended way to resume/restart a CCR following). This commit allows each Engine implementation to define the specific verification it must perform before closing the index. In order to allow following/frozen/closed shards to be closed whatever the max seq number or global checkpoint are, the FollowingEngine and ReadOnlyEngine do not perform any check before the index is closed. Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>	2019-02-11 17:34:17 +01:00
Martijn van Groningen	92201ef563	Catch AlreadyClosedException and use other IndexShard instance (#38630 ) Closes #38617	2019-02-11 15:36:48 +01:00
Martijn van Groningen	a29bf2585e	Added unit test for FollowParameters class (#38500 ) (#38690 ) A unit test that tests FollowParameters directly was missing.	2019-02-11 10:53:04 +01:00
Martijn van Groningen	4625807505	Reuse FollowParameters' parse fields. (#38508 )	2019-02-11 08:46:36 +01:00
Martijn van Groningen	e213ad3e88	Mute test. Relates to #38695	2019-02-11 08:32:42 +01:00
Tim Brooks	023e3c207a	Concurrent file chunk fetching for CCR restore (#38656 ) Adds the ability to fetch chunks from different files in parallel, configurable using the new `ccr.indices.recovery.max_concurrent_file_chunks` setting, which defaults to 5 in this PR. The implementation uses the parallel file writer functionality that is also used by peer recoveries.	2019-02-09 21:19:57 -07:00
Nhat Nguyen	c202900915	Retry on wait_for_metada_version timeout (#38521 ) Closes #37807 Backport of #38521	2019-02-09 19:51:58 -05:00
Christoph Büscher	d03b386f6a	Mute FollowerFailOverIT testFailOverOnFollower (#38634 ) Relates to #38633	2019-02-08 17:20:30 +01:00
David Turner	5a3c452480	Align docs etc with new discovery setting names (#38492 ) In #38333 and #38350 we moved away from the `discovery.zen` settings namespace since these settings have an effect even though Zen Discovery itself is being phased out. This change aligns the documentation and the names of related classes and methods with the newly-introduced naming conventions.	2019-02-06 11:34:38 +00:00
Tim Brooks	fb0ec26fd4	Set update mappings mater node timeout to 30 min (#38439 ) This is related to #35975. We do not want a slow master to fail a recovery from remote process due to a slow put mappings call. This commit increases the master node timeout on this call to 30 mins.	2019-02-05 16:22:11 -06:00
Przemyslaw Gomulka	afcdbd2bc0	XPack: core/ccr/Security-cli migration to java-time (#38415 ) part of the migrating joda time work. refactoring x-pack plugins usages of joda to java-time refers #27330	2019-02-05 22:09:32 +01:00
Tim Brooks	4a15e2b29e	Make Ccr recovery file chunk size configurable (#38370 ) This commit adds a byte setting `ccr.indices.recovery.chunk_size`. This setting configs the size of file chunk requested while recovering from remote.	2019-02-05 13:34:00 -06:00
Tim Brooks	c2a8fe1f91	Prevent CCR recovery from missing documents (#38237 ) Currently the snapshot/restore process manually sets the global checkpoint to the max sequence number from the restored segements. This does not work for Ccr as this will lead to documents that would be recovered in the normal followering operation from being recovered. This commit fixes this issue by setting the initial global checkpoint to the existing local checkpoint.	2019-02-05 13:32:41 -06:00
David Turner	f2dd5dd6eb	Remove DiscoveryPlugin#getDiscoveryTypes (#38414 ) With this change we no longer support pluggable discovery implementations. No known implementations of `DiscoveryPlugin` actually override this method, so in practice this should have no effect on the wider world. However, we were using this rather extensively in tests to provide the `test-zen` discovery type. We no longer need a separate discovery type for tests as we no longer need to customise its behaviour. Relates #38410	2019-02-05 17:42:24 +00:00
Armin Braun	887fa2c97a	Mute testReadRequestsReturnLatestMappingVersion (#38438 ) * Relates #37807	2019-02-05 17:10:12 +01:00
Martijn van Groningen	0beb3c93d1	Clean up duplicate follow config parameter code (#37688 ) Introduced FollowParameters class that put follow, resume follow, put auto follow pattern requests and follow info response classes reuse. The FollowParameters class had the fields, getters etc. for the common parameters that all these APIs have. Also binary and xcontent serialization / parsing is handled by this class. The follow, resume follow, put auto follow pattern request classes originally used optional non primitive fields, so FollowParameters has that too and the follow info api can handle that now too. Also the followerIndex field can in production only be specified via the url path. If it is also specified via the request body then it must have the same value as is specified in the url path. This option only existed to xcontent testing. However the AbstractSerializingTestCase base class now also supports createXContextTestInstance() to provide a different test instance when testing xcontent, so allowing followerIndex to be specified via the request body is no longer needed. By moving the followerIndex field from Body to ResumeFollowAction.Request class and not allowing the followerIndex field to be specified via the request body the Body class is redundant and can be removed. The ResumeFollowAction.Request class can then directly use the FollowParameters class. For consistency I also removed the ability to specified followerIndex in the put follow api and the name in put auto follow pattern api via the request body.	2019-02-05 17:05:19 +01:00
David Turner	2d114a02ff	Rename static Zen1 settings (#38333 ) Renames the following settings to remove the mention of `zen` in their names: - `discovery.zen.hosts_provider` -> `discovery.seed_providers` - `discovery.zen.ping.unicast.concurrent_connects` -> `discovery.seed_resolver.max_concurrent_resolvers` - `discovery.zen.ping.unicast.hosts.resolve_timeout` -> `discovery.seed_resolver.timeout` - `discovery.zen.ping.unicast.hosts` -> `discovery.seed_addresses`	2019-02-05 08:46:52 +00:00
Yogesh Gaikwad	fe36861ada	Add support for API keys to access Elasticsearch (#38291 ) X-Pack security supports built-in authentication service `token-service` that allows access tokens to be used to access Elasticsearch without using Basic authentication. The tokens are generated by `token-service` based on OAuth2 spec. The access token is a short-lived token (defaults to 20m) and refresh token with a lifetime of 24 hours, making them unsuitable for long-lived or recurring tasks where the system might go offline thereby failing refresh of tokens. This commit introduces a built-in authentication service `api-key-service` that adds support for long-lived tokens aka API keys to access Elasticsearch. The `api-key-service` is consulted after `token-service` in the authentication chain. By default, if TLS is enabled then `api-key-service` is also enabled. The service can be disabled using the configuration setting. The API keys:- - by default do not have an expiration but expiration can be configured where the API keys need to be expired after a certain amount of time. - when generated will keep authentication information of the user that generated them. - can be defined with a role describing the privileges for accessing Elasticsearch and will be limited by the role of the user that generated them - can be invalidated via invalidation API - information can be retrieved via a get API - that have been expired or invalidated will be retained for 1 week before being deleted. The expired API keys remover task handles this. Following are the API key management APIs:- 1. Create API Key - `PUT/POST /_security/api_key` 2. Get API key(s) - `GET /_security/api_key` 3. Invalidate API Key(s) `DELETE /_security/api_key` The API keys can be used to access Elasticsearch using `Authorization` header, where the auth scheme is `ApiKey` and the credentials, is the base64 encoding of API key Id and API key separated by a colon. Example:- ``` curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health ``` Closes #34383	2019-02-05 14:21:57 +11:00
Nhat Nguyen	cecfa5bd6d	Tighten mapping syncing in ccr remote restore (#38071 ) There are two issues regarding the way that we sync mapping from leader to follower when a ccr restore is completed: 1. The returned mapping from a cluster service might not be up to date as the mapping of the restored index commit. 2. We should not compare the mapping version of the follower and the leader. They are not related to one another. Moreover, I think we should only ensure that once the restore is done, the mapping on the follower should be at least the mapping of the copied index commit. We don't have to sync the mapping which is updated after we have opened a session. Relates #36879 Closes #37887	2019-02-04 17:53:41 -05:00
Tim Brooks	5a33816c86	Add test for `PutFollowAction` on a closed index (#38236 ) This is related to #35975. Currently when an index falls behind a leader it encounters a fatal exception. This commit adds a test for that scenario. Additionally, it tests that the user can stop following, close the follower index, and put follow again. After the indexing is re-bootstrapped, it will recover the documents it lost in normal following operations.	2019-02-04 16:37:42 -06:00
Nhat Nguyen	fb1e350c81	Mute testFollowIndexAndCloseNode (#38360 ) Tracked at #33337	2019-02-04 15:04:46 -05:00
Jason Tedor	f181e17038	Introduce retention leases versioning (#37951 ) Because concurrent sync requests from a primary to its replicas could be in flight, it can be the case that an older retention leases collection arrives and is processed on the replica after a newer retention leases collection has arrived and been processed. Without a defense, in this case the replica would overwrite the newer retention leases with the older retention leases. This commit addresses this issue by introducing a versioning scheme to retention leases. This versioning scheme is used to resolve out-of-order processing on the replica. We persist this version into Lucene and restore it on recovery. The encoding of retention leases is starting to get a little ugly. We can consider addressing this in a follow-up.	2019-02-01 17:19:19 -05:00
Nhat Nguyen	3ecdfe1060	Enable trace log in FollowerFailOverIT (#38148 ) This suite still fails one per week sometimes with a worrying assertion. Sadly we are still unable to find the actual source. Expected: <SeqNoStats{maxSeqNo=229, localCheckpoint=86, globalCheckpoint=86}> but: was <SeqNoStats{maxSeqNo=229, localCheckpoint=-1, globalCheckpoint=86}> This change enables trace log in the suite so we will have a better picture if this fails again. Relates #3333	2019-02-01 15:44:39 -05:00
Julie Tibshirani	c2e9d13ebd	Default include_type_name to false in the yml test harness. (#38058 ) This PR removes the temporary change we made to the yml test harness in #37285 to automatically set `include_type_name` to `true` in index creation requests if it's not already specified. This is possible now that the vast majority of index creation requests were updated to be typeless in #37611. A few additional tests also needed updating here. Additionally, this PR updates the test harness to set `include_type_name` to `false` in index creation requests when communicating with 6.x nodes. This mirrors the logic added in #37611 to allow for typeless document write requests in test set-up code. With this update in place, we can remove many references to `include_type_name: false` from the yml tests.	2019-02-01 11:44:13 -08:00
Nhat Nguyen	f64b20383e	Replace awaitBusy with assertBusy in atLeastDocsIndexed (#38190 ) Unlike assertBusy, awaitBusy does not retry if the code-block throws an AssertionError. A refresh in atLeastDocsIndexed can fail because we call this method while we are closing some node in FollowerFailOverIT.	2019-02-01 13:31:17 -05:00
Tim Brooks	291c4e7a0c	Fix file reading in ccr restore service (#38117 ) Currently we use the raw byte array length when calling the IndexInput read call to determine how many bytes we want to read. However, due to how BigArrays works, the array length might be longer than the reference length. This commit fixes the issue and uses the BytesRef length when calling read. Additionally, it expands the index follow test to index many more documents. These documents should potentially lead to large enough segment files to trigger scenarios where this fix matters.	2019-01-31 18:02:24 -07:00
Henning Andersen	68ed72b923	Handle scheduler exceptions (#38014 ) Scheduler.schedule(...) would previously assume that caller handles exception by calling get() on the returned ScheduledFuture. schedule() now returns a ScheduledCancellable that no longer gives access to the exception. Instead, any exception thrown out of a scheduled Runnable is logged as a warning. This is a continuation of #28667, #36137 and also fixes #37708.	2019-01-31 17:51:45 +01:00
Alpar Torok	b7de8e1d1e	Mute failing test Tracking #38100	2019-01-31 17:01:16 +02:00

1 2 3 4 5 ...

508 Commits