OpenSearch

Commit Graph

Author	SHA1	Message	Date
Jason Tedor	10bbb082a4	Only run retention lease actions on active primary (#40386 ) In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.	2019-03-23 09:39:39 -04:00
Nhat Nguyen	0e12065b54	Relax max_seq_no_of_updates assertion in follow tests If there's a failover on the follower, then its max_seq_no_of_updates is bootstrapped from its max_seq_no which might be higher than the max_seq_no_of_updates of the leader. We need to relax this check. Relates #40249	2019-03-21 19:41:55 -04:00
Jason Tedor	1e6941b138	Reduce retention lease sync intervals (#40302 ) This commit adjusts the frequency with which CCR renews retention leases and with which primaries sync retention leases to replicas. This helps Lucene reclaim soft-deleted documents more aggressively, which we have found in some use-cases can help improve performance, and either way will help keep disk space under more control.	2019-03-21 07:37:44 -04:00
Like	6f64267626	Make setting index.translog.sync_interval be dynamic (#37382 ) Currently, we cannot update index setting index.translog.sync_interval if index is open, because it's not dynamic which can be updated for closed index only. Closes #32763	2019-03-20 17:12:45 +01:00
Henning Andersen	4c2a8638ca	Cascading primary failure lead to MSU too low (#40249 ) If a replica were first reset due to one primary failover and then promoted (before resync completes), its MSU would not include changes since global checkpoint, leading to errors during translog replay. Fixed by re-initializing MSU before restoring local history.	2019-03-20 14:00:43 +01:00
Jason Tedor	f88e4181ca	Enable reading auto-follow patterns from x-content (#40130 ) This named writable was never registered, so it means that we could not read auto-follow patterns that were registered in the cluster state. This causes them to be lost on restarts, a bad bug. This commit addresses this by registering this named writable, and we add a basic CCR restart test to ensure that CCR keeps functioning properly when the follower is restarted.	2019-03-18 21:48:44 -04:00
Nhat Nguyen	38e9522218	Remove wait for cluster state step in peer recovery (#40004 ) We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped using it since #25692 (6.0). This change removes that action and related code in 7.x and 8.0. Relates #19287 Relates #25692	2019-03-18 15:17:21 -04:00
Jason Tedor	5be12e0999	Safe publication of AutoFollowCoordinator (#40153 ) We were leaking a reference to an AutoFollowCoordinator during construction, violating safe publication according to the JLS specification. This commit addresses this by waiting to register AutoFollowCoordinator with the ClusterApplierService after the AutoFollowCoordinator is fully constructed. We also remove ourselves as a listener when stopping.	2019-03-18 10:13:41 -04:00
Jason Tedor	b8ad337234	Stop auto-followers on shutdown (#40124 ) When shutting down a node, auto-followers will keep trying to run. This is happening even as transport services and other components are being closed. In some cases, this can lead to a stack overflow as we rapidly try to check the license state of the remote cluster, can not because the transport service is shutdown, and then immeidately retry again. This can happen faster than the shutdown, and we die with stack overflow. This commit adds a stop command to auto-followers so that this retry loop occurs at most once on shutdown.	2019-03-18 07:25:31 -04:00
Jason Tedor	0824eceacf	Add log message for auto-follower timeout When an auto-follower coordinator times out waiting for the remote cluster state, we do not log any indication of this. While this is expected behavior in quiet deployments, it is still useful to see this information for tracing the behavior of the auto-follow coordinator. This commit adds a trace log message indicating that the timeout.	2019-03-16 10:46:20 -04:00
Jason Tedor	86d1d03c37	Remove cluster state size (#40109 ) This commit removes the cluster state size field from the cluster state response, and drops the backwards compatibility layer added in 6.7.0 to continue to support this field. As calculation of this field was expensive and had dubious value, we have elected to remove this field.	2019-03-15 17:16:25 -04:00
David Kyle	4eb3683d65	Mute CcrRetentionLeaseIT tests (#40090 )	2019-03-15 15:05:47 +00:00
Jake Landis	b0b0f66669	Remove types from internal monitoring templates and bump to api 7 (#39888 ) (#39926 ) This commit removes the "doc" type from monitoring internal indexes. The template still carries the "_doc" type since that is needed for the internal representation. This change impacts the following templates: monitoring-alerts.json monitoring-beats.json monitoring-es.json monitoring-kibana.json monitoring-logstash.json As part of the required changes, the system_api_version has been bumped from "6" to "7" and support for version "2" has been dropped. A new empty pipeline is now introduced for the version "7", and the formerly empty "6" pipeline will now remove the type and re-direct the request to the "7" index. Additionally, to due to a difference in the internal representation (which requires the inclusion of "_doc" type) and external representation (which requires the exclusion of any type) a helper method is introduced to help convert internal to external representation, and used by the monitoring HTTP template exporter. Relates #38637	2019-03-11 13:17:27 -05:00
Martijn van Groningen	8925a2c6c2	Further tweak AutoFollowIT#testAutoFollowManyIndices: * reduce the number of leader indices to be auto followed * also check the number of follower indices being created * also check the whether leader indices are marked as auto followed Relates to #36761	2019-03-11 10:01:56 +01:00
Daniel Mitterdorfer	1bc31aca03	Mute CcrRetentionLeaseIT#testRetentionLeaseRenewalIsCancelledWhenFollowingIsPaused (#39897 ) Relates #39509	2019-03-11 08:47:51 +01:00
Jason Tedor	6675bafc49	Simplify CcrRetentionLeaseIT#testForgetFollower This test was more complicated than necessary, where we were capturing requests to prevent removal of retention leases, so that our forget follower request could remove the retention leases instead. Instead, a pause is enough to ensure that the retention leases are not re-added after we remove them by the forget follower request. This commit simplifies this test, and should remove some spurious failures. Relates #39850	2019-03-08 12:33:17 -05:00
Martijn van Groningen	8666aa1ed2	unmuted and tweaked test Relates to #36761	2019-03-08 12:43:23 +01:00
Jason Tedor	0250d554b6	Introduce forget follower API (#39718 ) This commit introduces the forget follower API. This API is needed in cases that unfollowing a following index fails to remove the shard history retention leases on the leader index. This can happen explicitly through user action, or implicitly through an index managed by ILM. When this occurs, history will be retained longer than necessary. While the retention lease will eventually expire, it can be expensive to allow history to persist for that long, and also prevent ILM from performing actions like shrink on the leader index. As such, we introduce an API to allow for manual removal of the shard history retention leases in this case.	2019-03-07 11:08:45 -05:00
Nhat Nguyen	83688ce2d4	Unmute testFollowIndexAndCloseNode Resolved in #39584	2019-03-06 22:39:13 -05:00
David Turner	77dd711847	Tidy up GroupedActionListener (#39633 ) Today the `GroupedActionListener` accepts a `defaults` parameter but all callers pass an empty list. Also it is permitted to pass an empty group but this is trappy because the delegated listener is never be called in that case. This commit removes the `defaults` parameter and forbids an empty group.	2019-03-06 09:25:10 +00:00
Jason Tedor	75a0d4f470	Rename retention lease setting (#39719 ) This commit renames the retention lease setting index.soft_deletes.retention.lease so that it is under the namespace index.soft_deletes.retention_lease. As such, we rename the setting to index.soft_deletes.retention_lease.period.	2019-03-05 22:04:45 -05:00
Nhat Nguyen	af4918ebff	Simplify AutoFollowCoordinator with GroupedListener (#39603 ) This change simplifies AutoFollowCoordinator by replacing a combination of AtomicArray and CountDown with GroupedActionListener.	2019-03-04 13:50:27 -05:00
Yannick Welsch	0f65390c29	Do not mutate engine during planning step (#39571 ) This cleans up the Engine implementation by separating the sequence number generation from the planning step in the engine, to avoid for the planning step to have any side effects. This makes it easier to see that every sequence number is properly accounted for.	2019-03-04 10:11:39 +01:00
Tanguy Leroux	e005eeb0b3	Backport support for replicating closed indices to 7.x (#39506 )(#39499 ) Backport support for replicating closed indices (#39499) Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888	2019-03-01 14:48:26 +01:00
Martijn van Groningen	24e478c58e	Fix test, more than one node may be connected. Relates to #37681	2019-02-26 10:40:09 +01:00
Martijn van Groningen	b159cc51c0	Ensure remote connection established and clean remote connection prior to leader cluster restart Relates to #37681	2019-02-26 09:06:30 +01:00
Nhat Nguyen	e9dda75834	Enable soft-deletes by default for 7.0+ indices (#38929 ) Today when users upgrade to 7.0, existing indices will automatically switch to soft-deletes without an opt-out option. With this change, we only enable soft-deletes by default for new indices. Relates #36141	2019-02-25 17:54:29 -05:00
Jason Tedor	a6c0166d68	Renew retention leases while following (#39335 ) This commit is the final piece of the integration of CCR with retention leases. Namely, we periodically renew retention leases and advance the retaining sequence number while following.	2019-02-25 17:14:19 -05:00
Nhat Nguyen	0f29b89655	Unmute FollowerFailOverIT#testFailOverOnFollower Relates #38633	2019-02-25 14:44:44 -05:00
Nhat Nguyen	48219112e3	Do not wait for advancement of checkpoint in recovery (#39006 ) With this change, we won't wait for the local checkpoint to advance to the max_seq_no before starting phase2 of peer-recovery. We also remove the sequence number range check in peer-recovery. We can safely do these thanks to Yannick's finding. The replication group to be used is currently sampled after indexing into the primary (see `ReplicationOperation` class). This means that when initiating tracking of a new replica, we have to consider the following two cases: - There are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable). - There are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to `endingSeqNo` are available in the snapshot (i.e. also see comment in `RecoverySourceHandler` saying `We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.`). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint. Relates #39000 Closes #38949 Co-authored-by: Yannick Welsch <yannick@welsch.lu>	2019-02-25 12:10:14 -05:00
Martijn van Groningen	6f69ef165b	Protect against the leader index being removed (#39351 ) when dealing with TimeoutException The `IndexFollowingIT#testDeleteLeaderIndex()`` test failed, because a NPE was captured as fatal error instead of an IndexNotFoundException. Closes #39308	2019-02-25 13:40:10 +01:00
Martijn van Groningen	9bf0538878	Wait for index following is active for auto followed index (#39175 ) before executing pause follow api: https://github.com/elastic/elasticsearch/issues/39126#issuecomment-465512002 Closes #39126	2019-02-25 10:44:20 +01:00
Jason Tedor	6e06f82106	Fix failing CCR retention lease test Finally! This commit should fix the issues with the CCR retention lease that has been plaguing build failures. The issue here is that we are trying to prevent the clear session requests from being executed until after we have been able to validate that retention leases are being renewed. However, we were only blocking the clear session requests but not blocking them when they are proxied through another node. This commit addresses that. Relates #39268	2019-02-22 20:43:39 -05:00
Jason Tedor	2d4c98a991	Change sort order of shard stats in CCR test This commit changes the sort order of shard stats that are collected in CCR retention lease integration tests. This change is done so that primaries appear first in sort order.	2019-02-22 18:17:28 -05:00
Jason Tedor	e569cf8324	Address failing CCR retention lease test This test fails rarely but it is flaky in its current form. The problem here is that we lack a guarantee on the retention leases having been synced to all shard copies. We need to sleep long enough to ensure that that occurs, and then we can sample the retention leases, possibly sleep again (we usually will not have too since the first sleep will have been long enough to allow a sync and a renewal to happen, if one was going to happen), and the sample the retention leases for comparison. Closes #39331	2019-02-22 18:15:10 -05:00
Jason Tedor	e4e96b8181	Fix shard logged in background lease renewal The shard logged here is the leader shard but it should be the follower shard since this background retention lease renewal is happening on the follower side. This commit fixes that.	2019-02-22 17:32:51 -05:00
Jason Tedor	feb25c71a0	Simplify mocking in CCR retention lease tests This commit simplifies the use of transport mocking in the CCR retention lease integration tests. Instead of adding a send rule between nodes, we add a default send rule. This greatly simplifies the code here, and speeds the test up a little bit too.	2019-02-22 17:24:12 -05:00
Tim Brooks	931953a3ee	Ensure index commit released when testing timeouts (#39273 ) This fixes #39245. Currently it is possible in this test that the clear session call times-out. This means that the index commit will not be released and there will be an assertion triggered in the test teardown. This commit ensures that we wipe the leader index in the test to avoid this assertion. It is okay if the clear session call times-out in normal usage. This scenario is unavoidable due to potential network issues. We have a local timeout on the leader to clean it up when this scenario happens.	2019-02-22 11:14:42 -07:00
Tim Brooks	44df76251f	Rebuild remote connections on profile changes (#39146 ) Currently remote compression and ping schedule settings are dynamic. However, we do not listen for changes. This commit adds listeners for changes to those two settings. Additionally, when those settings change we now close existing connections and open new ones with the settings applied. Fixes #37201.	2019-02-21 14:00:39 -07:00
Benjamin Trent	34d06471c3	[CI] Mute CcrRetentionLeaseIT.testRetentionLeaseIsRenewedDuringRecovery (#39270 )	2019-02-21 14:17:03 -06:00
Benjamin Trent	8072543428	Muting AutoFollowIT.testAutoFollowManyIndices (#39265 )	2019-02-21 13:43:09 -06:00
Jason Tedor	b9f8be6968	Clarify the use of sleep in CCR test Sleeps in tests smell funny, and we try to avoid them to the extent possible. We are using a small one in a CCR test. This commit clarifies the purpose of that sleep by adding a comment explaining it. We also removed a hard-coded value from the test, that if we ever modified the value higher up where it was set, we could end up forgetting to change the value here. Now we ensure that these would move in lock step if we ever maintain them later.	2019-02-21 14:05:48 -05:00
Jason Tedor	719c38a36d	Fix CCR tests that manipulate transport requests We have some CCR tests where we use mock transport send rules to control the behavior that we desire in these tests. Namely, we want to simulate an exception being thrown on the leader side, or a variety of other situations. These send rules were put in place between the data nodes on each side. However, it might not be the case that these requests are being sent between data nodes. For example, a request that is handled on a non-data master node would not be sent from a data node. And it might not be the case that the request is sent to a data node, as it could be proxied through a non-data coordinating node. This commit addresses this by putting these send rules in places between all nodes on each side. Closes #39011 Closes #39201	2019-02-21 12:26:09 -05:00
Martijn van Groningen	f40139c403	Change ShardFollowTask to reuse common serialization logic (#39094 ) Initially in #38910, ShardFollowTask was reusing ImmutableFollowParameters' serialization logic. After merging, bwc tests failed sometimes and the binary serialization that ShardFollowTask was originally was using was added back. ImmutableFollowParameters is using optional fields (optional vint) while ShardFollowTask was not (vint).	2019-02-21 09:32:33 +01:00
Nhat Nguyen	a96df5d209	Reduce refresh when lookup term in FollowingEngine (#39184 ) Today we always refresh when looking up the primary term in FollowingEngine. This is not necessary for we can simply return none for operations before the global checkpoint.	2019-02-20 19:21:00 -05:00
Nhat Nguyen	cdec11c4eb	Relax history check in ShardFollowTaskReplicationTests (#39162 ) The follower won't always have the same history as the leader for its soft-deletes retention can be different. However, if some operation exists on the history of the follower, then the same operation must exist on the leader. This change relaxes the history check in ShardFollowTaskReplicationTests. Closes #39093	2019-02-20 19:21:00 -05:00
Mark Vieira	24ac9da276	Mute CCR retention test that is consistently failing locally and in CI	2019-02-20 11:57:46 -08:00
Jason Tedor	90b1b36f50	Add cleanup logic to CCR retention lease test This commit adds some logic to remove the mock transport rules at the end of a CCR retention lease test.	2019-02-20 13:20:07 -05:00
Jason Tedor	cfd7c77b64	Fix broken CCR retention lease unfollow test This commit fixes a broken CCR retention lease unfollow test. The problem with the test is that the random subset of shards that we picked to disrupt would not necessarily overlap with the actual shards in use. We could take a non-empty subset of [0, 3] (e.g., { 2 }) when the only shard IDs in use were [0, 1]. This commit fixes this by taking into account the number of shards in use in the test. With this change, we also take measure to ensure that a successful branch is tested more frequently than would otherwise be the case. On that branch, we want to sometimes pretend that the retention lease is already removed. The randomness here was also sometimes selecting a subset of shards that did not overlap with the shards actually in use during the test. While this does not break the test, it is confusing and reduces the amount of coverage of that branch. Relates #39185	2019-02-20 12:09:28 -05:00
Jason Tedor	48984f647d	Mute failing CCR retention lease unfollow test This commit mutes a CCR retention lease unfollow test that is failing randomly, but frequently.	2019-02-20 09:47:17 -05:00

1 2 3 4 5 ...

461 Commits