OpenSearch

Commit Graph

Author	SHA1	Message	Date
Yannick Welsch	7f8e1454ab	Advance checkpoints only after persisting ops (#43205 ) Local and global checkpoints currently do not correctly reflect what's persisted to disk. The issue is that the local checkpoint is adapted as soon as an operation is processed (but not fsynced yet). This leaves room for the history below the global checkpoint to still change in case of a crash. As we rely on global checkpoints for CCR as well as operation-based recoveries, this has the risk of shard copies / follower clusters going out of sync. This commit required changing some core classes in the system: - The LocalCheckpointTracker keeps track now not only of the information whether an operation has been processed, but also whether that operation has been persisted to disk. - TranslogWriter now keeps track of the sequence numbers that have not been fsynced yet. Once they are fsynced, TranslogWriter notifies LocalCheckpointTracker of this. - ReplicationTracker now keeps track of the persisted local and persisted global checkpoints of all shard copies when in primary mode. The computed global checkpoint (which represents the minimum of all persisted local checkpoints of all in-sync shard copies), which was previously stored in the checkpoint entry for the local shard copy, has been moved to an extra field. - The periodic global checkpoint sync now also takes async durability into account, where the local checkpoints on shards only advance when the translog is asynchronously fsynced. This means that the previous condition to detect inactivity (max sequence number is equal to global checkpoint) is not sufficient anymore. - The new index closing API does not work when combined with async durability. The shard verification step is now requires an additional pre-flight step to fsync the translog, so that the main verify shard step has the most up-to-date global checkpoint at disposition.	2019-06-20 11:12:38 +02:00
Jason Tedor	1f1a035def	Remove stale test logging annotations (#43403 ) This commit removes some very old test logging annotations that appeared to be added to investigate test failures that are long since closed. If these are needed, they can be added back on a case-by-case basis with a comment associating them to a test failure.	2019-06-19 22:58:22 -04:00
Martijn van Groningen	a4c45b5d70	Replace Streamable w/ Writeable in SingleShardRequest and subclasses (#43222 ) (#43364 ) Backport of: https://github.com/elastic/elasticsearch/pull/43222 This commit replaces usages of Streamable with Writeable for the SingleShardRequest / TransportSingleShardAction classes and subclasses of these classes. Note that where possible response fields were made final and default constructors were removed. Relates to #34389	2019-06-19 16:15:09 +02:00
Nhat Nguyen	0c5086d2f3	Rebuild version map when opening internal engine (#43202 ) With this change, we will rebuild the live version map and local checkpoint using documents (including soft-deleted) from the safe commit when opening an internal engine. This allows us to safely prune away _id of all soft-deleted documents as the version map is always in-sync with the Lucene index. Relates #40741 Supersedes #42979	2019-06-17 18:08:09 -04:00
Alpar Torok	4ba94a5051	Testclusters: convert ccr tests (#42313 )	2019-06-13 19:19:36 +03:00
Nhat Nguyen	5692be2161	Fix timing issue in CcrRetentionLeaseIT (#43054 ) In these tests, we sleep for a small multiple of the renew interval, then check that the retention leases are not changed. If a renewal request takes longer than that interval because of GC or slow CI, then the retention leases are not the same as before sleep. With this change, we relax to assert that we eventually stop the renewable process. Closes #39509	2019-06-11 18:03:16 -04:00
Nhat Nguyen	5d3849215b	CCR should not replicate private/internal settings (#43067 ) With this change, CCR will not replicate internal or private settings to follower indices. Closes #41268	2019-06-11 06:59:09 -04:00
Nhat Nguyen	53eb630700	Fix NPE in CcrRetentionLeaseIT (#43059 ) The retention leases stats is null if the processing shard copy is being closed. In this the case, we should check against null then retry to avoid failing a test. Closes #41237	2019-06-10 17:58:37 -04:00
Nhat Nguyen	4191df6e1d	Unmute IndexFollowingIT#testFollowIndex Fixed in #41987	2019-06-10 17:58:37 -04:00
Jason Tedor	63bad28005	Do not allow modify aliases on followers (#43017 ) Now that aliases are replicated by a follower from its leader, this commit prevents directly modifying aliases on follower indices.	2019-06-09 22:53:54 -04:00
Jason Tedor	915d2f2daa	Refactor put mapping request validation for reuse (#43005 ) This commit refactors put mapping request validation for reuse. The concrete case that we are after here is the ability to apply effectively the same framework to indices aliases requests. This commit refactors the put mapping request validation framework to allow for that.	2019-06-09 10:19:04 -04:00
Gordon Brown	6eb4600e93	Add custom metadata to snapshots (#41281 ) Adds a metadata field to snapshots which can be used to store arbitrary key-value information. This may be useful for attaching a description of why a snapshot was taken, tagging snapshots to make categorization easier, or identifying the source of automatically-created snapshots.	2019-06-05 17:30:31 -06:00
Jason Tedor	117df87b2b	Replicate aliases in cross-cluster replication (#42875 ) This commit adds functionality so that aliases that are manipulated on leader indices are replicated by the shard follow tasks to the follower indices. Note that we ignore write indices. This is due to the fact that follower indices do not receive direct writes so the concept is not useful. Relates #41815	2019-06-04 20:36:24 -04:00
Mark Vieira	e44b8b1e2e	[Backport] Remove dependency substitutions 7.x (#42866 ) * Remove unnecessary usage of Gradle dependency substitution rules (#42773) (cherry picked from commit 12d583dbf6f7d44f00aa365e34fc7e937c3c61f7)	2019-06-04 13:50:23 -07:00
Mark Vieira	c1816354ed	[Backport] Improve build configuration time (#42674 )	2019-05-30 10:29:42 -07:00
David Turner	746a2f41fd	Remove PRE_60_NODE_CHECKPOINT (#42531 ) This commit removes the obsolete `PRE_60_NODE_CHECKPOINT` constant for dealing with 5.x nodes' lack of sequence number support. Backport of #42527	2019-05-28 12:25:53 +01:00
Nhat Nguyen	2077f9ffbc	Reset mock transport service in CcrRetentionLeaseIT (#42600 ) testRetentionLeaseIsAddedIfItDisappearsWhileFollowing does not reset the mock transport service after test. Surviving transport interceptors from that test can sneaky remove retention leases and make other tests fail. Closes #39331 Closes #39509 Closes #41428 Closes #41679 Closes #41737 Closes #41756	2019-05-27 21:51:25 -04:00
Nhat Nguyen	85e60850af	Add debug log for retention leases (#42557 ) We need more information to understand why CcrRetentionLeaseIT is failing. This commit adds some debug log to retention leases and enables them in CcrRetentionLeaseIT.	2019-05-26 16:04:47 -04:00
Nhat Nguyen	d6e2f4a43e	Enable recoveries trace log in CcrRetentionLeaseIT Tracked #41679	2019-05-24 22:16:14 -04:00
Luca Cavanna	29c9bb9181	Clean up ShardId usage of Streamable (#41843 ) ShardId already implements Writeable so there is no need for it to implement Streamable too. Also the readShardId static method can be easily replaced with direct usages of the constructor that takes a StreamInput as argument.	2019-05-22 18:47:54 +02:00
Yannick Welsch	5d8605c790	Fix testAutoFollowManyIndices On a slow CI worker, the test was failing an assertion. Closes #41234	2019-05-22 17:33:34 +02:00
Simon Willnauer	a79cd77e5c	Remove IndexShard dependency from Repository (#42213 ) * Remove IndexShard dependency from Repository In order to simplify repository testing especially for BlobStoreRepository it's important to remove the dependency on IndexShard and reduce it to Store and MapperService (in the snapshot case). This significantly reduces the dependcy footprint for Repository and allows unittesting without starting nodes or instantiate entire shard instances. This change deprecates the old method signatures and adds a unittest for FileRepository to show the advantage of this change. In addition, the unittesting surfaced a bug where the internal file names that are private to the repository were used in the recovery stats instead of the target file names which makes it impossible to relate to the actual lucene files in the recovery stats. * don't delegate deprecated methods * apply comments * test	2019-05-22 14:27:11 +02:00
Yannick Welsch	770d8e9e39	Remove usage of max_local_storage_nodes in test infrastructure (#41652 ) Moves the test infrastructure away from using node.max_local_storage_nodes, allowing us in a follow-up PR to deprecate this setting in 7.x and to remove it in 8.0. This also changes the behavior of InternalTestCluster so that starting up nodes will not automatically reuse data folders of previously stopped nodes. If this behavior is desired, it needs to be explicitly done by passing the data path from the stopped node to the new node that is started.	2019-05-22 11:04:55 +02:00
Tal Levy	5640197632	Refactor TransportSingleShardAction to serialize Writeable responses (#41985 ) (#42040 ) Previously, TransportSingleShardAction required constructing a new empty response object. This response object's Streamable readFrom was used. As part of the migration to Writeable, the interface here was updated to leverage Writeable.Reader. relates to #34389.	2019-05-09 22:08:31 -07:00
Jason Tedor	8bea3c3a58	Enable trace logging in CCR retention lease tests These tests are failing somewhat mysteriously, indicating that when we renew retention leaess during a restore that our retention leases that we added before starting the restore suddenly do not exist. To make sense of this, this commit enables trace logging.	2019-05-07 22:44:55 -04:00
Ryan Ernst	6fd8924c5a	Switch run task to use real distro (#41590 ) The run task is supposed to run elasticsearch with the given plugin or module. However, for modules, this is most realistic if using the full distribution. This commit changes the run setup to use the default or oss as appropriate.	2019-05-06 12:34:07 -07:00
Hicham Mallah	4a88da70c5	Add index name to cluster block exception (#41489 ) Updates the error message to reveal the index name that is causing it. Closes #40870	2019-05-04 19:11:59 -04:00
Nhat Nguyen	c7924014fa	Verify consistency of version and source in disruption tests (#41614 ) (#41661 ) With this change, we will verify the consistency of version and source (besides id, seq_no, and term) of live documents between shard copies at the end of disruption tests.	2019-05-03 18:47:14 -04:00
Nhat Nguyen	887f3f2c83	Simplify initialization of max_seq_no of updates (#41161 ) Today we choose to initialize max_seq_no_of_updates on primaries only so we can deal with a situation where a primary is on an old node (before 6.5) which does not have MUS while replicas on new nodes (6.5+). However, this strategy is quite complex and can lead to bugs (for example #40249) since we have to assign a correct value (not too low) to MSU in all possible situations (before recovering from translog, restoring history on promotion, and handing off relocation). Fortunately, we don't have to deal with this BWC in 7.0+ since all nodes in the cluster should have MSU. This change simplifies the initialization of MSU by always assigning it a correct value in the constructor of Engine regardless of whether it's a replica or primary. Relates #33842	2019-04-30 15:14:52 -04:00
David Kyle	f737b05ad1	Mute CcrRetentionLeaseIT.testForgetFollower https://github.com/elastic/elasticsearch/issues/39850	2019-04-30 09:55:16 +01:00
Armin Braun	aad33121d8	Async Snapshot Repository Deletes (#40144 ) (#41571 ) Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See https://github.com/elastic/elasticsearch/pull/39656#issuecomment-470492106 * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657	2019-04-26 15:36:09 +02:00
Christoph Büscher	52495843cc	[Docs] Fix common word repetitions (#39703 )	2019-04-25 20:47:47 +02:00
Armin Braun	40aef2b8aa	Introduce Delegating ActionListener Wrappers (#40129 ) (#41527 ) * Introduce Delegating ActionListener Wrappers * Dry up use cases of ActionListener that simply pass through the response or exception to another listener	2019-04-25 16:05:04 +02:00
Jason Tedor	21bf2fe3c4	Reduce security permissions in CCR plugin (#41391 ) It looks like these permissions were copy/pasted from another plugin yet almost none of these permissions are needed for the CCR plugin. This commit removes all these unneeded permissions from the CCR plugin.	2019-04-20 08:21:59 -04:00
Adrien Grand	86e56590a7	Revert "Disable CcrRetentionLeaseIT#testRetentionLeasesAreNotBeingRenewedAfterRecoveryCompletes." This reverts commit `343039e200`.	2019-04-18 11:31:00 +02:00
Adrien Grand	343039e200	Disable CcrRetentionLeaseIT#testRetentionLeasesAreNotBeingRenewedAfterRecoveryCompletes. Relates #39331.	2019-04-18 11:29:11 +02:00
Armin Braun	233df6b73b	Make Transport Shard Bulk Action Async (#39793 ) (#41112 ) This is a dependency of #39504 Motivation: By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update. With this change, the mapping update will trigger a new task in the `write` queue instead. This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines. The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down. Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.	2019-04-11 16:01:52 +02:00
Jason Tedor	bb6f060f74	Add log message to forget follower test This commit adds a log message to help debug failures in a forget follower test.	2019-04-09 23:33:29 -04:00
Julie Tibshirani	21c5d7e95f	Mute CcrRetentionLeaseIT#testRetentionLeasesAreNotBeingRenewedAfterRecoveryCompletes. Tracked in #39331.	2019-04-09 16:08:44 -07:00
Julie Tibshirani	cbae617898	Mute IndexFollowingIT#testFollowIndex as we await a fix. Tracked in #41037.	2019-04-09 14:56:37 -07:00
Mark Vieira	1287c7d91f	[Backport] Replace usages RandomizedTestingTask with built-in Gradle Test (#40978 ) (#40993 ) * Replace usages RandomizedTestingTask with built-in Gradle Test (#40978) This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions. (cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6) * Fix forking JVM runner * Don't bump shadow plugin version	2019-04-09 11:52:50 -07:00
David Turner	2ff19bc1b7	Use Writeable for TransportReplAction derivatives (#40905 ) Relates #34389, backport of #40894.	2019-04-05 19:10:10 +01:00
Martijn van Groningen	809a5f13a4	Make -try xlint warning disabled by default. (#40833 ) Many gradle projects specifically use the -try exclude flag, because there are many cases where auto-closeable resource ignore is never referenced in body of corresponding try statement. Suppressing this warning specifically in each case that it happens using `@SuppressWarnings("try")` would be very verbose. This change removes `-try` from any gradle project and adds it to the build plugin. Also this change removes exclude flags from gradle projects that is already specified in build plugin (for example -deprecation). Relates to #40366	2019-04-05 08:02:26 +02:00
David Turner	1d2bc85586	Inline TransportReplAction#registerRequestHandlers (#40762 ) It is important that resync actions are not rejected on the primary even if its `write` threadpool is overloaded. Today we do this by exposing `registerRequestHandlers` to subclasses and overriding it in `TransportResyncReplicationAction`. This isn't ideal because it obscures the difference between this action and other replication actions, and also might allow subclasses to try and use some state before they are properly initialised. This change replaces this override with a constructor parameter to solve these issues. Relates #40706	2019-04-03 12:12:26 +01:00
Christoph Büscher	a13be65b01	Fixing typo in test error message (#40611 )	2019-03-28 22:12:24 +01:00
Tim Brooks	760cfffe4b	Move TransportMessageListener to TransportService (#40474 ) Currently the TransportMessageListener is applied and used in the Transport class. However, local requests and responses never make it to this class. This PR moves the listener add/remove methods to the TransportService. After this change the Transport can only have one listener set with it. This one listener is the TransportService, which will then propogate the events to the external listeners. Additionally this commit back ports #40237 Remove Tracer from MockTransportService Currently the TransportMessageListener is applied and used in the Transport class. However, local requests and responses never make it to this class. This PR moves the listener add/remove methods to the TransportService. After this change the Transport can only have one listener set with it. This one listener is the TransportService, which will then propogate the events to the external listeners.	2019-03-27 09:24:20 -06:00
alex101101	fb8ad0cf30	Add a soft limit to the field name length (#40309 ) Adds an optional limit to the length of field names, throws an IllegalArgumentException if the limit is breached. Closes #33651	2019-03-26 17:58:32 +01:00
Jason Tedor	10bbb082a4	Only run retention lease actions on active primary (#40386 ) In some cases, a request to perform a retention lease action can arrive on a primary shard before it is active. In this case, the primary shard would not yet be in primary mode, tripping an assertion in the replication tracker. Instead, we should not attempt to perform such actions on an initializing shard. This commit addresses this by not returning the primary shard in the single shard iterator if the primary shard is not yet active.	2019-03-23 09:39:39 -04:00
Nhat Nguyen	0e12065b54	Relax max_seq_no_of_updates assertion in follow tests If there's a failover on the follower, then its max_seq_no_of_updates is bootstrapped from its max_seq_no which might be higher than the max_seq_no_of_updates of the leader. We need to relax this check. Relates #40249	2019-03-21 19:41:55 -04:00
Jason Tedor	1e6941b138	Reduce retention lease sync intervals (#40302 ) This commit adjusts the frequency with which CCR renews retention leases and with which primaries sync retention leases to replicas. This helps Lucene reclaim soft-deleted documents more aggressively, which we have found in some use-cases can help improve performance, and either way will help keep disk space under more control.	2019-03-21 07:37:44 -04:00

1 2 3 4 5 ...

508 Commits