Commit Graph

491 Commits

Author SHA1 Message Date
Nhat Nguyen 85e60850af Add debug log for retention leases (#42557)
We need more information to understand why CcrRetentionLeaseIT is
failing. This commit adds some debug log to retention leases and enables
them in CcrRetentionLeaseIT.
2019-05-26 16:04:47 -04:00
Nhat Nguyen d6e2f4a43e Enable recoveries trace log in CcrRetentionLeaseIT
Tracked #41679
2019-05-24 22:16:14 -04:00
Luca Cavanna 29c9bb9181 Clean up ShardId usage of Streamable (#41843)
ShardId already implements Writeable so there is no need for it to implement Streamable too. Also the readShardId static method can be
easily replaced with direct usages of the constructor that takes a
StreamInput as argument.
2019-05-22 18:47:54 +02:00
Yannick Welsch 5d8605c790 Fix testAutoFollowManyIndices
On a slow CI worker, the test was failing an assertion.

Closes #41234
2019-05-22 17:33:34 +02:00
Simon Willnauer a79cd77e5c Remove IndexShard dependency from Repository (#42213)
* Remove IndexShard dependency from Repository

In order to simplify repository testing especially for BlobStoreRepository
it's important to remove the dependency on IndexShard and reduce it to
Store and MapperService (in the snapshot case). This significantly reduces
the dependcy footprint for Repository and allows unittesting without starting
nodes or instantiate entire shard instances. This change deprecates the old
method signatures and adds a unittest for FileRepository to show the advantage
of this change.
In addition, the unittesting surfaced a bug where the internal file names that
are private to the repository were used in the recovery stats instead of the
target file names which makes it impossible to relate to the actual lucene files
in the recovery stats.

* don't delegate deprecated methods

* apply comments

* test
2019-05-22 14:27:11 +02:00
Yannick Welsch 770d8e9e39 Remove usage of max_local_storage_nodes in test infrastructure (#41652)
Moves the test infrastructure away from using node.max_local_storage_nodes, allowing us in a
follow-up PR to deprecate this setting in 7.x and to remove it in 8.0.

This also changes the behavior of InternalTestCluster so that starting up nodes will not automatically
reuse data folders of previously stopped nodes. If this behavior is desired, it needs to be explicitly
done by passing the data path from the stopped node to the new node that is started.
2019-05-22 11:04:55 +02:00
Tal Levy 5640197632
Refactor TransportSingleShardAction to serialize Writeable responses (#41985) (#42040)
Previously, TransportSingleShardAction required constructing a new
empty response object. This response object's Streamable readFrom
was used. As part of the migration to Writeable, the interface here
was updated to leverage Writeable.Reader.

relates to #34389.
2019-05-09 22:08:31 -07:00
Jason Tedor 8bea3c3a58
Enable trace logging in CCR retention lease tests
These tests are failing somewhat mysteriously, indicating that when we
renew retention leaess during a restore that our retention leases that
we added before starting the restore suddenly do not exist. To make
sense of this, this commit enables trace logging.
2019-05-07 22:44:55 -04:00
Ryan Ernst 6fd8924c5a Switch run task to use real distro (#41590)
The run task is supposed to run elasticsearch with the given plugin or
module. However, for modules, this is most realistic if using the full
distribution. This commit changes the run setup to use the default or
oss as appropriate.
2019-05-06 12:34:07 -07:00
Hicham Mallah 4a88da70c5 Add index name to cluster block exception (#41489)
Updates the error message to reveal the index name that is causing it.

Closes #40870
2019-05-04 19:11:59 -04:00
Nhat Nguyen c7924014fa
Verify consistency of version and source in disruption tests (#41614) (#41661)
With this change, we will verify the consistency of version and source
(besides id, seq_no, and term) of live documents between shard copies
at the end of disruption tests.
2019-05-03 18:47:14 -04:00
Nhat Nguyen 887f3f2c83 Simplify initialization of max_seq_no of updates (#41161)
Today we choose to initialize max_seq_no_of_updates on primaries only so
we can deal with a situation where a primary is on an old node (before
6.5) which does not have MUS while replicas on new nodes (6.5+).
However, this strategy is quite complex and can lead to bugs (for
example #40249) since we have to assign a correct value (not too low) to
MSU in all possible situations (before recovering from translog,
restoring history on promotion, and handing off relocation).

Fortunately, we don't have to deal with this BWC in 7.0+ since all nodes
in the cluster should have MSU. This change simplifies the
initialization of MSU by always assigning it a correct value in the
constructor of Engine regardless of whether it's a replica or primary.

Relates #33842
2019-04-30 15:14:52 -04:00
David Kyle f737b05ad1 Mute CcrRetentionLeaseIT.testForgetFollower
https://github.com/elastic/elasticsearch/issues/39850
2019-04-30 09:55:16 +01:00
Armin Braun aad33121d8
Async Snapshot Repository Deletes (#40144) (#41571)
Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete.

* Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll
   * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable.
* See https://github.com/elastic/elasticsearch/pull/39656#issuecomment-470492106
* Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore).
* By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657
2019-04-26 15:36:09 +02:00
Christoph Büscher 52495843cc [Docs] Fix common word repetitions (#39703) 2019-04-25 20:47:47 +02:00
Armin Braun 40aef2b8aa
Introduce Delegating ActionListener Wrappers (#40129) (#41527)
* Introduce Delegating ActionListener Wrappers
* Dry up use cases of ActionListener that simply pass through the response or exception to another listener
2019-04-25 16:05:04 +02:00
Jason Tedor 21bf2fe3c4
Reduce security permissions in CCR plugin (#41391)
It looks like these permissions were copy/pasted from another plugin yet
almost none of these permissions are needed for the CCR plugin. This
commit removes all these unneeded permissions from the CCR plugin.
2019-04-20 08:21:59 -04:00
Adrien Grand 86e56590a7 Revert "Disable CcrRetentionLeaseIT#testRetentionLeasesAreNotBeingRenewedAfterRecoveryCompletes."
This reverts commit 343039e200.
2019-04-18 11:31:00 +02:00
Adrien Grand 343039e200 Disable CcrRetentionLeaseIT#testRetentionLeasesAreNotBeingRenewedAfterRecoveryCompletes.
Relates #39331.
2019-04-18 11:29:11 +02:00
Armin Braun 233df6b73b
Make Transport Shard Bulk Action Async (#39793) (#41112)
This is a dependency of #39504

Motivation:
By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update.
With this change, the mapping update will trigger a new task in the `write` queue instead.
This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines.

The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down.

Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.
2019-04-11 16:01:52 +02:00
Jason Tedor bb6f060f74
Add log message to forget follower test
This commit adds a log message to help debug failures in a forget
follower test.
2019-04-09 23:33:29 -04:00
Julie Tibshirani 21c5d7e95f Mute CcrRetentionLeaseIT#testRetentionLeasesAreNotBeingRenewedAfterRecoveryCompletes.
Tracked in #39331.
2019-04-09 16:08:44 -07:00
Julie Tibshirani cbae617898 Mute IndexFollowingIT#testFollowIndex as we await a fix.
Tracked in #41037.
2019-04-09 14:56:37 -07:00
Mark Vieira 1287c7d91f
[Backport] Replace usages RandomizedTestingTask with built-in Gradle Test (#40978) (#40993)
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)

This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.

(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)

* Fix forking JVM runner

* Don't bump shadow plugin version
2019-04-09 11:52:50 -07:00
David Turner 2ff19bc1b7
Use Writeable for TransportReplAction derivatives (#40905)
Relates #34389, backport of #40894.
2019-04-05 19:10:10 +01:00
Martijn van Groningen 809a5f13a4
Make -try xlint warning disabled by default. (#40833)
Many gradle projects specifically use the -try exclude flag, because
there are many cases where auto-closeable resource ignore is never
referenced in body of corresponding try statement. Suppressing this
warning specifically in each case that it happens using
`@SuppressWarnings("try")` would be very verbose.

This change removes `-try` from any gradle project and adds it to the
build plugin. Also this change removes exclude flags from gradle projects
that is already specified in build plugin (for example -deprecation).

Relates to #40366
2019-04-05 08:02:26 +02:00
David Turner 1d2bc85586 Inline TransportReplAction#registerRequestHandlers (#40762)
It is important that resync actions are not rejected on the primary even if its
`write` threadpool is overloaded. Today we do this by exposing
`registerRequestHandlers` to subclasses and overriding it in
`TransportResyncReplicationAction`. This isn't ideal because it obscures the
difference between this action and other replication actions, and also might
allow subclasses to try and use some state before they are properly
initialised. This change replaces this override with a constructor parameter to
solve these issues.

Relates #40706
2019-04-03 12:12:26 +01:00
Christoph Büscher a13be65b01 Fixing typo in test error message (#40611) 2019-03-28 22:12:24 +01:00
Tim Brooks 760cfffe4b
Move TransportMessageListener to TransportService (#40474)
Currently the TransportMessageListener is applied and used in the
Transport class. However, local requests and responses never make it to
this class. This PR moves the listener add/remove methods to the
TransportService. After this change the Transport can only have one
listener set with it. This one listener is the TransportService, which
will then propogate the events to the external listeners.

Additionally this commit back ports #40237

Remove Tracer from MockTransportService

Currently the TransportMessageListener is applied and used in the
Transport class. However, local requests and responses never make it to
this class. This PR moves the listener add/remove methods to the
TransportService. After this change the Transport can only have one
listener set with it. This one listener is the TransportService, which
will then propogate the events to the external listeners.
2019-03-27 09:24:20 -06:00
alex101101 fb8ad0cf30 Add a soft limit to the field name length (#40309)
Adds an optional limit to the length of field names, throws an IllegalArgumentException if the limit is breached. 
Closes #33651
2019-03-26 17:58:32 +01:00
Jason Tedor 10bbb082a4
Only run retention lease actions on active primary (#40386)
In some cases, a request to perform a retention lease action can arrive
on a primary shard before it is active. In this case, the primary shard
would not yet be in primary mode, tripping an assertion in the
replication tracker. Instead, we should not attempt to perform such
actions on an initializing shard. This commit addresses this by not
returning the primary shard in the single shard iterator if the primary
shard is not yet active.
2019-03-23 09:39:39 -04:00
Nhat Nguyen 0e12065b54 Relax max_seq_no_of_updates assertion in follow tests
If there's a failover on the follower, then its max_seq_no_of_updates is
bootstrapped from its max_seq_no which might be higher than the
max_seq_no_of_updates of the leader. We need to relax this check.

Relates #40249
2019-03-21 19:41:55 -04:00
Jason Tedor 1e6941b138
Reduce retention lease sync intervals (#40302)
This commit adjusts the frequency with which CCR renews retention leases
and with which primaries sync retention leases to replicas. This helps
Lucene reclaim soft-deleted documents more aggressively, which we have
found in some use-cases can help improve performance, and either way
will help keep disk space under more control.
2019-03-21 07:37:44 -04:00
Like 6f64267626 Make setting index.translog.sync_interval be dynamic (#37382)
Currently, we cannot update index setting index.translog.sync_interval if index is open, because it's
not dynamic which can be updated for closed index only.

Closes #32763
2019-03-20 17:12:45 +01:00
Henning Andersen 4c2a8638ca Cascading primary failure lead to MSU too low (#40249)
If a replica were first reset due to one primary failover and then
promoted (before resync completes), its MSU would not include changes
since global checkpoint, leading to errors during translog replay.

Fixed by re-initializing MSU before restoring local history.
2019-03-20 14:00:43 +01:00
Jason Tedor f88e4181ca
Enable reading auto-follow patterns from x-content (#40130)
This named writable was never registered, so it means that we could not
read auto-follow patterns that were registered in the cluster
state. This causes them to be lost on restarts, a bad bug. This commit
addresses this by registering this named writable, and we add a basic
CCR restart test to ensure that CCR keeps functioning properly when the
follower is restarted.
2019-03-18 21:48:44 -04:00
Nhat Nguyen 38e9522218 Remove wait for cluster state step in peer recovery (#40004)
We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped
using it since #25692 (6.0). This change removes that action and related
code in 7.x and 8.0.

Relates #19287
Relates #25692
2019-03-18 15:17:21 -04:00
Jason Tedor 5be12e0999
Safe publication of AutoFollowCoordinator (#40153)
We were leaking a reference to an AutoFollowCoordinator during
construction, violating safe publication according to the JLS
specification. This commit addresses this by waiting to register
AutoFollowCoordinator with the ClusterApplierService after the
AutoFollowCoordinator is fully constructed. We also remove ourselves as
a listener when stopping.
2019-03-18 10:13:41 -04:00
Jason Tedor b8ad337234
Stop auto-followers on shutdown (#40124)
When shutting down a node, auto-followers will keep trying to run. This
is happening even as transport services and other components are being
closed. In some cases, this can lead to a stack overflow as we rapidly
try to check the license state of the remote cluster, can not because
the transport service is shutdown, and then immeidately retry
again. This can happen faster than the shutdown, and we die with stack
overflow. This commit adds a stop command to auto-followers so that this
retry loop occurs at most once on shutdown.
2019-03-18 07:25:31 -04:00
Jason Tedor 0824eceacf
Add log message for auto-follower timeout
When an auto-follower coordinator times out waiting for the remote
cluster state, we do not log any indication of this. While this is
expected behavior in quiet deployments, it is still useful to see this
information for tracing the behavior of the auto-follow
coordinator. This commit adds a trace log message indicating that the
timeout.
2019-03-16 10:46:20 -04:00
Jason Tedor 86d1d03c37
Remove cluster state size (#40109)
This commit removes the cluster state size field from the cluster state
response, and drops the backwards compatibility layer added in 6.7.0 to
continue to support this field. As calculation of this field was
expensive and had dubious value, we have elected to remove this field.
2019-03-15 17:16:25 -04:00
David Kyle 4eb3683d65 Mute CcrRetentionLeaseIT tests (#40090) 2019-03-15 15:05:47 +00:00
Jake Landis b0b0f66669
Remove types from internal monitoring templates and bump to api 7 (#39888) (#39926)
This commit removes the "doc" type from monitoring internal indexes.
The template still carries the "_doc" type since that is needed for
the internal representation.

This change impacts the following templates:
monitoring-alerts.json
monitoring-beats.json
monitoring-es.json
monitoring-kibana.json
monitoring-logstash.json

As part of the required changes, the system_api_version has been
bumped from "6" to "7" and support for version "2" has been dropped.

A new empty pipeline is now introduced for the version "7", and
the formerly empty "6" pipeline will now remove the type and re-direct
the request to the "7" index.

Additionally, to due to a difference in the internal representation
(which requires the inclusion of "_doc" type) and external representation
(which requires the exclusion of any type) a helper method is introduced
to help convert internal to external representation, and used by the
monitoring HTTP template exporter.

Relates #38637
2019-03-11 13:17:27 -05:00
Martijn van Groningen 8925a2c6c2
Further tweak AutoFollowIT#testAutoFollowManyIndices:
* reduce the number of leader indices to be auto followed
* also check the number of follower indices being created
* also check the whether leader indices are marked as auto followed

Relates to #36761
2019-03-11 10:01:56 +01:00
Daniel Mitterdorfer 1bc31aca03
Mute CcrRetentionLeaseIT#testRetentionLeaseRenewalIsCancelledWhenFollowingIsPaused (#39897)
Relates #39509
2019-03-11 08:47:51 +01:00
Jason Tedor 6675bafc49
Simplify CcrRetentionLeaseIT#testForgetFollower
This test was more complicated than necessary, where we were capturing
requests to prevent removal of retention leases, so that our forget
follower request could remove the retention leases instead. Instead, a
pause is enough to ensure that the retention leases are not re-added
after we remove them by the forget follower request. This commit
simplifies this test, and should remove some spurious failures.

Relates #39850
2019-03-08 12:33:17 -05:00
Martijn van Groningen 8666aa1ed2
unmuted and tweaked test
Relates to #36761
2019-03-08 12:43:23 +01:00
Jason Tedor 0250d554b6
Introduce forget follower API (#39718)
This commit introduces the forget follower API. This API is needed in cases that
unfollowing a following index fails to remove the shard history retention leases
on the leader index. This can happen explicitly through user action, or
implicitly through an index managed by ILM. When this occurs, history will be
retained longer than necessary. While the retention lease will eventually
expire, it can be expensive to allow history to persist for that long, and also
prevent ILM from performing actions like shrink on the leader index. As such, we
introduce an API to allow for manual removal of the shard history retention
leases in this case.
2019-03-07 11:08:45 -05:00
Nhat Nguyen 83688ce2d4 Unmute testFollowIndexAndCloseNode
Resolved in #39584
2019-03-06 22:39:13 -05:00
David Turner 77dd711847 Tidy up GroupedActionListener (#39633)
Today the `GroupedActionListener` accepts a `defaults` parameter but all
callers pass an empty list. Also it is permitted to pass an empty group but
this is trappy because the delegated listener is never be called in that case.
This commit removes the `defaults` parameter and forbids an empty group.
2019-03-06 09:25:10 +00:00