Commit Graph

215 Commits

Author SHA1 Message Date
Nhat Nguyen ff49e79d40 CCR: Rename follow-task parameters and stats (#34836)
* CCR: Rename follow parameters and stats

This commit renames the follow-task parameters and its stats.
Below are the changes:

## Params
- remote_cluster (unchanged)
- leader_index (unchanged)
- max_read_request_operation_count -> max_read_request_operation_count
- max_batch_size -> max_read_request_size
- max_write_request_operation_count (new)
- max_write_request_size (new)
- max_concurrent_read_batches -> max_outstanding_read_requests
- max_concurrent_write_batches -> max_outstanding_write_requests
- max_write_buffer_size (unchanged)
- max_write_buffer_count (unchanged)
- max_retry_delay (unchanged)
- poll_timeout -> read_poll_timeout

## Stats
- remote_cluster (unchanged)
- leader_index (unchanged)
- follower_index (unchanged)
- shard_id (unchanged)
- leader_global_checkpoint (unchanged)
- leader_max_seq_no (unchanged)
- follower_global_checkpoint (unchanged)
- follower_max_seq_no (unchanged)
- last_requested_seq_no (unchanged)
- number_of_concurrent_reads -> outstanding_read_requests
- number_of_concurrent_writes -> outstanding_write_requests
- buffer_size_in_bytes -> write_buffer_size_in_bytes (new)
- number_of_queued_writes -> write_buffer_operation_count
- mapping_version -> follower_mapping_version
- total_fetch_time_millis -> total_read_time_millis
- total_fetch_remote_time_millis -> total_read_remote_exec_time_millis
- number_of_successful_fetches -> successful_read_requests
- number_of_failed_fetches -> failed_read_requests
- operation_received -> operations_read
- total_transferred_bytes -> bytes_read
- total_index_time_millis -> total_write_time_millis [?]
- number_of_successful_bulk_operations -> successful_write_requests
- number_of_failed_bulk_operations -> failed_write_requests
- number_of_operations_indexed -> operations_written
- fetch_exception -> read_exceptions
- time_since_last_read_millis -> time_since_last_read_millis

* add test for max_write_request_(operation_count|size)
2018-10-25 10:36:15 +02:00
Martijn van Groningen 6fe0e62b7a
[CCR] Added write buffer size limit (#34797)
This limit is based on the size in bytes of the operations in the write buffer. If this limit is exceeded then no more read operations will be coordinated until the size in bytes of the write buffer has dropped below the configured write buffer size limit.

Renamed existing `max_write_buffer_size` to ``max_write_buffer_count` to indicate that limit is count based.

Closes #34705
2018-10-24 23:48:49 +02:00
Andrey Atapin 5f588180f9 Improve IndexNotFoundException's default error message (#34649)
This commit adds the index name to the error message when an index is not found.
2018-10-24 12:53:31 -07:00
Nhat Nguyen d73768f812
CCR: Do not follow if leader does not have soft-deletes (#34767)
We should not create a follower index and abort a follow request if the
leader does not have soft-deletes. Moreover, we also should not
auto-follow an index if it does not have soft-deletes.
2018-10-24 11:19:39 -04:00
Alpar Torok 795d57b4f9
Auto configure all test tasks (#34666)
With this change, we apply the common test config automatically to all
newly created tasks instead of opting in specifically.

For plugin authors using the plugin externally this means that the
configuration will be applied to their RandomizedTestingTasks as well.

The purpose of the task is to simplify setup and make it easier to
change projects that use the `test` task but actually run integration
tests to use a task called `integTest` for clarity, but also because
we may want to configure and run them differently.
E.x. using different levels of concurrency.
2018-10-24 16:05:50 +03:00
Martijn van Groningen 76240e6bbe
[CCR] Renamed leader_cluster to remote_cluster (#34776)
and also some occurrences of clusterAlias to remoteCluster.

Closes #34682
2018-10-24 13:39:36 +02:00
Boaz Leskes be907516ad
Change ShardFollowTask defaults (#34793)
Per #31717 this commit changes the defaults to the following:

Batch size of 5120 ops.
Maximum of 12 concurrent read requests.
Maximum of 9 concurrent write requests.

This is not necessarily our final values but it's good to have these as defaults for the purposes of initial testing.
2018-10-24 13:32:48 +02:00
Martijn van Groningen 18007a29b2
[CCR] Made leader cluster required in shard follow task.
Left over from #34580
2018-10-24 08:38:25 +02:00
Martijn van Groningen abf8cb6706
[CCR] Cleanup pause follow action (#34183)
* Change the `TransportPauseFollowAction` to extend from `TransportMasterNodeAction`
  instead of `HandledAction`, this removes a sync cluster state api call.
* Introduced `ResponseHandler` that removes duplicated code in `TransportPauseFollowAction` and
  `TransportResumeFollowAction`.
* Changed `PauseFollowAction.Request` to not use `readFrom()`.
2018-10-24 08:12:39 +02:00
Martijn van Groningen 0efba0675e
[CCR] Add qa test library (#34611)
* Introduced test qa lib that all CCR qa modules depend on to avoid
test code duplication.
2018-10-23 23:24:32 +02:00
Nhat Nguyen e242fd2e42
CCR: Add TransportService closed to retryable errors (#34722)
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in that case.

This change adds TransportService closed error to the list of retryable
errors.

Closes #34694
2018-10-23 14:23:29 -04:00
Martijn van Groningen ed817fb265
[CCR] Move leader_index and leader_cluster parameters from resume follow to put follow api (#34638)
As part of this change the leader index name and leader cluster name are
stored in the CCR metadata in the follow index. The resume follow api
will read that when a resume follow request is executed.
2018-10-23 19:37:45 +02:00
Nhat Nguyen 5923ea536e
CCR: Requires soft-deletes on the follower (#34725)
Since #34412 and #34474, a follower must have soft-deletes enabled 
to work correctly. This change requires soft-deletes on the follower.

Relates #34412
Relates #34474
2018-10-23 11:51:17 -04:00
Martijn van Groningen e6d87cc09f
[CCR] Add total fetch time leader stat (#34577)
Add total fetch time leader stat, that
keeps track how much time was spent on fetches
from the leader cluster perspective.
2018-10-23 16:41:06 +02:00
Martijn van Groningen 36baf3823d
[CCR] Auto follow pattern APIs adjustments (#34518)
* Changed the resource id of auto follow patterns to be a user defined name
instead of being the leader cluster alias name.
* Fail when an unfollowed leader index matches with two or more auto follow patterns.
2018-10-23 15:48:51 +02:00
Jason Tedor 52fc502b7e
Fix the casing in the names of some CCR classes
We should be consistent here. We were already using the casing "Ccr" and
this is the preferred casing for Java class names. This commit adjusts
the names of some classes that were using the casing "CCR" to be "Ccr".
2018-10-22 11:25:00 -04:00
Jason Tedor 7af19b8f81
Migrate wait for pending tasks helper to server (#34675)
In some of our X-Pack REST tests we have to wait for pending tasks to
complete. We are now needing this functionality in ESRestTestCase for
the docs tests where we run against X-Pack features. This commit moves
the helper method that we have in X-Pack to ESRestTestCase, and removes
duplicate logic from waiting for rollup tasks to complete.
2018-10-22 11:14:02 -04:00
Martijn van Groningen 92e34732f5
[CCR] Remove ccr related metadata between tests for single node tests too 2018-10-22 09:15:22 +02:00
Martijn van Groningen b6750cf6c2
[CCR] Muted tests
Relates to #34696
2018-10-22 08:47:31 +02:00
Martijn van Groningen f51301a1a6
[CCR] Moved integration test 2018-10-22 08:44:41 +02:00
Martijn van Groningen b816837d39
[CCR] Always remove persistent tasks metadata between tests and
better handle assertion errors between tests.
2018-10-22 08:15:43 +02:00
Nhat Nguyen d90b6730c7
CCR: Following primary should process NoOps once (#34408)
This is a follow-up for #34288.

Relates #34412
2018-10-19 21:10:13 -04:00
Nhat Nguyen 630d5514a5 CCR/TEST: Adjust testFailOverOnFollower
CI passed but the result is outdated after PR #34366 was merged.
2018-10-19 15:06:44 -04:00
Nhat Nguyen bd92a28cfc
CCR: Replicate existing ops with old term on follower (#34412)
Since #34288, we might hit deadlock if the FollowTask has more fetchers
than writers. This can happen in the following scenario:

Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has
two fetchers and one writer.

1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0,
num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1
respectively.

2. The second request which fetches seq#1 completes before, and then it
triggers a write request containing only seq#1.

3. The primary of a follower fails after it has replicated seq#1 to
replicas.

4. Since the old primary did not respond, the FollowTask issues another
write request containing seq#1 (resend the previous write request).

5. The new primary has seq#1 already; thus it won't replicate seq#1 to
replicas but will wait for the global checkpoint to advance at least
seq#1.

The problem is that the FollowTask has only one writer and that writer
is waiting for seq#0 which won't be delivered until the writer completed.

This PR proposes to replicate existing operations with the old primary
term (instead of the current term) on the follower. In particular, when
the following primary detects that it has processed an process already,
it will look up the term of an existing operation with the same seq_no
in the Lucene index, then rewrite that operation with the old term
before replicating it to the following replicas. This approach is
wait-free but requires soft-deletes on the follower.

Relates #34288
2018-10-19 13:56:00 -04:00
Nhat Nguyen 90ca5b1fde
Fill LocalCheckpointTracker with Lucene commit (#34474)
Today we rely on the LocalCheckpointTracker to ensure no duplicate when
enabling optimization using max_seq_no_of_updates. The problem is that
the LocalCheckpointTracker is not fully reloaded when opening an engine
with an out-of-order index commit. Suppose the starting commit has seq#0
and seq#2, then the current LocalCheckpointTracker would return "false"
when asking if seq#2 was processed before although seq#2 in the commit.

This change scans the existing sequence numbers in the starting commit,
then marks these as completed in the LocalCheckpointTracker to ensure
the consistent state between LocalCheckpointTracker and Lucene commit.
2018-10-19 12:38:06 -04:00
Martijn van Groningen 56d4f69718
Renamed remaining leader_cluster_alias / cluster_alias to leader_cluster 2018-10-19 07:59:56 +02:00
Martijn van Groningen 44b461aff2
[CCR] Make leader cluster a required argument. (#34580)
This change makes it no longer possible to follow / auto follow without
specifying a leader cluster. If a local index needs to be followed
then `cluster.remote.*.seeds` should point to nodes in the local cluster.

Closes #34258
2018-10-19 07:41:46 +02:00
Martijn van Groningen 0d62f6102c
[CCR] Split cluster alias from leader index field into its own field in follow APIs (#34366) 2018-10-18 12:11:48 +02:00
Jason Tedor 3e067123a1
Remove dead methods from ChainIT
This commit removes some unused methods from ChainIT.
2018-10-16 10:45:33 -04:00
Martijn van Groningen a1ec91395c
Changed CCR internal integration tests to use a leader and follower cluster instead of a single cluster (#34344)
The `AutoFollowTests` needs to restart the clusters between each tests, because
it is using auto follow stats in assertions. Auto follow stats are only reset
by stopping the elected master node.

Extracted the `testGetOperationsBasedOnGlobalSequenceId()` test to its own test, because it just tests the shard changes api.

* Renamed AutoFollowTests to AutoFollowIT, because it is an integration test.
Renamed ShardChangesIT to IndexFollowingIT, because shard changes it the name
of an internal api and isn't a good name for an integration test.

* move creation of NodeConfigurationSource to a seperate method

* Fixes issues after merge, moved assertSeqNos() and assertSameDocIdsOnShards() methods from ESIntegTestCase to InternalTestCluster, so that ccr tests can use these methods too.
2018-10-16 14:45:46 +02:00
Jason Tedor e0b6721df4
Add dedicated test for chain replication (#34497)
This commit adds a dedicated test that chain replication leader ->
middle -> follow is successful.
2018-10-16 06:21:28 -04:00
Martijn van Groningen f7df8718b9
[CCR] Don't fail shard follow tasks in case of a non-retryable error (#34404) 2018-10-16 07:44:15 +02:00
Martijn van Groningen 51eca14288
[TEST] Make sure there are shards started so that `ESIntegTestCase#assertSameDocIdsOnShards()` does not fail with shard not found. 2018-10-15 10:24:28 +02:00
Martijn van Groningen 74dc2da873
Change shard changes api's threadpool from get to search (#34421) 2018-10-15 08:09:00 +01:00
Nhat Nguyen 429c29e833 CCR/TEST: AwaitsFix testFailOverOnFollower
Tracked at #34412
2018-10-13 21:05:33 -04:00
Nhat Nguyen 7bc11a8099 Unmute testFollowIndexAndCloseNode
This issue was resolved by #34288.

Closes #33337
Relates #34288
2018-10-10 15:48:22 -04:00
Nhat Nguyen 33791ac27c
CCR: Following primary should process operations once (#34288)
Today we rewrite the operations from the leader with the term of the
following primary because the follower should own its history. The
problem is that a newly promoted primary may re-assign its term to
operations which were replicated to replicas before by the previous
primary. If this happens, some operations with the same seq_no may be
assigned different terms. This is not good for the future optimistic
locking using a combination of seqno and term.

This change ensures that the primary of a follower only processes an
operation if that operation was not processed before. The skipped
operations are guaranteed to be delivered to replicas via either
primary-replica resync or peer-recovery. However, the primary must not
acknowledge until the global checkpoint is at least the highest seqno of
all skipped ops (i.e., they all have been processed on every replica).

Relates #31751
Relates #31113
2018-10-10 15:39:57 -04:00
Martijn van Groningen 268e134121
renamed test class 2018-10-08 15:05:50 +02:00
Martijn van Groningen c6c83d19f7
[CCR] Clear fetch exceptions if an empty but successful shard changes response returns (#34256)
Also fixed ShardFollowNodeTaskTests to not return ops when responseSize
is empty. Otherwise ops are returned when no ops are expected to be returned.

Co-authored-by: Jason Tedor <jason@tedor.me>
2018-10-06 07:53:37 -04:00
Martijn van Groningen 899e48395b
[CCR] Change unfollow API's privilege scheme. (#34175)
Unfollow should be allowed / disallowed on a per index level instead of
cluster level.

Also renamed `create_follow_index` index privilege to
`manage_follow_index` privilege and include unfollow and close APIs.
2018-10-06 07:38:28 -04:00
Jason Tedor 7d57bdb3a0
Follow stats structure (#34301)
This commit modifies the follow stats API response structure to more
clearly highlight meaning of the higher level fields. In particular,
previously the response had a top-level key for each index. Instead, we
nest the indices under an "indices" field which is now an array. The
values in this array are objects containing two fields: "index" which is
the name of the follower index, and "shards" which is an array where
each value in the array is the follower stats for that shard. That is,
we have gone from:

{
  "bar": [
    {
      "shard_id": 0...
    }...
  ]...
}

to

{
  "indices": [
    {
      "index": "bar",
      "shards": [
        {
          "shard_id": 0...
        }...
      ]
   }...
}
2018-10-05 06:38:20 -04:00
Jason Tedor 7478167d60
Rename CCR stats implementation (#34300)
In the CCR docs we want to refer to the endpoint that returns following
stats as the follow stats API. This commit renames the internal
implementation of this endpoint to reflect this usage.
2018-10-05 06:25:24 -04:00
Nhat Nguyen d7893fd1e4 TEST: Mute testFollowIndexAndCloseNode
Tracked at #33337
2018-10-02 17:20:31 -04:00
Martijn van Groningen 7f5c2f1050
[CCR] Validate follower index historyUUIDs (#34078)
The follower index shard history UUID will be fetched from the indices stats api when the shard follow task starts and will be provided with the bulk shard operation requests. The bulk shard operations api will fail if the provided history uuid is unequal to the actual history uuid.

No longer record the leader history uuid in shard follow task params, but rather use the leader history UUIDs directly from follower index's custom metadata. The resume follow api will remain to fail if leader index shard history UUIDs are missing.

Closes #33956
2018-10-02 18:01:06 +02:00
Martijn van Groningen d12a64eac2
[CCR] Only use primary shards and get expected count from leader index (#34186)
Closes #34173
2018-10-01 20:13:16 +02:00
Nhat Nguyen a02debadfe TEST: Unmute testFollowIndexAndCloseNode
Since #34099, the FollowingEngine will skip an operation which was
already processed before. With that change, it should be okay to unmute
testFollowIndexAndCloseNode.
2018-10-01 11:59:33 -04:00
Jason Tedor 80f7c1dcc9
Fix compilation in unfollow action tests
This arose when two commits were pushed at roughly the same time, both
of which compiled successfully against master, but not when taken
together. This commit fixes a reference in one of the commits that was
changed in the other commit.
2018-09-30 14:30:08 -04:00
Jason Tedor 1893765055
Change CCR stats endpoint to be index-centric (#34169)
This commit modifies the CCR stats endpoint for indices to be
/{index}/_ccr/stats. This makes this endpoint consistent with other
index-centric endpoints like indices stats.
2018-09-30 14:29:32 -04:00
Jason Tedor e2bd2028d8
Allow specifying shard changes batch sizes in bytes (#34168)
This commit changes the shard changes requests from using a raw byte
value to being able to be specified using bytes units (e.g., 4mb).
2018-09-30 14:22:22 -04:00
Martijn van Groningen 7c91c7a638
fixed test compile error 2018-09-30 19:31:30 +02:00