445 Commits

Author SHA1 Message Date
Martijn van Groningen
b816837d39
[CCR] Always remove persistent tasks metadata between tests and
better handle assertion errors between tests.
2018-10-22 08:15:43 +02:00
Nhat Nguyen
d90b6730c7
CCR: Following primary should process NoOps once (#34408)
This is a follow-up for #34288.

Relates #34412
2018-10-19 21:10:13 -04:00
Nhat Nguyen
630d5514a5 CCR/TEST: Adjust testFailOverOnFollower
CI passed but the result is outdated after PR #34366 was merged.
2018-10-19 15:06:44 -04:00
Nhat Nguyen
bd92a28cfc
CCR: Replicate existing ops with old term on follower (#34412)
Since #34288, we might hit deadlock if the FollowTask has more fetchers
than writers. This can happen in the following scenario:

Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has
two fetchers and one writer.

1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0,
num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1
respectively.

2. The second request which fetches seq#1 completes before, and then it
triggers a write request containing only seq#1.

3. The primary of a follower fails after it has replicated seq#1 to
replicas.

4. Since the old primary did not respond, the FollowTask issues another
write request containing seq#1 (resend the previous write request).

5. The new primary has seq#1 already; thus it won't replicate seq#1 to
replicas but will wait for the global checkpoint to advance at least
seq#1.

The problem is that the FollowTask has only one writer and that writer
is waiting for seq#0 which won't be delivered until the writer completed.

This PR proposes to replicate existing operations with the old primary
term (instead of the current term) on the follower. In particular, when
the following primary detects that it has processed an process already,
it will look up the term of an existing operation with the same seq_no
in the Lucene index, then rewrite that operation with the old term
before replicating it to the following replicas. This approach is
wait-free but requires soft-deletes on the follower.

Relates #34288
2018-10-19 13:56:00 -04:00
Nhat Nguyen
90ca5b1fde
Fill LocalCheckpointTracker with Lucene commit (#34474)
Today we rely on the LocalCheckpointTracker to ensure no duplicate when
enabling optimization using max_seq_no_of_updates. The problem is that
the LocalCheckpointTracker is not fully reloaded when opening an engine
with an out-of-order index commit. Suppose the starting commit has seq#0
and seq#2, then the current LocalCheckpointTracker would return "false"
when asking if seq#2 was processed before although seq#2 in the commit.

This change scans the existing sequence numbers in the starting commit,
then marks these as completed in the LocalCheckpointTracker to ensure
the consistent state between LocalCheckpointTracker and Lucene commit.
2018-10-19 12:38:06 -04:00
Martijn van Groningen
56d4f69718
Renamed remaining leader_cluster_alias / cluster_alias to leader_cluster 2018-10-19 07:59:56 +02:00
Martijn van Groningen
44b461aff2
[CCR] Make leader cluster a required argument. (#34580)
This change makes it no longer possible to follow / auto follow without
specifying a leader cluster. If a local index needs to be followed
then `cluster.remote.*.seeds` should point to nodes in the local cluster.

Closes #34258
2018-10-19 07:41:46 +02:00
Martijn van Groningen
0d62f6102c
[CCR] Split cluster alias from leader index field into its own field in follow APIs (#34366) 2018-10-18 12:11:48 +02:00
Jason Tedor
3e067123a1
Remove dead methods from ChainIT
This commit removes some unused methods from ChainIT.
2018-10-16 10:45:33 -04:00
Martijn van Groningen
a1ec91395c
Changed CCR internal integration tests to use a leader and follower cluster instead of a single cluster (#34344)
The `AutoFollowTests` needs to restart the clusters between each tests, because
it is using auto follow stats in assertions. Auto follow stats are only reset
by stopping the elected master node.

Extracted the `testGetOperationsBasedOnGlobalSequenceId()` test to its own test, because it just tests the shard changes api.

* Renamed AutoFollowTests to AutoFollowIT, because it is an integration test.
Renamed ShardChangesIT to IndexFollowingIT, because shard changes it the name
of an internal api and isn't a good name for an integration test.

* move creation of NodeConfigurationSource to a seperate method

* Fixes issues after merge, moved assertSeqNos() and assertSameDocIdsOnShards() methods from ESIntegTestCase to InternalTestCluster, so that ccr tests can use these methods too.
2018-10-16 14:45:46 +02:00
Jason Tedor
e0b6721df4
Add dedicated test for chain replication (#34497)
This commit adds a dedicated test that chain replication leader ->
middle -> follow is successful.
2018-10-16 06:21:28 -04:00
Martijn van Groningen
f7df8718b9
[CCR] Don't fail shard follow tasks in case of a non-retryable error (#34404) 2018-10-16 07:44:15 +02:00
Martijn van Groningen
51eca14288
[TEST] Make sure there are shards started so that ESIntegTestCase#assertSameDocIdsOnShards() does not fail with shard not found. 2018-10-15 10:24:28 +02:00
Martijn van Groningen
74dc2da873
Change shard changes api's threadpool from get to search (#34421) 2018-10-15 08:09:00 +01:00
Nhat Nguyen
429c29e833 CCR/TEST: AwaitsFix testFailOverOnFollower
Tracked at #34412
2018-10-13 21:05:33 -04:00
Nhat Nguyen
7bc11a8099 Unmute testFollowIndexAndCloseNode
This issue was resolved by #34288.

Closes #33337
Relates #34288
2018-10-10 15:48:22 -04:00
Nhat Nguyen
33791ac27c
CCR: Following primary should process operations once (#34288)
Today we rewrite the operations from the leader with the term of the
following primary because the follower should own its history. The
problem is that a newly promoted primary may re-assign its term to
operations which were replicated to replicas before by the previous
primary. If this happens, some operations with the same seq_no may be
assigned different terms. This is not good for the future optimistic
locking using a combination of seqno and term.

This change ensures that the primary of a follower only processes an
operation if that operation was not processed before. The skipped
operations are guaranteed to be delivered to replicas via either
primary-replica resync or peer-recovery. However, the primary must not
acknowledge until the global checkpoint is at least the highest seqno of
all skipped ops (i.e., they all have been processed on every replica).

Relates #31751
Relates #31113
2018-10-10 15:39:57 -04:00
Martijn van Groningen
268e134121
renamed test class 2018-10-08 15:05:50 +02:00
Martijn van Groningen
c6c83d19f7
[CCR] Clear fetch exceptions if an empty but successful shard changes response returns (#34256)
Also fixed ShardFollowNodeTaskTests to not return ops when responseSize
is empty. Otherwise ops are returned when no ops are expected to be returned.

Co-authored-by: Jason Tedor <jason@tedor.me>
2018-10-06 07:53:37 -04:00
Martijn van Groningen
899e48395b
[CCR] Change unfollow API's privilege scheme. (#34175)
Unfollow should be allowed / disallowed on a per index level instead of
cluster level.

Also renamed `create_follow_index` index privilege to
`manage_follow_index` privilege and include unfollow and close APIs.
2018-10-06 07:38:28 -04:00
Jason Tedor
7d57bdb3a0
Follow stats structure (#34301)
This commit modifies the follow stats API response structure to more
clearly highlight meaning of the higher level fields. In particular,
previously the response had a top-level key for each index. Instead, we
nest the indices under an "indices" field which is now an array. The
values in this array are objects containing two fields: "index" which is
the name of the follower index, and "shards" which is an array where
each value in the array is the follower stats for that shard. That is,
we have gone from:

{
  "bar": [
    {
      "shard_id": 0...
    }...
  ]...
}

to

{
  "indices": [
    {
      "index": "bar",
      "shards": [
        {
          "shard_id": 0...
        }...
      ]
   }...
}
2018-10-05 06:38:20 -04:00
Jason Tedor
7478167d60
Rename CCR stats implementation (#34300)
In the CCR docs we want to refer to the endpoint that returns following
stats as the follow stats API. This commit renames the internal
implementation of this endpoint to reflect this usage.
2018-10-05 06:25:24 -04:00
Nhat Nguyen
d7893fd1e4 TEST: Mute testFollowIndexAndCloseNode
Tracked at #33337
2018-10-02 17:20:31 -04:00
Martijn van Groningen
7f5c2f1050
[CCR] Validate follower index historyUUIDs (#34078)
The follower index shard history UUID will be fetched from the indices stats api when the shard follow task starts and will be provided with the bulk shard operation requests. The bulk shard operations api will fail if the provided history uuid is unequal to the actual history uuid.

No longer record the leader history uuid in shard follow task params, but rather use the leader history UUIDs directly from follower index's custom metadata. The resume follow api will remain to fail if leader index shard history UUIDs are missing.

Closes #33956
2018-10-02 18:01:06 +02:00
Martijn van Groningen
d12a64eac2
[CCR] Only use primary shards and get expected count from leader index (#34186)
Closes #34173
2018-10-01 20:13:16 +02:00
Nhat Nguyen
a02debadfe TEST: Unmute testFollowIndexAndCloseNode
Since #34099, the FollowingEngine will skip an operation which was
already processed before. With that change, it should be okay to unmute
testFollowIndexAndCloseNode.
2018-10-01 11:59:33 -04:00
Jason Tedor
80f7c1dcc9
Fix compilation in unfollow action tests
This arose when two commits were pushed at roughly the same time, both
of which compiled successfully against master, but not when taken
together. This commit fixes a reference in one of the commits that was
changed in the other commit.
2018-09-30 14:30:08 -04:00
Jason Tedor
1893765055
Change CCR stats endpoint to be index-centric (#34169)
This commit modifies the CCR stats endpoint for indices to be
/{index}/_ccr/stats. This makes this endpoint consistent with other
index-centric endpoints like indices stats.
2018-09-30 14:29:32 -04:00
Jason Tedor
e2bd2028d8
Allow specifying shard changes batch sizes in bytes (#34168)
This commit changes the shard changes requests from using a raw byte
value to being able to be specified using bytes units (e.g., 4mb).
2018-09-30 14:22:22 -04:00
Martijn van Groningen
7c91c7a638
fixed test compile error 2018-09-30 19:31:30 +02:00
Martijn van Groningen
b1a27b2e6b
[CCR] Add unfollow API (#34132)
The unfollow API changes a follower index into a regular index, so that it will accept write requests from clients.

For the unfollow api to work the index follow needs to be stopped and the index needs to be closed.

Closes #33931
2018-09-30 19:19:34 +02:00
Nhat Nguyen
ad61398879
CCR: Optimize indexing ops using seq_no on followers (#34099)
This change introduces the indexing optimization using sequence numbers
in the FollowingEngine. This optimization uses the max_seq_no_updates
which is tracked on the primary of the leader and replicated to replicas
and followers.

Relates #33656
2018-09-28 20:42:26 -04:00
Martijn van Groningen
a984f8afb3
[CCR] Validate index privileges prior to following an index (#33758)
Prior to following an index in the follow API, check whether current
user has sufficient privileges in the leader cluster to read and
monitor the leader index.

Also check this in the create and follow API prior to creating the
follow index.

Also introduced READ_CCR cluster privilege that include the minimal
cluster level actions that are required for ccr in the leader cluster.
So a user can follow indices in a cluster, but not use the ccr admin APIs.

Closes #33553

Co-authored-by: Jason Tedor <jason@tedor.me>
2018-09-28 17:51:23 +02:00
Martijn van Groningen
3d7e3b2ab1
[TEST] changed naming of test methods to not refer to old api names. 2018-09-28 17:43:53 +02:00
Martijn van Groningen
eb00348b57
[CCR] Adjust list retryable errors (#33985)
The following changes were made:
* Added ElasticsearchSecurityException. For in the case the current user has insufficient privileges while an index is being followed. Prior to following ccr checks whether the current user has sufficient privileges and if not the follow api fails with an error.
* Added Index block exception. If the leader index gets closed, this exception is returned.
* Added ClusterBlockException service unavailable. In case for example the leader cluster is without elected master.
* Removed IndexNotFoundException. If the leader / follower index has been deleted, ccr will need to stop the shard follow tasks with an error.

Closes #33954
2018-09-28 13:33:09 +02:00
Martijn van Groningen
506c1c2d47
Retry errors when fetching follower global checkpoint. (#34019)
Closes #34016
2018-09-28 10:34:08 +02:00
Martijn van Groningen
9129948f60
Rename CCR APIs (#34027)
* Renamed CCR APIs

Renamed:
* `/{index}/_ccr/create_and_follow` to `/{index}/_ccr/follow`
* `/{index}/_ccr/unfollow` to `/{index}/_ccr/pause_follow`
* `/{index}/_ccr/follow` to `/{index}/_ccr/resume_follow`

Relates to #33931
2018-09-28 08:02:20 +02:00
Martijn van Groningen
17b3b97899
Fixed CCR stats api serialization issues and (#33983)
always use `IndicesOptions.strictExpand()` for indices options.

The follow index may be closed and we still want to get stats from
shard follow task and the whether the provided index name matches with
follow index name is checked when locating the task itself in the ccr
stats transport action.
2018-09-28 07:45:32 +02:00
Nhat Nguyen
48c169e065
CCR: replicates max seq_no of updates to follower (#34051)
This commit replicates the max_seq_no_of_updates on the leading index
to the primaries of the following index via ShardFollowNodeTask. The
max_seq_of_updates is then transmitted to the replicas of the follower
via replication requests (that's BulkShardOperationsRequest).

Relates #33656
2018-09-26 08:00:10 -04:00
Martijn van Groningen
eae5487477
[CCR] set minimum version to 6.5.0 2018-09-26 09:31:36 +02:00
Martijn van Groningen
96b3417985
[CCR] Don't auto follow follow indices in the same cluster. (#33944) 2018-09-26 07:34:51 +02:00
Nhat Nguyen
5166dd0a4c
Replicate max seq_no of updates to replicas (#33967)
We start tracking max seq_no_of_updates on the primary in #33842. This
commit replicates that value from a primary to its replicas in replication 
requests or the translog phase of peer-recovery.

With this change, we guarantee that the value of max seq_no_of_updates
on a replica when any index/delete operation is performed at least the
max_seq_no_of_updates on the primary when that operation was executed.

Relates #33656
2018-09-25 08:07:57 -04:00
Martijn van Groningen
793b2a94b4
[CCR] Expose auto follow stats to monitoring (#33886) 2018-09-25 07:19:46 +02:00
Nhat Nguyen
6ec36b1273
CCR: Make AutoFollowMetadata immutable (#33977)
We should make AutoFollowMetadata immutable to avoid being inconsistent
when one thread modifies it while other reads it.
2018-09-24 17:47:10 -04:00
Martijn van Groningen
2795ef561f
[CCR] Add get auto follow pattern api (#33849)
Relates to #33007
2018-09-24 20:26:13 +02:00
Nhat Nguyen
ddd5ce5740
TEST: Avoid invalid ranges in ShardChangesActionTests (#33976)
If numWrites is between 2 and 9, we will issue an invalid range because
the from_seq_no is negative. This commit makes sure that numWrites is at
least 10, and adds an explicit test to verify invalid request ranges.
2018-09-23 22:28:41 -04:00
Nhat Nguyen
7944a0cb25
Track max seq_no of updates or deletes on primary (#33842)
This PR is the first step to use seq_no to optimize indexing operations.
The idea is to track the max seq_no of either update or delete ops on a
primary, and transfer this information to replicas, and replicas use it
to optimize indexing plan for index operations (with assigned seq_no).

The max_seq_no_of_updates on primary is initialized once when a primary
finishes its local recovery or peer recovery in relocation or being
promoted. After that, the max_seq_no_of_updates is only advanced internally
inside an engine when processing update or delete operations.

Relates #33656
2018-09-22 08:02:57 -04:00
Martijn van Groningen
e1e5f40727
[CCR] Move headers from auto follow pattern to auto follow metadata (#33846)
This ensures that we will not serialize the headers as part of the
auto follow pattern in the to be added get auto follow api.
2018-09-21 18:08:29 +02:00
Martijn van Groningen
384ce58535
removed unused fields 2018-09-20 08:56:23 +02:00
Martijn van Groningen
44c7c4b166
[CCR] Add auto follow stats api (#33801)
GET /_ccr/auto_follow/stats

Returns:

```
{
   "number_of_successful_follow_indices": ...
   "number_of_failed_follow_indices": ...
   "number_of_failed_remote_cluster_state_requests": ...
   "recent_auto_follow_errors": [
      ...
   ]
}
```

Relates to #33007
2018-09-20 07:16:20 +02:00