Commit Graph

611 Commits

Author SHA1 Message Date
Martijn van Groningen 8e1ef0cff9
Rewrite shard follow node task logic (#31581)
The current shard follow mechanism is complex and does not give us easy ways the have visibility into the system (e.g. why we are falling behind).
The main reason why it is complex is because the current design is highly asynchronous. Also in the current model it is hard to apply backpressure
other than reducing the concurrent reads from the leader shard.

This PR has the following changes:
* Rewrote the shard follow task to coordinate the shard follow mechanism between a leader and follow shard in a single threaded manner.
  This allows for better unit testing and makes it easier to add stats.
* All write operations read from the shard changes api should be added to a buffer instead of directly sending it to the bulk shard operations api.
  This allows to apply backpressure. In this PR there is a limit that controls how many write ops are allowed in the buffer after which no new reads
  will be performed until the number of ops is below that limit.
* The shard changes api includes the current global checkpoint on the leader shard copy. This allows reading to be a more self sufficient process;
  instead of relying on a background thread to fetch the leader shard's global checkpoint.
* Reading write operations from the leader shard (via shard changes api) is a separate step then writing the write operations (via bulk shards operations api).
  Whereas before a read would immediately result into a write.
* The bulk shard operations api returns the local checkpoint on the follow primary shard, to keep the shard follow task up to date with what has been written.
* Moved the shard follow logic that was previously in ShardFollowTasksExecutor to ShardFollowNodeTask.
* Moved over the changes from #31242 to make shard follow mechanism resilient from node and shard failures.

Relates to #30086
2018-07-10 16:00:55 +02:00
Martijn van Groningen ac654cbc10
Follow engine should not fill gaps upon promotion and recovery (#31751)
Closes #31318
2018-07-03 13:15:06 +02:00
Martijn van Groningen 8ecfcc3b80
muted tests that will be replaced by the shard follow task refactoring:
https://github.com/elastic/elasticsearch/pull/31581
2018-06-29 11:47:46 +02:00
Nhat Nguyen 1185ddbcc6 Replaces testClassesDir with testClassesDirs in ccr build
Relates #30389
2018-06-28 11:24:41 -04:00
Nhat Nguyen 2c56df631d Adjusts transport actions in CCR
This commit adjusts the ccr’s actions accordingly to the recent changes
in the upstream.
2018-06-23 18:10:15 -04:00
Nhat Nguyen 34f127be3c CCR: Remove index name resolver from CCR actions
Relates #31002
2018-06-20 13:20:24 -04:00
Nhat Nguyen c74cd30ac6 Remove request type parameter from CCR actions
Relates #31405
2018-06-19 10:49:05 -04:00
Martijn van Groningen 50ce990305
added missing serialization tests 2018-06-19 10:22:58 +02:00
Martijn van Groningen 73c9dd976b
Remove action request builders. 2018-06-15 12:32:08 +02:00
Tanguy Leroux 18938aab39 Adapt ShardFollowTasksExecutor after #31031 2018-06-15 11:46:08 +02:00
Martijn van Groningen cc824ebb5e
[CCR] Added more validation to follow index api. (#31068) 2018-06-15 07:39:53 +02:00
Nhat Nguyen 1ccb34ac77 Remove unused imports 2018-06-14 11:44:20 -04:00
Jason Tedor 64b4cdeda6
Merge remote-tracking branch 'elastic/master' into ccr
* elastic/master: (53 commits)
  Painless: Restructure/Clean Up of Spec Documentation (#31013)
  Update ignore_unmapped serialization after backport
  Add back dropped substitution on merge
  high level REST api: cancel task (#30745)
  Enable engine factory to be pluggable (#31183)
  Remove vestiges of animal sniffer (#31178)
  Rename elasticsearch-nio to nio (#31186)
  Rename elasticsearch-core to core (#31185)
  Move cli sub-project out of server to libs (#31184)
  [DOCS] Fixes broken link in auditing settings
  QA: Better seed nodes for rolling restart
  [DOCS] Moves ML content to stack-docs
  [DOCS] Clarifies recommendation for audit index output type (#31146)
  Add nio-transport as option for http smoke tests (#31162)
  QA: Set better node names on rolling restart tests
  Add support for ignore_unmapped to geo sort (#31153)
  Share common parser in some AcknowledgedResponses (#31169)
  Fix random failure on SearchQueryIT#testTermExpansionExceptionOnSpanFailure
  Remove reference to multiple fields with one name (#31127)
  Remove BlobContainer.move() method (#31100)
  ...
2018-06-07 23:33:42 -04:00
Simon Willnauer 5c6711b8a4
Use a `_recovery_source` if source is omitted or modified (#31106)
Today if a user omits the `_source` entirely or modifies the source
on indexing we have no chance to re-create the document after it has
been added. This is an issue for CCR and recovery based on soft deletes
which we are going to make the default. This change adds an additional
recovery source if the source is disabled or modified that is only kept
around until the document leaves the retention policy window.

This change adds a merge policy that efficiently removes this extra source
on merge for all document that are live and not in the retention policy window
anymore.
2018-06-07 07:39:28 +02:00
Jason Tedor 20a2f646e2
Fix off-by-one error in chunks coordinator (#31147)
This commit fixes an off-by-error in the chunks coordinator where the
batches would be of size one more than the batch size.
2018-06-06 19:53:49 -04:00
Jason Tedor bf1152fcc6
Use follower primary term when applying operations (#31113)
The primary shard copy on the following has authority of the replication
operations that occur on the following side in cross-cluster
replication. Yet today we are using the primary term directly from the
operations on the leader side. Instead we should be replacing the
primary term on the following side with the primary term of the primary
on the following side. This commit does this by copying the translog
operations with the corrected primary term. This ensures that we use
this primary term while applying the operations on the primary, and when
replicating them across to the replica (where the replica request was
carrying the primary term of the primary shard copy on the follower).
2018-06-06 11:03:57 -04:00
Jason Tedor d230548401
Remove use of deprecated methods to perform request (#31117)
The old perform request methods on the REST client have been deprecated
in favor using request-flavored methods. This commit addresses the use
of these deprecated methods in the CCR test suite.
2018-06-06 05:09:55 -04:00
Nhat Nguyen 6ee6404e94 Adapt changes in PersistentTaskParams
Relates #31045
2018-06-04 14:48:04 -04:00
Nhat Nguyen 87abb49145 Adapt changes in AcknowledgeResponse
Relates #30983
2018-06-04 14:47:22 -04:00
Nhat Nguyen 9564b60194 Adjust CCR Actions after RequestBuilder is removed
CCR side of #30966
2018-06-01 23:09:59 -04:00
Nhat Nguyen 2a9a2002e6 CCR: Tighten requesting range check on leader
This commit clarifies the origin of the global checkpoint that the
following shard uses and replaces illegal_state_exc E by an assertion.

Relates #30980
2018-05-31 20:00:33 -04:00
Nhat Nguyen fa54be2dcd
CCR: Do not minimization requesting range on leader (#30980)
Today before reading operations on the leading shard, we minimization
the requesting range with the global checkpoint. However, this might
make the request invalid if the following shard generates a requesting
range based on the global-checkpoint from a primary shard and sends 
that request to a replica whose global checkpoint is lagged.

Another issue is that we are mutating the request when applying
minimization. If the request becomes invalid on a replica, we will
reroute the mutated request instead of the original one to the primary.

This commit removes the minimization and replaces it by a range check
with the local checkpoint.
2018-05-31 15:14:32 -04:00
Martijn van Groningen 7e8cf768cf
changed persistent task name to be of similar structure as the others 2018-05-31 15:16:13 +02:00
Martijn van Groningen a82f2e31b4
[CCR] Also copy routing_num_shards from leader to follow index. (#30894)
Bug was introduced when create and follow api was added in #30602
2018-05-31 08:03:53 +02:00
Nhat Nguyen f25ee254cc Mute ShardChangesIT#testFollowIndex 2018-05-30 14:29:58 -04:00
Martijn van Groningen adca32eae7
no need to resolve index name as only concrete index names are used 2018-05-30 12:42:35 +02:00
Martijn van Groningen 4a20dca5fe
Required changes after merging in master. 2018-05-30 10:26:49 +02:00
Martijn van Groningen 51caefe46c
[CCR] Sync mappings between leader and follow index (#30115)
The shard changes api returns the minimum IndexMetadata version the leader
index needs to have. If the leader side is behind on IndexMetadata version
then follow shard task waits with processing write operations until the
mapping has been fetched from leader index and applied in follower index
in the background.

The cluster state api is used to fetch the leader mapping and put mapping api
to apply the mapping in the follower index. This works because put mapping
api accepts fields that are already defined.

Relates to #30086
2018-05-28 07:37:27 +02:00
Martijn van Groningen e477147143
[CCR] Add create and follow api (#30602)
Also renamed FollowExisting* internal names to just Follow* and fixed tests
2018-05-26 15:05:40 +02:00
Martijn van Groningen 7942e4082a
build: enhance check task instead of overwriting it.
(test task didn't run when check task ran)
2018-05-16 10:54:15 +02:00
Martijn van Groningen 596ec1848e
[CCR] Add validation checks that were left out of #30120 (#30463) 2018-05-16 09:46:03 +02:00
Martijn van Groningen 23204e3d09
[CCR] Fixed follow and unfollow api url path according to design.
The TODOs in the rest actions was incorrect. The problem was that
these rest actions used `follow_index` as first named variable in the path
under which the rest actions were registered. Other candidate rest actions that
also have a named variable as first element in the path (but with a different
name) get resolved as rest parameters too and passed down to the rest
action that actually ends up getting executed.

In the case of the follow index api, a `index` parameter got passed down
to `RestFollowExistingAction`, but that param was never used. This caused the
follow index api call to fail, because of unused http parameters.

This change doesn't fixes that problem, but works around it by using
`index` as named variable for the follow index (instead of `follow_index`).

Relates to #30102
2018-05-16 09:07:50 +02:00
Martijn van Groningen 64b97313d5
[CCR] Make cross cluster replication work with security (#30239)
If security is enabled today with ccr then the follow index api will
fail with the fact that system user does not have privileges to use
the shard changes api. The reason that system user is used is because
the persistent tasks that keep the shards in sync runs in the background
and the user that invokes the follow index api only start those background
processes.

I think it is better that the system user isn't used by the persistent
tasks that keep shards in sync, but rather runs as the same user that
invoked the follow index api and use the permissions that that user has.
This is what this PR does, and this is done by keeping track of
security headers inside  the persistent task (similar to how rollup does this).

This PR also adds a cluster ccr priviledge that allows a user to follow
or unfollow an index. Finally if a user that wants to follow an index,
it needs to have read and monitor privileges on the leader index and
monitor and write privileges on the follow index.
2018-05-16 07:48:32 +02:00
Martijn van Groningen bb6586dc5f [CCR] Read changes from Lucene instead of translog (#30120)
This commit adds an API to read translog snapshot from Lucene,
then cut-over from the existing translog to the new API in CCR.

Relates #30086
Relates #29530
2018-05-09 17:35:27 -04:00
Martijn van Groningen ad499fc178
[CCR] added rest specs and simple rest test for follow and unfollow apis (#30123)
[CCR] added rest specs and simple rest test for follow and unfollow apis, also

Added an acknowledge field in follow and unfollow api responses. Currently these api return an empty response and fixed bug in unfollow api that didn't cleanup node tasks properly.
2018-05-07 14:18:28 +02:00
Nhat Nguyen 6e0d0feca0 Enable MockHttpTransport in ShardChangsIT
CCR side of #29601
2018-05-04 13:44:18 -04:00
Nhat Nguyen 8fefa8a661 Update InternalEngine tests on ccr side for #30121
Relates #30121
2018-05-04 10:57:54 -04:00
Nhat Nguyen d621fc7a00
Add tombstone document into Lucene for Noop (#30226)
This commit adds a tombstone document into Lucene for every No-op. 
With this change, Lucene index is expected to have a complete history 
of operations like Translog. In fact, this guarantee is subjected to the
soft-deletes retention merge policy.

Relates #29530
2018-05-02 09:08:29 -04:00
Nhat Nguyen eb4281edef CCR side #30244
Relates #30244
2018-05-01 21:08:24 -04:00
Martijn van Groningen 8a2df6c3b9
[CCR] Only normalize -1 seqno in shard changes request. (#30238)
Prior to this change a -1 seqno would be normalized earlier, which
caused a leader shard containing a single operation to be ignored.

Closes #30227
2018-05-01 08:40:23 +02:00
Martijn van Groningen e6b88fa5a0
removed duplicated license 2018-04-25 12:18:02 +02:00
Martijn van Groningen 5a67a0f78f
Applying changes required for ccr after moving ccr code to elasticsearch 2018-04-25 08:03:29 +02:00
Martijn van Groningen 9b9d0f9057 Enabled licence header check and fixed unchecked casts. (#4408) 2018-04-20 11:15:52 +02:00
Martijn van Groningen cfd7847628 fixed issues after merging in master 2018-04-20 07:59:13 +02:00
Nhat Nguyen f97aec7b8b Sibling of enforce access to translog via engine
Since elastic/elasticsearch#29542, we no longer expose translog instance
but only provide creating translog snapshot method. This commit adapts
that change in CCR branch.

Relates elastic/elasticsearch#29542
2018-04-18 11:54:00 -04:00
Martijn van Groningen 56ca59a513 Add the ability to the follow index to follow an index in a remote cluster.
The follow index api completely reuses CCS infrastructure that was exposed via:
https://github.com/elastic/elasticsearch/pull/29495

This means that the leader index parameter support the same ccs index
to indicate that an index resides in a different cluster.

I also added a qa module that smoke tests the cross cluster nature of ccr.
The idea is that this test just verifies that ccr can read data from a
remote leader index and that is it, no crazy randomization or indirectly
testing other features.
2018-04-17 07:36:40 +02:00
Martijn van Groningen c0d42e9cd1 Fixed test 2018-04-16 10:48:46 +02:00
Martijn van Groningen a94b38b88e Fixed compile errors and test failures after merging master into ccr. 2018-04-13 16:35:09 +02:00
Martijn van Groningen d77f756f5c ccr: use indices stats api to fetch global checkpoint of the follower shards and
keep track of shard follow stats inside shard follow stats' node task instead of persistent task status.

By maintaining the shard follow stats inside its node task the stats update is quicker as
no cluster state update is required. The stats are now transient; meaning if the task
is going to run a different node then the stats are gone too. Currently only the processed
global checkpoint is being tracked and this is being restored when a shard follow node task
starts via the indices stats api (the reason of the first change of this change). Other stats
that we may add in the future (like fetch_time, see: https://gist.github.com/s1monw/dba13daf8493bf48431b72365e110717)
it is ok if we start from zero in case a shard follow task moves to another node.
2018-04-05 14:52:20 +02:00
Martijn van Groningen d976fa44e7 Removed LocalCheckpointTracker usage. 2018-03-29 07:41:23 +02:00
Martijn van Groningen a22a7d079d ccr: Added maximum translog limit that a single shard changes response can return.
This limit is based on the number of estimate bytes in each translog
operation that fall between the minimum and maximum request sequence number.

If this limit is met then the shard follow task executor will make sure
that a subsequent shard changes request will be performed to fetch the
remaining translog operations.

This limit is needed in order to protect against returning too many
translog operations in a single shard changes response.

Relates to #2436
2018-03-28 15:49:57 +02:00
Martijn van Groningen 282740610b Fixed test after merging in master branch. 2018-03-28 09:54:41 +02:00
Nhat Nguyen 51111a8106 CCR: Stop FollowExistingIndexAction after report failure (#4111)
We check for the existence of both leader and follower index, then properly 
report to the caller. However, we do not return after reporting failure. This
causes the caller receive exception twice: IllegalArgumentException then
NullPointerException. This commit makes sure to stop the action after reporting
failure.
2018-03-26 13:56:47 -04:00
Martijn van Groningen 9e4c68c389 Fixed compile and test errors after merging in master 2018-03-16 17:47:10 +01:00
Martijn van Groningen 10cfa21a68 required changes after merge master branch into ccr branch. 2018-02-22 15:03:33 +01:00
Martijn van Groningen 1a9a7ffe97 removed hack 2018-02-07 17:54:28 +01:00
Martijn van Groningen c442d14f1d Several changes that were required after merging master into the ccr branch. 2018-02-05 13:25:58 +01:00
Martijn van Groningen 4e818254ad re-enabled java integration tests 2018-01-25 14:18:34 +01:00
Martijn van Groningen 05d3d2e49c fix packages after merge 2018-01-24 09:28:42 +01:00
Jason Tedor 9b6bb2c635 Enable run task for CCR
This commit enables the run task for ccr by specifying that the ccr
project not be evaluated until after core is evaluated. This is
important since ccr is alphabetically before core and thus Gradle
evaluates it first.

Relates #3665
2018-01-22 15:07:20 -05:00
Martijn van Groningen 83a82d83d0 Moved ccr source code to its own gradle module after xpack split. 2018-01-22 11:09:04 +01:00