Commit Graph

585 Commits

Author SHA1 Message Date
Martijn van Groningen 481f8a9a07
[CCR] Make auto follow patterns work with security (#33501)
Relates to #33007
2018-09-17 07:29:00 +02:00
Jason Tedor 770ad53978
Introduce long polling for changes (#33683)
Rather than scheduling pings to the leader index when we are caught up
to the leader, this commit introduces long polling for changes. We will
fire off a request to the leader which if we are already caught up will
enter a poll on the leader side to listen for global checkpoint
changes. These polls will timeout after a default of one minute, but can
also be specified when creating the following task. We use these time
outs as a way to keep statistics up to date, to not exaggerate time
since last fetches, and to avoid pipes being broken.
2018-09-16 10:35:23 -04:00
Jason Tedor 069605bd91
Do not count shard changes tasks against REST tests (#33738)
When executing CCR REST tests it is going to be expected after global
checkpoint polling goes in that shard changes tasks can still be pending
at the end of the test. One way to deal with this is to set a low
timeout on these polls, but then that means we are not executing our
REST tests with our default production settings and instead would be
using an unrealistic low timeout. Alternatively, since we expect these
tasks to be there, we can not count them against the test. That is what
this commit does.
2018-09-16 07:32:12 -04:00
Jason Tedor 73417bf09a
Move CCR REST tests to a sub-project of ccr
This commit moves these REST tests (possibly temporarily) to a
sub-project of ccr. We do this (again, possibly temporarily) to keep
them within the ccr sub-project yet there are changes within 6.x that
prevent these from being in the top-level project (the cluster formation
tasks are trying to install x-pack-ccr into the
integ-test-zip). Therefore, we isolate these for now until we can
understand why there are differences between 6.x and master.
2018-09-15 10:18:59 -04:00
Jason Tedor aa56892f2f
Move CCR REST tests to ccr sub-project (#33731)
This commit moves the CCR REST tests to the ccr sub-project as another
step towards running :x-pack:plugin:ccr:check giving us full coverage on
CCR.
2018-09-15 09:18:15 -04:00
Jason Tedor f037edb8e3
Move CCR monitoring tests to ccr sub-project (#33730)
This commit moves the CCR monitoring tests from the monitoring
sub-project to the ccr sub-project.
2018-09-15 09:16:33 -04:00
Martijn van Groningen 82a6ae1dae
[CCR] Move ccr tests in core module back to ccr module (#33711)
When developing ccr it is not ideal if tests are in multiple modules.
Even the classes these tests test are in the core module, it is easier
if these tests are in ccr module in order to avoid running the test task
in core module. This results in running many non ccr tests.

This way when developing ccr we can run locally:
./gradlew x-pack:plugin:core:precommit x-pack:plugin:ccr:check

before pushing to PR branches and be confident that the PR build passes,
without running x-pack:plugin:core:check task.
2018-09-14 17:18:00 +02:00
Jason Tedor 2282150f34
Expose retries for CCR fetch failures (#33694)
This commit exposes the number of times that a fetch has been tried to
the CCR stats endpoint, and to CCR monitoring.
2018-09-14 08:52:46 -04:00
Martijn van Groningen 222f42274e
[CCR] Check whether the rejected execution exception has the shutdown flag set (#33703)
and if so debug log it and otherwise rethrow.

This should fix a couple of test failures where during test teardown tests
failed due to uncaught exceptions being detected.
2018-09-14 13:28:11 +02:00
Martijn van Groningen 4bcad95fe7
[TEST] wait for no initializing shards 2018-09-14 09:59:24 +02:00
Martijn van Groningen 53ba253aa4
[CCR] Add validation for max_retry_delay (#33648) 2018-09-13 20:52:00 +02:00
Martijn van Groningen a69ae6b89f
[CCR] Add metadata to keep track of the index uuid of the leader index in the follow index (#33367)
The follow index api checks if the recorded uuid in the follow index matches
with uuid of the leader index and fails otherwise. This validation will
prevent a follow index from following an incompatible leader index.

The create_and_follow api will automatically add this custom index metadata
when it creates the follow index.

Closes #31505
2018-09-13 11:36:52 +02:00
Jason Tedor eb715d5290
Add follower index to CCR monitoring and status (#33645)
This commit adds the follower index to CCR shard follow task status, and
to monitoring.
2018-09-12 17:35:06 -04:00
Martijn van Groningen b5d8495789
[CCR] Add auto follow pattern APIs to transport client. (#33629) 2018-09-12 21:50:22 +02:00
Martijn van Groningen 5fa81310cc
[CCR] Added history uuid validation (#33546)
For correctness we need to verify whether the history uuid of the leader
index shards never changes while that index is being followed.

* The history UUIDs are recorded as custom index metadata in the follow index.
* The follow api validates whether the current history UUIDs of the leader
  index shards are the same as the recorded history UUIDs.
  If not the follow api fails.
* While a follow index is following a leader index; shard follow tasks
  on each shard changes api call verify whether their current history uuid
  is the same as the recorded history uuid.

Relates to #30086 

Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
2018-09-12 19:42:00 +02:00
Martijn van Groningen 901d8035d9
[CCR] Update es monitoring mapping and (#33635)
* [CCR] Update es monitoring mapping and
change qa tests to query based on leader index.


Co-authored-by: Jason Tedor <jason@tedor.me>
2018-09-12 19:36:17 +02:00
Tanguy Leroux bcac7f5e55 Fix checkstyle violation in ShardFollowNodeTask 2018-09-12 16:03:52 +02:00
Jason Tedor 23f12e42c1
Expose CCR stats to monitoring (#33617)
This commit exposes the CCR stats endpoint to monitoring collection.

Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
2018-09-12 09:13:07 -04:00
Martijn van Groningen 96c49e5ed0
[CCR] Improve shard follow task's retryable error handling (#33371)
Improve failure handling of retryable errors by retrying remote calls in
a exponential backoff like manner. The delay between a retry would not be
longer than the configured max retry delay. Also retryable errors will be
retried indefinitely.

Relates to #30086
2018-09-12 12:49:51 +02:00
Jason Tedor 20476b9e06
Disable CCR REST endpoints if CCR disabled (#33619)
This commit avoids enabling the CCR REST endpoints if CCR is disabled.
2018-09-12 01:54:34 -04:00
Jason Tedor eca37e6e0a
Expose CCR to the transport client (#33608)
This commit exposes CCR to the transport client.
2018-09-11 16:37:52 -04:00
Martijn van Groningen 74d41857c6
mute test on windows
Relates #33570
2018-09-10 16:49:17 +02:00
Martijn van Groningen 8eebca32d2
[CCR] Delay auto follow license check (#33557)
* [CCR] Delay auto follow license check
so that we're sure that there are auto follow patterns configured

Otherwise we log a warning in case someone is running with basic or gold
license and has not used the ccr feature.
2018-09-10 13:23:02 +02:00
Martijn van Groningen c4adcee3ea
[CCR] Add create_follow_index privilege (#33559)
This is a new index privilege that the user needs to have in the follow cluster.
This privilege is required in addition to the `manage_ccr` cluster privilege in
order to execute the create and follow api.

Closes #33555
2018-09-10 13:08:20 +02:00
Jason Tedor d1b99877fa
Remove underscore from auto-follow API (#33550)
This commit removes the leading underscore from _auto_follow in the
auto-follow API endpoints.
2018-09-09 14:42:49 -04:00
Nhat Nguyen 902d20cbbe CCR: Use single global checkpoint to normalize range (#33545)
We may use different global checkpoints to validate/normalize the range
of a change request if the global checkpoint is advanced between these
calls. If this is the case, then we generate an invalid request range.
2018-09-09 13:18:30 -04:00
Jason Tedor 6eca627409
Reverse logic for CCR license checks (#33549)
This commit reverses the logic for CCR license checks in a few
actions. This is done so that the successful case, which tends to be a
larger block of code, does not require indentation.
2018-09-09 10:22:22 -04:00
Jason Tedor edc492419b
Add latch countdown on failure in CCR license tests (#33548)
We have some listeners in the CCR license tests that invoke Assert#fail
if the onSuccess method for the listener is unexpectedly invoked. This
can leave the main test thread hanging until the test suite times out
rather than failing quickly. This commit adds some latch countdowns so
that we fail quickly if these cases are hit.
2018-09-09 09:52:40 -04:00
Jason Tedor c67b0ba33e
Create temporary directory if needed in CCR test
In the multi-cluster-with-non-compliant-license tests, we try to write
out a java.policy to a temporary directory. However, if this temporary
directory does not already exist then writing the java.policy file will
fail. This commit ensures that the temporary directory exists before we
attempt to write the java.policy file.
2018-09-09 07:16:56 -04:00
Jason Tedor 5a38c930fc
Add license checks for auto-follow implementation (#33496)
This commit adds license checks for the auto-follow implementation. We
check the license on put auto-follow patterns, and then for every
coordination round we check that the local and remote clusters are
licensed for CCR. In the case of non-compliance, we skip coordination
yet continue to schedule follow-ups.
2018-09-09 07:06:55 -04:00
Simon Willnauer c12d232215
Pass Directory instead of DirectoryService to Store (#33466)
Instead of passing DirectoryService which causes yet another dependency
on Store we can just pass in a Directory since we will just call
`DirectoryService#newDirectory()` on it anyway.
2018-09-07 14:00:24 +02:00
Nhat Nguyen 8afe09a749
Pass TranslogRecoveryRunner to engine from outside (#33449)
This commit allows us to use different TranslogRecoveryRunner when
recovering an engine from its local translog. This change is a
prerequisite for the commit-based rollback PR.

Relates #32867
2018-09-06 11:59:16 -04:00
Martijn van Groningen ef207edbf0
test: do not schedule when test has stopped 2018-09-06 14:14:24 +02:00
Martijn van Groningen cdd82bb203
test: fetch `SeqNoStats` inside try-catch block
Relates to #33457
2018-09-06 11:49:08 +02:00
Martijn van Groningen a721d09c81
[CCR] Added auto follow patterns feature (#33118)
Auto Following Patterns is a cross cluster replication feature that
keeps track whether in the leader cluster indices are being created with
names that match with a specific pattern and if so automatically let
the follower cluster follow these newly created indices.

This change adds an `AutoFollowCoordinator` component that is only active
on the elected master node. Periodically this component checks the
 the cluster state of remote clusters if there new leader indices that
match with configured auto follow patterns that have been defined in
`AutoFollowMetadata` custom metadata.

This change also adds two new APIs to manage auto follow patterns. A put
auto follow pattern api:

```
PUT /_ccr/_autofollow/{{remote_cluster}}
{
   "leader_index_pattern": ["logs-*", ...],
   "follow_index_pattern": "{{leader_index}}-copy",
   "max_concurrent_read_batches": 2
   ... // other optional parameters
}
```

and delete auto follow pattern api:

```
DELETE /_ccr/_autofollow/{{remote_cluster_alias}}
```

The auto follow patterns are directly tied to the remote cluster aliases
configured in the follow cluster.

Relates to #33007


Co-authored-by: Jason Tedor jason@tedor.me
2018-09-06 08:01:58 +02:00
Jason Tedor d71ced1b00
Generalize search.remote settings to cluster.remote (#33413)
With features like CCR building on the CCS infrastructure, the settings
prefix search.remote makes less sense as the namespace for these remote
cluster settings than does a more general namespace like
cluster.remote. This commit replaces these settings with cluster.remote
with a fallback to the deprecated settings search.remote.
2018-09-05 20:43:44 -04:00
Nhat Nguyen 16b53b5ab5 Mute testValidateFollowingIndexSettings
Tracked at #33379
2018-09-04 09:03:26 -04:00
Alpar Torok 7f7e8fd733
Disable assemble task instead of removing it (#33348) 2018-09-04 07:32:14 +03:00
Nhat Nguyen 3a1dad1050 Mute testFollowIndexAndCloseNode
Tracked at #33337
2018-09-02 19:17:51 -04:00
Nhat Nguyen c6b011f8ea
TEST: Increase timeout testFollowIndexAndCloseNode (#33333)
This test fails several times due to timeout when asserting the number
of docs on the following and leading indices. This change reduces
the number of docs to index and increases the timeout.
2018-09-02 09:28:47 -04:00
Martijn van Groningen 66b164c2a6
[CCR] Removed custom follow and unfollow api's reponse classes with AcknowledgedResponse (#33260)
These response classes did not add any value and in that case just AcknowledgedResponse should be used.

I also changed the formatting of methods to take one line per parameter in
FollowIndexAction.java and UnfollowIndexAction.java files to make
reviewing diffs in the future easier.
2018-08-31 21:16:06 +07:00
Nhat Nguyen d3f32273eb Merge branch 'master' into ccr 2018-08-30 23:22:58 -04:00
Martijn van Groningen 41c7fc8d37
[CCR] Introduce leader index name & last fetch time stats to stats api response (#33155) 2018-08-29 10:54:58 +07:00
Nhat Nguyen e2b931e80b
Use Lucene history in primary-replica resync (#33178)
This commit makes primary-replica resyncer use Lucene as the source of
history operation instead of translog if soft-deletes is enabled. With
this change, we no longer expose translog snapshot directly in IndexShard.

Relates #29530
2018-08-28 10:44:15 -04:00
Jason Tedor 5954354e62
Fix ShardFollowNodeTask.Status equals and hash code (#33189)
These were broken when fetch exceptions were introduced to the status
object but equals and hash code were not updated then. This commit
addresses that.
2018-08-28 08:53:45 -04:00
Jason Tedor cd91992c89
Only fetch mapping updates when necessary (#33182)
Today we fetch the mapping from the leader and apply it as a mapping
update whenever the index metadata version on the leader changes. Yet,
the index metadata can change for many reasons other than a mapping
update (e.g., settings updates, adding an alias, or a replica being
promoted to a primary among many other reasons). This commit builds on
the addition of a mapping version to the index metadata to only fetch
mapping updates when the mapping version increases. This reduces the
number of these fetches and application of mappings on the follower to
the bare minimum.
2018-08-28 06:06:22 -04:00
Jason Tedor 0e5d42ca38
Merge branch 'master' into ccr
* master:
  Adjust BWC version on mapping version
  Token API supports the client_credentials grant (#33106)
  Build: forked compiler max memory matches jvmArgs (#33138)
  Introduce mapping version to index metadata (#33147)
  SQL: Enable aggregations to create a separate bucket for missing values (#32832)
  Fix grammar in contributing docs
  SECURITY: Fix Compile Error in ReservedRealmTests (#33166)
  APM server monitoring (#32515)
  Support only string `format` in date, root object & date range (#28117)
  [Rollup] Move toBuilders() methods out of rollup config objects (#32585)
  Fix forbiddenapis on java 11  (#33116)
  Apply publishing to genreate pom (#33094)
  Have circuit breaker succeed on unknown mem usage
  Do not lose default mapper on metadata updates (#33153)
  Fix a mappings update test (#33146)
  Reload Secure Settings REST specs & docs (#32990)
  Refactor CachingUsernamePassword realm (#32646)
2018-08-27 13:49:59 -04:00
Martijn van Groningen 47e9e72df2
reduce maximum number of writes to speed up test 2018-08-27 12:14:46 +07:00
Jason Tedor ef9607ea0c
Track fetch exceptions for shard follow tasks (#33047)
This commit adds tracking and reporting for fetch exceptions. We track
fetch exceptions per fetch, keeping track of up to the maximum number of
concurrent fetches. With each failing fetch, we associate the from
sequence number with the exception that caused the fetch. We report
these in the CCR stats endpoint, and add some testing for this tracking.
2018-08-24 14:21:23 -04:00
Jason Tedor 7fa8a728c4
Make CCR QA tests build again (#33113)
Welp, I broke this. I merged a change to auto-discover the CCR QA tests
by making :x-pack:plugin:ccr:check auto-discover the check tasks in the
qa sub-project. Yet, the check tasks for these sub-projects did not
depend on the necessary test tasks (as we were previously doing this
directly from the ccr build file. This commit fixes this!
2018-08-24 09:48:54 -04:00
Martijn van Groningen b0f22d67c4
fixed not returning response instance 2018-08-24 16:56:29 +07:00
Martijn van Groningen 575f33941c
Required changes after merging in master branch. 2018-08-24 12:51:26 +07:00
Jason Tedor 9623cf6cde
Find CCR QA sub-projects automatically (#33027)
Today we are by-hand maintaining a list of CCR QA sub-projects that the
check task depends on. This commit simplifies this by finding these
sub-projects automatically and adding their check task as dependencies
of the CCR check task.
2018-08-21 12:51:55 -04:00
Jason Tedor b08d02e3b7
Implement CCR licensing (#33002)
This commit implements licensing for CCR. CCR will require a platinum
license, and administrative endpoints will be disabled when a license is
non-compliant.
2018-08-20 23:33:18 -04:00
Nhat Nguyen 919888eba7 TEST: Enable debug log testValidateFollowingIndexSettings 2018-08-06 14:55:56 -04:00
Nhat Nguyen c394eb9ae9 CCR: Expose the operation primary term
Relates #32442
2018-08-06 10:55:37 -04:00
Jason Tedor 3b739b9fd5
Avoid NPE on shard changes action (#32630)
If a leader index is deleted while there is an active follower, the
follower will send shard changes requests bound for the leader
index. Today this will result in a null pointer exception because there
will not be an index routing table for the index. A null pointer
exception looks like a bug to a user so this commit addresses this by
throwing an index not found exception instead.
2018-08-06 08:01:47 -04:00
Jason Tedor 32c2759bb9
Remove extra blank line in CcrStatsAction.java
This commit removes an extra blank line that was accidentally committed
to CcrStatsAction.java.
2018-08-03 09:55:04 -04:00
Jason Tedor d640c9ddf9
Introduce CCR stats endpoint (#32350)
This commit introduces the CCR stats endpoint which provides shard-level
stats on the status of CCR follower tasks.
2018-08-03 09:09:45 -04:00
Jason Tedor 2387616c80
Remove _xpack from CCR APIs (#32563)
For a new feature like CCR we will go without this extra layer of
indirection. This commit replaces all /_xpack/ccr/_(\S+) endpoints by
/_ccr/$1 endpoints.
2018-08-02 20:21:43 -04:00
Nhat Nguyen 8cfbb64d6e
ShardFollowNodeTask should fetch operation once (#32455)
Today ShardFollowNodeTask might fetch some operations more than once.
This happens because we ask the leading for up to max_batch_count
operations (instead of the left-over size) for the left-over request.
The leading then can freely respond up to the max_batch_count, and at
the same time, if one of the previous requests completed, we might issue
another read request whose range overlaps with the response of the
left-over request.

Closes #32453
2018-07-30 20:53:09 -04:00
Nhat Nguyen aa3b6e098c
Reject follow request if following setting not enabled on follower (#32448)
Today we do not check if the `following_index` setting of the follower
is enabled or not when processing a follow-request. If that setting is
disabled, the follower will use the default engine, not the following
engine. This change checks and rejects such invalid follow requests.

Relates #30086
2018-07-29 21:57:45 -04:00
Nhat Nguyen 8474f8a01c
Validate source of an index in LuceneChangesSnapshot (#32288)
Today it's possible to encounter an Index operation in Lucene whose
_source is disabled, and _recovery_source was pruned by the MergePolicy.
If it's the case, we create a Translog#Index without source and let the
caller validate it later. However, this approach is challenging for the
caller.

Deletes and No-Ops don't allow invoking "source()" method. The caller
has to make sure to call "source()" only on index operations. The
current implementation in CCR does not follow this and fail to replica
deletes or no-ops. Moreover, it's easier to reason if a Translog#Index
always has the source.
2018-07-27 08:16:52 -04:00
Nhat Nguyen cd8b80da58 Use shadow plugin in ccr/qa 2018-07-25 00:16:33 -04:00
Nhat Nguyen a5d8f0b55a CCR: use shadow plugin
Relates #32240
2018-07-24 22:48:11 -04:00
Nhat Nguyen 88190299df
CCR: Fix incorrect read request completion condition (#32266)
Today we consider a read request is exhausted if from_seqno is equal to
or greater than the max_required_seqno. However, if we stop when
from_seqno equals to the max_required_seqno, we will miss an operation
whose seqno is max_required_seqno because we have not seen that 
operation yet.
2018-07-22 22:14:27 -04:00
Martijn van Groningen b6b596e471
[CCR] Add random shard follow task test (#32188)
Added shard follow task unit tests that tests whether the shard follow task is able to process randomly generated shard changes api responses.
2018-07-21 12:38:05 +02:00
Nhat Nguyen 8e15504443 TEST: Fix range issue in ShardChangesActionTests
We modified the way we calculate to_seqno in #32121 but did not adjust
this test accordingly. If min_seqno equals to max_seqno, the size should be
one instead of zero.

Relates #32121
2018-07-20 17:20:41 -04:00
Nhat Nguyen fe574f89f8 CCR: Translog op on primary should have versionType
Normally translog operations will not be replayed on the primary.
Following engine is an exception where we replay translog on both
primary and replica as a non-primary strategy.  Even though we won't use
the version_type in the following engine, we still need to pass a valid
value for the primary operation in order not to trip assertions in an
engine.

This commit passes version_type EXTERNAL for translog operation if its
origin is primary.

Relates #31945
2018-07-20 08:39:38 -04:00
Martijn van Groningen a6b7497fdc
[CCR] Add more unit tests for shard follow task (#32121)
The added tests are based on specific scenarios as described in the test plan.
Before this change the ShardFollowNodeTaskTests contained more random like tests,
but these have been removed and in a followup pr better random tests will
be added in a new test class as is described in the test plan.
2018-07-20 14:12:05 +02:00
Nhat Nguyen d0f3ed5abd Merge branch 'master' into ccr
* master:
  Painless: Simplify Naming in Lookup Package (#32177)
  Handle missing values in painless (#32207)
  add support for write index resolution when creating/updating documents (#31520)
  ECS Task IAM profile credentials ignored in repository-s3 plugin (#31864)
  Remove indication of future multi-homing support (#32187)
  Rest test - allow for snapshots to take 0 milliseconds
  Make x-pack-core generate a pom file
  Rest HL client: Add put watch action (#32026)
  Build: Remove pom generation for plugin zip files (#32180)
  Fix comments causing errors with Java 11
  Fix rollup on date fields that don't support epoch_millis (#31890)
  Detect and prevent configuration that triggers a Gradle bug (#31912)
  [test] port linux package packaging tests (#31943)
  Revert "Introduce a Hashing Processor (#31087)" (#32178)
  Remove empty @return from JavaDoc
  Adjust SSLDriver behavior for JDK11 changes (#32145)
  [test] use randomized runner in packaging tests (#32109)
  Add support for field aliases. (#32172)
  Painless: Fix caching bug and clean up addPainlessClass. (#32142)
  Call setReferences() on custom referring tokenfilters in _analyze (#32157)
  Fix BwC Tests looking for UUID Pre 6.4 (#32158)
  Improve docs for search preferences (#32159)
  use before instead of onOrBefore
  Add more contexts to painless execute api (#30511)
  Add EC2 credential test for repository-s3 (#31918)
  A replica can be promoted and started in one cluster state update (#32042)
  Fix Java 11 javadoc compile problem
  Fix CP for namingConventions when gradle home has spaces (#31914)
  Fix `range` queries on `_type` field for singe type indices (#31756)
  [DOCS] Update TLS on Docker for 6.3 (#32114)
  ESIndexLevelReplicationTestCase doesn't support replicated failures but it's good to know what they are
  Remove versionType from translog (#31945)
  Switch distribution to new style Requests (#30595)
  Build: Skip jar tests if jar disabled
  Painless: Add PainlessClassBuilder (#32141)
  Build: Make additional test deps of check (#32015)
  Disable C2 from using AVX-512 on JDK 10 (#32138)
  Build: Move shadow customizations into common code (#32014)
  Painless: Fix Bug with Duplicate PainlessClasses (#32110)
  Remove empty @param from Javadoc
  Re-disable packaging tests on suse boxes
  Docs: Fix missing example script quote (#32010)
  [ML] Wait for aliases in multi-node tests (#32086)
  [ML] Move analyzer dependencies out of categorization config (#32123)
  Ensure to release translog snapshot in primary-replica resync (#32045)
  Handle TokenizerFactory  TODOs (#32063)
  Relax TermVectors API to work with textual fields other than TextFieldType (#31915)
  Updates the build to gradle 4.9 (#32087)
  Mute :qa:mixed-cluster indices.stats/10_index/Index - all’
  Check that client methods match API defined in the REST spec (#31825)
  Enable testing in FIPS140 JVM (#31666)
  Fix put mappings java API documentation (#31955)
  Add exclusion option to `keep_types` token filter (#32012)
  [Test] Modify assert statement for ssl handshake (#32072)
2018-07-19 23:03:01 -04:00
Martijn van Groningen d88c76e02b
[CCR] Initial replication group based tests (#32024)
Tests shard follow task in the context of a leader and follower ReplicationGroup,
in order to test how the shard follow logic reacts to certain shard related
failure scenarios.

More tests will need to be added, but this indicates what changes need to be made
to have these tests.

Relates to #30102
2018-07-17 17:39:49 +02:00
Martijn van Groningen 006c79a80d
[CCR] Improve retry mechanism when making remote calls from shard follow task (#31930)
Closes #31816
2018-07-17 10:25:51 +02:00
Martijn van Groningen 815faf34fc
[CCR] Move api parameters from url to request body. (#31949)
Relates to #30102
2018-07-11 10:16:43 +02:00
Martijn van Groningen 8e1ef0cff9
Rewrite shard follow node task logic (#31581)
The current shard follow mechanism is complex and does not give us easy ways the have visibility into the system (e.g. why we are falling behind).
The main reason why it is complex is because the current design is highly asynchronous. Also in the current model it is hard to apply backpressure
other than reducing the concurrent reads from the leader shard.

This PR has the following changes:
* Rewrote the shard follow task to coordinate the shard follow mechanism between a leader and follow shard in a single threaded manner.
  This allows for better unit testing and makes it easier to add stats.
* All write operations read from the shard changes api should be added to a buffer instead of directly sending it to the bulk shard operations api.
  This allows to apply backpressure. In this PR there is a limit that controls how many write ops are allowed in the buffer after which no new reads
  will be performed until the number of ops is below that limit.
* The shard changes api includes the current global checkpoint on the leader shard copy. This allows reading to be a more self sufficient process;
  instead of relying on a background thread to fetch the leader shard's global checkpoint.
* Reading write operations from the leader shard (via shard changes api) is a separate step then writing the write operations (via bulk shards operations api).
  Whereas before a read would immediately result into a write.
* The bulk shard operations api returns the local checkpoint on the follow primary shard, to keep the shard follow task up to date with what has been written.
* Moved the shard follow logic that was previously in ShardFollowTasksExecutor to ShardFollowNodeTask.
* Moved over the changes from #31242 to make shard follow mechanism resilient from node and shard failures.

Relates to #30086
2018-07-10 16:00:55 +02:00
Martijn van Groningen ac654cbc10
Follow engine should not fill gaps upon promotion and recovery (#31751)
Closes #31318
2018-07-03 13:15:06 +02:00
Martijn van Groningen 8ecfcc3b80
muted tests that will be replaced by the shard follow task refactoring:
https://github.com/elastic/elasticsearch/pull/31581
2018-06-29 11:47:46 +02:00
Nhat Nguyen 1185ddbcc6 Replaces testClassesDir with testClassesDirs in ccr build
Relates #30389
2018-06-28 11:24:41 -04:00
Nhat Nguyen 2c56df631d Adjusts transport actions in CCR
This commit adjusts the ccr’s actions accordingly to the recent changes
in the upstream.
2018-06-23 18:10:15 -04:00
Nhat Nguyen 34f127be3c CCR: Remove index name resolver from CCR actions
Relates #31002
2018-06-20 13:20:24 -04:00
Nhat Nguyen c74cd30ac6 Remove request type parameter from CCR actions
Relates #31405
2018-06-19 10:49:05 -04:00
Martijn van Groningen 50ce990305
added missing serialization tests 2018-06-19 10:22:58 +02:00
Martijn van Groningen 73c9dd976b
Remove action request builders. 2018-06-15 12:32:08 +02:00
Tanguy Leroux 18938aab39 Adapt ShardFollowTasksExecutor after #31031 2018-06-15 11:46:08 +02:00
Martijn van Groningen cc824ebb5e
[CCR] Added more validation to follow index api. (#31068) 2018-06-15 07:39:53 +02:00
Nhat Nguyen 1ccb34ac77 Remove unused imports 2018-06-14 11:44:20 -04:00
Jason Tedor 64b4cdeda6
Merge remote-tracking branch 'elastic/master' into ccr
* elastic/master: (53 commits)
  Painless: Restructure/Clean Up of Spec Documentation (#31013)
  Update ignore_unmapped serialization after backport
  Add back dropped substitution on merge
  high level REST api: cancel task (#30745)
  Enable engine factory to be pluggable (#31183)
  Remove vestiges of animal sniffer (#31178)
  Rename elasticsearch-nio to nio (#31186)
  Rename elasticsearch-core to core (#31185)
  Move cli sub-project out of server to libs (#31184)
  [DOCS] Fixes broken link in auditing settings
  QA: Better seed nodes for rolling restart
  [DOCS] Moves ML content to stack-docs
  [DOCS] Clarifies recommendation for audit index output type (#31146)
  Add nio-transport as option for http smoke tests (#31162)
  QA: Set better node names on rolling restart tests
  Add support for ignore_unmapped to geo sort (#31153)
  Share common parser in some AcknowledgedResponses (#31169)
  Fix random failure on SearchQueryIT#testTermExpansionExceptionOnSpanFailure
  Remove reference to multiple fields with one name (#31127)
  Remove BlobContainer.move() method (#31100)
  ...
2018-06-07 23:33:42 -04:00
Simon Willnauer 5c6711b8a4
Use a `_recovery_source` if source is omitted or modified (#31106)
Today if a user omits the `_source` entirely or modifies the source
on indexing we have no chance to re-create the document after it has
been added. This is an issue for CCR and recovery based on soft deletes
which we are going to make the default. This change adds an additional
recovery source if the source is disabled or modified that is only kept
around until the document leaves the retention policy window.

This change adds a merge policy that efficiently removes this extra source
on merge for all document that are live and not in the retention policy window
anymore.
2018-06-07 07:39:28 +02:00
Jason Tedor 20a2f646e2
Fix off-by-one error in chunks coordinator (#31147)
This commit fixes an off-by-error in the chunks coordinator where the
batches would be of size one more than the batch size.
2018-06-06 19:53:49 -04:00
Jason Tedor bf1152fcc6
Use follower primary term when applying operations (#31113)
The primary shard copy on the following has authority of the replication
operations that occur on the following side in cross-cluster
replication. Yet today we are using the primary term directly from the
operations on the leader side. Instead we should be replacing the
primary term on the following side with the primary term of the primary
on the following side. This commit does this by copying the translog
operations with the corrected primary term. This ensures that we use
this primary term while applying the operations on the primary, and when
replicating them across to the replica (where the replica request was
carrying the primary term of the primary shard copy on the follower).
2018-06-06 11:03:57 -04:00
Jason Tedor d230548401
Remove use of deprecated methods to perform request (#31117)
The old perform request methods on the REST client have been deprecated
in favor using request-flavored methods. This commit addresses the use
of these deprecated methods in the CCR test suite.
2018-06-06 05:09:55 -04:00
Nhat Nguyen 6ee6404e94 Adapt changes in PersistentTaskParams
Relates #31045
2018-06-04 14:48:04 -04:00
Nhat Nguyen 87abb49145 Adapt changes in AcknowledgeResponse
Relates #30983
2018-06-04 14:47:22 -04:00
Nhat Nguyen 9564b60194 Adjust CCR Actions after RequestBuilder is removed
CCR side of #30966
2018-06-01 23:09:59 -04:00
Nhat Nguyen 2a9a2002e6 CCR: Tighten requesting range check on leader
This commit clarifies the origin of the global checkpoint that the
following shard uses and replaces illegal_state_exc E by an assertion.

Relates #30980
2018-05-31 20:00:33 -04:00
Nhat Nguyen fa54be2dcd
CCR: Do not minimization requesting range on leader (#30980)
Today before reading operations on the leading shard, we minimization
the requesting range with the global checkpoint. However, this might
make the request invalid if the following shard generates a requesting
range based on the global-checkpoint from a primary shard and sends 
that request to a replica whose global checkpoint is lagged.

Another issue is that we are mutating the request when applying
minimization. If the request becomes invalid on a replica, we will
reroute the mutated request instead of the original one to the primary.

This commit removes the minimization and replaces it by a range check
with the local checkpoint.
2018-05-31 15:14:32 -04:00
Martijn van Groningen 7e8cf768cf
changed persistent task name to be of similar structure as the others 2018-05-31 15:16:13 +02:00
Martijn van Groningen a82f2e31b4
[CCR] Also copy routing_num_shards from leader to follow index. (#30894)
Bug was introduced when create and follow api was added in #30602
2018-05-31 08:03:53 +02:00
Nhat Nguyen f25ee254cc Mute ShardChangesIT#testFollowIndex 2018-05-30 14:29:58 -04:00
Martijn van Groningen adca32eae7
no need to resolve index name as only concrete index names are used 2018-05-30 12:42:35 +02:00
Martijn van Groningen 4a20dca5fe
Required changes after merging in master. 2018-05-30 10:26:49 +02:00
Martijn van Groningen 51caefe46c
[CCR] Sync mappings between leader and follow index (#30115)
The shard changes api returns the minimum IndexMetadata version the leader
index needs to have. If the leader side is behind on IndexMetadata version
then follow shard task waits with processing write operations until the
mapping has been fetched from leader index and applied in follower index
in the background.

The cluster state api is used to fetch the leader mapping and put mapping api
to apply the mapping in the follower index. This works because put mapping
api accepts fields that are already defined.

Relates to #30086
2018-05-28 07:37:27 +02:00
Martijn van Groningen e477147143
[CCR] Add create and follow api (#30602)
Also renamed FollowExisting* internal names to just Follow* and fixed tests
2018-05-26 15:05:40 +02:00
Martijn van Groningen 7942e4082a
build: enhance check task instead of overwriting it.
(test task didn't run when check task ran)
2018-05-16 10:54:15 +02:00
Martijn van Groningen 596ec1848e
[CCR] Add validation checks that were left out of #30120 (#30463) 2018-05-16 09:46:03 +02:00
Martijn van Groningen 23204e3d09
[CCR] Fixed follow and unfollow api url path according to design.
The TODOs in the rest actions was incorrect. The problem was that
these rest actions used `follow_index` as first named variable in the path
under which the rest actions were registered. Other candidate rest actions that
also have a named variable as first element in the path (but with a different
name) get resolved as rest parameters too and passed down to the rest
action that actually ends up getting executed.

In the case of the follow index api, a `index` parameter got passed down
to `RestFollowExistingAction`, but that param was never used. This caused the
follow index api call to fail, because of unused http parameters.

This change doesn't fixes that problem, but works around it by using
`index` as named variable for the follow index (instead of `follow_index`).

Relates to #30102
2018-05-16 09:07:50 +02:00
Martijn van Groningen 64b97313d5
[CCR] Make cross cluster replication work with security (#30239)
If security is enabled today with ccr then the follow index api will
fail with the fact that system user does not have privileges to use
the shard changes api. The reason that system user is used is because
the persistent tasks that keep the shards in sync runs in the background
and the user that invokes the follow index api only start those background
processes.

I think it is better that the system user isn't used by the persistent
tasks that keep shards in sync, but rather runs as the same user that
invoked the follow index api and use the permissions that that user has.
This is what this PR does, and this is done by keeping track of
security headers inside  the persistent task (similar to how rollup does this).

This PR also adds a cluster ccr priviledge that allows a user to follow
or unfollow an index. Finally if a user that wants to follow an index,
it needs to have read and monitor privileges on the leader index and
monitor and write privileges on the follow index.
2018-05-16 07:48:32 +02:00
Martijn van Groningen bb6586dc5f [CCR] Read changes from Lucene instead of translog (#30120)
This commit adds an API to read translog snapshot from Lucene,
then cut-over from the existing translog to the new API in CCR.

Relates #30086
Relates #29530
2018-05-09 17:35:27 -04:00
Martijn van Groningen ad499fc178
[CCR] added rest specs and simple rest test for follow and unfollow apis (#30123)
[CCR] added rest specs and simple rest test for follow and unfollow apis, also

Added an acknowledge field in follow and unfollow api responses. Currently these api return an empty response and fixed bug in unfollow api that didn't cleanup node tasks properly.
2018-05-07 14:18:28 +02:00
Nhat Nguyen 6e0d0feca0 Enable MockHttpTransport in ShardChangsIT
CCR side of #29601
2018-05-04 13:44:18 -04:00
Nhat Nguyen 8fefa8a661 Update InternalEngine tests on ccr side for #30121
Relates #30121
2018-05-04 10:57:54 -04:00
Nhat Nguyen d621fc7a00
Add tombstone document into Lucene for Noop (#30226)
This commit adds a tombstone document into Lucene for every No-op. 
With this change, Lucene index is expected to have a complete history 
of operations like Translog. In fact, this guarantee is subjected to the
soft-deletes retention merge policy.

Relates #29530
2018-05-02 09:08:29 -04:00
Nhat Nguyen eb4281edef CCR side #30244
Relates #30244
2018-05-01 21:08:24 -04:00
Martijn van Groningen 8a2df6c3b9
[CCR] Only normalize -1 seqno in shard changes request. (#30238)
Prior to this change a -1 seqno would be normalized earlier, which
caused a leader shard containing a single operation to be ignored.

Closes #30227
2018-05-01 08:40:23 +02:00
Martijn van Groningen e6b88fa5a0
removed duplicated license 2018-04-25 12:18:02 +02:00
Martijn van Groningen 5a67a0f78f
Applying changes required for ccr after moving ccr code to elasticsearch 2018-04-25 08:03:29 +02:00
Martijn van Groningen 9b9d0f9057 Enabled licence header check and fixed unchecked casts. (#4408) 2018-04-20 11:15:52 +02:00
Martijn van Groningen cfd7847628 fixed issues after merging in master 2018-04-20 07:59:13 +02:00
Nhat Nguyen f97aec7b8b Sibling of enforce access to translog via engine
Since elastic/elasticsearch#29542, we no longer expose translog instance
but only provide creating translog snapshot method. This commit adapts
that change in CCR branch.

Relates elastic/elasticsearch#29542
2018-04-18 11:54:00 -04:00
Martijn van Groningen 56ca59a513 Add the ability to the follow index to follow an index in a remote cluster.
The follow index api completely reuses CCS infrastructure that was exposed via:
https://github.com/elastic/elasticsearch/pull/29495

This means that the leader index parameter support the same ccs index
to indicate that an index resides in a different cluster.

I also added a qa module that smoke tests the cross cluster nature of ccr.
The idea is that this test just verifies that ccr can read data from a
remote leader index and that is it, no crazy randomization or indirectly
testing other features.
2018-04-17 07:36:40 +02:00
Martijn van Groningen c0d42e9cd1 Fixed test 2018-04-16 10:48:46 +02:00
Martijn van Groningen a94b38b88e Fixed compile errors and test failures after merging master into ccr. 2018-04-13 16:35:09 +02:00
Martijn van Groningen d77f756f5c ccr: use indices stats api to fetch global checkpoint of the follower shards and
keep track of shard follow stats inside shard follow stats' node task instead of persistent task status.

By maintaining the shard follow stats inside its node task the stats update is quicker as
no cluster state update is required. The stats are now transient; meaning if the task
is going to run a different node then the stats are gone too. Currently only the processed
global checkpoint is being tracked and this is being restored when a shard follow node task
starts via the indices stats api (the reason of the first change of this change). Other stats
that we may add in the future (like fetch_time, see: https://gist.github.com/s1monw/dba13daf8493bf48431b72365e110717)
it is ok if we start from zero in case a shard follow task moves to another node.
2018-04-05 14:52:20 +02:00
Martijn van Groningen d976fa44e7 Removed LocalCheckpointTracker usage. 2018-03-29 07:41:23 +02:00
Martijn van Groningen a22a7d079d ccr: Added maximum translog limit that a single shard changes response can return.
This limit is based on the number of estimate bytes in each translog
operation that fall between the minimum and maximum request sequence number.

If this limit is met then the shard follow task executor will make sure
that a subsequent shard changes request will be performed to fetch the
remaining translog operations.

This limit is needed in order to protect against returning too many
translog operations in a single shard changes response.

Relates to #2436
2018-03-28 15:49:57 +02:00
Martijn van Groningen 282740610b Fixed test after merging in master branch. 2018-03-28 09:54:41 +02:00
Nhat Nguyen 51111a8106 CCR: Stop FollowExistingIndexAction after report failure (#4111)
We check for the existence of both leader and follower index, then properly 
report to the caller. However, we do not return after reporting failure. This
causes the caller receive exception twice: IllegalArgumentException then
NullPointerException. This commit makes sure to stop the action after reporting
failure.
2018-03-26 13:56:47 -04:00
Martijn van Groningen 9e4c68c389 Fixed compile and test errors after merging in master 2018-03-16 17:47:10 +01:00
Martijn van Groningen 10cfa21a68 required changes after merge master branch into ccr branch. 2018-02-22 15:03:33 +01:00
Martijn van Groningen 1a9a7ffe97 removed hack 2018-02-07 17:54:28 +01:00
Martijn van Groningen c442d14f1d Several changes that were required after merging master into the ccr branch. 2018-02-05 13:25:58 +01:00
Martijn van Groningen 4e818254ad re-enabled java integration tests 2018-01-25 14:18:34 +01:00
Martijn van Groningen 05d3d2e49c fix packages after merge 2018-01-24 09:28:42 +01:00
Jason Tedor 9b6bb2c635 Enable run task for CCR
This commit enables the run task for ccr by specifying that the ccr
project not be evaluated until after core is evaluated. This is
important since ccr is alphabetically before core and thus Gradle
evaluates it first.

Relates #3665
2018-01-22 15:07:20 -05:00
Martijn van Groningen 83a82d83d0 Moved ccr source code to its own gradle module after xpack split. 2018-01-22 11:09:04 +01:00