Commit Graph

335 Commits

Author SHA1 Message Date
Armin Braun c5b3ac5578
SNAPSHOTS: Allow Parallel Restore Operations (#36397)
* Enable parallel restore operations
* Add uuid to restore in progress entries to uniquely identify them
* Adjust restore in progress entries to be a map in cluster state
* Added tests for:
   * Parallel restore from two different snapshots
   * Parallel restore from a single snapshot to different indices to test uuid identifiers are correctly used by `RestoreService` and routing allocator
   * Parallel restore with waiting for completion to test transport actions correctly use uuid identifiers
2018-12-14 11:39:23 +01:00
Nhat Nguyen a4b32f1143
Remove concurrency in testFailLeaderReplicaShard (#36607)
testFailLeaderReplicaShard periodically fails because we concurrently
index to the leader group and close one of its replicas. If a
replication request hits a closing shard, we will fail that shard;
however, failing a shard is supported by the test framework - this makes
the test fail.
2018-12-13 19:02:13 -05:00
Boaz Leskes f6b5d7e013
Add sequence numbers based optimistic concurrency control support to Engine (#36467)
This commit add support to engine operations for resolving and verifying the sequence number and
primary term of the last modification to a document before performing an operation. This is
infrastructure to move our (optimistic concurrency control)[http://en.wikipedia.org/wiki/Optimistic_concurrency_control] API to use sequence numbers instead of internal versioning.

Relates #36148 
Relates #10708
2018-12-13 08:08:40 +01:00
Martijn van Groningen 883940ad92
[CCR] Change AutofollowCoordinator to use wait_for_metadata_version (#36264)
Changed AutofollowCoordinator makes use of the wait_for_metadata_version
feature in cluster state API and removed hard coded poll interval.

Originates from #35895
Relates to #33007
2018-12-12 12:47:24 +01:00
Martijn van Groningen 4a825e2e86
[CCR] Clean followed leader index UUIDs in auto follow metadata (#36408)
The auto follow coordinator keeps track of the UUIDs of indices that it has followed. The index UUID strings need to be cleaned up in the case that these indices are removed in the remote cluster.

Relates to #33007
2018-12-12 09:55:37 +01:00
Nhat Nguyen 51800de2a8
Enable soft-deletes by default on 7.0.0 or later (#36141)
This change enables soft-deletes by default on ES 7.0.0 or later.

Relates #33222

Co-authored-by: Jason Tedor <jason@tedor.me>
2018-12-11 18:58:49 -05:00
Nhat Nguyen f23701406b CCR/TEST: Enable soft-deletes in ShardChangesActionTests
Relates #36446
2018-12-11 15:00:09 -05:00
Andrey Ershov 8b821706cc
Switch more tests to zen2 (#36367)
1. CCR tests work without any changes
2. `testDanglingIndices` require changes the source code (added TODO).
3. `testIndexDeletionWhenNodeRejoins` because it's using just two
nodes, adding the node to exclusions is needed on restart.
4. `testCorruptTranslogTruncationOfReplica` starts dedicated master
one, because otherwise, the cluster does not form, if nodes are stopped
and one node is started back.
5. `testResolvePath` needs TEST cluster, because all nodes are stopped
at the end of the test and it's not possible to perform checks needed
by SUITE cluster.
6. `SnapshotDisruptionIT`. Without changes, the test fails because Zen2
retries snapshot creation as soon as network partition heals. This
results into the race between creating snapshot and test cleanup logic
(deleting index). Zen1 on the
other hand, also schedules retry, but it takes some time after network
partition heals, so cleanup logic executes latter and test passes. The
check that snapshot is eventually created is added to
the end of the test.
2018-12-11 17:12:17 +01:00
Martijn van Groningen 633ab24017
[CCR] Restructured QA modules (#36404)
Renamed the follow qa modules:
`multi-cluster-downgraded-to-basic-license` to `downgraded-to-basic-license`
`multi-cluster-with-non-compliant-license` to `non-compliant-license`
`multi-cluster-with-security` to `security`

Moved the `chain` module into the `multi-cluster` module and
changed the `multi-cluster` to start 3 clusters.

Followup from #36031
2018-12-09 19:34:48 +01:00
Nhat Nguyen 95bafb0593 TEST: Always enable soft-deletes in ShardChangesTests 2018-12-09 02:57:13 -05:00
Tim Brooks 8a53f2b464
Implement basic `CcrRepository` restore (#36287)
This is related to #35975. It implements a basic restore functionality
for the CcrRepository. When the restore process is kicked off, it
configures the new index as expected for a follower index. This means
that the index has a different uuid, the version is not incremented, and
the Ccr metadata is installed.

When the restore shard method is called, an empty shard is initialized.
2018-12-07 15:27:04 -07:00
Nhat Nguyen f2df0a5be4
Remove LocalCheckpointTracker#resetCheckpoint (#34667)
In #34474, we added a new assertion to ensure that the
LocalCheckpointTracker is always consistent with Lucene index. However,
we reset LocalCheckpoinTracker in testDedupByPrimaryTerm cause this
assertion to be violated.

This commit removes resetCheckpoint from LocalCheckpointTracker and
rewrites testDedupByPrimaryTerm without resetting the local checkpoint.

Relates #34474
2018-12-07 12:22:20 -05:00
Ryan Ernst 37b3fc383f
Build: Use explicit deps on test tasks for check (#36325)
This commit moves back to use explicit dependsOn for test tasks on
check. Not all tasks extending RandomizedTestingTask should be run by
check directly.
2018-12-06 14:13:49 -08:00
Yannick Welsch a0ae1cc987 Merge remote-tracking branch 'elastic/master' into zen2 2018-12-05 23:13:12 +01:00
Jim Ferenczi 18866c4c0b
Make hits.total an object in the search response (#35849)
This commit changes the format of the `hits.total` in the search response to be an object with
a `value` and a `relation`. The `value` indicates the number of hits that match the query and the
`relation` indicates whether the number is accurate (in which case the relation is equals to `eq`)
or a lower bound of the total (in which case it is equals to `gte`).
This change also adds a parameter called `rest_total_hits_as_int` that can be used in the
search APIs to opt out from this change (retrieve the total hits as a number in the rest response).
Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain
`hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a
follow up (to allow numbers to be passed to `track_total_hits`).

Relates #33028
2018-12-05 19:49:06 +01:00
Yannick Welsch cc11953724 Merge remote-tracking branch 'elastic/master' into zen2 2018-12-05 16:55:45 +01:00
Tim Brooks 068c856e88
Rename internal repository actions to be internal (#36244)
This is a follow-up to #36086. It renames the internal repository
actions to be prefixed by "internal". This allows the system user to
execute the actions.

Additionally, this PR stops casting Client to NodeClient. The client we
have is a NodeClient so executing the actions will be local.
2018-12-05 08:11:47 -07:00
Yannick Welsch b20497560c Merge remote-tracking branch 'elastic/master' into zen2 2018-12-05 14:06:38 +01:00
Martijn van Groningen a264cb6ddb
Refactor AutoFollowCoordinator to track leader indices per remote cluster (#36031)
and replaced poll interval setting with a hardcoded poll interval.
The hard coded interval will be removed in a follow up change to make
use of cluster state API's wait_for_metatdata_version.

Before the auto following was bootstrapped from thread pool scheduler,
but now auto followers for new remote clusters are bootstrapped when
a new cluster state is published.

Originates from #35895
Relates to #33007
2018-12-05 13:39:14 +01:00
Alpar Torok 60e45cd81d
Testing conventions task part 2 (#36107)
Closes #35435

- make it easier to add additional testing tasks with the proper configuration and add some where they were missing.
- mute or fix failing tests
- add a check as part of testing conventions to find classes not included in any testing task.
2018-12-05 14:20:01 +02:00
Martijn van Groningen 11935cd480
Replace Streamable w/ Writeable in BaseTasksResponse and subclasses (#36176)
This commit replaces usages of Streamable with Writeable for the
BaseTasksResponse / TransportTasksAction classes and subclasses of
these classes.

Note that where possible response fields were made final.

Relates to #34389
2018-12-05 13:14:10 +01:00
Yannick Welsch 42457b5960 Merge remote-tracking branch 'elastic/master' into zen2 2018-12-05 11:39:38 +01:00
Martijn van Groningen b9707c29a1
[CCR] Change get autofollow patterns API response format (#36203)
The current response format is:

```
{
    "pattern1": {
        ...
    },
    "pattern2": {
        ...
    }
}
```

The new format is:

```
{
    "patterns": [
        {
            "name": "pattern1",
            "pattern": {
                ...
            }
        },
        {
            "name": "pattern2",
            "pattern": {
                ...
            }
        }
    ]
}
```

This format is more structured and more friendly for parsing and generating specs.
This is a breaking change, but it is better to do this now while ccr
is still a beta feature than later.

Follow up from #36049
2018-12-05 08:41:27 +01:00
Tim Brooks 8bde608979
Register CcrRepository based on settings update (#36086)
This commit adds an empty CcrRepository snapshot/restore repository.
When a new cluster is registered in the remote cluster settings, a new
CcrRepository is registered for that cluster.

This is implemented using a new concept of "internal repositories".
RepositoryPlugin now allows implementations to return factories for
"internal repositories". The "internal repositories" are different from
normal repositories in that they cannot be registered through the
external repository api. Additionally, "internal repositories" are local
to a node and are not stored in the cluster state.

The repository will be unregistered if the remote cluster is removed.
2018-12-04 14:36:50 -07:00
Yannick Welsch 70c361ea5a Merge remote-tracking branch 'elastic/master' into zen2 2018-12-04 21:26:11 +01:00
Adrien Grand 0df08dd458
Set Lucene version upon index creation. (#36038)
It is important that all shards of a given index have the same
`indexCreatedVersionMajor` to Lucene, or eg. merging those shards is going to
be considered illegal. At the moment, we use the latest Lucene version when
creating a shard, which could cause shards to have different created versions
eg. in case of forced allocation. This commit makes sure to reuse the
appropriate Lucene version in order to avoid such issues.

Closes #33826
2018-12-04 17:53:20 +01:00
Martijn van Groningen 6e1ff31222
[CCR] AutoFollowCoordinator should tolerate that auto follow patterns may be removed (#35945)
AutoFollowCoordinator should take into account that after auto following
an index and while updating that a leader index has been followed, that
the auto follow pattern may have been removed via delete auto follow patterns
api.

Also fixed a bug that when a remote cluster connection has been removed,
the auto follow coordinator does not die when it tries get a remote client for 
that cluster.

Closes #35480
2018-12-04 15:55:15 +01:00
Yannick Welsch 80ee7943c9 Merge remote-tracking branch 'elastic/master' into zen2 2018-12-04 09:37:09 +01:00
Martijn van Groningen 43773a32a4
Replace Streamable w/ Writeable in BaseTasksRequest and subclasses (#35854)
* Replace Streamable w/ Writeable in BaseTasksRequest and subclasses

This commit replaces usages of Streamable with Writeable for the
BaseTasksRequest / TransportTasksAction classes and subclasses of
these classes.

Relates to #34389
2018-12-03 08:04:29 +01:00
Martijn van Groningen 32f7fbd9f0
[TEST] Set 'index.unassigned.node_left.delayed_timeout' to 0 in ccr tests
Some tests kill nodes and otherwise it would take 60s by default
for replicas to get allocated and that is longer than we wait
for getting in a green state in tests.

Relates to #35403
2018-11-30 11:03:36 +01:00
David Turner 7f257187af
[Zen2] Update default for USE_ZEN2 to true (#35998)
Today the default for USE_ZEN2 is false and it is overridden in many places. By
defaulting it to true we can be sure that the only places in which Zen2 does
not work are those in which it is explicitly set to false.
2018-11-29 12:18:35 +00:00
Martijn van Groningen 1390f366d4
[CCR] Only auto follow indices when all primary shards have started (#35814)
This change adds an extra check that verifies that all primary shards
have been started of an index that is about to be auto followed.

If not all primary shards have been started for an index
then the next auto follow run will try to follow to auto follow
this index again.

Closes #35480
2018-11-29 09:46:09 +01:00
Jason Tedor a3186e4a32
Deprecate X-Pack centric license endpoints (#35959)
This commit is part of our plan to deprecate and ultimately remove the
use of _xpack in the REST APIs.
2018-11-28 08:24:35 -05:00
Jason Tedor 2887680acb
Avoid NPE in follower stats when no tasks metadata (#35802)
When there is no persistent tasks metadata we could hit a null pointer
exception when executing a follower stats request. This is because we
inspect the persistent tasks metadata. Yet, if no tasks have been
registered, this is null (as opposed to empty). We need to avoid
de-referencing the persistent tasks metadata in this case. That is what
this commit does, and we add a test for this situation.
2018-11-21 19:16:28 -05:00
Tim Brooks a989b675b5
Remove NPE from IndexFollowingIT (#35717)
Currently there is a common NPE in the IndexFollowingIT that does not
indicate the test failing. This is when a cluster state listener is
called and certain index metadata is not yet available.

This commit checks that the metadata is not null before performing the
logic that depends on the metadata.
2018-11-19 20:38:49 -07:00
Arthur Gavlyukovskiy 022726011c Remove use of AbstractComponent in server (#35444)
Removed extending of AbstractComponent and changed logger usage to
explicit declaration. Abstract classes still have logger
declaration using this.getClass() in order to show implementation class
name in its logs.

See #34488
2018-11-16 16:10:32 -05:00
Martijn van Groningen 0487181d0f
[TEST] Force flush to ensure multiple segments.
Relates to #35333
2018-11-13 14:58:17 +01:00
Jason Tedor 3859d21661
Fix the names of CCR stats endpoints in usage API (#35438)
This commit fixes the names of the CCR stats endpoints reported in the
usage API.
2018-11-12 10:27:12 -05:00
Martijn van Groningen ef10461caf
[TEST] Instead of ignoring the ccr downgrade to basic license qa test
avoid the assertions that check the log files, because that does not work on Windows.
The rest of the test is still useful and should work on Windows CI.

Currently on Windows CI this qa module fails because there is just one test and
that test si ignored if OS is Windows.
2018-11-12 10:17:33 +01:00
Martijn van Groningen ae2af20ae5
[CCR] Validate remote cluster license as part of put auto follow pattern api call (#35364)
Validate remote cluster license as part of put auto follow pattern api call
in addition of validation that when auto follow coordinator starts auto
following indices in the leader cluster.

Also added qa module that tests what happens to ccr after downgrading to basic license.
Existing active follow indices should remain to follow,
but the auto follow feature should not pickup new leader indices.
2018-11-09 17:43:43 +01:00
Martijn van Groningen 807ce10f73
[TEST] Increased timeout for verifying ccr monitoring. 2018-11-09 15:40:15 +01:00
Martijn van Groningen fba811fa3a
[TEST] increased the number of index and delete ops to make it less likely that all ops exist as soft delete docs. 2018-11-09 15:31:51 +01:00
Martijn van Groningen 83152b3835
[CCR] Get all auto follow patterns and no auto follow metadata (#35381)
Return empty response when querying all auto follow patterns,
but there is no auto follow metadata.
2018-11-09 14:24:27 +01:00
Martijn van Groningen 07a69a528b
[CCR] Rename leaderClient variables and parameters to remoteClient (#35368) 2018-11-08 16:26:14 +01:00
Martijn van Groningen 8a85251da0
[CCR] Auto follow Coordinator fetch cluster state in system context (#35120)
Auto follow Coordinator should fetch the leader cluster state using system context.
2018-11-08 10:48:27 +01:00
Martijn van Groningen 2f2090f562 [CCR] Adjust list of dynamic index settings that should be replicated (#35195)
Adjust list of dynamic index settings that should be replicated
and added a test that verifies whether builtin dynamic index settings
are classified as replicated or non replicated (whitelisted).
2018-11-07 21:59:58 -05:00
Jason Tedor 4f4fc3b8f8
Replicate index settings to followers (#35089)
This commit uses the index settings version so that a follower can
replicate index settings changes as needed from the leader.

Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
2018-11-07 21:20:51 -05:00
Martijn van Groningen 314b9ca44c
[CCR] Enforce auto follow pattern name restrictions (#35197)
An auto follow pattern:
* cannot start with `_`
* cannot contain a `,`
* can be encoded in UTF-8
* the length of UTF-8 encoded bytes is no longer than 255 bytes
2018-11-07 20:16:26 +01:00
Martijn van Groningen e685cfe8f9
[CCR] Fail with a better error if leader index is red (#35298)
as part of fetching history uuids from leader index.
2018-11-07 13:23:30 +01:00
Martijn van Groningen 2395e16d84
[CCR] Change resume follow api to be a master node action (#35249)
In order to start shard follow tasks, the resume follow api already
needs execute N requests to the elected master node.

The pause follow API is also a master node action, which would make
how both APIs execute more consistent.
2018-11-07 07:38:44 +01:00
Martijn van Groningen a937d7f5f3
[CCR] Forgot missing return statement,
Error was thrown if leader index had no soft deletes enabled, but it then continued creating the follower index.

The test caught this bug, but very rarely due to timing issue.

Build failure instance:

```
1> [2018-11-05T20:29:38,597][INFO ][o.e.x.c.LocalIndexFollowingIT] [testDoNotCreateFollowerIfLeaderDoesNotHaveSoftDeletes] before test
  1> [2018-11-05T20:29:38,599][INFO ][o.e.c.s.ClusterSettings  ] [node_s_0] updating [cluster.remote.local.seeds] from [[]] to [["127.0.0.1:9300"]]
  1> [2018-11-05T20:29:38,599][INFO ][o.e.c.s.ClusterSettings  ] [node_s_0] updating [cluster.remote.local.seeds] from [[]] to [["127.0.0.1:9300"]]
  1> [2018-11-05T20:29:38,609][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_s_0] [leader-index] creating index, cause [api], templates [random-soft-deletes-templat
e, one_shard_index_template], shards [2]/[0], mappings []
  1> [2018-11-05T20:29:38,628][INFO ][o.e.c.r.a.AllocationService] [node_s_0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[leader-
index][0]] ...]).
  1> [2018-11-05T20:29:38,660][INFO ][o.e.x.c.a.TransportPutFollowAction] [node_s_0] [follower-index] creating index, cause [ccr_create_and_follow], shards [2]/[0]
  1> [2018-11-05T20:29:38,675][INFO ][o.e.c.s.ClusterSettings  ] [node_s_0] updating [cluster.remote.local.seeds] from [["127.0.0.1:9300"]] to [[]]
  1> [2018-11-05T20:29:38,676][INFO ][o.e.c.s.ClusterSettings  ] [node_s_0] updating [cluster.remote.local.seeds] from [["127.0.0.1:9300"]] to [[]]
  1> [2018-11-05T20:29:38,678][INFO ][o.e.x.c.LocalIndexFollowingIT] [testDoNotCreateFollowerIfLeaderDoesNotHaveSoftDeletes] after test
  1> [2018-11-05T20:29:38,678][INFO ][o.e.x.c.LocalIndexFollowingIT] [testDoNotCreateFollowerIfLeaderDoesNotHaveSoftDeletes] [LocalIndexFollowingIT#testDoNotCreateFoll
owerIfLeaderDoesNotHaveSoftDeletes]: cleaning up after test
  1> [2018-11-05T20:29:38,678][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s_0] [follower-index/TlWlXp0JSVasju2Kr_hksQ] deleting index
  1> [2018-11-05T20:29:38,678][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s_0] [leader-index/FQ6EwIWcRAKD8qvOg2eS8g] deleting index
FAILURE 0.23s J0 | LocalIndexFollowingIT.testDoNotCreateFollowerIfLeaderDoesNotHaveSoftDeletes <<< FAILURES!
   > Throwable #1: java.lang.AssertionError:
   > Expected: <false>
   >      but: was <true>
   >    at __randomizedtesting.SeedInfo.seed([7A3C89DA3BCA17DD:65C26CBF6FEF0B39]:0)
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.xpack.ccr.LocalIndexFollowingIT.testDoNotCreateFollowerIfLeaderDoesNotHaveSoftDeletes(LocalIndexFollowingIT.java:83)
   >    at java.lang.Thread.run(Thread.java:748)
```

Build failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+intake/46/console
2018-11-06 16:05:35 +01:00
Martijn van Groningen 46c238d792
[CCR] Improve error when operations are missing (#35179)
Improve error when operations are missing
2018-11-06 08:42:47 +01:00
Martijn van Groningen cac67f8bcc
[CCR] Add extra validation to unfollow api (#35245)
Validate whether the follow index actually exists and
whether the follow index actually has custom ccr metadata.
2018-11-06 08:00:34 +01:00
Nik Everett f72ef9b5fd
Build: Pull "skip assemble on qa" to common build (#35214)
Pull all of the logic that we use to skip the `assemble` and
`dependenciesInfo` tasks on `qa` projects into one spot in our root
build file.
2018-11-05 16:16:00 -05:00
Alexander Reelsen 409050e8de
Refactor: Remove settings from transport action CTOR (#35208)
As settings are not used in the transport action constructor, this
removes the passing of the settings in all the transport actions.
2018-11-05 13:08:18 +01:00
Martijn van Groningen ddda2d419c
[CCR] Change max_read_request_size default (#35247)
This changes the max_read_request_size default from unlimited to 32MB.
2018-11-05 12:51:42 +01:00
Nhat Nguyen 54e1231ebd
CCR/TEST: Limit indexing docs in FollowerFailOverIT (#35228)
The suite FollowerFailOverIT is failing because some documents are not
replicated to the follower. Maybe the FollowTask is not working as
expected or the background indexers eat all resources while the follower
cluster is trying to reform after a failover; then CI is not fast enough
to replicate all the indexed docs within 60 seconds (sometimes I see 80k
docs on the leader).

This commit limits the number of documents to be indexed into the leader
index by the background threads so that we can eliminate the latter
case. This change also replaces a docCount assertion with a docIds
assertion so we can have more information if these tests fail again.

Relates #33337
2018-11-03 10:00:54 -04:00
David Kyle 43cff56aec Mute FollowerFailOverIT.testFollowIndexAndCloseNode
Issue #33337
2018-11-02 15:13:09 +00:00
Nhat Nguyen 4875d6fb0b
CCR: Add NodeClosedException to retryable list (#35191)
This change adds NodeClosedException to the retry-able exception list.
2018-11-02 07:01:46 -04:00
Martijn van Groningen 19d6cf1b9e
[CCR] Change response classes to not use from Streamable. (#35085)
Only the response classes of get auto follow pattern, the follow and stats APIs
were moved away from Streamable.  The other APIs use `AcknowledgedResponse`
or `BaseTasksResponse` as response class and
moving that class away from Streamable is a bigger change.
2018-11-02 08:02:17 +01:00
Nhat Nguyen dbc7c9259c CCR/TEST: Add debug log to testFailOverOnFollower 2018-11-01 22:28:06 -04:00
Nik Everett e28509fbfe
Core: Less settings to AbstractComponent (#35140)
Stop passing `Settings` to `AbstractComponent`'s ctor. This allows us to
stop passing around `Settings` in a *ton* of places. While this change
touches many files, it touches them all in fairly small, mechanical
ways, doing a few things per file:
1. Drop the `super(settings);` line on everything that extends
`AbstractComponent`.
2. Drop the `settings` argument to the ctor if it is no longer used.
3. If the file doesn't use `logger` then drop `extends
AbstractComponent` from it.
4. Clean up all compilation failure caused by the `settings` removal
and drop any now unused `settings` isntances and method arguments.

I've intentionally *not* removed the `settings` argument from a few
files:
1. TransportAction
2. AbstractLifecycleComponent
3. BaseRestHandler

These files don't *need* `settings` either, but this change is large
enough as is.

Relates to #34488
2018-10-31 21:23:20 -04:00
Martijn van Groningen 6da2fb7d5b
Change CCR API request classes to use Writeable serialization instead of Streamable (#34911)
Only the follow stats request couldn't be changed to use Writeable serialization,
because that requires changes in `TransportTasksAction` and `BaseTasksRequest` base classes.
2018-10-30 11:03:30 +01:00
Jason Tedor cc9894d78a
Fix name of CCR stats transport action
This class name should include an indication that it is for CCR. This
commit adds that to the name of this class.
2018-10-29 09:50:13 -04:00
Jason Tedor 26f5c509af
Fix CCR API specification (#34963)
This commit fixes two issues with the CCR API specification:
 - remove the CCR stats endpoint, it is not currently implemented
 - fix the documentation links
2018-10-29 09:37:13 -04:00
Martijn van Groningen b2daaf15d1
[CCR] move tests that modify test cluster from the main test class to
a dedicated class.
2018-10-29 13:45:59 +01:00
Martijn van Groningen 1801518527
[CCR] Refactor stats APIs (#34912)
* Changed the auto follow stats to also include follow stats.
* Renamed the auto follow stats api to stats api and changed its url path
  from `/_ccr/auto_follow/stats` `/_ccr/stats`.
* Removed `/_ccr/stats` url path for the follow stats api, which makes
  the index parameter a required parameter.
* Fixed docs.
2018-10-29 07:45:27 +01:00
Martijn van Groningen bad5972f62
[CCR] Fix request serialization bug (#34917)
and some parameters that were not set in tests.
2018-10-29 07:38:55 +01:00
Jason Tedor 43f6ba1c63
Fix put/resume follow request parsing (#34913)
This commit adds some fields that were missing from put follow, and
fixes a bug in resume follow.
2018-10-26 11:09:55 -04:00
Martijn van Groningen 306f1d78f8
[CCR] Retry when no index shard stats can be found (#34852)
Index shard stats for the follower shard are fetched, when a shard follow task is started.
This is needed in order to bootstap the shard follow task with the follower global checkpoint.

Sometimes index shard stats are not available (e.g. during a restart) and
we fail now, while it is very likely that these stats will be available some time later.
2018-10-26 15:14:24 +02:00
Nhat Nguyen ff49e79d40 CCR: Rename follow-task parameters and stats (#34836)
* CCR: Rename follow parameters and stats

This commit renames the follow-task parameters and its stats.
Below are the changes:

## Params
- remote_cluster (unchanged)
- leader_index (unchanged)
- max_read_request_operation_count -> max_read_request_operation_count
- max_batch_size -> max_read_request_size
- max_write_request_operation_count (new)
- max_write_request_size (new)
- max_concurrent_read_batches -> max_outstanding_read_requests
- max_concurrent_write_batches -> max_outstanding_write_requests
- max_write_buffer_size (unchanged)
- max_write_buffer_count (unchanged)
- max_retry_delay (unchanged)
- poll_timeout -> read_poll_timeout

## Stats
- remote_cluster (unchanged)
- leader_index (unchanged)
- follower_index (unchanged)
- shard_id (unchanged)
- leader_global_checkpoint (unchanged)
- leader_max_seq_no (unchanged)
- follower_global_checkpoint (unchanged)
- follower_max_seq_no (unchanged)
- last_requested_seq_no (unchanged)
- number_of_concurrent_reads -> outstanding_read_requests
- number_of_concurrent_writes -> outstanding_write_requests
- buffer_size_in_bytes -> write_buffer_size_in_bytes (new)
- number_of_queued_writes -> write_buffer_operation_count
- mapping_version -> follower_mapping_version
- total_fetch_time_millis -> total_read_time_millis
- total_fetch_remote_time_millis -> total_read_remote_exec_time_millis
- number_of_successful_fetches -> successful_read_requests
- number_of_failed_fetches -> failed_read_requests
- operation_received -> operations_read
- total_transferred_bytes -> bytes_read
- total_index_time_millis -> total_write_time_millis [?]
- number_of_successful_bulk_operations -> successful_write_requests
- number_of_failed_bulk_operations -> failed_write_requests
- number_of_operations_indexed -> operations_written
- fetch_exception -> read_exceptions
- time_since_last_read_millis -> time_since_last_read_millis

* add test for max_write_request_(operation_count|size)
2018-10-25 10:36:15 +02:00
Martijn van Groningen 6fe0e62b7a
[CCR] Added write buffer size limit (#34797)
This limit is based on the size in bytes of the operations in the write buffer. If this limit is exceeded then no more read operations will be coordinated until the size in bytes of the write buffer has dropped below the configured write buffer size limit.

Renamed existing `max_write_buffer_size` to ``max_write_buffer_count` to indicate that limit is count based.

Closes #34705
2018-10-24 23:48:49 +02:00
Andrey Atapin 5f588180f9 Improve IndexNotFoundException's default error message (#34649)
This commit adds the index name to the error message when an index is not found.
2018-10-24 12:53:31 -07:00
Nhat Nguyen d73768f812
CCR: Do not follow if leader does not have soft-deletes (#34767)
We should not create a follower index and abort a follow request if the
leader does not have soft-deletes. Moreover, we also should not
auto-follow an index if it does not have soft-deletes.
2018-10-24 11:19:39 -04:00
Alpar Torok 795d57b4f9
Auto configure all test tasks (#34666)
With this change, we apply the common test config automatically to all
newly created tasks instead of opting in specifically.

For plugin authors using the plugin externally this means that the
configuration will be applied to their RandomizedTestingTasks as well.

The purpose of the task is to simplify setup and make it easier to
change projects that use the `test` task but actually run integration
tests to use a task called `integTest` for clarity, but also because
we may want to configure and run them differently.
E.x. using different levels of concurrency.
2018-10-24 16:05:50 +03:00
Martijn van Groningen 76240e6bbe
[CCR] Renamed leader_cluster to remote_cluster (#34776)
and also some occurrences of clusterAlias to remoteCluster.

Closes #34682
2018-10-24 13:39:36 +02:00
Boaz Leskes be907516ad
Change ShardFollowTask defaults (#34793)
Per #31717 this commit changes the defaults to the following:

Batch size of 5120 ops.
Maximum of 12 concurrent read requests.
Maximum of 9 concurrent write requests.

This is not necessarily our final values but it's good to have these as defaults for the purposes of initial testing.
2018-10-24 13:32:48 +02:00
Martijn van Groningen 18007a29b2
[CCR] Made leader cluster required in shard follow task.
Left over from #34580
2018-10-24 08:38:25 +02:00
Martijn van Groningen abf8cb6706
[CCR] Cleanup pause follow action (#34183)
* Change the `TransportPauseFollowAction` to extend from `TransportMasterNodeAction`
  instead of `HandledAction`, this removes a sync cluster state api call.
* Introduced `ResponseHandler` that removes duplicated code in `TransportPauseFollowAction` and
  `TransportResumeFollowAction`.
* Changed `PauseFollowAction.Request` to not use `readFrom()`.
2018-10-24 08:12:39 +02:00
Martijn van Groningen 0efba0675e
[CCR] Add qa test library (#34611)
* Introduced test qa lib that all CCR qa modules depend on to avoid
test code duplication.
2018-10-23 23:24:32 +02:00
Nhat Nguyen e242fd2e42
CCR: Add TransportService closed to retryable errors (#34722)
Both testFollowIndexAndCloseNode and testFailOverOnFollower failed
because they responded to the FollowTask a TransportService closed
exception which is currently considered as a fatal error. This behavior
is not desirable since a closing node can throw that exception, and we
should retry in that case.

This change adds TransportService closed error to the list of retryable
errors.

Closes #34694
2018-10-23 14:23:29 -04:00
Martijn van Groningen ed817fb265
[CCR] Move leader_index and leader_cluster parameters from resume follow to put follow api (#34638)
As part of this change the leader index name and leader cluster name are
stored in the CCR metadata in the follow index. The resume follow api
will read that when a resume follow request is executed.
2018-10-23 19:37:45 +02:00
Nhat Nguyen 5923ea536e
CCR: Requires soft-deletes on the follower (#34725)
Since #34412 and #34474, a follower must have soft-deletes enabled 
to work correctly. This change requires soft-deletes on the follower.

Relates #34412
Relates #34474
2018-10-23 11:51:17 -04:00
Martijn van Groningen e6d87cc09f
[CCR] Add total fetch time leader stat (#34577)
Add total fetch time leader stat, that
keeps track how much time was spent on fetches
from the leader cluster perspective.
2018-10-23 16:41:06 +02:00
Martijn van Groningen 36baf3823d
[CCR] Auto follow pattern APIs adjustments (#34518)
* Changed the resource id of auto follow patterns to be a user defined name
instead of being the leader cluster alias name.
* Fail when an unfollowed leader index matches with two or more auto follow patterns.
2018-10-23 15:48:51 +02:00
Jason Tedor 52fc502b7e
Fix the casing in the names of some CCR classes
We should be consistent here. We were already using the casing "Ccr" and
this is the preferred casing for Java class names. This commit adjusts
the names of some classes that were using the casing "CCR" to be "Ccr".
2018-10-22 11:25:00 -04:00
Jason Tedor 7af19b8f81
Migrate wait for pending tasks helper to server (#34675)
In some of our X-Pack REST tests we have to wait for pending tasks to
complete. We are now needing this functionality in ESRestTestCase for
the docs tests where we run against X-Pack features. This commit moves
the helper method that we have in X-Pack to ESRestTestCase, and removes
duplicate logic from waiting for rollup tasks to complete.
2018-10-22 11:14:02 -04:00
Martijn van Groningen 92e34732f5
[CCR] Remove ccr related metadata between tests for single node tests too 2018-10-22 09:15:22 +02:00
Martijn van Groningen b6750cf6c2
[CCR] Muted tests
Relates to #34696
2018-10-22 08:47:31 +02:00
Martijn van Groningen f51301a1a6
[CCR] Moved integration test 2018-10-22 08:44:41 +02:00
Martijn van Groningen b816837d39
[CCR] Always remove persistent tasks metadata between tests and
better handle assertion errors between tests.
2018-10-22 08:15:43 +02:00
Nhat Nguyen d90b6730c7
CCR: Following primary should process NoOps once (#34408)
This is a follow-up for #34288.

Relates #34412
2018-10-19 21:10:13 -04:00
Nhat Nguyen 630d5514a5 CCR/TEST: Adjust testFailOverOnFollower
CI passed but the result is outdated after PR #34366 was merged.
2018-10-19 15:06:44 -04:00
Nhat Nguyen bd92a28cfc
CCR: Replicate existing ops with old term on follower (#34412)
Since #34288, we might hit deadlock if the FollowTask has more fetchers
than writers. This can happen in the following scenario:

Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has
two fetchers and one writer.

1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0,
num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1
respectively.

2. The second request which fetches seq#1 completes before, and then it
triggers a write request containing only seq#1.

3. The primary of a follower fails after it has replicated seq#1 to
replicas.

4. Since the old primary did not respond, the FollowTask issues another
write request containing seq#1 (resend the previous write request).

5. The new primary has seq#1 already; thus it won't replicate seq#1 to
replicas but will wait for the global checkpoint to advance at least
seq#1.

The problem is that the FollowTask has only one writer and that writer
is waiting for seq#0 which won't be delivered until the writer completed.

This PR proposes to replicate existing operations with the old primary
term (instead of the current term) on the follower. In particular, when
the following primary detects that it has processed an process already,
it will look up the term of an existing operation with the same seq_no
in the Lucene index, then rewrite that operation with the old term
before replicating it to the following replicas. This approach is
wait-free but requires soft-deletes on the follower.

Relates #34288
2018-10-19 13:56:00 -04:00
Nhat Nguyen 90ca5b1fde
Fill LocalCheckpointTracker with Lucene commit (#34474)
Today we rely on the LocalCheckpointTracker to ensure no duplicate when
enabling optimization using max_seq_no_of_updates. The problem is that
the LocalCheckpointTracker is not fully reloaded when opening an engine
with an out-of-order index commit. Suppose the starting commit has seq#0
and seq#2, then the current LocalCheckpointTracker would return "false"
when asking if seq#2 was processed before although seq#2 in the commit.

This change scans the existing sequence numbers in the starting commit,
then marks these as completed in the LocalCheckpointTracker to ensure
the consistent state between LocalCheckpointTracker and Lucene commit.
2018-10-19 12:38:06 -04:00
Martijn van Groningen 56d4f69718
Renamed remaining leader_cluster_alias / cluster_alias to leader_cluster 2018-10-19 07:59:56 +02:00
Martijn van Groningen 44b461aff2
[CCR] Make leader cluster a required argument. (#34580)
This change makes it no longer possible to follow / auto follow without
specifying a leader cluster. If a local index needs to be followed
then `cluster.remote.*.seeds` should point to nodes in the local cluster.

Closes #34258
2018-10-19 07:41:46 +02:00
Martijn van Groningen 0d62f6102c
[CCR] Split cluster alias from leader index field into its own field in follow APIs (#34366) 2018-10-18 12:11:48 +02:00
Jason Tedor 3e067123a1
Remove dead methods from ChainIT
This commit removes some unused methods from ChainIT.
2018-10-16 10:45:33 -04:00
Martijn van Groningen a1ec91395c
Changed CCR internal integration tests to use a leader and follower cluster instead of a single cluster (#34344)
The `AutoFollowTests` needs to restart the clusters between each tests, because
it is using auto follow stats in assertions. Auto follow stats are only reset
by stopping the elected master node.

Extracted the `testGetOperationsBasedOnGlobalSequenceId()` test to its own test, because it just tests the shard changes api.

* Renamed AutoFollowTests to AutoFollowIT, because it is an integration test.
Renamed ShardChangesIT to IndexFollowingIT, because shard changes it the name
of an internal api and isn't a good name for an integration test.

* move creation of NodeConfigurationSource to a seperate method

* Fixes issues after merge, moved assertSeqNos() and assertSameDocIdsOnShards() methods from ESIntegTestCase to InternalTestCluster, so that ccr tests can use these methods too.
2018-10-16 14:45:46 +02:00