The primary shard copy on the following has authority of the replication
operations that occur on the following side in cross-cluster
replication. Yet today we are using the primary term directly from the
operations on the leader side. Instead we should be replacing the
primary term on the following side with the primary term of the primary
on the following side. This commit does this by copying the translog
operations with the corrected primary term. This ensures that we use
this primary term while applying the operations on the primary, and when
replicating them across to the replica (where the replica request was
carrying the primary term of the primary shard copy on the follower).
The old perform request methods on the REST client have been deprecated
in favor using request-flavored methods. This commit addresses the use
of these deprecated methods in the CCR test suite.
This commit clarifies the origin of the global checkpoint that the
following shard uses and replaces illegal_state_exc E by an assertion.
Relates #30980
Today before reading operations on the leading shard, we minimization
the requesting range with the global checkpoint. However, this might
make the request invalid if the following shard generates a requesting
range based on the global-checkpoint from a primary shard and sends
that request to a replica whose global checkpoint is lagged.
Another issue is that we are mutating the request when applying
minimization. If the request becomes invalid on a replica, we will
reroute the mutated request instead of the original one to the primary.
This commit removes the minimization and replaces it by a range check
with the local checkpoint.
The shard changes api returns the minimum IndexMetadata version the leader
index needs to have. If the leader side is behind on IndexMetadata version
then follow shard task waits with processing write operations until the
mapping has been fetched from leader index and applied in follower index
in the background.
The cluster state api is used to fetch the leader mapping and put mapping api
to apply the mapping in the follower index. This works because put mapping
api accepts fields that are already defined.
Relates to #30086
The TODOs in the rest actions was incorrect. The problem was that
these rest actions used `follow_index` as first named variable in the path
under which the rest actions were registered. Other candidate rest actions that
also have a named variable as first element in the path (but with a different
name) get resolved as rest parameters too and passed down to the rest
action that actually ends up getting executed.
In the case of the follow index api, a `index` parameter got passed down
to `RestFollowExistingAction`, but that param was never used. This caused the
follow index api call to fail, because of unused http parameters.
This change doesn't fixes that problem, but works around it by using
`index` as named variable for the follow index (instead of `follow_index`).
Relates to #30102
If security is enabled today with ccr then the follow index api will
fail with the fact that system user does not have privileges to use
the shard changes api. The reason that system user is used is because
the persistent tasks that keep the shards in sync runs in the background
and the user that invokes the follow index api only start those background
processes.
I think it is better that the system user isn't used by the persistent
tasks that keep shards in sync, but rather runs as the same user that
invoked the follow index api and use the permissions that that user has.
This is what this PR does, and this is done by keeping track of
security headers inside the persistent task (similar to how rollup does this).
This PR also adds a cluster ccr priviledge that allows a user to follow
or unfollow an index. Finally if a user that wants to follow an index,
it needs to have read and monitor privileges on the leader index and
monitor and write privileges on the follow index.
This commit adds an API to read translog snapshot from Lucene,
then cut-over from the existing translog to the new API in CCR.
Relates #30086
Relates #29530
[CCR] added rest specs and simple rest test for follow and unfollow apis, also
Added an acknowledge field in follow and unfollow api responses. Currently these api return an empty response and fixed bug in unfollow api that didn't cleanup node tasks properly.
This commit adds a tombstone document into Lucene for every No-op.
With this change, Lucene index is expected to have a complete history
of operations like Translog. In fact, this guarantee is subjected to the
soft-deletes retention merge policy.
Relates #29530
The follow index api completely reuses CCS infrastructure that was exposed via:
https://github.com/elastic/elasticsearch/pull/29495
This means that the leader index parameter support the same ccs index
to indicate that an index resides in a different cluster.
I also added a qa module that smoke tests the cross cluster nature of ccr.
The idea is that this test just verifies that ccr can read data from a
remote leader index and that is it, no crazy randomization or indirectly
testing other features.
keep track of shard follow stats inside shard follow stats' node task instead of persistent task status.
By maintaining the shard follow stats inside its node task the stats update is quicker as
no cluster state update is required. The stats are now transient; meaning if the task
is going to run a different node then the stats are gone too. Currently only the processed
global checkpoint is being tracked and this is being restored when a shard follow node task
starts via the indices stats api (the reason of the first change of this change). Other stats
that we may add in the future (like fetch_time, see: https://gist.github.com/s1monw/dba13daf8493bf48431b72365e110717)
it is ok if we start from zero in case a shard follow task moves to another node.
This limit is based on the number of estimate bytes in each translog
operation that fall between the minimum and maximum request sequence number.
If this limit is met then the shard follow task executor will make sure
that a subsequent shard changes request will be performed to fetch the
remaining translog operations.
This limit is needed in order to protect against returning too many
translog operations in a single shard changes response.
Relates to #2436
We check for the existence of both leader and follower index, then properly
report to the caller. However, we do not return after reporting failure. This
causes the caller receive exception twice: IllegalArgumentException then
NullPointerException. This commit makes sure to stop the action after reporting
failure.
This commit enables the run task for ccr by specifying that the ccr
project not be evaluated until after core is evaluated. This is
important since ccr is alphabetically before core and thus Gradle
evaluates it first.
Relates #3665