Adjusts the cluster cleanup routine in ESRestTestCase to clean up SLM
test cases, and optionally wait for all snapshots to be deleted.
Waiting for all snapshots to be deleted, rather than failing if any are
in progress, is necessary for tests which use SLM policies because SLM
policies may be in the process of executing when the test ends.
* Restrict which tasks can use testclusters
This PR fixes a problem between the interaction of test-clusters and
build cache.
Before this any task could have used a cluster without tracking it as
input.
With this change a new interface is introduced to track the tasks that
can use clusters and we do consider the cluster as input for all of
them.
* Rename indexlifecycle to ilm and snapshotlifecycle to slm (#44917)
As a followup to #44725 and #44608, which renamed the packages within
the x-pack project, this renames the packages within the core x-pack
project. It also renames 'snapshotlifecycle' within the HLRC to slm.
* Fix one more import
In order to make it easier to interpret the output of the ILM Explain
API, this commit adds two request parameters to that API:
- `only_managed`, which causes the response to only contain indices
which have `index.lifecycle.name` set
- `only_errors`, which causes the response to contain only indices in an
ILM error state
"Error state" is defined as either being in the `ERROR` step or having
`index.lifecycle.name` set to a policy that does not exist.
This commit renames the ILM package from indexlifecycle to ilm. We have
all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. This
commit makes that change.
This commit renames the SLM package from snapshotlifecycle to slm. We
have all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. For SLM,
let's use slm for all of these including the package name from the
beginning.
* Expose index age in ILM explain output
This adds the index's age to the ILM explain output, for example:
```
{
"indices" : {
"ilm-000001" : {
"index" : "ilm-000001",
"managed" : true,
"policy" : "full-lifecycle",
"lifecycle_date" : "2019-07-16T19:48:22.294Z",
"lifecycle_date_millis" : 1563306502294,
"age" : "1.34m",
"phase" : "hot",
"phase_time" : "2019-07-16T19:48:22.487Z",
... etc ...
}
}
}
```
This age can be used to tell when ILM will transition the index to the
next phase, based on that phase's `min_age`.
Resolves#38988
* Expose age in getters and in HLRC
Today we have an annotation for controlling logging levels in
tests. This annotation serves two purposes, one is to control the
logging level used in tests, when such control is needed to impact and
assert the behavior of loggers in tests. The other use is when a test is
failing and additional logging is needed. This commit separates these
two concerns into separate annotations.
The primary motivation for this is that we have a history of leaving
behind the annotation for the purpose of investigating test failures
long after the test failure is resolved. The accumulation of these stale
logging annotations has led to excessive disk consumption. Having
recently cleaned this up, we would like to avoid falling into this state
again. To do this, we are adding a link to the test failure under
investigation to the annotation when used for the purpose of
investigating test failures. We will add tooling to inspect these
annotations, in the same way that we have tooling on awaits fix
annotations. This will enable us to report on the use of these
annotations, and report when stale uses of the annotation exist.
* Add Snapshot Lifecycle Management (#43934)
* Add SnapshotLifecycleService and related CRUD APIs
This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.
This also includes the get, put, and delete APIs for these policies
Relates to #38461
* Make scheduledJobIds return an immutable set
* Use Object.equals for SnapshotLifecyclePolicy
* Remove unneeded TODO
* Implement ToXContentFragment on SnapshotLifecyclePolicyItem
* Copy contents of the scheduledJobIds
* Handle snapshot lifecycle policy updates and deletions (#40062)
(Note this is a PR against the `snapshot-lifecycle-management` feature branch)
This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.
Relates to #38461
* Take a snapshot for the policy when the SLM policy is triggered (#40383)
(This is a PR for the `snapshot-lifecycle-management` branch)
This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.
This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.
Relates to #38461
* Record most recent snapshot policy success/failure (#40619)
Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.
This is the first step toward writing history in a more permanent
fashion.
* Validate snapshot lifecycle policies (#40654)
(This is a PR against the `snapshot-lifecycle-management` branch)
With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.
Part of #38461
* Hook SLM into ILM's start and stop APIs (#40871)
(This pull request is for the `snapshot-lifecycle-management` branch)
This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.
Relates to #38461
* Add tests for SnapshotLifecyclePolicyItem (#40912)
Adds serialization tests for SnapshotLifecyclePolicyItem.
* Fix improper import in build.gradle after master merge
* Add human readable version of modified date for snapshot lifecycle policy (#41035)
* Add human readable version of modified date for snapshot lifecycle policy
This small change changes it from:
```
...
"modified_date": 1554843903242,
...
```
To
```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```
Including the `"modified_date"` field when the `?human` field is used.
Relates to #38461
* Fix test
* Add API to execute SLM policy on demand (#41038)
This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.
```json
PUT /_ilm/snapshot/<policy>/_execute
```
And it returns the response with the generated snapshot name:
```json
{
"snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```
Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.
Relates to #38461
* Add next_execution to SLM policy metadata (#41221)
* Add next_execution to SLM policy metadata
This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:
```json
GET /_ilm/snapshot?human
{
"production" : {
"version" : 1,
"modified_date" : "2019-04-15T21:16:21.865Z",
"modified_date_millis" : 1555362981865,
"policy" : {
"name" : "<production-snap-{now/d}>",
"schedule" : "*/30 * * * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"foo-*",
"important"
],
"ignore_unavailable" : true,
"include_global_state" : false
}
},
"next_execution" : "2019-04-15T21:16:30.000Z",
"next_execution_millis" : 1555362990000
},
"other" : {
"version" : 1,
"modified_date" : "2019-04-15T21:12:19.959Z",
"modified_date_millis" : 1555362739959,
"policy" : {
"name" : "<other-snap-{now/d}>",
"schedule" : "0 30 2 * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"other"
],
"ignore_unavailable" : false,
"include_global_state" : true
}
},
"next_execution" : "2019-04-16T02:30:00.000Z",
"next_execution_millis" : 1555381800000
}
}
```
Relates to #38461
* Fix and enhance tests
* Figured out how to Cron
* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)
This commit changes the endpoint for snapshot lifecycle management from:
```
GET /_ilm/snapshot/<policy>
```
to:
```
GET /_slm/policy/<policy>
```
It mimics the ILM path only using `slm` instead of `ilm`.
Relates to #38461
* Add initial documentation for SLM (#41510)
* Add initial documentation for SLM
This adds the initial documentation for snapshot lifecycle management.
It also includes the REST spec API json files since they're sort of
documentation.
Relates to #38461
* Add `manage_slm` and `read_slm` roles (#41607)
* Add `manage_slm` and `read_slm` roles
This adds two more built in roles -
`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.
`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.
Relates to #38461
* Add execute to the test
* Fix ilm -> slm typo in test
* Record SLM history into an index (#41707)
It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.
This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.
Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s. Watcher would cause the same problem, but it
is already disabled (for the same reason).
* High Level Rest Client support for SLM (#41767)
* High Level Rest Client support for SLM
This commit add HLRC support for SLM.
Relates to #38461
* Fill out documentation tests with tags
* Add more callouts and asciidoc for HLRC
* Update javadoc links to real locations
* Add security test testing SLM cluster privileges (#42678)
* Add security test testing SLM cluster privileges
This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.
Relates to #38461
* Don't redefine vars
* Add Getting Started Guide for SLM (#42878)
This commit adds a basic Getting Started Guide for SLM.
* Include SLM policy name in Snapshot metadata (#43132)
Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.
* Fix compilation after master merge
* [TEST] Move exception wrapping for devious exception throwing
Fixes an issue where an exception was created from one line and thrown in another.
* Fix SLM for the change to AcknowledgedResponse
* Add Snapshot Lifecycle Management Package Docs (#43535)
* Fix compilation for transport actions now that task is required
* Add a note mentioning the privileges needed for SLM (#43708)
* Add a note mentioning the privileges needed for SLM
This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.
Relates to #38461
* Mention that you can create snapshots for indices you can't read
* Fix REST tests for new number of cluster privileges
* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)
* Fix SnapshotHistoryStoreTests after merge
* Remove overridden newResponse functions that have been removed
* Fix compilation for backport
* Fix get snapshot output parsing in test
* [DOCS] Add redirects for removed autogen anchors (#44380)
* Switch <tt>...</tt> in javadocs for {@code ...}
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
This commit adds functionality so that aliases that are manipulated on
leader indices are replicated by the shard follow tasks to the follower
indices. Note that we ignore write indices. This is due to the fact that
follower indices do not receive direct writes so the concept is not
useful.
Relates #41815
When the same alias points to multiple indices we can write to only one index
with `is_write_index` value `true`. The special handling in case of the put
mapping request(to resolve authorized indices) has a check on indices size
for a concrete index. If multiple indices existed then it marked the request
as unauthorized.
The check has been modified to consider write index flag and only when the
requested index matches with the one with write index alias, the alias is considered
for authorization.
Closes#40831
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)
This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.
(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)
* Fix forking JVM runner
* Don't bump shadow plugin version
Prior to this PR, there is a bug in ILM which does not allow ILM to stop
if one or more indices have an index.lifecycle.name which refers to
a policy that does not exist - the operation_mode will be stuck as
STOPPING until either the policy is created or the nonexistent
policy is removed from those indices.
This change allows ILM to stop in this case and makes the logging more
clear as to why ILM is not stopping.
It is possible that the Unfollow API may fail to release shard history
retention leases when unfollowing, so this needs to be handled by the
ILM Unfollow action. There's nothing much that can be done automatically
about it from the follower side, so this change makes the ILM unfollow
action simply ignore those failures.
This test should no longer pass when the functionality it is intended to
test is broken, as it now indexes a number of documents and verifies
that the index is staying on the same step until after indexing and
replication of those documents is finished. This prevents the test from
passing if the leader index progresses in its lifecycle during that time.
This commit attempts to remove the retention leases on the leader shards
when unfollowing an index. This is best effort, since the leader might
not be available.
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.
The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).
Given all these issues, it seems that this custom timeout mechanism
gives little benefits while causing a lot of harm. We should rather rely
on connect and socket timeout exposed by the underlying http client
and accept that a request can overall take longer than the configured
timeout, which is the case even with a single retry anyways.
This commit removes the `maxRetryTimeout` setting and all of its usages.
`CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when
deserializing index creation requests, accidentally accepts mappings that are
nested twice under the type key (as described in the bug report #38266).
This in turn causes us to be too lenient in parsing typeless mappings. In
particular, we accept the following index creation request, even though it
should not contain the type key `_doc`:
```
PUT index?include_type_name=false
{
"mappings": {
"_doc": {
"properties": { ... }
}
}
}
```
There is a similar issue for both 'put templates' and 'put mappings' requests
as well.
This PR makes the minimal changes to detect and reject these typed mappings in
requests. It does not address #38266 generally, or attempt a larger refactor
around types in these server-side requests, as I think this should be done at a
later time.
This test should not pass until CCR finishes integrating shard history
retention leases. It currently sometimes passes (which is a bug in the
test), but cannot pass reliably until the linked issue is resolved.
Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.
Previously, ShrinkAction would fail if
it was executed on an index that had
the same number of shards as the target
shrunken number.
This PR introduced a new BranchingStep that
is used inside of ShrinkAction to branch which
step to move to next, depending on the
shard values. So no shrink will occur if the
shard count is unchanged.
This commit modifies the put follow index action to use a
CcrRepository when creating a follower index. It routes
the logic through the snapshot/restore process. A
wait_for_active_shards parameter can be used to configure
how long to wait before returning the response.
We inject an Unfollow action before Shrink because the Shrink action
cannot be safely used on a following index, as it may not be fully
caught up with the leader index before the "original" following index is
deleted and replaced with a non-following Shrunken index. The Unfollow
action will verify that 1) the index is marked as "complete", and 2) all
operations up to this point have been replicated from the leader to the
follower before explicitly disconnecting the follower from the leader.
Injecting an Unfollow action before the Rollover action is done mainly
as a convenience: This allow users to use the same lifecycle policy on
both the leader and follower cluster without having to explictly modify
the policy to unfollow the index, while doing what we expect users to
want in most cases.
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.
This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.
This also adds integration tests for these scenarios (thanks to @talevy in #37552).
Resolves#37541
The integ tests currently use the raw zip project name as the
distribution type. This commit simplifies this specification to be
"default" or "oss". Whether zip or tar is used should be an internal
implementation detail of the integ test setup, which can (in the future)
be platform specific.
This change adds the unfollow action for CCR follower indices.
This is needed for the shrink action in case an index is a follower index.
This will give the follower index the opportunity to fully catch up with
the leader index, pause index following and unfollow the leader index.
After this the shrink action can safely perform the ilm shrink.
The unfollow action needs to be added to the hot phase and acts as
barrier for going to the next phase (warm or delete phases), so that
follower indices are being unfollowed properly before indices are expected
to go in read-only mode. This allows the force merge action to execute
its steps safely.
The unfollow action has three steps:
* `wait-for-indexing-complete` step: waits for the index in question
to get the `index.lifecycle.indexing_complete` setting be set to `true`
* `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks
for the index being handled to report that the leader shard global checkpoint
is equal to the follower shard global checkpoint.
* `pause-follower-index` step: Pauses index following, necessary to unfollow
* `close-follower-index` step: Closes the index, necessary to unfollow
* `unfollow-follower-index` step: Actually unfollows the index using
the CCR Unfollow API
* `open-follower-index` step: Reopens the index now that it is a normal index
* `wait-for-yellow` step: Waits for primary shards to be allocated after
reopening the index to ensure the index is ready for the next step
In the case of the last two steps, if the index in being handled is
a regular index then the steps acts as a no-op.
Relates to #34648
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Gordon Brown <gordon.brown@elastic.co>
This commit adds a set_priority action to the hot, warm, and cold
phases for an ILM policy. This action sets the `index.priority`
on the managed index to allow different priorities between the
hot, warm, and cold recoveries.
This commit also includes the HLRC and documentation changes.
closes#36905
the testFullPolicy and testMoveToRolloverStep tests
are very important tests, but they sometimes timeout
beyond the default 10sec wait for shrink to occur.
This commit increases one of the assertBusys to
20 seconds
Leaving `index.lifecycle.indexing_complete` in place when removing the
lifecycle policy from an index can cause confusion, as if a new policy
is associated with the policy, rollover will be silently skipped.
Removing that setting when removing the policy from an index makes
associating a new policy with the index more involved, but allows ILM to
fail loudly, rather than silently skipping operations which the user may
assume are being performed.
* Adjust order of checks in WaitForRolloverReadyStep
This allows ILM to error out properly for indices that have a valid
alias, but are not the write index, while still handling
`indexing_complete` on old-style aliases and rollover (that is, those
which only point to a single index at a time with no explicit write
index)
* WaitForRolloverReadyStepTests#mutateInstance sometimes did not mutate the instance
correctly
* 40_explain_lifecycle#"Test new phase still has phase_time" is not really a necessary
integration test. In addition to this, it is flaky due to the asynchronous nature of
ILM metadata population
* add read_ilm cluster privilege
Although managing ILM policies is best done using the
"manage" cluster privilege, it is useful to have read-only
views.
* adds `read_ilm` cluster privilege for viewing policies and status
* adds Explain API to the `view_index_metadata` index privilege
* add manage_ilm privileges
The error message used when attempting to delete a lifecycle policy that
is in use previously only included one index which was using the policy.
It now includes all indices using that policy.
Adds a setting that indicates that an index is done indexing, set by ILM
when the Rollover action completes. This indicates that the Rollover
action should be skipped in any future invocations, as long as the index
is no longer the write index for its alias.
This enables 1) an index with a policy that involves the Rollover action
to have the policy removed and switched to another one without use of
the move-to-step API, and 2) integrations with Beats and CCR.
RolloverStep previously had a name of "attempt_rollover", which was
inconsistent with all other step names due it its use of an underscore
instead of a dash.
RolloverAction will now periodically check the rollover conditions using
the Rollover API with the dry_run option as an AsyncWaitStep, then run
the rollover itself by calling the Rollover API with no conditions,
which will always roll over, as an AsyncActionStep. This will resolve
race condition issues in policies using RolloverAction.