We marked the `init` ILM step as retryable but our test used `waitUntil`
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).
This commit manually sets the current step to `init` when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)
* ShrunkenIndexCheckStep: Use correct logger
(cherry picked from commit f78d4b3d91345a2a8fc0f48b90dd66c9959bd7ff)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
Modifies SLM's and ILM's history indices to be hidden indices for added
protection against accidental querying and deletion, and improves
IndexTemplateRegistry to handle upgrading index templates.
Also modifies the REST test cleanup to delete hidden indices.
* Allow forcemerge in the hot phase for ILM policies
This commit changes the `forcemerge` action to also be allowed in the `hot` phase for policies. The
forcemerge will occur after a rollover, and allows users to take advantage of higher disk speeds for
performing the force merge (on a separate node type, for example).
On caveat with this is that a `forcemerge` in the `hot` phase *MUST* be accompanied by a `rollover`
action. ILM validates policies to ensure this is the case.
Resolves#43165
* Use anyMatch instead of findAny in validation
* Make randomTimeseriesLifecyclePolicy single-pass
This commit changes how RestHandlers are registered with the
RestController so that a RestHandler no longer needs to register itself
with the RestController. Instead the RestHandler interface has new
methods which when called provide information about the routes
(method and path combinations) that are handled by the handler
including any deprecated and/or replaced combinations.
This change also makes the publication of RestHandlers safe since they
no longer publish a reference to themselves within their constructors.
Closes#51622
Co-authored-by: Jason Tedor <jason@tedor.me>
Backport of #51950
* Adding best_compression (#49974)
This commit adds a `codec` parameter to the ILM `forcemerge` action. When setting the codec to `best_compression` ILM will close the index, then update the codec setting, re-open the index, and finally perform a force merge.
* Fix ForceMergeAction toSteps construction (#51825)
There was a duplicate force merge step and the test continued to fail. This commit clarifies the
`toStep` method and changes the `assertBestCompression` method for better readability.
Resolves#51822
* Update version constants
Co-authored-by: Sivagurunathan Velayutham <sivadeva.93@gmail.com>
We suspect the flakiness could’ve come from the fact that the rollover
step used to create the new index and roll the write alias to the new
index in separate cluster state updates. So the assertion that the
rolled index exists could’ve passed in the test but, before the
alias was rolled over to the new index, the subsequent write we execute
in the test (namely
`indexDocs("test_user", "x-pack-test-password", "foo_alias", 1)`)
would’ve sent the new document to the source index (ie. foo-logs-000001)
This would see the source index containing 3 documents and the rolled
index (foo-logs-000002) 0 documents.
However, we fixed this and the rollover step executes the “create index
and roll alias” in one single cluster update, so this situation should
not occur anymore.
(cherry picked from commit 834261c4fe7dd93f437eeec43c00d01ff2279f86)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Fix SnapshotLifecycleRestIT.testFullPolicySnapshot
This previously was missing some key information in the output of the failure. This captures that
information and adds logging at each step so we can determine the cause *if* it fails again.
Resolves#50358
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`,
denoted by a completed/completed/completed phase/action/step lifecycle execution state.
This commit changes the behavior so that the index lifecycle execution state halts at the last
configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy
containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete`
`PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have
indices configured to use that policy pick up execution at the newly added "later" phase. For
example, if a `delete` phase were added to the policy specified about, the index would then move
from `cold/complete/complete` into the `delete` phase.
Relates to #48431
This commit switches the strategy for managing dot-prefixed indices that
should be hidden indices from using "fake" system indices to an explicit
exclusions list that must be updated when those indices are converted to
hidden indices.
This commit deprecates the creation of dot-prefixed index names (e.g.
.watches) unless they are either 1) a hidden index, or 2) registered by
a plugin that extends SystemIndexPlugin. This is the first step
towards more thorough protections for system indices.
This commit also modifies several plugins which use dot-prefixed indices
to register indices they own as system indices, and adds a plugin to
register .tasks as a system index.
* Fix TimeSeriesLifecycleActionsIT.testShrinkAction
Shrinking a 6 shard index to 3 shards can be quite time consuming and
assertBusy probes the conditions at exponentially growing intervals.
This separates the one assertion that was used for all the conditions
into multiple assertBusy statements and increases the timeout for waiting
for the shrink to complete.
* Allow more time for shrink to complete
This commit allows more time for the shrink operation to complete in
testRetryFailedShrinkAction (separating the assertBusy calls too) and
testMoveToRolloverStep.
* Shrink to no more than 2 shards in tests
(cherry picked from commit 5fe780148fa3536915d61475b087896a5b9ace82)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Check all snapshots in SnapshotLifecycleRestIT.testFullPolicy
Rather than check the first returned snapshot for a snapshot starting with `snap-` in
SnapshotLifecycleRestIT.testFullPolicy, this commit changes the test to find any snapshots starting
with `snap-`.
In the event that there are no snapshots (the failure case), this also exposes the full results map
so we can diagnose why a failure occurred.
Relates to #50358
* Use a more imperative style for checking
* Separate aliases used for tests in TimeSeriesLifecycleActionsIT
This is related to #51375 and hopes to help illuminate why some of those tests are failing. This
commit switches the aliases used in the test to use a random alias name every time (since there were
some complaints in the tests about aliases having more than one write index). With this we hope to
determine the actual cause of the failure in the test.
This also adds additional information to the exception returned when calling move-to-step with the
incorrect current step.
* Fix rest test
* Use ESSingleNodeTestCase instead of ESIntegTestCase (#51345)
(cherry picked from commit abcf1c41faf05a0b0196fb06e57c3de8c3d67688)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This change exposes master timeout to ILM steps through global dynamic setting.
All currently implemented steps make use of this setting as well.
Closes#44136
This makes the UpdateSettingsStep retryable. This step updates settings needed
during the execution of ILM actions (mark indexes as read-only, change
allocation configurations, mark indexing complete, etc)
As the index updates are idempotent in nature (PUT requests and are applied only
if the values have changed) and the settings values are seldom user-configurable
(aside from the allocate action) the testing for this change goes along the
lines of artificially simulating a setting update failure on a particular value
update, which is followed by a successful step execution (a retry) in an
environment outside of ILM (the step executions are triggered manually).
(cherry picked from commit 8391b0aba469f39532bfc2796b76148167dc0289)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
After we rollover the index we wait for the configured number of shards for the
rolled index to become active (based on the index.write.wait_for_active_shards
setting which might be present in a template, or otherwise in the default case,
for the primaries to become active).
This wait might be long due to disk watermarks being tripped, replicas not
being able to spring to life due to cluster nodes reconfiguration and others
and, the RolloverStep might not complete successfully due to this inherent
transient situation, albeit the rolled index having been created.
(cherry picked from commit 457a92fb4c68c55976cc3c3e2f00a053dd2eac70)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
These policies store statistics, but since stats updating is asynchronous, it's
possible for the update from one test to bleed into a separate one. This change
switches the tests to use separate policy ids so that their stats are tracked
independently. It also relaxes the checking constraint in one of the tests.
Hopefully this:
Resolves#48531Resolves#48017
It's possible that the index could return no settings and thus throw a
`NullPointerException`.
I wasn't able to reproduce the original issue, but this should guard
against in the future.
Resolves#50646
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This test failed a couple of different ways, related to timing, as well
as concurrent snapshots, and also naming.
This commit splits the giant `assertBusy` into separate parts so that we don't
perform ~5 different requests and tests in the same loop. It also gives each
test a unique repository so that no other test can accidentally re-use
snapshots.
Resolves#50358 (hopefully!)
* ILM action to wait for SLM policy execution (#50454)
This change add new ILM action to wait for SLM policy execution to ensure that index has snapshot before deletion.
Closes#45067
* Fix flaky TimeSeriesLifecycleActionsIT#testWaitForSnapshot test
This change adds some randomness and cleanup step to TimeSeriesLifecycleActionsIT#testWaitForSnapshot and testWaitForSnapshotSlmExecutedBefore tests in attempt to make them stable.
Reletes to #50781
* Formatting changes
* Longer timeout
* Fix Map.of in Java8
* Unused import removed
* Refresh cached phase policy definition if possible on new policy
There are some cases when updating a policy does not change the
structure in a significant way. In these cases, we can reread the
policy definition for any indices using the updated policy.
This commit adds this refreshing to the `TransportPutLifecycleAction`
to allow this. It allows us to do things like change the configuration
values for a particular step, even when on that step (for example,
changing the rollover criteria while on the `check-rollover-ready` step).
There are more cases where the phase definition can be reread that just
the ones checked here (for example, removing an action that has already
been passed), and those will be added in subsequent work.
Relates to #48431
* Fix SLM check for restore in progress (#50868)
* Fix SLM check for restore in progress
This commit fixes the check in SLM where the `RestoreInProgress`
metadata was checked for existence. Rather than check existence we
should instead check the `isEmpty` method. Prior to this, a successful
restore for a repository that used SLM retention would prevent SLM
retention from running in subsequent invocations, due to SLM thinking
that a restore was still running.
* Fix 7.x-isms
This commits makes the "init" ILM step retryable. It also adds a test
where an index is created with a non-parsable index name and then fails.
Related to #48183
This makes the "update-rollover-lifecycle-date" step, which is part of the
rollover action, retryable. It also adds an integration test to check the
step is retried and it eventually succeeds.
(cherry picked from commit 5bf068522deb2b6cd2563bcf80f34fdbf459c9f2)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Add aditional logging for ILM history store tests (#50624)
These tests use the same index name, making it hard to read logs when
diagnosing the failures. Additionally more information about the current
state of the index could be retrieved when failing.
This changes these two things in the hope of capturing more data about
why this fails on some CI nodes but not others.
Relates to #50353
This adds support for retrying AsyncActionSteps by triggering the async
step after ILM was moved back on the failed step (the async step we'll
be attempting to run after the cluster state reflects ILM being moved
back on the failed step).
This also marks the RolloverStep as retryable and adds an integration
test where the RolloverStep is failing to execute as the rolled over
index already exists to test that the async action RolloverStep is
retried until the rolled over index is deleted.
(cherry picked from commit 8bee5f4cb58a1242cc2ef4bc0317dae6c8be49d3)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Add ILM histore store index (#50287)
* Add ILM histore store index
This commit adds an ILM history store that tracks the lifecycle
execution state as an index progresses through its ILM policy. ILM
history documents store output similar to what the ILM explain API
returns.
An example document with ALL fields (not all documents will have all
fields) would look like:
```json
{
"@timestamp": 1203012389,
"policy": "my-ilm-policy",
"index": "index-2019.1.1-000023",
"index_age":123120,
"success": true,
"state": {
"phase": "warm",
"action": "allocate",
"step": "ERROR",
"failed_step": "update-settings",
"is_auto-retryable_error": true,
"creation_date": 12389012039,
"phase_time": 12908389120,
"action_time": 1283901209,
"step_time": 123904107140,
"phase_definition": "{\"policy\":\"ilm-history-ilm-policy\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"rollover\":{\"max_size\":\"50gb\",\"max_age\":\"30d\"}}},\"version\":1,\"modified_date_in_millis\":1576517253463}",
"step_info": "{... etc step info here as json ...}"
},
"error_details": "java.lang.RuntimeException: etc\n\tcaused by:etc etc etc full stacktrace"
}
```
These documents go into the `ilm-history-1-00000N` index to provide an
audit trail of the operations ILM has performed.
This history storage is enabled by default but can be disabled by setting
`index.lifecycle.history_index_enabled` to `false.`
Resolves#49180
* Make ILMHistoryStore.putAsync truly async (#50403)
This moves the `putAsync` method in `ILMHistoryStore` never to block.
Previously due to the way that the `BulkProcessor` works, it was possible
for `BulkProcessor#add` to block executing a bulk request. This was bad
as we may be adding things to the history store in cluster state update
threads.
This also moves the index creation to be done prior to the bulk request
execution, rather than being checked every time an operation was added
to the queue. This lessens the chance of the index being created, then
deleted (by some external force), and then recreated via a bulk indexing
request.
Resolves#50353
Adjusts the subclasses of `TransportMasterNodeAction` to use their own loggers
instead of the one for the base class.
Relates #50056.
Partial backport of #46431 to 7.x.
This commit refactors the `IndexLifecycleRunner` to split out and
consolidate the number of methods that change state from within ILM. It
adds a new class `IndexLifecycleTransition` that contains a number of
static methods used to modify ILM's state. These methods all return new
cluster states rather than making changes themselves (they can be
thought of as helpers for modifying ILM state).
Rather than having multiple ways to move an index to a particular step
(like `moveClusterStateToStep`, `moveClusterStateToNextStep`,
`moveClusterStateToPreviouslyFailedStep`, etc (there are others)) this
now consolidates those into three with (hopefully) useful names:
- `moveClusterStateToStep`
- `moveClusterStateToErrorStep`
- `moveClusterStateToPreviouslyFailedStep`
In the move, I was also able to consolidate duplicate or redundant
arguments to these functions. Prior to this commit there were many calls
that provided duplicate information (both `IndexMetaData` and
`LifecycleExecutionState` for example) where the duplicate argument
could be derived from a previous argument with no problems.
With this split, `IndexLifecycleRunner` now contains the methods used to
actually run steps as well as the methods that kick off cluster state
updates for state transitions. `IndexLifecycleTransition` contains only
the helpers for constructing new states from given scenarios.
This also adds Javadocs to all methods in both `IndexLifecycleRunner`
and `IndexLifecycleTransition` (this accounts for almost all of the
increase in code lines for this commit). It also makes all methods be as
restrictive in visibility, to limit the scope of where they are used.
This refactoring is part of work towards capturing actions and
transitions that ILM makes, by consolidating and simplifying the places
we make state changes, it will make adding operation auditing easier.
This test must check for state `SUCCESS` as well. `SUCESS` in
`SnapshotsInProgress` means "all data nodes finished snapshotting sucessfully but master must still finalize the snapshot in the repo".
`SUCESS` does not mean that the snapshot is actually fully finished in this object.
You can easily reporduce the scenario in #49303 that has an in-progress snapshot in `SUCCESS` state
by waiting 20s before running the busy assert loop on the snapshot status so that all steps but the blocked
finalization can finish.
Closes#49303
* SLM set the operation mode to RUNNING on first run
Set the SLM operation mode to RUNNING when setting the first SLM lifecycle
policy. Historically, SLM was not decoupled from ILM but now they are
independent components. Setting the SLM operation mode to what the ILM running
mode was when we set the first SLM lifecycle policy was a remain from those
times.
* SLM update package info
* SLM suppress unusued warning
* SLM use logger for the correct class
* SLM Add integration test for operation mode
* Use ESSingleNodeTestCase instead of ESIntegTestCase
(cherry picked from commit 4ad3d93f89d03bf9a25685a990d1a439f33ce0e6)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commit changes the ThreadContext to just use a regular ThreadLocal
over the lucene CloseableThreadLocal. The CloseableThreadLocal solves
issues with ThreadLocals that are no longer needed during runtime but
in the case of the ThreadContext, we need it for the runtime of the
node and it is typically not closed until the node closes, so we miss
out on the benefits that this class provides.
Additionally by removing the close logic, we simplify code in other
places that deal with exceptions and tracking to see if it happens when
the node is closing.
Closes#42577
This commit wraps the calls to retrieve the current step in a try/catch
so that the exception does not bubble up. Instead, step info is added
containing the exception to the existing step.
Semi-related to #49128
(cherry picked from commit 72530f8a7f40ae1fca3704effb38cf92daf29057)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
The logic for `cleanupInProgress()` was backwards everywhere (method itself and
all but one user). Also, we weren't checking it when removing a repository.
This lead to a bug (in the one spot that didn't use the method backwards) that prevented
the cleanup cluster state entry from ever being removed from the cluster state if master
failed over during the cleanup process.
This change corrects the backwards logic, adds a test that makes sure the cleanup
is always removed and adds a check that prevents repository removal during cleanup
to the repositories service.
Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.
When triggered either by becoming master, a new cluster state, or a
periodic schedule, an ILM policy execution through
`maybeRunAsyncAction`, `runPolicyAfterStateChange`, or
`runPeriodicStep` throwing an exception will cause the loop the
terminate. This means that any indices that would have been processed
after the index where the exception was thrown will not be processed by
ILM.
For most execution this is not a problem because the actual running of
steps is protected by a try/catch that moves the index to the ERROR step
in the event of a problem. If an exception occurs prior to step
execution (for example, in fetching and parsing the current
policy/step) however, it causes the loop termination previously
mentioned.
This commit wraps the invocation of the methods specified above in a
try/catch block that provides better logging and does not bubble the
exception up.
The rollover action is now a retryable step (see #48256)
so ILM will keep retrying until it succeeds as opposed to stopping and
moving the execution in the ERROR step.
Fixes#49073
(cherry picked from commit 3ae90898121b43032ec8f3b50514d93a86e14d0f)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
# Conflicts:
# x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
When using the move-to-step API, we should reread the phase JSON from
the latest version of the ILM policy. This allows a user to move to the
same step while re-reading the policy's latest version. For example,
when changing rollover criteria.
While manually messing around with some other things I discovered that
we only reread the policy when using the retry API, not the move-to-step
API. This commit changes the move-to-step API to always read the latest
version of the policy.
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.
Fixes#48122
Previously this step moved to the forcemerge step, however, if the
machine running the test was fast enough, it would execute the
forcemerge and move to the next step (`segment-count`) so the comparison
would fail. This commit changes the step to be a step that will never go
anywhere else, the terminal step.
Resolves#48761
* ILM Test asserts on the same ilm/_explain output
With the introduction of retryable steps subsequent ilm/_explain calls
can see the state of an ilm cycle move out of the error step. This test
made several assertions assuming that the cycle remains in the error
step so this commit changes the test to make one _explain call and have
all the asserts work on the same ilm state (so subsequent assumptions to
the cycle being in the error step are valid).
* Drop unused field in test.
(cherry picked from commit 44c74bb487151c886a08b27f32b13f7a72056997)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This test used an index without an alias to simulate a failure in the
`check-rollover-ready` step. However, with #48256 that step
automatically retries, meaning that the index may not always be in
the ERROR step.
This commit changes the test to use a shrink action with an invalid
number of shards so that it stays in the ERROR step.
Resolves#48767
This adds the infrastructure to be able to retry the execution of retryable
steps and makes the `check-rollover-ready` retryable as an initial step to
make the rollover action more resilient to transient errors.
(cherry picked from commit 454020ac8acb147eae97acb4ccd6fb470d1e5f48)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Un-AwaitsFix and enhance logging for testPolicyCRUD
This removes the `AwaitsFix` and increases the test logging for
`SnapshotLifecycleServiceTests.testPolicyCRUD` in an effort to track
down the cause of #44997.
* Remove unused import
The logger was erroneously using the `SnapshotLifecycleMetadata` class
for its initialization, making it hard to target packages for logging
levels since `SnapshotLifecycleMetadata` is in a different package.
This adds a guard for the SLM lifecycle and retention service that
prevents new jobs from being scheduled once the service has been
stopped. Previous if the node were shut down the service would be
stopped, but a cluster state or local master election would cause a job
to attempt to be scheduled. This could lead to an uncaught
`RejectedExecutionException`.
Resolves#47749
The format returned by the API is not always parsable with
`Instant.parse()`, so this commit adjusts to parsing those dates as
`ISO_ZONED_DATE_TIME` instead, which appears to always parse the
returned value correctly.
The failures in these tests have been remarkably difficult to track
down, in part because they will not reproduce locally. This commit
unmutes the flaky tests and increases logging, as well as introducing
some additional logging, to attempt to pin down the failures.
The cron schedule "1 2 3 4 5 ?" will run every May 4 at 03:02:01, which
may result in unnecessary test failures once a year. This commit
switches out uses of that schedule in tests for one which will never
execute (because it specifies a day which doesn't exist, Feb. 31). Also
factors the schedule out to a constant to make the intent clearer.
Due to a bug, GETing a snapshot can cause a RespositoryException to be
thrown. This error is transient and should be retried, rather than
causing the test to fail. This commit converts those
RepositoryExceptions into AssertionErrors so that they will be retried
in code wrapped in assertBusy.
When checking for the existence of a document in the ILM/CCR integration
tests, `assertDocumentExists` makes an HTTP request and checks the
response code. However, if the repsonse code is not successful, the call
will throw a `ResponseException`. `assertDocumentExists` is often called
inside an `assertBusy`, and wrapping the `ResponseException` in an
`AssertionError` will allow the `assertBusy` to retry.
In particular, this fixes an issue with `testCCRUnfollowDuringSnapshot`
where the index in question may still be closed when the document is
requested.
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).
This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).
Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.
This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__
Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).
This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
This test could fail for two reasons, both should be fixed by this PR:
1) It hit a timeout for an `assertBusy`. This commit increases the
timeout for that `assertBusy`.
2) The snapshot that was supposed to be blocked could, in fact, be
successful. This is because a previous snapshot had been successfully
been taken, and no new data had been added between the two snapshots.
This means that no new segment files needed to be written for the new
snapshot, so the block on data files was never triggered. This commit
changes two things: First, it indexes some new data before taking the
second snapshot (the one that needs to be blocked), and second,
checks to ensure that the block is actually hit before continuing
with the test.
The after snapshot action is interfering with SLM deleting snapshots
here it seems, causing concurrent delete exceptions.
Since these tests are now test-scoped there is no reason to run
snapshot deletes after each test so we can remove them to avoid this issue.
Closes#47937
We were not closing repositories on Node shutdown.
In production, this has little effect but in tests
shutting down a node using `MockRepository` and is
currently stuck in a simulated blocked-IO situation
will only unblock when the node's threadpool is
interrupted. This might in some edge cases (many
snapshot threads and some CI slowness) result
in the execution taking longer than 5s to release
all the shard stores and thus we fail the assertion
about unreleased shard stores in the internal test cluster.
Regardless of tests, I think we should close repositories
and release resources associated with them when closing
a node and not just when removing a repository from the CS
with running nodes as this behavior is really unexpected.
Fixes#47689
* Add SLM support to xpack usage and info APIs
This is a backport of #48096
This adds the missing xpack usage and info information into the
`/_xpack` and `/_xpack/usage` APIs. The output now looks like:
```
GET /_xpack/usage
{
...
"slm" : {
"available" : true,
"enabled" : true,
"policy_count" : 1,
"policy_stats" : {
"retention_runs" : 0,
...
}
}
```
and
```
GET /_xpack
{
...
"features" : {
...
"slm" : {
"available" : true,
"enabled" : true
},
...
}
}
```
Relates to #43663
* Fix missing license
Currently, partial snapshots will eventually build up unless they are
manually deleted. Partial snapshots may be useful if there is not a more
recent successful snapshot, but should eventually be deleted if they are
no longer useful.
With this change, partial snapshots are deleted using the following
strategy: PARTIAL snapshots will be kept until the configured
expire_after period has passed, if present, and then be deleted. If
there is no configured expire_after in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Partial snapshots are not counted towards either min_count or
max_count.
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in #47834Closes#47834
* Separate SLM stop/start/status API from ILM
This separates a start/stop/status API for SLM from being tied to ILM's
operation mode. These APIs look like:
```
POST /_slm/stop
POST /_slm/start
GET /_slm/status
```
This allows administrators to have fine-grained control over preventing
periodic snapshots and deletions while performing cluster maintenance.
Relates to #43663
* Allow going from RUNNING to STOPPED
* Align with the OperationMode rules
* Fix slmStopping method
* Make OperationModeUpdateTask constructor private
* Wipe snapshots better in test
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.
With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
Previously when retrieving an SLM policy it would always return a 200
with `{}` in the body, even if the policy did not exist. This changes
that behavior to throw an error (similar to our other APIs) if a
policy doesn't exist.
This also adds a basic CRUD yml test for the behavior.
Resolves#47664
An index with an ILM policy that has a rollover action in one of the
phases was rolled over when the ILM conditions dictated regardless if
it was already rolled over (eg. manually after modifying an index
template in order to force the creation of a new index that uses the new
mappings).
This changes this behaviour and has ILM check if the index it's about to
roll has not been rolled over in the meantime.
(cherry picked from commit 37d6106feeb9f9369519117c88a9e7e30f3ac797)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Add API to execute SLM retention on-demand (#47405)
This is a backport of #47405
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.
This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.
This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.
Resolves#46508
Relates to #43663
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.
There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.
I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.
Other changes:
* Rework `waitForDocs` to scale its timeout. Instead of calling
`assertBusy` in a loop, work out a reasonable overall timeout and await
just once.
* Some tests failed after switching to `assertBusy` and had to be fixed.
* Correct the expect templates in AbstractUpgradeTestCase. The ES
Security team confirmed that they don't use templates any more, so
remove this from the expected templates. Also rewrite how the setup
code checks for templates, in order to give more information.
* Remove an expected ML template from XPackRestTestConstants The ML team
advised that the ML tests shouldn't be waiting for any
`.ml-notifications*` templates, since such checks should happen in the
production code instead.
* Also rework the template checking code in `XPackRestTestHelper` to give
more helpful failure messages.
* Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
Drop the usage of `SimpleDateFormat` and use the `DateFormatter` instead
(cherry picked from commit 7cf509a7a11ecf6c40c44c18e8f03b8e81fcd1c2)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commit adds support for POST requests to the SLM `_execute` API,
because POST is a more appropriate HTTP verb for this action as it is
not idempotent. The docs are also changed to favor POST over PUT,
although PUT is not removed or officially deprecated.
* ILM: parse origination date from index name (#46755)
Introduce the `index.lifecycle.parse_origination_date` setting that
indicates if the origination date should be parsed from the index name.
If set to true an index which doesn't match the expected format (namely
`indexName-{dateFormat}-optional_digits` will fail before being created.
The origination date will be parsed when initialising a lifecycle for an
index and it will be set as the `index.lifecycle.origination_date` for
that index.
A user set value for `index.lifecycle.origination_date` will always
override a possible parsable date from the index name.
(cherry picked from commit c363d27f0210733dad0c307d54fa224a92ddb569)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Drop usage of Map.of to be java 8 compliant
* Wait for snapshot completion in SLM snapshot invocation
This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.
For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.
Relates to #38461 and #43663
Using arrays of objects with embedded IDs is preferred for new APIs over
using entity IDs as JSON keys. This commit changes the SLM stats API to
use the preferred format.
* [ILM] Add date setting to calculate index age
Add the `index.lifecycle.origination_date` to allow users to configure a
custom date that'll be used to calculate the index age for the phase
transmissions (as opposed to the default index creation date).
This could be useful for users to create an index with an "older"
origination date when indexing old data.
Relates to #42449.
* [ILM] Don't override creation date on policy init
The initial approach we took was to override the lifecycle creation date
if the `index.lifecycle.origination_date` setting was set. This had the
disadvantage of the user not being able to update the `origination_date`
anymore once set.
This commit changes the way we makes use of the
`index.lifecycle.origination_date` setting by checking its value when
we calculate the index age (ie. at "read time") and, in case it's not
set, default to the index creation date.
* Make origination date setting index scope dynamic
* Document orignation date setting in ilm settings
(cherry picked from commit d5bd2bb77ee28c1978ab6679f941d7c02e389d32)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commit adds a wait/check for all running snapshots to be cleared
before taking another snapshot. The previous snapshot was successful but
had not yet been cleared from the cluster state, so the second snapshot
failed due to a `ConcurrentSnapshotException`.
Resolves#46508
* Add retention to Snapshot Lifecycle Management (#46407)
This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.
An example policy would look like:
```
PUT /_slm/policy/snapshot-every-day
{
"schedule": "0 30 2 * * ?",
"name": "<production-snap-{now/d}>",
"repository": "my-s3-repository",
"config": {
"indices": ["foo-*", "important"]
},
// Newly configured retention options
"retention": {
// Snapshots should be deleted after 14 days
"expire_after": "14d",
// Keep a maximum of thirty snapshots
"max_count": 30,
// Keep a minimum of the four most recent snapshots
"min_count": 4
}
}
```
SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.
Included in this work is a new SLM stats API endpoint available through
``` json
GET /_slm/stats
```
That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.
* Add base framework for snapshot retention (#43605)
* Add base framework for snapshot retention
This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.
Relates to #38461
* Remove extraneous 'public'
* Use a local var instead of reading class var repeatedly
* Add SnapshotRetentionConfiguration for retention configuration (#43777)
* Add SnapshotRetentionConfiguration for retention configuration
This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.
Relates to #43663
* Fix REST tests
* Fix more documentation
* Use Objects.equals to avoid NPE
* Put `randomSnapshotLifecyclePolicy` in only one place
* Occasionally return retention with no configuration
* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)
* Implement SnapshotRetentionTask's snapshot filtering and deletion
This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.
Relates to #43663
* Fix deletes running on the wrong thread
* Handle missing or null policy in snap metadata differently
* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>
* Use the `OriginSettingClient` to work with security, enhance logging
* Prevent NPE in test by mocking Client
* Allow empty/missing SLM retention configuration (#45018)
Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.
Relates to #43663
* Add min_count and max_count as SLM retention predicates (#44926)
This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.
These options are optional and one, two, or all three can be specified
in an SLM policy.
Relates to #43663
* Time-bound deletion of snapshots in retention delete function (#45065)
* Time-bound deletion of snapshots in retention delete function
With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.
Relates to #43663
* Wow snapshots suuuure can take a long time.
* Use a LongSupplier instead of actually sleeping
* Remove TestLogging annotation
* Remove rate limiting
* Add SLM metrics gathering and endpoint (#45362)
* Add SLM metrics gathering and endpoint
This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.
This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:
```
GET /_slm/stats
{
"retention_runs" : 13,
"retention_failed" : 0,
"retention_timed_out" : 0,
"retention_deletion_time" : "1.4s",
"retention_deletion_time_millis" : 1404,
"policy_metrics" : {
"daily-snapshots2" : {
"snapshots_taken" : 7,
"snapshots_failed" : 0,
"snapshots_deleted" : 6,
"snapshot_deletion_failures" : 0
},
"daily-snapshots" : {
"snapshots_taken" : 12,
"snapshots_failed" : 0,
"snapshots_deleted" : 12,
"snapshot_deletion_failures" : 6
}
},
"total_snapshots_taken" : 19,
"total_snapshots_failed" : 0,
"total_snapshots_deleted" : 18,
"total_snapshot_deletion_failures" : 6
}
```
This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.
Relates to #43663
* Version qualify serialization
* Initialize counters outside constructor
* Use computeIfAbsent instead of being too verbose
* Move part of XContent generation into subclass
* Fix REST action for master merge
* Unused import
* Record history of SLM retention actions (#45513)
This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.
* Retry SLM retention after currently running snapshot completes (#45802)
* Retry SLM retention after currently running snapshot completes
This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.
Relates to #43663
* Increase timeout waiting for snapshot completion
* Apply patch
From 2374316f0d.patch
* Rename test variables
* [TEST] Be less strict for stats checking
* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)
This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.
Relates to #43663
* Check all actions preventing snapshot delete during retention (#45992)
* Check all actions preventing snapshot delete during retention run
Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:
- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running
This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.
Relates to #43663
* Add unit test for okayToDeleteSnapshots
* Fix bug where SLM retention task would be scheduled on every node
* Enhance test logging
* Ignore if snapshot is already deleted
* Missing import
* Fix SnapshotRetentionServiceTests
* Expose SLM policy stats in get SLM policy API (#45989)
This also adds support for the SLM stats endpoint to the high level rest client.
Retrieving a policy now looks like:
```json
{
"daily-snapshots" : {
"version": 1,
"modified_date": "2019-04-23T01:30:00.000Z",
"modified_date_millis": 1556048137314,
"policy" : {
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "my_repository",
"config": {
"indices": ["data-*", "important"],
"ignore_unavailable": false,
"include_global_state": false
},
"retention": {}
},
"stats": {
"snapshots_taken": 0,
"snapshots_failed": 0,
"snapshots_deleted": 0,
"snapshot_deletion_failures": 0
},
"next_execution": "2019-04-24T01:30:00.000Z",
"next_execution_millis": 1556048160000
}
}
```
Relates to #43663
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase
This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.
Relates to #43663Resolves#46205
* Add error logging when exceptions are thrown
* Update serialization versions
* Fix type inference
* Use non-Cancellable HLRC return value
* Fix Client mocking in test
* Fix SLMSnapshotBlockingIntegTests for 7.x branch
* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval
* Add serialization guards for SnapshotLifecyclePolicy
This commit adds the `rollover_alias` setting required for ILM to work
correctly to the SLM history index template and adds assertions to the
SLM integration tests to ensure that it works correctly.
Following our own guidelines, SLM should use rollover instead of purely
time-based indices to keep shard counts low. This commit implements lazy
index creation for SLM's history indices, indexing via an alias, and
rollover in the built-in ILM policy.
* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot
* Fixes#45594
Adjusts the cluster cleanup routine in ESRestTestCase to clean up SLM
test cases, and optionally wait for all snapshots to be deleted.
Waiting for all snapshots to be deleted, rather than failing if any are
in progress, is necessary for tests which use SLM policies because SLM
policies may be in the process of executing when the test ends.
* Restrict which tasks can use testclusters
This PR fixes a problem between the interaction of test-clusters and
build cache.
Before this any task could have used a cluster without tracking it as
input.
With this change a new interface is introduced to track the tasks that
can use clusters and we do consider the cluster as input for all of
them.
* Rename indexlifecycle to ilm and snapshotlifecycle to slm (#44917)
As a followup to #44725 and #44608, which renamed the packages within
the x-pack project, this renames the packages within the core x-pack
project. It also renames 'snapshotlifecycle' within the HLRC to slm.
* Fix one more import
In order to make it easier to interpret the output of the ILM Explain
API, this commit adds two request parameters to that API:
- `only_managed`, which causes the response to only contain indices
which have `index.lifecycle.name` set
- `only_errors`, which causes the response to contain only indices in an
ILM error state
"Error state" is defined as either being in the `ERROR` step or having
`index.lifecycle.name` set to a policy that does not exist.
This commit renames the ILM package from indexlifecycle to ilm. We have
all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. This
commit makes that change.
This commit renames the SLM package from snapshotlifecycle to slm. We
have all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. For SLM,
let's use slm for all of these including the package name from the
beginning.
* Allow empty configuration for SLM policies
When putting or updating a snapshot lifecycle policy it was not possible
to elide the `config` map. This commit makes the configuration optional,
the same way that it is when taking a snapshot.
Relates to #38461
* Add Objects.requireNonNull for required parts of the policy
* Expose index age in ILM explain output
This adds the index's age to the ILM explain output, for example:
```
{
"indices" : {
"ilm-000001" : {
"index" : "ilm-000001",
"managed" : true,
"policy" : "full-lifecycle",
"lifecycle_date" : "2019-07-16T19:48:22.294Z",
"lifecycle_date_millis" : 1563306502294,
"age" : "1.34m",
"phase" : "hot",
"phase_time" : "2019-07-16T19:48:22.487Z",
... etc ...
}
}
}
```
This age can be used to tell when ILM will transition the index to the
next phase, based on that phase's `min_age`.
Resolves#38988
* Expose age in getters and in HLRC
This commit converts all remaining ActionType response classes to
writeable in xpack core. It also converts a few from server which were
used by xpack core.
relates #34389
Today we have an annotation for controlling logging levels in
tests. This annotation serves two purposes, one is to control the
logging level used in tests, when such control is needed to impact and
assert the behavior of loggers in tests. The other use is when a test is
failing and additional logging is needed. This commit separates these
two concerns into separate annotations.
The primary motivation for this is that we have a history of leaving
behind the annotation for the purpose of investigating test failures
long after the test failure is resolved. The accumulation of these stale
logging annotations has led to excessive disk consumption. Having
recently cleaned this up, we would like to avoid falling into this state
again. To do this, we are adding a link to the test failure under
investigation to the annotation when used for the purpose of
investigating test failures. We will add tooling to inspect these
annotations, in the same way that we have tooling on awaits fix
annotations. This will enable us to report on the use of these
annotations, and report when stale uses of the annotation exist.
This commit adds constructors to AcknolwedgedRequest subclasses to
implement Writeable.Reader, and ensures all future subclasses implement
the same.
relates #34389
* Add Snapshot Lifecycle Management (#43934)
* Add SnapshotLifecycleService and related CRUD APIs
This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.
This also includes the get, put, and delete APIs for these policies
Relates to #38461
* Make scheduledJobIds return an immutable set
* Use Object.equals for SnapshotLifecyclePolicy
* Remove unneeded TODO
* Implement ToXContentFragment on SnapshotLifecyclePolicyItem
* Copy contents of the scheduledJobIds
* Handle snapshot lifecycle policy updates and deletions (#40062)
(Note this is a PR against the `snapshot-lifecycle-management` feature branch)
This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.
Relates to #38461
* Take a snapshot for the policy when the SLM policy is triggered (#40383)
(This is a PR for the `snapshot-lifecycle-management` branch)
This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.
This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.
Relates to #38461
* Record most recent snapshot policy success/failure (#40619)
Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.
This is the first step toward writing history in a more permanent
fashion.
* Validate snapshot lifecycle policies (#40654)
(This is a PR against the `snapshot-lifecycle-management` branch)
With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.
Part of #38461
* Hook SLM into ILM's start and stop APIs (#40871)
(This pull request is for the `snapshot-lifecycle-management` branch)
This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.
Relates to #38461
* Add tests for SnapshotLifecyclePolicyItem (#40912)
Adds serialization tests for SnapshotLifecyclePolicyItem.
* Fix improper import in build.gradle after master merge
* Add human readable version of modified date for snapshot lifecycle policy (#41035)
* Add human readable version of modified date for snapshot lifecycle policy
This small change changes it from:
```
...
"modified_date": 1554843903242,
...
```
To
```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```
Including the `"modified_date"` field when the `?human` field is used.
Relates to #38461
* Fix test
* Add API to execute SLM policy on demand (#41038)
This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.
```json
PUT /_ilm/snapshot/<policy>/_execute
```
And it returns the response with the generated snapshot name:
```json
{
"snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```
Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.
Relates to #38461
* Add next_execution to SLM policy metadata (#41221)
* Add next_execution to SLM policy metadata
This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:
```json
GET /_ilm/snapshot?human
{
"production" : {
"version" : 1,
"modified_date" : "2019-04-15T21:16:21.865Z",
"modified_date_millis" : 1555362981865,
"policy" : {
"name" : "<production-snap-{now/d}>",
"schedule" : "*/30 * * * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"foo-*",
"important"
],
"ignore_unavailable" : true,
"include_global_state" : false
}
},
"next_execution" : "2019-04-15T21:16:30.000Z",
"next_execution_millis" : 1555362990000
},
"other" : {
"version" : 1,
"modified_date" : "2019-04-15T21:12:19.959Z",
"modified_date_millis" : 1555362739959,
"policy" : {
"name" : "<other-snap-{now/d}>",
"schedule" : "0 30 2 * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"other"
],
"ignore_unavailable" : false,
"include_global_state" : true
}
},
"next_execution" : "2019-04-16T02:30:00.000Z",
"next_execution_millis" : 1555381800000
}
}
```
Relates to #38461
* Fix and enhance tests
* Figured out how to Cron
* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)
This commit changes the endpoint for snapshot lifecycle management from:
```
GET /_ilm/snapshot/<policy>
```
to:
```
GET /_slm/policy/<policy>
```
It mimics the ILM path only using `slm` instead of `ilm`.
Relates to #38461
* Add initial documentation for SLM (#41510)
* Add initial documentation for SLM
This adds the initial documentation for snapshot lifecycle management.
It also includes the REST spec API json files since they're sort of
documentation.
Relates to #38461
* Add `manage_slm` and `read_slm` roles (#41607)
* Add `manage_slm` and `read_slm` roles
This adds two more built in roles -
`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.
`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.
Relates to #38461
* Add execute to the test
* Fix ilm -> slm typo in test
* Record SLM history into an index (#41707)
It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.
This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.
Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s. Watcher would cause the same problem, but it
is already disabled (for the same reason).
* High Level Rest Client support for SLM (#41767)
* High Level Rest Client support for SLM
This commit add HLRC support for SLM.
Relates to #38461
* Fill out documentation tests with tags
* Add more callouts and asciidoc for HLRC
* Update javadoc links to real locations
* Add security test testing SLM cluster privileges (#42678)
* Add security test testing SLM cluster privileges
This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.
Relates to #38461
* Don't redefine vars
* Add Getting Started Guide for SLM (#42878)
This commit adds a basic Getting Started Guide for SLM.
* Include SLM policy name in Snapshot metadata (#43132)
Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.
* Fix compilation after master merge
* [TEST] Move exception wrapping for devious exception throwing
Fixes an issue where an exception was created from one line and thrown in another.
* Fix SLM for the change to AcknowledgedResponse
* Add Snapshot Lifecycle Management Package Docs (#43535)
* Fix compilation for transport actions now that task is required
* Add a note mentioning the privileges needed for SLM (#43708)
* Add a note mentioning the privileges needed for SLM
This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.
Relates to #38461
* Mention that you can create snapshots for indices you can't read
* Fix REST tests for new number of cluster privileges
* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)
* Fix SnapshotHistoryStoreTests after merge
* Remove overridden newResponse functions that have been removed
* Fix compilation for backport
* Fix get snapshot output parsing in test
* [DOCS] Add redirects for removed autogen anchors (#44380)
* Switch <tt>...</tt> in javadocs for {@code ...}
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.
relates #34389
Action is a class that encapsulates meta information about an action
that allows it to be called remotely, specifically the action name and
response type. With recent refactoring, the action class can now be
constructed as a static constant, instead of needing to create a
subclass. This makes the old pattern of creating a singleton INSTANCE
both misnamed and lacking a common placement.
This commit renames Action to ActionType, thus allowing the old INSTANCE
naming pattern to be TYPE on the transport action itself. ActionType
also conveys that this class is also not the action itself, although
this change does not rename any concrete classes as those will be
removed organically as they are converted to TYPE constants.
relates #34389
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.
Note that where possible response fields were made final and default
constructors were removed.
This is a large PR, but the change is mostly mechanical.
Relates to #34389
Backport of #43414
* Narrow period of Shrink action in which ILM prevents stopping
Prior to this change, we would prevent stopping of ILM if the index was
anywhere in the shrink action. This commit changes
`IndexLifecycleService` to allow stopping when in any of the innocuous
steps during shrink. This changes ILM only to prevent stopping if
absolutely necessary.
Resolves#43253
* Rename variable for ignore actions -> ignore steps
* Fix comment
* Factor test out to test *all* stoppable steps
The description field of xpack featuresets is optionally part of the
xpack info api, when using the verbose flag. However, this information
is unnecessary, as it is better left for documentation (and the existing
descriptions describe anything meaningful). This commit removes the
description field from feature sets.
This commit adds functionality so that aliases that are manipulated on
leader indices are replicated by the shard follow tasks to the follower
indices. Note that we ignore write indices. This is due to the fact that
follower indices do not receive direct writes so the concept is not
useful.
Relates #41815
The run task is supposed to run elasticsearch with the given plugin or
module. However, for modules, this is most realistic if using the full
distribution. This commit changes the run setup to use the default or
oss as appropriate.
When the same alias points to multiple indices we can write to only one index
with `is_write_index` value `true`. The special handling in case of the put
mapping request(to resolve authorized indices) has a check on indices size
for a concrete index. If multiple indices existed then it marked the request
as unauthorized.
The check has been modified to consider write index flag and only when the
requested index matches with the one with write index alias, the alias is considered
for authorization.
Closes#40831
Usage of the ILM Move to Step API can result in some very odd
situations, and for diagnosing problems arising from these situations it
would be nice to have a record of when this API was called with what
parameters.
Also, adds a dedicated logger for TransportMoveToStepAction,
rather than using the (deprecated) inherited one.
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)
This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.
(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)
* Fix forking JVM runner
* Don't bump shadow plugin version
Prior to this PR, there is a bug in ILM which does not allow ILM to stop
if one or more indices have an index.lifecycle.name which refers to
a policy that does not exist - the operation_mode will be stuck as
STOPPING until either the policy is created or the nonexistent
policy is removed from those indices.
This change allows ILM to stop in this case and makes the logging more
clear as to why ILM is not stopping.
Previously we only set the latch countdown with `nextStep.setLatch` after the
cluster state change has already been counted down. However, it's possible
execution could have already started, causing the latch to be missed when the
`MockAsyncActionStep` is being executed.
This moves the latch setting to be before the call to
`runPolicyAfterStateChange`, which means it is always available when the
`MockAsyncActionStep` is executed.
I was able to reproduce the failure every 30-40 runs before this change. With
this change, running 2000+ times the test passes.
Resolves#40018
When trace logging is enabled we log the computed steps for a policy. This
commit makes sure that the steps that are logged are in the same order they will
be run when the policy executes. This makes it much easier to reason about the
policy if the move-to-step API is ever required in the future.
It is possible that the Unfollow API may fail to release shard history
retention leases when unfollowing, so this needs to be handled by the
ILM Unfollow action. There's nothing much that can be done automatically
about it from the follower side, so this change makes the ILM unfollow
action simply ignore those failures.
This test should no longer pass when the functionality it is intended to
test is broken, as it now indexes a number of documents and verifies
that the index is staying on the same step until after indexing and
replication of those documents is finished. This prevents the test from
passing if the leader index progresses in its lifecycle during that time.
The SchedulerEngine is used in several places in our code and not all
of these usages properly stopped the SchedulerEngine, which could lead
to test failures due to leaked threads from the SchedulerEngine. This
change adds stopping to these usages in order to avoid the thread leaks
that cause CI failures and noise.
Closes#38875
This commit attempts to remove the retention leases on the leader shards
when unfollowing an index. This is best effort, since the leader might
not be available.
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.
The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).
Given all these issues, it seems that this custom timeout mechanism
gives little benefits while causing a lot of harm. We should rather rely
on connect and socket timeout exposed by the underlying http client
and accept that a request can overall take longer than the configured
timeout, which is the case even with a single retry anyways.
This commit removes the `maxRetryTimeout` setting and all of its usages.
`CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when
deserializing index creation requests, accidentally accepts mappings that are
nested twice under the type key (as described in the bug report #38266).
This in turn causes us to be too lenient in parsing typeless mappings. In
particular, we accept the following index creation request, even though it
should not contain the type key `_doc`:
```
PUT index?include_type_name=false
{
"mappings": {
"_doc": {
"properties": { ... }
}
}
}
```
There is a similar issue for both 'put templates' and 'put mappings' requests
as well.
This PR makes the minimal changes to detect and reject these typed mappings in
requests. It does not address #38266 generally, or attempt a larger refactor
around types in these server-side requests, as I think this should be done at a
later time.
This test should not pass until CCR finishes integrating shard history
retention leases. It currently sometimes passes (which is a bug in the
test), but cannot pass reliably until the linked issue is resolved.
Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.
There was a bug where creating a new policy would start
the ILM service, even if it was stopped. This change ensures
that there is no change to the existing operation mode
Previously, ShrinkAction would fail if
it was executed on an index that had
the same number of shards as the target
shrunken number.
This PR introduced a new BranchingStep that
is used inside of ShrinkAction to branch which
step to move to next, depending on the
shard values. So no shrink will occur if the
shard count is unchanged.
This commit modifies the put follow index action to use a
CcrRepository when creating a follower index. It routes
the logic through the snapshot/restore process. A
wait_for_active_shards parameter can be used to configure
how long to wait before returning the response.
We inject an Unfollow action before Shrink because the Shrink action
cannot be safely used on a following index, as it may not be fully
caught up with the leader index before the "original" following index is
deleted and replaced with a non-following Shrunken index. The Unfollow
action will verify that 1) the index is marked as "complete", and 2) all
operations up to this point have been replicated from the leader to the
follower before explicitly disconnecting the follower from the leader.
Injecting an Unfollow action before the Rollover action is done mainly
as a convenience: This allow users to use the same lifecycle policy on
both the leader and follower cluster without having to explictly modify
the policy to unfollow the index, while doing what we expect users to
want in most cases.