Commit Graph

174 Commits

Author SHA1 Message Date
Andrei Dan c0406f78b7
ILM add cluster update timeout on step retry (#54878) (#55022)
This commits adds a timeout when moving ILM back on to a failed step. In
case the master is struggling with processing the cluster update requests
these ones will expire (as we'll send them again anyway on the next ILM
loop run)

ILM more descriptive source messages for cluster updates

Use the configured ILM step master timeout setting

(cherry picked from commit ff6c5ed16616eadfcddd9c95317d370f0d126583)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-04-11 10:13:31 +01:00
Andrei Dan b8df265b42
[7.x] ILM use Priority.IMMEDIATE for stop ILM cluster update (#54909) (#55018)
* ILM use Priority.IMMEDIATE for stop ILM cluster update (#54909)

This changes the priority of the cluster state update that stops ILM
altogether to `IMMEDIATE`. We've chosen to change this as it can be useful to
temporarily stop ILM if a cluster is overwhelmed, but a `NORMAL`
priority can see the "stop ILM update" not make it up the tasks queue.

On the same note, we're keeping the `start ILM` cluster update priority
to `NORMAL` on purpose such that we only start `ILM` if the cluster can
handle it.

(cherry picked from commit d67df3a7cd2a8619c2c9efac4dde3ba83271f2fa)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-04-11 10:12:35 +01:00
Tanguy Leroux 4d36917e52
Merge feature/searchable-snapshots branch into 7.x (#54803) (#54825)
This is a backport of #54803 for 7.x.

This pull request cherry picks the squashed commit from #54803 with the additional commits:

    6f50c92 which adjusts master code to 7.x
    a114549 to mute a failing ILM test (#54818)
    48cbca1 and 50186b2 that cleans up and fixes the previous test
    aae12bb that adds a missing feature flag (#54861)
    6f330e3 that adds missing serialization bits (#54864)
    bf72c02 that adjust the version in YAML tests
    a51955f that adds some plumbing for the transport client used in integration tests

Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
2020-04-07 13:28:53 +02:00
Armin Braun 1039cae2cc
Fix Repository Consistency TODOs from SLM Tests (#54767) (#54860)
These TODOs don't apply any longer with the repository generation
now being tracked consistently so we can remove the workarounds.
2020-04-07 09:27:50 +02:00
Jason Tedor 63e5f2b765
Rename META_DATA to METADATA
This is a follow up to a previous commit that renamed MetaData to
Metadata in all of the places. In that commit in master, we renamed
META_DATA to METADATA, but lost this on the backport. This commit
addresses that.
2020-03-31 17:30:51 -04:00
Jason Tedor 5fcda57b37
Rename MetaData to Metadata in all of the places (#54519)
This is a simple naming change PR, to fix the fact that "metadata" is a
single English word, and for too long we have not followed general
naming conventions for it. We are also not consistent about it, for
example, METADATA instead of META_DATA if we were trying to be
consistent with MetaData (although METADATA is correct when considered
in the context of "metadata"). This was a simple find and replace across
the code base, only taking a few minutes to fix this naming issue
forever.
2020-03-31 17:24:38 -04:00
Martijn van Groningen 4b4fbc160d
Refactor AliasOrIndex abstraction. (#54394)
Backport of #53982

In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams,
the abstraction needs to be made more flexible, because currently it really can be only
an alias or an index.

* Renamed `AliasOrIndex` to `IndexAbstraction`.
* Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is.
* Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum.
* Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface.
* Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`.
* Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method.

Relates to #53100
2020-03-30 10:12:16 +02:00
Gordon Brown 0d30b48613
Disallow negative TimeValues (#53913)
This commit causes negative TimeValues, other than -1 which is sometimes used as
a sentinel value, to be rejected during parsing.

Also introduces a hack to allow ILM to load policies which were written to the
cluster state with a negative min_age, treating those values as 0, which should
match the behavior of prior versions.
2020-03-26 13:30:35 -06:00
Gordon Brown 880cc3ca7e
Hide I/SLM history aliases (#53564)
This commit adjusts the aliases used for the ILM and SLM history indices
to be hidden aliases.

Also tweaks the configuration of the `IndexTemplateRegistry`s used by
these history system to only upgrade the template from the master node,
as documents are indexed from the master node, so the template version
should only be upgraded from the master node.
2020-03-16 13:07:26 -06:00
Jay Modi af36665b08
Deprecate the logstash enabled setting (#53487)
The setting, `xpack.logstash.enabled`, exists to enable or disable the
logstash extensions found within x-pack. In practice, this setting had
no effect on the functionality of the extension. Given this, the
setting is now deprecated in preparation for removal.

Backport of #53367
2020-03-12 10:18:39 -06:00
Przemko Robakowski 847ac9c7d7
Fix null config in SnapshotLifecyclePolicy.toRequest (#53328) (#53355)
This avoids NPE when executing SLM policy when no config was provided.

Related to #44465

Closes #53171

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-03-10 20:44:30 +01:00
Przemko Robakowski f075d70cf8
[7.x] Avoid race condition in ILMHistorySotre (#53039) (#53094)
* Avoid race condition in ILMHistorySotre (#53039)

* Avoid race condition in ILMHistorySotre

This change modifies ILMHistoryStore to always apply correct settings and mappings,
even if template is deleted and not yet recreated. This ensures that ILM history index
is correctly managed by ILM and also fixes flaky history tests that were prone to
triggenring this race.

This commit also refactors and simplifies ILM history tests.

Closes #50353 and #52853

* Review comment

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

* fixed tests

* backport #53306

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-03-09 22:24:15 +01:00
Jay Modi f3f6ff97ee
Single instance of the IndexNameExpressionResolver (#52604)
This commit modifies the codebase so that our production code uses a
single instance of the IndexNameExpressionResolver class. This change
is being made in preparation for allowing name expression resolution
to be augmented by a plugin.

In order to remove some instances of IndexNameExpressionResolver, the
single instance is added as a parameter of Plugin#createComponents and
PersistentTaskPlugin#getPersistentTasksExecutor.

Backport of #52596
2020-02-21 07:50:02 -07:00
Armin Braun 4bb780bc37
Refactor Inflexible Snapshot Repository BwC (#52365) (#52557)
* Refactor Inflexible Snapshot Repository BwC (#52365)

Transport the version to use for  a snapshot instead of whether to use shard generations in the snapshots in progress entry. This allows making upcoming repository metadata changes in a flexible manner in an analogous way to how we handle serialization BwC elsewhere.
Also, exposing the version at the repository API level will make it easier to do BwC relevant changes in derived repositories like source only or encrypted.
2020-02-21 09:14:34 +01:00
Lee Hinman 22cf1140eb
[7.x] Add additional logging to SLM retention task (#52343) (#52535)
This commit adds more logging to the actions that the SLM retention task does. It will help in the
event that we need to diagnose any additional issues or problems while running retention.
2020-02-19 13:15:01 -07:00
Andrei Dan bd3a70db4e
ILM fix the init step to actually be retryable (#52076) (#52375)
We marked the `init` ILM step as retryable but our test used `waitUntil`
without an assert so we didn’t catch the fact that we were not actually
able to retry this step as our ILM state didn’t contain any information
about the policy execution (as we were in the process of initialising
it).

This commit manually sets the current step to `init` when we’re moving
the ilm policy into the ERROR step (this enables us to successfully
move to the error step and later retry the step)

* ShrunkenIndexCheckStep: Use correct logger

(cherry picked from commit f78d4b3d91345a2a8fc0f48b90dd66c9959bd7ff)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-02-15 18:42:05 +00:00
Gordon Brown d48ce12920
Convert ILM and SLM histories into hidden indices (#51456)
Modifies SLM's and ILM's history indices to be hidden indices for added
protection against accidental querying and deletion, and improves
IndexTemplateRegistry to handle upgrading index templates.

Also modifies the REST test cleanup to delete hidden indices.
2020-02-11 14:18:55 -07:00
Jay Modi 3edadfefd0 RestHandlers declare handled routes (#52123)
This commit changes how RestHandlers are registered with the
RestController so that a RestHandler no longer needs to register itself
with the RestController. Instead the RestHandler interface has new
methods which when called provide information about the routes
(method and path combinations) that are handled by the handler
including any deprecated and/or replaced combinations.

This change also makes the publication of RestHandlers safe since they
no longer publish a reference to themselves within their constructors.

Closes #51622

Co-authored-by: Jason Tedor <jason@tedor.me>

Backport of #51950
2020-02-09 22:48:32 -07:00
Lee Hinman 0be61a3662
[7.x] Adding best_compression (#49974) (763480ee) (#51819)
* Adding best_compression (#49974)

This commit adds a `codec` parameter to the ILM `forcemerge` action. When setting the codec to `best_compression` ILM will close the index, then update the codec setting, re-open the index, and finally perform a force merge.

* Fix ForceMergeAction toSteps construction (#51825)

There was a duplicate force merge step and the test continued to fail. This commit clarifies the
`toStep` method and changes the `assertBestCompression` method for better readability.

Resolves #51822

* Update version constants

Co-authored-by: Sivagurunathan Velayutham <sivadeva.93@gmail.com>
2020-02-04 14:15:43 -07:00
Andrei Dan 20f47b14b0
Fix SnapshotLifecycleServiceTests.testPolicyCRUD (#51653) (#51755)
(cherry picked from commit 8f9a87fa576a8a1c6ea3efb29bf1296d50d89ace)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-01-31 18:17:38 +00:00
Lee Hinman deefc85d60
[7.x] Stop policy on last PhaseCompleteStep instead of Termina… (#51758)
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`,
denoted by a completed/completed/completed phase/action/step lifecycle execution state.

This commit changes the behavior so that the index lifecycle execution state halts at the last
configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy
containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete`
`PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have
indices configured to use that policy pick up execution at the newly added "later" phase. For
example, if a `delete` phase were added to the policy specified about, the index would then move
from `cold/complete/complete` into the `delete` phase.

Relates to #48431
2020-01-31 10:36:41 -07:00
Gordon Brown 10c8179351
Use exclusions list instead of fake system indices (#51586)
This commit switches the strategy for managing dot-prefixed indices that
should be hidden indices from using "fake" system indices to an explicit
exclusions list that must be updated when those indices are converted to
hidden indices.
2020-01-30 16:31:27 -07:00
Gordon Brown 89c2834b24
Deprecate creation of dot-prefixed index names except for hidden and system indices (#49959)
This commit deprecates the creation of dot-prefixed index names (e.g.
.watches) unless they are either 1) a hidden index, or 2) registered by
a plugin that extends SystemIndexPlugin. This is the first step
towards more thorough protections for system indices.

This commit also modifies several plugins which use dot-prefixed indices
to register indices they own as system indices, and adds a plugin to
register .tasks as a system index.
2020-01-28 10:01:16 -07:00
Lee Hinman bdb8b6aa0d
[7.x] Separate aliases used for tests in TimeSeriesLifecycleAc… (#51432)
* Separate aliases used for tests in TimeSeriesLifecycleActionsIT

This is related to #51375 and hopes to help illuminate why some of those tests are failing. This
commit switches the aliases used in the test to use a random alias name every time (since there were
some complaints in the tests about aliases having more than one write index). With this we hope to
determine the actual cause of the failure in the test.

This also adds additional information to the exception returned when calling move-to-step with the
incorrect current step.

* Fix rest test
2020-01-24 11:05:19 -07:00
Andrei Dan 2f7c240184
[7.x] Use ESSingleNodeTestCase instead of ESIntegTestCase (#51345) (#51346)
* Use ESSingleNodeTestCase instead of ESIntegTestCase (#51345)

(cherry picked from commit abcf1c41faf05a0b0196fb06e57c3de8c3d67688)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-01-24 10:53:37 +00:00
Przemko Robakowski 84664e8d60
Expose master timeout for ILM actions (#51130) (#51348)
This change exposes master timeout to ILM steps through global dynamic setting.
All currently implemented steps make use of this setting as well.

Closes #44136
2020-01-23 15:28:13 +01:00
Andrei Dan 421aa14972
ILM: Make UpdateSettingsStep retryable (#51235) (#51298)
This makes the UpdateSettingsStep retryable. This step updates settings needed
during the execution of ILM actions (mark indexes as read-only, change
allocation configurations, mark indexing complete, etc)

As the index updates are idempotent in nature (PUT requests and are applied only
if the values have changed) and the settings values are seldom user-configurable
(aside from the allocate action) the testing for this change goes along the
lines of artificially simulating a setting update failure on a particular value
update, which is followed by a successful step execution (a retry) in an
environment outside of ILM (the step executions are triggered manually).

(cherry picked from commit 8391b0aba469f39532bfc2796b76148167dc0289)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-01-22 11:02:26 +00:00
Andrei Dan 123266714b
ILM wait for active shards on rolled index in a separate step (#50718) (#51296)
After we rollover the index we wait for the configured number of shards for the
rolled index to become active (based on the index.write.wait_for_active_shards
setting which might be present in a template, or otherwise in the default case,
for the primaries to become active).
This wait might be long due to disk watermarks being tripped, replicas not
being able to spring to life due to cluster nodes reconfiguration and others
and, the RolloverStep might not complete successfully due to this inherent
transient situation, albeit the rolled index having been created.

(cherry picked from commit 457a92fb4c68c55976cc3c3e2f00a053dd2eac70)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-01-22 11:01:52 +00:00
Przemko Robakowski a18736b46d
[7.x] ILM action to wait for SLM policy execution (#50454) (#50943)
* ILM action to wait for SLM policy execution (#50454)

This change add new ILM action to wait for SLM policy execution to ensure that index has snapshot before deletion.

Closes #45067

* Fix flaky TimeSeriesLifecycleActionsIT#testWaitForSnapshot test

This change adds some randomness and cleanup step to TimeSeriesLifecycleActionsIT#testWaitForSnapshot and testWaitForSnapshotSlmExecutedBefore tests in attempt to make them stable.

Reletes to #50781

* Formatting changes

* Longer timeout

* Fix Map.of in Java8

* Unused import removed
2020-01-14 01:34:33 +01:00
Lee Hinman 91689e793d
[7.x] Refresh cached phase policy definition if possible on ne… (#50941)
* Refresh cached phase policy definition if possible on new policy

There are some cases when updating a policy does not change the
structure in a significant way. In these cases, we can reread the
policy definition for any indices using the updated policy.

This commit adds this refreshing to the `TransportPutLifecycleAction`
to allow this. It allows us to do things like change the configuration
values for a particular step, even when on that step (for example,
changing the rollover criteria while on the `check-rollover-ready` step).

There are more cases where the phase definition can be reread that just
the ones checked here (for example, removing an action that has already
been passed), and those will be added in subsequent work.

Relates to #48431
2020-01-13 14:31:41 -07:00
Lee Hinman 63472d30c7
[7.x] Fix SLM check for restore in progress (#50868) (#50876)
* Fix SLM check for restore in progress (#50868)

* Fix SLM check for restore in progress

This commit fixes the check in SLM where the `RestoreInProgress`
metadata was checked for existence. Rather than check existence we
should instead check the `isEmpty` method. Prior to this, a successful
restore for a repository that used SLM retention would prevent SLM
retention from running in subsequent invocations, due to SLM thinking
that a restore was still running.

* Fix 7.x-isms
2020-01-10 14:27:55 -07:00
Nhat Nguyen 90e66a7b97 Mute testPolicyCRUD
Tracked at #44997
2020-01-08 13:25:40 -05:00
Andrei Dan 3c971f2911
ILM retryable async action steps (#50522) (#50591)
This adds support for retrying AsyncActionSteps by triggering the async
step after ILM was moved back on the failed step (the async step we'll
be attempting to run after the cluster state reflects ILM being moved
back on the failed step).

This also marks the RolloverStep as retryable and adds an integration
test where the RolloverStep is failing to execute as the rolled over
index already exists to test that the async action RolloverStep is
retried until the rolled over index is deleted.

(cherry picked from commit 8bee5f4cb58a1242cc2ef4bc0317dae6c8be49d3)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-01-03 16:19:58 +02:00
Lee Hinman c3c9ccf61f
[7.x] Add ILM histore store index (#50287) (#50345)
* Add ILM histore store index (#50287)

* Add ILM histore store index

This commit adds an ILM history store that tracks the lifecycle
execution state as an index progresses through its ILM policy. ILM
history documents store output similar to what the ILM explain API
returns.

An example document with ALL fields (not all documents will have all
fields) would look like:

```json
{
  "@timestamp": 1203012389,
  "policy": "my-ilm-policy",
  "index": "index-2019.1.1-000023",
  "index_age":123120,
  "success": true,
  "state": {
    "phase": "warm",
    "action": "allocate",
    "step": "ERROR",
    "failed_step": "update-settings",
    "is_auto-retryable_error": true,
    "creation_date": 12389012039,
    "phase_time": 12908389120,
    "action_time": 1283901209,
    "step_time": 123904107140,
    "phase_definition": "{\"policy\":\"ilm-history-ilm-policy\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"rollover\":{\"max_size\":\"50gb\",\"max_age\":\"30d\"}}},\"version\":1,\"modified_date_in_millis\":1576517253463}",
    "step_info": "{... etc step info here as json ...}"
  },
  "error_details": "java.lang.RuntimeException: etc\n\tcaused by:etc etc etc full stacktrace"
}
```

These documents go into the `ilm-history-1-00000N` index to provide an
audit trail of the operations ILM has performed.

This history storage is enabled by default but can be disabled by setting
`index.lifecycle.history_index_enabled` to `false.`

Resolves #49180

* Make ILMHistoryStore.putAsync truly async (#50403)

This moves the `putAsync` method in `ILMHistoryStore` never to block.
Previously due to the way that the `BulkProcessor` works, it was possible
for `BulkProcessor#add` to block executing a bulk request. This was bad
as we may be adding things to the history store in cluster state update
threads.

This also moves the index creation to be done prior to the bulk request
execution, rather than being checked every time an operation was added
to the queue. This lessens the chance of the index being created, then
deleted (by some external force), and then recreated via a bulk indexing
request.

Resolves #50353
2019-12-20 12:33:36 -07:00
David Turner 285eacd267
Use more specific loggers in subclasses of TMNA (#50076)
Adjusts the subclasses of `TransportMasterNodeAction` to use their own loggers
instead of the one for the base class.

Relates #50056.
Partial backport of #46431 to 7.x.
2019-12-11 15:07:47 +00:00
Lee Hinman 8205cdd423
[7.x] Refactor IndexLifecycleRunner to split state modificatio… (#49936)
This commit refactors the `IndexLifecycleRunner` to split out and
consolidate the number of methods that change state from within ILM. It
adds a new class `IndexLifecycleTransition` that contains a number of
static methods used to modify ILM's state. These methods all return new
cluster states rather than making changes themselves (they can be
thought of as helpers for modifying ILM state).

Rather than having multiple ways to move an index to a particular step
(like `moveClusterStateToStep`, `moveClusterStateToNextStep`,
`moveClusterStateToPreviouslyFailedStep`, etc (there are others)) this
now consolidates those into three with (hopefully) useful names:

- `moveClusterStateToStep`
- `moveClusterStateToErrorStep`
- `moveClusterStateToPreviouslyFailedStep`

In the move, I was also able to consolidate duplicate or redundant
arguments to these functions. Prior to this commit there were many calls
that provided duplicate information (both `IndexMetaData` and
`LifecycleExecutionState` for example) where the duplicate argument
could be derived from a previous argument with no problems.

With this split, `IndexLifecycleRunner` now contains the methods used to
actually run steps as well as the methods that kick off cluster state
updates for state transitions. `IndexLifecycleTransition` contains only
the helpers for constructing new states from given scenarios.

This also adds Javadocs to all methods in both `IndexLifecycleRunner`
and `IndexLifecycleTransition` (this accounts for almost all of the
increase in code lines for this commit). It also makes all methods be as
restrictive in visibility, to limit the scope of where they are used.

This refactoring is part of work towards capturing actions and
transitions that ILM makes, by consolidating and simplifying the places
we make state changes, it will make adding operation auditing easier.
2019-12-06 12:55:16 -07:00
Armin Braun af0f97d50a
Fix SLMSnapshotBlockingIntegTests.testSnapshotInProgress (#49533) (#49542)
This test must check for state `SUCCESS` as well. `SUCESS` in
`SnapshotsInProgress` means "all data nodes finished snapshotting sucessfully but master must still finalize the snapshot in the repo".
`SUCESS` does not mean that the snapshot is actually fully finished in this object.

You can easily reporduce the scenario in #49303 that has an in-progress snapshot in `SUCCESS` state
by waiting 20s before running the busy assert loop on the snapshot status so that all steps but the blocked
finalization can finish.

Closes #49303
2019-11-25 13:31:45 +01:00
Andrei Dan 010c3de47e
Slm set operation mode to RUNNING on first run (#49236) (#49425)
* SLM set the operation mode to RUNNING on first run

Set the SLM operation mode to RUNNING when setting the first SLM lifecycle
policy. Historically, SLM was not decoupled from ILM but now they are
independent components. Setting the SLM operation mode to what the ILM running
mode was when we set the first SLM lifecycle policy was a remain from those
times.

* SLM update package info

* SLM suppress unusued warning

* SLM use logger for the correct class

* SLM Add integration test for operation mode

* Use ESSingleNodeTestCase instead of ESIntegTestCase

(cherry picked from commit 4ad3d93f89d03bf9a25685a990d1a439f33ce0e6)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2019-11-21 11:41:32 +00:00
Jay Modi eed4cd25eb
ThreadPool and ThreadContext are not closeable (#43249) (#49273)
This commit changes the ThreadContext to just use a regular ThreadLocal
over the lucene CloseableThreadLocal. The CloseableThreadLocal solves
issues with ThreadLocals that are no longer needed during runtime but
in the case of the ThreadContext, we need it for the runtime of the
node and it is typically not closed until the node closes, so we miss
out on the benefits that this class provides.

Additionally by removing the close logic, we simplify code in other
places that deal with exceptions and tracking to see if it happens when
the node is closing.

Closes #42577
2019-11-19 13:15:16 -07:00
Andrei Dan 19780e20ba
Handle failure to retrieve ILM policy step better (#49193) (#49316)
This commit wraps the calls to retrieve the current step in a try/catch
so that the exception does not bubble up. Instead, step info is added
containing the exception to the existing step.

Semi-related to #49128

(cherry picked from commit 72530f8a7f40ae1fca3704effb38cf92daf29057)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2019-11-19 17:14:46 +00:00
Armin Braun 25cc8e3663
Fix RepoCleanup not Removed on Master-Failover (#49217) (#49239)
The logic for `cleanupInProgress()` was backwards everywhere (method itself and
all but one user). Also, we weren't checking it when removing a repository.

This lead to a bug (in the one spot that didn't use the method backwards) that prevented
the cleanup cluster state entry from ever being removed from the cluster state if master
failed over during the cleanup process.

This change corrects the backwards logic, adds a test that makes sure the cleanup
is always removed and adds a check that prevents repository removal during cleanup
to the repositories service.

Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.
2019-11-18 16:44:09 +01:00
Lee Hinman 680436dd0d
[7.x] Don't halt policy execution on policy trigger exception… (#49171)
When triggered either by becoming master, a new cluster state, or a
periodic schedule, an ILM policy execution through
`maybeRunAsyncAction`, `runPolicyAfterStateChange`, or
`runPeriodicStep` throwing an exception will cause the loop the
terminate. This means that any indices that would have been processed
after the index where the exception was thrown will not be processed by
ILM.

For most execution this is not a problem because the actual running of
steps is protected by a try/catch that moves the index to the ERROR step
in the event of a problem. If an exception occurs prior to step
execution (for example, in fetching and parsing the current
policy/step) however, it causes the loop termination previously
mentioned.

This commit wraps the invocation of the methods specified above in a
try/catch block that provides better logging and does not bubble the
exception up.
2019-11-15 09:22:37 -07:00
Lee Hinman 5eb37c29fe
[7.x] Re-read policy phase JSON when using ILM's move-to-step… (#49011)
When using the move-to-step API, we should reread the phase JSON from
the latest version of the ILM policy. This allows a user to move to the
same step while re-reading the policy's latest version. For example,
when changing rollover criteria.

While manually messing around with some other things I discovered that
we only reread the policy when using the retry API, not the move-to-step
API. This commit changes the move-to-step API to always read the latest
version of the policy.
2019-11-12 19:41:06 -07:00
Armin Braun 3c20541823
Cleanup Concurrent RepositoryData Loading (#48329) (#48834)
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122
2019-11-02 20:42:29 +01:00
Andrei Dan ffe5d5417f
ILM Make the `check-rollover-ready` step retryable (#48256) (#48740)
This adds the infrastructure to be able to retry the execution of retryable
steps and makes the `check-rollover-ready` retryable as an initial step to
make the rollover action more resilient to transient errors.

(cherry picked from commit 454020ac8acb147eae97acb4ccd6fb470d1e5f48)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2019-10-31 11:28:55 +00:00
Lee Hinman 2d5291cf3b Un-AwaitsFix and enhance logging for testPolicyCRUD (#48719)
* Un-AwaitsFix and enhance logging for testPolicyCRUD

This removes the `AwaitsFix` and increases the test logging for
`SnapshotLifecycleServiceTests.testPolicyCRUD` in an effort to track
down the cause of #44997.

* Remove unused import
2019-10-30 17:02:57 -06:00
Lee Hinman ed2bb73de2 Fix SnapshotLifecycleService logger (#48711)
The logger was erroneously using the `SnapshotLifecycleMetadata` class
for its initialization, making it hard to target packages for logging
levels since `SnapshotLifecycleMetadata` is in a different package.
2019-10-30 13:13:50 -06:00
Lee Hinman 72a601c47f
[7.x] Don't schedule SLM jobs when services have been stopped… (#48692)
This adds a guard for the SLM lifecycle and retention service that
prevents new jobs from being scheduled once the service has been
stopped. Previous if the node were shut down the service would be
stopped, but a cluster state or local master election would cause a job
to attempt to be scheduled. This could lead to an uncaught
`RejectedExecutionException`.

Resolves #47749
2019-10-30 09:46:35 -06:00
Gordon Brown 25724c5c46
Adjust date parsing in ILM integration tests (#48648)
The format returned by the API is not always parsable with
`Instant.parse()`, so this commit adjusts to parsing those dates as
`ISO_ZONED_DATE_TIME` instead, which appears to always parse the
returned value correctly.
2019-10-29 15:44:04 -07:00
Gordon Brown cf235796c0
Use more reliable "never run" cron pattern in tests (#48608)
The cron schedule "1 2 3 4 5 ?" will run every May 4 at 03:02:01, which
may result in unnecessary test failures once a year. This commit
switches out uses of that schedule in tests for one which will never
execute (because it specifies a day which doesn't exist, Feb. 31). Also
factors the schedule out to a constant to make the intent clearer.
2019-10-29 09:33:14 -07:00