Commit Graph

174 Commits

Author SHA1 Message Date
Gordon Brown 5021410165
Retry on RepositoryException in SLM tests (#48548)
Due to a bug, GETing a snapshot can cause a RespositoryException to be
thrown. This error is transient and should be retried, rather than
causing the test to fail. This commit converts those
RepositoryExceptions into AssertionErrors so that they will be retried
in code wrapped in assertBusy.
2019-10-28 09:24:38 -07:00
Armin Braun 7215201406
Track Shard-Snapshot Index Generation at Repository Root (#48371)
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).

This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).

Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.

This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__

Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).

This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
2019-10-23 10:58:26 +01:00
Gordon Brown a2217f4a91
Fix testRetentionWhileSnapshotInProgress (#48219)
This test could fail for two reasons, both should be fixed by this PR:

1) It hit a timeout for an `assertBusy`. This commit increases the
timeout for that `assertBusy`.

2) The snapshot that was supposed to be blocked could, in fact, be
successful. This is because a previous snapshot had been successfully
been taken, and no new data had been added between the two snapshots.
This means that no new segment files needed to be written for the new
snapshot, so the block on data files was never triggered. This commit
changes two things: First, it indexes some new data before taking the
second snapshot (the one that needs to be blocked), and second,
checks to ensure that the block is actually hit before continuing
with the test.
2019-10-18 14:25:18 -06:00
Armin Braun 9bf8e1e060
Fix SLMSnapshotBlockingIntegTest (#47941) (#47963)
The after snapshot action is interfering with SLM deleting snapshots
here it seems, causing concurrent delete exceptions.
Since these tests are now test-scoped there is no reason to run
snapshot deletes after each test so we can remove them to avoid this issue.

Closes #47937
2019-10-17 08:55:56 +02:00
Armin Braun 0ca7cc1848
Safely Close Repositories on Node Shutdown (#48020) (#48107)
We were not closing repositories on Node shutdown.
In production, this has little effect but in tests
shutting down a node using `MockRepository` and is
currently stuck in a simulated blocked-IO situation
will only unblock when the node's threadpool is
interrupted. This might in some edge cases (many
snapshot threads and some CI slowness) result
in the execution taking longer than 5s to release
all the shard stores and thus we fail the assertion
about unreleased shard stores in the internal test cluster.

Regardless of tests, I think we should close repositories
and release resources associated with them when closing
a node and not just when removing a repository from the CS
with running nodes as this behavior is really unexpected.

Fixes #47689
2019-10-17 07:55:05 +02:00
Lee Hinman 5af66d79ef
Add SLM support to xpack usage and info APIs (#48149)
* Add SLM support to xpack usage and info APIs

This is a backport of #48096

This adds the missing xpack usage and info information into the
`/_xpack` and `/_xpack/usage` APIs. The output now looks like:

```
GET /_xpack/usage
{
  ...
  "slm" : {
    "available" : true,
    "enabled" : true,
    "policy_count" : 1,
    "policy_stats" : {
      "retention_runs" : 0,
      ...
    }
  }
```

and

```
GET /_xpack
{
  ...
  "features" : {
    ...
    "slm" : {
      "available" : true,
      "enabled" : true
    },
    ...
  }
}
```

Relates to #43663

* Fix missing license
2019-10-16 21:06:27 -06:00
Gordon Brown 699d4d4c6f
Manage retention of partial snapshots in SLM (#47833)
Currently, partial snapshots will eventually build up unless they are
manually deleted. Partial snapshots may be useful if there is not a more
recent successful snapshot, but should eventually be deleted if they are
no longer useful.

With this change, partial snapshots are deleted using the following
strategy: PARTIAL snapshots will be kept until the configured
expire_after period has passed, if present, and then be deleted. If
there is no configured expire_after in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Partial snapshots are not counted towards either min_count or
max_count.
2019-10-14 10:19:57 -06:00
Nick Knize 68eaa21d77 Mute testBasicFailureRetention (#47940) 2019-10-11 14:03:46 -05:00
Armin Braun 48823b1112
Fix SLMSnapshotBlockingIntegTests (#47841) (#47863)
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in #47834

Closes #47834
2019-10-10 18:41:57 +02:00
Mark Vieira 0360a18f61
Mute test SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress
Signed-off-by: Mark Vieira <portugee@gmail.com>
(cherry picked from commit a8a7477c396554926f260d210364f009d85ae5f2)
2019-10-09 15:38:24 -07:00
Jim Ferenczi d96977202d Disable SLMSnapshotBlockingIntegTests#testSnapshotInProgress (#47775)
This test fails constantly in master and prs.

Relates #47689
2019-10-09 17:49:13 +02:00
Lee Hinman fb7abe9fa4 Separate SLM stop/start/status API from ILM (#47710)
* Separate SLM stop/start/status API from ILM

This separates a start/stop/status API for SLM from being tied to ILM's
operation mode. These APIs look like:

```
POST /_slm/stop
POST /_slm/start
GET /_slm/status
```

This allows administrators to have fine-grained control over preventing
periodic snapshots and deletions while performing cluster maintenance.

Relates to #43663

* Allow going from RUNNING to STOPPED

* Align with the OperationMode rules

* Fix slmStopping method

* Make OperationModeUpdateTask constructor private

* Wipe snapshots better in test
2019-10-08 17:21:38 -06:00
Gordon Brown a492864a9d
Manage retention of failed snapshots in SLM (#47617)
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
2019-10-08 17:07:08 -06:00
Lee Hinman 91988c7c26 Throw error retrieving non-existent SLM policy (#47679)
Previously when retrieving an SLM policy it would always return a 200
with `{}` in the body, even if the policy did not exist. This changes
that behavior to throw an error (similar to our other APIs) if a
policy doesn't exist.

This also adds a basic CRUD yml test for the behavior.

Resolves #47664
2019-10-07 19:54:04 -06:00
Lee Hinman 2e3eb4b24e
Add API to execute SLM retention on-demand (#47405) (#47463)
* Add API to execute SLM retention on-demand (#47405)

This is a backport of #47405

This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.

This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.

This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.

Resolves #46508
Relates to #43663
2019-10-02 12:29:04 -06:00
Andrei Dan 4c909438dd
Fix OriginationDate parsing tests. (#47170) (#47200)
Drop the usage of `SimpleDateFormat` and use the `DateFormatter` instead

(cherry picked from commit 7cf509a7a11ecf6c40c44c18e8f03b8e81fcd1c2)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2019-09-27 13:16:45 +01:00
Gordon Brown 7ac647c365
Add support for POST requests to SLM Execute API (#47061)
This commit adds support for POST requests to the SLM `_execute` API,
because POST is a more appropriate HTTP verb for this action as it is
not idempotent. The docs are also changed to favor POST over PUT,
although PUT is not removed or officially deprecated.
2019-09-25 16:15:10 -06:00
Andrei Dan 27520cac3b
ILM: parse origination date from index name (#46755) (#47124)
* ILM: parse origination date from index name (#46755)

Introduce the `index.lifecycle.parse_origination_date` setting that
indicates if the origination date should be parsed from the index name.
If set to true an index which doesn't match the expected format (namely
`indexName-{dateFormat}-optional_digits` will fail before being created.
The origination date will be parsed when initialising a lifecycle for an
index and it will be set as the `index.lifecycle.origination_date` for
that index.

A user set value for `index.lifecycle.origination_date` will always
override a possible parsable date from the index name.

(cherry picked from commit c363d27f0210733dad0c307d54fa224a92ddb569)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

* Drop usage of Map.of to be java 8 compliant
2019-09-25 21:44:16 +01:00
Lee Hinman a267df30fa Wait for snapshot completion in SLM snapshot invocation (#47051)
* Wait for snapshot completion in SLM snapshot invocation

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to #38461 and #43663
2019-09-25 14:25:22 -06:00
Lee Hinman 5ca37db60c Mute SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress
Relates to #46508
2019-09-23 17:08:09 -06:00
Lee Hinman b85468d6ea
Add node setting for disabling SLM (#46794) (#46796)
This adds the `xpack.slm.enabled` setting to allow disabling of SLM
functionality as well as its HTTP API endpoints.

Relates to #38461
2019-09-17 17:39:41 -06:00
Andrei Dan c57cca98b2
[ILM] Add date setting to calculate index age (#46561) (#46697)
* [ILM] Add date setting to calculate index age

Add the `index.lifecycle.origination_date` to allow users to configure a
custom date that'll be used to calculate the index age for the phase
transmissions (as opposed to the default index creation date).

This could be useful for users to create an index with an "older"
origination date when indexing old data.

Relates to #42449.

* [ILM] Don't override creation date on policy init

The initial approach we took was to override the lifecycle creation date
if the `index.lifecycle.origination_date` setting was set. This had the
disadvantage of the user not being able to update the `origination_date`
anymore once set.

This commit changes the way we makes use of the
`index.lifecycle.origination_date` setting by checking its value when
we calculate the index age (ie. at "read time") and, in case it's not
set, default to the index creation date.

* Make origination date setting index scope dynamic

* Document orignation date setting in ilm settings

(cherry picked from commit d5bd2bb77ee28c1978ab6679f941d7c02e389d32)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2019-09-16 08:50:28 +01:00
Lee Hinman 52d7b03b49 Wait for no snapshots in state in testRetentionWhileSnapshotIn… (#46573)
This commit adds a wait/check for all running snapshots to be cleared
before taking another snapshot. The previous snapshot was successful but
had not yet been cleared from the cluster state, so the second snapshot
failed due to a `ConcurrentSnapshotException`.

Resolves #46508
2019-09-11 09:47:01 -06:00
Lee Hinman cdc3a260af
Add retention to Snapshot Lifecycle Management (backport of #4… (#46506)
* Add retention to Snapshot Lifecycle Management (#46407)

This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to #38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to #43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to #43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (#45018)

Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.

Relates to #43663

* Add min_count and max_count as SLM retention predicates (#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to #43663

* Time-bound deletion of snapshots in retention delete function (#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to #43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to #43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to #43663

* Increase timeout waiting for snapshot completion

* Apply patch

From 2374316f0d.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to #43663

* Check all actions preventing snapshot delete during retention (#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to #43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to #43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to #43663
Resolves #46205

* Add error logging when exceptions are thrown

* Update serialization versions

* Fix type inference

* Use non-Cancellable HLRC return value

* Fix Client mocking in test

* Fix SLMSnapshotBlockingIntegTests for 7.x branch

* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval

* Add serialization guards for SnapshotLifecyclePolicy
2019-09-10 09:08:09 -06:00
Lee Hinman 3d4b8e01c7
Validate SLM policy ids strictly (#45998) (#46145)
This uses strict validation for SLM policy ids, similar to what we use
for index names.

Resolves #45997
2019-09-03 09:20:02 -06:00
Gordon Brown 47b1e2b3d0
[7.x] Use rollover for SLM's history indices (#45686)
Following our own guidelines, SLM should use rollover instead of purely
time-based indices to keep shard counts low. This commit implements lazy
index creation for SLM's history indices, indexing via an alias, and
rollover in the built-in ILM policy.
2019-08-21 13:42:11 -06:00
Armin Braun a01bd6c5a3
Stop Executing SLM Policy Transport Action on Snapshot Pool (#45727) (#45748)
* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot
* Fixes #45594
2019-08-20 19:15:36 +02:00
Gordon Brown 3f5dab99c3
Properly set origin for SLM history store client (#45515)
The origin was not set properly for the SnapshotHistoryStore client,
resulting in errors when SLM was used when security was enabled.
2019-08-13 18:23:20 -06:00
Armin Braun a9e1402189
Remove Settings from BaseRestRequest Constructor (#45418) (#45429)
* Resolving the todo, cleaning up the unused `settings` parameter
* Cleaning up some other minor dead code in affected classes
2019-08-12 05:14:45 +02:00
Lee Hinman c7ec0b8431 Include in-progress snapshot for a policy with get SLM policy… (#45245)
This commit adds the "in_progress" key to the SLM get policy API,
returning a policy that looks like:

```json
{
  "daily-snapshots" : {
    "version" : 1,
    "modified_date" : "2019-08-05T18:41:48.778Z",
    "modified_date_millis" : 1565030508778,
    "policy" : {
      "name" : "<production-snap-{now/d}>",
      "schedule" : "0 30 1 * * ?",
      "repository" : "repo",
      "config" : {
        "indices" : [
          "foo-*",
          "important"
        ],
        "ignore_unavailable" : true,
        "include_global_state" : false
      },
      "retention" : {
        "expire_after" : "10m"
      }
    },
    "last_success" : {
      "snapshot_name" : "production-snap-2019.08.05-oxctmnobqye3luim4uejhg",
      "time_string" : "2019-08-05T18:42:23.257Z",
      "time" : 1565030543257
    },
    "next_execution" : "2019-08-06T01:30:00.000Z",
    "next_execution_millis" : 1565055000000,
    "in_progress" : {
      "name" : "production-snap-2019.08.05-oxctmnobqye3luim4uejhg",
      "uuid" : "t8Idqt6JQxiZrzp0Vt7z6g",
      "state" : "STARTED",
      "start_time" : "2019-08-05T18:42:22.998Z",
      "start_time_millis" : 1565030542998
    }
  }
}
```

These are only visible while the snapshot is being taken (or failed),
since it reads from the cluster state rather than from the repository
itself.
2019-08-07 08:29:49 -06:00
David Kyle e18e9fa8c5 Mute SnapshotLifecycleServiceTests#testPolicyCRUD
Relates to https://github.com/elastic/elasticsearch/issues/44997
2019-07-30 10:36:27 +01:00
Lee Hinman 598c4e72f9
[7.x] Rename indexlifecycle to ilm and snapshotlifecycle to sl… (#44977)
* Rename indexlifecycle to ilm and snapshotlifecycle to slm (#44917)

As a followup to #44725 and #44608, which renamed the packages within
the x-pack project, this renames the packages within the core x-pack
project. It also renames 'snapshotlifecycle' within the HLRC to slm.

* Fix one more import
2019-07-29 15:51:14 -06:00
Gordon Brown d4b2d21339
Add option to filter ILM explain response (#44777)
In order to make it easier to interpret the output of the ILM Explain
API, this commit adds two request parameters to that API:

- `only_managed`, which causes the response to only contain indices
  which have `index.lifecycle.name` set
- `only_errors`, which causes the response to contain only indices in an
  ILM error state

"Error state" is defined as either being in the `ERROR` step or having
`index.lifecycle.name` set to a policy that does not exist.
2019-07-26 11:57:38 -04:00
Jason Tedor e2c8f8dfa3
Rename ILM package to ilm (#44725)
This commit renames the ILM package from indexlifecycle to ilm. We have
all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. This
commit makes that change.
2019-07-23 16:46:38 +09:00
Jason Tedor 5878bde8dc
Rename SLM package to slm (#44608)
This commit renames the SLM package from snapshotlifecycle to slm. We
have all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. For SLM,
let's use slm for all of these including the package name from the
beginning.
2019-07-23 07:35:06 +09:00
Lee Hinman 3001f7941f Allow empty configuration for SLM policies (#44465)
* Allow empty configuration for SLM policies

When putting or updating a snapshot lifecycle policy it was not possible
to elide the `config` map. This commit makes the configuration optional,
the same way that it is when taking a snapshot.

Relates to #38461

* Add Objects.requireNonNull for required parts of the policy
2019-07-18 16:20:31 -06:00
Ryan Ernst 2a2686e6e7
Convert remaining ActionTypes to writeable in xpack core (#44467) (#44525)
This commit converts all remaining ActionType response classes to
writeable in xpack core. It also converts a few from server which were
used by xpack core.

relates #34389
2019-07-17 18:01:45 -07:00
Ryan Ernst 0755a13c9f
Convert AcknowledgedRequest to Writeable.Reader (#44412) (#44454)
This commit adds constructors to AcknolwedgedRequest subclasses to
implement Writeable.Reader, and ensures all future subclasses implement
the same.

relates #34389
2019-07-17 11:17:36 -07:00
Lee Hinman fb0461ac76
[7.x] Add Snapshot Lifecycle Management (#44382)
* Add Snapshot Lifecycle Management (#43934)

* Add SnapshotLifecycleService and related CRUD APIs

This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.

This also includes the get, put, and delete APIs for these policies

Relates to #38461

* Make scheduledJobIds return an immutable set

* Use Object.equals for SnapshotLifecyclePolicy

* Remove unneeded TODO

* Implement ToXContentFragment on SnapshotLifecyclePolicyItem

* Copy contents of the scheduledJobIds

* Handle snapshot lifecycle policy updates and deletions (#40062)

(Note this is a PR against the `snapshot-lifecycle-management` feature branch)

This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.

Relates to #38461

* Take a snapshot for the policy when the SLM policy is triggered (#40383)

(This is a PR for the `snapshot-lifecycle-management` branch)

This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.

This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.

Relates to #38461

* Record most recent snapshot policy success/failure (#40619)

Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.

This is the first step toward writing history in a more permanent
fashion.

* Validate snapshot lifecycle policies (#40654)

(This is a PR against the `snapshot-lifecycle-management` branch)

With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.

Part of #38461

* Hook SLM into ILM's start and stop APIs (#40871)

(This pull request is for the `snapshot-lifecycle-management` branch)

This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.

Relates to #38461

* Add tests for SnapshotLifecyclePolicyItem (#40912)

Adds serialization tests for SnapshotLifecyclePolicyItem.

* Fix improper import in build.gradle after master merge

* Add human readable version of modified date for snapshot lifecycle policy (#41035)

* Add human readable version of modified date for snapshot lifecycle policy

This small change changes it from:

```
...
"modified_date": 1554843903242,
...
```

To

```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```

Including the `"modified_date"` field when the `?human` field is used.

Relates to #38461

* Fix test

* Add API to execute SLM policy on demand (#41038)

This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.

```json
PUT /_ilm/snapshot/<policy>/_execute
```

And it returns the response with the generated snapshot name:

```json
{
  "snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```

Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.

Relates to #38461

* Add next_execution to SLM policy metadata (#41221)

* Add next_execution to SLM policy metadata

This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:

```json
GET /_ilm/snapshot?human
{
  "production" : {
    "version" : 1,
    "modified_date" : "2019-04-15T21:16:21.865Z",
    "modified_date_millis" : 1555362981865,
    "policy" : {
      "name" : "<production-snap-{now/d}>",
      "schedule" : "*/30 * * * * ?",
      "repository" : "repo",
      "config" : {
        "indices" : [
          "foo-*",
          "important"
        ],
        "ignore_unavailable" : true,
        "include_global_state" : false
      }
    },
    "next_execution" : "2019-04-15T21:16:30.000Z",
    "next_execution_millis" : 1555362990000
  },
  "other" : {
    "version" : 1,
    "modified_date" : "2019-04-15T21:12:19.959Z",
    "modified_date_millis" : 1555362739959,
    "policy" : {
      "name" : "<other-snap-{now/d}>",
      "schedule" : "0 30 2 * * ?",
      "repository" : "repo",
      "config" : {
        "indices" : [
          "other"
        ],
        "ignore_unavailable" : false,
        "include_global_state" : true
      }
    },
    "next_execution" : "2019-04-16T02:30:00.000Z",
    "next_execution_millis" : 1555381800000
  }
}
```

Relates to #38461

* Fix and enhance tests

* Figured out how to Cron

* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)

This commit changes the endpoint for snapshot lifecycle management from:

```
GET /_ilm/snapshot/<policy>
```

to:

```
GET /_slm/policy/<policy>
```

It mimics the ILM path only using `slm` instead of `ilm`.

Relates to #38461

* Add initial documentation for SLM (#41510)

* Add initial documentation for SLM

This adds the initial documentation for snapshot lifecycle management.

It also includes the REST spec API json files since they're sort of
documentation.

Relates to #38461

* Add `manage_slm` and `read_slm` roles (#41607)

* Add `manage_slm` and `read_slm` roles

This adds two more built in roles -

`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.

`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.

Relates to #38461

* Add execute to the test

* Fix ilm -> slm typo in test

* Record SLM history into an index (#41707)

It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.

This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.

Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s.  Watcher would cause the same problem, but it
is already disabled (for the same reason).

* High Level Rest Client support for SLM (#41767)

* High Level Rest Client support for SLM

This commit add HLRC support for SLM.

Relates to #38461

* Fill out documentation tests with tags

* Add more callouts and asciidoc for HLRC

* Update javadoc links to real locations

* Add security test testing SLM cluster privileges (#42678)

* Add security test testing SLM cluster privileges

This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.

Relates to #38461

* Don't redefine vars

*  Add Getting Started Guide for SLM  (#42878)

This commit adds a basic Getting Started Guide for SLM.

* Include SLM policy name in Snapshot metadata (#43132)

Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.

* Fix compilation after master merge

* [TEST] Move exception wrapping for devious exception throwing

Fixes an issue where an exception was created from one line and thrown in another.

* Fix SLM for the change to AcknowledgedResponse

* Add Snapshot Lifecycle Management Package Docs (#43535)

* Fix compilation for transport actions now that task is required

* Add a note mentioning the privileges needed for SLM (#43708)

* Add a note mentioning the privileges needed for SLM

This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.

Relates to #38461

* Mention that you can create snapshots for indices you can't read

* Fix REST tests for new number of cluster privileges

* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)

* Fix SnapshotHistoryStoreTests after merge

* Remove overridden newResponse functions that have been removed

* Fix compilation for backport

* Fix get snapshot output parsing in test

* [DOCS] Add redirects for removed autogen anchors (#44380)

* Switch <tt>...</tt> in javadocs for {@code ...}
2019-07-16 07:37:13 -06:00
Ryan Ernst 59658daef9
Separate streamable based master node actions (#44313)
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.

relates #34389
2019-07-15 09:20:20 -07:00
Ryan Ernst 3a2c698ce0
Rename Action to ActionType (#43778)
Action is a class that encapsulates meta information about an action
that allows it to be called remotely, specifically the action name and
response type. With recent refactoring, the action class can now be
constructed as a static constant, instead of needing to create a
subclass. This makes the old pattern of creating a singleton INSTANCE
both misnamed and lacking a common placement.

This commit renames Action to ActionType, thus allowing the old INSTANCE
naming pattern to be TYPE on the transport action itself. ActionType
also conveys that this class is also not the action itself, although
this change does not rename any concrete classes as those will be
removed organically as they are converted to TYPE constants.

relates #34389
2019-06-30 22:00:17 -07:00
Martijn van Groningen 101cf384ba
Replace Streamable w/ Writable in AcknowledgedResponse and subclasses (backport 7.x) (#43525)
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.

Note that where possible response fields were made final and default
constructors were removed.

This is a large PR, but the change is mostly mechanical.

Relates to #34389
Backport of #43414
2019-06-24 13:47:37 +02:00
Lee Hinman c2bf628a6d
[7.x] Narrow period of Shrink action in which ILM prevents stopping (#43254) (#43393)
* Narrow period of Shrink action in which ILM prevents stopping

Prior to this change, we would prevent stopping of ILM if the index was
anywhere in the shrink action. This commit changes
`IndexLifecycleService` to allow stopping when in any of the innocuous
steps during shrink. This changes ILM only to prevent stopping if
absolutely necessary.

Resolves #43253

* Rename variable for ignore actions -> ignore steps

* Fix comment

* Factor test out to test *all* stoppable steps
2019-06-19 16:37:41 -06:00
Ryan Ernst 172cd4dbfa Remove description from xpack feature sets (#43065)
The description field of xpack featuresets is optionally part of the
xpack info api, when using the verbose flag. However, this information
is unnecessary, as it is better left for documentation (and the existing
descriptions describe anything meaningful). This commit removes the
description field from feature sets.
2019-06-11 09:22:58 -07:00
Gordon Brown 7e59794ced
Log every use of ILM Move to Step API (#41171)
Usage of the ILM Move to Step API can result in some very odd
situations, and for diagnosing problems arising from these situations it
would be nice to have a record of when this API was called with what
parameters.

Also, adds a dedicated logger for TransportMoveToStepAction, 
rather than using the (deprecated) inherited one.
2019-04-15 16:20:37 -06:00
Gordon Brown 5347dec55e
Allow ILM to stop if indices have nonexistent policies (#40820)
Prior to this PR, there is a bug in ILM which does not allow ILM to stop
if one or more indices have an index.lifecycle.name which refers to
a policy that does not exist - the operation_mode will be stuck as
STOPPING until either the policy is created or the nonexistent
policy is removed from those indices.

This change allows ILM to stop in this case and makes the logging more
clear as to why ILM is not stopping.
2019-04-04 11:46:21 -06:00
Lee Hinman 2fd01cc0b7 Fix testRunStateChangePolicyWithAsyncActionNextStep race condition (#40707)
Previously we only set the latch countdown with `nextStep.setLatch` after the
cluster state change has already been counted down. However, it's possible
execution could have already started, causing the latch to be missed when the
`MockAsyncActionStep` is being executed.

This moves the latch setting to be before the call to
`runPolicyAfterStateChange`, which means it is always available when the
`MockAsyncActionStep` is executed.

I was able to reproduce the failure every 30-40 runs before this change. With
this change, running 2000+ times the test passes.

Resolves #40018
2019-04-02 10:56:44 -06:00
Gordon Brown db7f00098e
Correct ILM metadata minimum compatibility version (#40569)
The ILM metadata minimum compatibility version was not set correctly,
which can cause issues in mixed-version clusters.
2019-03-28 10:53:44 -06:00
Lee Hinman 8ec456b5df Maintain step order for ILM trace logging (#39522)
When trace logging is enabled we log the computed steps for a policy. This
commit makes sure that the steps that are logged are in the same order they will
be run when the policy executes. This makes it much easier to reason about the
policy if the move-to-step API is ever required in the future.
2019-03-07 11:37:58 -07:00
Jay Modi 697911c31d
Fixed missed stopping of SchedulerEngine (#39193)
The SchedulerEngine is used in several places in our code and not all
of these usages properly stopped the SchedulerEngine, which could lead
to test failures due to leaked threads from the SchedulerEngine. This
change adds stopping to these usages in order to avoid the thread leaks
that cause CI failures and noise.

Closes #38875
2019-02-21 14:31:33 -07:00
Tal Levy bae656dcea
Preserve ILM operation mode when creating new lifecycles (#38134)
There was a bug where creating a new policy would start
the ILM service, even if it was stopped. This change ensures
that there is no change to the existing operation mode
2019-02-01 13:16:34 -08:00
Tal Levy 7c738fd241
Skip Shrink when numberOfShards not changed (#37953)
Previously, ShrinkAction would fail if
it was executed on an index that had
the same number of shards as the target
shrunken number.

This PR introduced a new BranchingStep that
is used inside of ShrinkAction to branch which
step to move to next, depending on the
shard values. So no shrink will occur if the
shard count is unchanged.
2019-01-30 15:09:17 -08:00
Lee Hinman 427bc7f940
Use ILM for Watcher history deletion (#37443)
* Use ILM for Watcher history deletion

This commit adds an index lifecycle policy for the `.watch-history-*` indices.
This policy is automatically used for all new watch history indices.

This does not yet remove the automatic cleanup that the monitoring plugin does
for the .watch-history indices, and it does not touch the
`xpack.watcher.history.cleaner_service.enabled` setting.

Relates to #32041
2019-01-23 10:18:08 -07:00
Lee Hinman 647e225698
Retry ILM steps that fail due to SnapshotInProgressException (#37624)
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.

This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.

This also adds integration tests for these scenarios (thanks to @talevy in #37552).

Resolves #37541
2019-01-23 09:46:31 -07:00
Martijn van Groningen a3030c51e2 [ILM] Add unfollow action (#36970)
This change adds the unfollow action for CCR follower indices.

This is needed for the shrink action in case an index is a follower index.
This will give the follower index the opportunity to fully catch up with
the leader index, pause index following and unfollow the leader index.
After this the shrink action can safely perform the ilm shrink.

The unfollow action needs to be added to the hot phase and acts as
barrier for going to the next phase (warm or delete phases), so that
follower indices are being unfollowed properly before indices are expected
to go in read-only mode. This allows the force merge action to execute
its steps safely.

The unfollow action has three steps:
* `wait-for-indexing-complete` step: waits for the index in question
  to get the `index.lifecycle.indexing_complete` setting be set to `true`
* `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks
  for the index being handled to report that the leader shard global checkpoint
  is equal to the follower shard global checkpoint.
* `pause-follower-index` step: Pauses index following, necessary to unfollow
* `close-follower-index` step: Closes the index, necessary to unfollow
* `unfollow-follower-index` step: Actually unfollows the index using 
  the CCR Unfollow API
* `open-follower-index` step: Reopens the index now that it is a normal index
* `wait-for-yellow` step: Waits for primary shards to be allocated after
  reopening the index to ensure the index is ready for the next step

In the case of the last two steps, if the index in being handled is
a regular index then the steps acts as a no-op.

Relates to #34648

Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Gordon Brown <gordon.brown@elastic.co>
2019-01-18 13:05:03 -07:00
Jake Landis 587034dfa7
Add set_priority action to ILM (#37397)
This commit adds a set_priority action to the hot, warm, and cold
phases for an ILM policy. This action sets the `index.priority`
on the managed index to allow different priorities between the
hot, warm, and cold recoveries.

This commit also includes the HLRC and documentation changes.

closes #36905
2019-01-17 09:55:36 -06:00
Alexander Reelsen b2e8437424
Tests: Add ElasticsearchAssertions.awaitLatch method (#36777)
* Tests: Add ElasticsearchAssertions.awaitLatch method

Some tests are using assertTrue(latch.await(...)) in their code. This
leads to an assertion error without any error message. This adds a
method which has a nicer error message and can be used in tests.

* fix forbidden apis

* fix spaces
2019-01-10 09:25:36 +01:00
Tal Levy eaeccd8401
[ILM] Add Freeze Action (#36910)
This commit adds a new ILM Action for
freezing indices in the cold phase.

Closes #34630.
2019-01-03 15:00:40 -08:00
Gordon Brown d39956c65c
Remove `indexing_complete` when removing policy (#36620)
Leaving `index.lifecycle.indexing_complete` in place when removing the
lifecycle policy from an index can cause confusion, as if a new policy
is associated with the policy, rollover will be silently skipped.
Removing that setting when removing the policy from an index makes
associating a new policy with the index more involved, but allows ILM to
fail loudly, rather than silently skipping operations which the user may
assume are being performed.

* Adjust order of checks in WaitForRolloverReadyStep

This allows ILM to error out properly for indices that have a valid
alias, but are not the write index, while still handling
`indexing_complete` on old-style aliases and rollover (that is, those
which only point to a single index at a time with no explicit write
index)
2018-12-19 12:11:30 -07:00
Gordon Brown 6a824322fc
Improve error message for deleting in-use policy (#36457)
The error message used when attempting to delete a lifecycle policy that
is in use previously only included one index which was using the policy.
It now includes all indices using that policy.
2018-12-12 14:57:48 -07:00
Gordon Brown 6481f2e380
Add setting to bypass Rollover action (#36235)
Adds a setting that indicates that an index is done indexing, set by ILM
when the Rollover action completes. This indicates that the Rollover
action should be skipped in any future invocations, as long as the index
is no longer the write index for its alias.

This enables 1) an index with a policy that involves the Rollover action
to have the policy removed and switched to another one without use of
the move-to-step API, and 2) integrations with Beats and CCR.
2018-12-11 08:53:05 -07:00
Tal Levy ed7afd1a9e
[ILM] TEST: fix long overflow in TimeValueScheduleTests (#36384)
Closes #35948.
2018-12-10 09:28:17 -08:00
Lee Hinman 8ea999e489
Include stack trace with ILM error in explain output (#35512)
This changes the stacktrace to be included with the ILM explain error when the
index is an on ERROR step.

Before:

```json
{
  "indices" : {
    "foo" : {
      "index" : "foo",
      "managed" : true,
      "policy" : "bad",
      "lifecycle_date_millis" : 1542131670601,
      "phase" : "warm",
      "phase_time_millis" : 1542131676335,
      "action" : "shrink",
      "action_time_millis" : 1542131676335,
      "step" : "ERROR",
      "step_time_millis" : 1542131676451,
      "failed_step" : "shrink",
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "the number of target shards [13] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "bad",
        "phase_definition" : {
          "min_age" : "5s",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 13
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1542131669839
      }
    }
  }
}
```

After

```
{
  "indices" : {
    "foo" : {
      "index" : "foo",
      "managed" : true,
      "policy" : "bad",
      "lifecycle_date_millis" : 1542131670601,
      "phase" : "warm",
      "phase_time_millis" : 1542131676335,
      "action" : "shrink",
      "action_time_millis" : 1542131676335,
      "step" : "ERROR",
      "step_time_millis" : 1542131676451,
      "failed_step" : "shrink",
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "the number of target shards [13] must be less that the number of source shards [2]",
        "stack_trace" : "java.lang.IllegalArgumentException: the number of target shards [13] must be less that the number of source shards [2]\n\tat org.elasticsearch.cluster.metadata.IndexMetaData.selectShrinkShards(IndexMetaData.java:1509)\n\tat org.elasticsearch.action.admin.indices.shrink.TransportResizeAction.prepareCreateIndexRequest(TransportResizeAction.java:146)\n\tat org.elasticsearch.action.admin.indices.shrink.TransportResizeAction$1.onResponse(TransportResizeAction.java:104)\n\tat org.elasticsearch.action.admin.indices.shrink.TransportResizeAction$1.onResponse(TransportResizeAction.java:101)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:60)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:383)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:352)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:324)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:314)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1117)\n\tat org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1198)\n\tat org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1178)\n\tat org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:417)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:391)\n\tat org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309)\n\tat org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)\n\tat org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:714)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:726)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\n"
      },
      "phase_execution" : {
        "policy" : "bad",
        "phase_definition" : {
          "min_age" : "5s",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 13
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1542131669839
      }
    }
  }
}
```

Resolves #35498
2018-11-14 14:40:05 -07:00
Tal Levy 16cbbab7b7
[ILM] fix retry so it picks up latest policy and executes async action (#35406)
Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes #35397.
2018-11-12 11:32:59 -08:00
Gordon Brown 67f9e8fa23
Enforce limitations on ILM policy names (#35104)
Enforces restrictions on ILM policy names to ensure we don't accept
policy names the system can't handle, or may reserve for future use.
2018-11-09 10:11:26 -07:00
Alpar Torok 8a85b2eada
Remove build qualifier from server's Version (#35172)
With this change, `Version` no longer carries information about the qualifier,
we still need a way to show the "display version" that does have both
qualifier and snapshot. This is now stored  by the build and red from `META-INF`.
2018-11-07 14:01:05 +02:00
Tal Levy a85b4f42ca
[ILM] change remove-policy-from-index http method from DELETE to POST (#35268)
The remove-ilm-from-index API was using the DELETE http method
to signify that something is being removed. Although, metadata
about ILM for the index is being deleted, no entity/resource
is being deleted during this operation. POST is more in line with
what this API is actually doing, it is modifying the metadata for
an index. As part of this change, `remove` is also appended to the path 
to be more explicit about its actions.
2018-11-06 07:46:25 -08:00
Alexander Reelsen 409050e8de
Refactor: Remove settings from transport action CTOR (#35208)
As settings are not used in the transport action constructor, this
removes the passing of the settings in all the transport actions.
2018-11-05 13:08:18 +01:00
Gordon Brown b3da3eae08
[ILM] Fix race condition in test (#35143)
Previously, testRunStateChangePolicyWithNextStep asserted that the
ClusterState before and after running the steps were equal. The test
only passed due to a race condition: The latch would be triggered by the
step execution, but the cluster state update thread would continue
running before committing the change to the cluster state. This allowed
the test to read the old cluster state and pass the equality check about
99.99% of the time.

The test now waits for the new cluster state to be committed before
checking that it is _not_ equal to the old cluster state.
2018-11-02 11:09:48 -06:00
Tal Levy 6b312a500d uninherit from AbstractComponent in IndexLifecycleService 2018-11-01 10:22:55 -07:00
Tal Levy 5f4b23f8c1
cleanup ILM qa structure (#35110)
This commit does a few things

- moves ILM-specifc rest yaml tests into plugin/ilm/qa, and creates special
  :plugin:ilm:qa:rest module to test them
- removes the with-security tests of the yaml tests since they are covered in
  the rest tests now
- moves ChangePolicyforIndexIT into the qa/multi-node project since that test is
  not currently running in main ilm since integTest is disabled
2018-10-31 11:49:29 -07:00
Tal Levy a294a7c6b5 fix IndexLifecycleService setting member
the settings variable was previously created by
the AbstractComponent class inherited by IndexLifecycleService.
this is no more.
2018-10-31 11:17:16 -07:00
Tal Levy 5141084048
rename CRUD api REST path prefix _ilm to _ilm/policy (#35056)
This PR renames the CRUD APIS for ILM

GET _ilm/<policy>, _ilm -> _ilm/policy/<policy>, _ilm/policy
PUT _ilm/<policy> -> _ilm/policy/<policy>
DELETE _ilm/<policy> -> _ilm/policy/<policy>

closes #34929.
2018-10-30 16:19:05 -07:00
Gordon Brown 6ecb8ff344
Move to Error step if ClusterState* steps throw (#35069)
Previously, if ClusterStateActionSteps or ClusterStateWaitSteps threw an
exception executing, the exception would only be caught and logged by
the generic ClusterStateUpdateTask machinery and the index would become
stuck on that step.

Now, exceptions thrown in these steps will be caught and the index will
be moved to the Error step.
2018-10-30 13:33:32 -06:00
Gordon Brown f6ac0e4bbc
[ILM] Fix Move To Step API causing ILM to hang (#34618)
The Move To Step API now checks to see if the target step is an
AsyncActionStep, and if so, runs it.

Previously, AsyncActionSteps would only be run when they are entered by
executing the previous step, so if an AsyncActionStep was entered via
the Move To Step API, ILM would never touch that index again.
2018-10-29 11:18:12 -06:00
Tal Levy f6ce935444
fix `GET _ilm` response with uninitialized ILM metadata (#34881)
ILM would return a resource-not-found exception when requesting policies
while the IndexLifecycleMetaData is not initialized. The behavior here
should not be as extreme since it is not the user's fault.

This commit changes the behavior so that it succeeds and returns no policies
when no policy names are explicitely specified, otherwise keep the same behavior
of throwing an exception
2018-10-25 16:00:44 -07:00
Tal Levy 41eaa586e8
remove index.lifecycle.skip setting (#34823)
With the introduction of _ilm/stop and _ilm/start APIs, the
use cases where one would only target a select group
of indices to start/stop has been reduced. Since there is no
strong use-case for skipping specific indices, it is best to
remove this functionality and only adding if later desired, with the
hopes of keeping things more simple.
2018-10-25 07:27:04 -07:00
Tal Levy 21b9b024c7
fix PolicyStatsTests mutateInstance (#34835)
through randomization, there is a chance that the mutateInstance
for PolicyStatsTests does not actually mutate the original object.
This PR aims to fix this
2018-10-25 07:24:54 -07:00
Colin Goodheart-Smithe e7fddb5c93
Adds usage data for ILM (#33377)
* Adds usage data for ILM

* Adds tests for IndexLifecycleFeatureSetUsage and friends

* Adds tests for IndexLifecycleFeatureSet

* Fixes merge errors

* Add number of indices managed to usage stats

Also adds more tests

* Addresses Review comments
2018-10-24 18:28:46 +01:00
Colin Goodheart-Smithe c7fe87e43f
Removes Set Policy API in favour of setting index.lifecycle.name directly (#34304)
* Removes Set Policy API in favour of setting index.lifecycle.name
directly

* Reinstates matcher that will still be used

* Cleans up code after rebase

* Adds test to check changing policy with ndex settings works

* Fixes TimeseriesLifecycleActionsIT after API removal

* Fixes docs tests

* Fixes case on close where lifecycle service was never created
2018-10-24 16:14:59 +01:00
Lee Hinman c5a264e77f
Ensure phase_time is set when in the "new" phase (#34280)
Since there's no transition into the "new" phase it wasn't set until the "hot"
phase, so now we initialize it when initializing the policy context.

Resolves #34277
2018-10-23 15:20:41 -06:00
Gordon Brown 9cb0bb8b9f
Rework ILM build to separate integration tests (#34617)
Having integration tests separated from the unit tests in the qa
directory works much more smoothly with our testing infrastructure,
matches what other plugins do, and tests in a more "real" deployment
scenario by having all plugins installed.
2018-10-18 13:33:33 -06:00
Tal Levy fdb850735a fix setting version on deleting unmaanged indices with wildcard 2018-10-16 23:39:48 -07:00
Tal Levy 3a555da34d update version on ILM setting updates 2018-10-16 15:43:10 -07:00
Jack Conradson 80474e138f
HLRC: Add remove index lifecycle policy (#34204)
This change adds the command RemoveIndexLifecyclePolicy to the HLRC. This uses the 
new TimeRequest as a base class for RemoveIndexLifecyclePolicyRequest on the client side.
2018-10-16 08:12:06 -07:00
Lee Hinman 9ad2a7fa77
Fix expected next step being incorrect when executing async action (#34313)
This fixes an issue where an incorrect expected next step is used when checking
to execute `AsyncActionStep`s after a cluster state step.

It fixes this scenario:

- `ExecuteStepsUpdateTask` executes a `ClusterStateWaitStep` or
  `ClusterStateActionStep` successfully
- The next step is also a `ClusterStateWaitStep`, so it loops
- The `ClusterStateWaitStep` has a next stepkey (which gets set to the
  `nextStepKey` in the code)
- The `ClusterStateWaitStep` fails the condition, meaning that it will have to
  wait longer
- The `nextStepKey` is now incorrect though, because we did not advance the
  index's step, and it's not `null` (which is another safe value if there is no
  step after the `ClusterStateWaitStep`)

This fixes the problem by resetting the nextStepKey to null if the condition is
not met, since we are not going to advance the step metadata in this
case (thereby skipping the `maybeRunAsyncAction` invocation).

This commit also tightens up and enhances much of the ILM logging. A lot of
logging was missing the index name (making it hard to debug in the presence of
multiple indices) and a lot was using the wrong logging level (DEBUG is now
actually readable without being a wall of text).

Resolves #34297
2018-10-08 11:25:18 -06:00
Gordon Brown 13d89295c8
Provide useful error when a policy doesn't exist (#34206)
When an index is configured to use a lifecycle policy that does not
exist, this will now be noted in the step_info for that policy.
2018-10-04 08:21:55 -06:00
Tal Levy f10735aa9a ILM integration test with full policy (#33402)
- this adds an integration test that runs through a policy
with all the actions defined.
- adds a test specific to a policy having just a rollover action
- bumps the node count to 4
2018-10-03 12:20:43 -06:00
Lee Hinman 388f754a8e
Change step execution flow to be deliberate about type (#34126)
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

`AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default
`ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the
cluster state changes.
`AsyncActionStep` is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the `RolloverStep` from an `AsyncActionStep` to an
`AsyncWaitStep` so that it can run periodically.

Relates to #29823
2018-10-02 20:02:50 -06:00
Lee Hinman 2d9cb21490 Merge remote-tracking branch 'origin/master' into index-lifecycle 2018-10-01 14:10:09 -06:00
Lee Hinman a49d59802a
Use more descriptive task names for ILM cluster state updates (#34161)
Rather than using "ILM" for everything, we should use more descriptive names so
debugging from logs is easier to do.

Resolves #34118
2018-10-01 13:45:26 -06:00
Gordon Brown c0bfc07f53
Only make indexes read-only on Shrink and ForceMerge actions (#33907)
ILM now only forces indices to become read only in the case of Shrink
and Force Merge actions, as these are most useful in cases where the
index is no longer being written to.
2018-09-25 10:16:01 -06:00
Gordon Brown 90de436e55
Use custom index metadata for ILM state (#33783)
Using index settings for ILM state is fragile and exposes too much
information that doesn't need to be exposed. Using custom index metadata
is more resilient and allows more controlled access to internal
information.

As part of these changes, moves away from using defaults for ILM-related
values, in favor of using null values to clearly indicate that the value is not
present.
2018-09-19 14:50:48 -06:00
Lee Hinman 27dd25857b
Rebuild step on PolicyStepsRegistry.getStep (#33780)
This moves away from caching a list of steps for a current phase, instead
rebuilding the necessary step from the phase JSON stored in the index's
metadata.

Relates to #29823
2018-09-18 17:07:57 -06:00
Lee Hinman 11a55d2307 [TEST] Handle an IndexLifecycleService that has not started up 2018-09-18 14:02:09 -06:00
Tal Levy 94a66c556d
add phase execution info to ILM Explain API (#33488)
adds a section for phase execution to the Explain API.

This contains

- phase definition
- policy name
- policy version
- modified date
2018-09-17 17:00:00 -07:00
Lee Hinman 1f048d3d3f
Remove unneeded listener on MoveToNextStepUpdateTask (#33725)
There was a listener that re-runs the policy with the new state when the cluster
state is processed by the `MoveToNextStepUpdateTask`. This removes this listener
as we will execute the policy through the `IndexLifecyleService` cluster state
listener.
2018-09-14 14:38:23 -06:00
Lee Hinman b7649fce0c
Rename "after" to "minimum_age" in lifecycle definition (#33530)
This renames the "after" field to better reflect what the meaning is.

Supercedes #32624
2018-09-08 21:40:55 -06:00
Lee Hinman 8fa8dea138
Encapsulate Client as class variable for PolicyStepsRegistry (#33529)
Rather than pass in the client on the `update` step, this makes it passed in to
the constructor so it's not required on every update.
2018-09-07 16:32:25 -06:00
Colin Goodheart-Smithe f83641346f
Adds checks to ensure index metadata exists when we try to use it (#33455)
* Adds checks to ensure index metadata exists when we try to use it

* Fixes failing test
2018-09-07 13:06:51 +01:00