Commit Graph

174 Commits

Author SHA1 Message Date
Jake Landis 6e9ccda2c5
ilm test - allow more time for policy completion (#43844) 2019-07-02 22:05:18 -05:00
Jake Landis 0a79f4ca70
Extend timeout for TimeSeriesLifecycleActionsIT> testFullPolicy (#43891) 2019-07-02 22:05:04 -05:00
Ryan Ernst 3a2c698ce0
Rename Action to ActionType (#43778)
Action is a class that encapsulates meta information about an action
that allows it to be called remotely, specifically the action name and
response type. With recent refactoring, the action class can now be
constructed as a static constant, instead of needing to create a
subclass. This makes the old pattern of creating a singleton INSTANCE
both misnamed and lacking a common placement.

This commit renames Action to ActionType, thus allowing the old INSTANCE
naming pattern to be TYPE on the transport action itself. ActionType
also conveys that this class is also not the action itself, although
this change does not rename any concrete classes as those will be
removed organically as they are converted to TYPE constants.

relates #34389
2019-06-30 22:00:17 -07:00
Martijn van Groningen 101cf384ba
Replace Streamable w/ Writable in AcknowledgedResponse and subclasses (backport 7.x) (#43525)
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.

Note that where possible response fields were made final and default
constructors were removed.

This is a large PR, but the change is mostly mechanical.

Relates to #34389
Backport of #43414
2019-06-24 13:47:37 +02:00
Lee Hinman c2bf628a6d
[7.x] Narrow period of Shrink action in which ILM prevents stopping (#43254) (#43393)
* Narrow period of Shrink action in which ILM prevents stopping

Prior to this change, we would prevent stopping of ILM if the index was
anywhere in the shrink action. This commit changes
`IndexLifecycleService` to allow stopping when in any of the innocuous
steps during shrink. This changes ILM only to prevent stopping if
absolutely necessary.

Resolves #43253

* Rename variable for ignore actions -> ignore steps

* Fix comment

* Factor test out to test *all* stoppable steps
2019-06-19 16:37:41 -06:00
Alpar Torok 167e51335d Convert ILM tests to use testclusters (#43076)
Also improove the error message when bin scripts are not found
2019-06-13 12:24:48 +03:00
Ryan Ernst 172cd4dbfa Remove description from xpack feature sets (#43065)
The description field of xpack featuresets is optionally part of the
xpack info api, when using the verbose flag. However, this information
is unnecessary, as it is better left for documentation (and the existing
descriptions describe anything meaningful). This commit removes the
description field from feature sets.
2019-06-11 09:22:58 -07:00
Jason Tedor 117df87b2b
Replicate aliases in cross-cluster replication (#42875)
This commit adds functionality so that aliases that are manipulated on
leader indices are replicated by the shard follow tasks to the follower
indices. Note that we ignore write indices. This is due to the fact that
follower indices do not receive direct writes so the concept is not
useful.

Relates #41815
2019-06-04 20:36:24 -04:00
Mark Vieira e44b8b1e2e
[Backport] Remove dependency substitutions 7.x (#42866)
* Remove unnecessary usage of Gradle dependency substitution rules (#42773)

(cherry picked from commit 12d583dbf6f7d44f00aa365e34fc7e937c3c61f7)
2019-06-04 13:50:23 -07:00
Ryan Ernst 6fd8924c5a Switch run task to use real distro (#41590)
The run task is supposed to run elasticsearch with the given plugin or
module. However, for modules, this is most realistic if using the full
distribution. This commit changes the run setup to use the default or
oss as appropriate.
2019-05-06 12:34:07 -07:00
Daniel Mitterdorfer 8580053818
Mute PermissionsIT#testWhen[...]ByILMPolicy (#41859)
Relates #41440
Relates #41858
2019-05-06 16:15:37 +02:00
Christoph Büscher 52495843cc [Docs] Fix common word repetitions (#39703) 2019-04-25 20:47:47 +02:00
Yogesh Gaikwad 0d1178fca6
put mapping authorization for alias with write-index and multiple read indices (#40834) (#41287)
When the same alias points to multiple indices we can write to only one index
with `is_write_index` value `true`. The special handling in case of the put
mapping request(to resolve authorized indices) has a check on indices size
for a concrete index. If multiple indices existed then it marked the request
as unauthorized.

The check has been modified to consider write index flag and only when the
requested index matches with the one with write index alias, the alias is considered
for authorization.

Closes #40831
2019-04-17 14:25:33 +10:00
Gordon Brown ec8709e831
Check allocation rules are cleared after ILM Shrink (#41170)
Adds some checks to make sure that the allocation rules that ILM adds
before a shrink are cleared after the shrink is complete
2019-04-16 09:25:51 -06:00
Gordon Brown 7e59794ced
Log every use of ILM Move to Step API (#41171)
Usage of the ILM Move to Step API can result in some very odd
situations, and for diagnosing problems arising from these situations it
would be nice to have a record of when this API was called with what
parameters.

Also, adds a dedicated logger for TransportMoveToStepAction, 
rather than using the (deprecated) inherited one.
2019-04-15 16:20:37 -06:00
Mark Vieira 1287c7d91f
[Backport] Replace usages RandomizedTestingTask with built-in Gradle Test (#40978) (#40993)
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)

This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.

(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)

* Fix forking JVM runner

* Don't bump shadow plugin version
2019-04-09 11:52:50 -07:00
Gordon Brown 5347dec55e
Allow ILM to stop if indices have nonexistent policies (#40820)
Prior to this PR, there is a bug in ILM which does not allow ILM to stop
if one or more indices have an index.lifecycle.name which refers to
a policy that does not exist - the operation_mode will be stuck as
STOPPING until either the policy is created or the nonexistent
policy is removed from those indices.

This change allows ILM to stop in this case and makes the logging more
clear as to why ILM is not stopping.
2019-04-04 11:46:21 -06:00
Lee Hinman 2fd01cc0b7 Fix testRunStateChangePolicyWithAsyncActionNextStep race condition (#40707)
Previously we only set the latch countdown with `nextStep.setLatch` after the
cluster state change has already been counted down. However, it's possible
execution could have already started, causing the latch to be missed when the
`MockAsyncActionStep` is being executed.

This moves the latch setting to be before the call to
`runPolicyAfterStateChange`, which means it is always available when the
`MockAsyncActionStep` is executed.

I was able to reproduce the failure every 30-40 runs before this change. With
this change, running 2000+ times the test passes.

Resolves #40018
2019-04-02 10:56:44 -06:00
Gordon Brown db7f00098e
Correct ILM metadata minimum compatibility version (#40569)
The ILM metadata minimum compatibility version was not set correctly,
which can cause issues in mixed-version clusters.
2019-03-28 10:53:44 -06:00
Lee Hinman 8ec456b5df Maintain step order for ILM trace logging (#39522)
When trace logging is enabled we log the computed steps for a policy. This
commit makes sure that the steps that are logged are in the same order they will
be run when the policy executes. This makes it much easier to reason about the
policy if the move-to-step API is ever required in the future.
2019-03-07 11:37:58 -07:00
Gordon Brown f4c5abe4d4
Handle failure to release retention leases in ILM (#39281) (#39417)
It is possible that the Unfollow API may fail to release shard history
retention leases when unfollowing, so this needs to be handled by the
ILM Unfollow action. There's nothing much that can be done automatically
about it from the follower side, so this change makes the ILM unfollow
action simply ignore those failures.
2019-02-26 16:58:30 -07:00
Gordon Brown 2ad1e6aedc
Fix testCannotShrinkLeaderIndex (#38529)
This test should no longer pass when the functionality it is intended to
test is broken, as it now indexes a number of documents and verifies
that the index is staying on the same step until after indexing and
replication of those documents is finished. This prevents the test from
passing if the leader index progresses in its lifecycle during that time.
2019-02-22 08:03:36 -07:00
Jay Modi 697911c31d
Fixed missed stopping of SchedulerEngine (#39193)
The SchedulerEngine is used in several places in our code and not all
of these usages properly stopped the SchedulerEngine, which could lead
to test failures due to leaked threads from the SchedulerEngine. This
change adds stopping to these usages in order to avoid the thread leaks
that cause CI failures and noise.

Closes #38875
2019-02-21 14:31:33 -07:00
Jason Tedor 09ea3ccd16
Remove retention leases when unfollowing (#39088)
This commit attempts to remove the retention leases on the leader shards
when unfollowing an index. This is best effort, since the leader might
not be available.
2019-02-20 07:06:49 -05:00
Jason Tedor 2d8f6b6501
Introduce retention lease state file (#39004)
This commit moves retention leases from being persisted in the Lucene
commit point to being persisted in a dedicated state file.
2019-02-18 16:53:46 -05:00
Jason Tedor a5ce1e0bec
Integrate retention leases to recovery from remote (#38829)
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
2019-02-16 15:37:52 -05:00
Luca Cavanna a7046e001c
Remove support for maxRetryTimeout from low-level REST client (#38085)
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.

The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).

Given all these issues, it seems that this custom timeout mechanism 
gives little benefits while causing a lot of harm. We should rather rely 
on connect and socket timeout exposed by the underlying http client 
and accept that a request can overall take longer than the configured 
timeout, which is the case even with a single retry anyways.

This commit removes the `maxRetryTimeout` setting and all of its usages.
2019-02-06 08:43:47 +01:00
Julie Tibshirani 3ce7d2c9b6
Make sure to reject mappings with type _doc when include_type_name is false. (#38270)
`CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when
deserializing index creation requests, accidentally accepts mappings that are
nested twice under the type key (as described in the bug report #38266).

This in turn causes us to be too lenient in parsing typeless mappings. In
particular, we accept the following index creation request, even though it
should not contain the type key `_doc`:

```
PUT index?include_type_name=false
{
  "mappings": {
    "_doc": {
      "properties": { ... }
    }
  }
}
```

There is a similar issue for both 'put templates' and 'put mappings' requests
as well.

This PR makes the minimal changes to detect and reject these typed mappings in
requests. It does not address #38266 generally, or attempt a larger refactor
around types in these server-side requests, as I think this should be done at a
later time.
2019-02-05 10:52:32 -08:00
Gordon Brown b866417650
Mute testCannotShrinkLeaderIndex (#38374)
This test should not pass until CCR finishes integrating shard history
retention leases. It currently sometimes passes (which is a bug in the
test), but cannot pass reliably until the linked issue is resolved.
2019-02-04 16:06:19 -07:00
Gordon Brown 7a1e89c7ed
Ensure ILM policies run safely on leader indices (#38140)
Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.
2019-02-01 20:46:12 -07:00
Tal Levy bae656dcea
Preserve ILM operation mode when creating new lifecycles (#38134)
There was a bug where creating a new policy would start
the ILM service, even if it was stopped. This change ensures
that there is no change to the existing operation mode
2019-02-01 13:16:34 -08:00
Tal Levy 7c738fd241
Skip Shrink when numberOfShards not changed (#37953)
Previously, ShrinkAction would fail if
it was executed on an index that had
the same number of shards as the target
shrunken number.

This PR introduced a new BranchingStep that
is used inside of ShrinkAction to branch which
step to move to next, depending on the
shard values. So no shrink will occur if the
shard count is unchanged.
2019-01-30 15:09:17 -08:00
Tim Brooks 00ace369af
Use `CcrRepository` to init follower index (#35719)
This commit modifies the put follow index action to use a
CcrRepository when creating a follower index. It routes 
the logic through the snapshot/restore process. A 
wait_for_active_shards parameter can be used to configure
how long to wait before returning the response.
2019-01-29 11:47:29 -07:00
Gordon Brown 49bd8715ff
Inject Unfollow before Rollover and Shrink (#37625)
We inject an Unfollow action before Shrink because the Shrink action
cannot be safely used on a following index, as it may not be fully
caught up with the leader index before the "original" following index is
deleted and replaced with a non-following Shrunken index. The Unfollow
action will verify that 1) the index is marked as "complete", and 2) all
operations up to this point have been replicated from the leader to the
follower before explicitly disconnecting the follower from the leader.

Injecting an Unfollow action before the Rollover action is done mainly
as a convenience: This allow users to use the same lifecycle policy on
both the leader and follower cluster without having to explictly modify
the policy to unfollow the index, while doing what we expect users to
want in most cases.
2019-01-28 14:09:12 -07:00
Lee Hinman 427bc7f940
Use ILM for Watcher history deletion (#37443)
* Use ILM for Watcher history deletion

This commit adds an index lifecycle policy for the `.watch-history-*` indices.
This policy is automatically used for all new watch history indices.

This does not yet remove the automatic cleanup that the monitoring plugin does
for the .watch-history indices, and it does not touch the
`xpack.watcher.history.cleaner_service.enabled` setting.

Relates to #32041
2019-01-23 10:18:08 -07:00
Lee Hinman 647e225698
Retry ILM steps that fail due to SnapshotInProgressException (#37624)
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.

This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.

This also adds integration tests for these scenarios (thanks to @talevy in #37552).

Resolves #37541
2019-01-23 09:46:31 -07:00
Ryan Ernst 9a34b20233
Simplify integ test distribution types (#37618)
The integ tests currently use the raw zip project name as the
distribution type. This commit simplifies this specification to be
"default" or "oss". Whether zip or tar is used should be an internal
implementation detail of the integ test setup, which can (in the future)
be platform specific.
2019-01-21 12:37:17 -08:00
Martijn van Groningen a3030c51e2 [ILM] Add unfollow action (#36970)
This change adds the unfollow action for CCR follower indices.

This is needed for the shrink action in case an index is a follower index.
This will give the follower index the opportunity to fully catch up with
the leader index, pause index following and unfollow the leader index.
After this the shrink action can safely perform the ilm shrink.

The unfollow action needs to be added to the hot phase and acts as
barrier for going to the next phase (warm or delete phases), so that
follower indices are being unfollowed properly before indices are expected
to go in read-only mode. This allows the force merge action to execute
its steps safely.

The unfollow action has three steps:
* `wait-for-indexing-complete` step: waits for the index in question
  to get the `index.lifecycle.indexing_complete` setting be set to `true`
* `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks
  for the index being handled to report that the leader shard global checkpoint
  is equal to the follower shard global checkpoint.
* `pause-follower-index` step: Pauses index following, necessary to unfollow
* `close-follower-index` step: Closes the index, necessary to unfollow
* `unfollow-follower-index` step: Actually unfollows the index using 
  the CCR Unfollow API
* `open-follower-index` step: Reopens the index now that it is a normal index
* `wait-for-yellow` step: Waits for primary shards to be allocated after
  reopening the index to ensure the index is ready for the next step

In the case of the last two steps, if the index in being handled is
a regular index then the steps acts as a no-op.

Relates to #34648

Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Gordon Brown <gordon.brown@elastic.co>
2019-01-18 13:05:03 -07:00
Jake Landis 587034dfa7
Add set_priority action to ILM (#37397)
This commit adds a set_priority action to the hot, warm, and cold
phases for an ILM policy. This action sets the `index.priority`
on the managed index to allow different priorities between the
hot, warm, and cold recoveries.

This commit also includes the HLRC and documentation changes.

closes #36905
2019-01-17 09:55:36 -06:00
Alexander Reelsen b2e8437424
Tests: Add ElasticsearchAssertions.awaitLatch method (#36777)
* Tests: Add ElasticsearchAssertions.awaitLatch method

Some tests are using assertTrue(latch.await(...)) in their code. This
leads to an assertion error without any error message. This adds a
method which has a nicer error message and can be used in tests.

* fix forbidden apis

* fix spaces
2019-01-10 09:25:36 +01:00
Tal Levy eaeccd8401
[ILM] Add Freeze Action (#36910)
This commit adds a new ILM Action for
freezing indices in the cold phase.

Closes #34630.
2019-01-03 15:00:40 -08:00
Tal Levy f6c1e3f14f
[ILM][TEST] increase assertBusy timeout (#36864)
the testFullPolicy and testMoveToRolloverStep tests
are very important tests, but they sometimes timeout
beyond the default 10sec wait for shrink to occur.
This commit increases one of the assertBusys to
20 seconds
2018-12-20 08:55:02 -08:00
Gordon Brown d39956c65c
Remove `indexing_complete` when removing policy (#36620)
Leaving `index.lifecycle.indexing_complete` in place when removing the
lifecycle policy from an index can cause confusion, as if a new policy
is associated with the policy, rollover will be silently skipped.
Removing that setting when removing the policy from an index makes
associating a new policy with the index more involved, but allows ILM to
fail loudly, rather than silently skipping operations which the user may
assume are being performed.

* Adjust order of checks in WaitForRolloverReadyStep

This allows ILM to error out properly for indices that have a valid
alias, but are not the write index, while still handling
`indexing_complete` on old-style aliases and rollover (that is, those
which only point to a single index at a time with no explicit write
index)
2018-12-19 12:11:30 -07:00
Alpar Torok e9ef5bdce8
Converting randomized testing to create a separate unitTest task instead of replacing the builtin test task (#36311)
- Create a separate unitTest task instead of Gradle's built in 
- convert all configuration to use the new task 
- the  built in task is now disabled
2018-12-19 08:25:20 +02:00
Tal Levy 06dfd4aadc
[TEST] fix flaky ILM tests (#36612)
* WaitForRolloverReadyStepTests#mutateInstance sometimes did not mutate the instance
  correctly
* 40_explain_lifecycle#"Test new phase still has phase_time" is not really a necessary
  integration test. In addition to this, it is flaky due to the asynchronous nature of
  ILM metadata population
2018-12-14 11:36:18 -08:00
Tal Levy e3cf642299
Add ILM-specific security privileges (#36493)
* add read_ilm cluster privilege

Although managing ILM policies is best done using the
"manage" cluster privilege, it is useful to have read-only
views.

* adds `read_ilm` cluster privilege for viewing policies and status
* adds Explain API to the `view_index_metadata` index privilege

* add manage_ilm privileges
2018-12-13 08:11:33 -08:00
Gordon Brown 6a824322fc
Improve error message for deleting in-use policy (#36457)
The error message used when attempting to delete a lifecycle policy that
is in use previously only included one index which was using the policy.
It now includes all indices using that policy.
2018-12-12 14:57:48 -07:00
Gordon Brown 6481f2e380
Add setting to bypass Rollover action (#36235)
Adds a setting that indicates that an index is done indexing, set by ILM
when the Rollover action completes. This indicates that the Rollover
action should be skipped in any future invocations, as long as the index
is no longer the write index for its alias.

This enables 1) an index with a policy that involves the Rollover action
to have the policy removed and switched to another one without use of
the move-to-step API, and 2) integrations with Beats and CCR.
2018-12-11 08:53:05 -07:00
Tal Levy ed7afd1a9e
[ILM] TEST: fix long overflow in TimeValueScheduleTests (#36384)
Closes #35948.
2018-12-10 09:28:17 -08:00
Alpar Torok 8659af68e0
Auto skip license headers on no source (#35640)
* Unmute BuildExamplePluginsIT

* Skip licenseHeaders when there are no sources
2018-11-20 13:02:33 +02:00
Gordon Brown cce9648f9d
Align RolloverStep's name with other step names (#35655)
RolloverStep previously had a name of "attempt_rollover", which was
inconsistent with all other step names due it its use of an underscore
instead of a dash.
2018-11-16 17:42:48 -07:00
Gordon Brown 3883e9bf4c
Split RolloverStep into Wait and Action steps (#35524)
RolloverAction will now periodically check the rollover conditions using
the Rollover API with the dry_run option as an AsyncWaitStep, then run
the rollover itself by calling the Rollover API with no conditions,
which will always roll over, as an AsyncActionStep. This will resolve
race condition issues in policies using RolloverAction.
2018-11-15 17:11:31 -07:00
Lee Hinman 8ea999e489
Include stack trace with ILM error in explain output (#35512)
This changes the stacktrace to be included with the ILM explain error when the
index is an on ERROR step.

Before:

```json
{
  "indices" : {
    "foo" : {
      "index" : "foo",
      "managed" : true,
      "policy" : "bad",
      "lifecycle_date_millis" : 1542131670601,
      "phase" : "warm",
      "phase_time_millis" : 1542131676335,
      "action" : "shrink",
      "action_time_millis" : 1542131676335,
      "step" : "ERROR",
      "step_time_millis" : 1542131676451,
      "failed_step" : "shrink",
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "the number of target shards [13] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "bad",
        "phase_definition" : {
          "min_age" : "5s",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 13
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1542131669839
      }
    }
  }
}
```

After

```
{
  "indices" : {
    "foo" : {
      "index" : "foo",
      "managed" : true,
      "policy" : "bad",
      "lifecycle_date_millis" : 1542131670601,
      "phase" : "warm",
      "phase_time_millis" : 1542131676335,
      "action" : "shrink",
      "action_time_millis" : 1542131676335,
      "step" : "ERROR",
      "step_time_millis" : 1542131676451,
      "failed_step" : "shrink",
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "the number of target shards [13] must be less that the number of source shards [2]",
        "stack_trace" : "java.lang.IllegalArgumentException: the number of target shards [13] must be less that the number of source shards [2]\n\tat org.elasticsearch.cluster.metadata.IndexMetaData.selectShrinkShards(IndexMetaData.java:1509)\n\tat org.elasticsearch.action.admin.indices.shrink.TransportResizeAction.prepareCreateIndexRequest(TransportResizeAction.java:146)\n\tat org.elasticsearch.action.admin.indices.shrink.TransportResizeAction$1.onResponse(TransportResizeAction.java:104)\n\tat org.elasticsearch.action.admin.indices.shrink.TransportResizeAction$1.onResponse(TransportResizeAction.java:101)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:60)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:383)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:352)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:324)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:314)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1117)\n\tat org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1198)\n\tat org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1178)\n\tat org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:417)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:391)\n\tat org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309)\n\tat org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)\n\tat org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:714)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:726)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\n"
      },
      "phase_execution" : {
        "policy" : "bad",
        "phase_definition" : {
          "min_age" : "5s",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 13
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1542131669839
      }
    }
  }
}
```

Resolves #35498
2018-11-14 14:40:05 -07:00
Tal Levy 16cbbab7b7
[ILM] fix retry so it picks up latest policy and executes async action (#35406)
Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes #35397.
2018-11-12 11:32:59 -08:00
Gordon Brown 67f9e8fa23
Enforce limitations on ILM policy names (#35104)
Enforces restrictions on ILM policy names to ensure we don't accept
policy names the system can't handle, or may reserve for future use.
2018-11-09 10:11:26 -07:00
Alpar Torok 8a85b2eada
Remove build qualifier from server's Version (#35172)
With this change, `Version` no longer carries information about the qualifier,
we still need a way to show the "display version" that does have both
qualifier and snapshot. This is now stored  by the build and red from `META-INF`.
2018-11-07 14:01:05 +02:00
Tal Levy a85b4f42ca
[ILM] change remove-policy-from-index http method from DELETE to POST (#35268)
The remove-ilm-from-index API was using the DELETE http method
to signify that something is being removed. Although, metadata
about ILM for the index is being deleted, no entity/resource
is being deleted during this operation. POST is more in line with
what this API is actually doing, it is modifying the metadata for
an index. As part of this change, `remove` is also appended to the path 
to be more explicit about its actions.
2018-11-06 07:46:25 -08:00
Tal Levy 2bf843e768 [TEST] Mute ChangePolicyForIndexIT#testChangePolicyForIndex 2018-11-06 06:09:49 -08:00
Nik Everett f72ef9b5fd
Build: Pull "skip assemble on qa" to common build (#35214)
Pull all of the logic that we use to skip the `assemble` and
`dependenciesInfo` tasks on `qa` projects into one spot in our root
build file.
2018-11-05 16:16:00 -05:00
Gordon Brown 0fbb8a16bc
Skip Rollover step if next index already exists (#35168)
If the Rollover step would fail due to the next index in sequence
already existing, just skip to the next step instead of going to the
Error step.

This prevents spurious `ResourceAlreadyExistsException`s created by
simultaneous RolloverStep executions from causing ILM to error out
unnecessarily.
2018-11-05 09:20:43 -07:00
Lee Hinman 3473217563
Remove Joda usage from ILM (#35220)
This commit removes the Joda time usage from ILM and the HLRC components of ILM.
It also fixes an issue where using the `?human=true` flag could have caused the
parser not to work. These millisecond fields now follow the standard we use
elsewhere in the code, with additional fields added iff the `human` flag is
specified.

This is a breaking change for ILM, but since ILM has not yet been released, no
compatibility shim is needed.
2018-11-05 08:17:15 -07:00
Alexander Reelsen 409050e8de
Refactor: Remove settings from transport action CTOR (#35208)
As settings are not used in the transport action constructor, this
removes the passing of the settings in all the transport actions.
2018-11-05 13:08:18 +01:00
Gordon Brown b3da3eae08
[ILM] Fix race condition in test (#35143)
Previously, testRunStateChangePolicyWithNextStep asserted that the
ClusterState before and after running the steps were equal. The test
only passed due to a race condition: The latch would be triggered by the
step execution, but the cluster state update thread would continue
running before committing the change to the cluster state. This allowed
the test to read the old cluster state and pass the equality check about
99.99% of the time.

The test now waits for the new cluster state to be committed before
checking that it is _not_ equal to the old cluster state.
2018-11-02 11:09:48 -06:00
Jason Tedor 1e241190eb
Disable assemble task from ILM qa projects
This commit disables the assemble tasks from all ILM qa projects. These
projects do not have an assemble task to execute.
2018-11-02 11:16:34 -04:00
Tal Levy 6b312a500d uninherit from AbstractComponent in IndexLifecycleService 2018-11-01 10:22:55 -07:00
Tal Levy f8e23f6400
update ILM integ test cluster poll interval to 1s (#35113) 2018-10-31 17:09:35 -07:00
Tal Levy 5f4b23f8c1
cleanup ILM qa structure (#35110)
This commit does a few things

- moves ILM-specifc rest yaml tests into plugin/ilm/qa, and creates special
  :plugin:ilm:qa:rest module to test them
- removes the with-security tests of the yaml tests since they are covered in
  the rest tests now
- moves ChangePolicyforIndexIT into the qa/multi-node project since that test is
  not currently running in main ilm since integTest is disabled
2018-10-31 11:49:29 -07:00
Tal Levy a294a7c6b5 fix IndexLifecycleService setting member
the settings variable was previously created by
the AbstractComponent class inherited by IndexLifecycleService.
this is no more.
2018-10-31 11:17:16 -07:00
Tal Levy 5141084048
rename CRUD api REST path prefix _ilm to _ilm/policy (#35056)
This PR renames the CRUD APIS for ILM

GET _ilm/<policy>, _ilm -> _ilm/policy/<policy>, _ilm/policy
PUT _ilm/<policy> -> _ilm/policy/<policy>
DELETE _ilm/<policy> -> _ilm/policy/<policy>

closes #34929.
2018-10-30 16:19:05 -07:00
Gordon Brown 6ecb8ff344
Move to Error step if ClusterState* steps throw (#35069)
Previously, if ClusterStateActionSteps or ClusterStateWaitSteps threw an
exception executing, the exception would only be caught and logged by
the generic ClusterStateUpdateTask machinery and the index would become
stuck on that step.

Now, exceptions thrown in these steps will be caught and the index will
be moved to the Error step.
2018-10-30 13:33:32 -06:00
Gordon Brown f6ac0e4bbc
[ILM] Fix Move To Step API causing ILM to hang (#34618)
The Move To Step API now checks to see if the target step is an
AsyncActionStep, and if so, runs it.

Previously, AsyncActionSteps would only be run when they are entered by
executing the previous step, so if an AsyncActionStep was entered via
the Move To Step API, ILM would never touch that index again.
2018-10-29 11:18:12 -06:00
Tal Levy f6ce935444
fix `GET _ilm` response with uninitialized ILM metadata (#34881)
ILM would return a resource-not-found exception when requesting policies
while the IndexLifecycleMetaData is not initialized. The behavior here
should not be as extreme since it is not the user's fault.

This commit changes the behavior so that it succeeds and returns no policies
when no policy names are explicitely specified, otherwise keep the same behavior
of throwing an exception
2018-10-25 16:00:44 -07:00
Tal Levy 41eaa586e8
remove index.lifecycle.skip setting (#34823)
With the introduction of _ilm/stop and _ilm/start APIs, the
use cases where one would only target a select group
of indices to start/stop has been reduced. Since there is no
strong use-case for skipping specific indices, it is best to
remove this functionality and only adding if later desired, with the
hopes of keeping things more simple.
2018-10-25 07:27:04 -07:00
Tal Levy 21b9b024c7
fix PolicyStatsTests mutateInstance (#34835)
through randomization, there is a chance that the mutateInstance
for PolicyStatsTests does not actually mutate the original object.
This PR aims to fix this
2018-10-25 07:24:54 -07:00
Colin Goodheart-Smithe 0b26f8b14c
Fixes NPE in multi node qa testt 2018-10-25 10:45:30 +01:00
Colin Goodheart-Smithe e7fddb5c93
Adds usage data for ILM (#33377)
* Adds usage data for ILM

* Adds tests for IndexLifecycleFeatureSetUsage and friends

* Adds tests for IndexLifecycleFeatureSet

* Fixes merge errors

* Add number of indices managed to usage stats

Also adds more tests

* Addresses Review comments
2018-10-24 18:28:46 +01:00
Colin Goodheart-Smithe c7fe87e43f
Removes Set Policy API in favour of setting index.lifecycle.name directly (#34304)
* Removes Set Policy API in favour of setting index.lifecycle.name
directly

* Reinstates matcher that will still be used

* Cleans up code after rebase

* Adds test to check changing policy with ndex settings works

* Fixes TimeseriesLifecycleActionsIT after API removal

* Fixes docs tests

* Fixes case on close where lifecycle service was never created
2018-10-24 16:14:59 +01:00
Lee Hinman c5a264e77f
Ensure phase_time is set when in the "new" phase (#34280)
Since there's no transition into the "new" phase it wasn't set until the "hot"
phase, so now we initialize it when initializing the policy context.

Resolves #34277
2018-10-23 15:20:41 -06:00
Gordon Brown 9cb0bb8b9f
Rework ILM build to separate integration tests (#34617)
Having integration tests separated from the unit tests in the qa
directory works much more smoothly with our testing infrastructure,
matches what other plugins do, and tests in a more "real" deployment
scenario by having all plugins installed.
2018-10-18 13:33:33 -06:00
Tal Levy fdb850735a fix setting version on deleting unmaanged indices with wildcard 2018-10-16 23:39:48 -07:00
Tal Levy 3a555da34d update version on ILM setting updates 2018-10-16 15:43:10 -07:00
Jack Conradson 80474e138f
HLRC: Add remove index lifecycle policy (#34204)
This change adds the command RemoveIndexLifecyclePolicy to the HLRC. This uses the 
new TimeRequest as a base class for RemoveIndexLifecyclePolicyRequest on the client side.
2018-10-16 08:12:06 -07:00
Lee Hinman 9ad2a7fa77
Fix expected next step being incorrect when executing async action (#34313)
This fixes an issue where an incorrect expected next step is used when checking
to execute `AsyncActionStep`s after a cluster state step.

It fixes this scenario:

- `ExecuteStepsUpdateTask` executes a `ClusterStateWaitStep` or
  `ClusterStateActionStep` successfully
- The next step is also a `ClusterStateWaitStep`, so it loops
- The `ClusterStateWaitStep` has a next stepkey (which gets set to the
  `nextStepKey` in the code)
- The `ClusterStateWaitStep` fails the condition, meaning that it will have to
  wait longer
- The `nextStepKey` is now incorrect though, because we did not advance the
  index's step, and it's not `null` (which is another safe value if there is no
  step after the `ClusterStateWaitStep`)

This fixes the problem by resetting the nextStepKey to null if the condition is
not met, since we are not going to advance the step metadata in this
case (thereby skipping the `maybeRunAsyncAction` invocation).

This commit also tightens up and enhances much of the ILM logging. A lot of
logging was missing the index name (making it hard to debug in the presence of
multiple indices) and a lot was using the wrong logging level (DEBUG is now
actually readable without being a wall of text).

Resolves #34297
2018-10-08 11:25:18 -06:00
Gordon Brown 13d89295c8
Provide useful error when a policy doesn't exist (#34206)
When an index is configured to use a lifecycle policy that does not
exist, this will now be noted in the step_info for that policy.
2018-10-04 08:21:55 -06:00
Tal Levy f10735aa9a ILM integration test with full policy (#33402)
- this adds an integration test that runs through a policy
with all the actions defined.
- adds a test specific to a policy having just a rollover action
- bumps the node count to 4
2018-10-03 12:20:43 -06:00
Lee Hinman 388f754a8e
Change step execution flow to be deliberate about type (#34126)
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

`AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default
`ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the
cluster state changes.
`AsyncActionStep` is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the `RolloverStep` from an `AsyncActionStep` to an
`AsyncWaitStep` so that it can run periodically.

Relates to #29823
2018-10-02 20:02:50 -06:00
Lee Hinman 2d9cb21490 Merge remote-tracking branch 'origin/master' into index-lifecycle 2018-10-01 14:10:09 -06:00
Lee Hinman a49d59802a
Use more descriptive task names for ILM cluster state updates (#34161)
Rather than using "ILM" for everything, we should use more descriptive names so
debugging from logs is easier to do.

Resolves #34118
2018-10-01 13:45:26 -06:00
Gordon Brown c0bfc07f53
Only make indexes read-only on Shrink and ForceMerge actions (#33907)
ILM now only forces indices to become read only in the case of Shrink
and Force Merge actions, as these are most useful in cases where the
index is no longer being written to.
2018-09-25 10:16:01 -06:00
Gordon Brown 90de436e55
Use custom index metadata for ILM state (#33783)
Using index settings for ILM state is fragile and exposes too much
information that doesn't need to be exposed. Using custom index metadata
is more resilient and allows more controlled access to internal
information.

As part of these changes, moves away from using defaults for ILM-related
values, in favor of using null values to clearly indicate that the value is not
present.
2018-09-19 14:50:48 -06:00
Lee Hinman 27dd25857b
Rebuild step on PolicyStepsRegistry.getStep (#33780)
This moves away from caching a list of steps for a current phase, instead
rebuilding the necessary step from the phase JSON stored in the index's
metadata.

Relates to #29823
2018-09-18 17:07:57 -06:00
Lee Hinman 11a55d2307 [TEST] Handle an IndexLifecycleService that has not started up 2018-09-18 14:02:09 -06:00
Tal Levy 94a66c556d
add phase execution info to ILM Explain API (#33488)
adds a section for phase execution to the Explain API.

This contains

- phase definition
- policy name
- policy version
- modified date
2018-09-17 17:00:00 -07:00
Lee Hinman 1f048d3d3f
Remove unneeded listener on MoveToNextStepUpdateTask (#33725)
There was a listener that re-runs the policy with the new state when the cluster
state is processed by the `MoveToNextStepUpdateTask`. This removes this listener
as we will execute the policy through the `IndexLifecyleService` cluster state
listener.
2018-09-14 14:38:23 -06:00
Lee Hinman b7649fce0c
Rename "after" to "minimum_age" in lifecycle definition (#33530)
This renames the "after" field to better reflect what the meaning is.

Supercedes #32624
2018-09-08 21:40:55 -06:00
Lee Hinman 8fa8dea138
Encapsulate Client as class variable for PolicyStepsRegistry (#33529)
Rather than pass in the client on the `update` step, this makes it passed in to
the constructor so it's not required on every update.
2018-09-07 16:32:25 -06:00
Colin Goodheart-Smithe f83641346f
Adds checks to ensure index metadata exists when we try to use it (#33455)
* Adds checks to ensure index metadata exists when we try to use it

* Fixes failing test
2018-09-07 13:06:51 +01:00
Tal Levy 21bb4720a2
add notion of version and modified_date to LifecyclePolicyMetadata (#33450)
It is useful to keep track of which version of a policy is currently
being executed by a specific index. For management purposes, it would
also be useful to know at which time the latest version was inserted
so that an audit trail is left for reconciling changes happening in ILM.
2018-09-06 13:32:24 -07:00
Lee Hinman b335487ca6 Fix qa build.gradle to gradle assemble works correctly
There is a new way to disable assembling from certain subdirectories
2018-09-06 11:22:27 -06:00
Lee Hinman 96d515e3f5
Replace PhaseAfterStep with PhaseCompleteStep (#33398)
This removes `PhaseAfterStep` in favor of a new `PhaseCompleteStep`. This step
in only a marker that the `LifecyclePolicyRunner` needs to halt until the time
indicated for entering the next phase.

This also fixes a bug where phase times were encapsulated into the policy
instead of dynamically adjusting to policy changes.

Supersedes #33140, which it replaces
Relates to #29823
2018-09-05 16:37:45 -06:00