OpenSearch

Commit Graph

Author	SHA1	Message	Date
Andrei Dan	81388051d8	Reenable testWhenUserLimitedByOnlyAliasOfIndexCanWriteToIndexWhichWasRolledoverByILMPolicy (#51768 ) (#51801 ) We suspect the flakiness could’ve come from the fact that the rollover step used to create the new index and roll the write alias to the new index in separate cluster state updates. So the assertion that the rolled index exists could’ve passed in the test but, before the alias was rolled over to the new index, the subsequent write we execute in the test (namely `indexDocs("test_user", "x-pack-test-password", "foo_alias", 1)`) would’ve sent the new document to the source index (ie. foo-logs-000001) This would see the source index containing 3 documents and the rolled index (foo-logs-000002) 0 documents. However, we fixed this and the rollover step executes the “create index and roll alias” in one single cluster update, so this situation should not occur anymore. (cherry picked from commit 834261c4fe7dd93f437eeec43c00d01ff2279f86) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-02-03 11:54:00 +00:00
Lee Hinman	4594a210bf	[7.x] Fix SnapshotLifecycleRestIT.testFullPolicySnapshot (#517… (#51778 ) * Fix SnapshotLifecycleRestIT.testFullPolicySnapshot This previously was missing some key information in the output of the failure. This captures that information and adds logging at each step so we can determine the cause if it fails again. Resolves #50358	2020-01-31 15:38:28 -07:00
Andrei Dan	20f47b14b0	Fix SnapshotLifecycleServiceTests.testPolicyCRUD (#51653 ) (#51755 ) (cherry picked from commit 8f9a87fa576a8a1c6ea3efb29bf1296d50d89ace) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-31 18:17:38 +00:00
Lee Hinman	deefc85d60	[7.x] Stop policy on last PhaseCompleteStep instead of Termina… (#51758 ) Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`, denoted by a completed/completed/completed phase/action/step lifecycle execution state. This commit changes the behavior so that the index lifecycle execution state halts at the last configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete` `PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have indices configured to use that policy pick up execution at the newly added "later" phase. For example, if a `delete` phase were added to the policy specified about, the index would then move from `cold/complete/complete` into the `delete` phase. Relates to #48431	2020-01-31 10:36:41 -07:00
Gordon Brown	10c8179351	Use exclusions list instead of fake system indices (#51586 ) This commit switches the strategy for managing dot-prefixed indices that should be hidden indices from using "fake" system indices to an explicit exclusions list that must be updated when those indices are converted to hidden indices.	2020-01-30 16:31:27 -07:00
David Roberts	e0e35b7feb	[TEST] Mute TimeSeriesLifecycleActionsIT.testWaitForSnapshotSlmExecutedBefore Due to https://github.com/elastic/elasticsearch/issues/50781	2020-01-29 13:08:55 +01:00
Gordon Brown	89c2834b24	Deprecate creation of dot-prefixed index names except for hidden and system indices (#49959 ) This commit deprecates the creation of dot-prefixed index names (e.g. .watches) unless they are either 1) a hidden index, or 2) registered by a plugin that extends SystemIndexPlugin. This is the first step towards more thorough protections for system indices. This commit also modifies several plugins which use dot-prefixed indices to register indices they own as system indices, and adds a plugin to register .tasks as a system index.	2020-01-28 10:01:16 -07:00
Andrei Dan	977cce002e	Preserve slm-history-ilm-policy between test runs (#51442 ) (#51468 ) (cherry picked from commit 4e95c8a94fa700d44ac31ef17547512748ab1885) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-27 10:40:40 +00:00
Andrei Dan	d872db278a	Fix TimeSeriesLifecycleActionsIT.testShrinkAction (#51431 ) (#51467 ) * Fix TimeSeriesLifecycleActionsIT.testShrinkAction Shrinking a 6 shard index to 3 shards can be quite time consuming and assertBusy probes the conditions at exponentially growing intervals. This separates the one assertion that was used for all the conditions into multiple assertBusy statements and increases the timeout for waiting for the shrink to complete. * Allow more time for shrink to complete This commit allows more time for the shrink operation to complete in testRetryFailedShrinkAction (separating the assertBusy calls too) and testMoveToRolloverStep. * Shrink to no more than 2 shards in tests (cherry picked from commit 5fe780148fa3536915d61475b087896a5b9ace82) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-27 10:40:29 +00:00
Lee Hinman	8560847dd9	[7.x] Check all snapshots in SnapshotLifecycleRestIT.testFullP… (#51448 ) * Check all snapshots in SnapshotLifecycleRestIT.testFullPolicy Rather than check the first returned snapshot for a snapshot starting with `snap-` in SnapshotLifecycleRestIT.testFullPolicy, this commit changes the test to find any snapshots starting with `snap-`. In the event that there are no snapshots (the failure case), this also exposes the full results map so we can diagnose why a failure occurred. Relates to #50358 * Use a more imperative style for checking	2020-01-24 14:30:42 -07:00
Lee Hinman	bdb8b6aa0d	[7.x] Separate aliases used for tests in TimeSeriesLifecycleAc… (#51432 ) * Separate aliases used for tests in TimeSeriesLifecycleActionsIT This is related to #51375 and hopes to help illuminate why some of those tests are failing. This commit switches the aliases used in the test to use a random alias name every time (since there were some complaints in the tests about aliases having more than one write index). With this we hope to determine the actual cause of the failure in the test. This also adds additional information to the exception returned when calling move-to-step with the incorrect current step. * Fix rest test	2020-01-24 11:05:19 -07:00
Andrei Dan	2f7c240184	[7.x] Use ESSingleNodeTestCase instead of ESIntegTestCase (#51345 ) (#51346 ) * Use ESSingleNodeTestCase instead of ESIntegTestCase (#51345) (cherry picked from commit abcf1c41faf05a0b0196fb06e57c3de8c3d67688) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-24 10:53:37 +00:00
Nhat Nguyen	072203cba8	Clean soft-deletes setting in ccr tests (#51113 ) (#51372 ) We no longer need to explicitly enable soft-deletes in CCR tests. Relates #50775 Backport of #51113	2020-01-23 16:31:47 -05:00
Przemko Robakowski	84664e8d60	Expose master timeout for ILM actions (#51130 ) (#51348 ) This change exposes master timeout to ILM steps through global dynamic setting. All currently implemented steps make use of this setting as well. Closes #44136	2020-01-23 15:28:13 +01:00
Andrei Dan	421aa14972	ILM: Make UpdateSettingsStep retryable (#51235 ) (#51298 ) This makes the UpdateSettingsStep retryable. This step updates settings needed during the execution of ILM actions (mark indexes as read-only, change allocation configurations, mark indexing complete, etc) As the index updates are idempotent in nature (PUT requests and are applied only if the values have changed) and the settings values are seldom user-configurable (aside from the allocate action) the testing for this change goes along the lines of artificially simulating a setting update failure on a particular value update, which is followed by a successful step execution (a retry) in an environment outside of ILM (the step executions are triggered manually). (cherry picked from commit 8391b0aba469f39532bfc2796b76148167dc0289) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-22 11:02:26 +00:00
Andrei Dan	123266714b	ILM wait for active shards on rolled index in a separate step (#50718 ) (#51296 ) After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active). This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and, the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created. (cherry picked from commit 457a92fb4c68c55976cc3c3e2f00a053dd2eac70) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-22 11:01:52 +00:00
Tim Vernum	a0ca82422c	Mute TimeSeriesLifecycleActionsIT.waitForSnapshot (#51208 ) This test was recently un-muted, but is still failing Relates: #50781 Backport of: #51203	2020-01-20 20:19:29 +11:00
Lee Hinman	731c96b507	[7.x] Use separate policies for tests in SnapshotLifecycleRest… (#51181 ) These policies store statistics, but since stats updating is asynchronous, it's possible for the update from one test to bleed into a separate one. This change switches the tests to use separate policy ids so that their stats are tracked independently. It also relaxes the checking constraint in one of the tests. Hopefully this: Resolves #48531 Resolves #48017	2020-01-17 13:26:40 -07:00
Lee Hinman	e395cf3419	Guard against null settings in CCRIndexLifecycleIT (#51008 ) (#51054 ) It's possible that the index could return no settings and thus throw a `NullPointerException`. I wasn't able to reproduce the original issue, but this should guard against in the future. Resolves #50646 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>	2020-01-15 11:21:18 -07:00
Lee Hinman	ad60f0015e	Address failures in SnapshotLifecycleRestIT.testFullPolicySnapshot (#51013 ) This test failed a couple of different ways, related to timing, as well as concurrent snapshots, and also naming. This commit splits the giant `assertBusy` into separate parts so that we don't perform ~5 different requests and tests in the same loop. It also gives each test a unique repository so that no other test can accidentally re-use snapshots. Resolves #50358 (hopefully!)	2020-01-15 09:47:41 -07:00
David Kyle	69a3626ee1	Mute SnapshotLifecycleRestIT testFullPolicySnapshot Relates to #50358	2020-01-14 13:46:37 +01:00
Przemko Robakowski	a18736b46d	[7.x] ILM action to wait for SLM policy execution (#50454 ) (#50943 ) * ILM action to wait for SLM policy execution (#50454) This change add new ILM action to wait for SLM policy execution to ensure that index has snapshot before deletion. Closes #45067 * Fix flaky TimeSeriesLifecycleActionsIT#testWaitForSnapshot test This change adds some randomness and cleanup step to TimeSeriesLifecycleActionsIT#testWaitForSnapshot and testWaitForSnapshotSlmExecutedBefore tests in attempt to make them stable. Reletes to #50781 * Formatting changes * Longer timeout * Fix Map.of in Java8 * Unused import removed	2020-01-14 01:34:33 +01:00
Lee Hinman	91689e793d	[7.x] Refresh cached phase policy definition if possible on ne… (#50941 ) * Refresh cached phase policy definition if possible on new policy There are some cases when updating a policy does not change the structure in a significant way. In these cases, we can reread the policy definition for any indices using the updated policy. This commit adds this refreshing to the `TransportPutLifecycleAction` to allow this. It allows us to do things like change the configuration values for a particular step, even when on that step (for example, changing the rollover criteria while on the `check-rollover-ready` step). There are more cases where the phase definition can be reread that just the ones checked here (for example, removing an action that has already been passed), and those will be added in subsequent work. Relates to #48431	2020-01-13 14:31:41 -07:00
Lee Hinman	63472d30c7	[7.x] Fix SLM check for restore in progress (#50868 ) (#50876 ) * Fix SLM check for restore in progress (#50868) * Fix SLM check for restore in progress This commit fixes the check in SLM where the `RestoreInProgress` metadata was checked for existence. Rather than check existence we should instead check the `isEmpty` method. Prior to this, a successful restore for a repository that used SLM retention would prevent SLM retention from running in subsequent invocations, due to SLM thinking that a restore was still running. * Fix 7.x-isms	2020-01-10 14:27:55 -07:00
Lee Hinman	8dc6e98819	[7.x] Make InitializePolicyContextStep retryable (#50685 ) (#50760 ) This commits makes the "init" ILM step retryable. It also adds a test where an index is created with a non-parsable index name and then fails. Related to #48183	2020-01-08 13:13:57 -07:00
Nhat Nguyen	90e66a7b97	Mute testPolicyCRUD Tracked at #44997	2020-01-08 13:25:40 -05:00
Lee Hinman	615532b4f8	Mute TimeSeriesLifecycleActionsIT.testHistoryIsWritten* (#50755 ) Related to #50353	2020-01-08 10:35:44 -07:00
Andrei Dan	3915d4c055	Make the UpdateRolloverLifecycleDateStep retryable (#50702 ) (#50730 ) This makes the "update-rollover-lifecycle-date" step, which is part of the rollover action, retryable. It also adds an integration test to check the step is retried and it eventually succeeds. (cherry picked from commit 5bf068522deb2b6cd2563bcf80f34fdbf459c9f2) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-08 11:45:26 +01:00
Lee Hinman	552edd862e	[7.x] Add aditional logging for ILM history store tests (#5062… (#50678 ) * Add aditional logging for ILM history store tests (#50624) These tests use the same index name, making it hard to read logs when diagnosing the failures. Additionally more information about the current state of the index could be retrieved when failing. This changes these two things in the hope of capturing more data about why this fails on some CI nodes but not others. Relates to #50353	2020-01-06 15:24:24 -07:00
Christoph Büscher	6c8868e955	Mute TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithSuccess Also muting TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure. Tracked in #50353	2020-01-03 18:32:03 +01:00
Andrei Dan	3c971f2911	ILM retryable async action steps (#50522 ) (#50591 ) This adds support for retrying AsyncActionSteps by triggering the async step after ILM was moved back on the failed step (the async step we'll be attempting to run after the cluster state reflects ILM being moved back on the failed step). This also marks the RolloverStep as retryable and adds an integration test where the RolloverStep is failing to execute as the rolled over index already exists to test that the async action RolloverStep is retried until the rolled over index is deleted. (cherry picked from commit 8bee5f4cb58a1242cc2ef4bc0317dae6c8be49d3) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-01-03 16:19:58 +02:00
Lee Hinman	c3c9ccf61f	[7.x] Add ILM histore store index (#50287 ) (#50345 ) * Add ILM histore store index (#50287) * Add ILM histore store index This commit adds an ILM history store that tracks the lifecycle execution state as an index progresses through its ILM policy. ILM history documents store output similar to what the ILM explain API returns. An example document with ALL fields (not all documents will have all fields) would look like: ```json { "@timestamp": 1203012389, "policy": "my-ilm-policy", "index": "index-2019.1.1-000023", "index_age":123120, "success": true, "state": { "phase": "warm", "action": "allocate", "step": "ERROR", "failed_step": "update-settings", "is_auto-retryable_error": true, "creation_date": 12389012039, "phase_time": 12908389120, "action_time": 1283901209, "step_time": 123904107140, "phase_definition": "{\"policy\":\"ilm-history-ilm-policy\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"rollover\":{\"max_size\":\"50gb\",\"max_age\":\"30d\"}}},\"version\":1,\"modified_date_in_millis\":1576517253463}", "step_info": "{... etc step info here as json ...}" }, "error_details": "java.lang.RuntimeException: etc\n\tcaused by:etc etc etc full stacktrace" } ``` These documents go into the `ilm-history-1-00000N` index to provide an audit trail of the operations ILM has performed. This history storage is enabled by default but can be disabled by setting `index.lifecycle.history_index_enabled` to `false.` Resolves #49180 * Make ILMHistoryStore.putAsync truly async (#50403) This moves the `putAsync` method in `ILMHistoryStore` never to block. Previously due to the way that the `BulkProcessor` works, it was possible for `BulkProcessor#add` to block executing a bulk request. This was bad as we may be adding things to the history store in cluster state update threads. This also moves the index creation to be done prior to the bulk request execution, rather than being checked every time an operation was added to the queue. This lessens the chance of the index being created, then deleted (by some external force), and then recreated via a bulk indexing request. Resolves #50353	2019-12-20 12:33:36 -07:00
David Turner	285eacd267	Use more specific loggers in subclasses of TMNA (#50076 ) Adjusts the subclasses of `TransportMasterNodeAction` to use their own loggers instead of the one for the base class. Relates #50056. Partial backport of #46431 to 7.x.	2019-12-11 15:07:47 +00:00
Lee Hinman	8205cdd423	[7.x] Refactor IndexLifecycleRunner to split state modificatio… (#49936 ) This commit refactors the `IndexLifecycleRunner` to split out and consolidate the number of methods that change state from within ILM. It adds a new class `IndexLifecycleTransition` that contains a number of static methods used to modify ILM's state. These methods all return new cluster states rather than making changes themselves (they can be thought of as helpers for modifying ILM state). Rather than having multiple ways to move an index to a particular step (like `moveClusterStateToStep`, `moveClusterStateToNextStep`, `moveClusterStateToPreviouslyFailedStep`, etc (there are others)) this now consolidates those into three with (hopefully) useful names: - `moveClusterStateToStep` - `moveClusterStateToErrorStep` - `moveClusterStateToPreviouslyFailedStep` In the move, I was also able to consolidate duplicate or redundant arguments to these functions. Prior to this commit there were many calls that provided duplicate information (both `IndexMetaData` and `LifecycleExecutionState` for example) where the duplicate argument could be derived from a previous argument with no problems. With this split, `IndexLifecycleRunner` now contains the methods used to actually run steps as well as the methods that kick off cluster state updates for state transitions. `IndexLifecycleTransition` contains only the helpers for constructing new states from given scenarios. This also adds Javadocs to all methods in both `IndexLifecycleRunner` and `IndexLifecycleTransition` (this accounts for almost all of the increase in code lines for this commit). It also makes all methods be as restrictive in visibility, to limit the scope of where they are used. This refactoring is part of work towards capturing actions and transitions that ILM makes, by consolidating and simplifying the places we make state changes, it will make adding operation auditing easier.	2019-12-06 12:55:16 -07:00
Armin Braun	af0f97d50a	Fix SLMSnapshotBlockingIntegTests.testSnapshotInProgress (#49533 ) (#49542 ) This test must check for state `SUCCESS` as well. `SUCESS` in `SnapshotsInProgress` means "all data nodes finished snapshotting sucessfully but master must still finalize the snapshot in the repo". `SUCESS` does not mean that the snapshot is actually fully finished in this object. You can easily reporduce the scenario in #49303 that has an in-progress snapshot in `SUCCESS` state by waiting 20s before running the busy assert loop on the snapshot status so that all steps but the blocked finalization can finish. Closes #49303	2019-11-25 13:31:45 +01:00
Andrei Dan	010c3de47e	Slm set operation mode to RUNNING on first run (#49236 ) (#49425 ) * SLM set the operation mode to RUNNING on first run Set the SLM operation mode to RUNNING when setting the first SLM lifecycle policy. Historically, SLM was not decoupled from ILM but now they are independent components. Setting the SLM operation mode to what the ILM running mode was when we set the first SLM lifecycle policy was a remain from those times. * SLM update package info * SLM suppress unusued warning * SLM use logger for the correct class * SLM Add integration test for operation mode * Use ESSingleNodeTestCase instead of ESIntegTestCase (cherry picked from commit 4ad3d93f89d03bf9a25685a990d1a439f33ce0e6) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-11-21 11:41:32 +00:00
Jay Modi	eed4cd25eb	ThreadPool and ThreadContext are not closeable (#43249 ) (#49273 ) This commit changes the ThreadContext to just use a regular ThreadLocal over the lucene CloseableThreadLocal. The CloseableThreadLocal solves issues with ThreadLocals that are no longer needed during runtime but in the case of the ThreadContext, we need it for the runtime of the node and it is typically not closed until the node closes, so we miss out on the benefits that this class provides. Additionally by removing the close logic, we simplify code in other places that deal with exceptions and tracking to see if it happens when the node is closing. Closes #42577	2019-11-19 13:15:16 -07:00
Andrei Dan	19780e20ba	Handle failure to retrieve ILM policy step better (#49193 ) (#49316 ) This commit wraps the calls to retrieve the current step in a try/catch so that the exception does not bubble up. Instead, step info is added containing the exception to the existing step. Semi-related to #49128 (cherry picked from commit 72530f8a7f40ae1fca3704effb38cf92daf29057) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-11-19 17:14:46 +00:00
Armin Braun	25cc8e3663	Fix RepoCleanup not Removed on Master-Failover (#49217 ) (#49239 ) The logic for `cleanupInProgress()` was backwards everywhere (method itself and all but one user). Also, we weren't checking it when removing a repository. This lead to a bug (in the one spot that didn't use the method backwards) that prevented the cleanup cluster state entry from ever being removed from the cluster state if master failed over during the cleanup process. This change corrects the backwards logic, adds a test that makes sure the cleanup is always removed and adds a check that prevents repository removal during cleanup to the repositories service. Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.	2019-11-18 16:44:09 +01:00
Lee Hinman	680436dd0d	[7.x] Don't halt policy execution on policy trigger exception… (#49171 ) When triggered either by becoming master, a new cluster state, or a periodic schedule, an ILM policy execution through `maybeRunAsyncAction`, `runPolicyAfterStateChange`, or `runPeriodicStep` throwing an exception will cause the loop the terminate. This means that any indices that would have been processed after the index where the exception was thrown will not be processed by ILM. For most execution this is not a problem because the actual running of steps is protected by a try/catch that moves the index to the ERROR step in the event of a problem. If an exception occurs prior to step execution (for example, in fetching and parsing the current policy/step) however, it causes the loop termination previously mentioned. This commit wraps the invocation of the methods specified above in a try/catch block that provides better logging and does not bubble the exception up.	2019-11-15 09:22:37 -07:00
Andrei Dan	085d08cfd1	ILM Remove obsolete testRolloverAlreadyExists (#49104 ) (#49144 ) The rollover action is now a retryable step (see #48256) so ILM will keep retrying until it succeeds as opposed to stopping and moving the execution in the ERROR step. Fixes #49073 (cherry picked from commit 3ae90898121b43032ec8f3b50514d93a86e14d0f) Signed-off-by: Andrei Dan <andrei.dan@elastic.co> # Conflicts: # x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java	2019-11-15 12:06:22 +00:00
Rory Hunter	c46a0e8708	Apply 2-space indent to all gradle scripts (#49071 ) Backport of #48849. Update `.editorconfig` to make the Java settings the default for all files, and then apply a 2-space indent to all `*.gradle` files. Then reformat all the files.	2019-11-14 11:01:23 +00:00
Lee Hinman	5eb37c29fe	[7.x] Re-read policy phase JSON when using ILM's move-to-step… (#49011 ) When using the move-to-step API, we should reread the phase JSON from the latest version of the ILM policy. This allows a user to move to the same step while re-reading the policy's latest version. For example, when changing rollover criteria. While manually messing around with some other things I discovered that we only reread the policy when using the retry API, not the move-to-step API. This commit changes the move-to-step API to always read the latest version of the policy.	2019-11-12 19:41:06 -07:00
Armin Braun	3c20541823	Cleanup Concurrent RepositoryData Loading (#48329 ) (#48834 ) The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122	2019-11-02 20:42:29 +01:00
Lee Hinman	6c290ecaf7	Fix ilm/20_move_to_step basic moving to step (#48821 ) Previously this step moved to the forcemerge step, however, if the machine running the test was fast enough, it would execute the forcemerge and move to the next step (`segment-count`) so the comparison would fail. This commit changes the step to be a step that will never go anywhere else, the terminal step. Resolves #48761	2019-11-01 13:58:24 -06:00
Andrei Dan	98a9227588	Fix TimeSeriesLifecycleActionsIT.testRolloverAlreadyExists (#48747 ) (#48795 ) * ILM Test asserts on the same ilm/_explain output With the introduction of retryable steps subsequent ilm/_explain calls can see the state of an ilm cycle move out of the error step. This test made several assertions assuming that the cycle remains in the error step so this commit changes the test to make one _explain call and have all the asserts work on the same ilm state (so subsequent assumptions to the cycle being in the error step are valid). * Drop unused field in test. (cherry picked from commit 44c74bb487151c886a08b27f32b13f7a72056997) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-11-01 12:34:33 +00:00
Lee Hinman	d0ead688c3	[7.x] Fix TimeSeriesLifecycleActionsIT.testExplainFilters (#48… (#48776 ) This test used an index without an alias to simulate a failure in the `check-rollover-ready` step. However, with #48256 that step automatically retries, meaning that the index may not always be in the ERROR step. This commit changes the test to use a shrink action with an invalid number of shards so that it stays in the ERROR step. Resolves #48767	2019-10-31 15:25:12 -06:00
Andrei Dan	ffe5d5417f	ILM Make the `check-rollover-ready` step retryable (#48256 ) (#48740 ) This adds the infrastructure to be able to retry the execution of retryable steps and makes the `check-rollover-ready` retryable as an initial step to make the rollover action more resilient to transient errors. (cherry picked from commit 454020ac8acb147eae97acb4ccd6fb470d1e5f48) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-10-31 11:28:55 +00:00
Lee Hinman	2d5291cf3b	Un-AwaitsFix and enhance logging for testPolicyCRUD (#48719 ) * Un-AwaitsFix and enhance logging for testPolicyCRUD This removes the `AwaitsFix` and increases the test logging for `SnapshotLifecycleServiceTests.testPolicyCRUD` in an effort to track down the cause of #44997. * Remove unused import	2019-10-30 17:02:57 -06:00
Lee Hinman	ed2bb73de2	Fix SnapshotLifecycleService logger (#48711 ) The logger was erroneously using the `SnapshotLifecycleMetadata` class for its initialization, making it hard to target packages for logging levels since `SnapshotLifecycleMetadata` is in a different package.	2019-10-30 13:13:50 -06:00
Lee Hinman	72a601c47f	[7.x] Don't schedule SLM jobs when services have been stopped… (#48692 ) This adds a guard for the SLM lifecycle and retention service that prevents new jobs from being scheduled once the service has been stopped. Previous if the node were shut down the service would be stopped, but a cluster state or local master election would cause a job to attempt to be scheduled. This could lead to an uncaught `RejectedExecutionException`. Resolves #47749	2019-10-30 09:46:35 -06:00
Gordon Brown	25724c5c46	Adjust date parsing in ILM integration tests (#48648 ) The format returned by the API is not always parsable with `Instant.parse()`, so this commit adjusts to parsing those dates as `ISO_ZONED_DATE_TIME` instead, which appears to always parse the returned value correctly.	2019-10-29 15:44:04 -07:00
Gordon Brown	50d7424e7d	Unmute and increase logging on flaky SLM tests (#48612 ) The failures in these tests have been remarkably difficult to track down, in part because they will not reproduce locally. This commit unmutes the flaky tests and increases logging, as well as introducing some additional logging, to attempt to pin down the failures.	2019-10-29 13:39:19 -07:00
Gordon Brown	cf235796c0	Use more reliable "never run" cron pattern in tests (#48608 ) The cron schedule "1 2 3 4 5 ?" will run every May 4 at 03:02:01, which may result in unnecessary test failures once a year. This commit switches out uses of that schedule in tests for one which will never execute (because it specifies a day which doesn't exist, Feb. 31). Also factors the schedule out to a constant to make the intent clearer.	2019-10-29 09:33:14 -07:00
Gordon Brown	5021410165	Retry on RepositoryException in SLM tests (#48548 ) Due to a bug, GETing a snapshot can cause a RespositoryException to be thrown. This error is transient and should be retried, rather than causing the test to fail. This commit converts those RepositoryExceptions into AssertionErrors so that they will be retried in code wrapped in assertBusy.	2019-10-28 09:24:38 -07:00
Gordon Brown	c353ad71fe	Wrap ResponseException in AssertionError in ILM/CCR tests (#48489 ) When checking for the existence of a document in the ILM/CCR integration tests, `assertDocumentExists` makes an HTTP request and checks the response code. However, if the repsonse code is not successful, the call will throw a `ResponseException`. `assertDocumentExists` is often called inside an `assertBusy`, and wrapping the `ResponseException` in an `AssertionError` will allow the `assertBusy` to retry. In particular, this fixes an issue with `testCCRUnfollowDuringSnapshot` where the index in question may still be closed when the document is requested.	2019-10-28 07:37:52 -07:00
Dimitrios Liappis	fc1b4ad23c	Mute testCCRUnfollowDuringSnapshot (#48464 ) tracked in #48461 backport of #48462	2019-10-24 15:52:56 +03:00
Dimitrios Liappis	4d0fb6e551	Mute testBasicTimeBasedRetenion (#48458 ) tracked in #48017 backport of #48456	2019-10-24 14:53:12 +03:00
Armin Braun	7215201406	Track Shard-Snapshot Index Generation at Repository Root (#48371 ) This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <yannick@welsch.lu> Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>	2019-10-23 10:58:26 +01:00
Gordon Brown	a2217f4a91	Fix testRetentionWhileSnapshotInProgress (#48219 ) This test could fail for two reasons, both should be fixed by this PR: 1) It hit a timeout for an `assertBusy`. This commit increases the timeout for that `assertBusy`. 2) The snapshot that was supposed to be blocked could, in fact, be successful. This is because a previous snapshot had been successfully been taken, and no new data had been added between the two snapshots. This means that no new segment files needed to be written for the new snapshot, so the block on data files was never triggered. This commit changes two things: First, it indexes some new data before taking the second snapshot (the one that needs to be blocked), and second, checks to ensure that the block is actually hit before continuing with the test.	2019-10-18 14:25:18 -06:00
Armin Braun	9bf8e1e060	Fix SLMSnapshotBlockingIntegTest (#47941 ) (#47963 ) The after snapshot action is interfering with SLM deleting snapshots here it seems, causing concurrent delete exceptions. Since these tests are now test-scoped there is no reason to run snapshot deletes after each test so we can remove them to avoid this issue. Closes #47937	2019-10-17 08:55:56 +02:00
Armin Braun	0ca7cc1848	Safely Close Repositories on Node Shutdown (#48020 ) (#48107 ) We were not closing repositories on Node shutdown. In production, this has little effect but in tests shutting down a node using `MockRepository` and is currently stuck in a simulated blocked-IO situation will only unblock when the node's threadpool is interrupted. This might in some edge cases (many snapshot threads and some CI slowness) result in the execution taking longer than 5s to release all the shard stores and thus we fail the assertion about unreleased shard stores in the internal test cluster. Regardless of tests, I think we should close repositories and release resources associated with them when closing a node and not just when removing a repository from the CS with running nodes as this behavior is really unexpected. Fixes #47689	2019-10-17 07:55:05 +02:00
Lee Hinman	5af66d79ef	Add SLM support to xpack usage and info APIs (#48149 ) * Add SLM support to xpack usage and info APIs This is a backport of #48096 This adds the missing xpack usage and info information into the `/_xpack` and `/_xpack/usage` APIs. The output now looks like: ``` GET /_xpack/usage { ... "slm" : { "available" : true, "enabled" : true, "policy_count" : 1, "policy_stats" : { "retention_runs" : 0, ... } } ``` and ``` GET /_xpack { ... "features" : { ... "slm" : { "available" : true, "enabled" : true }, ... } } ``` Relates to #43663 * Fix missing license	2019-10-16 21:06:27 -06:00
Gordon Brown	699d4d4c6f	Manage retention of partial snapshots in SLM (#47833 ) Currently, partial snapshots will eventually build up unless they are manually deleted. Partial snapshots may be useful if there is not a more recent successful snapshot, but should eventually be deleted if they are no longer useful. With this change, partial snapshots are deleted using the following strategy: PARTIAL snapshots will be kept until the configured expire_after period has passed, if present, and then be deleted. If there is no configured expire_after in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Partial snapshots are not counted towards either min_count or max_count.	2019-10-14 10:19:57 -06:00
Nick Knize	68eaa21d77	Mute testBasicFailureRetention (#47940 )	2019-10-11 14:03:46 -05:00
Armin Braun	48823b1112	Fix SLMSnapshotBlockingIntegTests (#47841 ) (#47863 ) One of the tests in this suit stops a master node, plus we're doing other node starts in this suit. => the internal test cluster should be TEST and not `SUITE` scoped to avoid random failures like the one in #47834 Closes #47834	2019-10-10 18:41:57 +02:00
Mark Vieira	0360a18f61	Mute test SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress Signed-off-by: Mark Vieira <portugee@gmail.com> (cherry picked from commit a8a7477c396554926f260d210364f009d85ae5f2)	2019-10-09 15:38:24 -07:00
Gordon Brown	9b3790d4f2	Mute "Test All Indexes Lifecycle Explain" (#47317 )	2019-10-09 21:32:58 +04:00
Jim Ferenczi	d96977202d	Disable SLMSnapshotBlockingIntegTests#testSnapshotInProgress (#47775 ) This test fails constantly in master and prs. Relates #47689	2019-10-09 17:49:13 +02:00
Lee Hinman	fb7abe9fa4	Separate SLM stop/start/status API from ILM (#47710 ) * Separate SLM stop/start/status API from ILM This separates a start/stop/status API for SLM from being tied to ILM's operation mode. These APIs look like: ``` POST /_slm/stop POST /_slm/start GET /_slm/status ``` This allows administrators to have fine-grained control over preventing periodic snapshots and deletions while performing cluster maintenance. Relates to #43663 * Allow going from RUNNING to STOPPED * Align with the OperationMode rules * Fix slmStopping method * Make OperationModeUpdateTask constructor private * Wipe snapshots better in test	2019-10-08 17:21:38 -06:00
Gordon Brown	a492864a9d	Manage retention of failed snapshots in SLM (#47617 ) Failed snapshots will eventually build up unless they are deleted. While failures may not take up much space, they add noise to the list of snapshots and it's desirable to remove them when they are no longer useful. With this change, failed snapshots are deleted using the following strategy: `FAILED` snapshots will be kept until the configured `expire_after` period has passed, if present, and then be deleted. If there is no configured `expire_after` in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Failed snapshots are not counted towards either `min_count` or `max_count`.	2019-10-08 17:07:08 -06:00
Lee Hinman	91988c7c26	Throw error retrieving non-existent SLM policy (#47679 ) Previously when retrieving an SLM policy it would always return a 200 with `{}` in the body, even if the policy did not exist. This changes that behavior to throw an error (similar to our other APIs) if a policy doesn't exist. This also adds a basic CRUD yml test for the behavior. Resolves #47664	2019-10-07 19:54:04 -06:00
Lee Hinman	906be45209	Add a test for SLM retention with security enabled (#47608 ) This enhances the existing SLM test using users/roles/etc to also test that SLM retention works when security is enabled. Relates to #43663	2019-10-07 19:52:09 -06:00
Andrei Dan	4506b37ed5	ILM: Skip rolling indexes that are already rolled (#47324 ) (#47592 ) An index with an ILM policy that has a rollover action in one of the phases was rolled over when the ILM conditions dictated regardless if it was already rolled over (eg. manually after modifying an index template in order to force the creation of a new index that uses the new mappings). This changes this behaviour and has ILM check if the index it's about to roll has not been rolled over in the meantime. (cherry picked from commit 37d6106feeb9f9369519117c88a9e7e30f3ac797) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-10-07 07:47:47 +01:00
Lee Hinman	2e3eb4b24e	Add API to execute SLM retention on-demand (#47405 ) (#47463 ) * Add API to execute SLM retention on-demand (#47405) This is a backport of #47405 This commit adds the `/_slm/_execute_retention` API endpoint. This endpoint kicks off SLM retention and then returns immediately. This in particular allows us to run retention without scheduling it (for entirely manual invocation) or perform a one-off cleanup. This commit also includes HLRC for the new API, and fixes an issue in SLMSnapshotBlockingIntegTests where retention invoked prior to the test completing could resurrect an index the internal test cluster cleanup had already deleted. Resolves #46508 Relates to #43663	2019-10-02 12:29:04 -06:00
Rory Hunter	53a4d2176f	Convert most awaitBusy calls to assertBusy (#45794 ) (#47112 ) Backport of #45794 to 7.x. Convert most `awaitBusy` calls to `assertBusy`, and use asserts where possible. Follows on from #28548 by @liketic. There were a small number of places where it didn't make sense to me to call `assertBusy`, so I kept the existing calls but renamed the method to `waitUntil`. This was partly to better reflect its usage, and partly so that anyone trying to add a new call to awaitBusy wouldn't be able to find it. I also didn't change the usage in `TransportStopRollupAction` as the comments state that the local awaitBusy method is a temporary copy-and-paste. Other changes: * Rework `waitForDocs` to scale its timeout. Instead of calling `assertBusy` in a loop, work out a reasonable overall timeout and await just once. * Some tests failed after switching to `assertBusy` and had to be fixed. * Correct the expect templates in AbstractUpgradeTestCase. The ES Security team confirmed that they don't use templates any more, so remove this from the expected templates. Also rewrite how the setup code checks for templates, in order to give more information. * Remove an expected ML template from XPackRestTestConstants The ML team advised that the ML tests shouldn't be waiting for any `.ml-notifications` templates, since such checks should happen in the production code instead. Also rework the template checking code in `XPackRestTestHelper` to give more helpful failure messages. * Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4	2019-09-29 12:21:46 +01:00
Andrei Dan	4c909438dd	Fix OriginationDate parsing tests. (#47170 ) (#47200 ) Drop the usage of `SimpleDateFormat` and use the `DateFormatter` instead (cherry picked from commit 7cf509a7a11ecf6c40c44c18e8f03b8e81fcd1c2) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-09-27 13:16:45 +01:00
Gordon Brown	7ac647c365	Add support for POST requests to SLM Execute API (#47061 ) This commit adds support for POST requests to the SLM `_execute` API, because POST is a more appropriate HTTP verb for this action as it is not idempotent. The docs are also changed to favor POST over PUT, although PUT is not removed or officially deprecated.	2019-09-25 16:15:10 -06:00
Andrei Dan	27520cac3b	ILM: parse origination date from index name (#46755 ) (#47124 ) * ILM: parse origination date from index name (#46755) Introduce the `index.lifecycle.parse_origination_date` setting that indicates if the origination date should be parsed from the index name. If set to true an index which doesn't match the expected format (namely `indexName-{dateFormat}-optional_digits` will fail before being created. The origination date will be parsed when initialising a lifecycle for an index and it will be set as the `index.lifecycle.origination_date` for that index. A user set value for `index.lifecycle.origination_date` will always override a possible parsable date from the index name. (cherry picked from commit c363d27f0210733dad0c307d54fa224a92ddb569) Signed-off-by: Andrei Dan <andrei.dan@elastic.co> * Drop usage of Map.of to be java 8 compliant	2019-09-25 21:44:16 +01:00
Lee Hinman	a267df30fa	Wait for snapshot completion in SLM snapshot invocation (#47051 ) * Wait for snapshot completion in SLM snapshot invocation This changes the snapshots internally invoked by SLM to wait for completion. This allows us to capture more snapshotting failure scenarios. For example, previously a snapshot would be created and then registered as a "success", however, the snapshot may have been aborted, or it may have had a subset of its shards fail. These cases are now handled by inspecting the response to the `CreateSnapshotRequest` and ensuring that there are no failures. If any failures are present, the history store now stores the action as a failure instead of a success. Relates to #38461 and #43663	2019-09-25 14:25:22 -06:00
Gordon Brown	a46eef9634	Change SLM stats format (#46991 ) Using arrays of objects with embedded IDs is preferred for new APIs over using entity IDs as JSON keys. This commit changes the SLM stats API to use the preferred format.	2019-09-25 11:32:08 -06:00
Lee Hinman	5ca37db60c	Mute SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress Relates to #46508	2019-09-23 17:08:09 -06:00
Lee Hinman	b85468d6ea	Add node setting for disabling SLM (#46794 ) (#46796 ) This adds the `xpack.slm.enabled` setting to allow disabling of SLM functionality as well as its HTTP API endpoints. Relates to #38461	2019-09-17 17:39:41 -06:00
Andrei Dan	c57cca98b2	[ILM] Add date setting to calculate index age (#46561 ) (#46697 ) * [ILM] Add date setting to calculate index age Add the `index.lifecycle.origination_date` to allow users to configure a custom date that'll be used to calculate the index age for the phase transmissions (as opposed to the default index creation date). This could be useful for users to create an index with an "older" origination date when indexing old data. Relates to #42449. * [ILM] Don't override creation date on policy init The initial approach we took was to override the lifecycle creation date if the `index.lifecycle.origination_date` setting was set. This had the disadvantage of the user not being able to update the `origination_date` anymore once set. This commit changes the way we makes use of the `index.lifecycle.origination_date` setting by checking its value when we calculate the index age (ie. at "read time") and, in case it's not set, default to the index creation date. * Make origination date setting index scope dynamic * Document orignation date setting in ilm settings (cherry picked from commit d5bd2bb77ee28c1978ab6679f941d7c02e389d32) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-09-16 08:50:28 +01:00
Lee Hinman	52d7b03b49	Wait for no snapshots in state in testRetentionWhileSnapshotIn… (#46573 ) This commit adds a wait/check for all running snapshots to be cleared before taking another snapshot. The previous snapshot was successful but had not yet been cleared from the cluster state, so the second snapshot failed due to a `ConcurrentSnapshotException`. Resolves #46508	2019-09-11 09:47:01 -06:00
Lee Hinman	cdc3a260af	Add retention to Snapshot Lifecycle Management (backport of #4… (#46506 ) * Add retention to Snapshot Lifecycle Management (#46407) This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria. An example policy would look like: ``` PUT /_slm/policy/snapshot-every-day { "schedule": "0 30 2 * * ?", "name": "<production-snap-{now/d}>", "repository": "my-s3-repository", "config": { "indices": ["foo-", "important"] }, // Newly configured retention options "retention": { // Snapshots should be deleted after 14 days "expire_after": "14d", // Keep a maximum of thirty snapshots "max_count": 30, // Keep a minimum of the four most recent snapshots "min_count": 4 } } ``` SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour. Included in this work is a new SLM stats API endpoint available through ``` json GET /_slm/stats ``` That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API. Add base framework for snapshot retention (#43605) * Add base framework for snapshot retention This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask` to start as the basis for SLM's retention implementation. Relates to #38461 * Remove extraneous 'public' * Use a local var instead of reading class var repeatedly * Add SnapshotRetentionConfiguration for retention configuration (#43777) * Add SnapshotRetentionConfiguration for retention configuration This commit adds the `SnapshotRetentionConfiguration` class and its HLRC counterpart to encapsulate the configuration for SLM retention. Currently only a single parameter is supported as an example (we still need to discuss the different options we want to support and their names) to keep the size of the PR down. It also does not yet include version serialization checks since the original SLM branch has not yet been merged. Relates to #43663 * Fix REST tests * Fix more documentation * Use Objects.equals to avoid NPE * Put `randomSnapshotLifecyclePolicy` in only one place * Occasionally return retention with no configuration * Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764) * Implement SnapshotRetentionTask's snapshot filtering and deletion This commit implements the snapshot filtering and deletion for `SnapshotRetentionTask`. Currently only the expire-after age is used for determining whether a snapshot is eligible for deletion. Relates to #43663 * Fix deletes running on the wrong thread * Handle missing or null policy in snap metadata differently * Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>> * Use the `OriginSettingClient` to work with security, enhance logging * Prevent NPE in test by mocking Client * Allow empty/missing SLM retention configuration (#45018) Semi-related to #44465, this allows the `"retention"` configuration map to be missing. Relates to #43663 * Add min_count and max_count as SLM retention predicates (#44926) This adds the configuration options for `min_count` and `max_count` as well as the logic for determining whether a snapshot meets this criteria to SLM's retention feature. These options are optional and one, two, or all three can be specified in an SLM policy. Relates to #43663 * Time-bound deletion of snapshots in retention delete function (#45065) * Time-bound deletion of snapshots in retention delete function With a cluster that has a large number of snapshots, it's possible that snapshot deletion can take a very long time (especially since deletes currently have to happen in a serial fashion). To prevent snapshot deletion from taking forever in a cluster and blocking other operations, this commit adds a setting to allow configuring a maximum time to spend deletion snapshots during retention. This dynamic setting defaults to 1 hour and is best-effort, meaning that it doesn't hard stop a deletion at an hour mark, but ensures that once the time has passed, all subsequent deletions are deferred until the next retention cycle. Relates to #43663 * Wow snapshots suuuure can take a long time. * Use a LongSupplier instead of actually sleeping * Remove TestLogging annotation * Remove rate limiting * Add SLM metrics gathering and endpoint (#45362) * Add SLM metrics gathering and endpoint This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The stats stored include the number of snapshots taken, failed, deleted, the number of retention runs, as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount of time spent deleting snapshots from SLM retention. This commit also adds an endpoint for retrieving all stats (further commits will expose this in the SLM get-policy API) that looks like: ``` GET /_slm/stats { "retention_runs" : 13, "retention_failed" : 0, "retention_timed_out" : 0, "retention_deletion_time" : "1.4s", "retention_deletion_time_millis" : 1404, "policy_metrics" : { "daily-snapshots2" : { "snapshots_taken" : 7, "snapshots_failed" : 0, "snapshots_deleted" : 6, "snapshot_deletion_failures" : 0 }, "daily-snapshots" : { "snapshots_taken" : 12, "snapshots_failed" : 0, "snapshots_deleted" : 12, "snapshot_deletion_failures" : 6 } }, "total_snapshots_taken" : 19, "total_snapshots_failed" : 0, "total_snapshots_deleted" : 18, "total_snapshot_deletion_failures" : 6 } ``` This does not yet include HLRC for this, as this commit is quite large on its own. That will be added in a subsequent commit. Relates to #43663 * Version qualify serialization * Initialize counters outside constructor * Use computeIfAbsent instead of being too verbose * Move part of XContent generation into subclass * Fix REST action for master merge * Unused import * Record history of SLM retention actions (#45513) This commit records the deletion of snapshots by the retention component of SLM into the SLM history index for the purposes of reviewing operations taken by SLM and alerting. * Retry SLM retention after currently running snapshot completes (#45802) * Retry SLM retention after currently running snapshot completes This commit adds a ClusterStateObserver to wait until the currently running snapshot is complete before proceeding with snapshot deletion. SLM retention waits for the maximum allowed deletion time for the snapshot to complete, however, the waiting time is not factored into the limit on actual deletions. Relates to #43663 * Increase timeout waiting for snapshot completion * Apply patch From `2374316f0d`.patch * Rename test variables * [TEST] Be less strict for stats checking * Skip SLM retention if ILM is STOPPING or STOPPED (#45869) This adds a check to ensure we take no action during SLM retention if ILM is currently stopped or in the process of stopping. Relates to #43663 * Check all actions preventing snapshot delete during retention (#45992) * Check all actions preventing snapshot delete during retention run Previously we only checked to see if a snapshot was currently running, but it turns out that more things can block snapshot deletion. This changes the check to be a check for: - a snapshot currently running - a deletion already in progress - a repo cleanup in progress - a restore currently running This was found by CI where a third party delete in a test caused SLM retention deletion to throw an exception. Relates to #43663 * Add unit test for okayToDeleteSnapshots * Fix bug where SLM retention task would be scheduled on every node * Enhance test logging * Ignore if snapshot is already deleted * Missing import * Fix SnapshotRetentionServiceTests * Expose SLM policy stats in get SLM policy API (#45989) This also adds support for the SLM stats endpoint to the high level rest client. Retrieving a policy now looks like: ```json { "daily-snapshots" : { "version": 1, "modified_date": "2019-04-23T01:30:00.000Z", "modified_date_millis": 1556048137314, "policy" : { "schedule": "0 30 1 * * ?", "name": "<daily-snap-{now/d}>", "repository": "my_repository", "config": { "indices": ["data-", "important"], "ignore_unavailable": false, "include_global_state": false }, "retention": {} }, "stats": { "snapshots_taken": 0, "snapshots_failed": 0, "snapshots_deleted": 0, "snapshot_deletion_failures": 0 }, "next_execution": "2019-04-24T01:30:00.000Z", "next_execution_millis": 1556048160000 } } ``` Relates to #43663 Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356) * Rewrite SnapshotLifecycleIT as as ESIntegTestCase This commit splits `SnapshotLifecycleIT` into two different tests. `SnapshotLifecycleRestIT` which includes the tests that do not require slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an integration test using `MockRepository` to simulate a snapshot being in progress. Relates to #43663 Resolves #46205 * Add error logging when exceptions are thrown * Update serialization versions * Fix type inference * Use non-Cancellable HLRC return value * Fix Client mocking in test * Fix SLMSnapshotBlockingIntegTests for 7.x branch * Update SnapshotRetentionTask for non-multi-repo snapshot retrieval * Add serialization guards for SnapshotLifecyclePolicy	2019-09-10 09:08:09 -06:00
Lee Hinman	3d4b8e01c7	Validate SLM policy ids strictly (#45998 ) (#46145 ) This uses strict validation for SLM policy ids, similar to what we use for index names. Resolves #45997	2019-09-03 09:20:02 -06:00
Gordon Brown	47bbd9d9a9	[7.x] Fix rollover alias in SLM history index template (#46001 ) This commit adds the `rollover_alias` setting required for ILM to work correctly to the SLM history index template and adds assertions to the SLM integration tests to ensure that it works correctly.	2019-08-28 14:50:22 -07:00
Gordon Brown	47b1e2b3d0	[7.x] Use rollover for SLM's history indices (#45686 ) Following our own guidelines, SLM should use rollover instead of purely time-based indices to keep shard counts low. This commit implements lazy index creation for SLM's history indices, indexing via an alias, and rollover in the built-in ILM policy.	2019-08-21 13:42:11 -06:00
Armin Braun	a01bd6c5a3	Stop Executing SLM Policy Transport Action on Snapshot Pool (#45727 ) (#45748 ) * Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot * Fixes #45594	2019-08-20 19:15:36 +02:00
Gordon Brown	ecb3ebd796	Clean SLM and ongoing snapshots in test framework (#45564 ) Adjusts the cluster cleanup routine in ESRestTestCase to clean up SLM test cases, and optionally wait for all snapshots to be deleted. Waiting for all snapshots to be deleted, rather than failing if any are in progress, is necessary for tests which use SLM policies because SLM policies may be in the process of executing when the test ends.	2019-08-16 14:17:34 -06:00
Gordon Brown	3f5dab99c3	Properly set origin for SLM history store client (#45515 ) The origin was not set properly for the SnapshotHistoryStore client, resulting in errors when SLM was used when security was enabled.	2019-08-13 18:23:20 -06:00
Armin Braun	a9e1402189	Remove Settings from BaseRestRequest Constructor (#45418 ) (#45429 ) * Resolving the todo, cleaning up the unused `settings` parameter * Cleaning up some other minor dead code in affected classes	2019-08-12 05:14:45 +02:00
Alpar Torok	634a070430	Restrict which tasks can use testclusters (#45198 ) * Restrict which tasks can use testclusters This PR fixes a problem between the interaction of test-clusters and build cache. Before this any task could have used a cluster without tracking it as input. With this change a new interface is introduced to track the tasks that can use clusters and we do consider the cluster as input for all of them.	2019-08-09 13:38:01 +03:00
Lee Hinman	c7ec0b8431	Include in-progress snapshot for a policy with get SLM policy… (#45245 ) This commit adds the "in_progress" key to the SLM get policy API, returning a policy that looks like: ```json { "daily-snapshots" : { "version" : 1, "modified_date" : "2019-08-05T18:41:48.778Z", "modified_date_millis" : 1565030508778, "policy" : { "name" : "<production-snap-{now/d}>", "schedule" : "0 30 1 * * ?", "repository" : "repo", "config" : { "indices" : [ "foo-*", "important" ], "ignore_unavailable" : true, "include_global_state" : false }, "retention" : { "expire_after" : "10m" } }, "last_success" : { "snapshot_name" : "production-snap-2019.08.05-oxctmnobqye3luim4uejhg", "time_string" : "2019-08-05T18:42:23.257Z", "time" : 1565030543257 }, "next_execution" : "2019-08-06T01:30:00.000Z", "next_execution_millis" : 1565055000000, "in_progress" : { "name" : "production-snap-2019.08.05-oxctmnobqye3luim4uejhg", "uuid" : "t8Idqt6JQxiZrzp0Vt7z6g", "state" : "STARTED", "start_time" : "2019-08-05T18:42:22.998Z", "start_time_millis" : 1565030542998 } } } ``` These are only visible while the snapshot is being taken (or failed), since it reads from the cluster state rather than from the repository itself.	2019-08-07 08:29:49 -06:00
Mark Vieira	c13285a382	Remove unnecessary plugin application and project configuration (#45100 )	2019-08-01 14:18:24 -07:00
David Kyle	e18e9fa8c5	Mute SnapshotLifecycleServiceTests#testPolicyCRUD Relates to https://github.com/elastic/elasticsearch/issues/44997	2019-07-30 10:36:27 +01:00
Lee Hinman	598c4e72f9	[7.x] Rename indexlifecycle to ilm and snapshotlifecycle to sl… (#44977 ) * Rename indexlifecycle to ilm and snapshotlifecycle to slm (#44917) As a followup to #44725 and #44608, which renamed the packages within the x-pack project, this renames the packages within the core x-pack project. It also renames 'snapshotlifecycle' within the HLRC to slm. * Fix one more import	2019-07-29 15:51:14 -06:00
Gordon Brown	d4b2d21339	Add option to filter ILM explain response (#44777 ) In order to make it easier to interpret the output of the ILM Explain API, this commit adds two request parameters to that API: - `only_managed`, which causes the response to only contain indices which have `index.lifecycle.name` set - `only_errors`, which causes the response to contain only indices in an ILM error state "Error state" is defined as either being in the `ERROR` step or having `index.lifecycle.name` set to a policy that does not exist.	2019-07-26 11:57:38 -04:00
Jason Tedor	e2c8f8dfa3	Rename ILM package to ilm (#44725 ) This commit renames the ILM package from indexlifecycle to ilm. We have all come to know index lifecycle management as ILM, the APIs and settings use ilm, and it would be nice of the package did too. This commit makes that change.	2019-07-23 16:46:38 +09:00
Jason Tedor	5878bde8dc	Rename SLM package to slm (#44608 ) This commit renames the SLM package from snapshotlifecycle to slm. We have all come to know index lifecycle management as ILM, the APIs and settings use ilm, and it would be nice of the package did too. For SLM, let's use slm for all of these including the package name from the beginning.	2019-07-23 07:35:06 +09:00
Lee Hinman	3001f7941f	Allow empty configuration for SLM policies (#44465 ) * Allow empty configuration for SLM policies When putting or updating a snapshot lifecycle policy it was not possible to elide the `config` map. This commit makes the configuration optional, the same way that it is when taking a snapshot. Relates to #38461 * Add Objects.requireNonNull for required parts of the policy	2019-07-18 16:20:31 -06:00
Lee Hinman	fe2ef66e45	Expose index age in ILM explain output (#44457 ) * Expose index age in ILM explain output This adds the index's age to the ILM explain output, for example: ``` { "indices" : { "ilm-000001" : { "index" : "ilm-000001", "managed" : true, "policy" : "full-lifecycle", "lifecycle_date" : "2019-07-16T19:48:22.294Z", "lifecycle_date_millis" : 1563306502294, "age" : "1.34m", "phase" : "hot", "phase_time" : "2019-07-16T19:48:22.487Z", ... etc ... } } } ``` This age can be used to tell when ILM will transition the index to the next phase, based on that phase's `min_age`. Resolves #38988 * Expose age in getters and in HLRC	2019-07-18 15:33:45 -06:00
Ryan Ernst	2a2686e6e7	Convert remaining ActionTypes to writeable in xpack core (#44467 ) (#44525 ) This commit converts all remaining ActionType response classes to writeable in xpack core. It also converts a few from server which were used by xpack core. relates #34389	2019-07-17 18:01:45 -07:00
Jason Tedor	39c5f98de7	Introduce test issue logging (#44477 ) Today we have an annotation for controlling logging levels in tests. This annotation serves two purposes, one is to control the logging level used in tests, when such control is needed to impact and assert the behavior of loggers in tests. The other use is when a test is failing and additional logging is needed. This commit separates these two concerns into separate annotations. The primary motivation for this is that we have a history of leaving behind the annotation for the purpose of investigating test failures long after the test failure is resolved. The accumulation of these stale logging annotations has led to excessive disk consumption. Having recently cleaned this up, we would like to avoid falling into this state again. To do this, we are adding a link to the test failure under investigation to the annotation when used for the purpose of investigating test failures. We will add tooling to inspect these annotations, in the same way that we have tooling on awaits fix annotations. This will enable us to report on the use of these annotations, and report when stale uses of the annotation exist.	2019-07-18 05:33:33 +09:00
Ryan Ernst	0755a13c9f	Convert AcknowledgedRequest to Writeable.Reader (#44412 ) (#44454 ) This commit adds constructors to AcknolwedgedRequest subclasses to implement Writeable.Reader, and ensures all future subclasses implement the same. relates #34389	2019-07-17 11:17:36 -07:00
Yannick Welsch	d98b3e4760	Move frozen indices to x-pack module (#44490 ) Backport of #44408 and #44286.	2019-07-17 16:53:10 +02:00
Lee Hinman	fb0461ac76	[7.x] Add Snapshot Lifecycle Management (#44382 ) * Add Snapshot Lifecycle Management (#43934) * Add SnapshotLifecycleService and related CRUD APIs This commit adds `SnapshotLifecycleService` as a new service under the ilm plugin. This service handles snapshot lifecycle policies by scheduling based on the policies defined schedule. This also includes the get, put, and delete APIs for these policies Relates to #38461 * Make scheduledJobIds return an immutable set * Use Object.equals for SnapshotLifecyclePolicy * Remove unneeded TODO * Implement ToXContentFragment on SnapshotLifecyclePolicyItem * Copy contents of the scheduledJobIds * Handle snapshot lifecycle policy updates and deletions (#40062) (Note this is a PR against the `snapshot-lifecycle-management` feature branch) This adds logic to `SnapshotLifecycleService` to handle updates and deletes for snapshot policies. Policies with incremented versions have the old policy cancelled and the new one scheduled. Deleted policies have their schedules cancelled when they are no longer present in the cluster state metadata. Relates to #38461 * Take a snapshot for the policy when the SLM policy is triggered (#40383) (This is a PR for the `snapshot-lifecycle-management` branch) This commit fills in `SnapshotLifecycleTask` to actually perform the snapshotting when the policy is triggered. Currently there is no handling of the results (other than logging) as that will be added in subsequent work. This also adds unit tests and an integration test that schedules a policy and ensures that a snapshot is correctly taken. Relates to #38461 * Record most recent snapshot policy success/failure (#40619) Keeping a record of the results of the successes and failures will aid troubleshooting of policies and make users more confident that their snapshots are being taken as expected. This is the first step toward writing history in a more permanent fashion. * Validate snapshot lifecycle policies (#40654) (This is a PR against the `snapshot-lifecycle-management` branch) With the commit, we now validate the content of snapshot lifecycle policies when the policy is being created or updated. This checks for the validity of the id, name, schedule, and repository. Additionally, cluster state is checked to ensure that the repository exists prior to the lifecycle being added to the cluster state. Part of #38461 * Hook SLM into ILM's start and stop APIs (#40871) (This pull request is for the `snapshot-lifecycle-management` branch) This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are cancelled. Relates to #38461 * Add tests for SnapshotLifecyclePolicyItem (#40912) Adds serialization tests for SnapshotLifecyclePolicyItem. * Fix improper import in build.gradle after master merge * Add human readable version of modified date for snapshot lifecycle policy (#41035) * Add human readable version of modified date for snapshot lifecycle policy This small change changes it from: ``` ... "modified_date": 1554843903242, ... ``` To ``` ... "modified_date" : "2019-04-09T21:05:03.242Z", "modified_date_millis" : 1554843903242, ... ``` Including the `"modified_date"` field when the `?human` field is used. Relates to #38461 * Fix test * Add API to execute SLM policy on demand (#41038) This commit adds the ability to perform a snapshot on demand for a policy. This can be useful to take a snapshot immediately prior to performing some sort of maintenance. ```json PUT /_ilm/snapshot/<policy>/_execute ``` And it returns the response with the generated snapshot name: ```json { "snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug" } ``` Note that this does not allow waiting for the snapshot, and the snapshot could still fail. It does record this information into the cluster state similar to a regularly trigged SLM job. Relates to #38461 * Add next_execution to SLM policy metadata (#41221) * Add next_execution to SLM policy metadata This adds the next time a snapshot lifecycle policy will be executed when retriving a policy's metadata, for example: ```json GET /_ilm/snapshot?human { "production" : { "version" : 1, "modified_date" : "2019-04-15T21:16:21.865Z", "modified_date_millis" : 1555362981865, "policy" : { "name" : "<production-snap-{now/d}>", "schedule" : "/30 * * * ?", "repository" : "repo", "config" : { "indices" : [ "foo-", "important" ], "ignore_unavailable" : true, "include_global_state" : false } }, "next_execution" : "2019-04-15T21:16:30.000Z", "next_execution_millis" : 1555362990000 }, "other" : { "version" : 1, "modified_date" : "2019-04-15T21:12:19.959Z", "modified_date_millis" : 1555362739959, "policy" : { "name" : "<other-snap-{now/d}>", "schedule" : "0 30 2 * ?", "repository" : "repo", "config" : { "indices" : [ "other" ], "ignore_unavailable" : false, "include_global_state" : true } }, "next_execution" : "2019-04-16T02:30:00.000Z", "next_execution_millis" : 1555381800000 } } ``` Relates to #38461 * Fix and enhance tests * Figured out how to Cron * Change SLM endpoint from /_ilm/* to /_slm/* (#41320) This commit changes the endpoint for snapshot lifecycle management from: ``` GET /_ilm/snapshot/<policy> ``` to: ``` GET /_slm/policy/<policy> ``` It mimics the ILM path only using `slm` instead of `ilm`. Relates to #38461 * Add initial documentation for SLM (#41510) * Add initial documentation for SLM This adds the initial documentation for snapshot lifecycle management. It also includes the REST spec API json files since they're sort of documentation. Relates to #38461 * Add `manage_slm` and `read_slm` roles (#41607) * Add `manage_slm` and `read_slm` roles This adds two more built in roles - `manage_slm` which has permission to perform any of the SLM actions, as well as stopping, starting, and retrieving the operation status of ILM. `read_slm` which has permission to retrieve snapshot lifecycle policies as well as retrieving the operation status of ILM. Relates to #38461 * Add execute to the test * Fix ilm -> slm typo in test * Record SLM history into an index (#41707) It is useful to have a record of the actions that Snapshot Lifecycle Management takes, especially for the purposes of alerting when a snapshot fails or has not been taken successfully for a certain amount of time. This adds the infrastructure to record SLM actions into an index that can be queried at leisure, along with a lifecycle policy so that this history does not grow without bound. Additionally, SLM automatically setting up an index + lifecycle policy leads to `index_lifecycle` custom metadata in the cluster state, which some of the ML tests don't know how to deal with due to setting up custom `NamedXContentRegistry`s. Watcher would cause the same problem, but it is already disabled (for the same reason). * High Level Rest Client support for SLM (#41767) * High Level Rest Client support for SLM This commit add HLRC support for SLM. Relates to #38461 * Fill out documentation tests with tags * Add more callouts and asciidoc for HLRC * Update javadoc links to real locations * Add security test testing SLM cluster privileges (#42678) * Add security test testing SLM cluster privileges This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm` cluster privileges. Relates to #38461 * Don't redefine vars * Add Getting Started Guide for SLM (#42878) This commit adds a basic Getting Started Guide for SLM. * Include SLM policy name in Snapshot metadata (#43132) Keep track of which SLM policy in the metadata field of the Snapshots taken by SLM. This allows users to more easily understand where the snapshot came from, and will enable future SLM features such as retention policies. * Fix compilation after master merge * [TEST] Move exception wrapping for devious exception throwing Fixes an issue where an exception was created from one line and thrown in another. * Fix SLM for the change to AcknowledgedResponse * Add Snapshot Lifecycle Management Package Docs (#43535) * Fix compilation for transport actions now that task is required * Add a note mentioning the privileges needed for SLM (#43708) * Add a note mentioning the privileges needed for SLM This adds a note to the top of the "getting started with SLM" documentation mentioning that there are two built-in privileges to assist with creating roles for SLM users and administrators. Relates to #38461 * Mention that you can create snapshots for indices you can't read * Fix REST tests for new number of cluster privileges * Mute testThatNonExistingTemplatesAreAddedImmediately (#43951) * Fix SnapshotHistoryStoreTests after merge * Remove overridden newResponse functions that have been removed * Fix compilation for backport * Fix get snapshot output parsing in test * [DOCS] Add redirects for removed autogen anchors (#44380) * Switch <tt>...</tt> in javadocs for {@code ...}	2019-07-16 07:37:13 -06:00
Ryan Ernst	7e06888bae	Convert testclusters to use distro download plugin (#44253 ) (#44362 ) Test clusters currently has its own set of logic for dealing with finding different versions of Elasticsearch, downloading them, and extracting them. This commit converts testclusters to use the DistributionDownloadPlugin.	2019-07-15 17:53:05 -07:00
Ryan Ernst	59658daef9	Separate streamable based master node actions (#44313 ) This commit creates new base classes for master node actions whose response types still implement Streamable. This simplifies both finding remaining classes to convert, as well as creating new master node actions that use Writeable for their responses. relates #34389	2019-07-15 09:20:20 -07:00
Jake Landis	6e9ccda2c5	ilm test - allow more time for policy completion (#43844 )	2019-07-02 22:05:18 -05:00
Jake Landis	0a79f4ca70	Extend timeout for TimeSeriesLifecycleActionsIT> testFullPolicy (#43891 )	2019-07-02 22:05:04 -05:00
Ryan Ernst	3a2c698ce0	Rename Action to ActionType (#43778 ) Action is a class that encapsulates meta information about an action that allows it to be called remotely, specifically the action name and response type. With recent refactoring, the action class can now be constructed as a static constant, instead of needing to create a subclass. This makes the old pattern of creating a singleton INSTANCE both misnamed and lacking a common placement. This commit renames Action to ActionType, thus allowing the old INSTANCE naming pattern to be TYPE on the transport action itself. ActionType also conveys that this class is also not the action itself, although this change does not rename any concrete classes as those will be removed organically as they are converted to TYPE constants. relates #34389	2019-06-30 22:00:17 -07:00
Martijn van Groningen	101cf384ba	Replace Streamable w/ Writable in AcknowledgedResponse and subclasses (backport 7.x) (#43525 ) This commit replaces usages of Streamable with Writeable for the AcknowledgedResponse and its subclasses, plus associated actions. Note that where possible response fields were made final and default constructors were removed. This is a large PR, but the change is mostly mechanical. Relates to #34389 Backport of #43414	2019-06-24 13:47:37 +02:00
Lee Hinman	c2bf628a6d	[7.x] Narrow period of Shrink action in which ILM prevents stopping (#43254 ) (#43393 ) * Narrow period of Shrink action in which ILM prevents stopping Prior to this change, we would prevent stopping of ILM if the index was anywhere in the shrink action. This commit changes `IndexLifecycleService` to allow stopping when in any of the innocuous steps during shrink. This changes ILM only to prevent stopping if absolutely necessary. Resolves #43253 * Rename variable for ignore actions -> ignore steps * Fix comment * Factor test out to test all stoppable steps	2019-06-19 16:37:41 -06:00
Alpar Torok	167e51335d	Convert ILM tests to use testclusters (#43076 ) Also improove the error message when bin scripts are not found	2019-06-13 12:24:48 +03:00
Ryan Ernst	172cd4dbfa	Remove description from xpack feature sets (#43065 ) The description field of xpack featuresets is optionally part of the xpack info api, when using the verbose flag. However, this information is unnecessary, as it is better left for documentation (and the existing descriptions describe anything meaningful). This commit removes the description field from feature sets.	2019-06-11 09:22:58 -07:00
Jason Tedor	117df87b2b	Replicate aliases in cross-cluster replication (#42875 ) This commit adds functionality so that aliases that are manipulated on leader indices are replicated by the shard follow tasks to the follower indices. Note that we ignore write indices. This is due to the fact that follower indices do not receive direct writes so the concept is not useful. Relates #41815	2019-06-04 20:36:24 -04:00
Mark Vieira	e44b8b1e2e	[Backport] Remove dependency substitutions 7.x (#42866 ) * Remove unnecessary usage of Gradle dependency substitution rules (#42773) (cherry picked from commit 12d583dbf6f7d44f00aa365e34fc7e937c3c61f7)	2019-06-04 13:50:23 -07:00
Ryan Ernst	6fd8924c5a	Switch run task to use real distro (#41590 ) The run task is supposed to run elasticsearch with the given plugin or module. However, for modules, this is most realistic if using the full distribution. This commit changes the run setup to use the default or oss as appropriate.	2019-05-06 12:34:07 -07:00
Daniel Mitterdorfer	8580053818	Mute PermissionsIT#testWhen[...]ByILMPolicy (#41859 ) Relates #41440 Relates #41858	2019-05-06 16:15:37 +02:00
Christoph Büscher	52495843cc	[Docs] Fix common word repetitions (#39703 )	2019-04-25 20:47:47 +02:00
Yogesh Gaikwad	0d1178fca6	put mapping authorization for alias with write-index and multiple read indices (#40834 ) (#41287 ) When the same alias points to multiple indices we can write to only one index with `is_write_index` value `true`. The special handling in case of the put mapping request(to resolve authorized indices) has a check on indices size for a concrete index. If multiple indices existed then it marked the request as unauthorized. The check has been modified to consider write index flag and only when the requested index matches with the one with write index alias, the alias is considered for authorization. Closes #40831	2019-04-17 14:25:33 +10:00
Gordon Brown	ec8709e831	Check allocation rules are cleared after ILM Shrink (#41170 ) Adds some checks to make sure that the allocation rules that ILM adds before a shrink are cleared after the shrink is complete	2019-04-16 09:25:51 -06:00
Gordon Brown	7e59794ced	Log every use of ILM Move to Step API (#41171 ) Usage of the ILM Move to Step API can result in some very odd situations, and for diagnosing problems arising from these situations it would be nice to have a record of when this API was called with what parameters. Also, adds a dedicated logger for TransportMoveToStepAction, rather than using the (deprecated) inherited one.	2019-04-15 16:20:37 -06:00
Mark Vieira	1287c7d91f	[Backport] Replace usages RandomizedTestingTask with built-in Gradle Test (#40978 ) (#40993 ) * Replace usages RandomizedTestingTask with built-in Gradle Test (#40978) This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions. (cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6) * Fix forking JVM runner * Don't bump shadow plugin version	2019-04-09 11:52:50 -07:00
Gordon Brown	5347dec55e	Allow ILM to stop if indices have nonexistent policies (#40820 ) Prior to this PR, there is a bug in ILM which does not allow ILM to stop if one or more indices have an index.lifecycle.name which refers to a policy that does not exist - the operation_mode will be stuck as STOPPING until either the policy is created or the nonexistent policy is removed from those indices. This change allows ILM to stop in this case and makes the logging more clear as to why ILM is not stopping.	2019-04-04 11:46:21 -06:00
Lee Hinman	2fd01cc0b7	Fix testRunStateChangePolicyWithAsyncActionNextStep race condition (#40707 ) Previously we only set the latch countdown with `nextStep.setLatch` after the cluster state change has already been counted down. However, it's possible execution could have already started, causing the latch to be missed when the `MockAsyncActionStep` is being executed. This moves the latch setting to be before the call to `runPolicyAfterStateChange`, which means it is always available when the `MockAsyncActionStep` is executed. I was able to reproduce the failure every 30-40 runs before this change. With this change, running 2000+ times the test passes. Resolves #40018	2019-04-02 10:56:44 -06:00
Gordon Brown	db7f00098e	Correct ILM metadata minimum compatibility version (#40569 ) The ILM metadata minimum compatibility version was not set correctly, which can cause issues in mixed-version clusters.	2019-03-28 10:53:44 -06:00
Lee Hinman	8ec456b5df	Maintain step order for ILM trace logging (#39522 ) When trace logging is enabled we log the computed steps for a policy. This commit makes sure that the steps that are logged are in the same order they will be run when the policy executes. This makes it much easier to reason about the policy if the move-to-step API is ever required in the future.	2019-03-07 11:37:58 -07:00
Gordon Brown	f4c5abe4d4	Handle failure to release retention leases in ILM (#39281 ) (#39417 ) It is possible that the Unfollow API may fail to release shard history retention leases when unfollowing, so this needs to be handled by the ILM Unfollow action. There's nothing much that can be done automatically about it from the follower side, so this change makes the ILM unfollow action simply ignore those failures.	2019-02-26 16:58:30 -07:00
Gordon Brown	2ad1e6aedc	Fix testCannotShrinkLeaderIndex (#38529 ) This test should no longer pass when the functionality it is intended to test is broken, as it now indexes a number of documents and verifies that the index is staying on the same step until after indexing and replication of those documents is finished. This prevents the test from passing if the leader index progresses in its lifecycle during that time.	2019-02-22 08:03:36 -07:00
Jay Modi	697911c31d	Fixed missed stopping of SchedulerEngine (#39193 ) The SchedulerEngine is used in several places in our code and not all of these usages properly stopped the SchedulerEngine, which could lead to test failures due to leaked threads from the SchedulerEngine. This change adds stopping to these usages in order to avoid the thread leaks that cause CI failures and noise. Closes #38875	2019-02-21 14:31:33 -07:00
Jason Tedor	09ea3ccd16	Remove retention leases when unfollowing (#39088 ) This commit attempts to remove the retention leases on the leader shards when unfollowing an index. This is best effort, since the leader might not be available.	2019-02-20 07:06:49 -05:00
Jason Tedor	2d8f6b6501	Introduce retention lease state file (#39004 ) This commit moves retention leases from being persisted in the Lucene commit point to being persisted in a dedicated state file.	2019-02-18 16:53:46 -05:00
Jason Tedor	a5ce1e0bec	Integrate retention leases to recovery from remote (#38829 ) This commit is the first step in integrating shard history retention leases with CCR. In this commit we integrate shard history retention leases with recovery from remote. Before we start transferring files, we take out a retention lease on the primary. Then during the file copy phase, we repeatedly renew the retention lease. Finally, when recovery from remote is complete, we disable the background renewing of the retention lease.	2019-02-16 15:37:52 -05:00
Luca Cavanna	a7046e001c	Remove support for maxRetryTimeout from low-level REST client (#38085 ) We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see #31834), would leave the request executing after expiration causing memory leaks (see #33342), and would not take into account the http client internal queuing (see #25951). Given all these issues, it seems that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the `maxRetryTimeout` setting and all of its usages.	2019-02-06 08:43:47 +01:00
Julie Tibshirani	3ce7d2c9b6	Make sure to reject mappings with type _doc when include_type_name is false. (#38270 ) `CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when deserializing index creation requests, accidentally accepts mappings that are nested twice under the type key (as described in the bug report #38266). This in turn causes us to be too lenient in parsing typeless mappings. In particular, we accept the following index creation request, even though it should not contain the type key `_doc`: ``` PUT index?include_type_name=false { "mappings": { "_doc": { "properties": { ... } } } } ``` There is a similar issue for both 'put templates' and 'put mappings' requests as well. This PR makes the minimal changes to detect and reject these typed mappings in requests. It does not address #38266 generally, or attempt a larger refactor around types in these server-side requests, as I think this should be done at a later time.	2019-02-05 10:52:32 -08:00
Gordon Brown	b866417650	Mute testCannotShrinkLeaderIndex (#38374 ) This test should not pass until CCR finishes integrating shard history retention leases. It currently sometimes passes (which is a bug in the test), but cannot pass reliably until the linked issue is resolved.	2019-02-04 16:06:19 -07:00
Gordon Brown	7a1e89c7ed	Ensure ILM policies run safely on leader indices (#38140 ) Adds a Step to the Shrink and Delete actions which prevents those actions from running on a leader index - all follower indices must first unfollow the leader index before these actions can run. This prevents the loss of history before follower indices are ready, which might otherwise result in the loss of data.	2019-02-01 20:46:12 -07:00
Tal Levy	bae656dcea	Preserve ILM operation mode when creating new lifecycles (#38134 ) There was a bug where creating a new policy would start the ILM service, even if it was stopped. This change ensures that there is no change to the existing operation mode	2019-02-01 13:16:34 -08:00
Tal Levy	7c738fd241	Skip Shrink when numberOfShards not changed (#37953 ) Previously, ShrinkAction would fail if it was executed on an index that had the same number of shards as the target shrunken number. This PR introduced a new BranchingStep that is used inside of ShrinkAction to branch which step to move to next, depending on the shard values. So no shrink will occur if the shard count is unchanged.	2019-01-30 15:09:17 -08:00
Tim Brooks	00ace369af	Use `CcrRepository` to init follower index (#35719 ) This commit modifies the put follow index action to use a CcrRepository when creating a follower index. It routes the logic through the snapshot/restore process. A wait_for_active_shards parameter can be used to configure how long to wait before returning the response.	2019-01-29 11:47:29 -07:00
Gordon Brown	49bd8715ff	Inject Unfollow before Rollover and Shrink (#37625 ) We inject an Unfollow action before Shrink because the Shrink action cannot be safely used on a following index, as it may not be fully caught up with the leader index before the "original" following index is deleted and replaced with a non-following Shrunken index. The Unfollow action will verify that 1) the index is marked as "complete", and 2) all operations up to this point have been replicated from the leader to the follower before explicitly disconnecting the follower from the leader. Injecting an Unfollow action before the Rollover action is done mainly as a convenience: This allow users to use the same lifecycle policy on both the leader and follower cluster without having to explictly modify the policy to unfollow the index, while doing what we expect users to want in most cases.	2019-01-28 14:09:12 -07:00
Lee Hinman	427bc7f940	Use ILM for Watcher history deletion (#37443 ) * Use ILM for Watcher history deletion This commit adds an index lifecycle policy for the `.watch-history-*` indices. This policy is automatically used for all new watch history indices. This does not yet remove the automatic cleanup that the monitoring plugin does for the .watch-history indices, and it does not touch the `xpack.watcher.history.cleaner_service.enabled` setting. Relates to #32041	2019-01-23 10:18:08 -07:00
Lee Hinman	647e225698	Retry ILM steps that fail due to SnapshotInProgressException (#37624 ) Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed. This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring. This also adds integration tests for these scenarios (thanks to @talevy in #37552). Resolves #37541	2019-01-23 09:46:31 -07:00
Ryan Ernst	9a34b20233	Simplify integ test distribution types (#37618 ) The integ tests currently use the raw zip project name as the distribution type. This commit simplifies this specification to be "default" or "oss". Whether zip or tar is used should be an internal implementation detail of the integ test setup, which can (in the future) be platform specific.	2019-01-21 12:37:17 -08:00
Martijn van Groningen	a3030c51e2	[ILM] Add unfollow action (#36970 ) This change adds the unfollow action for CCR follower indices. This is needed for the shrink action in case an index is a follower index. This will give the follower index the opportunity to fully catch up with the leader index, pause index following and unfollow the leader index. After this the shrink action can safely perform the ilm shrink. The unfollow action needs to be added to the hot phase and acts as barrier for going to the next phase (warm or delete phases), so that follower indices are being unfollowed properly before indices are expected to go in read-only mode. This allows the force merge action to execute its steps safely. The unfollow action has three steps: * `wait-for-indexing-complete` step: waits for the index in question to get the `index.lifecycle.indexing_complete` setting be set to `true` * `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks for the index being handled to report that the leader shard global checkpoint is equal to the follower shard global checkpoint. * `pause-follower-index` step: Pauses index following, necessary to unfollow * `close-follower-index` step: Closes the index, necessary to unfollow * `unfollow-follower-index` step: Actually unfollows the index using the CCR Unfollow API * `open-follower-index` step: Reopens the index now that it is a normal index * `wait-for-yellow` step: Waits for primary shards to be allocated after reopening the index to ensure the index is ready for the next step In the case of the last two steps, if the index in being handled is a regular index then the steps acts as a no-op. Relates to #34648 Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Gordon Brown <gordon.brown@elastic.co>	2019-01-18 13:05:03 -07:00
Jake Landis	587034dfa7	Add set_priority action to ILM (#37397 ) This commit adds a set_priority action to the hot, warm, and cold phases for an ILM policy. This action sets the `index.priority` on the managed index to allow different priorities between the hot, warm, and cold recoveries. This commit also includes the HLRC and documentation changes. closes #36905	2019-01-17 09:55:36 -06:00
Alexander Reelsen	b2e8437424	Tests: Add ElasticsearchAssertions.awaitLatch method (#36777 ) * Tests: Add ElasticsearchAssertions.awaitLatch method Some tests are using assertTrue(latch.await(...)) in their code. This leads to an assertion error without any error message. This adds a method which has a nicer error message and can be used in tests. * fix forbidden apis * fix spaces	2019-01-10 09:25:36 +01:00

1 2 3 4 5 ...

334 Commits