* Narrow period of Shrink action in which ILM prevents stopping
Prior to this change, we would prevent stopping of ILM if the index was
anywhere in the shrink action. This commit changes
`IndexLifecycleService` to allow stopping when in any of the innocuous
steps during shrink. This changes ILM only to prevent stopping if
absolutely necessary.
Resolves#43253
* Rename variable for ignore actions -> ignore steps
* Fix comment
* Factor test out to test *all* stoppable steps
The description field of xpack featuresets is optionally part of the
xpack info api, when using the verbose flag. However, this information
is unnecessary, as it is better left for documentation (and the existing
descriptions describe anything meaningful). This commit removes the
description field from feature sets.
This commit adds functionality so that aliases that are manipulated on
leader indices are replicated by the shard follow tasks to the follower
indices. Note that we ignore write indices. This is due to the fact that
follower indices do not receive direct writes so the concept is not
useful.
Relates #41815
The run task is supposed to run elasticsearch with the given plugin or
module. However, for modules, this is most realistic if using the full
distribution. This commit changes the run setup to use the default or
oss as appropriate.
When the same alias points to multiple indices we can write to only one index
with `is_write_index` value `true`. The special handling in case of the put
mapping request(to resolve authorized indices) has a check on indices size
for a concrete index. If multiple indices existed then it marked the request
as unauthorized.
The check has been modified to consider write index flag and only when the
requested index matches with the one with write index alias, the alias is considered
for authorization.
Closes#40831
Usage of the ILM Move to Step API can result in some very odd
situations, and for diagnosing problems arising from these situations it
would be nice to have a record of when this API was called with what
parameters.
Also, adds a dedicated logger for TransportMoveToStepAction,
rather than using the (deprecated) inherited one.
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)
This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.
(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)
* Fix forking JVM runner
* Don't bump shadow plugin version
Prior to this PR, there is a bug in ILM which does not allow ILM to stop
if one or more indices have an index.lifecycle.name which refers to
a policy that does not exist - the operation_mode will be stuck as
STOPPING until either the policy is created or the nonexistent
policy is removed from those indices.
This change allows ILM to stop in this case and makes the logging more
clear as to why ILM is not stopping.
Previously we only set the latch countdown with `nextStep.setLatch` after the
cluster state change has already been counted down. However, it's possible
execution could have already started, causing the latch to be missed when the
`MockAsyncActionStep` is being executed.
This moves the latch setting to be before the call to
`runPolicyAfterStateChange`, which means it is always available when the
`MockAsyncActionStep` is executed.
I was able to reproduce the failure every 30-40 runs before this change. With
this change, running 2000+ times the test passes.
Resolves#40018
When trace logging is enabled we log the computed steps for a policy. This
commit makes sure that the steps that are logged are in the same order they will
be run when the policy executes. This makes it much easier to reason about the
policy if the move-to-step API is ever required in the future.
It is possible that the Unfollow API may fail to release shard history
retention leases when unfollowing, so this needs to be handled by the
ILM Unfollow action. There's nothing much that can be done automatically
about it from the follower side, so this change makes the ILM unfollow
action simply ignore those failures.
This test should no longer pass when the functionality it is intended to
test is broken, as it now indexes a number of documents and verifies
that the index is staying on the same step until after indexing and
replication of those documents is finished. This prevents the test from
passing if the leader index progresses in its lifecycle during that time.
The SchedulerEngine is used in several places in our code and not all
of these usages properly stopped the SchedulerEngine, which could lead
to test failures due to leaked threads from the SchedulerEngine. This
change adds stopping to these usages in order to avoid the thread leaks
that cause CI failures and noise.
Closes#38875
This commit attempts to remove the retention leases on the leader shards
when unfollowing an index. This is best effort, since the leader might
not be available.
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.
The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).
Given all these issues, it seems that this custom timeout mechanism
gives little benefits while causing a lot of harm. We should rather rely
on connect and socket timeout exposed by the underlying http client
and accept that a request can overall take longer than the configured
timeout, which is the case even with a single retry anyways.
This commit removes the `maxRetryTimeout` setting and all of its usages.
`CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when
deserializing index creation requests, accidentally accepts mappings that are
nested twice under the type key (as described in the bug report #38266).
This in turn causes us to be too lenient in parsing typeless mappings. In
particular, we accept the following index creation request, even though it
should not contain the type key `_doc`:
```
PUT index?include_type_name=false
{
"mappings": {
"_doc": {
"properties": { ... }
}
}
}
```
There is a similar issue for both 'put templates' and 'put mappings' requests
as well.
This PR makes the minimal changes to detect and reject these typed mappings in
requests. It does not address #38266 generally, or attempt a larger refactor
around types in these server-side requests, as I think this should be done at a
later time.
This test should not pass until CCR finishes integrating shard history
retention leases. It currently sometimes passes (which is a bug in the
test), but cannot pass reliably until the linked issue is resolved.
Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.
There was a bug where creating a new policy would start
the ILM service, even if it was stopped. This change ensures
that there is no change to the existing operation mode
Previously, ShrinkAction would fail if
it was executed on an index that had
the same number of shards as the target
shrunken number.
This PR introduced a new BranchingStep that
is used inside of ShrinkAction to branch which
step to move to next, depending on the
shard values. So no shrink will occur if the
shard count is unchanged.
This commit modifies the put follow index action to use a
CcrRepository when creating a follower index. It routes
the logic through the snapshot/restore process. A
wait_for_active_shards parameter can be used to configure
how long to wait before returning the response.
We inject an Unfollow action before Shrink because the Shrink action
cannot be safely used on a following index, as it may not be fully
caught up with the leader index before the "original" following index is
deleted and replaced with a non-following Shrunken index. The Unfollow
action will verify that 1) the index is marked as "complete", and 2) all
operations up to this point have been replicated from the leader to the
follower before explicitly disconnecting the follower from the leader.
Injecting an Unfollow action before the Rollover action is done mainly
as a convenience: This allow users to use the same lifecycle policy on
both the leader and follower cluster without having to explictly modify
the policy to unfollow the index, while doing what we expect users to
want in most cases.
* Use ILM for Watcher history deletion
This commit adds an index lifecycle policy for the `.watch-history-*` indices.
This policy is automatically used for all new watch history indices.
This does not yet remove the automatic cleanup that the monitoring plugin does
for the .watch-history indices, and it does not touch the
`xpack.watcher.history.cleaner_service.enabled` setting.
Relates to #32041
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.
This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.
This also adds integration tests for these scenarios (thanks to @talevy in #37552).
Resolves#37541
The integ tests currently use the raw zip project name as the
distribution type. This commit simplifies this specification to be
"default" or "oss". Whether zip or tar is used should be an internal
implementation detail of the integ test setup, which can (in the future)
be platform specific.
This change adds the unfollow action for CCR follower indices.
This is needed for the shrink action in case an index is a follower index.
This will give the follower index the opportunity to fully catch up with
the leader index, pause index following and unfollow the leader index.
After this the shrink action can safely perform the ilm shrink.
The unfollow action needs to be added to the hot phase and acts as
barrier for going to the next phase (warm or delete phases), so that
follower indices are being unfollowed properly before indices are expected
to go in read-only mode. This allows the force merge action to execute
its steps safely.
The unfollow action has three steps:
* `wait-for-indexing-complete` step: waits for the index in question
to get the `index.lifecycle.indexing_complete` setting be set to `true`
* `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks
for the index being handled to report that the leader shard global checkpoint
is equal to the follower shard global checkpoint.
* `pause-follower-index` step: Pauses index following, necessary to unfollow
* `close-follower-index` step: Closes the index, necessary to unfollow
* `unfollow-follower-index` step: Actually unfollows the index using
the CCR Unfollow API
* `open-follower-index` step: Reopens the index now that it is a normal index
* `wait-for-yellow` step: Waits for primary shards to be allocated after
reopening the index to ensure the index is ready for the next step
In the case of the last two steps, if the index in being handled is
a regular index then the steps acts as a no-op.
Relates to #34648
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Gordon Brown <gordon.brown@elastic.co>
This commit adds a set_priority action to the hot, warm, and cold
phases for an ILM policy. This action sets the `index.priority`
on the managed index to allow different priorities between the
hot, warm, and cold recoveries.
This commit also includes the HLRC and documentation changes.
closes#36905
* Tests: Add ElasticsearchAssertions.awaitLatch method
Some tests are using assertTrue(latch.await(...)) in their code. This
leads to an assertion error without any error message. This adds a
method which has a nicer error message and can be used in tests.
* fix forbidden apis
* fix spaces
the testFullPolicy and testMoveToRolloverStep tests
are very important tests, but they sometimes timeout
beyond the default 10sec wait for shrink to occur.
This commit increases one of the assertBusys to
20 seconds
Leaving `index.lifecycle.indexing_complete` in place when removing the
lifecycle policy from an index can cause confusion, as if a new policy
is associated with the policy, rollover will be silently skipped.
Removing that setting when removing the policy from an index makes
associating a new policy with the index more involved, but allows ILM to
fail loudly, rather than silently skipping operations which the user may
assume are being performed.
* Adjust order of checks in WaitForRolloverReadyStep
This allows ILM to error out properly for indices that have a valid
alias, but are not the write index, while still handling
`indexing_complete` on old-style aliases and rollover (that is, those
which only point to a single index at a time with no explicit write
index)
* WaitForRolloverReadyStepTests#mutateInstance sometimes did not mutate the instance
correctly
* 40_explain_lifecycle#"Test new phase still has phase_time" is not really a necessary
integration test. In addition to this, it is flaky due to the asynchronous nature of
ILM metadata population
* add read_ilm cluster privilege
Although managing ILM policies is best done using the
"manage" cluster privilege, it is useful to have read-only
views.
* adds `read_ilm` cluster privilege for viewing policies and status
* adds Explain API to the `view_index_metadata` index privilege
* add manage_ilm privileges
The error message used when attempting to delete a lifecycle policy that
is in use previously only included one index which was using the policy.
It now includes all indices using that policy.
Adds a setting that indicates that an index is done indexing, set by ILM
when the Rollover action completes. This indicates that the Rollover
action should be skipped in any future invocations, as long as the index
is no longer the write index for its alias.
This enables 1) an index with a policy that involves the Rollover action
to have the policy removed and switched to another one without use of
the move-to-step API, and 2) integrations with Beats and CCR.
RolloverStep previously had a name of "attempt_rollover", which was
inconsistent with all other step names due it its use of an underscore
instead of a dash.
RolloverAction will now periodically check the rollover conditions using
the Rollover API with the dry_run option as an AsyncWaitStep, then run
the rollover itself by calling the Rollover API with no conditions,
which will always roll over, as an AsyncActionStep. This will resolve
race condition issues in policies using RolloverAction.
Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.
1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.
Changes now
1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.
Closes#35397.