We implicitly only supported the prime256v1 ( aka secp256r1 )
curve for the EC keys we read as PEM files to be used in any
SSL Context. We would not fail when trying to read a key
pair using a different curve but we would silently assume
that it was using `secp256r1` which would lead to strange
TLS handshake issues if the curve was actually another one.
This commit fixes that behavior in that it
supports parsing EC keys that use any of the named curves
defined in rfc5915 and rfc5480 making no assumptions about
whether the security provider in use supports them (JDK8 and
higher support all the curves defined in rfc5480).
A Java8 compatible version of Map.ofEntries() was added in #54183,
but it really needs a compat version of Map.entry as well in order to
facilitate easy backports from master.
This commit causes negative TimeValues, other than -1 which is sometimes used as
a sentinel value, to be rejected during parsing.
Also introduces a hack to allow ILM to load policies which were written to the
cluster state with a negative min_age, treating those values as 0, which should
match the behavior of prior versions.
This commit removes the configuration time vs execution time distinction
with regards to certain BuildParms properties. Because of the cost of
determining Java versions for configuration JDK locations we deferred
this until execution time. This had two main downsides. First, we had
to implement all this build logic in tasks, which required a bunch of
additional plumbing and complexity. Second, because some information
wasn't known during configuration time, we had to nest any build logic
that depended on this in awkward callbacks.
We now defer to the JavaInstallationRegistry recently added in Gradle.
This utility uses a much more efficient method for probing Java
installations vs our jrunscript implementation. This, combined with some
optimizations to avoid probing the current JVM as well as deferring
some evaluation via Providers when probing installations for BWC builds
we can maintain effectively the same configuration time performance
while removing a bunch of complexity and runtime cost (snapshotting
inputs for the GenerateGlobalBuildInfoTask was very expensive). The end
result should be a much more responsive build execution in almost all
scenarios.
(cherry picked from commit ecdbd37f2e0f0447ed574b306adb64c19adc3ce1)
It's simple to deprecate a field used in an ObjectParser just by adding deprecation
markers to the relevant ParseField objects. The warnings themselves don't currently
have any context - they simply say that a deprecated field has been used, but not
where in the input xcontent it appears. This commit adds the parent object parser
name and XContentLocation to these deprecation messages.
Note that the context is automatically stripped from warning messages when they
are asserted on by integration tests and REST tests, because randomization of
xcontent type during these tests means that the XContentLocation is not constant
This commit makes a number of improvements when importing the
Elasticsearch project into IntelliJ IDEA. Specifically:
- Contributing documentation has been updated to reflect that the
'idea' task should no long be used and Gradle project import is
instead the officially supported way of setting up the project.
- Attempts to run the 'idea' task will result in a failure with a
message directing folks to our CONTRIBUTING.md document.
- The project JDK is explicit set rather that using whatever JAVA_HOME
is.
- Gradle build operation delegation is disabled, and test execution is
configured to 'choose per test'.
- Gradle is configured to inherit the project JDK.
- Some code style conventions are automatically configured.
- File encoding is explicitly set to UTF-8.
- Parallel module compilation is enabled and deprecated feature
warnings are disabled.
- A remote debug run configuration using listen mode is created.
- JUnit runner is configured with required system properties.
- License headers are configured such that Apache 2 is the default
notice added to all source files with exception of source in /x-pack
which will use the Elastic license.
Backport to 7x
Enable geo_shape query to work on geo_point fields for shapes: circle, polygon, multipolygon, rectangle see: #48928
Co-Authored-By: @iverase
Sometimes we want to deprecate and remove a ParseField entirely, without replacement;
for example, the various places where we specify a _type field in 7x. Currently we can
tell users only that a particular field name should not be used, and that another name should
be used in its place. This commit adds the ability to say that a field should not be used at
all.
Re-applies the change from #53523 along with test fixes.
closes#53626closes#53624closes#53622closes#53625
Co-authored-by: Nik Everett <nik9000@gmail.com>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
Co-authored-by: Jake Landis <jake.landis@elastic.co>
ObjectParser allows you to declare a set of required fields, such that at least one
of the set must appear in an xcontent object for it to be valid. This commit adds
the similar concept of a set of exclusive fields, such that at most one of the set
must be present. It also enables required fields on ConstructingObjectParser, and
re-implements PercolateQueryBuilder.fromXContent() to use object parsing as
an example of how this works.
This adds a builder and parsed results for the `string_stats`
aggregation directly to the high level rest client. Without this the
HLRC can't access the `string_stats` API without the elastic licensed
`analytics` module.
While I'm in there this adds a few of our usual unit tests and
modernizes the parsing.
ConstructingObjectParser can be used to specify required fields,
but it is still difficult to configure "sets" of fields where only
one of the set is required (requiring hand-rolled logic in each
ConstructingObjectParser, or adding special validation methods
to objects that are called after building the object).
This commit adds a new method on ObjectParser which allows
the parsers to register required sets. E.g. ["foo", "bar"] can be
registered, which means "foo", "bar" or both must be configured
by the user otherwise an exception is thrown.
This pattern crops up in many places in our parsers; a good example are
the aggregation "field" and "script" fields. One or both must be
configured on all aggregations, omitting both should result in an exception.
This was previously handled far downstream resulting in an aggregation
exception, when it should be a parse exception.
While we use `== false` as a more visible form of boolean negation
(instead of `!`), the true case is implied and the true value does not
need to explicitly checked. This commit converts cases that have slipped
into the code checking for `== true`.
* Reload secure settings with password (#43197)
If a password is not set, we assume an empty string to be
compatible with previous behavior.
Only allow the reload to be broadcast to other nodes if TLS is
enabled for the transport layer.
* Add passphrase support to elasticsearch-keystore (#38498)
This change adds support for keystore passphrases to all subcommands
of the elasticsearch-keystore cli tool and adds a subcommand for
changing the passphrase of an existing keystore.
The work to read the passphrase in Elasticsearch when
loading, which will be addressed in a different PR.
Subcommands of elasticsearch-keystore can handle (open and create)
passphrase protected keystores
When reading a keystore, a user is only prompted for a passphrase
only if the keystore is passphrase protected.
When creating a keystore, a user is allowed (default behavior) to create one with an
empty passphrase
Passphrase can be set to be empty when changing/setting it for an
existing keystore
Relates to: #32691
Supersedes: #37472
* Restore behavior for force parameter (#44847)
Turns out that the behavior of `-f` for the add and add-file sub
commands where it would also forcibly create the keystore if it
didn't exist, was by design - although undocumented.
This change restores that behavior auto-creating a keystore that
is not password protected if the force flag is used. The force
OptionSpec is moved to the BaseKeyStoreCommand as we will presumably
want to maintain the same behavior in any other command that takes
a force option.
* Handle pwd protected keystores in all CLI tools (#45289)
This change ensures that `elasticsearch-setup-passwords` and
`elasticsearch-saml-metadata` can handle a password protected
elasticsearch.keystore.
For setup passwords the user would be prompted to add the
elasticsearch keystore password upon running the tool. There is no
option to pass the password as a parameter as we assume the user is
present in order to enter the desired passwords for the built-in
users.
For saml-metadata, we prompt for the keystore password at all times
even though we'd only need to read something from the keystore when
there is a signing or encryption configuration.
* Modify docs for setup passwords and saml metadata cli (#45797)
Adds a sentence in the documentation of `elasticsearch-setup-passwords`
and `elasticsearch-saml-metadata` to describe that users would be
prompted for the keystore's password when running these CLI tools,
when the keystore is password protected.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
* Elasticsearch keystore passphrase for startup scripts (#44775)
This commit allows a user to provide a keystore password on Elasticsearch
startup, but only prompts when the keystore exists and is encrypted.
The entrypoint in Java code is standard input. When the Bootstrap class is
checking for secure keystore settings, it checks whether or not the keystore
is encrypted. If so, we read one line from standard input and use this as the
password. For simplicity's sake, we allow a maximum passphrase length of 128
characters. (This is an arbitrary limit and could be increased or eliminated.
It is also enforced in the keystore tools, so that a user can't create a
password that's too long to enter at startup.)
In order to provide a password on standard input, we have to account for four
different ways of starting Elasticsearch: the bash startup script, the Windows
batch startup script, systemd startup, and docker startup. We use wrapper
scripts to reduce systemd and docker to the bash case: in both cases, a
wrapper script can read a passphrase from the filesystem and pass it to the
bash script.
In order to simplify testing the need for a passphrase, I have added a
has-passwd command to the keystore tool. This command can run silently, and
exit with status 0 when the keystore has a password. It exits with status 1 if
the keystore doesn't exist or exists and is unencrypted.
A good deal of the code-change in this commit has to do with refactoring
packaging tests to cleanly use the same tests for both the "archive" and the
"package" cases. This required not only moving tests around, but also adding
some convenience methods for an abstraction layer over distribution-specific
commands.
* Adjust docs for password protected keystore (#45054)
This commit adds relevant parts in the elasticsearch-keystore
sub-commands reference docs and in the reload secure settings API
doc.
* Fix failing Keystore Passphrase test for feature branch (#50154)
One problem with the passphrase-from-file tests, as written, is that
they would leave a SystemD environment variable set when they failed,
and this setting would cause elasticsearch startup to fail for other
tests as well. By using a try-finally, I hope that these tests will fail
more gracefully.
It appears that our Fedora and Ubuntu environments may be configured to
store journald information under /var rather than under /run, so that it
will persist between boots. Our destructive tests that read from the
journal need to account for this in order to avoid trying to limit the
output we check in tests.
* Run keystore management tests on docker distros (#50610)
* Add Docker handling to PackagingTestCase
Keystore tests need to be able to run in the Docker case. We can do this
by using a DockerShell instead of a plain Shell when Docker is running.
* Improve ES startup check for docker
Previously we were checking truncated output for the packaged JDK as
an indication that Elasticsearch had started. With new preliminary
password checks, we might get a false positive from ES keystore
commands, so we have to check specifically that the Elasticsearch
class from the Bootstrap package is what's running.
* Test password-protected keystore with Docker (#50803)
This commit adds two tests for the case where we mount a
password-protected keystore into a Docker container and provide a
password via a Docker environment variable.
We also fix a logging bug where we were logging the identifier for an
array of strings rather than the contents of that array.
* Add documentation for keystore startup prompting (#50821)
When a keystore is password-protected, Elasticsearch will prompt at
startup. This commit adds documentation for this prompt for the archive,
systemd, and Docker cases.
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
* Warn when unable to upgrade keystore on debian (#51011)
For Red Hat RPM upgrades, we warn if we can't upgrade the keystore. This
commit brings the same logic to the code for Debian packages. See the
posttrans file for gets executed for RPMs.
* Restore handling of string input
Adds tests that were mistakenly removed. One of these tests proved
we were not handling the the stdin (-x) option correctly when no
input was added. This commit restores the original approach of
reading stdin one char at a time until there is no more (-1, \r, \n)
instead of using readline() that might return null
* Apply spotless reformatting
* Use '--since' flag to get recent journal messages
When we get Elasticsearch logs from journald, we want to fetch only log
messages from the last run. There are two reasons for this. First, if
there are many logs, we might get a string that's too large for our
utility methods. Second, when we're looking for a specific message or
error, we almost certainly want to look only at messages from the last
execution.
Previously, we've been trying to do this by clearing out the physical
files under the journald process. But there seems to be some contention
over these directories: if journald writes a log file in between when
our deletion command deletes the file and when it deletes the log
directory, the deletion will fail.
It seems to me that we might be able to use journald's "--since" flag to
retrieve only log messages from the last run, and that this might be
less likely to fail due to race conditions in file deletion.
Unfortunately, it looks as if the "--since" flag has a granularity of
one-second. I've added a two-second sleep to make sure that there's a
sufficient gap between the test that will read from journald and the
test before it.
* Use new journald wrapper pattern
* Update version added in secure settings request
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
This change changes the way to run our test suites in
JVMs configured in FIPS 140 approved mode. It does so by:
- Configuring any given runtime Java in FIPS mode with the bundled
policy and security properties files, setting the system
properties java.security.properties and java.security.policy
with the == operator that overrides the default JVM properties
and policy.
- When runtime java is 11 and higher, using BouncyCastle FIPS
Cryptographic provider and BCJSSE in FIPS mode. These are
used as testRuntime dependencies for unit
tests and internal clusters, and copied (relevant jars)
explicitly to the lib directory for testclusters used in REST tests
- When runtime java is 8, using BouncyCastle FIPS
Cryptographic provider and SunJSSE in FIPS mode.
Running the tests in FIPS 140 approved mode doesn't require an
additional configuration either in CI workers or locally and is
controlled by specifying -Dtests.fips.enabled=true
When you declare an ObjectParser with top level named objects like we do
with `significant_terms` we didn't support "did you mean". This fixes
that.
Relates #50938
Check it out:
```
$ curl -u elastic:password -HContent-Type:application/json -XPOST localhost:9200/test/_update/foo?pretty -d'{
"dac": {}
}'
{
"error" : {
"root_cause" : [
{
"type" : "x_content_parse_exception",
"reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
}
],
"type" : "x_content_parse_exception",
"reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
},
"status" : 400
}
```
The tricky thing about implementing this is that x-content doesn't
depend on Lucene. So this works by creating an extension point for the
error message using SPI. Elasticsearch's server module provides the
"spell checking" implementation.
s
In order to ensure that logstash and Elasticsearch are able to understand
the same patterns, this commit adapts to changes in logstash, adds a few
patterns and changes a few.
We *very* commonly have object with ctors like:
```
public Foo(String name)
```
And then declare a bunch of setters on the object. Every aggregation
works like this, for example. This change teaches `ObjectParser` how to
build these aggregations all on its own, without any help. This'll make
it much cleaner to parse aggs, and, probably, a bunch of other things.
It'll let us remove lots of wrapping. I've used this new power for the
`avg` aggregation just to prove that it works outside of a unit test.
* Adds JavaDoc to `AbstractWireTestCase` and
`AbstractWireSerializingTestCase` so it is more obvious you should prefer
the latter if you have a choice
* Moves the `instanceReader` method out of `AbstractWireTestCase` becaue
it is no longer used.
* Marks a bunch of methods final so it is more obvious which classes are
for what.
* Cleans up the side effects of the above.
This adds support for "collapsed" named object to `ObjectParser`. In
particular, this supports the sort of xcontent that we use to specify
significance heuristics. See #25519 and this example:
```
GET /_search
{
"query" : {
"terms" : {"force" : [ "British Transport Police" ]}
},
"aggregations" : {
"significant_crime_types" : {
"significant_terms" : {
"field" : "crime_type",
"mutual_information" : { <<------- This is the name
"include_negatives": true
}
}
}
}
}
```
I believe there are a couple of things that work this way.
I've held off on moving the actual parsing of the significant heuristics
to this code to keep the review more compact. The moving is pretty
mechanical stuff in the aggs framework.
Backport of #49612.
The current Docker entrypoint script picks up environment variables and
translates them into -E command line arguments. However, since any tool
executes via `docker exec` doesn't run the entrypoint, it results in
a poorer user experience.
Therefore, refactor the env var handling so that the -E options are
generated in `elasticsearch-env`. These have to be appended to any
existing command arguments, since some CLI tools have subcommands and
-E arguments must come after the subcommand.
Also extract the support for `_FILE` env vars into a separate script, so
that it can be called from more than once place (the behaviour is
idempotent).
Finally, add noop -E handling to CronEvalTool for parity, and support
`-E` in MultiCommand before subcommands.
The elasticsearch-node tools allow manipulating the on-disk cluster state. The tool is currently
unaware of plugins and will therefore drop custom metadata from the cluster state once the
state is written out again (as it skips over the custom metadata that it can't read). This commit
preserves unknown customs when editing on-disk metadata through the elasticsearch-node
command-line tools.
IntelliJ IDEA moved their JUnit runner to a different package. While this does not break running
tests in IDEA, it leads to an ugly exception being thrown at the end of the tests:
Exception in thread "main" java.lang.SecurityException: java.lang.System#exit(0) calls are not
allowed
at org.elasticsearch.secure_sm.SecureSM$2.run(SecureSM.java:248)
at org.elasticsearch.secure_sm.SecureSM$2.run(SecureSM.java:215)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:310)
at org.elasticsearch.secure_sm.SecureSM.innerCheckExit(SecureSM.java:215)
at org.elasticsearch.secure_sm.SecureSM.checkExit(SecureSM.java:206)
at java.base/java.lang.Runtime.exit(Runtime.java:111)
at java.base/java.lang.System.exit(System.java:1781)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:59)
This commit adds support for newer IDEA versions in SecureSM.
- Improves HTTP client hostname verification failure messages
- Adds "DiagnosticTrustManager" which logs certificate information
when trust cannot be established (hostname failure, CA path failure,
etc)
These diagnostic messages are designed so that many common TLS
problems can be diagnosed based solely (or primarily) on the
elasticsearch logs.
These diagnostics can be disabled by setting
xpack.security.ssl.diagnose.trust: false
Backport of: #48911
Backport of #47208.
Closes#46900. When running ES with `--quiet`, if ES then exits abnormally, a
user has to go hunting in the logs for the error. Instead, never close
System.err, and print more information to it if ES encounters a fatal error
e.g. config validation, or some fatal runtime exception. This is useful when
running under e.g. systemd, since the error will go into the journal.
Note that stderr is still closed in daemon (`-d`) mode.
this resolves incorrectly licensed code in #49009.
ESSloppyMath is made as a wrapper around FastMath.java which is
not meant to be modified with code beyond the original source
This commit introduces a new class called ESSloppyMath
that is meant to reflect the purpose of Lucene's SloppyMath,
but add additional unimplemented faster alternatives to math functions.
The two that are used by geotile-grid a lot are sinh/atan.
In a quick elasticsearch rally benchmark for geotile-grid on Switzerland
data points, this shows a (1.22x) 22% speed-up over using Math's functions.
closes#41166.
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
This commit introduces a consistent, and type-safe manner for handling
global build parameters through out our build logic. Primarily this
replaces the existing usages of extra properties with static accessors.
It also introduces and explicit API for initialization and mutation of
any such parameters, as well as better error handling for uninitialized
or eager access of parameter values.
Closes#42042
Backport of #48448. Make a number of changes so that code in the libs and
modules directories are more resilient to automatic formatting. This covers:
* Remove string concatenation where JSON fits on a single line
* Move some comments around to they aren't auto-formatted to a strange
place
There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.
Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a `Matcher`
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)
Recently we upgraded to a recent joni version (#47374), and this PR
is a followup of that PR.
This change should also fix#43673, since it appears when unit tests
are ran the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.
This prevents the following warning from being printed to console:
`regular expression has redundant nested repeat operator + /%\{(?<name>(?<pattern>[A-z0-9]+)(?::(?<subname>[[:alnum:]@\[\]_:.-]+))?)(?:=(?<definition>(?:(?:[^{}]+|\.+)+)+))?\}/`
The current grok expression is not failing, but just this warning is being printed.
The warning started being printed after upgrading joni (#47374).
Closes#47861
Backport of #47374
Changed the Grok class to use searchInterruptible(...) instead of search(...)
otherwise we can't interrupt long running matching via the thread watch
dog.
Joni now also provides another way to interrupt long running matches.
By invoking the interrupt() method on the Matcher. We need then to refactor
the watch thread dog to keep track of Matchers instead of Threads, but
it is a better way of doing this, since interrupting would be more direct
(not every 30k iterations) and efficient (checking a volatile field).
This work needs to be done in a follow up.
* Remove eclipse conditionals
We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.
This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.
Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.
This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
The change #47238 fixed a first issue (#47076) but introduced
another one that can be reproduced using:
org.elasticsearch.common.CharArraysTests > testConstantTimeEquals FAILED
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at __randomizedtesting.SeedInfo.seed([DFCA64FE2C786BE3:ED987E883715C63B]:0)
at java.lang.String.substring(String.java:1963)
at org.elasticsearch.common.CharArraysTests.testConstantTimeEquals(CharArraysTests.java:74)
REPRODUCE WITH: ./gradlew ':libs:elasticsearch-core:test' --tests
"org.elasticsearch.common.CharArraysTests.testConstantTimeEquals"
-Dtests.seed=DFCA64FE2C786BE3 -Dtests.security.manager=true -Dtests.locale=fr-CA
-Dtests.timezone=Pacific/Johnston -Dcompiler.java=12 -Druntime.java=8
that happens when the first randomized string has a length of 0.
The test of constantTimeEquals could get unlucky and randomly produce
the same two strings. This commit tweaks the test to ensure the two
string are unique, and the loop inside constantTimeEquals is actually
executed (which requires the strings be of the same length).
fixes#47076
Due to recent changes in the nio transport, a failure to bind the server
channel has started to be logged at an error level. This exception leads
to an automatic retry on a different port, so it should only be logged
at a trace level.
* Add retention to Snapshot Lifecycle Management (#46407)
This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.
An example policy would look like:
```
PUT /_slm/policy/snapshot-every-day
{
"schedule": "0 30 2 * * ?",
"name": "<production-snap-{now/d}>",
"repository": "my-s3-repository",
"config": {
"indices": ["foo-*", "important"]
},
// Newly configured retention options
"retention": {
// Snapshots should be deleted after 14 days
"expire_after": "14d",
// Keep a maximum of thirty snapshots
"max_count": 30,
// Keep a minimum of the four most recent snapshots
"min_count": 4
}
}
```
SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.
Included in this work is a new SLM stats API endpoint available through
``` json
GET /_slm/stats
```
That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.
* Add base framework for snapshot retention (#43605)
* Add base framework for snapshot retention
This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.
Relates to #38461
* Remove extraneous 'public'
* Use a local var instead of reading class var repeatedly
* Add SnapshotRetentionConfiguration for retention configuration (#43777)
* Add SnapshotRetentionConfiguration for retention configuration
This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.
Relates to #43663
* Fix REST tests
* Fix more documentation
* Use Objects.equals to avoid NPE
* Put `randomSnapshotLifecyclePolicy` in only one place
* Occasionally return retention with no configuration
* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)
* Implement SnapshotRetentionTask's snapshot filtering and deletion
This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.
Relates to #43663
* Fix deletes running on the wrong thread
* Handle missing or null policy in snap metadata differently
* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>
* Use the `OriginSettingClient` to work with security, enhance logging
* Prevent NPE in test by mocking Client
* Allow empty/missing SLM retention configuration (#45018)
Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.
Relates to #43663
* Add min_count and max_count as SLM retention predicates (#44926)
This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.
These options are optional and one, two, or all three can be specified
in an SLM policy.
Relates to #43663
* Time-bound deletion of snapshots in retention delete function (#45065)
* Time-bound deletion of snapshots in retention delete function
With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.
Relates to #43663
* Wow snapshots suuuure can take a long time.
* Use a LongSupplier instead of actually sleeping
* Remove TestLogging annotation
* Remove rate limiting
* Add SLM metrics gathering and endpoint (#45362)
* Add SLM metrics gathering and endpoint
This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.
This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:
```
GET /_slm/stats
{
"retention_runs" : 13,
"retention_failed" : 0,
"retention_timed_out" : 0,
"retention_deletion_time" : "1.4s",
"retention_deletion_time_millis" : 1404,
"policy_metrics" : {
"daily-snapshots2" : {
"snapshots_taken" : 7,
"snapshots_failed" : 0,
"snapshots_deleted" : 6,
"snapshot_deletion_failures" : 0
},
"daily-snapshots" : {
"snapshots_taken" : 12,
"snapshots_failed" : 0,
"snapshots_deleted" : 12,
"snapshot_deletion_failures" : 6
}
},
"total_snapshots_taken" : 19,
"total_snapshots_failed" : 0,
"total_snapshots_deleted" : 18,
"total_snapshot_deletion_failures" : 6
}
```
This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.
Relates to #43663
* Version qualify serialization
* Initialize counters outside constructor
* Use computeIfAbsent instead of being too verbose
* Move part of XContent generation into subclass
* Fix REST action for master merge
* Unused import
* Record history of SLM retention actions (#45513)
This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.
* Retry SLM retention after currently running snapshot completes (#45802)
* Retry SLM retention after currently running snapshot completes
This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.
Relates to #43663
* Increase timeout waiting for snapshot completion
* Apply patch
From 2374316f0d.patch
* Rename test variables
* [TEST] Be less strict for stats checking
* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)
This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.
Relates to #43663
* Check all actions preventing snapshot delete during retention (#45992)
* Check all actions preventing snapshot delete during retention run
Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:
- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running
This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.
Relates to #43663
* Add unit test for okayToDeleteSnapshots
* Fix bug where SLM retention task would be scheduled on every node
* Enhance test logging
* Ignore if snapshot is already deleted
* Missing import
* Fix SnapshotRetentionServiceTests
* Expose SLM policy stats in get SLM policy API (#45989)
This also adds support for the SLM stats endpoint to the high level rest client.
Retrieving a policy now looks like:
```json
{
"daily-snapshots" : {
"version": 1,
"modified_date": "2019-04-23T01:30:00.000Z",
"modified_date_millis": 1556048137314,
"policy" : {
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "my_repository",
"config": {
"indices": ["data-*", "important"],
"ignore_unavailable": false,
"include_global_state": false
},
"retention": {}
},
"stats": {
"snapshots_taken": 0,
"snapshots_failed": 0,
"snapshots_deleted": 0,
"snapshot_deletion_failures": 0
},
"next_execution": "2019-04-24T01:30:00.000Z",
"next_execution_millis": 1556048160000
}
}
```
Relates to #43663
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase
This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.
Relates to #43663Resolves#46205
* Add error logging when exceptions are thrown
* Update serialization versions
* Fix type inference
* Use non-Cancellable HLRC return value
* Fix Client mocking in test
* Fix SLMSnapshotBlockingIntegTests for 7.x branch
* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval
* Add serialization guards for SnapshotLifecyclePolicy