When checking for the existence of a document in the ILM/CCR integration
tests, `assertDocumentExists` makes an HTTP request and checks the
response code. However, if the repsonse code is not successful, the call
will throw a `ResponseException`. `assertDocumentExists` is often called
inside an `assertBusy`, and wrapping the `ResponseException` in an
`AssertionError` will allow the `assertBusy` to retry.
In particular, this fixes an issue with `testCCRUnfollowDuringSnapshot`
where the index in question may still be closed when the document is
requested.
Backport of #48452.
The SAML tests have large XML documents within which various parameters
are replaced. At present, if these test are auto-formatted, the XML
documents get strung out over many, many lines, and are basically
illegible.
Fix this by using named placeholders for variables, and indent the
multiline XML documents.
The tests in `SamlSpMetadataBuilderTests` deserve a special mention,
because they include a number of certificates in Base64. I extracted
these into variables, for additional legibility.
This commit ensures that the creation of a DocumentSubsetReader does not
eagerly resolve the role query and the number of docs that match.
We want to delay this expensive operation in order to ensure that we really
need this information when we build it. For this reason the role query and the
number of docs are now resolved on demand. This commit also depends on
https://issues.apache.org/jira/browse/LUCENE-9003 that will also compute the global
number of docs lazily.
* Extract remote "sniffing" to connection strategy (#47253)
Currently the connection strategy used by the remote cluster service is
implemented as a multi-step sniffing process in the
RemoteClusterConnection. We intend to introduce a new connection strategy
that will operate in a different manner. This commit extracts the
sniffing logic to a dedicated strategy class. Additionally, it implements
dedicated tests for this class.
Additionally, in previous commits we moved away from a world where the
remote cluster connection was mutable. Instead, when setting updates are
made, the connection is torn down and rebuilt. We still had methods and
tests hanging around for the mutable behavior. This commit removes those.
* Introduce simple remote connection strategy (#47480)
This commit introduces a simple remote connection strategy which will
open remote connections to a configurable list of user supplied
addresses. These addresses can be remote Elasticsearch nodes or
intermediate proxies. We will perform normal clustername and version
validation, but otherwise rely on the remote cluster to route requests
to the appropriate remote node.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
This commit changes the REST API spec slm.get_lifecycle's policy_id url part to be of type "list", in line with other REST API specs that accept a comma-separated list of values.
Closes#47765
BytesReference is currently an abstract class which is extended by
various implementations. This makes it very difficult to use the
delegation pattern. The implication of this is that our releasable
BytesReference is a PagedBytesReference type and cannot be used as a
generic releasable bytes reference that delegates to any reference type.
This commit makes BytesReference an interface and introduces an
AbstractBytesReference for common functionality.
The AbstractHlrcWriteableXContentTestCase was replaced by a better test
case a while ago, and this is the last two instances using it. They have
been converted and the test is now deleted.
Ref #39745
There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.
Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a `Matcher`
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)
Recently we upgraded to a recent joni version (#47374), and this PR
is a followup of that PR.
This change should also fix#43673, since it appears when unit tests
are ran the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.
We have not seen much adoption of this experimental field type, and don't see a
clear use case as it's currently designed. This PR deprecates the field type in
7.x. It will be removed from 8.0 in a follow-up PR.
7.5+ for SLM requires [stats] object to exist in the cluster state.
When doing an in-place upgrade from 7.4 to 7.5+ [stats] does not exist
in cluster state, result in an exception on startup [1].
This commit moves the [stats] to be an optional object in the parser
and if not found will default to an empty stats object.
[1] Caused by: java.lang.IllegalArgumentException: Required [stats]
Reverting the change introducing IsoLocal.ROOT and introducing IsoCalendarDataProvider that defaults start of the week to Monday and requires minimum 4 days in first week of a year. This extension is using java SPI mechanism and defaults for Locale.ROOT only.
It require jvm property java.locale.providers to be set with SPI,COMPAT
closes#41670
backport #48209
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).
This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).
Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.
This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__
Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).
This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
FIPS 140 bootstrap checks should not be bootstrap checks as they
are always enforced. This commit moves the validation logic within
the security plugin.
The FIPS140SecureSettingsBootstrapCheck was not applicable as the
keystore was being loaded on init, before the Bootstrap checks
were checked, so an elasticsearch keystore of version < 3 would
cause the node to fail in a FIPS 140 JVM before the bootstrap check
kicked in, and as such hasn't been migrated.
Resolves: #34772
The enrich stats api picked the wrong task to be displayed
in the executing stats section.
In case `wait_for_completion` was set to `false` then no task
was being displayed and if that param was set to `true` then
the wrong task was being displayed (transport action task instead
of enrich policy executor task).
Testing executing policies in enrich stats api is tricky.
I have verified locally that this commit fixes the bug.
This PR adds an origin for the Enrich feature, and modifies the background
maintenance task to use the origin when executing client operations.
Without this fix, the maintenance task fails to execute when security is
enabled.
This class is only used by the blob store repository
and CCR and the abstractions didn't really make sense
with CCR ignoring the concrete `restoreFiles` method
completely and having a method used only by the blobstore
overriden as unsupported.
=> Moved to a more fitting set of abstractions
=> Dried up the stream wrapping in `BlobStoreRepository` a little
now that the `restoreFile` method could be simplified
Relates #48110 as it makes changing the API of `FileRestoreContext`
to what is needed for async restores simpler
The test testPauseAndResumeWithMultipleAutoFollowPatterns
failed multiple times, mostly because it creates too many leader
indices and the following cluster cannot cope with cluster state
updates generated by following indices creation and pause/
resume auto-followers changes.
This commit simplifies the test by creating at most 20 leader
indices and by waiting for any new leader index to be picked
up by the auto-follower before created another leader index.
It also pause and resume less auto-followers as previously.
closes#47917
All internal searches (triggered by APIs) across the .security index
must be performed while "under the security origin". Otherwise,
the search is performed in the context of the caller which most
likely does not have privileges to search .security (hopefully).
This commit fixes this in the case of two methods in the
TokenService and corrects an overly done such context switch
in the ApiKeyService.
In addition, this makes all tests from the client/rest-high-level
module execute as an all mighty administrator,
but not a literal superuser.
Closes#47151
This test could fail for two reasons, both should be fixed by this PR:
1) It hit a timeout for an `assertBusy`. This commit increases the
timeout for that `assertBusy`.
2) The snapshot that was supposed to be blocked could, in fact, be
successful. This is because a previous snapshot had been successfully
been taken, and no new data had been added between the two snapshots.
This means that no new segment files needed to be written for the new
snapshot, so the block on data files was never triggered. This commit
changes two things: First, it indexes some new data before taking the
second snapshot (the one that needs to be blocked), and second,
checks to ensure that the block is actually hit before continuing
with the test.
The logic for handling empty segment files has been
unnecessary ever since #24021 which removes the support
for these files in 6.x -> we can safely remove the
support for restoring these from 7.x+ to simplify the code.
There is no reason to still resolve the
fallback `IndexId` here. It only applies to
`2.x` repos and those we can't read anymore
anyway because they use an `/index` instead of
an `/index-N` blob at the repo root for which
at least 7.x+ does not contain the logic to find
it.
If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.
We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.
Closes#47612
Backport of #48090