* Introduce binary_format request parameter (binary.format for JDBC) to disable binary
communication between clients (jdbc/odbc) and server.
* for CLI - "binary" command line parameter (or -b) is introduced. Default value is "true".
* binary communication (cbor) is enabled by default
* disabling request parameter introduced for debugging purposes only
(cherry picked from commit f96a5ca61cb9fad9ed59357320af20e669348ce7)
* Add ingest info to Cluster Stats (#48485)
This commit enhances the ClusterStatsNodes response to include global
processor usage stats on a per-processor basis.
example output:
```
...
"processor_stats": {
"gsub": {
"count": 0,
"failed": 0
"current": 0
"time_in_millis": 0
},
"script": {
"count": 0,
"failed": 0
"current": 0,
"time_in_millis": 0
}
}
...
```
The purpose for this enhancement is to make it easier to collect stats on how specific processors are being used across the cluster beyond the current per-node usage statistics that currently exist in node stats.
Closes#46146.
* fix BWC of ingest stats
The introduction of processor types into IngestStats had a bug.
It was set to `null` and set as the key to the map. This would
throw a NPE. This commit resolves this by setting all the processor
types from previous versions that are not serializing it out to
`_NOT_AVAILABLE`.
This test used an index without an alias to simulate a failure in the
`check-rollover-ready` step. However, with #48256 that step
automatically retries, meaning that the index may not always be in
the ERROR step.
This commit changes the test to use a shrink action with an invalid
number of shards so that it stays in the ERROR step.
Resolves#48767
Previous behavior while copying HTTP headers to the ThreadContext,
would allow multiple HTTP headers with the same name, handling only
the first occurrence and disregarding the rest of the values. This
can be confusing when dealing with multiple Headers as it is not
obvious which value is read and which ones are silently dropped.
According to RFC-7230, a client must not send multiple header fields
with the same field name in a HTTP message, unless the entire field
value for this header is defined as a comma separated list or this
specific header is a well-known exception.
This commits changes the behavior in order to be more compliant to
the aforementioned RFC by requiring the classes that implement
ActionPlugin to declare if a header can be multi-valued or not when
registering this header to be copied over to the ThreadContext in
ActionPlugin#getRestHeaders.
If the header is allowed to be multivalued, then all such headers
are read from the HTTP request and their values get concatenated in
a comma-separated string.
If the header is not allowed to be multivalued, and the HTTP
request contains multiple such Headers with different values, the
request is rejected with a 400 status.
Fix an issue that arises from the use of ExpressionIds as keys in a lookup map
that helps the QueryTranslator to identify the grouping columns. The issue is
that the same expression in different parts of the query (SELECT clause and GROUP BY clause)
ends up with different ExpressionIds so the lookup fails. So, instead of ExpressionIds
use the hashCode() of NamedExpression.
Fixes: #41159Fixes: #40001Fixes: #40240Fixes: #33361Fixes: #46316Fixes: #36074Fixes: #34543Fixes: #37044Fixes: #42041
(cherry picked from commit 3c38ea555984fcd2c6bf9e39d0f47a01b09e7c48)
This adds the infrastructure to be able to retry the execution of retryable
steps and makes the `check-rollover-ready` retryable as an initial step to
make the rollover action more resilient to transient errors.
(cherry picked from commit 454020ac8acb147eae97acb4ccd6fb470d1e5f48)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
At present the ML C++ artifact is always downloaded from
S3. This change adds an option to configure the location.
(The intention is to use a file:/// URL to pick up the
artifact built in a Docker container in ml-cpp PR builds
so that C++ changes that will break Java integration tests
can be detected before the ml-cpp PRs are merged.)
Relates elastic/ml-cpp#766
Moves common field extraction logic to its own package so that it can
be used both for anomaly detection and data frame analytics.
In preparation for refactoring extraction fields to be simpler and to
support multi-fields properly.
Backport of #48709
When we load a JSON Web Key (JWKSet) from the specified
file using JWKSet.load it internally uses IOUtils.readFileToString
but the opened FileInputStream is never closed after usage.
https://bitbucket.org/connect2id/nimbus-jose-jwt/issues/342
This commit reads the file and parses the JWKSet from the string.
This also fixes an issue wherein if the underlying file changed,
for every change event it would add another file watcher. The
change is to only add the file watcher at the start.
Closes#44942
* Un-AwaitsFix and enhance logging for testPolicyCRUD
This removes the `AwaitsFix` and increases the test logging for
`SnapshotLifecycleServiceTests.testPolicyCRUD` in an effort to track
down the cause of #44997.
* Remove unused import
This PR performs the following changes:
* Split `ScoreScriptUtilsTests` into `DenseVectorFunctionTests` and
`SparseVectorFunctionTests`. This will make it easier to delete all sparse
vector function tests once we remove support on 8.x.
* As much as possible, break up the large test methods into individual tests
for each vector function (`cosineSimilarity`, `l2norm`, etc.).
The logger was erroneously using the `SnapshotLifecycleMetadata` class
for its initialization, making it hard to target packages for logging
levels since `SnapshotLifecycleMetadata` is in a different package.
* [ML][Inference] separating definition and config object storage (#48651)
This separates out the `definition` object from being stored within the configuration object in the index.
This allows us to gather the config object without decompressing a potentially large definition.
Additionally, `input` is moved to the TrainedModelConfig object and out of the definition. This is so the trained input fields are accessible outside the potentially large model definition.
This adds a guard for the SLM lifecycle and retention service that
prevents new jobs from being scheduled once the service has been
stopped. Previous if the node were shut down the service would be
stopped, but a cluster state or local master election would cause a job
to attempt to be scheduled. This could lead to an uncaught
`RejectedExecutionException`.
Resolves#47749
The 1MB IO-buffer size per transport thread is causing trouble in
some tests, albeit at a low rate. Reducing the number of transport
threads was not enough to fully fix this situation.
Allowing to configure the size of the buffer and reducing it by
more than an order of magnitude should fix these tests.
Closes#46803
Previously the functions accepted a doc values reference, whereas they now
accept the name of the vector field. Here's an example of how a vector function
was called before and after the change.
```
Before: cosineSimilarity(params.query_vector, doc['field'])
After: cosineSimilarity(params.query_vector, 'field')
```
This seems more intuitive, since we don't allow direct access to vector doc
values and the the meaning of `doc['field']` is unclear.
The PR makes the following changes (broken into distinct commits):
* Add new function signatures of the form `function(params.query_vector,
'field')` and deprecates the old ones. Because Painless doesn't allow two
methods with the same name and number of arguments, we allow a generic `Object`
to be passed in to the function and decide on the behavior through an
`instanceof` check.
* Refactor the class bindings so that the document field is passed to the
constructor instead of the instance method. This allows us to avoid retrieving
the vector doc values on every function invocation, which gives a tiny speed-up
in benchmarks.
Note that this PR adds new signatures for the sparse vector functions too, even
though sparse vectors are deprecated. It seemed simplest to understand (for both
us and users) to keep everything symmetric between dense and sparse vectors.
The format returned by the API is not always parsable with
`Instant.parse()`, so this commit adjusts to parsing those dates as
`ISO_ZONED_DATE_TIME` instead, which appears to always parse the
returned value correctly.
The failures in these tests have been remarkably difficult to track
down, in part because they will not reproduce locally. This commit
unmutes the flaky tests and increases logging, as well as introducing
some additional logging, to attempt to pin down the failures.
The open and close follower steps didn't check if the index is open,
closed respectively, before executing the open/close request.
This changes the steps to check the index state and only perform the
open/close operation if the index is not already open/closed.
The cron schedule "1 2 3 4 5 ?" will run every May 4 at 03:02:01, which
may result in unnecessary test failures once a year. This commit
switches out uses of that schedule in tests for one which will never
execute (because it specifies a day which doesn't exist, Feb. 31). Also
factors the schedule out to a constant to make the intent clearer.
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395).
Closes#48395
This commit simplifies and standardizes our usage of the Gradle Shadow
plugin to conform more to plugin conventions. The custom "bundle" plugin
has been removed as it's not necessary and performs the same function
as the Shadow plugin's default behavior with existing configurations.
Additionally, this removes unnecessary creation of a "nodeps" artifact,
which is unnecessary because by default project dependencies will in
fact use the non-shadowed JAR unless explicitly depending on the
"shadow" configuration.
Finally, we've cleaned up the logic used for unit testing, so we are
now correctly testing against the shadow JAR when the plugin is applied.
This better represents a real-world scenario for consumers and provides
better test coverage for incorrectly declared dependencies.
(cherry picked from commit 3698131109c7e78bdd3a3340707e1c7b4740d310)
Due to a bug, GETing a snapshot can cause a RespositoryException to be
thrown. This error is transient and should be retried, rather than
causing the test to fail. This commit converts those
RepositoryExceptions into AssertionErrors so that they will be retried
in code wrapped in assertBusy.
When checking for the existence of a document in the ILM/CCR integration
tests, `assertDocumentExists` makes an HTTP request and checks the
response code. However, if the repsonse code is not successful, the call
will throw a `ResponseException`. `assertDocumentExists` is often called
inside an `assertBusy`, and wrapping the `ResponseException` in an
`AssertionError` will allow the `assertBusy` to retry.
In particular, this fixes an issue with `testCCRUnfollowDuringSnapshot`
where the index in question may still be closed when the document is
requested.
Backport of #48452.
The SAML tests have large XML documents within which various parameters
are replaced. At present, if these test are auto-formatted, the XML
documents get strung out over many, many lines, and are basically
illegible.
Fix this by using named placeholders for variables, and indent the
multiline XML documents.
The tests in `SamlSpMetadataBuilderTests` deserve a special mention,
because they include a number of certificates in Base64. I extracted
these into variables, for additional legibility.
This commit ensures that the creation of a DocumentSubsetReader does not
eagerly resolve the role query and the number of docs that match.
We want to delay this expensive operation in order to ensure that we really
need this information when we build it. For this reason the role query and the
number of docs are now resolved on demand. This commit also depends on
https://issues.apache.org/jira/browse/LUCENE-9003 that will also compute the global
number of docs lazily.
* Extract remote "sniffing" to connection strategy (#47253)
Currently the connection strategy used by the remote cluster service is
implemented as a multi-step sniffing process in the
RemoteClusterConnection. We intend to introduce a new connection strategy
that will operate in a different manner. This commit extracts the
sniffing logic to a dedicated strategy class. Additionally, it implements
dedicated tests for this class.
Additionally, in previous commits we moved away from a world where the
remote cluster connection was mutable. Instead, when setting updates are
made, the connection is torn down and rebuilt. We still had methods and
tests hanging around for the mutable behavior. This commit removes those.
* Introduce simple remote connection strategy (#47480)
This commit introduces a simple remote connection strategy which will
open remote connections to a configurable list of user supplied
addresses. These addresses can be remote Elasticsearch nodes or
intermediate proxies. We will perform normal clustername and version
validation, but otherwise rely on the remote cluster to route requests
to the appropriate remote node.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
This commit changes the REST API spec slm.get_lifecycle's policy_id url part to be of type "list", in line with other REST API specs that accept a comma-separated list of values.
Closes#47765
BytesReference is currently an abstract class which is extended by
various implementations. This makes it very difficult to use the
delegation pattern. The implication of this is that our releasable
BytesReference is a PagedBytesReference type and cannot be used as a
generic releasable bytes reference that delegates to any reference type.
This commit makes BytesReference an interface and introduces an
AbstractBytesReference for common functionality.
The AbstractHlrcWriteableXContentTestCase was replaced by a better test
case a while ago, and this is the last two instances using it. They have
been converted and the test is now deleted.
Ref #39745
There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.
Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a `Matcher`
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)
Recently we upgraded to a recent joni version (#47374), and this PR
is a followup of that PR.
This change should also fix#43673, since it appears when unit tests
are ran the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.
We have not seen much adoption of this experimental field type, and don't see a
clear use case as it's currently designed. This PR deprecates the field type in
7.x. It will be removed from 8.0 in a follow-up PR.
7.5+ for SLM requires [stats] object to exist in the cluster state.
When doing an in-place upgrade from 7.4 to 7.5+ [stats] does not exist
in cluster state, result in an exception on startup [1].
This commit moves the [stats] to be an optional object in the parser
and if not found will default to an empty stats object.
[1] Caused by: java.lang.IllegalArgumentException: Required [stats]
Reverting the change introducing IsoLocal.ROOT and introducing IsoCalendarDataProvider that defaults start of the week to Monday and requires minimum 4 days in first week of a year. This extension is using java SPI mechanism and defaults for Locale.ROOT only.
It require jvm property java.locale.providers to be set with SPI,COMPAT
closes#41670
backport #48209
- Section about the case where the `principal` user property can't
be mapped.
- Section about when the IdP SAML metadata do not contain a
SingleSignOnService that supports HTTP-Redirect binding.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
Co-Authored-By: Tim Vernum <tim@adjective.org>
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).
This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).
Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.
This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__
Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).
This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
FIPS 140 bootstrap checks should not be bootstrap checks as they
are always enforced. This commit moves the validation logic within
the security plugin.
The FIPS140SecureSettingsBootstrapCheck was not applicable as the
keystore was being loaded on init, before the Bootstrap checks
were checked, so an elasticsearch keystore of version < 3 would
cause the node to fail in a FIPS 140 JVM before the bootstrap check
kicked in, and as such hasn't been migrated.
Resolves: #34772
The enrich stats api picked the wrong task to be displayed
in the executing stats section.
In case `wait_for_completion` was set to `false` then no task
was being displayed and if that param was set to `true` then
the wrong task was being displayed (transport action task instead
of enrich policy executor task).
Testing executing policies in enrich stats api is tricky.
I have verified locally that this commit fixes the bug.
This PR adds an origin for the Enrich feature, and modifies the background
maintenance task to use the origin when executing client operations.
Without this fix, the maintenance task fails to execute when security is
enabled.
This class is only used by the blob store repository
and CCR and the abstractions didn't really make sense
with CCR ignoring the concrete `restoreFiles` method
completely and having a method used only by the blobstore
overriden as unsupported.
=> Moved to a more fitting set of abstractions
=> Dried up the stream wrapping in `BlobStoreRepository` a little
now that the `restoreFile` method could be simplified
Relates #48110 as it makes changing the API of `FileRestoreContext`
to what is needed for async restores simpler
The test testPauseAndResumeWithMultipleAutoFollowPatterns
failed multiple times, mostly because it creates too many leader
indices and the following cluster cannot cope with cluster state
updates generated by following indices creation and pause/
resume auto-followers changes.
This commit simplifies the test by creating at most 20 leader
indices and by waiting for any new leader index to be picked
up by the auto-follower before created another leader index.
It also pause and resume less auto-followers as previously.
closes#47917
All internal searches (triggered by APIs) across the .security index
must be performed while "under the security origin". Otherwise,
the search is performed in the context of the caller which most
likely does not have privileges to search .security (hopefully).
This commit fixes this in the case of two methods in the
TokenService and corrects an overly done such context switch
in the ApiKeyService.
In addition, this makes all tests from the client/rest-high-level
module execute as an all mighty administrator,
but not a literal superuser.
Closes#47151
This test could fail for two reasons, both should be fixed by this PR:
1) It hit a timeout for an `assertBusy`. This commit increases the
timeout for that `assertBusy`.
2) The snapshot that was supposed to be blocked could, in fact, be
successful. This is because a previous snapshot had been successfully
been taken, and no new data had been added between the two snapshots.
This means that no new segment files needed to be written for the new
snapshot, so the block on data files was never triggered. This commit
changes two things: First, it indexes some new data before taking the
second snapshot (the one that needs to be blocked), and second,
checks to ensure that the block is actually hit before continuing
with the test.
The logic for handling empty segment files has been
unnecessary ever since #24021 which removes the support
for these files in 6.x -> we can safely remove the
support for restoring these from 7.x+ to simplify the code.
There is no reason to still resolve the
fallback `IndexId` here. It only applies to
`2.x` repos and those we can't read anymore
anyway because they use an `/index` instead of
an `/index-N` blob at the repo root for which
at least 7.x+ does not contain the logic to find
it.
If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.
We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.
Closes#47612
Backport of #48090
The after snapshot action is interfering with SLM deleting snapshots
here it seems, causing concurrent delete exceptions.
Since these tests are now test-scoped there is no reason to run
snapshot deletes after each test so we can remove them to avoid this issue.
Closes#47937
We were not closing repositories on Node shutdown.
In production, this has little effect but in tests
shutting down a node using `MockRepository` and is
currently stuck in a simulated blocked-IO situation
will only unblock when the node's threadpool is
interrupted. This might in some edge cases (many
snapshot threads and some CI slowness) result
in the execution taking longer than 5s to release
all the shard stores and thus we fail the assertion
about unreleased shard stores in the internal test cluster.
Regardless of tests, I think we should close repositories
and release resources associated with them when closing
a node and not just when removing a repository from the CS
with running nodes as this behavior is really unexpected.
Fixes#47689
* Add SLM support to xpack usage and info APIs
This is a backport of #48096
This adds the missing xpack usage and info information into the
`/_xpack` and `/_xpack/usage` APIs. The output now looks like:
```
GET /_xpack/usage
{
...
"slm" : {
"available" : true,
"enabled" : true,
"policy_count" : 1,
"policy_stats" : {
"retention_runs" : 0,
...
}
}
```
and
```
GET /_xpack
{
...
"features" : {
...
"slm" : {
"available" : true,
"enabled" : true
},
...
}
}
```
Relates to #43663
* Fix missing license
This adds parsing an inference model as a possible
result of the analytics process. When we do parse such a model
we persist a `TrainedModelConfig` into the inference index
that contains additional metadata derived from the running job.
Previously when a numeric literal was enclosed in parentheses and then
negated, the negation was lost and the number was considered positive, e.g.:
`-(5)` was considered as `5` instead of `-5`
`- ( (1.28) )` was considered as `1.28` instead of `-1.28`
Fixes: #48009
(cherry picked from commit 4dee4bf3b34081062ba2e28ab8524a066812a180)
Audit messages are stored with millisecond timestamps. If two
messages have the same millisecond timestamp then asserting on
their order is impossible given the information available.
This PR changes the assertion on audit messages in the native
data frame analytics tests to assert that the expected audit
messages exist in any order.
Fixes#48035
which is backport merge and adds a new ingest processor, named enrich processor,
that allows document being ingested to be enriched with data from other indices.
Besides a new enrich processor, this PR adds several APIs to manage an enrich policy.
An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.
Related to #32789
max_empty_searches = -1 in a datafeed update implies
max_empty_searches will be unset on the datafeed when
the update is applied. The isNoop() method needs to
take this -1 to null equivalence into account.
Previously, the safety check for the 2nd argument of the DateAddProcessor was
restricting it to Integer which was wrong since we allow all non-rational
numbers, so it's changed to a Number check as it's done in other cases.
Enhanced some tests regarding the check for an integer (non-rational
argument).
(cherry picked from commit 0516b6eaf5eb98fa5bd087c3fece80139a6b118e)
This change adds:
- A new option, allow_lazy_open, to anomaly detection jobs
- A new option, allow_lazy_start, to data frame analytics jobs
Both work in the same way: they allow a job to be
opened/started even if no ML node exists that can
accommodate the job immediately. In this situation
the job waits in the opening/starting state until ML
node capacity is available. (The starting state for data
frame analytics jobs is new in this change.)
Additionally, the ML nightly maintenance tasks now
creates audit warnings for ML jobs that are unassigned.
This means that jobs that cannot be assigned to an ML
node for a very long time will show a yellow warning
triangle in the UI.
A final change is that it is now possible to close a job
that is not assigned to a node without using force.
This is because previously jobs that were open but
not assigned to a node were an aberration, whereas
after this change they'll be relatively common.
This PR adds the ability to run the enrich policy execution task in the background,
returning a task id instead of waiting for the completed operation.
Prior to this change the `target_field` would always be a json array
field in the document being ingested. This to take into account that
multiple enrich documents could be inserted into the `target_field`.
However the default `max_matches` is `1`. Meaning that by default
only a single enrich document would be added to `target_field` json
array field.
This commit changes this; if `max_matches` is set to `1` then the single
document would be added as a json object to the `target_field` and
if it is configured to a higher value then the enrich documents will be
added as a json array (even if a single enrich document happens to be
enriched).
Currently, partial snapshots will eventually build up unless they are
manually deleted. Partial snapshots may be useful if there is not a more
recent successful snapshot, but should eventually be deleted if they are
no longer useful.
With this change, partial snapshots are deleted using the following
strategy: PARTIAL snapshots will be kept until the configured
expire_after period has passed, if present, and then be deleted. If
there is no configured expire_after in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Partial snapshots are not counted towards either min_count or
max_count.
Adds a new datafeed config option, max_empty_searches,
that tells a datafeed that has never found any data to stop
itself and close its associated job after a certain number
of real-time searches have returned no data.
Backport of #47922
Make clear in the docs that the role mapping APIs is the preferred
way to manage role mappings and that the role mappings that are
defined in files cannot be viewed or managed with the APIs
Usually syslog timestamps have two spaces before a single
digit day-of-month. However, in some non-syslog cases
where syslog-like timestamps are used there is only one
space. The grok pattern supports this, so the timestamp
parser should too. This change makes the
find_file_structure endpoint do this.
Also fixes another problem that the same test case
exposed in the find_file_structure endpoint, which was
that the exclude_lines_pattern for delimited files was
always created on the assumption the delimiter was a
comma. Now it is based on the actual delimiter.
This commit adds two APIs that allow to pause and resume
CCR auto-follower patterns:
// pause auto-follower
POST /_ccr/auto_follow/my_pattern/pause
// resume auto-follower
POST /_ccr/auto_follow/my_pattern/resume
The ability to pause and resume auto-follow patterns can be
useful in some situations, including the rolling upgrades of
cluster using a bi-directional cross-cluster replication scheme
(see #46665).
This commit adds a new active flag to the AutoFollowPattern
and adapts the AutoCoordinator and AutoFollower classes so
that it stops to fetch remote's cluster state when all auto-follow
patterns associate to the remote cluster are paused.
When an auto-follower is paused, remote indices that match the
pattern are just ignored: they are not added to the pattern's
followed indices uids list that is maintained in the local cluster
state. This way, when the auto-follow pattern is resumed the
indices created in the remote cluster in the meantime will be
picked up again and added as new following indices. Indices
created and then deleted in the remote cluster will be ignored
as they won't be seen at all by the auto-follower pattern at
resume time.
Backport of #47510 for 7.x
Previously, Nullability was set to UNKNOWN instead of TRUE which
resulted on QueryFolder not correctly folding to NULL if any of the args
was null.
Remove the overriding nullable() also for DatePart/DateTrunc to allow
delegation the parent class.
(cherry picked from commit 05a7108e133b5ae7bec2257db5ae2d30ad926ee2)
Joda was using ResolverStyle.STRICT when parsing. This means that date will be validated to be a correct year, year-of-month, day-of-month
However, we also want to make it works with Year-Of-Era as Joda used to, hence custom temporalquery.localdate in DateFormatters.from
Within DateFormatters we use the correct uuuu year instead of yyyy year of era
worth noting: if yyyy(without an era) is used in code, the parsing result will be a TemporalAccessor which will fail to be converted into LocalDate. We mostly use DateFormatters.from so this takes care of this. If possible the uuuu format should be used.
* [ML][Analytics] fix bug where regression deleted early does not delete state (#47885)
* [ML][Analytics] fix bug where regression deleted early does not delete state
* Fixing ml with security test failure
* fixing for older java
Changes the execution logic to create a new task using the execute request,
and attaches the new task to the policy runner to be updated. Also, a new
response is now returned from the execute api, which contains either the task
id of the execution, or the completed status of the run. The fields are mutually
exclusive to make it easier to discern what type of response it is.
This change adds documentation for the SAML APIs in Elasticsearch
and adds simple instructions on how these APIs can be used to
authenticate a user with SAML by a custom web application other
than Kibana.
Resolves: #40352
rename internal indexes of transform plugin
- rename audit index and create an alias for accessing it, BWC: add an alias for old indexes to
keep them working, kibana UI will switch to use the read alias
- rename config index and provide BWC to read from old and new ones
Batch transforms automatically stop after all data has processed, therefore tests can not reliable test the state. This change rewrites tests to remove the unreliable tests or use continuous transforms instead as they do not auto-stop.
fixes#47441
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in #47834Closes#47834
The random timestamps were landing too close to the current time,
so an unlucky rollup interval would round such that the doc wasn't
included in the search range (and thus not "rolled up") which
would then fail the test.
The fix is to make sure the timestamp of all docs is sufficiently behind
'now' that the possible rounding intervals will always include them.
Backport of #38753 to 7.x where the test was still muted.
Refactor DateTrunc and DatePart to use separate Pipe classes which
allows the removal of the BinaryDateOperation enum.
(cherry picked from commit a6075e7718dff94a90dbc0795dd924dcb7641092)
Currently if the document being ingested contains another field value
than a string then the processor fails with an error.
This commit changes the match processor to handle number values
and array values correctly.
If a json array is detected then the `terms` query is used instead
of the `term` query.
Especially in the snapshot code there's a lot
of logic chaining `ActionRunnables` in tricky
ways now and the code is getting hard to follow.
This change introduces two convinience methods that
make it clear that a wrapped listener is invoked with
certainty in some trickier spots and shortens the code a bit.
Backport of (#47721) for 7.x.
Similarly to #47582, Auto-follow patterns creates following
indices as long as the remote index matches the pattern and
the remote primary shards are all started. But since 7.2 closed
indices are also replicated, and it does not play well with CCR
auto-follow patterns as they create following indices for closed
leader indices too.
This commit changes the getLeaderIndicesToFollow() so that closed
indices are excluded from auto-follow patterns.
If a cluster sending monitoring data is unhealthy and triggers an
alert, then stops sending data the following exception [1] can occur.
This exception stops the current Watch and the behavior is actually
correct in part due to the exception. Simply fixing the exception
introduces some incorrect behavior. Now that the Watch does not
error in the this case, it will result in an incorrectly "resolved"
alert. The fix here is two parts a) fix the exception b) fix the
following incorrect behavior.
a) fixing the exception is as easy as checking the size of the
array before accessing it.
b) fixing the following incorrect behavior is a bit more intrusive
- Note - the UI depends on the success/met state for each condition
to determine an "OK" or "FIRING"
In this scenario, where an unhealthy cluster triggers an alert and
then goes silent, it should keep "FIRING" until it hears back that
the cluster is green. To keep the Watch "FIRING" either the index
action or the email action needs to fire. Since the Watch is neither
a "new" alert or a "resolved" alert, we do not want to keep sending
an email (that would be non-passive too). Without completely changing
the logic of how an alert is resolved allowing the index action to
take place would result in the alert being resolved. Since we can
not keep "FIRING" either the email or index action (since we don't
want to resolve the alert nor re-write the logic for alert resolution),
we will introduce a 3rd action. A logging action that WILL fire when
the cluster is unhealthy. Specifically will fire when there is an
unresolved alert and it can not find the cluster state.
This logging action is logged at debug, so it should be noticed much.
This logging action serves as an 'anchor' for the UI to keep the state
in an a "FIRING" status until the alert is resolved.
This presents a possible scenario where a cluster starts firing,
then goes completely silent forever, the Watch will be "FIRING"
forever. This is an edge case that already exists in some scenarios
and requires manual intervention to remove that Watch.
This changes changes to use a template-like method to populate the
version_created for the default monitoring watches. The version is
set to 7.5 since that is where this is first introduced.
Fixes#43184
The currently logic shard selecting logic selects a random shard copy
instead of selecting the local shard copy and if local copy is not
available then selecting a random shard copy. The latter is desired
behaviour for enrich.
By reusing `OperationRouting#searchShards(...)` we get the desired
behaviour and reuse the same logic that the search api is using.
This commit adds documentation for new index privilege
create_doc which only allows indexing of new documents
but no updates to existing documents via Index or Bulk APIs.
Relates: #45806
* Separate SLM stop/start/status API from ILM
This separates a start/stop/status API for SLM from being tied to ILM's
operation mode. These APIs look like:
```
POST /_slm/stop
POST /_slm/start
GET /_slm/status
```
This allows administrators to have fine-grained control over preventing
periodic snapshots and deletions while performing cluster maintenance.
Relates to #43663
* Allow going from RUNNING to STOPPED
* Align with the OperationMode rules
* Fix slmStopping method
* Make OperationModeUpdateTask constructor private
* Wipe snapshots better in test
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.
With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
Adds a check when running an Enrich policy to make sure that an Enrich index
is force merged down to one segment, and if it was not fully merged, attempts
the merge again, up to a configurable number of times.
When exceptions could be returned from another node, the exception
might be wrapped in a `RemoteTransportException`. In places where
we handled specific exceptions using `instanceof` we ought to unwrap
the cause first.
This commit attempts to fix this issue after searching code in the ML
plugin.
Backport of #47676
* Convert RunTask to use testclusers, remove ClusterFormationTasks
This PR adds a new RunTask and a way for it to start a
testclusters cluster out of band and block on it to replace
the old RunTask that used ClusterFormationTasks.
With this we can now remove ClusterFormationTasks.
Previously when retrieving an SLM policy it would always return a 200
with `{}` in the body, even if the policy did not exist. This changes
that behavior to throw an error (similar to our other APIs) if a
policy doesn't exist.
This also adds a basic CRUD yml test for the behavior.
Resolves#47664
this commit introduces a geo-match enrich processor that looks up a specific
`geo_point` field in the enrich-index for all entries that have a geo_shape match field
that meets some specific relation criteria with the input field.
For example, the enrich index may contain documents with zipcodes and their respective
geo_shape. Ingesting documents with a geo_point field can be enriched with which zipcode
they associate according to which shape they are contained within.
this commit also refactors some of the MatchProcessor by moving a lot of the shared code to
AbstractEnrichProcessor.
Closes#42639.
If a thread pool rejection exception happens, an alternative code
path is chosen to write history and delete the trigger. If an exception
happens during deletion of the trigger an exception may be thrown and not
caught.
This commit catches the exception and provides a meaning error message.
fixes#47008
When deactivating a watch, there is a chance that it is fully deactivated
and reporting as not running but the history is not fully written yet.
There is not a tight coupling between the associated watcher history
index and the deactivation. This test assumes that once a watch is
deactivated that all history is fully written in a very short time period.
If the Watch is deactivated, but the history is slow to write it can result
in a failing test.
This change removes an assertion that assumes that the deactivation of a watch
ensured the all of the watch history was written. There is still a minor race
condition with respect to the remaining history assertions. However, if the
history is slow to be written, it will allow the test to still passing.
fixes#47503
Adds the following parameters to `outlier_detection`:
- `compute_feature_influence` (boolean): whether to compute or not
feature influence scores
- `outlier_fraction` (double): the proportion of the data set assumed
to be outlying prior to running outlier detection
- `standardization_enabled` (boolean): whether to apply standardization
to the feature values
Backport of #47600
Previously, we supported only the format `{fn <FUNCTION_NAME>()}`
but other DBs like MSSQL, DB2, MariaDB/MySQL alos allow whitespaces
between `{` and `fn`. Furhermore, also some applications - like PowerBI -
generate escape sequences with spaces: `select { fn name(params) } etc.`
Add support for white spaces between `{` and the escape pattern definition
like `fn`, `ts`, `d`, `guid` etc.
Closes: #47401
(cherry picked from commit 08a22d0b393f4a76c52dabc5e7b9cafcc19c30ca)
Use case:
User with `create_doc` index privilege will be allowed to only index new documents
either via Index API or Bulk API.
There are two cases that we need to think:
- **User indexing a new document without specifying an Id.**
For this ES auto generates an Id and now ES version 7.5.0 onwards defaults to `op_type` `create` we just need to authorize on the `op_type`.
- **User indexing a new document with an Id.**
This is problematic as we do not know whether a document with Id exists or not.
If the `op_type` is `create` then we can assume the user is trying to add a document, if it exists it is going to throw an error from the index engine.
Given these both cases, we can safely authorize based on the `op_type` value. If the value is `create` then the user with `create_doc` privilege is authorized to index new documents.
In the `AuthorizationService` when authorizing a bulk request, we check the implied action.
This code changes that to append the `:op_type/index` or `:op_type/create`
to indicate the implied index action.
This commit adds support to retrieve all API keys if the authenticated
user is authorized to do so.
This removes the restriction of specifying one of the
parameters (like id, name, username and/or realm name)
when the `owner` is set to `false`.
Closes#46887
Backport of (#47582)
Today when following a new leader index, we fetch the remote cluster state,
check the remote cluster license, check the user privileges, retrieve the
index shard stats before initiating a CCR restore session.
But if the leader index to follow is closed, we're executing a bunch of
operations that would inevitability fail at some point (on retrieving the
index shard stats, because this type of request forbid closed indices
when resolving indices). We could fail a Put Follow request at the first
step by checking the leader index state directly from the remote cluster
state.
This also helps the Resume Follow API to fail a bit earlier.
* Use versions specific distribution folders so we don't need to clean up (#46539)
* Retry deleting distro dir on windows
When retarting the cluster we clean up old distribution files that might
still be in use by the OS.
Windows closes resources of ded processes async, so we do a couple of
retries to get arround it.
Closes#46014
* Avoid having to delete the distro folder.
* Remove the use of ClusterFormationTasks form RestTestTask (#47022)
This PR removes a use-case of the ClusterFormationTasks and converts a
project that flew under the radar so far.
There's probably more clean-up possible here, but for now the goal is
to be able to remove that code after `RunTask` is also updated.
* Migrate some 7.x only projects
An index with an ILM policy that has a rollover action in one of the
phases was rolled over when the ILM conditions dictated regardless if
it was already rolled over (eg. manually after modifying an index
template in order to force the creation of a new index that uses the new
mappings).
This changes this behaviour and has ILM check if the index it's about to
roll has not been rolled over in the meantime.
(cherry picked from commit 37d6106feeb9f9369519117c88a9e7e30f3ac797)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commit lifts the validation of the monitoring hosts setting into
the setting itself, rather than when the setting is used. This prevents
a scenario where an invalid value for the setting is accepted, but then
later fails while applying a cluster state with the invalid setting.
This adds a default for the `slm.retention_schedule` setting, setting it
to `0 30 1 * * ?` which is 1:30am every day.
Having retention unset meant that it would never be invoked and clean up
snapshots. We determined it would be better to have a default than never
to be run. When coming to a decision, we weighed the option of an
absolute time (such as 1:30am) versus a periodic invocation (like every
12 hours). In the end we decided on the absolute time because it has
better predictability and consistency than a periodic invocation, which
would rely on when the master node were elected or restarted.
Relates to #43663
Adds a pipeline that removes ids and routing from documents before indexing
them into enrich indices. Enrich documents may come from multiple indices,
and thus have id collisions on them. This pipeline ensures that documents
with colliding id fields do not clobber one another during the reindex operation
while executing an enrich policy.
* Bwc testclusters all (#46265)
Convert all bwc projects to testclusters
* Fix bwc versions config
* WIP fix rolling upgrade
* Fix bwc tests on old versions
* Fix rolling upgrade
When an ML job runs the memory required can be
broken down into:
1. Memory required to load the executable code
2. Instrumented model memory
3. Other memory used by the job's main process or
ancilliary processes that is not instrumented
Previously we added a simple fixed overhead to
account for 1 and 3. This was 100MB for anomaly
detection jobs (large because of the completely
uninstrumented categorization function and
normalize process), and 20MB for data frame
analytics jobs.
However, this was an oversimplification because
the executable code only needs to be loaded once
per machine. Also the 100MB overhead for anomaly
detection jobs was probably too high in most cases
because categorization and normalization don't use
_that_ much memory.
This PR therefore changes the calculation of memory
requirements as follows:
1. A per-node overhead of 30MB for _only_ the first
job of any type to be run on a given node - this
is to account for loading the executable code
2. The established model memory (if applicable) or
model memory limit of the job
3. A per-job overhead of 10MB for anomaly detection
jobs and 5MB for data frame analytics jobs, to
account for the uninstrumented memory usage
This change will enable more jobs to be run on the
same node. It will be particularly beneficial when
there are a large number of small jobs. It will
have less of an effect when there are a small number
of large jobs.
When API key is invalidated we do two things first it tries to trigger `ExpiredApiKeysRemover` task
and second, we do index the invalidation for the API key. The index invalidation may happen
before the `ExpiredApiKeysRemover` task is run and in that case, the API key
invalidated will also get deleted. If the `ExpiredApiKeysRemover` runs before the
API key invalidation is indexed then the API key is not deleted and will be
deleted in the future run.
This behavior was not captured in the tests related to `ExpiredApiKeysRemover`
causing intermittent failures.
This commit fixes those tests by checking if the API key invalidated is reported
back when we get API keys after invalidation and perform the checks based on that.
Closes#41747
The changes introduced in #47179 made it so that we could try to
build an SSLContext with verification mode set to None, which is
not allowed in FIPS 140 JVMs. This commit address that
* Remove eclipse conditionals
We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.
This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.
Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.
This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
* Add API to execute SLM retention on-demand (#47405)
This is a backport of #47405
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.
This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.
This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.
Resolves#46508
Relates to #43663
* Fix AllocationRoutedStepTests.testConditionMetOnlyOneCopyAllocated
These tests were using randomly generated includes/excludes/requires for
routing, however, it was possible to generate mutually exclusive
allocation settings (about 1 out of 50,000 times for my runs).
This splits the test into three different tests, and removes the
randomization (it doesn't add anything to the testing here) to fix the
issue.
Resolves#47142
Fixes multiple Active Directory related tests that run against the
samba fixture. Some were failing since we changed the realm settings
format in 7.0 and a few were slightly broken in other ways.
We can move to cleanup the tests in a follow up but this work fits
better to be done with or after we move the tests from a Samba
based fixture to a real(-ish) Microsoft Active Directory based
fixture.
Resolves: #33425, #35738
While it seemed like the PUT data frame analytics action did not
have to be a master node action as the config is stored in an index
rather than the cluster state, there are other subtle nuances which
make it worthwhile to convert it. In particular, it helps maintain
order of execution for put actions which are anyhow user driven and
are expected to have low volume.
This commit converts `TransportPutDataFrameAnalyticsAction` from
a handled transport action to a master node action.
Note this means that the action might fail in a mixed cluster
but as the API is still experimental and not widely used there will
be few moments more suitable to make this change than now.
Bulk requests currently do not allow adding "create" actions with auto-generated IDs.
This commit allows using the optype CREATE for append-only indexing operations. This is
mainly the user facing aspect of it.
XPackPlugin holds data in statics and can only be initialized once. This
caused tests to fail primarily when running with a low max-workers.
Replaced usages with the LocalStateCompositeXPackPlugin, which handles
this properly for testing.
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.
This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)
Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)
Relates #47103
This commit restores the model state if available in data
frame analytics jobs.
In addition, this changes the start API so that a stopped job
can be restarted. As we now store the progress in the state index
when the task is stopped, we can use it to determine what state
the job was in when it got stopped.
Note that in order to be able to distinguish between a job
that runs for the first time and another that is restarting,
we ensure reindexing progress is reported to be at least 1
for a running task.
Due to a regression bug the metadata Active Directory realm
setting is ignored (it works correctly for the LDAP realm type).
This commit redresses it.
Closes#45848
DATE_PART(<datetime unit>, <date/datetime>) is a function that allows
the user to extract the specified unit from a date/datetime field
similar to the EXTRACT (<datetime unit> FROM <date/datetime>) but
with different names and aliases for the units and it also provides more
options like `DATE_PART('tzoffset', datetimeField)`.
Implemented following the SQL server's spec: https://docs.microsoft.com/en-us/sql/t-sql/functions/datepart-transact-sql?view=sql-server-2017
with the difference that the <datetime unit> argument is either a
literal single quoted string or gets a value from a table field, whereas
in SQL server keywords are used (unquoted identifiers) and it's not
possible to use a value coming for a table column.
Closes: #46372
(cherry picked from commit ead743d3579eb753fd314d4a58fae205e465d72e)
* [ML][Inference] adding .ml-inference* index and storage (#47267)
* [ML][Inference] adding .ml-inference* index and storage
* Addressing PR comments
* Allowing null definition, adding validation tests for model config
* fixing line length
* adjusting for backport
Fixes multiple Active Directory related tests that run against the
samba fixture. Some were failing since we changed the realm settings
format in 7.0 and a few were slightly broken in other ways.
We can move to cleanup the tests in a follow up but this work fits
better to be done with or after we move the tests from a Samba
based fixture to a real(-ish) Microsoft Active Directory based
fixture.
Resolves: #33425, #35738
- Build paths with PathUtils#get instead of hard-coding a string with
forward slashes.
- Do not try to match the whole message that includes paths. The
file separator is `\\` in windows but when we throw an Elasticsearch
Exception, the message is formatted with LoggerMessageFormat#format
which replaces `\\` with `\` in Path names. That means that in Windows
the Exception message will contain paths with single backslashes while
the expected string that comes from Path#toString on filename and
env.configFile will contain double backslashes. There is no point in
attempting to match the whole message string for the purpose of this test.
Resolves: #45598
Add examples of failures for both sql and csv integeration
tests and instructions on how to mute them.
(cherry picked from commit 591bba46516d770f5fc95a4c536dd7448b74dd49)
As a result of #45689 snapshot finalization started to
take significantly longer than before. This may be a
little unfortunate since it increases the likelihood
of failing to finalize after having written out all
the segment blobs.
This change parallelizes all the metadata writes that
can safely run in parallel in the finalization step to
speed the finalization step up again. Also, this will
generally speed up the snapshot process overall in case
of large number of indices.
This is also a nice to have for #46250 since we add yet
another step (deleting of old index- blobs in the shards
to the finalization.
When an integration test fails before the assertion of the results it's
missing information, like the file name and the line in the file where
the test resides.
(cherry picked from commit 683dc7213311d13c81e06829e08f3f9f80ebf73a)
With this change the test setup for ML config upgrade
tests only waits for v6.6+ ML index templates to be
installed if the old cluster is running version 6.6.0
or higher.
Previously it was always waiting, but timing out without
failing the test if the templates were not installed
within 10 seconds, effectively just adding a pointless
10 second sleep to BWC tests against versions earlier
than 6.6.0. This problem was exposed by #47112.
Fixes#47286
Previously, if a column (field, scalar, alias) appeared more than once in the
SELECT list, the value was returned only once (1st appearance) in each row.
Fixes: #41811
(cherry picked from commit 097ea36581a751605fc4f2088319d954ce35b5d1)
Currently the policy config is placed directly in the json object
of the toplevel `policies` array field. For example:
```
{
"policies": [
{
"match": {
"name" : "my-policy",
"indices" : ["users"],
"match_field" : "email",
"enrich_fields" : [
"first_name",
"last_name",
"city",
"zip",
"state"
]
}
}
]
}
```
This change adds a `config` field in each policy json object:
```
{
"policies": [
{
"config": {
"match": {
"name" : "my-policy",
"indices" : ["users"],
"match_field" : "email",
"enrich_fields" : [
"first_name",
"last_name",
"city",
"zip",
"state"
]
}
}
}
]
}
```
This allows us in the future to add other information about policies
in the get policy api response.
The UI will consume this API to build an overview of all policies.
The UI may in the future include additional information about a policy
and the plan is to include that in the get policy api, so that this
information can be gathered in a single api call.
An example of the information that is likely to be added is:
* Last policy execution time
* The status of a policy (executing, executed, unexecuted)
* Information about the last failure if exists
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.
Fixes#47003
These settings were using get raw to fallback to whether or not SSL is
enabled. Yet, we have a formal mechanism for falling back to a
setting. This commit cuts over to that formal mechanism.
When we added support for wildcard application names, we started to build
the prefix query along with the term query but we used 'filter' clause
instead of 'should', so this would not fetch the correct application
privilege descriptor thereby failing the _has_privilege checks.
This commit changes the clause to use should and with minimum_should_match
as 1.
This commit adds the documentation to point the user that when one
creates API keys with no role descriptor specified then that API
key will have a point in time snapshot of user permissions.
Closes#46876
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.
There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.
I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.
Other changes:
* Rework `waitForDocs` to scale its timeout. Instead of calling
`assertBusy` in a loop, work out a reasonable overall timeout and await
just once.
* Some tests failed after switching to `assertBusy` and had to be fixed.
* Correct the expect templates in AbstractUpgradeTestCase. The ES
Security team confirmed that they don't use templates any more, so
remove this from the expected templates. Also rewrite how the setup
code checks for templates, in order to give more information.
* Remove an expected ML template from XPackRestTestConstants The ML team
advised that the ML tests shouldn't be waiting for any
`.ml-notifications*` templates, since such checks should happen in the
production code instead.
* Also rework the template checking code in `XPackRestTestHelper` to give
more helpful failure messages.
* Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
We disable MSU optimization if the local checkpoint is smaller than
max_seq_no_of_updates. Hence, we need to relax the MSU assertion in
FollowingEngine for that scenario. Suppose the leader has three
operations: index-0, delete-1, and index-2 for the same doc Id. MSU on
the leader is 1 as index-2 is an append. If the follower applies index-0
then index-2, then the assertion is violated.
Closes#47137
To be on the safe side in terms of use cases also add the alias
DATETRUNC to the DATE_TRUNC function.
Follows: #46473
(cherry picked from commit 9ac223cb1fc66486f86e218fa785a32b61e9bacc)
Drop the usage of `SimpleDateFormat` and use the `DateFormatter` instead
(cherry picked from commit 7cf509a7a11ecf6c40c44c18e8f03b8e81fcd1c2)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
In some cases, the fetch size affects the way the groups are returned
causing the last page to go beyond the limit. Add dedicated check to
prevent extra data from being returned.
Fix#47002
(cherry picked from commit f4c29646f097bbd29855300342823ef4cef61c05)
Enables support for Cartesian geometries shape type. We still need to
decide how to handle the distance function since it is currently using
the haversine distance formula and returns results in meters, which
doesn't make any sense for Cartesian geometries.
Closes#46412
Relates to #43644
When the ML native multi-node tests use _cat/indices/_all
and the request goes to a non-master node, _all is
translated to a list of concrete indices by the authz layer
on the coordinating node before the request is forwarded
to the master node. Then it is possible for the master
node to return an index_not_found_exception if one of
the concrete indices that was expanded on the
coordinating node has been deleted in the meantime.
(#47159 has been opened to track the underlying problem.)
It has been observed that the index that gets deleted when
the problem affects the ML native multi-node tests is
always the ML notifications index. The tests that fail are
only interested in the presence or absense of ML results
indices. Therefore the workaround is to only _cat indices
that match the ML results index pattern.
Fixes#45652
This change also slightly modifies the stats response,
so that is can easier consumer by monitoring and other
users. (coordinators stats are now in a list instead of
a map and has an additional field for the node id)
Relates to #32789
In the current implementation, the validation of the role query
occurs at runtime when the query is being executed.
This commit adds validation for the role query when creating a role
but not for the template query as we do not have the runtime
information required for evaluating the template query (eg. authenticated user's
information). This is similar to the scripts that we
store but do not evaluate or parse if they are valid queries or not.
For validation, the query is evaluated (if not a template), parsed to build the
QueryBuilder and verify if the query type is allowed.
Closes#34252
This change merges the `ShardSearchTransportRequest` and `ShardSearchLocalRequest`
into a single `ShardSearchRequest` that can be used to create a SearchContext.
Relates #46523
This commit adds support for POST requests to the SLM `_execute` API,
because POST is a more appropriate HTTP verb for this action as it is
not idempotent. The docs are also changed to favor POST over PUT,
although PUT is not removed or officially deprecated.
* ILM: parse origination date from index name (#46755)
Introduce the `index.lifecycle.parse_origination_date` setting that
indicates if the origination date should be parsed from the index name.
If set to true an index which doesn't match the expected format (namely
`indexName-{dateFormat}-optional_digits` will fail before being created.
The origination date will be parsed when initialising a lifecycle for an
index and it will be set as the `index.lifecycle.origination_date` for
that index.
A user set value for `index.lifecycle.origination_date` will always
override a possible parsable date from the index name.
(cherry picked from commit c363d27f0210733dad0c307d54fa224a92ddb569)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Drop usage of Map.of to be java 8 compliant
* Wait for snapshot completion in SLM snapshot invocation
This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.
For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.
Relates to #38461 and #43663
Using arrays of objects with embedded IDs is preferred for new APIs over
using entity IDs as JSON keys. This commit changes the SLM stats API to
use the preferred format.
* [ML][Inference] Feature pre-processing objects and functions (#46777)
To support inference on pre-trained machine learning models, some basic feature encoding will be necessary. I am using a named object serialization approach so new encodings/pre-processing steps could be added in the future.
This PR lays down the ground work for 3 basic encodings:
* HotOne
* Target Mean
* Frequency
More feature encodings or pre-processings could be added in the future:
* Handling missing columns
* Standardization
* Label encoding
* etc....
* fixing compilation for namedxcontent tests
This commit clarifies and points out that the Role management UI and
the Role management API cannot be used to manage roles that are
defined in roles.yml and that file based role management is
intended to have a small administrative scope and not handle all
possible RBAC use cases.
This change allows for the caller of the `saml/prepare` API to pass
a `relay_state` parameter that will then be part of the redirect
URL in the response as the `RelayState` query parameter.
The SAML IdP is required to reflect back the value of that relay
state when sending a SAML Response. The caller of the APIs can
then, when receiving the SAML Response, read and consume the value
as it see fit.
This change adds a check to the migration tool that warns about the deprecated
`enabled` setting for the `_field_names` field on 7.x indices and issues a
warning for templates containing this setting, which has been removed
with 8.0.
Relates to #42854, #46681
When we rewrite alias requests, after filtering down to only those that
the user is authorized to see, it can be that there are no aliases
remaining in the request. However, core Elasticsearch interprets this as
_all so the user would see more than they are authorized for. To address
this, we previously rewrote all such requests to have aliases `"*"`,
`"-*"`, which would be interpreted when aliases are resolved as
nome. Yet, this is only needed for get aliases requests and we were
applying it to all alias requests, including remove index requests. If
such a request was sent to a coordinating node that is not the master
node, the request would be rewritten to include `"*"` and `"-*"`, and
then the master would authorize the user for these. If the user had
limited permissions, the request would fail, even if they were
authorized on the index that the remove index action was over. This
commit addresses this by rewriting for get aliases and remove
aliases request types but not for the remove index.
Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>
Co-authored-by: Tim Vernum <tim@adjective.org>
This change works around JDK-8213202, which is a bug related to TLSv1.3
session resumption before JDK 11.0.3 that occurs when there are
multiple concurrent sessions being established. Nodes connecting to
each other will trigger this bug when client authentication is
disabled, which is the case for SSLClientAuthTests.
Backport of #46680
Previously, queries on the _index field were not able to specify index aliases.
This was a regression in functionality compared to the 'indices' query that was
deprecated and removed in 6.0.
Now queries on _index can specify an alias, which is resolved to the concrete
index names when we check whether an index matches. To match a remote shard
target, the pattern needs to be of the form 'cluster:index' to match the
fully-qualified index name. Index aliases can be specified in the following query
types: term, terms, prefix, and wildcard.
This commit replaces the SearchContext used in AbstractQueryTestCase with
a QueryShardContext in order to reduce the visibility of search contexts.
Relates #46523
Add initial PIVOT support for transforming a regular table into a
statistics table around an arbitrary pivoting column:
SELECT * FROM
(SELECT languages, country, salary, FROM mp)
PIVOT (AVG(salary) FOR countries IN ('NL', 'DE', 'ES', 'RO', 'US'))
In the current implementation PIVOT allows only one aggregation however
this restriction is likely to be lifted in the future.
Also not all aggregations are working, in particular MatrixStats are not yet supported.
(cherry picked from commit d91263746a222915c570d4a662ec48c1d6b4f583)
This PR adds some restrictions around testfixtures to make sure the same service ( as defiend in docker-compose.yml ) is not shared between multiple projects.
Sharing would break running with --parallel.
Projects can still share fixtures as long as each has it;s own service within.
This is still useful to share some of the setup and configuration code of the fixture.
Project now also have to specify a service name when calling useCluster to refer to a specific service.
If this is not the case all services will be claimed and the fixture can't be shared.
For this reason fixtures have to explicitly specify if they are using themselves ( fixture and tests in the same project ).
This commit changes the GET REST api so it will accept an optional comma
separated list of enrich policy ids. This change also modifies the
behavior of the GET API in that it will not error if it is passed a bad
enrich id anymore, but will instead just return an empty list.
This commit adds the ability to require an ingest pipeline on an
index. Today we can have a default pipeline, but that could be
overridden by a request pipeline parameter. This commit introduces a new
index setting index.required_pipeline that acts similarly to
index.default_pipeline, except that it can not be overridden by a
request pipeline parameter. Additionally, a default pipeline and a
request pipeline can not both be set. The required pipeline can be set
to _none to ensure that no pipeline ever runs for index requests on that
index.
When using auto-generated IDs + the ingest drop processor (which looks to be used by filebeat
as well) + coordinating nodes that do not have the ingest processor functionality, this can lead
to a NullPointerException.
The issue is that markCurrentItemAsDropped() is creating an UpdateResponse with no id when
the request contains auto-generated IDs. The response serialization is lenient for our
REST/XContent format (i.e. we will send "id" : null) but the internal transport format (used for
communication between nodes) assumes for this field to be non-null, which means that it can't
be serialized between nodes. Bulk requests with ingest functionality are processed on the
coordinating node if the node has the ingest capability, and only otherwise sent to a different
node. This means that, in order to reproduce this, one needs two nodes, with the coordinating
node not having the ingest functionality.
Closes#46678
The fact that this test randomly uses a relatively large number
of nodes and hence Netty worker threads created a problem with
running out of direct memory on CI.
Tests run with 512M heap (and hence 512M direct memory) by default.
On a CI worker with 16 cores, this means Netty will by default set
up 32 transport workers. If we get unlucky and a lot of them
actually do work (and thus instantiate a `CopyBytesSocketChannel`
which costs 1M per thread for the thread-local IO buffer) we
would run out of memory.
This specific failure was only seen with `NativeRealmIntegTests` so I
only added the constraint on the Netty worker count here.
We can add it to other tests (or `SecurityIntegTestCase`) if need be
but for now it doesn't seem necessary so I opted for least impact.
Closes#46803
This commit reuses the same state processor that is used for autodetect
to parse state output from data frame analytics jobs. We then index the
state document into the state index.
Backport of #46804
* [ML][Transforms] remove `force` flag from _start (#46414)
* [ML][Transforms] remove `force` flag from _start
* fixing expected error message
* adjusting bwc version
It is possible for a running analytics job that its config is removed
from the '.ml-config' index (perhaps the user deleted the entire index,
etc.). In that case the task remains without a matching config. I have
raised #46781 to discuss how to deal with this issue.
This commit focuses on `MlMemoryTracker` and changes it so that when
we get the configs for the running tasks we leniently ignore missing ones.
This at least means memory tracking will keep working for other jobs
if one or more are missing.
In addition, this commit makes the cleanup code for native analytics
tests more robust by explicitly stopping all jobs and force-stopping
if an error occurs. This helps so that a single failing test does
not cause other tests fail due to pending tasks.
Backport of #46789
Since the `resolveAllDependencies` task resolves all the congfigurations
it can find, this was not caught by our testing, but it's required to be
configuraed specifically.
We should probably cut-over to the new configurations at some point to
avoid problems like this.
Closeselastic/infra#14580
* Give kibana user reserved role privileges on .apm-* to create APM agent configuration index.
* fixed test to include checking all .apm-* permissions
* changed pattern from ".apm-*" to the more specific ".apm-agent-configuration"
When encountering only indices with empty mapping, the IndexResolver
throws an exception as it expects to find at least one entry.
This commit fixes this case so that an empty mapping is returned.
Fix#46757
(cherry picked from commit 5f4f5807acb93b5fab36718c092c328977a396b6)
* Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581
* Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons
* Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization
* Fixes#41581
Handle queries with implicit GROUP BY where the aggregation is not in
the projection/SELECT but inside the filter/HAVING such as:
SELECT 1 FROM x HAVING COUNT(*) > 0
The engine now properly identifies the case and handles it accordingly.
Fix#37051
(cherry picked from commit fa53ca05d8219c27079b50b4a5b7aeb220c7cde2)
Improve the defensive behavior of ResultSet when dealing with incorrect
API usage. In particular handle the case of dealing with no row
available (either because the cursor is before the first entry or after
the last).
Fix#46750
(cherry picked from commit 58fa38e4606625962e879265d35eacb0960c6cdb)
* [ILM] Add date setting to calculate index age
Add the `index.lifecycle.origination_date` to allow users to configure a
custom date that'll be used to calculate the index age for the phase
transmissions (as opposed to the default index creation date).
This could be useful for users to create an index with an "older"
origination date when indexing old data.
Relates to #42449.
* [ILM] Don't override creation date on policy init
The initial approach we took was to override the lifecycle creation date
if the `index.lifecycle.origination_date` setting was set. This had the
disadvantage of the user not being able to update the `origination_date`
anymore once set.
This commit changes the way we makes use of the
`index.lifecycle.origination_date` setting by checking its value when
we calculate the index age (ie. at "read time") and, in case it's not
set, default to the index creation date.
* Make origination date setting index scope dynamic
* Document orignation date setting in ilm settings
(cherry picked from commit d5bd2bb77ee28c1978ab6679f941d7c02e389d32)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
When the stop API is called while the task is running there is
a chance the task gets marked completed twice. This may cause
undesired side effects, like indexing the progress document a second
time after the stop API has returned (the cause for #46705).
This commit adds a check that the task has not been completed before
proceeding to mark it so. In addition, when we update the task's state
we could get some warnings that the task was missing if the stop API
has been called in the meantime. We now check the errors are
`ResourceNotFoundException` and ignore them if so.
Closes#46705
Backports #46721
We renew the CCR retention lease at a fixed interval, therefore it's
possible to have more than one in-flight renewal requests at the same
time. If requests arrive out of order, then the assertion is violated.
Closes#46416Closes#46013
This is fixing a bug where if an analytics job is started before any
anomaly detection job is opened, we create an index after the state
write alias.
Instead, we should create the state index and alias before starting
an analytics job and this commit makes sure this is the case.
Backport of #46602
This makes the AllocatedPersistentTask#init() method protected so that
implementing classes can perform their initialization logic there,
instead of the constructor. Rollup's task is adjusted to use this
init method.
It also slightly refactors the methods to se a static logger in the
AllocatedTask instead of passing it in via an argument. This is
simpler, logged messages come from the task instead of the
service, and is easier for tests
DATE_TRUNC(<truncate field>, <date/datetime>) is a function that allows
the user to truncate a timestamp to the specified field by zeroing out
the rest of the fields. The function is implemented according to the
spec from PostgreSQL: https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNCCloses: #46319
(cherry picked from commit b37e96712db1aace09f17b574eb02ff6b942a297)
The sql project uses a common set of security tests, which are run in
subprojects. Currently these are shared through a shared directory, but
this is not setup correctly to ensure it is built before tests run. This
commit changes the test classes to be an artifact of the sql/qa/security
project and makes the test runner use the built artifact (a directory of
classes) for tests.
closes#45866
Since the `IndicesSegmentsRequest` scatters to all shards for the index,
it's possible that some of the shards may fail. This adds failure
handling and logging (since this is a best-effort step in the first
place) for this case.
Many scalar functions try to find out the common type between their
arguments in order to set it as their return time, e.g.:
for `float + double` the common type which is set as the return type
of the + operation is `double`.
Previously, for data types TEXT and KEYWORD (string data types) there
was no common data type found and null was returned causing NPEs when
the function was trying to resolve the return data type.
Fixes: #46551
(cherry picked from commit 291017d69dfc810707c3c7c692f5a50af431b790)
This commit adds a wait/check for all running snapshots to be cleared
before taking another snapshot. The previous snapshot was successful but
had not yet been cleared from the cluster state, so the second snapshot
failed due to a `ConcurrentSnapshotException`.
Resolves#46508
After starting the analytics job and checking its state
the state can be any of "started", "reindexing" or
"analyzing" depending on how quickly the work is done.
When upgrading data nodes to a newer version before
master nodes there was a risk that a transform running
on an upgraded data node would index a document into
the new transforms internal index before its index
template was created. This would cause the index to
be created with entirely dynamic mappings.
This change introduces a check before indexing any
internal transforms document to ensure that the required
index template exists and create it if it doesn't.
Backport of #46553
This commit replaces the `SearchContext` with the `QueryShardContext` when building aggregator factories. Aggregator factories are part of the `SearchContext` so they shouldn't require a `SearchContext` to create them.
The main changes here are the signatures of `AggregationBuilder#build` that now takes a `QueryShardContext` and `AggregatorFactory#createInternal` that passes the `SearchContext` to build the `Aggregator`.
Relates #46523
rename data frame transform plugin to transform:
- rename plugin data-frame to transform
- change all package names from o.e.*.dataframe.* to o.e.*.transform.*
- necessary changes to fix loading/testing
* More Efficient Ordering of Shard Upload Execution (#42791)
* Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard
* Inspired by #39657
* Cleanup BlobStoreRepository Abort and Failure Handling (#46208)
The enrich api returns enrich coordinator stats and
information about currently executing enrich policies.
The coordinator stats include per ingest node:
* The current number of search requests in the queue.
* The total number of outstanding remote requests that
have been executed since node startup. Each remote
request is likely to include multiple search requests.
This depends on how much search requests are in the
queue at the time when the remote request is performed.
* The number of current outstanding remote requests.
* The total number of search requests that `enrich`
processors have executed since node startup.
The current execution policies stats include:
* The name of policy that is executing
* A full blow task info object that is executing the policy.
Relates to #32789
This change adds an IndexSearcher and the node's BigArrays in the QueryShardContext.
It's a spin off of #46527 as this change is required to allow aggregation builder to solely use the
query shard context.
Relates #46523
Investigating the test failure reported in #45518 it appears that
the datafeed task was not found during a tast state update. There
are only two places where such an update is performed: when we set
the state to `started` and when we set it to `stopping`. We handle
`ResourceNotFoundException` in the latter but not in the former.
Thus the test reveals a rare race condition where the datafeed gets
requested to stop before we managed to update its state to `started`.
I could not reproduce this scenario but it would be my best guess.
This commit catches `ResourceNotFoundException` while updating the
state to `started` and lets the task terminate smoothly.
Closes#45518
Backport of #46495
We depend on file realms being unique in a number of places. Pre
7.0 this was enforced by the fact that the multiple realm types
with different name would mean identical configuration keys and
cause configuration parsing errors. Since we intoduced affix
settings for realms this is not the case any more as the realm type
is part of the configuration key.
This change adds a check when building realms which will explicitly
fail if multiple realms are defined with the same name.
Backport of #46253
This changes API-Key authentication to always fallback to the realm
chain if the API key is not valid. The previous behaviour was
inconsistent and would terminate on some failures, but continue to the
realm chain for others.
Backport of: #46538
This class has been using a logger configured for a different class for
quite a while. While the circumstance in which it logs is rare, it
should still use the correct logger.
* Add retention to Snapshot Lifecycle Management (#46407)
This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.
An example policy would look like:
```
PUT /_slm/policy/snapshot-every-day
{
"schedule": "0 30 2 * * ?",
"name": "<production-snap-{now/d}>",
"repository": "my-s3-repository",
"config": {
"indices": ["foo-*", "important"]
},
// Newly configured retention options
"retention": {
// Snapshots should be deleted after 14 days
"expire_after": "14d",
// Keep a maximum of thirty snapshots
"max_count": 30,
// Keep a minimum of the four most recent snapshots
"min_count": 4
}
}
```
SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.
Included in this work is a new SLM stats API endpoint available through
``` json
GET /_slm/stats
```
That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.
* Add base framework for snapshot retention (#43605)
* Add base framework for snapshot retention
This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.
Relates to #38461
* Remove extraneous 'public'
* Use a local var instead of reading class var repeatedly
* Add SnapshotRetentionConfiguration for retention configuration (#43777)
* Add SnapshotRetentionConfiguration for retention configuration
This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.
Relates to #43663
* Fix REST tests
* Fix more documentation
* Use Objects.equals to avoid NPE
* Put `randomSnapshotLifecyclePolicy` in only one place
* Occasionally return retention with no configuration
* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)
* Implement SnapshotRetentionTask's snapshot filtering and deletion
This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.
Relates to #43663
* Fix deletes running on the wrong thread
* Handle missing or null policy in snap metadata differently
* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>
* Use the `OriginSettingClient` to work with security, enhance logging
* Prevent NPE in test by mocking Client
* Allow empty/missing SLM retention configuration (#45018)
Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.
Relates to #43663
* Add min_count and max_count as SLM retention predicates (#44926)
This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.
These options are optional and one, two, or all three can be specified
in an SLM policy.
Relates to #43663
* Time-bound deletion of snapshots in retention delete function (#45065)
* Time-bound deletion of snapshots in retention delete function
With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.
Relates to #43663
* Wow snapshots suuuure can take a long time.
* Use a LongSupplier instead of actually sleeping
* Remove TestLogging annotation
* Remove rate limiting
* Add SLM metrics gathering and endpoint (#45362)
* Add SLM metrics gathering and endpoint
This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.
This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:
```
GET /_slm/stats
{
"retention_runs" : 13,
"retention_failed" : 0,
"retention_timed_out" : 0,
"retention_deletion_time" : "1.4s",
"retention_deletion_time_millis" : 1404,
"policy_metrics" : {
"daily-snapshots2" : {
"snapshots_taken" : 7,
"snapshots_failed" : 0,
"snapshots_deleted" : 6,
"snapshot_deletion_failures" : 0
},
"daily-snapshots" : {
"snapshots_taken" : 12,
"snapshots_failed" : 0,
"snapshots_deleted" : 12,
"snapshot_deletion_failures" : 6
}
},
"total_snapshots_taken" : 19,
"total_snapshots_failed" : 0,
"total_snapshots_deleted" : 18,
"total_snapshot_deletion_failures" : 6
}
```
This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.
Relates to #43663
* Version qualify serialization
* Initialize counters outside constructor
* Use computeIfAbsent instead of being too verbose
* Move part of XContent generation into subclass
* Fix REST action for master merge
* Unused import
* Record history of SLM retention actions (#45513)
This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.
* Retry SLM retention after currently running snapshot completes (#45802)
* Retry SLM retention after currently running snapshot completes
This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.
Relates to #43663
* Increase timeout waiting for snapshot completion
* Apply patch
From 2374316f0d.patch
* Rename test variables
* [TEST] Be less strict for stats checking
* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)
This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.
Relates to #43663
* Check all actions preventing snapshot delete during retention (#45992)
* Check all actions preventing snapshot delete during retention run
Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:
- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running
This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.
Relates to #43663
* Add unit test for okayToDeleteSnapshots
* Fix bug where SLM retention task would be scheduled on every node
* Enhance test logging
* Ignore if snapshot is already deleted
* Missing import
* Fix SnapshotRetentionServiceTests
* Expose SLM policy stats in get SLM policy API (#45989)
This also adds support for the SLM stats endpoint to the high level rest client.
Retrieving a policy now looks like:
```json
{
"daily-snapshots" : {
"version": 1,
"modified_date": "2019-04-23T01:30:00.000Z",
"modified_date_millis": 1556048137314,
"policy" : {
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "my_repository",
"config": {
"indices": ["data-*", "important"],
"ignore_unavailable": false,
"include_global_state": false
},
"retention": {}
},
"stats": {
"snapshots_taken": 0,
"snapshots_failed": 0,
"snapshots_deleted": 0,
"snapshot_deletion_failures": 0
},
"next_execution": "2019-04-24T01:30:00.000Z",
"next_execution_millis": 1556048160000
}
}
```
Relates to #43663
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase
This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.
Relates to #43663Resolves#46205
* Add error logging when exceptions are thrown
* Update serialization versions
* Fix type inference
* Use non-Cancellable HLRC return value
* Fix Client mocking in test
* Fix SLMSnapshotBlockingIntegTests for 7.x branch
* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval
* Add serialization guards for SnapshotLifecyclePolicy
The previous transport action was a read action, which under the right
set of circumstances can execute on a coordinating node. This commit
ensures that cannot happen.
This commit changes the SSLContext for the email server we use in
the tests so that it loads its key material from an in memory
keystore (that is in turn built from a pair of PEM encoded private key
and certificate) instead of a PKCS#12 one. This is done so that when
we run our tests in FIPS 140-2 JVMs, the keystore is of a type that the
Security Provider actually supports.
This also mutes testCanSendMessageToSmtpServerByDisablingVerification
as we can't run tests with verification set to `none` in FIPS 140
JVMs.
* Fix issue with painless scripting not being correctly generated when
datetime functions are used for GROUPing of an INTERVAL operation.
(cherry picked from commit cb92828e8ec9d9d241bd6189e5835fd99f8b9a44)
ML users who upgrade from versions prior to 7.4 to 7.4 or later
will have ML results indices that do not have mappings for the
total_search_time_ms field. Therefore, when searching these
indices we must tolerate this field not having a mapping.
Fixes#46437
This refactors `DataFrameAnalyticsTask` into its own class.
The task has quite a lot of functionality now and I believe it would
make code more readable to have it live as its own class rather than
an inner class of the start action class.
Backport of #46402
* [ML][Transforms] fixing rolling upgrade continuous transform test (#45823)
* [ML][Transforms] fixing rolling upgrade continuous transform test
* adjusting wait assert logic
* adjusting wait conditions
* [ML][Transforms] allow executor to call start on started task (#46347)
* making sure we only upgrade from 7.4.0 in test
* [ML] waiting for ml indices before waiting task assignment testFullClusterRestart
* waiting for a stable cluster after fullrestart
* removing unused imports
This commit initializes DocumentSubsetBitsetCache even if DLS
is disabled. Previously it would throw null pointer when querying
usage stats if we explicitly disabled DLS as there would be no instance of DocumentSubsetBitsetCache to query. It is okay to initialize
DocumentSubsetBitsetCache which will be empty as the license enforcement
would prevent usage of DLS feature and it will not fail when accessing usage stats.
Closes#45147
This PR merges the `vectors-optimize-brute-force` feature branch, which makes
the following changes to how vector functions are computed:
* Precompute the L2 norm of each vector at indexing time. (#45390)
* Switch to ByteBuffer for vector encoding. (#45936)
* Decode vectors and while computing the vector function. (#46103)
* Use an array instead of a List for the query vector. (#46155)
* Precompute the normalized query vector when using cosine similarity. (#46190)
Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
Besides a rename, this changes allows to processor to attach multiple
enrich docs to the document being ingested.
Also in order to control the maximum number of enrich docs to be
included in the document being ingested, the `max_matches` setting
is added to the enrich processor.
Relates #32789
As per #45852 comment we no longer need to log stack-traces in
SecurityTransportExceptionHandler and SecurityHttpExceptionHandler even
if trace logging is enabled.
(cherry picked from commit c99224a32d26db985053b7b36e2049036e438f97)
The test seems to have been failing due to a race condition between
stopping the task and refreshing the destination index. In particular,
we were going forward with refreshing the destination index even
though the task stopped in the meantime. This was fixed in
request.
Closes#43960
Backport of #46271
Previously, when the condition (1st argument) of the IIF function could
be evaluated (folded) to false, the `IfConditional` was eliminated which
caused `IndexOutOfBoundsException` to be thrown when `info()` and
`resolveType()` methods where called.
Fixes: #46268
(cherry picked from commit 9a885a3ac47bc8f52c07770d1d8d670ce0af1e59)
* [ML][Transforms] fixing stop on changes check bug
* Adding new method finishAndCheckState to cover race conditions in early terminations
* changing stopping conditions in `onStart`
* allow indexer to finish when exiting early
Fixes a problem where operations_behind would be one less than
expected per shard in a new index matched by the data frame
transform source pattern.
For example, if a data frame transform had a source of foo*
and a new index foo-new was created with 2 shards and 7 documents
indexed in it then operations_behind would be 5 prior to this
change.
The problem was that an empty index has a global checkpoint
number of -1 and the sequence number of the first document that
is indexed into an index is 0, not 1. This doesn't matter for
indices included in both the last and next checkpoints, as the
off-by-one errors cancelled, but for a new index it affected
the observed result.
Fix test issue to stabilise scoring through use of DFS search mode.
Randomised index-then-delete docs introduced by the test framework likely caused an imbalance in IDF scores across shards. Also made number of shards used in test a random number for added test coverage.
Closes#46174
Previously, if the DataType of all the WHEN conditions of a CASE
statement is NULL, then it was set to NULL even if the ELSE clause
has a non-NULL data type, e.g.:
```
CASE WHEN a = 1 THEN NULL
WHEN a = 5 THEN NULL
ELSE 'foo'
```
Fixes: #46032
(cherry picked from commit 8c1012efbbd3a300afd0dfb9b18250f15ea753f9)
Though we allow CCS within datafeeds, users could prevent nodes from accessing remote clusters. This can cause mysterious errors and difficult to troubleshoot.
This commit adds a check to verify that `cluster.remote.connect` is enabled on the current node when a datafeed is configured with a remote index pattern.
* [ML] Regression dependent variable must be numeric
This adds a validation that the dependent variable of a regression
analysis must be numeric.
* Address review comments and fix some problems
In addition to addressing the review comments, this
commit fixes a few issues I found during testing.
In particular:
- if there were mappings for required fields but they were
not included we were not reporting the error
- if explicitly included fields had unsupported types we were
not reporting the error
Unfortunately, I couldn't get those fixed without refactoring
the code in `ExtractedFieldsDetector`.
Today we might carry on a big merge uncommitted and therefore
occupy a significant amount of diskspace for quite a long time
if for instance indexing load goes down and we are not quickly
reaching the translog size threshold. This change will cause a
flush if we hit a significant merge (512MB by default) which
frees diskspace sooner.
If a CCR lease is disappeared while we are renewing it, then we will
issue asyncAddRetentionLease to add that lease. And if
asyncAddRetentionLease takes longer than retentionLeaseRenewInterval,
then we can issue another asyncAddRetentionLease request. One of
asyncAddRetentionLease requests will fail with
RetentionLeaseAlreadyExistsException, hence trip the assertion.
Closes#45192
This commit adds the `rollover_alias` setting required for ILM to work
correctly to the SLM history index template and adds assertions to the
SLM integration tests to ensure that it works correctly.
Currently, when using script_score functions like cosineSimilarity, the query
vector is treated as an array of doubles. Since the stored document vectors use
floats, it seems like the least surprising behavior for the query vectors to
also be float arrays.
In addition to improving consistency, this change may help with some
optimizations we have been considering around vector dot product.
This commit adds support for `boolean` fields in data frame
analytics (and currently both outlier detection and regression).
The analytics process expects `boolean` fields to be encoded as
integers with 0 or 1 value.
Prior to this commit the foreach action execution had a hard coded
limit to 100 iterations. This commit allows the max number of
iterations to be a configuration ('max_iterations') on the foreach
action. The default remains 100.
Adds a parameter `training_percent` to regression. The default
value is `100`. When the parameter is set to a value less than `100`,
from the rows that can be used for training (ie. those that have a
value for the dependent variable) we randomly choose whether to actually
use for training. This enables splitting the data into a training set and
the rest, usually called testing, validation or holdout set, which allows
for validating the model on data that have not been used for training.
Technically, the analytics process considers as training the data that
have a value for the dependent variable. Thus, when we decide a training
row is not going to be used for training, we simply clear the row's
dependent variable.
The existing privilege model for API keys with privileges like
`manage_api_key`, `manage_security` etc. are too permissive and
we would want finer-grained control over the cluster privileges
for API keys. Previously APIs created would also need these
privileges to get its own information.
This commit adds support for `manage_own_api_key` cluster privilege
which only allows api key cluster actions on API keys owned by the
currently authenticated user. Also adds support for retrieval of
the API key self-information when authenticating via API key
without the need for the additional API key privileges.
To support this privilege, we are introducing additional
authentication context along with the request context such that
it can be used to authorize cluster actions based on the current
user authentication.
The API key get and invalidate APIs introduce an `owner` flag
that can be set to true if the API key request (Get or Invalidate)
is for the API keys owned by the currently authenticated user only.
In that case, `realm` and `username` cannot be set as they are
assumed to be the currently authenticated ones.
The changes cover HLRC changes, documentation for the API changes.
Closes#40031
This commit introduces PKI realm delegation. This feature
supports the PKI authentication feature in Kibana.
In essence, this creates a new API endpoint which Kibana must
call to authenticate clients that use certificates in their TLS
connection to Kibana. The API call passes to Elasticsearch the client's
certificate chain. The response contains an access token to be further
used to authenticate as the client. The client's certificates are validated
by the PKI realms that have been explicitly configured to permit
certificates from the proxy (Kibana). The user calling the delegation
API must have the delegate_pki privilege.
Closes#34396
This check was introduced in #41392 but had the unwanted side-effect
that the keystore settings in such blocks would note be added in the
node's keystore. Given that we have a mid-term plan for FIPS testing
that would made such checks unnecessary, and that the conditional
in these two cases is not really that important, this change removes
this conditional logic so that full-cluster-restart and rolling
upgrade tests will run with PEM files for key/certificate material
no matter if we're in a FIPS JVM or not.
Resolves: #45475
This adds a pipeline aggregation that calculates the cumulative
cardinality of a field. It does this by iteratively merging in the
HLL sketch from consecutive buckets and emitting the cardinality up
to that point.
This is useful for things like finding the total "new" users that have
visited a website (as opposed to "repeat" visitors).
This is a Basic+ aggregation and adds a new Data Science plugin
to house it and future advanced analytics/data science aggregations.
The native process requires that there be a non-zero number of rows to analyze. If the flag --rows 0 is passed to the executable, it throws and does not start.
When building the configuration for the process we should not start the native process if there are no rows.
Adding some logging to indicate what is occurring.
* Watcher add email warning if CSV attachment contains formulas (#44460)
This commit introduces a Warning message to the emails generated by
Watcher's reporting action. This change complements Kibana's CSV
formula notifications (see elastic/kibana#37930).
This is implemented by reading a header (kbn-csv-contains-formulas)
provided by Kibana to notify to attach the Warning to the email.
The wording of the warning is borrowed from Kibana's UI and may
be overridden by a dynamic setting
xpack.notification.reporting.warning.kbn-csv-contains-formulas.text.
This warning is enabled by default, but may be disabled via a
dynamic setting xpack.notification.reporting.warning.enabled.
As of #43939 Watcher tests now correctly block until all Watch executions
kicked off by that test are finished. Prior we allowed tests to finish with
outstanding watch executions. It was known that this would increase the
time needed to finish a test. However, running the tests on CI can be slow
and on at least 1 occasion it took 60s to actually finish.
This PR simply increases the max allowable timeout for Watcher tests
to clean up after themselves.
Today if non-TLS record is received on TLS port generic exception will
be logged with the stack-trace.
SSLExceptionHelper.isNotSslRecordException method does not work because
it's assuming that NonSslRecordException would be top-level.
This commit addresses the issue and the log would be more concise.
(cherry picked from commit 6b83527bf0c23d4d5b97fab7f290c43432945d4f)
This commit allows the Transport Actions for the SSO realms to
indicate the realm that should be used to authenticate the
constructed AuthenticationToken. This is useful in the case that
many authentication realms of the same type have been configured
and where the caller of the API(Kibana or a custom web app) already
know which realm should be used so there is no need to iterate all
the realms of the same type.
The realm parameter is added in the relevant REST APIs as optional
so as not to introduce any breaking change.
When a policy is deleted, the enrich indices that are backing the policy
alias should also be deleted. This commit does that work and cleans up
the transport action a bit so that the lock release is easier to see, as
well as to ensure that any action carried out, regardless of exception,
unlocks the policy.
Previously, the stats API reports a progress percentage
for DF analytics tasks that are running and are in the
`reindexing` or `analyzing` state.
This means that when the task is `stopped` there is no progress
reported. Thus, one cannot distinguish between a task that never
run to one that completed.
In addition, there are blind spots in the progress reporting.
In particular, we do not account for when data is loaded into the
process. We also do not account for when results are written.
This commit addresses the above issues. It changes progress
to being a list of objects, each one describing the phase
and its progress as a percentage. We currently have 4 phases:
reindexing, loading_data, analyzing, writing_results.
When the task stops, progress is persisted as a document in the
state index. The stats API now reports progress from in-memory
if the task is running, or returns the persisted document
(if there is one).
This commit changes the enrich processor factory to read the required
configuration from the current enrich index (from meta mapping field)
in order to create the processor.
Before this change the required config was read from the enrich policy
in the cluster state. Enrich policies are going to be stored in an
index (instead of the cluster state). In a processor factory there isn't
a way to load something from an index, so with this change we read
the required config / info from the enrich index (which is derived
from the enrich policy), which then allows us to move enrich policies
to an index.
With this change it is required to execute a policy before creating a
pipeline. Otherwise there is no enrich index and then there is no way
to validate that a policy exist or retrieve its type and match field.
Relates to #32789
A policy type controls how the enrich index is created and
the query executed against the match field. Currently there
is a single policy type (`exact_match`). In the near future
more policy types will be added and different policy may have
different configuration options.
For this reason type should be a json object instead of a string field:
```
{
"exact_match": {
...
}
}
```
instead of:
```
{
"type": "exact_match",
...
}
```
This will make streaming parsing of enrich policies easier as in the
new format, the parsing code can know ahead what configuration fields
to expect. In the latter format that is not possible if the type field
appears not as the first field.
Relates to #32789
The security indices were being created without specifying the
refresh interval, which means it would inherit a value from any
templates that exists.
However, certain security functionality depends on being able to
wait_for refresh, and causes errors (e.g. in Kibana) if that time
exceeds 30s.
This commit changes the security indices configuration to always be
created with a 1s refresh interval. This prevents any templates from
inadvertantly interfering with the proper functioning of security.
It is possible for an administrator to explicitly change the refresh
interval after the indices have been created.
Backport of: #45434
This change adds a new SSL context
xpack.notification.email.ssl.*
that supports the standard SSL configuration settings (truststore,
verification_mode, etc). This SSL context is used when configuring
outbound SMTP properties for watcher email notifications.
Backport of: #45272
Since #45136, we use soft-deletes instead of translog in peer recovery.
There's no need to retain extra translog to increase a chance of
operation-based recoveries. This commit ignores the translog retention
policy if soft-deletes is enabled so we can discard translog more
quickly.
Backport of #45473
Relates #45136
* [ML] Adding data frame analytics stats to _usage API (#45820)
* [ML] Adding data frame analytics stats to _usage API
* making the size of analytics stats 10k
* adjusting backport
Adds index versioning for the internal data frame transform index. Allows for new indices to be created and referenced, `GET` requests now query over the index pattern and takes the latest doc (based on INDEX name).
When Watcher is stopped and there are still outstanding watches running
Watcher will report it self as stopped. In normal cases, this is not problematic.
However, for integration tests Watcher is started and stopped between
each test to help ensure a clean slate for each test. The tests are blocking
only on the stopped state and make an implicit assumption that all watches are
finished if the Watcher is stopped. This is an incorrect assumption since
Stopped really means, "I will not accept any more watches". This can lead to
un-predictable behavior in the tests such as message : "Watch is already queued
in thread pool" and state: "not_executed_already_queued".
This can also change the .watcher-history if watches linger between tests.
This commit changes the semantics of a manual stopping watcher to now mean:
"I will not accept any more watches AND all running watches are complete".
There is now an intermediary step "Stopping" and callback to allow transition
to a "Stopped" state when all Watches have completed.
Additionally since this impacts how long the tests will block waiting for a
"Stopped" state, the timeout has been increased.
Related: #42409
In internal test clusters tests we check that wiping all indices was acknowledged
but in REST tests we didn't.
This aligns the behavior in both kinds of tests.
Relates #45605 which might be caused by unacked deletes that were just slow.
Enrich processor configuration changes:
* Renamed `enrich_key` option to `field` option.
* Replaced `set_from` and `targets` options with `target_field`.
The `target_field` option behaves different to how `set_from` and
`targets` worked. The `target_field` is the field that will contain
the looked up document.
Relates to #32789
After the PR #45676 onFailure is now called before the indexer state has transitioned out of indexing.
To fix these tests, I added a new check to make sure that we don't mark it as failed until AFTER doSaveState is called with a STARTED indexer.
Following our own guidelines, SLM should use rollover instead of purely
time-based indices to keep shard counts low. This commit implements lazy
index creation for SLM's history indices, indexing via an alias, and
rollover in the built-in ILM policy.
Most of our CLI tools use the Terminal class, which previously did not provide methods for writing to standard output. When all output goes to standard out, there are two basic problems. First, errors and warnings are "swallowed" in pipelines, making it hard for a user to know when something's gone wrong. Second, errors and warnings are intermingled with legitimate output, making it difficult to pass the results of interactive scripts to other tools.
This commit adds a second set of print commands to Terminal for printing to standard error, with errorPrint corresponding to print and errorPrintln corresponding to println. This leaves it to developers to decide which output should go where. It also adjusts existing commands to send errors and warnings to stderr.
Usage is printed to standard output when it's correctly requested (e.g., bin/elasticsearch-keystore --help) but goes to standard error when a command is invoked incorrectly (e.g. bin/elasticsearch-keystore list-with-a-typo | sort).
Regression analysis support missing fields. Even more, it is expected
that the dependent variable has missing fields to the part of the
data frame that is not for training.
This commit allows to declare that an analysis supports missing values.
For such analysis, rows with missing values are not skipped. Instead,
they are written as normal with empty strings used for the missing values.
This also contains a fix to the integration test.
Closes#45425
* [ML] better handle empty results when evaluating regression
* adding new failure test to ml_security black list
* fixing equality check for regression results
* Executing SLM policies on the snapshot thread will block until a snapshot finishes if the pool is completely busy executing that snapshot
* Fixes#45594
The get and list APIs are a single API in this commit. Whether
requesting one named policy or all policies, a list of policies is
returened. The list API code has all been removed and the GET api is
what remains, which contains much of the list response code.
The setting index.soft_deletes.retention.operations is no longer needed
nor recommended in CCR. We, therefore, should hint users about the
retention leases period setting instead when operations are no longer
available for replicating.
We cannot know how long the analysis will take to complete thus we should not have
a timeout. Note that if the process crashes, the result processor will pick the
exception due to the stream closing.
Closes#45723
* [ML][Data frame] fixing failure state transitions and race condition (#45627)
There is a small window for a race condition while we are flagging a task as failed.
Here are the steps where the race condition occurs:
1. A failure occurs
2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following:
a. `finishAndSetState()` which sets the IndexerState to STARTED
b. `doSaveState(...)` which attempts to save the current state of the indexer
3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs.
The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.
Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another.
I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state.
closes#45609
Relates to #45562
* [ML][Data Frame] moves failure state transition for MT safety (#45676)
* [ML][Data Frame] moves failure state transition for MT safety
* removing unused imports
* Search enhancement: pinned queries (#44345)
Search enhancement: - new query type allows selected documents to be promoted above any "organic” search results.
This is the first feature in a new module `search-business-rules` which will house licensed (non OSS) logic for rewriting queries according to business rules.
The PinnedQueryBuilder class offers a new `pinned` query in the DSL that takes an array of promoted IDs and an “organic” query and ensures the documents with the promoted IDs rank higher than the organic matches.
Closes#44074
Encapsulate the serialization/deserialization of SQL client classes.
Make configuration specific parameters (such as ZoneId) generic just
like the version and remove the need for consumer classes to manage them
individually.
This is not only consistent but also provides significant savings in the
cursor.
Fix#40216
(cherry picked from commit 5c844798045d7baa0d932289d2e3d1607ba6a9a4)
Improve the initialization and state passing of TextFormatter in CLI
and TEXT mode by leveraging the Page listener hook. Additionally
simplify the code inside RestSqlQueryAction.
(cherry picked from commit a56db2fa119cf9e8748723e19f1fc9f6a8afe5fc)
Improve encapsulation of pagination of rowsets by breaking the cycle
between cursor and associated rowset implementation, all logic now
residing inside each cursor implementation.
(cherry picked from commit be8fe0a0ce562fe732fae12a0b236b5731e4638c)
Adjusts the cluster cleanup routine in ESRestTestCase to clean up SLM
test cases, and optionally wait for all snapshots to be deleted.
Waiting for all snapshots to be deleted, rather than failing if any are
in progress, is necessary for tests which use SLM policies because SLM
policies may be in the process of executing when the test ends.
Changes the order of parameters in Geometries from lat, lon to lon, lat
and moves all Geometry classes are moved to the
org.elasticsearch.geomtery package.
Backport of #45332Closes#45048
* Update the REST API specification
This patch updates the REST API spefication in JSON files to better encode deprecated entities,
to improve specification of URL paths, and to open up the schema for future extensions.
Notably, it changes the `paths` from a list of strings to a list of objects, where each
particular object encodes all the information for this particular path: the `parts` and the `methods`.
Among the benefits of this approach is eg. encoding the difference between using the `PUT` and `POST`
methods in the Index API, to either use a specific document ID, or let Elasticsearch generate one.
Also `documentation` becomes an object that supports an `url` and also a `description` which is a
new field.
* Adapt YAML runner to new REST API specification format
The logic for choosing the path to use when running tests has been
simplified, as a consequence of the path parts being listed under each
path in the spec. The special case for create and index has been removed.
Also the parsing code has been hardened so that errors are thrown earlier
when the structure of the spec differs from what expected, and their
error messages should be more helpful.
This commit adds a lock to the delete policy, in the same way that the
locking is done for policy execution. It also creates a test to exercise
the delete transport action, and modifies an existing test to provide a
common set of functions for saving and deleting policies.
The delete policy had a subtle bug in that it would still delete the
policy if pipelines were accessing it, after giving the client back an
error. This commit fixes that and ensures it does not happen by adding
verification in the test.
* Introduce Spatial Plugin (#44389)
Introduce a skeleton Spatial plugin that holds new licensed features coming to
Geo/Spatial land!
* [GEO] Refactor DeprecatedParameters in AbstractGeometryFieldMapper (#44923)
Refactor DeprecatedParameters specific to legacy geo_shape out of
AbstractGeometryFieldMapper.TypeParser#parse.
* [SPATIAL] New ShapeFieldMapper for indexing cartesian geometries (#44980)
Add a new ShapeFieldMapper to the xpack spatial module for
indexing arbitrary cartesian geometries using a new field type called shape.
The indexing approach leverages lucene's new XYShape field type which is
backed by BKD in the same manner as LatLonShape but without the WGS84
latitude longitude restrictions. The new field mapper builds on and
extends the refactoring effort in AbstractGeometryFieldMapper and accepts
shapes in either GeoJSON or WKT format (both of which support non geospatial
geometries).
Tests are provided in the ShapeFieldMapperTest class in the same manner
as GeoShapeFieldMapperTests and LegacyGeoShapeFieldMapperTests.
Documentation for how to use the new field type and what parameters are
accepted is included. The QueryBuilder for searching indexed shapes is
provided in a separate commit.
* [SPATIAL] New ShapeQueryBuilder for querying indexed cartesian geometry (#45108)
Add a new ShapeQueryBuilder to the xpack spatial module for
querying arbitrary Cartesian geometries indexed using the new shape field
type.
The query builder extends AbstractGeometryQueryBuilder and leverages the
ShapeQueryProcessor added in the previous field mapper commit.
Tests are provided in ShapeQueryTests in the same manner as
GeoShapeQueryTests and docs are updated to explain how the query works.
If a pipeline that refrences the policy exists, we should not allow the
policy to be deleted. The user will need to remove the processor from
the pipeline before deleting the policy. This commit adds a check to
ensure that the policy cannot be deleted if it is referenced by any
pipeline in the system.
* Add format parameter to the range queries built for CURRENT_* functions
used in comparison conditions
* Use range queries for date fields equality/non-equality as well.
(cherry picked from commit c1e81e90f937ee5a002524d632bfce74d76962f9)
The policy name is used to generate the enrich index name.
For this reason, a policy name should be validated in the same way
as index names.
Relates to #32789
* Reenable Integ Tests in native-multi-node-tests
* The tests broken here were likely fixed by #45463 => let's reenable them and see if things run fine again
* Relates #45405, #45455
The http client could end up creating URLs, that did not resemble the
original one, when encoding. This fixes a couple of corner cases, where
too much or too few slashes were added to an URI.
Closes#44970
The current implementations make it difficult for
adding new privileges (example: a cluster privilege which is
more than cluster action-based and not exposed to the security
administrator). On the high level, we would like our cluster privilege
either:
- a named cluster privilege
This corresponds to `cluster` field from the role descriptor
- or a configurable cluster privilege
This corresponds to the `global` field from the role-descriptor and
allows a security administrator to configure them.
Some of the responsibilities like the merging of action based cluster privileges
are now pushed at cluster permission level. How to implement the predicate
(using Automaton) is being now enforced by cluster permission.
`ClusterPermission` helps in enforcing the cluster level access either by
performing checks against cluster action and optionally against a request.
It is a collection of one or more permission checks where if any of the checks
allow access then the permission allows access to a cluster action.
Implementations of cluster privilege must be able to provide information
regarding the predicates to the cluster permission so that can be enforced.
This is enforced by making implementations of cluster privilege aware of
cluster permission builder and provide a way to specify how the permission is
to be built for a given privilege.
This commit renames `ConditionalClusterPrivilege` to `ConfigurableClusterPrivilege`.
`ConfigurableClusterPrivilege` is a renderable cluster privilege exposed
as a `global` field in role descriptor.
Other than this there is a requirement where we would want to know if a cluster
permission is implied by another cluster-permission (`has-privileges`).
This is helpful in addressing queries related to privileges for a user.
This is not just simply checking of cluster permissions since we do not
have access to runtime information (like request object).
This refactoring does not try to address those scenarios.
Relates #44048
The vagrant based tests currently reside in a single project, creating
dozens of tasks to manage starting and stopping the vagrant VM along
with running java and bats tests within each image. This all-in-one
pattern makes parallelizing packaging tests difficult.
This commit rewrites the vagrant testing infrastructure to be
independent of the actual test runners, thus allowing each platform to
be handled in a separate subproject. Additionally, the java and bats
tests are changed to be run through a "destructive" gradle task, which
is run inside the VM. The combination of these will allow
parallelization both locally (through running several VMs at once) as
well as running the destructive tasks in CI machines dedicated to each
platform (thus removing the need for vagrant in CI).
This commit adds a first draft of a regression analysis
to data frame analytics. There is high probability that
the exact syntax might change.
This commit adds the new analysis type and its parameters as
well as appropriate validation. It also modifies the extractor
and the fields detector to be able to handle categorical fields
as regression analysis supports them.
In the case that source and target are the same in `enrich_values` then
a string array can be specified.
For example instead of this:
```
PUT /_ingest/pipeline/my-pipeline
{
"processors": [
{
"enrich" : {
"policy_name": "my-policy",
"enrich_values": [
{
"source": "first_name",
"target": "first_name"
},
{
"source": "last_name",
"target": "last_name"
},
{
"source": "address",
"target": "address"
},
{
"source": "city",
"target": "city"
},
{
"source": "state",
"target": "state"
},
{
"source": "zip",
"target": "zip"
}
]
}
}
]
}
```
This more compact format can be specified:
```
PUT /_ingest/pipeline/my-pipeline
{
"processors": [
{
"enrich" : {
"policy_name": "my-policy",
"targets": [
"first_name",
"last_name",
"address",
"city",
"state",
"zip"
]
}
}
]
}
```
And the `enrich_values` key has been renamed to `set_from`.
Relates to #32789
* Restrict which tasks can use testclusters
This PR fixes a problem between the interaction of test-clusters and
build cache.
Before this any task could have used a cluster without tracking it as
input.
With this change a new interface is introduced to track the tasks that
can use clusters and we do consider the cluster as input for all of
them.
Currently the msearch api is used to execute buffered search requests;
however the msearch api doesn't deal with search requests in an intelligent way.
It basically executes each search separately in a concurrent manner.
This api reuses the msearch request and response classes and executes
the searches as one request in the node holding the enrich index shard.
Things like engine.searcher and query shard context are only created once.
Also there are less layers than executing a regular msearch request. This
results in an interesting speedup.
Without this change, in a single node cluster, enriching documents
with a bulk size of 5000 items, the ingest time in each bulk response
varied from 174ms to 822ms. With this change the ingest time in each
bulk response varied from 54ms to 109ms.
I think we should add a change like this based on this improvement in ingest time.
However I do wonder if instead of doing this change, we should improve
the msearch api to execute more efficiently. That would be more complicated
then this change, because in this change the custom api can only search
enrich index shards and these are special because they always have a single
primary shard. If msearch api is to be improved then that should work for
any search request to any indices. Making the same optimization for
indices with more than 1 primary shard requires much more work.
The current change is isolated in the enrich plugin and LOC / complexity
is small. So this good enough for now.
* Name each inner_hits section of nested queries differently and extract and combine the multiple values it generates into a single list.
This also introduces a limitation (its origin it's with Elasticsearch
though) on the sorting capabilities when the sorting is based on the
nested fields filtered: only one of the conditions applied to nested
documents will be used in the nested sorting.
(cherry picked from commit cfc5cf68f6e83b07bb9006986d0903d6be418ec6)
This commit replaces task_state and indexer_state in the
data frame _stats output with a single top level state
that combines the two. It is defined as:
- failed if what's currently reported as task_state is failed
- stopped if there is no persistent task
- Otherwise what's currently reported as indexer_state
Backport of #45276
When using the implicit flow in OpenID Connect, the
op.token_endpoint_url should not be mandatory as there is no need
to contact the token endpoint of the OP.
* [ML][Data Frame] Add update transform api endpoint (#45154)
This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`.
This PR contains all that is necessary for this addition:
* HLRC
* Docs
* Server side
This adds support for `geo_bounds` aggregation inside the `pivot.aggregations` configuration.
The two points returned from the `geo_bounds` aggregation are transformed into `geo_shape` whose types are dynamic given the point's similarity.
* `point` if the two points are identical
* `linestring` if the two points share either a latitude or longitude
* `polygon` if the two points are completely different
The automatically deduced mapping for the resulting field is a `geo_shape`.
We currently use the unboundid ldap SDK, which is triply licensed under
GPL-2.0, LGPL-2.1, and the "UnboundID LDAP SDK Free Use License". We currently
identify the license as the latter, but LGPL-2.1 is the one we should be using
per our policy.
The PutJob API accidentally used an "expert" API of CreateIndexRequest.
That API is semi-lenient to syntax; a type could be omitted and the
request would work as expected. But if a type was omitted it would
not merge with templates correctly, leading to index creation that
only has the template and not the requested mappings in the request.
This commit refactors the PutJob API to:
- Include the type name
- Use a less "expert" API in an attempt to future proof against errors
- Uses an XContentBuilder instead of string replacing, removes json template
This commit applies a normalization process to environment paths, both
in how they are stored internally, also their settings values. This
normalization is done via two means:
- we make the paths absolute
- we remove redundant name elements from the path (what Java calls
"normalization")
This change ensures that when we compare and refer to these paths within
the system, we are using a common ground. For example, prior to the
change if the data path was relative, we would not compare it correctly
to paths from disk usage. This is because the paths in disk usage were
being made absolute.
Uses JDK 11's per-socket configuration of TCP keepalive (supported on Linux and Mac), see
https://bugs.openjdk.java.net/browse/JDK-8194298, and exposes these as transport settings.
By default, these options are disabled for now (i.e. fall-back to OS behavior), but we would like
to explore whether we can enable them by default, in particular to force keepalive configurations
that are better tuned for running ES.
introduces an abstraction for how checkpointing and synchronization works, covering
- retrieval of checkpoints
- check for updates
- retrieving stats information
Currently in the transport-nio work we connect and bind channels on the
a thread before the channel is registered with a selector. Additionally,
it is at this point that we set all the socket options. This commit
moves these operations onto the event-loop after the channel has been
registered with a selector. It attempts to set the socket options for a
non-server channel at registration time. If that fails, it will attempt
to set the options after the channel is connected. This should fix
#41071.
In the FIPS JVM the JVM default locale seems to leak into places
where it should be overridden. This change skips assertions
in TimestampFormatFinderTests.testGuessIsDayFirstFromLocale
that may be impacted.
Fixes#45140
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.
Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
Reloading of synonym_graph filter doesn't work currently because the search time
AnalysisMode doesn't get propagated to the TokenFilterFactory emitted by the
graph filters getChainAwareTokenFilterFactory() method. This change fixes that.
Closes#45127
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.
This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.
Fixes#45047
This change improves the exception messages that are thrown when the
system cannot read TLS resources such as keystores, truststores,
certificates, keys or certificate-chains (CAs).
This change specifically handles:
- Files that do not exist
- Files that cannot be read due to file-system permissions
- Files that cannot be read due to the ES security-manager
Backport of: #44787
There are no realms that can be configured exclusively with secure
settings. Every realm that supports secure settings also requires one
or more non-secure settings.
However, sometimes a node will be configured with entries in the
keystore for which there is nothing in elasticsearch.yml - this may be
because the realm we removed from the yml, but not deleted from the
keystore, or it could be because there was a typo in the realm name
which has accidentially orphaned the keystore entry.
In these cases the realm building would fail, but the error would not
always be clear or point to the root cause (orphaned keystore
entries). RealmSettings would act as though the realm existed, but
then fail because an incorrect combination of settings was provided.
This change causes realm building to fail early, with an explicit
message about incorrect keystore entries.
Backport of: #44471
When we create API key we check if the API key with the name
already exists. It searches with scroll enabled and this causes
the request to fail when creating large number of API keys in
parallel as it hits the number of open scroll limit (default 500).
We do not need the search context to be created so this commit
removes the scroll parameter from the search request for duplicate
API key.
If one tries to start a DF analytics job that has already run,
the result will be that the task will fail after reindexing the
dest index from the source index. The results of the prior run
will be gone and the task state is not properly set to failed
with the failure reason.
This commit improves the behavior in this scenario. First, we
set the task state to `failed` in a set of failures that were
missed. Second, a validation is added that if the destination
index exists, it must be empty.
We keep adding the current primary term to operations for which we do not assign a sequence
number. This does not make sense anymore as all operations which we care about have
sequence numbers now. The goal of this commit is to clean things up in InternalEngine and
reduce the complexity.
* We shouldn't be recreating wrapped REST handlers over and over for every request. We only use this hook in x-pack and the wrapper there does not have any per request state.
This is inefficient and could lead to some very unexpected memory behavior
=> I made the logic create the wrapper on handler registration and adjusted the x-pack wrapper implementation to correctly forward the circuit breaker and content stream flags
The Get Users API also returns users form the restricted realm or built-in users,
as we call them in our docs. One can also change the passwords of built-in
users with the Change Password API
A mismatched configuration between the IdP and SP will often result in
SAML authentication attempts failing because the audience condition is
not met (because the IdP and SP disagree about the correct form of the
SP's Entity ID).
Previously the error message in this case did not provide sufficient
information to resolve the issue because the IdP's expected audience
would be truncated if it exceeeded 32 characters. Since the error did
not provide both IDs in full, it was not possible to determine the
correct fix (in detail) based on the error alone.
This change expands the message that is included in the thrown
exception, and also adds additional logging of every failed audience
condition, with diagnostics of the match failure.
Backport of: #44334
* Create S3 Third Party Test Task that Covers the S3 CLI Tool
* Adjust snapshot cli test tool tests to work with real S3
* Build adjustment
* Clean up repo path before testing
* Dedup the logic for asserting path contents by using the correct utility method here that somehow became unused
Today closing a `ClusterNode` in an `AbstractCoordinatorTestCase` uses
`onNode()` so has no effect if the node is not in the current list of nodes.
It also discards the `Runnable` it creates without having run it, so has no
effect anyway.
This commit makes these tests much stricter about properly closing the nodes
started during `Coordinator` tests, by tracking the persisted states that are
opened, and adds an assertion to catch the trappy requirement that the closing
node still belongs to the cluster.
The existing equals check was broken, and would always be false.
The correct behaviour is to return "Collections.emptyList()" whenever
the the active(licensed)-realms equals the configured-realms.
Backport of: #44399
* Rename indexlifecycle to ilm and snapshotlifecycle to slm (#44917)
As a followup to #44725 and #44608, which renamed the packages within
the x-pack project, this renames the packages within the core x-pack
project. It also renames 'snapshotlifecycle' within the HLRC to slm.
* Fix one more import
In case closing the process throws an exception we should be catching
it no matter its type. The process may have terminated because of a
fatal error in which case closing the process will throw a server
error, not an `IOException`. If this happens we fail to mark the
persistent task as failed and the task gets in limbo.
As data frame rows with missing values for analyzed fields are skipped,
we can be more efficient by including a query that only picks documents
that have values for all analyzed fields. Besides improving the number
of documents we go through, we also provide a more accurate measurement
of how many rows we need which reduces the memory requirements.
This also adds an integration test that runs outlier detection on data
with missing fields.
TaskListener accepts today Throwable in its onFailure method. Though
looking at where it is called (TransportAction), it can never be
notified of a Throwable.
This commit changes the signature of TaskListener#onFailure so that it
accepts an `Exception` rather than a `Throwable` as second argument.
In order to make it easier to interpret the output of the ILM Explain
API, this commit adds two request parameters to that API:
- `only_managed`, which causes the response to only contain indices
which have `index.lifecycle.name` set
- `only_errors`, which causes the response to contain only indices in an
ILM error state
"Error state" is defined as either being in the `ERROR` step or having
`index.lifecycle.name` set to a policy that does not exist.
Today the processors setting is permitted to be set to more than the
number of processors available to the JVM. The processors setting
directly sizes the number of threads in the various thread pools, with
most of these sizes being a linear function in the number of
processors. It doesn't make any sense to set processors very high as the
overhead from context switching amongst all the threads will overwhelm,
and changing the setting does not control how many physical CPU
resources there are on which to schedule the additional threads. We have
to draw a line somewhere and this commit deprecates setting processors
to more than the number of available processors. This is the right place
to draw the line given the linear growth as a function of processors in
most of the thread pools, and that some are capped at the number of
available processors already.
Since 7.3, it's possible to explicitly configure the SAML realm to
be used in Kibana's configuration. This in turn, eliminates the need
of properly setting `xpack.security.public.*` settings in Kibana
and largely simplifies relevant documentation.
This also changes `xpack.security.authProviders` to
`xpack.security.authc.providers` as the former was deprecated in
favor of the latter in 7.3 in Kibana
This is a followup to #44350. The indexer stats used to
be persisted standalone, but now are only persisted as
part of a state-and-stats document. During the review
of #44350 it was decided that we'll stick with this
design, so there will never be a need for an indexer
stats object to store its transform ID as it is stored
on the enclosing document. This PR removes the indexer
stats document ID.
Backport of #44768
The problem is that RemoteClusterConnection closes the connection manager asynchronously, which races with the threadpool being shutdown at the end of the test.
Closes#44339Closes#44610
Deleting a follower index does not delete its ShardFollowTasks, potentially
leaving many persistent tasks in the cluster that cannot be allocated on
nodes and unnecessary fill the logs. This commit adds a cluster state listener
(ShardFollowTaskCleaner) that completes (with a failure) any persistent task
that refers to a non existent follower index.
I think that this bug has been introduced by #34404: before this change the
task would have been completed as failed and removed from the cluster state.
Backport of #44702 and #44801 on 7.x
* Switch from using docvalue_fields to extracting values from _source
where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.
* This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.
(cherry picked from commit 8bf8a055e38f00df5f49c8d97f632f69d6e00c2c)
Use InspectionHelper classes to decide if the aggregations should return null (in case there is no value) or the value itself.
(cherry picked from commit dafd7b039b0da072750e8f57e7572d24f7aad44a)
* Only emit deprecation warning if there was actual change of a datafeed's job_id.
* Add @Deprecated annotation to DatafeedUpdate.Builder#setJobId method
Adds a global soft limit on the number of concurrently executing enrich policies.
Since an enrich policy is run on the generic thread pool, this is meant to limit
policy runs separately from the generic thread pool capacity.
This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.
This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.
Backport of #44350
This commit renames the ILM package from indexlifecycle to ilm. We have
all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. This
commit makes that change.
This commit renames the SLM package from snapshotlifecycle to slm. We
have all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. For SLM,
let's use slm for all of these including the package name from the
beginning.
This PR adds a background maintenance task that is scheduled on the master node only.
The deletion of an index is based on if it is not linked to a policy or if the enrich alias is not
currently pointing at it. Synchronization has been added to make sure that no policy
executions are running at the time of cleanup, and if any executions do occur, the marking
process delays cleanup until next run.
We often start testing with early access versions of new Java
versions and this have caused minor issues in our tests
(i.e. #43141) because the version string that the JVM reports
cannot be parsed as it ends with the string -ea.
This commit changes how we parse and compare Java versions to
allow correct parsing and comparison of the output of java.version
system property that might include an additional alphanumeric
part after the version numbers
(see [JEP 223[(https://openjdk.java.net/jeps/223)). In short it
handles a version number part, like before, but additionally a
PRE part that matches ([a-zA-Z0-9]+).
It also changes a number of tests that would attempt to parse
java.specification.version in order to get the full version
of Java. java.specification.version only contains the major
version and is thus inappropriate when trying to compare against
a version that might contain a minor, patch or an early access
part. We know parse java.version that can be consistently
parsed.
Resolves#43141
Since #44344 we use IndicesOptions.LENIENT_EXPAND_OPEN
when deciding which indices to include in checkpoint
calculation. This change uses the same option when
deciding which indices to search for data and which
indices to get mappings from, otherwise there is a
potential mismatch between the checkpoint details and
what is searched elsewhere.
* Mute failing test
tracked in #44552
* mute EvilSecurityTests
tracking in #44558
* Fix line endings in ESJsonLayoutTests
* Mute failing ForecastIT test on windows
Tracking in #44609
* mute BasicRenormalizationIT.testDefaultRenormalization
tracked in #44613
* fix mute testDefaultRenormalization
* Increase busyWait timeout windows is slow
* Mute failure unconfigured node name
* mute x-pack internal cluster test windows
tracking #44610
* Mute JvmErgonomicsTests on windows
Tracking #44669
* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot
Tracking #44671
* Mute NodeTests on Windows
Tracking #44256
This adds a new dynamic cluster setting `xpack.data_frame.num_transform_failure_retries`.
This setting indicates how many times non-critical failures should be retried before a data frame transform is marked as failed and should stop executing. At the time of this commit; Min: 0, Max: 100, Default: 10
NamedAnalyzer should return the same AnalysisMode than any custom analyzer it
wraps, otherwise AnalysisMode.ALL. This used to be only CustomAnalyzer in the
past, but with the introduction of the ReloadableCustomAnalyzer this needs to be
added as an option where the analysis mode gets propagated.
Closes#44625
This commit converts several utility classes that implement Streamable
to have StreamInput constructors. It also adds a default version of
readFrom to Streamable so that overriding to throw UOE is not necessary.
relates #34389
This commit converts several more classes from streamable to writeable
in server, mostly within the o.e.index and o.e.persistent packages.
relates #34389
* Allow empty configuration for SLM policies
When putting or updating a snapshot lifecycle policy it was not possible
to elide the `config` map. This commit makes the configuration optional,
the same way that it is when taking a snapshot.
Relates to #38461
* Add Objects.requireNonNull for required parts of the policy
* Expose index age in ILM explain output
This adds the index's age to the ILM explain output, for example:
```
{
"indices" : {
"ilm-000001" : {
"index" : "ilm-000001",
"managed" : true,
"policy" : "full-lifecycle",
"lifecycle_date" : "2019-07-16T19:48:22.294Z",
"lifecycle_date_millis" : 1563306502294,
"age" : "1.34m",
"phase" : "hot",
"phase_time" : "2019-07-16T19:48:22.487Z",
... etc ...
}
}
}
```
This age can be used to tell when ILM will transition the index to the
next phase, based on that phase's `min_age`.
Resolves#38988
* Expose age in getters and in HLRC
If the setting on the follower and the leader are identical after
filtering out private and internal settings, then we should not call
update setting (on the follower) as there's nothing to change. Moreover,
this makes the ShardFollowTask abort as it considers
ActionRequestValidationException (caused by an empty update setting
request) as a fatal error.
Closes#44521
A tool to work with snapshots.
Co-authored by @original-brownbear.
This commit adds snapshot tool and the single command cleanup, that
cleans up orphaned files for S3.
Snapshot tool lives in x-pack/snapshot-tool.
(cherry picked from commit fc4aed44dd975d83229561090f957a95cc76b287)
Removes the warning suppression -Xlint:-deprecation,-rawtypes,-serial,-try,-unchecked.
Many warnings were unchecked warnings in the test code often because of the use of mocks.
These are suppressed with @SuppressWarning
many classes still use the Streamable constructors of HandledTransportAction,
this commit moves more of those classes to the new Writeable constructors.
relates #34389.
this commit removes usage of the deprecated
constructor with a single argument and no Writeable.Reader.
The purpose of this is to reduce the boilerplate necessary for
properly implementing a new action, as well as reducing the
chances of using the incorrect super constructor while classes
are being migrated to Writeable
relates #34389.
This commit converts all remaining ActionType response classes to
writeable in xpack core. It also converts a few from server which were
used by xpack core.
relates #34389
Today we have an annotation for controlling logging levels in
tests. This annotation serves two purposes, one is to control the
logging level used in tests, when such control is needed to impact and
assert the behavior of loggers in tests. The other use is when a test is
failing and additional logging is needed. This commit separates these
two concerns into separate annotations.
The primary motivation for this is that we have a history of leaving
behind the annotation for the purpose of investigating test failures
long after the test failure is resolved. The accumulation of these stale
logging annotations has led to excessive disk consumption. Having
recently cleaned this up, we would like to avoid falling into this state
again. To do this, we are adding a link to the test failure under
investigation to the annotation when used for the purpose of
investigating test failures. We will add tooling to inspect these
annotations, in the same way that we have tooling on awaits fix
annotations. This will enable us to report on the use of these
annotations, and report when stale uses of the annotation exist.
This commit adds constructors to AcknolwedgedRequest subclasses to
implement Writeable.Reader, and ensures all future subclasses implement
the same.
relates #34389
This commit converts the request and response classes for broadcast
actions to implement ctors for Writeable.Reader and forces all future
implementations to implement the same.
relates #34389
Registering a channel with a selector is a required operation for the
channel to be handled properly. Currently, we mix the registeration with
other setup operations (ip filtering, SSL initiation, etc). However, a
fail to register is fatal. This PR modifies how registeration occurs to
immediately close the channel if it fails.
There are still two clear loopholes for how a user can interact with a
channel even if registration fails. 1. through the exception handler.
2. through the channel accepted callback. These can perhaps be improved
in the future. For now, this PR prevents writes from proceeding if the
channel is not registered.
This commit avoids a situation where we might stack overflow in the
auto-follower coordinator. In the face of repeated failures to get the
remote cluster state, we would previously be called back on the same
thread and then recurse to try again. If this failure persists, the
repeated callbacks on the same thread would lead to a stack
overflow. The repeated failures can occur, for example, if the connect
queue is full when we attempt to make a connection to the remote
cluster. This commit avoids this by truncating the call stack if we are
called back on the same thread as the initial request was made on.
This commit avoids an NPE when checking for privileges to follow
indices. The problem here is that in some cases we might not be able to
read the authentication info from the thread context. In that case, a
null user would be returned and we were not guarding against this.
* Migrate ML Actions to use writeable ActionType (#44302)
This commit converts all the StreamableResponseActionType
actions in the ML core module to be ActionType and leverage
the Writeable infrastructure.
This improves the error message when encrypting of sensitive watcher
data is configured, but no system file was specified in the keystore.
This error message is displayed on startup.
This also closes the input stream of the secure file properly.
Closes#43619
* [ML][Data Frame] treat bulk index failures as an indexing failure
* removing redundant public modifier
* changing to an ElasticsearchException
* fixing redundant public modifier
The failure is correctly getting propagated, this commit adds support to
explicitly look for .watch-history failures using the same logging strategy
as triggered watch failures.
* Add Snapshot Lifecycle Management (#43934)
* Add SnapshotLifecycleService and related CRUD APIs
This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.
This also includes the get, put, and delete APIs for these policies
Relates to #38461
* Make scheduledJobIds return an immutable set
* Use Object.equals for SnapshotLifecyclePolicy
* Remove unneeded TODO
* Implement ToXContentFragment on SnapshotLifecyclePolicyItem
* Copy contents of the scheduledJobIds
* Handle snapshot lifecycle policy updates and deletions (#40062)
(Note this is a PR against the `snapshot-lifecycle-management` feature branch)
This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.
Relates to #38461
* Take a snapshot for the policy when the SLM policy is triggered (#40383)
(This is a PR for the `snapshot-lifecycle-management` branch)
This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.
This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.
Relates to #38461
* Record most recent snapshot policy success/failure (#40619)
Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.
This is the first step toward writing history in a more permanent
fashion.
* Validate snapshot lifecycle policies (#40654)
(This is a PR against the `snapshot-lifecycle-management` branch)
With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.
Part of #38461
* Hook SLM into ILM's start and stop APIs (#40871)
(This pull request is for the `snapshot-lifecycle-management` branch)
This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.
Relates to #38461
* Add tests for SnapshotLifecyclePolicyItem (#40912)
Adds serialization tests for SnapshotLifecyclePolicyItem.
* Fix improper import in build.gradle after master merge
* Add human readable version of modified date for snapshot lifecycle policy (#41035)
* Add human readable version of modified date for snapshot lifecycle policy
This small change changes it from:
```
...
"modified_date": 1554843903242,
...
```
To
```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```
Including the `"modified_date"` field when the `?human` field is used.
Relates to #38461
* Fix test
* Add API to execute SLM policy on demand (#41038)
This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.
```json
PUT /_ilm/snapshot/<policy>/_execute
```
And it returns the response with the generated snapshot name:
```json
{
"snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```
Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.
Relates to #38461
* Add next_execution to SLM policy metadata (#41221)
* Add next_execution to SLM policy metadata
This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:
```json
GET /_ilm/snapshot?human
{
"production" : {
"version" : 1,
"modified_date" : "2019-04-15T21:16:21.865Z",
"modified_date_millis" : 1555362981865,
"policy" : {
"name" : "<production-snap-{now/d}>",
"schedule" : "*/30 * * * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"foo-*",
"important"
],
"ignore_unavailable" : true,
"include_global_state" : false
}
},
"next_execution" : "2019-04-15T21:16:30.000Z",
"next_execution_millis" : 1555362990000
},
"other" : {
"version" : 1,
"modified_date" : "2019-04-15T21:12:19.959Z",
"modified_date_millis" : 1555362739959,
"policy" : {
"name" : "<other-snap-{now/d}>",
"schedule" : "0 30 2 * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"other"
],
"ignore_unavailable" : false,
"include_global_state" : true
}
},
"next_execution" : "2019-04-16T02:30:00.000Z",
"next_execution_millis" : 1555381800000
}
}
```
Relates to #38461
* Fix and enhance tests
* Figured out how to Cron
* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)
This commit changes the endpoint for snapshot lifecycle management from:
```
GET /_ilm/snapshot/<policy>
```
to:
```
GET /_slm/policy/<policy>
```
It mimics the ILM path only using `slm` instead of `ilm`.
Relates to #38461
* Add initial documentation for SLM (#41510)
* Add initial documentation for SLM
This adds the initial documentation for snapshot lifecycle management.
It also includes the REST spec API json files since they're sort of
documentation.
Relates to #38461
* Add `manage_slm` and `read_slm` roles (#41607)
* Add `manage_slm` and `read_slm` roles
This adds two more built in roles -
`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.
`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.
Relates to #38461
* Add execute to the test
* Fix ilm -> slm typo in test
* Record SLM history into an index (#41707)
It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.
This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.
Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s. Watcher would cause the same problem, but it
is already disabled (for the same reason).
* High Level Rest Client support for SLM (#41767)
* High Level Rest Client support for SLM
This commit add HLRC support for SLM.
Relates to #38461
* Fill out documentation tests with tags
* Add more callouts and asciidoc for HLRC
* Update javadoc links to real locations
* Add security test testing SLM cluster privileges (#42678)
* Add security test testing SLM cluster privileges
This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.
Relates to #38461
* Don't redefine vars
* Add Getting Started Guide for SLM (#42878)
This commit adds a basic Getting Started Guide for SLM.
* Include SLM policy name in Snapshot metadata (#43132)
Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.
* Fix compilation after master merge
* [TEST] Move exception wrapping for devious exception throwing
Fixes an issue where an exception was created from one line and thrown in another.
* Fix SLM for the change to AcknowledgedResponse
* Add Snapshot Lifecycle Management Package Docs (#43535)
* Fix compilation for transport actions now that task is required
* Add a note mentioning the privileges needed for SLM (#43708)
* Add a note mentioning the privileges needed for SLM
This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.
Relates to #38461
* Mention that you can create snapshots for indices you can't read
* Fix REST tests for new number of cluster privileges
* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)
* Fix SnapshotHistoryStoreTests after merge
* Remove overridden newResponse functions that have been removed
* Fix compilation for backport
* Fix get snapshot output parsing in test
* [DOCS] Add redirects for removed autogen anchors (#44380)
* Switch <tt>...</tt> in javadocs for {@code ...}
make checkpointing more robust:
- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error
fixes#43992
This commit converts all the StreamableResponseActionType security
classes in xpack core to ActionType, implementing Writeable for their
response classes.
relates #34389
When getting authentication info from the thread context, it might be
that we encounter an I/O exception. Today we swallow this exception and
return a null authentication info to the caller. Yet, this could be
hiding bugs or errors. This commits adjusts this behavior so that we no
longer swallow the exception.
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.
relates #34389
* HLRC: Fix '+' Not Correctly Encoded in GET Req.
* Encode `+` correctly as `%2B` in URL paths
* Keep encoding `+` as space in URL parameters
* Closes#33077
This commit modifies bwc behavior in FindFileStructureAction to check
against a concrete version instead of Version.CURRENT. Checking against
Version.CURRENT does not work since it is changing, in addition to it
having different meanings on each branch.
relates #42501
Rewrites how continuous data frame transforms calculates and handles buckets that require an update. Instead of storing the whole set in memory, it pages through the updates using a 2nd cursor. This lowers memory consumption and prevents problems with limits at query time (max_terms_count). The list of updates can be re-retrieved in a failure case (#43662)
This commit moves the Supplier variant of HandledTransportAction to have
a different ordering than the Writeable.Reader variant. The Supplier
version is used for the legacy Streamable, and currently having the
location of the Writeable.Reader vs Supplier in the same place forces
using casts of Writeable.Reader to select the correct super constructor.
This change in ordering allows easier migration to Writeable.Reader.
relates #34389
This change makes the process of verifying the signature of
official plugins FIPS 140 compliant by defaulting to use the
BouncyCastle FIPS provider and adding a dependency to bcpg-fips
that implement parts of openPGP in a FIPS compliant manner.
In already FIPS 140 enabled environments that use the
BouncyCastle FIPS provider, the bcfips dependency is redundant
but doesn't cause an issue as it will be added only in the classpath
of the cli-tools
This is a backport of #44224
Fixes a bug in the PKI authentication. This manifests when there
are multiple PKI realms configured in the chain, with different
principal parse patterns. There are a few configuration scenarios
where one PKI realm might parse the principal from the Subject
DN (according to the `username_pattern` realm setting) but
another one might do the truststore validation (according to
the truststore.* realm settings).
This is caused by the two passes through the realm chain, first to
build the authentication token and secondly to authenticate it, and
that the X509AuthenticationToken sets the principal during
construction.
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports and also adds an assertion so
that cases like #44134 can be more easily debugged. The likely reason for that one is that a test,
which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle
daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
Now that ML job configs are stored in an index rather than
cluster state, availability of the .ml-config index is very
important to the operation of ML. When a cluster starts up
the ML persistent tasks will be considered for node
assignment very early on. It is best in this case if
assignment is deferred until after the .ml-config index is
available.
The introduction of data frame analytics jobs has made this
problem worse, because anomaly detection jobs already waited
for the primary shards of the .ml-state, .ml-anomalies-shared
and .ml-meta indices to be available before doing node
assignment, and by coincidence this would probably lead to
the primary shards of .ml-config also being searchable. But
data frame analytics jobs had no other index checks prior to
this change.
This fixes problem 2 of #44156
* The incompatible snapshots logic was created to track 1.x snapshots that
became incompatible with 2.x
* It serves no purpose at this point
* It adds an additional GET request to every loading of
RepositoryData (from loading the incompatible snapshots blob)
By default, we don't check ranges while indexing geo_shapes. As a
result, it is possible to index geoshapes that contain contain
coordinates outside of -90 +90 and -180 +180 ranges. Such geoshapes
will currently break SQL and ML retrieval mechanism. This commit removes
these restriction from the validator is used in SQL and ML retrieval.
The base classes for transport requests and responses currently
implement Streamable and Writeable. The writeTo method on these base
classes is implemented with an empty implementation. Not only does this
complicate subclasses to think they need to call super.writeTo, but it
also can lead to not implementing writeTo when it should have been
implemented, or extendiong one of these classes when not necessary,
since there is nothing to actually implement.
This commit removes the empty writeTo from these base classes, and fixes
subclasses to not call super and in some cases implement an empty
writeTo themselves.
relates #34389
This commit adds permissions validation on the indices provided in the
enrich policy. These indices should be validated at store time so as not
to have cryptic error messages in the event the user does not have
permissions to access said indices.
AggregatorFactory was generic over itself, but it doesn't appear we
use this functionality anywhere (e.g. to allow the super class
to declare arguments/return types generically for subclasses to
override). Most places use a wildcard constraint, and even when a
concrete type is specified it wasn't used.
But since AggFactories are widely used, this led to
the generic touching many pieces of code and making type signatures
fairly complex
When the ML memory tracker is refreshed and a refresh is
already in progress the idea is that the second and
subsequent refresh requests receive the same response as
the currently in progress refresh.
There was a bug that if a refresh failed then the ML
memory tracker's view of whether a refresh was in progress
was not reset, leading to every subsequent request being
registered to receive a response that would never come.
This change makes the ML memory tracker pass on failures
as well as successes to all interested parties and reset
the list of interested parties so that further refresh
attempts are possible after either a success or failure.
This fixes problem 1 of #44156
This commit documents the backup and restore of a cluster's
security configuration.
It is not possible to only backup (or only restore) security
configuration, independent to the rest of the cluster's conf,
so this describes how a full configuration backup&restore
will include security as well. Moreover, it explains how part
of the security conf data resides on the special .security
index and how to backup that using regular data snapshot API.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
Co-Authored-By: Tim Vernum <tim@adjective.org>
* Add test for SQL not being available error message in JDBC.
* Add a new qa sub-project that explicitly disables SQL XPack module in Gradle.
(cherry picked from commit 8a1ac8a3a88a325ec9b99963e0fa288c18ee0ee5)
* The created array didn't have the correct initial size while attempting to resolve multiple indices
(cherry picked from commit 341006e9913e831408f5bbc7f8ad8c453a7f630e)
Custom timestamp overrides provided to the find_file_structure
endpoint produced an invalid Grok pattern if the fractional
seconds separator was a dot rather than a comma or colon.
This commit fixes that problem and adds tests for this sort
of timestamp override.
Fixes#44110
Previously a data frame transform would check whether the
source index was changed every 10 seconds. Sometimes it
may be desirable for the check to be done less frequently.
This commit increases the default to 60 seconds but also
allows the frequency to be overridden by a setting in the
data frame transform config.
Data frame task responses had logic to return a HTTP 500 status code if there was
any node or task failures even if other tasks in the same request reported correctly.
This is different to how other task responses are handled where a 200 is always
returned leaving the client should check for failures. Returning a 500 also breaks
the high level rest client so always return a 200
Closes#44011
This PR enables the indexing optimization using sequence numbers on
replicas. With this optimization, indexing on replicas should be faster
and use less memory as it can forgo the version lookup when possible.
This change also deactivates the append-only optimization on replicas.
Relates #34099