Work in progress in the c++ side is increasing memory estimates
a bit and this test fails. At the time of this commit the mem
estimate when there is no source query is a about 2Mb. So I
am relaxing the test to assert memory estimate is less than 1Mb
instead of 500Kb.
Backport of #49924
Step on the road to #49060.
This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.
It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer.
With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
Makes sure that CCR also properly works with _source disabled.
Changes one exception in LuceneChangesSnapshot as the case of missing _recovery_source
because of a missing lease was not properly properly bubbled up to CCR (testIndexFallBehind
was failing).
To recap, Attributes form the properties of a derived table.
Each LogicalPlan has Attributes as output since each one can be part of
a query and as such its result are sent to its consumer.
This change essentially removes the name id comparison so any changes
applied to existing expressions should work as long as the said
expressions are semantically equivalent.
This change enforces the hashCode and equals which has the side-effect
of using hashCode as identifiers for each expression.
By removing any property from an Attribute, the various components need
to look the original source for comparison which, while annoying, should
prevent a reference from getting out of sync with its source due to
optimizations.
Essentially going forward there are only 3 types of NamedExpressions:
Alias - user define (implicit or explicit) name
FieldAttribute - field backed by Elasticsearch
ReferenceAttribute - a reference to another source acting as an
Attribute. Typically the Attribute of an Alias.
* Remove the usage of NamedExpression as basis for all Expressions.
Instead, restrict their use only for named context, such as projections
by using Aliasing instead.
* Remove different types of Attributes and allow only FieldAttribute,
UnresolvedAttribute and ReferenceAttribute. To avoid issues with
rewrites, resolve the references inside the QueryContainer so the
information always stays on the source.
* Side-effect, simplify the rules as the state for InnerAggs doesn't
have to be contained anymore.
* Improve ResolveMissingRef rule to handle references to named
non-singular expression tree against the same expression used up the
tree.
#49693 backport to 7.x
(cherry picked from commit 5d095e2173bcbf120f534a6f2a584185a7879b57)
In order to cache script results in the query shard cache, we need to
check if scripts are deterministic. This change adds a default method
to the script factories, `isResultDeterministic() -> false` which is
used by the `QueryShardContext`.
Script results were never cached and that does not change here. Future
changes will implement this method based on whether the results of the
scripts are deterministic or not and therefore cacheable.
Refs: #49466
**Backport**
This commit refactors the `IndexLifecycleRunner` to split out and
consolidate the number of methods that change state from within ILM. It
adds a new class `IndexLifecycleTransition` that contains a number of
static methods used to modify ILM's state. These methods all return new
cluster states rather than making changes themselves (they can be
thought of as helpers for modifying ILM state).
Rather than having multiple ways to move an index to a particular step
(like `moveClusterStateToStep`, `moveClusterStateToNextStep`,
`moveClusterStateToPreviouslyFailedStep`, etc (there are others)) this
now consolidates those into three with (hopefully) useful names:
- `moveClusterStateToStep`
- `moveClusterStateToErrorStep`
- `moveClusterStateToPreviouslyFailedStep`
In the move, I was also able to consolidate duplicate or redundant
arguments to these functions. Prior to this commit there were many calls
that provided duplicate information (both `IndexMetaData` and
`LifecycleExecutionState` for example) where the duplicate argument
could be derived from a previous argument with no problems.
With this split, `IndexLifecycleRunner` now contains the methods used to
actually run steps as well as the methods that kick off cluster state
updates for state transitions. `IndexLifecycleTransition` contains only
the helpers for constructing new states from given scenarios.
This also adds Javadocs to all methods in both `IndexLifecycleRunner`
and `IndexLifecycleTransition` (this accounts for almost all of the
increase in code lines for this commit). It also makes all methods be as
restrictive in visibility, to limit the scope of where they are used.
This refactoring is part of work towards capturing actions and
transitions that ILM makes, by consolidating and simplifying the places
we make state changes, it will make adding operation auditing easier.
Historically only two things happened in the final reduction:
empty buckets were filled, and pipeline aggs were reduced (since it
was the final reduction, this was safe). Usage of the final reduction
is growing however. Auto-date-histo might need to perform
many reductions on final-reduce to merge down buckets, CCS
may need to side-step the final reduction if sending to a
different cluster, etc
Having pipelines generate their output in the final reduce was
convenient, but is becoming increasingly difficult to manage
as the rest of the agg framework advances.
This commit decouples pipeline aggs from the final reduction by
introducing a new "top level" reduce, which should be called
at the beginning of the reduce cycle (e.g. from the SearchPhaseController).
This will only reduce pipeline aggs on the final reduce after
the non-pipeline agg tree has been fully reduced.
By separating pipeline reduction into their own set of methods,
aggregations are free to use the final reduction for whatever
purpose without worrying about generating pipeline results
which are non-reducible
* Copying the request is not necessary here. We can simply release it once the response has been generated and a lot of `Unpooled` allocations that way
* Relates #32228
* I think the issue that preventet that PR that PR from being merged was solved by #39634 that moved the bulk index marker search to ByteBuf bulk access so the composite buffer shouldn't require many additional bounds checks (I'd argue the bounds checks we add, we save when copying the composite buffer)
* I couldn't neccessarily reproduce much of a speedup from this change, but I could reproduce a very measureable reduction in GC time with e.g. Rally's PMC (4g heap node and bulk requests of size 5k saw a reduction in young GC time by ~10% for me)
This commit fixes a number of issues with data replication:
- Local and global checkpoints are not updated after the new operations have been fsynced, but
might capture a state before the fsync. The reason why this probably went undetected for so
long is that AsyncIOProcessor is synchronous if you index one item at a time, and hence working
as intended unless you have a high enough level of concurrent indexing. As we rely in other
places on the assumption that we have an up-to-date local checkpoint in case of synchronous
translog durability, there's a risk for the local and global checkpoints not to be up-to-date after
replication completes, and that this won't be corrected by the periodic global checkpoint sync.
- AsyncIOProcessor also has another "bad" side effect here: if you index one bulk at a time, the
bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread
is tasked by AsyncIOProcessor to drain the processing queue and fsync, other threads can
easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread
might continue doing a lot more fsyncs before returning (as the other threads pile more and
more on top), which blocks it from returning as a replication request (e.g. if this thread is on the
primary, it blocks the replication requests to the replicas from going out, and delaying
checkpoint advancement).
This commit fixes all these issues, and also simplifies the code that coordinates all the after
write actions.
Some extended testing with MS-SQL server and H2 (which agree on
results) revealed bugs in the implementation of WEEK related extraction
and diff functions.
Non-iso WEEK seems to be broken since #48209 because
of the replacement of Calendar and the change in the ISO rules.
ISO_WEEK failed for some edge cases around the January 1st.
DATE_DIFF was previously based on non-iso WEEK extraction which seems
not to be the case.
Fixes: #49376
(cherry picked from commit 54fe7f57289c46bb0905b1418f51a00e8c581560)
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.
In contrast to prior PR (#49647), this PR also fixes (see b3697cc) a situation where the previous
index expression logic had an interesting side effect. For bulk requests (which had resolveIndex
= false), the reroute phase was waiting for the index to appear in case where it was not present,
and for all other replication requests (resolveIndex = true) it would right away throw an
IndexNotFoundException while resolving the name and exit. With #49647, every replication
request was now waiting for the index to appear, which was problematic when the given index
had just been deleted (e.g. deleting a follower index while it's still receiving requests from the
leader, where these requests would now wait up to a minute for the index to appear). This PR
now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior
where the reroute phase waits for the bulk request for the index to appear. That logic was in
place to ensure that when an index was created and not all nodes had learned about it yet, that
the bulk would not fail somewhere in the reroute phase. This is now only restricted to the
situation where the current node has an older cluster state than the one that coordinated the
bulk request (which checks that the index is present). This also means that when an index is
deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and
instead fail the request.
Closes#20279
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.
Closes#49531
Backport of #49690
* Make BlobStoreRepository Aware of ClusterState (#49639)
This is a preliminary to #49060.
It does not introduce any substantial behavior change to how the blob store repository
operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency
by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation
(create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple.
This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the
repository operates in #49060
- Improves HTTP client hostname verification failure messages
- Adds "DiagnosticTrustManager" which logs certificate information
when trust cannot be established (hostname failure, CA path failure,
etc)
These diagnostic messages are designed so that many common TLS
problems can be diagnosed based solely (or primarily) on the
elasticsearch logs.
These diagnostics can be disabled by setting
xpack.security.ssl.diagnose.trust: false
Backport of: #48911
Add a mirror of the maven repository of the shibboleth project
and upgrade opensaml and related dependencies to the latest
version available version
Resolves: #44947
This commit adds a function in NodeClient that allows to track the progress
of a search request locally. Progress is tracked through a SearchProgressListener
that exposes query and fetch responses as well as partial and final reduces.
This new method can be used by modules/plugins inside a node in order to track the
progress of a local search request.
Relates #49091
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.
Closes#20279
This change adds a dynamic cluster setting named `indices.id_field_data.enabled`.
When set to `false` any attempt to load the fielddata for the `_id` field will fail
with an exception. The default value in this change is set to `false` in order to prevent
fielddata usage on this field for future versions but it will be set to `true` when backporting
to 7x. When the setting is set to true (manually or by default in 7x) the loading will also issue
a deprecation warning since we want to disallow fielddata entirely when https://github.com/elastic/elasticsearch/issues/26472
is implemented.
Closes#43599
Authentication has grown more complex with the addition of new realm
types and authentication methods. When user authentication does not
behave as expected it can be difficult to determine where and why it
failed.
This commit adds DEBUG and TRACE logging at key points in the
authentication flow so that it is possible to gain addition insight
into the operation of the system.
Backport of: #49575
The AuthenticationService has a feature to "smart order" the realm
chain so that whicherver realm was the last one to successfully
authenticate a given user will be tried first when that user tries to
authenticate again.
There was a bug where the building of this realm order would
incorrectly drop the first realm from the default chain unless that
realm was the "last successful" realm.
In most cases this didn't cause problems because the first realm is
the reserved realm and so it is unusual for a user that authenticated
against a different realm to later need to authenticate against the
resevered realm.
This commit fixes that bug and adds relevant asserts and tests.
Backport of: #49473
All the implementations of `EsBlobStoreTestCase` use the exact same
bootstrap code that is also used by their implementation of
`EsBlobStoreContainerTestCase`.
This means all tests might as well live under `EsBlobStoreContainerTestCase`
saving a lot of code duplication. Also, there was no HDFS implementation for
`EsBlobStoreTestCase` which is now automatically resolved by moving the tests over
since there is a HDFS implementation for the container tests.
Grouping By YEAR() is translated to a histogram aggregation, but
previously if there was a scalar function invloved (e.g.:
`YEAR(date + INTERVAL 2 YEARS)`), there was no proper script created
and the histogram was applied on a field with name: `date + INTERVAL 2 YEARS`
which doesn't make sense, and resulted in null result.
Check the underlying field of YEAR() and if it's a function call
`asScript()` to properly get the painless script on which the histogram
is applied.
Fixes: #49386
(cherry picked from commit 93c37abc943d00d3a14ba08435d118a6d48874c7)
Add TRUNC as alias to already implemented TRUNCATE
numeric function which is the flavour supported by
Oracle and PostgreSQL.
Relates to: #41195
(cherry picked from commit f2aa7f0779bc5cce40cc0c1f5e5cf1a5bb7d84f0)
We depend on the number of data frame rows in order to report progress
for the writing of results, the last phase of a job run. However, results
include other objects than just the data frame rows (e.g, progress, inference model, etc.).
The problem this commit fixes is that if we receive the last data frame row results
we'll report that progress is complete even though we still have more results to process
potentially. If the job gets stopped for any reason at this point, we will not be able
to restart the job properly as we'll think that the job was completed.
This commit addresses this by limiting the max progress we can report for the
writing_results phase before the results processor completes to 98.
At the end, when the process is done we set the progress to 100.
The commit also improves failure capturing and reporting in the results processor.
Backport of #49551
Previously, CaseProcessor was pre-calculating (called `process()`)
on all the building elements of a CASE/IIF expression, not only the
conditions involved but also the results, as well as the final else result.
In case one of those results had an erroneous calculation
(e.g.: division by zero) this was executed and resulted in
an Exception to be thrown, even if this result was not used because of
the condition guarding it. e.g.:
```
SELECT CASE myField1 = 0 THEN NULL ELSE myField2 / myField1 END
FROM test;
```
Fixes: #49388
(cherry picked from commit dbd169afc98686cae1bc72024fad0ca32b272efd)
This commit back ports three commits related to enabling the simple
connection strategy.
Allow simple connection strategy to be configured (#49066)
Currently the simple connection strategy only exists in the code. It
cannot be configured. This commit moves in the direction of allowing it
to be configured. It introduces settings for the addresses and socket
count. Additionally it introduces new settings for the sniff strategy
so that the more generic number of connections and seed node settings
can be deprecated.
The simple settings are not yet registered as the registration is
dependent on follow-up work to validate the settings.
Ensure at least 1 seed configured in remote test (#49389)
This fixes#49384. Currently when we select a random subset of seed
nodes from a list, it is possible for 0 seeds to be selected. This test
depends on at least 1 seed being selected.
Add the simple strategy to cluster settings (#49414)
This is related to #49067. This commit adds the simple connection
strategy settings and strategy mode setting to the cluster settings
registry. With these changes, the simple connection mode can be used.
Additionally, it adds validation to ensure that settings cannot be
misconfigured.
The categorization job wizard in the ML UI will use this
information when showing the effect of the chosen categorization
analyzer on a sample of input.
Before this change excluding an unsupported field resulted in
an error message that explained the excluded field could not be
detected as if it doesn't exist. This error message is confusing.
This commit commit changes this so that there is no error in this
scenario. When excluding a field that does exist but has been
automatically been excluded from the analysis there is no harm
(unlike excluding a missing field which could be a typo).
Backport of #49535
This test must check for state `SUCCESS` as well. `SUCESS` in
`SnapshotsInProgress` means "all data nodes finished snapshotting sucessfully but master must still finalize the snapshot in the repo".
`SUCESS` does not mean that the snapshot is actually fully finished in this object.
You can easily reporduce the scenario in #49303 that has an in-progress snapshot in `SUCCESS` state
by waiting 20s before running the busy assert loop on the snapshot status so that all steps but the blocked
finalization can finish.
Closes#49303
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.
The API consolidates information that is useful before
creating a data frame analytics job.
It includes:
- memory estimation
- field selection explanation
Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.
Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.
Backport of #49455
This commit enhances the required pipeline functionality by changing it
so that default/request pipelines can also be executed, but the required
pipeline is always executed last. This gives users the flexibility to
execute their own indexing pipelines, but also ensure that any required
pipelines are also executed. Since such pipelines are executed last, we
change the name of required pipelines to final pipelines.
Add extra checks to prevent ConstantFolding rule to try to fold
the CASE/IIF functions early before the SimplifyCase rule gets applied.
Fixes: #49387
(cherry picked from commit f35c9725350e35985d8dd3001870084e1784a5ca)
This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merge have more chance to contain contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least
recent documents from time-based indices.
Relates #49091
This commit adds a deprecation warning when starting
a node where either of the server contexts
(xpack.security.transport.ssl and xpack.security.http.ssl)
meet either of these conditions:
1. The server lacks a certificate/key pair (i.e. neither
ssl.keystore.path not ssl.certificate are configured)
2. The server has some ssl configuration, but ssl.enabled is not
specified. This new validation does not care whether ssl.enabled is
true or false (though other validation might), it simply makes it
an error to configure server SSL without being explicit about
whether to enable that configuration.
Backport of: #45892
Backport of #48277
Otherwise integration tests may fail if the monitoring interval is low:
```
[2019-10-21T09:57:25,527][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [integTest-0] fatal error in thread [elasticsearch[integTest-0][generic][T#4]], exiting
java.lang.AssertionError: initial cluster state not set yet
at org.elasticsearch.cluster.service.ClusterApplierService.state(ClusterApplierService.java:208) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterService.state(ClusterService.java:125) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
at org.elasticsearch.xpack.monitoring.MonitoringService$MonitoringExecution$1.doRun(MonitoringService.java:231) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
```
I ran into this when lowering the monitoring interval when investigating
enrich monitoring test: #48258
fix force stopping transform if indexer state hasn't been written and/or is set to STOPPED. In certain situations the transform could not be stopped, which means the task could not be removed. Introduces improved abstraction in order to better test state handling in future.
* SLM set the operation mode to RUNNING on first run
Set the SLM operation mode to RUNNING when setting the first SLM lifecycle
policy. Historically, SLM was not decoupled from ILM but now they are
independent components. Setting the SLM operation mode to what the ILM running
mode was when we set the first SLM lifecycle policy was a remain from those
times.
* SLM update package info
* SLM suppress unusued warning
* SLM use logger for the correct class
* SLM Add integration test for operation mode
* Use ESSingleNodeTestCase instead of ESIntegTestCase
(cherry picked from commit 4ad3d93f89d03bf9a25685a990d1a439f33ce0e6)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
If a datafeed is stopped normally and force stopped at the same
time then it is possible that the force stop removes the
persistent task while the normal stop is performing actions.
Currently this causes the normal stop to error, but since
stopping a stopped datafeed is not an error this doesn't make
sense. Instead the force stop should just take precedence.
This is a followup to #49191 and should really have been
included in the changes in that PR.
This commit moves the async calls required to retrieve the components
that make up `ExtractedFieldsExtractor` out of `DataFrameDataExtractorFactory`
and into a dedicated `ExtractorFieldsExtractorFactory` class.
A few more refactorings are performed:
- The detector no longer needs the results field. Instead, it knows
whether to use it or not based on whether the task is restarting.
- We pass more accurately whether the task is restarting or not.
- The validation of whether fields that have a cardinality limit
are valid is now performed in the detector after retrieving the
respective cardinalities.
Backport of #49315
This is a pure code rearrangement refactor. Logic for what specific ValuesSource instance to use for a given type (e.g. script or field) moved out of ValuesSourceConfig and into CoreValuesSourceType (previously just ValueSourceType; we extract an interface for future extensibility). ValueSourceConfig still selects which case to use, and then the ValuesSourceType instance knows how to construct the ValuesSource for that case.
This commit changes the ThreadContext to just use a regular ThreadLocal
over the lucene CloseableThreadLocal. The CloseableThreadLocal solves
issues with ThreadLocals that are no longer needed during runtime but
in the case of the ThreadContext, we need it for the runtime of the
node and it is typically not closed until the node closes, so we miss
out on the benefits that this class provides.
Additionally by removing the close logic, we simplify code in other
places that deal with exceptions and tracking to see if it happens when
the node is closing.
Closes#42577
This commit wraps the calls to retrieve the current step in a try/catch
so that the exception does not bubble up. Instead, step info is added
containing the exception to the existing step.
Semi-related to #49128
(cherry picked from commit 72530f8a7f40ae1fca3704effb38cf92daf29057)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of #49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.
In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
Previously, DATEDIFF for minutes and hours was doing a
rounding calculation using all the time fields (secs, msecs/micros/nanos).
Instead it should first truncate the 2 dates to the respective field (mins or hours)
zeroing out all the more detailed time fields and then make the subtraction.
(cherry picked from commit 124cd18e20429e19d52fd8dc383827ea5132d428)
The following edge cases were fixed:
1. A request to force-stop a stopping datafeed is no longer
ignored. Force-stop is an important recovery mechanism
if normal stop doesn't work for some reason, and needs
to operate on a datafeed in any state other than stopped.
2. If the node that a datafeed is running on is removed from
the cluster during a normal stop then the stop request is
retried (and will likely succeed on this retry by simply
cancelling the persistent task for the affected datafeed).
3. If there are multiple simultaneous force-stop requests for
the same datafeed we no longer fail the one that is
processed second. The previous behaviour was wrong as
stopping a stopped datafeed is not an error, so stopping
a datafeed twice simultaneously should not be either.
Backport of #49191
* [ML] ML Model Inference Ingest Processor (#49052)
* [ML][Inference] adds lazy model loader and inference (#47410)
This adds a couple of things:
- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:
* [ML][Inference] Adjust inference configuration option API (#47812)
* [ML][Inference] adds logistic_regression output aggregator (#48075)
* [ML][Inference] Adding read/del trained models (#47882)
* [ML][Inference] Adding inference ingest processor (#47859)
* [ML][Inference] fixing classification inference for ensemble (#48463)
* [ML][Inference] Adding model memory estimations (#48323)
* [ML][Inference] adding more options to inference processor (#48545)
* [ML][Inference] handle string values better in feature extraction (#48584)
* [ML][Inference] Adding _stats endpoint for inference (#48492)
* [ML][Inference] add inference processors and trained models to usage (#47869)
* [ML][Inference] add new flag for optionally including model definition (#48718)
* [ML][Inference] adding license checks (#49056)
* [ML][Inference] Adding memory and compute estimates to inference (#48955)
* fixing version of indexed docs for model inference
The logic for `cleanupInProgress()` was backwards everywhere (method itself and
all but one user). Also, we weren't checking it when removing a repository.
This lead to a bug (in the one spot that didn't use the method backwards) that prevented
the cleanup cluster state entry from ever being removed from the cluster state if master
failed over during the cleanup process.
This change corrects the backwards logic, adds a test that makes sure the cleanup
is always removed and adds a check that prevents repository removal during cleanup
to the repositories service.
Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.
improve error handling for script errors, treating it as irrecoverable errors which puts the task
immediately into failed state, also improves the error extraction to properly report the script
error.
fixes#48467
AutoFollowIT relies on assertBusy() calls to wait for a given number of
leader indices to be created but this is prone to failures on CI. Instead,
we should use latches to indicate when auto-follow patterns must be
paused and resumed.
This commit fixes a NPE problem as reported in #49150.
But this problem uncovered that we never added proper handling
of state for data frame analytics tasks.
In this commit we improve the `MlTasks.getDataFrameAnalyticsState`
method to handle null tasks and state tasks properly.
Closes#49150
Backport of #49186
If CCR encounters a rejected execution exception, today we treat this as
fatal. This is not though, as the stuffed queue could drain. Requiring
an administrator to manually restart the follow tasks that faced such an
exception is a burden. This commit addresses this by making CCR
auto-retry on rejected execution exceptions.
We can't guarantee expected request failures if search request is across
many indexes, as if expected shards fail, some indexes may return 200.
closes#47743
The JdbcHttpClientRequestTests and HttpClientRequestTests classes both
hold a static reference to a mock web server that internally uses the
JDKs built-in HttpServer, which resides in a sun package that the
RamUsageEstimator does not have access to. This causes builds that use
a runtime of Java 8 to fail since the StaticFieldsInvariantRule is run
when Java 8 is used.
Relates #41526
Relates #49105
When triggered either by becoming master, a new cluster state, or a
periodic schedule, an ILM policy execution through
`maybeRunAsyncAction`, `runPolicyAfterStateChange`, or
`runPeriodicStep` throwing an exception will cause the loop the
terminate. This means that any indices that would have been processed
after the index where the exception was thrown will not be processed by
ILM.
For most execution this is not a problem because the actual running of
steps is protected by a try/catch that moves the index to the ERROR step
in the event of a problem. If an exception occurs prior to step
execution (for example, in fetching and parsing the current
policy/step) however, it causes the loop termination previously
mentioned.
This commit wraps the invocation of the methods specified above in a
try/catch block that provides better logging and does not bubble the
exception up.
Backport of #47468 to 7.x
This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:
min_length: The length of the shortest term
max_length: The length of the longest term
avg_length: The average length of all terms
distribution: The probability distribution of all characters appearing in all terms
entropy: The total Shannon entropy value calculated for all terms
This aggregation has been implemented as an analytics plugin.
The rollover action is now a retryable step (see #48256)
so ILM will keep retrying until it succeeds as opposed to stopping and
moving the execution in the ERROR step.
Fixes#49073
(cherry picked from commit 3ae90898121b43032ec8f3b50514d93a86e14d0f)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
# Conflicts:
# x-pack/plugin/ilm/qa/multi-node/src/test/java/org/elasticsearch/xpack/ilm/TimeSeriesLifecycleActionsIT.java
Ensures that methods that are called from different threads ( i.e.
from the callbacks of org.apache.http.concurrent.FutureCallback )
catch `Exception` instead of only the expected checked exceptions.
This resolves a bug where OpenIdConnectAuthenticator#mergeObjects
would throw an IllegalStateException that was never caught causing
the thread to hang and the listener to never be called. This would
in turn cause Kibana requests to authenticate with OpenID Connect
to timeout and fail without even logging anything relevant.
This also guards against unexpected Exceptions that might be thrown
by invoked library methods while performing the necessary operations
in these callbacks.
The cluster state is obtained twice in the EnrichPolicyRunner when updating
the final alias. There is a possibility for the state to be slightly different
between those two calls. This PR just has the function get the cluster state
once and reuse it for the life of the function call.
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
Temporarily "mute" the testReplaceChildren for Pivot since it leads to
failing tests for some seeds, since the new child doesn't respond to a
valid data type.
Relates to #48900
(cherry picked from commit 6200a2207b9a4264d2f3fc976577323c7e084317)
We can have a race here where `scheduleNextRun` executes concurrently to `stop`
and so we run into a `RejectedExecutionException` that we don't catch and thus it
fails tests.
=> Fixed by ignoring these so long as they coincide with a scheduler shutdown
When a node shuts down, `TransportService` moves to stopped state and
then closes connections. If a request is done in between, an exception
was thrown that was not retried in replication actions. Now throw a
wrapped `NodeClosedException` exception instead, which is correctly
handled in replication action. Fixed other usages too.
Relates #42612
This commit fixes an off-by-one bug in the AutoFollowIT test that causes
failures because the leaderIndices counter is incremented during the evaluation
of the leaderIndices.incrementAndGet() < 20 condition but the 20th index is
not created, making the final assertion not verified.
It also gives a bit more time for cluster state updates to be processed on the
follower cluster.
Closes#48982
Ensures that we always use the primary term established by the primary to index docs on the
replica. Makes the logic around replication less brittle by always using the operation primary
term on the replica that is coming from the primary.
This PR makes the following two fixes around updating flattened fields:
* Make sure that the new value for ignore_above is immediately taken into
affect. Previously we recorded the new value but did not use it when parsing
documents.
* Allow depth_limit to be updated dynamically. It seems plausible that a user
might want to tweak this setting as they encounter more data.
When using the move-to-step API, we should reread the phase JSON from
the latest version of the ILM policy. This allows a user to move to the
same step while re-reading the policy's latest version. For example,
when changing rollover criteria.
While manually messing around with some other things I discovered that
we only reread the policy when using the retry API, not the move-to-step
API. This commit changes the move-to-step API to always read the latest
version of the policy.
Backport of #48908
The enrich project doesn't have much history as all the other gradle projects,
so it makes sense to enable spotless for this gradle project.
The `HttpExportBulk` exporter is using a lot more memory than it needs to
by allocating buffers for serialization and IO:
* Remove copying of all bytes when flushing, instead use the stream wrapper
* Remove copying step turning the BAOS into a `byte[]`
* This also avoids the allocation of a single huge `byte[]` and instead makes use of the internal paging logic of the `BytesStreamOutput`
* Don't allocate a new BAOS for every document, just keep appending to a single BAOS
Currently the BulkProcessor class uses a single scheduler to schedule
flushes and retries. Functionally these are very different concerns but
can result in a dead lock. Specifically, the single shared scheduler
can kick off a flush task, which only finishes it's task when the bulk
that is being flushed finishes. If (for what ever reason), any items in
that bulk fails it will (by default) schedule a retry. However, that retry
will never run it's task, since the flush task is consuming the 1 and
only thread available from the shared scheduler.
Since the BulkProcessor is mostly client based code, the client can
provide their own scheduler. As-is the scheduler would require
at minimum 2 worker threads to avoid the potential deadlock. Since the
number of threads is a configuration option in the scheduler, the code
can not enforce this 2 worker rule until runtime. For this reason this
commit splits the single task scheduler into 2 schedulers. This eliminates
the potential for the flush task to block the retry task and removes this
deadlock scenario.
This commit also deprecates the Java APIs that presume a single scheduler,
and updates any internal code to no longer use those APIs.
Fixes#47599
Note - #41451 fixed the general case where a bulk fails and is retried
that can result in a deadlock. This fix should address that case as well as
the case when a bulk failure *from the flush* needs to be retried.
* [ML] Add new geo_results.(actual_point|typical_point) fields for `lat_long` results (#47050)
[ML] Add new geo_results.(actual_point|typical_point) fields for `lat_long` results (#47050)
Related PR: https://github.com/elastic/ml-cpp/pull/809
* adjusting bwc version
CCR follower stats can return information for persistent tasks that are in the process of being cleaned up. This is problematic for tests where CCR follower indices have been deleted, but their persistent follower task is only cleaned up asynchronously afterwards. If one of the following tests then accesses the follower stats, it might still get the stats for that follower task.
In addition, some tests were not cleaning up their auto-follow patterns, leaving orphaned patterns behind. Other tests cleaned up their auto-follow patterns. As always the same name was used, it just depended on the test execution order whether this led to a failure or not. This commit fixes the offensive tests, and will also automatically remove auto-follow-patterns at the end of tests, like we do for many other features.
Closes #48700
We might have some outstanding renew retention lease requests after a
shard has unfollowed. If testRetentionLeaseIsAddedIfItDisappearsWhileFollowing
intercepts a renew request from other tests then we will never unlatch
and the test will time out.
Closes#45192
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.
Fixes#48122
testRetentionLeaseIsAddedIfItDisappearsWhileFollowing is still failing
although we already have several fixes. I think other tests interfere
and cause this test to fail. We can use the test scope to isolate them.
However, I prefer to add debug logs so we can find the source.
Relates #45192
Previously this step moved to the forcemerge step, however, if the
machine running the test was fast enough, it would execute the
forcemerge and move to the next step (`segment-count`) so the comparison
would fail. This commit changes the step to be a step that will never go
anywhere else, the terminal step.
Resolves#48761
decouple TransformTask and ClientTransformIndexer. Interaction between the 2 classes are
now moved into a context class which holds shared information.
relates #45369
This commit introduces a consistent, and type-safe manner for handling
global build parameters through out our build logic. Primarily this
replaces the existing usages of extra properties with static accessors.
It also introduces and explicit API for initialization and mutation of
any such parameters, as well as better error handling for uninitialized
or eager access of parameter values.
Closes#42042
In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.
Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.
Closes#48756
Backport of #48799
* ILM Test asserts on the same ilm/_explain output
With the introduction of retryable steps subsequent ilm/_explain calls
can see the state of an ilm cycle move out of the error step. This test
made several assertions assuming that the cycle remains in the error
step so this commit changes the test to make one _explain call and have
all the asserts work on the same ilm state (so subsequent assumptions to
the cycle being in the error step are valid).
* Drop unused field in test.
(cherry picked from commit 44c74bb487151c886a08b27f32b13f7a72056997)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.
This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.
Relates #48756
Backport of #48770
* Introduce binary_format request parameter (binary.format for JDBC) to disable binary
communication between clients (jdbc/odbc) and server.
* for CLI - "binary" command line parameter (or -b) is introduced. Default value is "true".
* binary communication (cbor) is enabled by default
* disabling request parameter introduced for debugging purposes only
(cherry picked from commit f96a5ca61cb9fad9ed59357320af20e669348ce7)
* Add ingest info to Cluster Stats (#48485)
This commit enhances the ClusterStatsNodes response to include global
processor usage stats on a per-processor basis.
example output:
```
...
"processor_stats": {
"gsub": {
"count": 0,
"failed": 0
"current": 0
"time_in_millis": 0
},
"script": {
"count": 0,
"failed": 0
"current": 0,
"time_in_millis": 0
}
}
...
```
The purpose for this enhancement is to make it easier to collect stats on how specific processors are being used across the cluster beyond the current per-node usage statistics that currently exist in node stats.
Closes#46146.
* fix BWC of ingest stats
The introduction of processor types into IngestStats had a bug.
It was set to `null` and set as the key to the map. This would
throw a NPE. This commit resolves this by setting all the processor
types from previous versions that are not serializing it out to
`_NOT_AVAILABLE`.
This test used an index without an alias to simulate a failure in the
`check-rollover-ready` step. However, with #48256 that step
automatically retries, meaning that the index may not always be in
the ERROR step.
This commit changes the test to use a shrink action with an invalid
number of shards so that it stays in the ERROR step.
Resolves#48767
Previous behavior while copying HTTP headers to the ThreadContext,
would allow multiple HTTP headers with the same name, handling only
the first occurrence and disregarding the rest of the values. This
can be confusing when dealing with multiple Headers as it is not
obvious which value is read and which ones are silently dropped.
According to RFC-7230, a client must not send multiple header fields
with the same field name in a HTTP message, unless the entire field
value for this header is defined as a comma separated list or this
specific header is a well-known exception.
This commits changes the behavior in order to be more compliant to
the aforementioned RFC by requiring the classes that implement
ActionPlugin to declare if a header can be multi-valued or not when
registering this header to be copied over to the ThreadContext in
ActionPlugin#getRestHeaders.
If the header is allowed to be multivalued, then all such headers
are read from the HTTP request and their values get concatenated in
a comma-separated string.
If the header is not allowed to be multivalued, and the HTTP
request contains multiple such Headers with different values, the
request is rejected with a 400 status.
Fix an issue that arises from the use of ExpressionIds as keys in a lookup map
that helps the QueryTranslator to identify the grouping columns. The issue is
that the same expression in different parts of the query (SELECT clause and GROUP BY clause)
ends up with different ExpressionIds so the lookup fails. So, instead of ExpressionIds
use the hashCode() of NamedExpression.
Fixes: #41159Fixes: #40001Fixes: #40240Fixes: #33361Fixes: #46316Fixes: #36074Fixes: #34543Fixes: #37044Fixes: #42041
(cherry picked from commit 3c38ea555984fcd2c6bf9e39d0f47a01b09e7c48)
This adds the infrastructure to be able to retry the execution of retryable
steps and makes the `check-rollover-ready` retryable as an initial step to
make the rollover action more resilient to transient errors.
(cherry picked from commit 454020ac8acb147eae97acb4ccd6fb470d1e5f48)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
At present the ML C++ artifact is always downloaded from
S3. This change adds an option to configure the location.
(The intention is to use a file:/// URL to pick up the
artifact built in a Docker container in ml-cpp PR builds
so that C++ changes that will break Java integration tests
can be detected before the ml-cpp PRs are merged.)
Relates elastic/ml-cpp#766
Moves common field extraction logic to its own package so that it can
be used both for anomaly detection and data frame analytics.
In preparation for refactoring extraction fields to be simpler and to
support multi-fields properly.
Backport of #48709
When we load a JSON Web Key (JWKSet) from the specified
file using JWKSet.load it internally uses IOUtils.readFileToString
but the opened FileInputStream is never closed after usage.
https://bitbucket.org/connect2id/nimbus-jose-jwt/issues/342
This commit reads the file and parses the JWKSet from the string.
This also fixes an issue wherein if the underlying file changed,
for every change event it would add another file watcher. The
change is to only add the file watcher at the start.
Closes#44942
* Un-AwaitsFix and enhance logging for testPolicyCRUD
This removes the `AwaitsFix` and increases the test logging for
`SnapshotLifecycleServiceTests.testPolicyCRUD` in an effort to track
down the cause of #44997.
* Remove unused import
This PR performs the following changes:
* Split `ScoreScriptUtilsTests` into `DenseVectorFunctionTests` and
`SparseVectorFunctionTests`. This will make it easier to delete all sparse
vector function tests once we remove support on 8.x.
* As much as possible, break up the large test methods into individual tests
for each vector function (`cosineSimilarity`, `l2norm`, etc.).
The logger was erroneously using the `SnapshotLifecycleMetadata` class
for its initialization, making it hard to target packages for logging
levels since `SnapshotLifecycleMetadata` is in a different package.
* [ML][Inference] separating definition and config object storage (#48651)
This separates out the `definition` object from being stored within the configuration object in the index.
This allows us to gather the config object without decompressing a potentially large definition.
Additionally, `input` is moved to the TrainedModelConfig object and out of the definition. This is so the trained input fields are accessible outside the potentially large model definition.
This adds a guard for the SLM lifecycle and retention service that
prevents new jobs from being scheduled once the service has been
stopped. Previous if the node were shut down the service would be
stopped, but a cluster state or local master election would cause a job
to attempt to be scheduled. This could lead to an uncaught
`RejectedExecutionException`.
Resolves#47749
The 1MB IO-buffer size per transport thread is causing trouble in
some tests, albeit at a low rate. Reducing the number of transport
threads was not enough to fully fix this situation.
Allowing to configure the size of the buffer and reducing it by
more than an order of magnitude should fix these tests.
Closes#46803
Previously the functions accepted a doc values reference, whereas they now
accept the name of the vector field. Here's an example of how a vector function
was called before and after the change.
```
Before: cosineSimilarity(params.query_vector, doc['field'])
After: cosineSimilarity(params.query_vector, 'field')
```
This seems more intuitive, since we don't allow direct access to vector doc
values and the the meaning of `doc['field']` is unclear.
The PR makes the following changes (broken into distinct commits):
* Add new function signatures of the form `function(params.query_vector,
'field')` and deprecates the old ones. Because Painless doesn't allow two
methods with the same name and number of arguments, we allow a generic `Object`
to be passed in to the function and decide on the behavior through an
`instanceof` check.
* Refactor the class bindings so that the document field is passed to the
constructor instead of the instance method. This allows us to avoid retrieving
the vector doc values on every function invocation, which gives a tiny speed-up
in benchmarks.
Note that this PR adds new signatures for the sparse vector functions too, even
though sparse vectors are deprecated. It seemed simplest to understand (for both
us and users) to keep everything symmetric between dense and sparse vectors.
The format returned by the API is not always parsable with
`Instant.parse()`, so this commit adjusts to parsing those dates as
`ISO_ZONED_DATE_TIME` instead, which appears to always parse the
returned value correctly.
The failures in these tests have been remarkably difficult to track
down, in part because they will not reproduce locally. This commit
unmutes the flaky tests and increases logging, as well as introducing
some additional logging, to attempt to pin down the failures.
The open and close follower steps didn't check if the index is open,
closed respectively, before executing the open/close request.
This changes the steps to check the index state and only perform the
open/close operation if the index is not already open/closed.
The cron schedule "1 2 3 4 5 ?" will run every May 4 at 03:02:01, which
may result in unnecessary test failures once a year. This commit
switches out uses of that schedule in tests for one which will never
execute (because it specifies a day which doesn't exist, Feb. 31). Also
factors the schedule out to a constant to make the intent clearer.
This commit simplifies and standardizes our usage of the Gradle Shadow
plugin to conform more to plugin conventions. The custom "bundle" plugin
has been removed as it's not necessary and performs the same function
as the Shadow plugin's default behavior with existing configurations.
Additionally, this removes unnecessary creation of a "nodeps" artifact,
which is unnecessary because by default project dependencies will in
fact use the non-shadowed JAR unless explicitly depending on the
"shadow" configuration.
Finally, we've cleaned up the logic used for unit testing, so we are
now correctly testing against the shadow JAR when the plugin is applied.
This better represents a real-world scenario for consumers and provides
better test coverage for incorrectly declared dependencies.
(cherry picked from commit 3698131109c7e78bdd3a3340707e1c7b4740d310)
Due to a bug, GETing a snapshot can cause a RespositoryException to be
thrown. This error is transient and should be retried, rather than
causing the test to fail. This commit converts those
RepositoryExceptions into AssertionErrors so that they will be retried
in code wrapped in assertBusy.
When checking for the existence of a document in the ILM/CCR integration
tests, `assertDocumentExists` makes an HTTP request and checks the
response code. However, if the repsonse code is not successful, the call
will throw a `ResponseException`. `assertDocumentExists` is often called
inside an `assertBusy`, and wrapping the `ResponseException` in an
`AssertionError` will allow the `assertBusy` to retry.
In particular, this fixes an issue with `testCCRUnfollowDuringSnapshot`
where the index in question may still be closed when the document is
requested.
Backport of #48452.
The SAML tests have large XML documents within which various parameters
are replaced. At present, if these test are auto-formatted, the XML
documents get strung out over many, many lines, and are basically
illegible.
Fix this by using named placeholders for variables, and indent the
multiline XML documents.
The tests in `SamlSpMetadataBuilderTests` deserve a special mention,
because they include a number of certificates in Base64. I extracted
these into variables, for additional legibility.
This commit ensures that the creation of a DocumentSubsetReader does not
eagerly resolve the role query and the number of docs that match.
We want to delay this expensive operation in order to ensure that we really
need this information when we build it. For this reason the role query and the
number of docs are now resolved on demand. This commit also depends on
https://issues.apache.org/jira/browse/LUCENE-9003 that will also compute the global
number of docs lazily.
* Extract remote "sniffing" to connection strategy (#47253)
Currently the connection strategy used by the remote cluster service is
implemented as a multi-step sniffing process in the
RemoteClusterConnection. We intend to introduce a new connection strategy
that will operate in a different manner. This commit extracts the
sniffing logic to a dedicated strategy class. Additionally, it implements
dedicated tests for this class.
Additionally, in previous commits we moved away from a world where the
remote cluster connection was mutable. Instead, when setting updates are
made, the connection is torn down and rebuilt. We still had methods and
tests hanging around for the mutable behavior. This commit removes those.
* Introduce simple remote connection strategy (#47480)
This commit introduces a simple remote connection strategy which will
open remote connections to a configurable list of user supplied
addresses. These addresses can be remote Elasticsearch nodes or
intermediate proxies. We will perform normal clustername and version
validation, but otherwise rely on the remote cluster to route requests
to the appropriate remote node.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
This commit changes the REST API spec slm.get_lifecycle's policy_id url part to be of type "list", in line with other REST API specs that accept a comma-separated list of values.
Closes#47765
BytesReference is currently an abstract class which is extended by
various implementations. This makes it very difficult to use the
delegation pattern. The implication of this is that our releasable
BytesReference is a PagedBytesReference type and cannot be used as a
generic releasable bytes reference that delegates to any reference type.
This commit makes BytesReference an interface and introduces an
AbstractBytesReference for common functionality.
The AbstractHlrcWriteableXContentTestCase was replaced by a better test
case a while ago, and this is the last two instances using it. They have
been converted and the test is now deleted.
Ref #39745
There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.
Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a `Matcher`
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)
Recently we upgraded to a recent joni version (#47374), and this PR
is a followup of that PR.
This change should also fix#43673, since it appears when unit tests
are ran the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.
We have not seen much adoption of this experimental field type, and don't see a
clear use case as it's currently designed. This PR deprecates the field type in
7.x. It will be removed from 8.0 in a follow-up PR.
7.5+ for SLM requires [stats] object to exist in the cluster state.
When doing an in-place upgrade from 7.4 to 7.5+ [stats] does not exist
in cluster state, result in an exception on startup [1].
This commit moves the [stats] to be an optional object in the parser
and if not found will default to an empty stats object.
[1] Caused by: java.lang.IllegalArgumentException: Required [stats]
Reverting the change introducing IsoLocal.ROOT and introducing IsoCalendarDataProvider that defaults start of the week to Monday and requires minimum 4 days in first week of a year. This extension is using java SPI mechanism and defaults for Locale.ROOT only.
It require jvm property java.locale.providers to be set with SPI,COMPAT
closes#41670
backport #48209
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).
This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).
Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.
This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__
Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).
This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
FIPS 140 bootstrap checks should not be bootstrap checks as they
are always enforced. This commit moves the validation logic within
the security plugin.
The FIPS140SecureSettingsBootstrapCheck was not applicable as the
keystore was being loaded on init, before the Bootstrap checks
were checked, so an elasticsearch keystore of version < 3 would
cause the node to fail in a FIPS 140 JVM before the bootstrap check
kicked in, and as such hasn't been migrated.
Resolves: #34772
The enrich stats api picked the wrong task to be displayed
in the executing stats section.
In case `wait_for_completion` was set to `false` then no task
was being displayed and if that param was set to `true` then
the wrong task was being displayed (transport action task instead
of enrich policy executor task).
Testing executing policies in enrich stats api is tricky.
I have verified locally that this commit fixes the bug.
This PR adds an origin for the Enrich feature, and modifies the background
maintenance task to use the origin when executing client operations.
Without this fix, the maintenance task fails to execute when security is
enabled.
This class is only used by the blob store repository
and CCR and the abstractions didn't really make sense
with CCR ignoring the concrete `restoreFiles` method
completely and having a method used only by the blobstore
overriden as unsupported.
=> Moved to a more fitting set of abstractions
=> Dried up the stream wrapping in `BlobStoreRepository` a little
now that the `restoreFile` method could be simplified
Relates #48110 as it makes changing the API of `FileRestoreContext`
to what is needed for async restores simpler
The test testPauseAndResumeWithMultipleAutoFollowPatterns
failed multiple times, mostly because it creates too many leader
indices and the following cluster cannot cope with cluster state
updates generated by following indices creation and pause/
resume auto-followers changes.
This commit simplifies the test by creating at most 20 leader
indices and by waiting for any new leader index to be picked
up by the auto-follower before created another leader index.
It also pause and resume less auto-followers as previously.
closes#47917
All internal searches (triggered by APIs) across the .security index
must be performed while "under the security origin". Otherwise,
the search is performed in the context of the caller which most
likely does not have privileges to search .security (hopefully).
This commit fixes this in the case of two methods in the
TokenService and corrects an overly done such context switch
in the ApiKeyService.
In addition, this makes all tests from the client/rest-high-level
module execute as an all mighty administrator,
but not a literal superuser.
Closes#47151
This test could fail for two reasons, both should be fixed by this PR:
1) It hit a timeout for an `assertBusy`. This commit increases the
timeout for that `assertBusy`.
2) The snapshot that was supposed to be blocked could, in fact, be
successful. This is because a previous snapshot had been successfully
been taken, and no new data had been added between the two snapshots.
This means that no new segment files needed to be written for the new
snapshot, so the block on data files was never triggered. This commit
changes two things: First, it indexes some new data before taking the
second snapshot (the one that needs to be blocked), and second,
checks to ensure that the block is actually hit before continuing
with the test.
The logic for handling empty segment files has been
unnecessary ever since #24021 which removes the support
for these files in 6.x -> we can safely remove the
support for restoring these from 7.x+ to simplify the code.
There is no reason to still resolve the
fallback `IndexId` here. It only applies to
`2.x` repos and those we can't read anymore
anyway because they use an `/index` instead of
an `/index-N` blob at the repo root for which
at least 7.x+ does not contain the logic to find
it.
If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.
We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.
Closes#47612
Backport of #48090
The after snapshot action is interfering with SLM deleting snapshots
here it seems, causing concurrent delete exceptions.
Since these tests are now test-scoped there is no reason to run
snapshot deletes after each test so we can remove them to avoid this issue.
Closes#47937
We were not closing repositories on Node shutdown.
In production, this has little effect but in tests
shutting down a node using `MockRepository` and is
currently stuck in a simulated blocked-IO situation
will only unblock when the node's threadpool is
interrupted. This might in some edge cases (many
snapshot threads and some CI slowness) result
in the execution taking longer than 5s to release
all the shard stores and thus we fail the assertion
about unreleased shard stores in the internal test cluster.
Regardless of tests, I think we should close repositories
and release resources associated with them when closing
a node and not just when removing a repository from the CS
with running nodes as this behavior is really unexpected.
Fixes#47689
* Add SLM support to xpack usage and info APIs
This is a backport of #48096
This adds the missing xpack usage and info information into the
`/_xpack` and `/_xpack/usage` APIs. The output now looks like:
```
GET /_xpack/usage
{
...
"slm" : {
"available" : true,
"enabled" : true,
"policy_count" : 1,
"policy_stats" : {
"retention_runs" : 0,
...
}
}
```
and
```
GET /_xpack
{
...
"features" : {
...
"slm" : {
"available" : true,
"enabled" : true
},
...
}
}
```
Relates to #43663
* Fix missing license
This adds parsing an inference model as a possible
result of the analytics process. When we do parse such a model
we persist a `TrainedModelConfig` into the inference index
that contains additional metadata derived from the running job.
Previously when a numeric literal was enclosed in parentheses and then
negated, the negation was lost and the number was considered positive, e.g.:
`-(5)` was considered as `5` instead of `-5`
`- ( (1.28) )` was considered as `1.28` instead of `-1.28`
Fixes: #48009
(cherry picked from commit 4dee4bf3b34081062ba2e28ab8524a066812a180)
Audit messages are stored with millisecond timestamps. If two
messages have the same millisecond timestamp then asserting on
their order is impossible given the information available.
This PR changes the assertion on audit messages in the native
data frame analytics tests to assert that the expected audit
messages exist in any order.
Fixes#48035
which is backport merge and adds a new ingest processor, named enrich processor,
that allows document being ingested to be enriched with data from other indices.
Besides a new enrich processor, this PR adds several APIs to manage an enrich policy.
An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.
Related to #32789
max_empty_searches = -1 in a datafeed update implies
max_empty_searches will be unset on the datafeed when
the update is applied. The isNoop() method needs to
take this -1 to null equivalence into account.
Previously, the safety check for the 2nd argument of the DateAddProcessor was
restricting it to Integer which was wrong since we allow all non-rational
numbers, so it's changed to a Number check as it's done in other cases.
Enhanced some tests regarding the check for an integer (non-rational
argument).
(cherry picked from commit 0516b6eaf5eb98fa5bd087c3fece80139a6b118e)
This change adds:
- A new option, allow_lazy_open, to anomaly detection jobs
- A new option, allow_lazy_start, to data frame analytics jobs
Both work in the same way: they allow a job to be
opened/started even if no ML node exists that can
accommodate the job immediately. In this situation
the job waits in the opening/starting state until ML
node capacity is available. (The starting state for data
frame analytics jobs is new in this change.)
Additionally, the ML nightly maintenance tasks now
creates audit warnings for ML jobs that are unassigned.
This means that jobs that cannot be assigned to an ML
node for a very long time will show a yellow warning
triangle in the UI.
A final change is that it is now possible to close a job
that is not assigned to a node without using force.
This is because previously jobs that were open but
not assigned to a node were an aberration, whereas
after this change they'll be relatively common.
This PR adds the ability to run the enrich policy execution task in the background,
returning a task id instead of waiting for the completed operation.
Prior to this change the `target_field` would always be a json array
field in the document being ingested. This to take into account that
multiple enrich documents could be inserted into the `target_field`.
However the default `max_matches` is `1`. Meaning that by default
only a single enrich document would be added to `target_field` json
array field.
This commit changes this; if `max_matches` is set to `1` then the single
document would be added as a json object to the `target_field` and
if it is configured to a higher value then the enrich documents will be
added as a json array (even if a single enrich document happens to be
enriched).
Currently, partial snapshots will eventually build up unless they are
manually deleted. Partial snapshots may be useful if there is not a more
recent successful snapshot, but should eventually be deleted if they are
no longer useful.
With this change, partial snapshots are deleted using the following
strategy: PARTIAL snapshots will be kept until the configured
expire_after period has passed, if present, and then be deleted. If
there is no configured expire_after in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Partial snapshots are not counted towards either min_count or
max_count.
Adds a new datafeed config option, max_empty_searches,
that tells a datafeed that has never found any data to stop
itself and close its associated job after a certain number
of real-time searches have returned no data.
Backport of #47922
Usually syslog timestamps have two spaces before a single
digit day-of-month. However, in some non-syslog cases
where syslog-like timestamps are used there is only one
space. The grok pattern supports this, so the timestamp
parser should too. This change makes the
find_file_structure endpoint do this.
Also fixes another problem that the same test case
exposed in the find_file_structure endpoint, which was
that the exclude_lines_pattern for delimited files was
always created on the assumption the delimiter was a
comma. Now it is based on the actual delimiter.
This commit adds two APIs that allow to pause and resume
CCR auto-follower patterns:
// pause auto-follower
POST /_ccr/auto_follow/my_pattern/pause
// resume auto-follower
POST /_ccr/auto_follow/my_pattern/resume
The ability to pause and resume auto-follow patterns can be
useful in some situations, including the rolling upgrades of
cluster using a bi-directional cross-cluster replication scheme
(see #46665).
This commit adds a new active flag to the AutoFollowPattern
and adapts the AutoCoordinator and AutoFollower classes so
that it stops to fetch remote's cluster state when all auto-follow
patterns associate to the remote cluster are paused.
When an auto-follower is paused, remote indices that match the
pattern are just ignored: they are not added to the pattern's
followed indices uids list that is maintained in the local cluster
state. This way, when the auto-follow pattern is resumed the
indices created in the remote cluster in the meantime will be
picked up again and added as new following indices. Indices
created and then deleted in the remote cluster will be ignored
as they won't be seen at all by the auto-follower pattern at
resume time.
Backport of #47510 for 7.x
Previously, Nullability was set to UNKNOWN instead of TRUE which
resulted on QueryFolder not correctly folding to NULL if any of the args
was null.
Remove the overriding nullable() also for DatePart/DateTrunc to allow
delegation the parent class.
(cherry picked from commit 05a7108e133b5ae7bec2257db5ae2d30ad926ee2)
Joda was using ResolverStyle.STRICT when parsing. This means that date will be validated to be a correct year, year-of-month, day-of-month
However, we also want to make it works with Year-Of-Era as Joda used to, hence custom temporalquery.localdate in DateFormatters.from
Within DateFormatters we use the correct uuuu year instead of yyyy year of era
worth noting: if yyyy(without an era) is used in code, the parsing result will be a TemporalAccessor which will fail to be converted into LocalDate. We mostly use DateFormatters.from so this takes care of this. If possible the uuuu format should be used.
* [ML][Analytics] fix bug where regression deleted early does not delete state (#47885)
* [ML][Analytics] fix bug where regression deleted early does not delete state
* Fixing ml with security test failure
* fixing for older java
Changes the execution logic to create a new task using the execute request,
and attaches the new task to the policy runner to be updated. Also, a new
response is now returned from the execute api, which contains either the task
id of the execution, or the completed status of the run. The fields are mutually
exclusive to make it easier to discern what type of response it is.
rename internal indexes of transform plugin
- rename audit index and create an alias for accessing it, BWC: add an alias for old indexes to
keep them working, kibana UI will switch to use the read alias
- rename config index and provide BWC to read from old and new ones
Batch transforms automatically stop after all data has processed, therefore tests can not reliable test the state. This change rewrites tests to remove the unreliable tests or use continuous transforms instead as they do not auto-stop.
fixes#47441
One of the tests in this suit stops a master node,
plus we're doing other node starts in this suit.
=> the internal test cluster should be TEST and not `SUITE`
scoped to avoid random failures like the one in #47834Closes#47834
The random timestamps were landing too close to the current time,
so an unlucky rollup interval would round such that the doc wasn't
included in the search range (and thus not "rolled up") which
would then fail the test.
The fix is to make sure the timestamp of all docs is sufficiently behind
'now' that the possible rounding intervals will always include them.
Backport of #38753 to 7.x where the test was still muted.
Refactor DateTrunc and DatePart to use separate Pipe classes which
allows the removal of the BinaryDateOperation enum.
(cherry picked from commit a6075e7718dff94a90dbc0795dd924dcb7641092)
Currently if the document being ingested contains another field value
than a string then the processor fails with an error.
This commit changes the match processor to handle number values
and array values correctly.
If a json array is detected then the `terms` query is used instead
of the `term` query.
Especially in the snapshot code there's a lot
of logic chaining `ActionRunnables` in tricky
ways now and the code is getting hard to follow.
This change introduces two convinience methods that
make it clear that a wrapped listener is invoked with
certainty in some trickier spots and shortens the code a bit.
Backport of (#47721) for 7.x.
Similarly to #47582, Auto-follow patterns creates following
indices as long as the remote index matches the pattern and
the remote primary shards are all started. But since 7.2 closed
indices are also replicated, and it does not play well with CCR
auto-follow patterns as they create following indices for closed
leader indices too.
This commit changes the getLeaderIndicesToFollow() so that closed
indices are excluded from auto-follow patterns.
If a cluster sending monitoring data is unhealthy and triggers an
alert, then stops sending data the following exception [1] can occur.
This exception stops the current Watch and the behavior is actually
correct in part due to the exception. Simply fixing the exception
introduces some incorrect behavior. Now that the Watch does not
error in the this case, it will result in an incorrectly "resolved"
alert. The fix here is two parts a) fix the exception b) fix the
following incorrect behavior.
a) fixing the exception is as easy as checking the size of the
array before accessing it.
b) fixing the following incorrect behavior is a bit more intrusive
- Note - the UI depends on the success/met state for each condition
to determine an "OK" or "FIRING"
In this scenario, where an unhealthy cluster triggers an alert and
then goes silent, it should keep "FIRING" until it hears back that
the cluster is green. To keep the Watch "FIRING" either the index
action or the email action needs to fire. Since the Watch is neither
a "new" alert or a "resolved" alert, we do not want to keep sending
an email (that would be non-passive too). Without completely changing
the logic of how an alert is resolved allowing the index action to
take place would result in the alert being resolved. Since we can
not keep "FIRING" either the email or index action (since we don't
want to resolve the alert nor re-write the logic for alert resolution),
we will introduce a 3rd action. A logging action that WILL fire when
the cluster is unhealthy. Specifically will fire when there is an
unresolved alert and it can not find the cluster state.
This logging action is logged at debug, so it should be noticed much.
This logging action serves as an 'anchor' for the UI to keep the state
in an a "FIRING" status until the alert is resolved.
This presents a possible scenario where a cluster starts firing,
then goes completely silent forever, the Watch will be "FIRING"
forever. This is an edge case that already exists in some scenarios
and requires manual intervention to remove that Watch.
This changes changes to use a template-like method to populate the
version_created for the default monitoring watches. The version is
set to 7.5 since that is where this is first introduced.
Fixes#43184
The currently logic shard selecting logic selects a random shard copy
instead of selecting the local shard copy and if local copy is not
available then selecting a random shard copy. The latter is desired
behaviour for enrich.
By reusing `OperationRouting#searchShards(...)` we get the desired
behaviour and reuse the same logic that the search api is using.
* Separate SLM stop/start/status API from ILM
This separates a start/stop/status API for SLM from being tied to ILM's
operation mode. These APIs look like:
```
POST /_slm/stop
POST /_slm/start
GET /_slm/status
```
This allows administrators to have fine-grained control over preventing
periodic snapshots and deletions while performing cluster maintenance.
Relates to #43663
* Allow going from RUNNING to STOPPED
* Align with the OperationMode rules
* Fix slmStopping method
* Make OperationModeUpdateTask constructor private
* Wipe snapshots better in test
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.
With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
Adds a check when running an Enrich policy to make sure that an Enrich index
is force merged down to one segment, and if it was not fully merged, attempts
the merge again, up to a configurable number of times.
When exceptions could be returned from another node, the exception
might be wrapped in a `RemoteTransportException`. In places where
we handled specific exceptions using `instanceof` we ought to unwrap
the cause first.
This commit attempts to fix this issue after searching code in the ML
plugin.
Backport of #47676
* Convert RunTask to use testclusers, remove ClusterFormationTasks
This PR adds a new RunTask and a way for it to start a
testclusters cluster out of band and block on it to replace
the old RunTask that used ClusterFormationTasks.
With this we can now remove ClusterFormationTasks.
Previously when retrieving an SLM policy it would always return a 200
with `{}` in the body, even if the policy did not exist. This changes
that behavior to throw an error (similar to our other APIs) if a
policy doesn't exist.
This also adds a basic CRUD yml test for the behavior.
Resolves#47664
this commit introduces a geo-match enrich processor that looks up a specific
`geo_point` field in the enrich-index for all entries that have a geo_shape match field
that meets some specific relation criteria with the input field.
For example, the enrich index may contain documents with zipcodes and their respective
geo_shape. Ingesting documents with a geo_point field can be enriched with which zipcode
they associate according to which shape they are contained within.
this commit also refactors some of the MatchProcessor by moving a lot of the shared code to
AbstractEnrichProcessor.
Closes#42639.
If a thread pool rejection exception happens, an alternative code
path is chosen to write history and delete the trigger. If an exception
happens during deletion of the trigger an exception may be thrown and not
caught.
This commit catches the exception and provides a meaning error message.
fixes#47008
When deactivating a watch, there is a chance that it is fully deactivated
and reporting as not running but the history is not fully written yet.
There is not a tight coupling between the associated watcher history
index and the deactivation. This test assumes that once a watch is
deactivated that all history is fully written in a very short time period.
If the Watch is deactivated, but the history is slow to write it can result
in a failing test.
This change removes an assertion that assumes that the deactivation of a watch
ensured the all of the watch history was written. There is still a minor race
condition with respect to the remaining history assertions. However, if the
history is slow to be written, it will allow the test to still passing.
fixes#47503
Adds the following parameters to `outlier_detection`:
- `compute_feature_influence` (boolean): whether to compute or not
feature influence scores
- `outlier_fraction` (double): the proportion of the data set assumed
to be outlying prior to running outlier detection
- `standardization_enabled` (boolean): whether to apply standardization
to the feature values
Backport of #47600
Previously, we supported only the format `{fn <FUNCTION_NAME>()}`
but other DBs like MSSQL, DB2, MariaDB/MySQL alos allow whitespaces
between `{` and `fn`. Furhermore, also some applications - like PowerBI -
generate escape sequences with spaces: `select { fn name(params) } etc.`
Add support for white spaces between `{` and the escape pattern definition
like `fn`, `ts`, `d`, `guid` etc.
Closes: #47401
(cherry picked from commit 08a22d0b393f4a76c52dabc5e7b9cafcc19c30ca)
Use case:
User with `create_doc` index privilege will be allowed to only index new documents
either via Index API or Bulk API.
There are two cases that we need to think:
- **User indexing a new document without specifying an Id.**
For this ES auto generates an Id and now ES version 7.5.0 onwards defaults to `op_type` `create` we just need to authorize on the `op_type`.
- **User indexing a new document with an Id.**
This is problematic as we do not know whether a document with Id exists or not.
If the `op_type` is `create` then we can assume the user is trying to add a document, if it exists it is going to throw an error from the index engine.
Given these both cases, we can safely authorize based on the `op_type` value. If the value is `create` then the user with `create_doc` privilege is authorized to index new documents.
In the `AuthorizationService` when authorizing a bulk request, we check the implied action.
This code changes that to append the `:op_type/index` or `:op_type/create`
to indicate the implied index action.
This commit adds support to retrieve all API keys if the authenticated
user is authorized to do so.
This removes the restriction of specifying one of the
parameters (like id, name, username and/or realm name)
when the `owner` is set to `false`.
Closes#46887
Backport of (#47582)
Today when following a new leader index, we fetch the remote cluster state,
check the remote cluster license, check the user privileges, retrieve the
index shard stats before initiating a CCR restore session.
But if the leader index to follow is closed, we're executing a bunch of
operations that would inevitability fail at some point (on retrieving the
index shard stats, because this type of request forbid closed indices
when resolving indices). We could fail a Put Follow request at the first
step by checking the leader index state directly from the remote cluster
state.
This also helps the Resume Follow API to fail a bit earlier.
An index with an ILM policy that has a rollover action in one of the
phases was rolled over when the ILM conditions dictated regardless if
it was already rolled over (eg. manually after modifying an index
template in order to force the creation of a new index that uses the new
mappings).
This changes this behaviour and has ILM check if the index it's about to
roll has not been rolled over in the meantime.
(cherry picked from commit 37d6106feeb9f9369519117c88a9e7e30f3ac797)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commit lifts the validation of the monitoring hosts setting into
the setting itself, rather than when the setting is used. This prevents
a scenario where an invalid value for the setting is accepted, but then
later fails while applying a cluster state with the invalid setting.
This adds a default for the `slm.retention_schedule` setting, setting it
to `0 30 1 * * ?` which is 1:30am every day.
Having retention unset meant that it would never be invoked and clean up
snapshots. We determined it would be better to have a default than never
to be run. When coming to a decision, we weighed the option of an
absolute time (such as 1:30am) versus a periodic invocation (like every
12 hours). In the end we decided on the absolute time because it has
better predictability and consistency than a periodic invocation, which
would rely on when the master node were elected or restarted.
Relates to #43663
Adds a pipeline that removes ids and routing from documents before indexing
them into enrich indices. Enrich documents may come from multiple indices,
and thus have id collisions on them. This pipeline ensures that documents
with colliding id fields do not clobber one another during the reindex operation
while executing an enrich policy.
When an ML job runs the memory required can be
broken down into:
1. Memory required to load the executable code
2. Instrumented model memory
3. Other memory used by the job's main process or
ancilliary processes that is not instrumented
Previously we added a simple fixed overhead to
account for 1 and 3. This was 100MB for anomaly
detection jobs (large because of the completely
uninstrumented categorization function and
normalize process), and 20MB for data frame
analytics jobs.
However, this was an oversimplification because
the executable code only needs to be loaded once
per machine. Also the 100MB overhead for anomaly
detection jobs was probably too high in most cases
because categorization and normalization don't use
_that_ much memory.
This PR therefore changes the calculation of memory
requirements as follows:
1. A per-node overhead of 30MB for _only_ the first
job of any type to be run on a given node - this
is to account for loading the executable code
2. The established model memory (if applicable) or
model memory limit of the job
3. A per-job overhead of 10MB for anomaly detection
jobs and 5MB for data frame analytics jobs, to
account for the uninstrumented memory usage
This change will enable more jobs to be run on the
same node. It will be particularly beneficial when
there are a large number of small jobs. It will
have less of an effect when there are a small number
of large jobs.
When API key is invalidated we do two things first it tries to trigger `ExpiredApiKeysRemover` task
and second, we do index the invalidation for the API key. The index invalidation may happen
before the `ExpiredApiKeysRemover` task is run and in that case, the API key
invalidated will also get deleted. If the `ExpiredApiKeysRemover` runs before the
API key invalidation is indexed then the API key is not deleted and will be
deleted in the future run.
This behavior was not captured in the tests related to `ExpiredApiKeysRemover`
causing intermittent failures.
This commit fixes those tests by checking if the API key invalidated is reported
back when we get API keys after invalidation and perform the checks based on that.
Closes#41747
The changes introduced in #47179 made it so that we could try to
build an SSLContext with verification mode set to None, which is
not allowed in FIPS 140 JVMs. This commit address that
* Remove eclipse conditionals
We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.
This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.
Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.
This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
* Add API to execute SLM retention on-demand (#47405)
This is a backport of #47405
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.
This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.
This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.
Resolves#46508
Relates to #43663
* Fix AllocationRoutedStepTests.testConditionMetOnlyOneCopyAllocated
These tests were using randomly generated includes/excludes/requires for
routing, however, it was possible to generate mutually exclusive
allocation settings (about 1 out of 50,000 times for my runs).
This splits the test into three different tests, and removes the
randomization (it doesn't add anything to the testing here) to fix the
issue.
Resolves#47142
Fixes multiple Active Directory related tests that run against the
samba fixture. Some were failing since we changed the realm settings
format in 7.0 and a few were slightly broken in other ways.
We can move to cleanup the tests in a follow up but this work fits
better to be done with or after we move the tests from a Samba
based fixture to a real(-ish) Microsoft Active Directory based
fixture.
Resolves: #33425, #35738
While it seemed like the PUT data frame analytics action did not
have to be a master node action as the config is stored in an index
rather than the cluster state, there are other subtle nuances which
make it worthwhile to convert it. In particular, it helps maintain
order of execution for put actions which are anyhow user driven and
are expected to have low volume.
This commit converts `TransportPutDataFrameAnalyticsAction` from
a handled transport action to a master node action.
Note this means that the action might fail in a mixed cluster
but as the API is still experimental and not widely used there will
be few moments more suitable to make this change than now.
Bulk requests currently do not allow adding "create" actions with auto-generated IDs.
This commit allows using the optype CREATE for append-only indexing operations. This is
mainly the user facing aspect of it.
XPackPlugin holds data in statics and can only be initialized once. This
caused tests to fail primarily when running with a low max-workers.
Replaced usages with the LocalStateCompositeXPackPlugin, which handles
this properly for testing.
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.
This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)
Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)
Relates #47103
This commit restores the model state if available in data
frame analytics jobs.
In addition, this changes the start API so that a stopped job
can be restarted. As we now store the progress in the state index
when the task is stopped, we can use it to determine what state
the job was in when it got stopped.
Note that in order to be able to distinguish between a job
that runs for the first time and another that is restarting,
we ensure reindexing progress is reported to be at least 1
for a running task.
Due to a regression bug the metadata Active Directory realm
setting is ignored (it works correctly for the LDAP realm type).
This commit redresses it.
Closes#45848
DATE_PART(<datetime unit>, <date/datetime>) is a function that allows
the user to extract the specified unit from a date/datetime field
similar to the EXTRACT (<datetime unit> FROM <date/datetime>) but
with different names and aliases for the units and it also provides more
options like `DATE_PART('tzoffset', datetimeField)`.
Implemented following the SQL server's spec: https://docs.microsoft.com/en-us/sql/t-sql/functions/datepart-transact-sql?view=sql-server-2017
with the difference that the <datetime unit> argument is either a
literal single quoted string or gets a value from a table field, whereas
in SQL server keywords are used (unquoted identifiers) and it's not
possible to use a value coming for a table column.
Closes: #46372
(cherry picked from commit ead743d3579eb753fd314d4a58fae205e465d72e)
* [ML][Inference] adding .ml-inference* index and storage (#47267)
* [ML][Inference] adding .ml-inference* index and storage
* Addressing PR comments
* Allowing null definition, adding validation tests for model config
* fixing line length
* adjusting for backport
Fixes multiple Active Directory related tests that run against the
samba fixture. Some were failing since we changed the realm settings
format in 7.0 and a few were slightly broken in other ways.
We can move to cleanup the tests in a follow up but this work fits
better to be done with or after we move the tests from a Samba
based fixture to a real(-ish) Microsoft Active Directory based
fixture.
Resolves: #33425, #35738
- Build paths with PathUtils#get instead of hard-coding a string with
forward slashes.
- Do not try to match the whole message that includes paths. The
file separator is `\\` in windows but when we throw an Elasticsearch
Exception, the message is formatted with LoggerMessageFormat#format
which replaces `\\` with `\` in Path names. That means that in Windows
the Exception message will contain paths with single backslashes while
the expected string that comes from Path#toString on filename and
env.configFile will contain double backslashes. There is no point in
attempting to match the whole message string for the purpose of this test.
Resolves: #45598
Add examples of failures for both sql and csv integeration
tests and instructions on how to mute them.
(cherry picked from commit 591bba46516d770f5fc95a4c536dd7448b74dd49)
As a result of #45689 snapshot finalization started to
take significantly longer than before. This may be a
little unfortunate since it increases the likelihood
of failing to finalize after having written out all
the segment blobs.
This change parallelizes all the metadata writes that
can safely run in parallel in the finalization step to
speed the finalization step up again. Also, this will
generally speed up the snapshot process overall in case
of large number of indices.
This is also a nice to have for #46250 since we add yet
another step (deleting of old index- blobs in the shards
to the finalization.
When an integration test fails before the assertion of the results it's
missing information, like the file name and the line in the file where
the test resides.
(cherry picked from commit 683dc7213311d13c81e06829e08f3f9f80ebf73a)
Previously, if a column (field, scalar, alias) appeared more than once in the
SELECT list, the value was returned only once (1st appearance) in each row.
Fixes: #41811
(cherry picked from commit 097ea36581a751605fc4f2088319d954ce35b5d1)
Currently the policy config is placed directly in the json object
of the toplevel `policies` array field. For example:
```
{
"policies": [
{
"match": {
"name" : "my-policy",
"indices" : ["users"],
"match_field" : "email",
"enrich_fields" : [
"first_name",
"last_name",
"city",
"zip",
"state"
]
}
}
]
}
```
This change adds a `config` field in each policy json object:
```
{
"policies": [
{
"config": {
"match": {
"name" : "my-policy",
"indices" : ["users"],
"match_field" : "email",
"enrich_fields" : [
"first_name",
"last_name",
"city",
"zip",
"state"
]
}
}
}
]
}
```
This allows us in the future to add other information about policies
in the get policy api response.
The UI will consume this API to build an overview of all policies.
The UI may in the future include additional information about a policy
and the plan is to include that in the get policy api, so that this
information can be gathered in a single api call.
An example of the information that is likely to be added is:
* Last policy execution time
* The status of a policy (executing, executed, unexecuted)
* Information about the last failure if exists
A refactoring in 6.6 meant that the ML daily
maintenance actions have not been run at all
since then. This change installs the local
master listener that schedules the ML daily
maintenance, and also defends against some
subtle race conditions that could occur in the
future if a node flipped very quickly between
master and non-master.
Fixes#47003
These settings were using get raw to fallback to whether or not SSL is
enabled. Yet, we have a formal mechanism for falling back to a
setting. This commit cuts over to that formal mechanism.
When we added support for wildcard application names, we started to build
the prefix query along with the term query but we used 'filter' clause
instead of 'should', so this would not fetch the correct application
privilege descriptor thereby failing the _has_privilege checks.
This commit changes the clause to use should and with minimum_should_match
as 1.
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.
There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.
I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.
Other changes:
* Rework `waitForDocs` to scale its timeout. Instead of calling
`assertBusy` in a loop, work out a reasonable overall timeout and await
just once.
* Some tests failed after switching to `assertBusy` and had to be fixed.
* Correct the expect templates in AbstractUpgradeTestCase. The ES
Security team confirmed that they don't use templates any more, so
remove this from the expected templates. Also rewrite how the setup
code checks for templates, in order to give more information.
* Remove an expected ML template from XPackRestTestConstants The ML team
advised that the ML tests shouldn't be waiting for any
`.ml-notifications*` templates, since such checks should happen in the
production code instead.
* Also rework the template checking code in `XPackRestTestHelper` to give
more helpful failure messages.
* Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
We disable MSU optimization if the local checkpoint is smaller than
max_seq_no_of_updates. Hence, we need to relax the MSU assertion in
FollowingEngine for that scenario. Suppose the leader has three
operations: index-0, delete-1, and index-2 for the same doc Id. MSU on
the leader is 1 as index-2 is an append. If the follower applies index-0
then index-2, then the assertion is violated.
Closes#47137
To be on the safe side in terms of use cases also add the alias
DATETRUNC to the DATE_TRUNC function.
Follows: #46473
(cherry picked from commit 9ac223cb1fc66486f86e218fa785a32b61e9bacc)
Drop the usage of `SimpleDateFormat` and use the `DateFormatter` instead
(cherry picked from commit 7cf509a7a11ecf6c40c44c18e8f03b8e81fcd1c2)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
In some cases, the fetch size affects the way the groups are returned
causing the last page to go beyond the limit. Add dedicated check to
prevent extra data from being returned.
Fix#47002
(cherry picked from commit f4c29646f097bbd29855300342823ef4cef61c05)
Enables support for Cartesian geometries shape type. We still need to
decide how to handle the distance function since it is currently using
the haversine distance formula and returns results in meters, which
doesn't make any sense for Cartesian geometries.
Closes#46412
Relates to #43644
When the ML native multi-node tests use _cat/indices/_all
and the request goes to a non-master node, _all is
translated to a list of concrete indices by the authz layer
on the coordinating node before the request is forwarded
to the master node. Then it is possible for the master
node to return an index_not_found_exception if one of
the concrete indices that was expanded on the
coordinating node has been deleted in the meantime.
(#47159 has been opened to track the underlying problem.)
It has been observed that the index that gets deleted when
the problem affects the ML native multi-node tests is
always the ML notifications index. The tests that fail are
only interested in the presence or absense of ML results
indices. Therefore the workaround is to only _cat indices
that match the ML results index pattern.
Fixes#45652
This change also slightly modifies the stats response,
so that is can easier consumer by monitoring and other
users. (coordinators stats are now in a list instead of
a map and has an additional field for the node id)
Relates to #32789
In the current implementation, the validation of the role query
occurs at runtime when the query is being executed.
This commit adds validation for the role query when creating a role
but not for the template query as we do not have the runtime
information required for evaluating the template query (eg. authenticated user's
information). This is similar to the scripts that we
store but do not evaluate or parse if they are valid queries or not.
For validation, the query is evaluated (if not a template), parsed to build the
QueryBuilder and verify if the query type is allowed.
Closes#34252
This change merges the `ShardSearchTransportRequest` and `ShardSearchLocalRequest`
into a single `ShardSearchRequest` that can be used to create a SearchContext.
Relates #46523
This commit adds support for POST requests to the SLM `_execute` API,
because POST is a more appropriate HTTP verb for this action as it is
not idempotent. The docs are also changed to favor POST over PUT,
although PUT is not removed or officially deprecated.
* ILM: parse origination date from index name (#46755)
Introduce the `index.lifecycle.parse_origination_date` setting that
indicates if the origination date should be parsed from the index name.
If set to true an index which doesn't match the expected format (namely
`indexName-{dateFormat}-optional_digits` will fail before being created.
The origination date will be parsed when initialising a lifecycle for an
index and it will be set as the `index.lifecycle.origination_date` for
that index.
A user set value for `index.lifecycle.origination_date` will always
override a possible parsable date from the index name.
(cherry picked from commit c363d27f0210733dad0c307d54fa224a92ddb569)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Drop usage of Map.of to be java 8 compliant
* Wait for snapshot completion in SLM snapshot invocation
This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.
For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.
Relates to #38461 and #43663
Using arrays of objects with embedded IDs is preferred for new APIs over
using entity IDs as JSON keys. This commit changes the SLM stats API to
use the preferred format.
* [ML][Inference] Feature pre-processing objects and functions (#46777)
To support inference on pre-trained machine learning models, some basic feature encoding will be necessary. I am using a named object serialization approach so new encodings/pre-processing steps could be added in the future.
This PR lays down the ground work for 3 basic encodings:
* HotOne
* Target Mean
* Frequency
More feature encodings or pre-processings could be added in the future:
* Handling missing columns
* Standardization
* Label encoding
* etc....
* fixing compilation for namedxcontent tests
This change allows for the caller of the `saml/prepare` API to pass
a `relay_state` parameter that will then be part of the redirect
URL in the response as the `RelayState` query parameter.
The SAML IdP is required to reflect back the value of that relay
state when sending a SAML Response. The caller of the APIs can
then, when receiving the SAML Response, read and consume the value
as it see fit.
This change adds a check to the migration tool that warns about the deprecated
`enabled` setting for the `_field_names` field on 7.x indices and issues a
warning for templates containing this setting, which has been removed
with 8.0.
Relates to #42854, #46681
When we rewrite alias requests, after filtering down to only those that
the user is authorized to see, it can be that there are no aliases
remaining in the request. However, core Elasticsearch interprets this as
_all so the user would see more than they are authorized for. To address
this, we previously rewrote all such requests to have aliases `"*"`,
`"-*"`, which would be interpreted when aliases are resolved as
nome. Yet, this is only needed for get aliases requests and we were
applying it to all alias requests, including remove index requests. If
such a request was sent to a coordinating node that is not the master
node, the request would be rewritten to include `"*"` and `"-*"`, and
then the master would authorize the user for these. If the user had
limited permissions, the request would fail, even if they were
authorized on the index that the remove index action was over. This
commit addresses this by rewriting for get aliases and remove
aliases request types but not for the remove index.
Co-authored-by: Albert Zaharovits <albert.zaharovits@elastic.co>
Co-authored-by: Tim Vernum <tim@adjective.org>
This change works around JDK-8213202, which is a bug related to TLSv1.3
session resumption before JDK 11.0.3 that occurs when there are
multiple concurrent sessions being established. Nodes connecting to
each other will trigger this bug when client authentication is
disabled, which is the case for SSLClientAuthTests.
Backport of #46680
Previously, queries on the _index field were not able to specify index aliases.
This was a regression in functionality compared to the 'indices' query that was
deprecated and removed in 6.0.
Now queries on _index can specify an alias, which is resolved to the concrete
index names when we check whether an index matches. To match a remote shard
target, the pattern needs to be of the form 'cluster:index' to match the
fully-qualified index name. Index aliases can be specified in the following query
types: term, terms, prefix, and wildcard.
This commit replaces the SearchContext used in AbstractQueryTestCase with
a QueryShardContext in order to reduce the visibility of search contexts.
Relates #46523
Add initial PIVOT support for transforming a regular table into a
statistics table around an arbitrary pivoting column:
SELECT * FROM
(SELECT languages, country, salary, FROM mp)
PIVOT (AVG(salary) FOR countries IN ('NL', 'DE', 'ES', 'RO', 'US'))
In the current implementation PIVOT allows only one aggregation however
this restriction is likely to be lifted in the future.
Also not all aggregations are working, in particular MatrixStats are not yet supported.
(cherry picked from commit d91263746a222915c570d4a662ec48c1d6b4f583)
This commit changes the GET REST api so it will accept an optional comma
separated list of enrich policy ids. This change also modifies the
behavior of the GET API in that it will not error if it is passed a bad
enrich id anymore, but will instead just return an empty list.
This commit adds the ability to require an ingest pipeline on an
index. Today we can have a default pipeline, but that could be
overridden by a request pipeline parameter. This commit introduces a new
index setting index.required_pipeline that acts similarly to
index.default_pipeline, except that it can not be overridden by a
request pipeline parameter. Additionally, a default pipeline and a
request pipeline can not both be set. The required pipeline can be set
to _none to ensure that no pipeline ever runs for index requests on that
index.
When using auto-generated IDs + the ingest drop processor (which looks to be used by filebeat
as well) + coordinating nodes that do not have the ingest processor functionality, this can lead
to a NullPointerException.
The issue is that markCurrentItemAsDropped() is creating an UpdateResponse with no id when
the request contains auto-generated IDs. The response serialization is lenient for our
REST/XContent format (i.e. we will send "id" : null) but the internal transport format (used for
communication between nodes) assumes for this field to be non-null, which means that it can't
be serialized between nodes. Bulk requests with ingest functionality are processed on the
coordinating node if the node has the ingest capability, and only otherwise sent to a different
node. This means that, in order to reproduce this, one needs two nodes, with the coordinating
node not having the ingest functionality.
Closes#46678
The fact that this test randomly uses a relatively large number
of nodes and hence Netty worker threads created a problem with
running out of direct memory on CI.
Tests run with 512M heap (and hence 512M direct memory) by default.
On a CI worker with 16 cores, this means Netty will by default set
up 32 transport workers. If we get unlucky and a lot of them
actually do work (and thus instantiate a `CopyBytesSocketChannel`
which costs 1M per thread for the thread-local IO buffer) we
would run out of memory.
This specific failure was only seen with `NativeRealmIntegTests` so I
only added the constraint on the Netty worker count here.
We can add it to other tests (or `SecurityIntegTestCase`) if need be
but for now it doesn't seem necessary so I opted for least impact.
Closes#46803
This commit reuses the same state processor that is used for autodetect
to parse state output from data frame analytics jobs. We then index the
state document into the state index.
Backport of #46804
* [ML][Transforms] remove `force` flag from _start (#46414)
* [ML][Transforms] remove `force` flag from _start
* fixing expected error message
* adjusting bwc version
It is possible for a running analytics job that its config is removed
from the '.ml-config' index (perhaps the user deleted the entire index,
etc.). In that case the task remains without a matching config. I have
raised #46781 to discuss how to deal with this issue.
This commit focuses on `MlMemoryTracker` and changes it so that when
we get the configs for the running tasks we leniently ignore missing ones.
This at least means memory tracking will keep working for other jobs
if one or more are missing.
In addition, this commit makes the cleanup code for native analytics
tests more robust by explicitly stopping all jobs and force-stopping
if an error occurs. This helps so that a single failing test does
not cause other tests fail due to pending tasks.
Backport of #46789
Since the `resolveAllDependencies` task resolves all the congfigurations
it can find, this was not caught by our testing, but it's required to be
configuraed specifically.
We should probably cut-over to the new configurations at some point to
avoid problems like this.
Closeselastic/infra#14580
* Give kibana user reserved role privileges on .apm-* to create APM agent configuration index.
* fixed test to include checking all .apm-* permissions
* changed pattern from ".apm-*" to the more specific ".apm-agent-configuration"
When encountering only indices with empty mapping, the IndexResolver
throws an exception as it expects to find at least one entry.
This commit fixes this case so that an empty mapping is returned.
Fix#46757
(cherry picked from commit 5f4f5807acb93b5fab36718c092c328977a396b6)
* Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581
* Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons
* Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization
* Fixes#41581