This fixes a bug introduced by #61782. In that PR I thought I could
simplify the persistence of progress by using the progress straight
from the stats holder in the task instead of calling the get
stats action. However, I overlooked that it is then possible to
have stale progress for the reindexing task as that is only updated
when the get stats API is called.
In this commit this is fixed by updating reindexing task progress
before persisting the job progress. This seems to be much more
lightweight than calling the get stats request.
Closes#61852
Backport of #61868
For 1/2 the plugins in x-pack, the integTest
task is now a no-op and all of the tests are now executed via a test,
yamlRestTest, javaRestTest, or internalClusterTest.
This includes the following projects:
security, spatial, stack, transform, vecotrs, voting-only-node, and watcher.
A few of the more specialized qa projects within these plugins
have not been changed with this PR due to additional complexity which should
be addressed separately.
related: #60630
related: #56841
related: #59939
related: #55896
For 1/2 the plugins in x-pack, the integTest
task is now a no-op and all of the tests are now executed via a test,
yamlRestTest, javaRestTest, or internalClusterTest.
This includes the following projects:
async-search, autoscaling, ccr, enrich, eql, frozen-indicies,
data-streams, graph, ilm, mapper-constant-keyword, mapper-flattened, ml
A few of the more specialized qa projects within these plugins
have not been changed with this PR due to additional complexity which should
be addressed separately.
A follow up PR will address the remaining x-pack plugins (this PR is big enough as-is).
related: #61802
related: #56841
related: #59939
related: #55896
While starting the data frame analytics process it is possible
to get an exception before the process crash handler is in place.
In addition, right after starting the process, we check the process
is alive to ensure we capture a failed process. However, those exceptions
are unhandled.
This commit catches any exception thrown while starting the process
and sets the task to failed with the root cause error message.
I have also taken the chance to remove some unused parameters
in `NativeAnalyticsProcessFactory`.
Relates #61704
Backport of #61838
Allow filtering through a pipe, across events and sequences.
Filter pipes are pushed down to base queries.
For now filtering after limit (head/tail) is forbidden as the
semantics are still up for debate.
Fix#59763
(cherry picked from commit 80569a388b76cecb5f55037fe989c8b6f140761b)
The ML mappings upgrade test had become useless as it was
checking a field that has been the same since 6.5. This
commit switches to a field that was changed in 7.9.
Additionally, the test only used to check the results index
mappings. This commit also adds checking for the config
index.
Backport of #61340
During a rolling upgrade it is possible that a worker node will be upgraded before
the master in which case the DFA templates will not have been installed.
Before a DFA task starts check that the latest template is installed and install it if necessary.
When an error occurs and we set the task to failed via
the `DataFrameAnalyticsTask.setFailed` method we do not
persist progress. If the job is later restarted, this means
we do not correctly restore from where we can but instead
we start the job from scratch and have to redo the reindexing
phase.
This commit solves this bug by persisting the progress before
setting the task to failed.
Backport of #61782
@ywangd made an awesome analysis on why this test is failing, over
at https://github.com/elastic/elasticsearch/issues/55816#issuecomment-620913282
This change makes it so that we use the same client to perform a
refresh of a token, as we use to subsequently attempt to authenticate
with the refreshed token. This ensures the tests are failing and is
a good approximation of how we expect the same client doing the
refresh, to also perform the subsequent authentication in real life
uses.
The errors we were seeing from users have disappeared after #55114
so we deem our behavior safe.
System indices can be snapshotted and are therefore potential candidates
to be mounted as searchable snapshot indices. As of today nothing
prevents a snapshot to be mounted under an index name starting with .
and this can lead to conflicting situations because searchable snapshot
indices are read-only and Elasticsearch expects some system indices
to be writable; because searchable snapshot indices will soon use an
internal system index (#60522) to speed up recoveries and we should
prevent the system index to be itself a searchable snapshot index
(leading to some deadlock situation for recovery).
This commit introduces a changes to prevent snapshots to be mounted
as a system index.
BlobStoreCacheService implements ClusterStateListener in order to
maintain a ready flag that can be used to know when the snapshot
blob cache should be queries or not.
Now the getAsync() method correctly handles the various exceptions
that can be thrown when the .snapshot-blob-cache index is not
available(in isExpectedCacheGetException()) and logs as DEBUG
we can safely remove the ready flag.
This is a minor refactor where the job node load logic (node availability, etc.) is refactored into its own class.
This will allow future things (i.e. autoscaling decisions) to use the same node load detection class.
backport of #61521
This commit addresses two issues:
- per class feature importance is now written out for binary classification (logistic regression)
- The `class_name` in per class feature importance now matches what is written in the `top_classes` array.
backport of https://github.com/elastic/elasticsearch/pull/61597
- don't do encoding of asynchExecutionId if it is already provided in
the encoded form
- create a new instance of AsyncExecutionId after checks for
correctness are done
If the master node of the follower cluster is busy, then the
auto-follower will fail to initialize the following process. This also
occurs when an auto-follow pattern matches multiple indices. We should
set the timeout of put-follow requests issued by the auto-follower to
unbounded to avoid this problem.
Closes#56891
Backport of #61474.
Part of #46106. Simplify the implementation of deprecation logging by
relying of log4j more completely, and implementing additional behaviour
through custom appenders and filters.
This commit adds the functionality to allocate newly created indices on nodes in the "hot" tier by
default when they are created.
This does not break existing behavior, as nodes with the `data` role are considered to be part of
the hot tier. Users that separate their deployments by using the `data_hot` (and `data_warm`,
`data_cold`, `data_frozen`) roles will have their data allocated on the hot tier nodes now by
default.
This change is a little more complicated than changing the default value for
`index.routing.allocation.include._tier` from null to "data_hot". Instead, this adds the ability to
have a plugin inject a setting into the builder for a newly created index. This has the benefit of
allowing this setting to be visible as part of the settings when retrieving the index, for example:
```
// Create an index
PUT /eggplant
// Get an index
GET /eggplant?flat_settings
```
Returns the default settings now of:
```json
{
"eggplant" : {
"aliases" : { },
"mappings" : { },
"settings" : {
"index.creation_date" : "1597855465598",
"index.number_of_replicas" : "1",
"index.number_of_shards" : "1",
"index.provided_name" : "eggplant",
"index.routing.allocation.include._tier" : "data_hot",
"index.uuid" : "6ySG78s9RWGystRipoBFCA",
"index.version.created" : "8000099"
}
}
}
```
After the initial setting of this setting, it can be treated like any other index level setting.
This new setting is *not* set on a new index if any of the following is true:
- The index is created with an `index.routing.allocation.include.<anything>` setting
- The index is created with an `index.routing.allocation.exclude.<anything>` setting
- The index is created with an `index.routing.allocation.require.<anything>` setting
- The index is created with a null `index.routing.allocation.include._tier` value
- The index was created from an existing source metadata (shrink, clone, split, etc)
Relates to #60848
The check introduced by #60640 for scroll searches, in which we log
if the index access control before the query and fetch phases differs
from when the scroll context is created, is too strict, leading to spurious
warning log messages.
The check verifies instance equality but this assumes that the fetch
phase is executed in the same thread context as the scroll context
validation. However, this is not true if the scroll search is executed
cross-cluster, and even for local scroll searches it is an unfounded assumption.
The check is hence reduced to a null check for the index access.
The fact that the access control is suitable given the indices that
are actually accessed (by the scroll) will be done in a follow-up,
after we better regulate the creation of index access controls in general.
Runtime fields need to have a SearchLookup available, when building their fielddata implementations, so that they can look up other fields, runtime or not.
To achieve that, we add a Supplier<SearchLookup> argument to the existing MappedFieldType#fielddataBuilder method.
As we introduce the ability to look up other fields while building fielddata for mapped fields, we implicitly add the ability for a field to require other fields. This requires some protection mechanism that detects dependency cycles to prevent stack overflow errors.
With this commit we also introduce detection for cycles, as well as a limit on the depth of the references for a runtime field. Note that we also plan on introducing cycles detection at compile time, so the runtime cycles detection is a last resort to prevent stack overflow errors but we hope that we can reject runtime fields from being registered in the mappings when they create a cycle in their definition.
Note that this commit does not introduce any production implementation of runtime fields, but is rather a pre-requisite to merge the runtime fields feature branch.
This is a breaking change for MapperPlugins that plug in a mapper, as the signature of MappedFieldType#fielddataBuilder changes from taking a single argument (the index name), to also accept a Supplier<SearchLookup>.
Relates to #59332
Co-authored-by: Nik Everett <nik9000@gmail.com>
Replaces the superclass of the test for `HistogramFieldMapperTests` with
one that doesn't extend `ESSingleNodeTestCase` so we don't depend on the
entire world to test the field mapper.
Continues #61301.
Today we sometimes notify a listener of completion while holding
`SparseFileTracker#mutex`. This commit move all such calls out from
under the mutex and adds assertions that the mutex is not held in the
listener.
Closes#61520
Inference processors asynchronously usage write stats to the .ml-stats index after they used.
In tests the write can leak into the next test causing failures depending on which test follows.
This change waits for the usage stats docs to be written at the end of the test
If a TLS-protected connection closes unexpectedly then today we often
emit a `WARN` log, typically one of the following:
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: Received close_notify during handshake
We typically only report unexpectedly-closed connections at `DEBUG`
level, but these two messages don't follow that rule and generate a lot
of noise as a result. This commit adjusts the logging to report these
two exceptions at `DEBUG` level only.
Today we use `long` to represent the number of parts of a blob. There's
no need for this extra range, it forces us to do some casting elsewhere,
and indeed when snapshotting we iterate over the parts using an `int`
which would be an infinite loop in case of overflow anyway:
for (int i = 0; i < fileInfo.numberOfParts(); i++) {
This commit changes the representation of the number of parts of a blob
to an `int`.
We convert longs to ints using `Math.toIntExact` in places where we're
sure there will be no overflow, but this doesn't explain the intent of
these conversions very well. This commit introduces a dedicated method
for these conversions, and adds an assertion that we never overflow.
If a searchable snapshot shard fails (e.g. its node leaves the cluster)
we want to be able to start it up again on a different node as quickly
as possible to avoid unnecessarily blocking or failing searches. It
isn't feasible to fully restore such shards in an acceptably short time.
In particular we would like to be able to deal with the `can_match`
phase of a search ASAP so that we can skip unnecessary waiting on shards
that may still be warming up but which are not required for the search.
This commit solves this problem by introducing a system index that holds
much of the data required to start a shard. Today(*) this means it holds
the contents of every file with size <8kB, and the first 4kB of every
other file in the shard. This system index acts as a second-level cache,
behind the first-level node-local disk cache but in front of the blob
store itself. Reading chunks from the index is slower than reading them
directly from disk, but faster than reading them from the blob store,
and is also replicated and accessible to all nodes in the cluster.
(*) the exact heuristics for what we should put into the system index
are still under investigation and may change in future.
This second-level cache is populated when we attempt to read a chunk
which is missing from both levels of cache and must therefore be read
from the blob store.
We also introduce `SearchableSnapshotsBlobStoreCacheIntegTests` which
verify that we do not hit the blob store more than necessary when
starting up a shard that we've seen before, whether due to a node
restart or because a snapshot was mounted multiple times.
Backport of #60522
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
If a search failure occurs during data frame extraction we catch
the error and retry once. However, we retry another search that is
identical to the first one. This means we will re-fetch any docs
that were already processed. This may result either to training
a model using duplicate data or in the case of outlier detection to
an error message that the process received more records than it
expected.
This commit fixes this issue by tracking the latest doc's sort key
and then using that in a range query in case we restart the search
due to a failure.
Backport of #61544
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Backports the following commits to 7.x:
[ML] write warning if configured memory limit is too low for analytics job (#61505)
Having `_start` fail when the configured memory limit is too low can be frustrating.
We should instead warn the user that their job might not run properly if their configured limit is too low.
It might be that our estimate is too high, and their configured limit works just fine.
DeprecationLogger's constructor should not create two loggers. It was
taking parent logger instance, changing its name with a .deprecation
prefix and creating a new logger.
Most of the time parent logger was not needed. It was causing Log4j to
unnecessarily cache the unused parent logger instance.
depends on #61515
backports #58435
Refactor the tests to not require a mock HTTP Server. This has been
the cause of flakiness and removing it doesn't affect the logical
coverage of this suite. The "fake UI" is now simulated by an
http client that makes the necessary requests to Elasticsearch APIs.
Backport to add case insensitive support for regex queries.
Forks a copy of Lucene’s RegexpQuery and RegExp from Lucene master.
This can be removed when 8.7 Lucene is released.
Closes#59235
Splitting DeprecationLogger into two. HeaderWarningLogger - responsible for adding a response warning headers and ThrottlingLogger - responsible for limiting the duplicated log entries for the same key (previously deprecateAndMaybeLog).
Introducing A ThrottlingAndHeaderWarningLogger which is a base for other common logging usages where both response warning header and logging throttling was needed.
relates #55699
relates #52369
backports #55941
The building block of the eql response is currently the SearchHit. This
is a problem since it is tied to an actual search, and thus has scoring,
highlighting, shard information and a lot of other things that are not
relevant for EQL.
This becomes a problem when doing sequence queries since the response is
not generated from one search query and thus there are no SearchHits to
speak of.
Emulating one is not just conceptually incorrect but also problematic
since most of the data is missed or made-up.
As such this PR introduces a simple class, Event, that maps nicely to
the terminology while hiding the ES internals (the use of SearchHit or
GetResult/GetResponse depending on the API used).
Fix#59764Fix#59779
Co-authored-by: Igor Motov <igor@motovs.org>
(cherry picked from commit 997376fbe6ef2894038968842f5e0635731ede65)
* Faster `equals` for `BytesArray` which is nice since with this change we use it for the search cache
* Lighter `StreamInput` for `BytesArray` that should save memory and some indirection relative to the one on the abstract bytes reference
* Lighter `writeTo` implementation
* Build a `BytesArray` instead of a PagedBytesReference whenever possible to save indirection and memory
This is mostly motivated by the performance issues we are seeing around the GET mappings
REST API which (in case of a large number of indices) will create decompressing streams in a hot loop
which takes a significant amount of time for the system calls involved in instantiating deflaters
and inflaters.
Also, this fixes a leaked deflater when deserializing cached repository data.
This commit removes the log info message "Created ML annotations index and aliases".
The message comes in addition to elasticsearch's index creation logging and it does
not add to it. In addition, since #61107 that message may be logged multiple times.
Backport of #61461