* [ML] mark forecasts for force closed/failed jobs as failed (#57143)
forecasts that are still running should be marked as failed/finished in the following scenarios:
- Job is force closed
- Job is re-assigned to another node.
Forecasts are not "resilient". Their execution does not continue after a node failure. Consequently, forecasts marked as STARTED or SCHEDULED should be flagged as failed. These forecasts can then be deleted.
Additionally, force closing a job kills the native task directly. This means that if a forecast was running, it is not allowed to complete and could still have the status of `STARTED` in the index.
relates to https://github.com/elastic/elasticsearch/issues/56419
* [ML] adds new for_export flag to GET _ml/inference API (#57351)
Adds a new boolean flag, `for_export` to the `GET _ml/inference/<model_id>` API.
This flag is useful for moving models between clusters.
* Add new circuitbreaker plugin and refactor CircuitBreakerService (#55695)
This commit lays the ground work for plugins supplying their own circuit breakers.
It adds a new interface: `CircuitBreakerPlugin`.
This interface provides methods for providing custom child CircuitBreaker objects. There are also facilities for allowing dynamic settings for the custom breakers.
With the refactor, circuit breakers are no longer replaced on setting changes. Instead, the two mutable settings themselves are `volatile`. Plugins that want to use their custom circuit breaker should keep a reference of their constructed breaker.
This adds a max_model_memory setting to forecast requests.
This setting can take a string value that is formatted according to byte sizes (i.e. "50mb", "150mb").
The default value is `20mb`.
There is a HARD limit at `500mb` which will throw an error if used.
If the limit is larger than 40% the anomaly job's configured model limit, the forecast limit is reduced to be strictly lower than that value. This reduction is logged and audited.
related native change: https://github.com/elastic/ml-cpp/pull/1238
closes: https://github.com/elastic/elasticsearch/issues/56420
Implement TIME_PARSE(<time_str>, <pattern_str>) function
which allows to parse a time string according to the specified
pattern into a time object. The patterns allowed are those of
java.time.format.DateTimeFormatter.
Closes#54963
Co-authored-by: Andrei Stefan <astefan@users.noreply.github.com>
Co-authored-by: Patrick Jiang(白泽) <patrickjiang0530@gmail.com>
(cherry picked from commit 1fe1188d449cad7d0782a202372edc52a4014135)
Backport of #56878 to 7.x branch.
With this change the following APIs will be able to resolve data streams:
get index, get mappings and ilm explain APIs.
Relates to #53100
Allows geo fields (`geo_point`, `geo_shape`) to have missing values.
Fixes a bug where such missing values would result in an error.
Closes#57299
Backport of #57300
Backporting #56888 to 7.x branch.
Limit the creation of data streams only for namespaces that have a composable template with a data stream definition.
This way we ensure that mappings/settings have been specified and will be used at data stream creation and data stream rollover.
Also remove `timestamp_field` parameter from create data stream request and
let the create data stream api resolve the timestamp field
from the data stream definition snippet inside a composable template.
Relates to #53100
Previously, `CASE` and `IIF` when translated to painless scripts
(used in GROUP BY, HAVING, WHERE) a custom `caseFunction`
registered in the `InternalSqlScriptUtils` was used. This function
received and array of arbitrary length:
```[condition1, result1, condition2, result2, ... elseResult]```
Painless doesn't know of the context and therefore is evaluating
all conditions and results before invoking the `caseFunction` on them.
As a consequence, erroneous result expressions (i.e. division by 0)
where always evaluated despite of the guarding condition.
Replace the `caseFunction` with painless `<cond> ? <res1> : <res2>`
expressions to properly guard the result expressions and only evaluate
the one for which its guarding condition evaluates to true (or of course
the elseResult).
As a bonus, this approach includes performance benefits since we avoid
unnecessary evaluations of both conditions and result expressions.
Fixes: #49672
(cherry picked from commit 9584b345d89f797bfb658212b928b9812804f02f)
The ssl.trust setting for Watcher provides a list of hostnames that
should be automatically trusted for SSL hostname verification. It was
accidentally broken when we added the full ssl.* settings for email
notifications (see #45272)
This commit corrects this, so the setting is once again respected,
as long as none of the other ssl settings are configured for email
notifications.
Resolves: #52153
Backport of: #56090
This commit removes the compiler.java setting from the build. It was
originally added when Gradle was far behind support for the latest jdk,
but is no longer applicable as we don't have any need to update the
supported compile version before gradle supports the newer version. Note
that the runtime version changing support still exists here, this only
ensures we use the same jdk to compile as we use to run gradle.
Since #51888 the ML job stats endpoint has returned entries for
jobs that have a persistent task but not job config. Such
orphaned tasks caused monitoring to fail.
This change ignores any such corrupt jobs for monitoring purposes.
Backport of #57235
This PR removes the blocking call to insert ingest documents into a queue in the
coordinator. It replaces it with an offer call which will throw a rejection exception
in the event that the queue is full. This prevents deadlocks of the write threads
when the queue fills to capacity and there are more than one enrich processors
in a pipeline.
The searcher was randomly wrapping its reader as slow, parallel, or filtered.
This was causing casting issues in the normalizer tests. By removing the
wrapping, the problem goes away.
Closes#57164
If a job is NOT opened, forecasts should be able to be deleted, no matter their state.
This also fixes a bug with expanding forecast IDs. We should check for wildcard `*` and `_all` when expanding the ids
closes https://github.com/elastic/elasticsearch/issues/56419
Change the error message wording for comparisons against fields in
filtering (s/variables/fields).
(cherry picked from commit d9a1cb50940d0a98fd75b9c0123ca6e1d862f65d)
* Update the JLine dependency to 3.14.1
Update the JLine dependency from 3.10.0 to 3.14.1.
(cherry picked from commit c2d9b74046fa5ddb54604da3afa7887cc38548a1)
Backport of #55548
Adds equivalence for keyword field to the wildcard field. Regex, fuzzy, wildcard and prefix queries are all supported.
All queries use an approximation query backed by an automaton-based verification queries.
Closes#54275
Fix delete_expired_data/nightly maintenance when
many model snapshots need deleting (#57041)
The queries performed by the expired data removers pull back entire
documents when only a few fields are required. For ModelSnapshots in
particular this is a problem as they contain quantiles which may be
100s of KB and the search size is set to 10,000.
This change makes the search more efficient by only requesting the
fields needed to work out which expired data should be deleted.
In #51089 where SamlAuthenticatorTests were refactored, we missed
to update one test case which meant that a single key would be
used both for signing and encryption in the same run. As explained
in #51089, and due to FIPS 140 requirements, BouncyCastle FIPS
provider will block RSA keys that have been used for signing from
being used for encryption and vice versa
This commit changes testNoAttributesReturnedWhenTheyCannotBeDecrypted
to always use the specific keys we have added for encryption.
This change ensures that we stop the maintenance service on all nodes
when a data node is restarted. This ensures that we don't send
update_by_query requests on the node that is restarted.
This commit also raises the log level to trace for some packages
in order to investigate the failures to acquire a shard lock
after a restart.
Relates #56765
- Use opensaml to sign and encrypt responses/assertions/attributes
instead of doing this manually
- Use opensaml to build response and assertion objects instead of
parsing xml strings
- Always use different keys for signing and encryption. Due to FIPS
140 requirements, BouncyCastle FIPS provider will block
RSA keys that have been used for signing from being used for
encryption and vice versa. This change adds new encryption specific
keys to be used throughout the tests.
Move the JDBC functionality integration tests from `:sql:qa` to a separate
module `:sql:qa:jdbc`. This way the tests are isolated from the rest of the
integration tests and they only depend to the `:sql:jdbc` module, thus
removing the danger of accidentally pulling in some dependency that may
hide bugs.
Moreover this is a preparation for #56722, so that we can run those tests
between different JDBC and ES node versions and ensure forward
compatibility.
Move the rest of existing tests inside a new `:sql:qa:server` project, so that
the `:sql:qa` becomes the parent project for both and one can run all the integration
tests by using this parent project.
(cherry picked from commit c09f4a04484b8a43934fe58fbc41bd90b7dbcc76)
Our FIPS 140 testing depends on setting the appropriate java policy
in order to configure the JVM in FIPS mode. Some tests (
discovery-ec2 and ccr qa ) also needed to set a custom policy file
to grant a specific permission, which overwrote the FIPS related
policy and tests would fail. This change ensures that when a
custom policy needs to be set in these tests, the permissions that
are necessary for FIPS are also set.
Resolves: #51685, #52034
Field mapping detection is done via grok patterns.
This commit adds well-known text (WKT) formatted geometry detection.
If everything is a `POINT`, then a `geo_point` mapping is preferred.
Otherwise, if all the fields are WKT geometries a `geo_shape` mapping is preferred.
This does **NOT** detect other types of formatted geometries (geohash, comma delimited points, etc.)
closes https://github.com/elastic/elasticsearch/issues/56967
* Fix temp dir locked errors
The tests involving a temporary directory (containing the JDBC JAR) fail
on Windows because they can't be deleted, due to still being in use.
This commit forces a premature closing of the JAR file, which mitigates
the failure by giving the JVM more time to collect any open FDs.
(Calling the System.gc() in the tests is another working alternative
fix.)
The stream-based JAR access is taken care by disabling the cache usage
(cherry picked from commit 04f97333a015404a68e8f19223f33aadeb396687)
The original implementation utilized `bbox` as the index mapping type. This would not work as it would have to be `envelope`. But, given that `envelope` and `polygon` are tessellated in the same way, we choose to use `polygon` as the geo_shape type. This is for easier support other places in the stack (a la kibana maps)
Merging logic is currently split between FieldMapper, with its merge() method, and
MappedFieldType, which checks for merging compatibility. The compatibility checks
are called from a third class, MappingMergeValidator. This makes it difficult to reason
about what is or is not compatible in updates, and even what is in fact updateable - we
have a number of tests that check compatibility on changes in mapping configuration
that are not in fact possible.
This commit refactors the compatibility logic so that it all sits on FieldMapper, and
makes it called at merge time. It adds a new FieldMapperTestCase base class that
FieldMapper tests can extend, and moves the compatibility testing machinery from
FieldTypeTestCase to here.
Relates to #56814
Previously `COUNT(DISTINCT <literal>)` was returning the same result
as `COUNT(<literal>)` which is not correct as it should always return 1
if there is at least one matching row (bucket if there is a GROUP BY),
or 0 otherwise.
(cherry picked from commit 7f7d7562d43034907f432d39d0d66f490d78f4a8)
Throttling nightly cleanup as much as we do has been over cautious.
Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.
Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.
This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
The version number componenent can't equal or exceed the revision
multiplier.
This fixes a the VersionTests unit test.
(cherry picked from commit 7d2331a2818ae20024c5c3617cd4433f90e9c098)
* Adds support for MIN, MAX, AVG, SUM aggregates acting on literals.
SELECT SUM(1) FROM index
and
SELECT SUM(1), AVG(2)
work both on indices and as local execution.
(cherry picked from commit efb72907c0391612c4a2b6256e327060b4167912)
WatcherIndexTemplateRegistry as of https://github.com/elastic/elasticsearch/pull/52962
requires all nodes to be on 7.7.0 before it allows the version 11 index template to be
installed.
While in a mixed cluster, nothing prevents Watcher from running on the new
host before the all of the nodes are on 7.7.0. This will result in the
.watcher-history-11* index without the proper mappings. Without the proper
mapping a single document (for a large watch) can exceed the default 1000 field
limit and cause error to show in the logs.
This commit ensures the same logic for writing to the index is applied as for
installing the template. In a mixed cluster, the `10` index template will continue
to be written. Only once all of nodes are on 7.7.0+ will the `11` index template
be installed and used.
closes#56732
In DF analytics classification, it is possible to use no samples
of a class if its cardinality is too low.
This commit fixes this by ensuring the target sample count can never be zero.
Backport of #56783
* JDBC: fix access to the Manifest for non-entry JAR
The JDBC driver will attempt to read its version from the Manifest file
embedded into its JAR. The URL pointing to the JAR can be provided in a
few ways.
So far, accessing the Manfiest was attempted by getting a URLConnection
out of the URL and then getting an input stream out of this connection.
For file JAR URLs, this only works however if the URL points to the
driver as a JAR file entry (i.e. <sub-url>!/jdbc-driver.jar!/). If
that's not the case, the JarURLConnection will throw an IOException.
This commit fixes that: in case the URL points to a JAR entry
(jar:file:<path>/jdbc-driver.jar!/), the manifest is read directly with
JarURLConnection#getManifest().
(cherry picked from commit 2175b7b01cf5fcf3ab2bb21404a9bd454a8df3f0)
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
In some cases the Enrich processor factory may be called before it is
ready to create processors. While these calls are usually made in error,
the response from the Enrich processor is an NPE which is almost always
an unhelpful error when debugging an issue.
This change aims to fix our setup in CI so that we can run 7.x in
FIPS 140 mode. The major issue that we have in 7.x and did not
have in master is that we can't use the diagnostic trust manager
in FIPS mode in Java 8 with SunJSSE in FIPS approved mode as it
explicitly disallows the wrapping of X509TrustManager.
Previous attempts like #56427 and #52211 focused on disabling the
setting in all of our tests when creating a Settings object or
on setting fips_mode.enabled accordingly (which implicitly disables
the diagnostic trust manager). The attempts weren't future proof
though as nothing would forbid someone to add new tests without
setting the necessary setting and forcing this would be very
inconvenient for any other case ( see
#56427 (comment) for the full argumentation).
This change introduces a runtime check in SSLService that overrides
the configuration value of xpack.security.ssl.diagnose.trust and
disables the diagnostic trust manager when we are running in Java 8
and the SunJSSE provider is set in FIPS mode.
* [Transform] add support for terms agg in transforms (#56696)
This adds support for `terms` and `rare_terms` aggs in transforms.
The default behavior is that the results are collapsed in the following manner:
`<AGG_NAME>.<BUCKET_NAME>.<SUBAGGS...>...`
Or if no sub aggs exist
`<AGG_NAME>.<BUCKET_NAME>.<_doc_count>`
The mapping is also defined as `flattened` by default. This is to avoid field explosion while still providing (limited) search and aggregation capabilities.
This is a followup to #56632. Tests that had to be changed
to mock the C++ log handler more accurately need to be more
careful about when that stream ends, as ending of that
stream is used to detect crashes in the production system.
Fixes#56796
Mapper.Builder currently has some complex generics on it to allow fluent builder
construction. However, the second parameter, a return type from the build() method,
is unnecessary, as we can use covariant return types. This commit removes this second
generic parameter.
This is another part of the breakup of the massive BuildPlugin. This PR
moves the code for configuring publications to a separate plugin. Most
of the time these publications are jar files, but this also supports the
zip publication we have for integ tests.
This aggregation will perform normalizations of metrics
for a given series of data in the form of bucket values.
The aggregations supports the following normalizations
- rescale 0-1
- rescale 0-100
- percentage of sum
- mean normalization
- z-score normalization
- softmax normalization
To specify which normalization is to be used, it can be specified
in the normalize agg's `normalizer` field.
For example:
```
{
"normalize": {
"buckets_path": <>,
"normalizer": "percent"
}
}
```
Optimize away events queries and joins/sequence that cannot match any
results without having to query the backend.
(cherry picked from commit 69c8ef8cfefd8fc6dcb6d1a566bfcd537068e3e4)
Adds the conflicting types and an example of an index which specifies
them in order to make it easier for the user to understand the conflict.
Backport of #56700
This change ensures that the maintenance service that is responsible for deleting the expired response is stopped between each test. This is needed since we check that no search context are in-flight after each test method.
Fixes#55988
If an email action is used in a foreach loop, message ids could have
been duplicated, which then get rejected by the mail server.
This commit introduces an additional static counter in the email action
in order to ensure that every message id is unique.
Prior to this change the named pipes that connect the ML C++
processes to the Elasticsearch JVM were all opened before any
of them were read from or written to.
This created a problem, where if the C++ process logged more
messages between opening the log pipe and opening the last
pipe to be connected than there was space for in the named
pipe's buffer then the C++ process would block. This would
mean it never got as far as opening the last named pipe, so
the JVM would never get as far as reading from the log pipe,
hence a deadlock.
This change alters the connection order so that the JVM
starts reading from the logging pipe immediately after opening
it so that if the C++ process logs messages while opening the
other named pipes they are captured in a timely manner and
there is no danger of a deadlock.
Backport of #56632
This merges the code for the `significant_terms` agg into the package
for the code for the `terms` agg. They are *super* entangled already,
this mostly just admits that to ourselves.
Precondition for the terms work in #56487
Initial support for EQL sequences
The current algorithm is focused on correctness and does not contain
any optimization which is left for the future.
The current implementation uses a state machine approach which moves
ascending and runs each query one after the other working on computing
sequences as the data comes in.
For each result, the key and its timestamp are being extracted which are
then used for matching/building a sequence.
(cherry picked from commit 4f3e18c894a1841d333022361ad9d1fdf1477dc3)
- Add support for scalar functions on the field of SQL's LIKE/RLIKE
- Add support for scalar functions on the field of EQL's match/matchLite
Closes: #55058
(cherry picked from commit 51c14e2dbb7fb29004a23369c449d425b3ac8fe2)
When decoding async execution ids, exceptions thrown from the decode method itself were not caught, leading to cryptic errors like "Input byte array has incorrect ending byte at 68" being returned. With this commit we return "invalid id: [abcdef]".
Added tests coverage for a couple of these scenarios and also added tests for equals/hashcode methods.
The docs pattern url was using `*` which means zero or many instead
of `?` which means zero or one. The pattern url returned in error
messages was not in sync with the one in the docs.
Fixes: #56476
(cherry picked from commit 1a5945c3962cdda21482f4b0b3e0ca508534c2c4)
Today you can convert a searchable snapshot index back into a regular index by
restoring the underlying snapshot, but this is somewhat wasteful if the shards
are already in cache since it copies the whole index from the repository again.
Instead, we can make use of the locally-cached data by using the clone API to
copy the contents of the cache into the layout expected by a regular shard.
This commit marks the searchable snapshot's private index settings as
`NotCopyableOnResize` so that they are removed by resize operations such as
cloning.
Cloning a regular index typically hard-links the underlying files rather than
copying them, but this is tricky to support in the case of a searchable
snapshot so this commit takes the simpler approach of always copying the
underlying files.
This setting was not returned in the SamlRealmSettings#getSettings
so it was not possible for users to set this in the realm config
in our configuration.
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.
When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.
This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.
This commit also un-mutes the tests related to #53177 and #56534
This commit allows the JSON schema's documentation.url property to have a null value.
This can useful for cases where a feature is under development, and does not have
documentation published yet.
This commit also adds a documentation.url for two ml resources.
Two spots that allow for some optimization:
* We are often creating a composite reference of just a single item in
the transport layer => special cased via static constructor to make sure we never do that
* Also removed the pointless case of an empty composite bytes ref
* `ByteBufferReference` is practically always created from a heap buffer these days so there
is no point of dealing with all the bounds checks and extra references to sliced buffers from that
and we can just use the underlying array directly
Without the flag we run into the situation where a broken repository (broken by some old 6.x
version of ES that is missing some snap-${uuid}.dat blobs fails to run the SLM retention task
since it always errors out).
Use `ORDER BY` to ensure order of the rows since more
than are returned in the testDate().
Follows: #56492
(cherry picked from commit 0053a1cb515b4db160d7b0bed5cf3f13c1050687)
Backport: #55377
This commit adds the ability to auto create data streams using index templates v2.
Index templates (v2) now have a data_steam field that includes a timestamp field,
if provided and index name matches with that template then a data stream
(plus first backing index) is auto created.
Relates to #53100
* QL: case sensitive support in EQL (#56404)
* adds a generic startsWith function to QL
* modifies the existent EQL startsWith function to be case sensitive
aware
* improves the existent EQL startsWith function to use a prefix query
when the function is used in a case sensitive context. Same improvement
is used in SQL's newly added STARTS_WITH function.
* adds case sensitivity to EQL configuration through a case_sensitive
parameter in the eql request, as established in #54411.
The case_sensitive parameter can be specified when running queries
(default is case insensitive)
(cherry picked from commit ee5a09ea840167566e34c28c8225dc38bc6a7ae8)
fix count in get and get stats if explicit ids are given and ids might be
duplicated when configuration are stored in different index (versions).
fixes#56196
The Date/Time related query params of a JDBC prepared statement
serialized using java.util.Date. The rules for serializing
`java.util.Date` objects though reside in
`XContentElasticsearchExtension` which is not available in the
jdbc jar as this class is in `server` module. Therefore, a
custom extension of the `XContentBuilderExtension` iface has been
added to the jdbc module/jar.
Moreover the sql's `qa` project had as dependency the `sql-action`
module which depends on `server` so the `XContentBuilderExtension`
was available for the integ tests hiding the real problem.
Previously, when a user was setting a `java.sql.Time` to the prepStmt,
the DataType used was `DATETIME` instead of `TIME` and therefore
prevented from filtering with a `TIME` casted field:
```
SELECT * FROM test WHERE date::TIME = ?
```
Fixes: #56084
(cherry picked from commit f8d8e971bd2c85fa4aea44b5b3ba0cdcc950a4ed)
Backport of: #56569
A data stream test, which tests data stream resolvability in xpack apis failed in release builds.
A invocation of a searchable snapshot api failed, because the corresponding feature flag
wasn't enabled for xpack rest tests.
Closes#56531
Similar to what the moving function aggregation does, except merging windows of percentiles
sketches together instead of cumulatively merging final metrics
When no timezone is specified the session timezone is used without
conversion, fix the docs test accordingly.
Follows: #56158
(cherry picked from commit 4b79b19ea5c3d17e05cb8130f3c754ac9bfd2382)
This commit refactors the following:
* GeoPointFieldMapper and PointFieldMapper to
AbstractPointGeometryFieldMapper derived from AbstractGeometryFieldMapper.
* .setupFieldType moved up to AbstractGeometryFieldMapper
* lucene indexing moved up to AbstractGeometryFieldMapper.parse
* new addStoredFields, addDocValuesFields abstract methods for implementing
stored field and doc values field indexing in the concrete field mappers
This refactor is the next phase for setting up a framework for extending
spatial field mapper functionality in x-pack.
Currently Elasticsearch creates independent event loop groups for each
transport (http and internal) transport type. This is unnecessary and
can lead to contention when different threads access shared resources
(ex: allocators). This commit moves to a model where, by default, the
event loops are shared between the transports. The previous behavior can
be attained by specifically setting the http worker count.
This commit removes the `prefer_v2_templates` flag and setting. This was a brief setting that
allowed specifying whether V1 or V2 template should be used when an index is created. It has been
removed in favor of V2 templates always having priority.
Relates to #53101Resolves#56528
This is not a breaking change because this flag was never in a released version.
We are ensuring order in the two tests changed by waiting on latches.
The problem is, that 3s is a pretty short wait and on CI can randomly be exceeded
by pure chance. If that happened we wouldn't have visibility on it since we didn't
assert that the waits actually worked.
=> Fixed by asserting that the waits work and upping the timeout to our standard 10s
Also, moved to a per-test threadpool to make it simpler to identify which test failed,
should an unexpected task run on a closed client's pool afterall.
Async search integration tests are subject to random failures when:
* The test index has more than one replica.
* The request cache is used.
* Some shards are empty.
* The maintenance service starts a garbage collection when node is closing.
They are also slow because the test index is created/populated on each
test method.
This change refactors these integration tests in order to:
* Create the index once for the entire test suite.
* Fix the usage of the request cache and replicas.
* Ensures that all shards have at least one document.
* Increase the delay of the maintenance service garbage collection.
Closes#55895Closes#55988
Change TransportBroadcastByNodeAction and TransportBroadcastReplicationAction
to be able to resolve data streams by default. Implementations can change this ability.
This change allows to following APIs to resolve data streams: flush,
refresh (already supported data streams), force merge, clear indices cache,
indices stats (already supported data streams), segments, upgrade stats,
upgrade, validate query, searchable snapshots stats, clear searchable snapshots cache and
reload analyzers APIs.
Relates to #53100
Right now all implementations of the `terms` agg allocate a new
`Aggregator` per bucket. This uses a bunch of memory. Exactly how much
isn't clear but each `Aggregator` ends up making its own objects to read
doc values which have non-trivial buffers. And it forces all of it
sub-aggregations to do the same. We allocate a new `Aggregator` per
bucket for two reasons:
1. We didn't have an appropriate data structure to track the
sub-ordinals of each parent bucket.
2. You can only make a single call to `runDeferredCollections(long...)`
per `Aggregator` which was the only way to delay collection of
sub-aggregations.
This change switches the method that builds aggregation results from
building them one at a time to building all of the results for the
entire aggregator at the same time.
It also adds a fairly simplistic data structure to track the sub-ordinals
for `long`-keyed buckets.
It uses both of those to power numeric `terms` aggregations and removes
the per-bucket allocation of their `Aggregator`. This fairly
substantially reduces memory consumption of numeric `terms` aggregations
that are not the "top level", especially when those aggregations contain
many sub-aggregations. It also is a pretty big speed up, especially when
the aggregation is under a non-selective aggregation like
the `date_histogram`.
I picked numeric `terms` aggregations because those have the simplest
implementation. At least, I could kind of fit it in my head. And I
haven't fully understood the "bytes"-based terms aggregations, but I
imagine I'll be able to make similar optimizations to them in follow up
changes.
Even with changes from #48854 we're still seeing significant (as in tens and hundreds of MB)
buffer usage for bulk exports in some cases which destabilizes master nodes.
Since we need to know the serialized length of the bulk body we can't do the serialization
in a streaming manner. (also it's not easily doable with the HTTP client API we're using anyway).
=> let's at least serialize on heap in compressed form and decompress as we're streaming to the
HTTP connection. For small requests this adds negligible overhead but for large requests this reduces
the size of the payload field by about an order of magnitude (empirically determined) which is a massive reduction in size when considering O(100MB) bulk requests.
We have been using a zero timeout in the case that DF analytics
is stopped. This may cause a timeout when we cancel, for example,
the reindex task.
This commit fixes this by using the default timeout instead.
Backport of #56423
While investigating possible optimizations to speed up searchable
snapshots shard restores, we noticed that Elasticsearch builds the
list of shard files on local disk in order to compare it with the list of
files contained in the snapshot to restore. This list of files is
materialized with a MetadataSnapshot object whose construction
involves to read the footer checksum of every files of the shard
using Store.checksumFromLuceneFile() method.
Further investigation shows that a MetadataSnapshot object is
also created for other types of operations like building the list of
files to recover in a peer recovery (and primary shard relocation)
or in order to assign a shard to a node. These operations use the
Store.getMetadata(IndexCommit) method to build the list of files
and checksums.
In the case of searchable snapshots building the MetadataSnapshot
object can potentially trigger cache misses, which in turn can
cause the download and the writing in cache of the last range of
the file in order to check the 16 bytes footer. This in turn can
cause more evictions.
Since searchable snapshots already contains the footer information
of every file in BlobStoreIndexShardSnapshot it can directly read the
checksum from it and avoid to use the cache at all to create a
MetadataSnapshot for the operations mentioned above.
This commit adds a shortcut to the
SearchableSnapshotDirectory.openInput() method - similarly to what
already exists for segment infos - so that it creates a specific
IndexInput for checksum reading operation.
It is possible that the config document for a data frame
analytics job is deleted from the config index. If that is
the case the user is unable to stop a running job because
we attempt to retrieve the config and that will throw.
This commit changes that. When the request is forced,
we do not expand the requested ids based on the existing
configs but from the list of running tasks instead.
Backport of #56360
Due to multi-threading it is possible that phase progress
updates written from the c++ process arrive reordered.
We can address this by ensuring that progress may only increase.
Closes#56282
Backport of #56339
* Add xpack setting deprecations to deprecation API
The deprecated settings showed up in the deprecation log file by
default, but I did not add them to the deprecation API. This commit
fixes that. Now if you use one of the deprecated basic feature
enablement settings, calling _monitoring/deprecations will inform you of
that fact.
* Remove incorrectly backported settings documents
It seems that I backported these docs to the wrong place in #56061,
in #55980, and in #56167. I hope they're in the right place now.
Co-authored-by: debadair <debadair@elastic.co>
Rounding dates on a shard that contains a daylight savings time transition
is currently something like 1400% slower than when a shard contains dates
only on one side of the DST transition. And it makes a ton of short lived
garbage. This replaces that implementation with one that benchmarks to
having around 30% overhead instead of the 1400%. And it doesn't generate
any garbage per search hit.
Some background:
There are two ways to round in ES:
* Round to the nearest time unit (Day/Hour/Week/Month/etc)
* Round to the nearest time *interval* (3 days/2 weeks/etc)
I'm only optimizing the first one in this change and plan to do the second
in a follow up. It turns out that rounding to the nearest unit really *is*
two problems: when the unit rounds to midnight (day/week/month/year) and
when it doesn't (hour/minute/second). Rounding to midnight is consistently
about 25% faster and rounding to individual hour or minutes.
This optimization relies on being able to *usually* figure out what the
minimum and maximum dates are on the shard. This is similar to an existing
optimization where we rewrite time zones that aren't fixed
(think America/New_York and its daylight savings time transitions) into
fixed time zones so long as there isn't a daylight savings time transition
on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement
time interval rounding the time zone rewriting optimization *should* no
longer be needed.
This optimization doesn't come into play for `composite` or
`auto_date_histogram` aggs because neither have been migrated to the new
`DATE` `ValuesSourceType` which is where that range lookup happens. When
they are they will be able to pick up the optimization without much work.
I expect this to be substantial for `auto_date_histogram` but less so for
`composite` because it deals with fewer values.
Note: My 30% overhead figure comes from small numbers of daylight savings
time transitions. That overhead gets higher when there are more
transitions in logarithmic fashion. When there are two thousand years
worth of transitions my algorithm ends up being 250% slower than rounding
without a time zone, but java time is 47000% slower at that point,
allocating memory as fast as it possibly can.
This test sometimes fails when prewarming is enabled because
it's possible that some files are cached in background while the
test tries to clear the cache. This commit disables prewarming
for this test.
* Simplify equals/not-equals TRUE/FALSE expressions, by returning them
as is (TRUE variant) or negating them (FALSE variant)
(cherry picked from commit 17858afbe6da5fa0b3ecfc537cabb337e4baaffe)
Another Jackson release is available. There are some CVEs addressed,
none of which impact us, but since we can now bump Jackson easily, let
us move along with the train to avoid the false positives from security
scanners.
`FieldMapper#parseCreateField` accepts the parse context, plus a list of fields
as an output parameter. These fields are immediately added to the document
through `ParseContext#doc()`.
This commit simplifies the signature by removing the list of fields, and having
the mappers add the fields directly to `ParseContext#doc()`. I think this is
nicer for implementors, because previously fields could be added either through
the list, or the context (through `add`, `addWithKey`, etc.)
Async search allows users to retrieve partial results for a running search. For partial results, the number of successful shards does not include the skipped shards, while the response returned to users should.
Also, we recently had a bug where async search would miss tracking shard failures, which would have been caught if we had assertions in place that verified that whenever we get the last response, the number of failures included in it is the same as the failures that were tracked through the listener notifications.
A FilterBlobContainer class was introduced in #55952 and it delegates
its behavior to a given BlobContainer while allowing to override
only necessary methods.
This commit replaces the existing BlobContainerWrapper class from
the test framework with the new FilterBlobContainer from core.
If an exception occurs while flushing a bulk the cause of the exception
can be lost. This commit ensures that cause of the exception is carried
forward and gets logged.
* SQL: Add BigDecimal support to JDBC (#56015)
* Introduce BigDecimal support to JDBC -- fetching
This commit adds support for the getBigDecimal() methods.
* Allow BigDecimal params in double range
A prepared statement will now accept a BigDecimal parameter as a proxy
for a double, if the conversion is lossless.
(cherry picked from commit e9a873ad7f387682e3472110b1d7c0514bd347c9)
* Fix compilation error
Dimond notation with anonymous inner classes not avail in Java8.
The incomatible client version test is changed to:
- iterate on all versions prior to the allowed one_s;
- format the exception message just as the server does it.
The defect stemed from the fact that the clients will not send a
version's qualifier, but just major.minor.revision, so the raised
error/exception_message won't contain it, while the test expected it.
(cherry picked from commit 4a81c8f7a1f4573e3be95f346d9fb18772b297ee)
* [ML] lay ground work for handling >1 result indices (#55892)
This commit removes all but one reference to `getInitialResultsIndexName`.
This is to support more than one result index for a single job.
* Introduce a query builder for the rest tests
The new BaseRestSqlTestCase.RequestObjectBuilder class is a helper class
to build REST request objects for the tests. Consequently, "manual" string
concatenation to form JSON is done away with.
The class mimics SqlQueryRequestBuilder API.
(cherry picked from commit c8363f04c029542c233a758e9286d33c51d9c0c4)
this commit adds aggregation support for the geo_shape field
type on geo*_grid aggregations.
it introduces a Tiler for both tiles and hashes that enables a new type of
ValuesSource to replace the GeoPoint's CellIdSource. This makes it possible
for the existing Aggregator to be re-used, so no new implementations of
the grid aggregators are added.
Transforms should propagate up the search execution exception if one is returned when it does the test query.
this allows transforms to return a `4xx` when the aggs are malformed but parseable.
closes https://github.com/elastic/elasticsearch/issues/55994
* Relax version lock between ES/SQL and clients
Allow older-than-server clients to connect, if these are past or on a
certain min release.
(cherry picked from commit 108f907297542ce649aa7304060aaf0a504eb699)
The following settings are now no-ops:
* xpack.flattened.enabled
* xpack.logstash.enabled
* xpack.rollup.enabled
* xpack.slm.enabled
* xpack.sql.enabled
* xpack.transform.enabled
* xpack.vectors.enabled
Since these settings no longer need to be checked, we can remove settings
parameters from a number of constructors and methods, and do so in this
commit.
We also update documentation to remove references to these settings.
This commit changes searchable snapshots so that it now respects the
repository's max_restore_bytes_per_sec setting when it downloads blobs.
Backport of #55952 for 7.x
This PR implements the following changes to make ML model snapshot
retention more flexible in advance of adding a UI for the feature in
an upcoming release.
- The default for `model_snapshot_retention_days` for new jobs is now
10 instead of 1
- There is a new job setting, `daily_model_snapshot_retention_after_days`,
that defaults to 1 for new jobs and `model_snapshot_retention_days`
for pre-7.8 jobs
- For days that are older than `model_snapshot_retention_days`, all
model snapshots are deleted as before
- For days that are in between `daily_model_snapshot_retention_after_days`
and `model_snapshot_retention_days` all but the first model snapshot
for that day are deleted
- The `retain` setting of model snapshots is still respected to allow
selected model snapshots to be retained indefinitely
Backport of #56125
This commit strengthens the assertion about which threads may access a blob
store to exclude the cluster applier thread, since we no longer need to do so.
Relates #50999
As of elastic/ml-cpp#1179, the analytics process reports phases
depending on the analysis type. This commit adjusts the phases
of current analyses from `analyzing` to the following:
- outlier_detection: [`computing_outlier`]
- regression/classification: [`feature_selection`, `coarse_parameter_search`, `fine_tuning_parameters`, `final_training`]
Backport of #56107
Previously, when the timezone was missing from the datetime string
and the pattern, UTC was used, instead of the session defined timezone.
Moreover, if a timezone was included in the datetime string and the
pattern then this timezone was used. To have a consistent behaviour
the resulting datetime will always be converted to the session defined
timezone, e.g.:
```
SELECT DATETIME_PARSE('2020-05-04 10:20:30.123 +02:00', 'HH:mm:ss dd/MM/uuuu VV') AS datetime;
```
with `time_zone` set to `-03:00` will result in
```
2020-05-04T05:20:40.123-03:00
```
Follows: #54960
(cherry picked from commit 8810ed03a209cc8fe1bad309a81e85b56a39da27)
Today the cache prewarming introduced in #55322 works by
enqueuing altogether the files parts to warm in the
searchable_snapshots thread pool. In order to make this fairer
among concurrent warmings, this commit starts workers that
concurrently polls file parts to warm from a queue, warms the
part and then immediately schedule another warming
execution. This should leave more room for concurrent
shard warming to sneak in and be executed.
Relates #55322
Previously, the timezone parameter was not passed to the RangeQuery
and as a results queries that use the ES date math notation (now,
now-1d, now/d, now/h, now+2h, etc.) were using the UTC timezone and
not the one passed through the "timezone"/"time_zone" JDBC/REST params.
As a consequence, the date math defined dates were always considered in
UTC and possibly led to incorrect results for queries like:
```
SELECT * FROM t WHERE date BETWEEN now-1d/d AND now/d
```
Fixes: #56049
(cherry picked from commit 300f010c0b18ed0f10a41d5e1606466ba0a3088f)
In #55763 I thought I could remove the flag that marks
reindexing was finished on a data frame analytics task.
However, that exposed a race condition. It is possible that
between updating reindexing progress to 100 because we
have called `DataFrameAnalyticsManager.startAnalytics()` and
a call to the _stats API which updates reindexing progress via the
method `DataFrameAnalyticsTask.updateReindexTaskProgress()` we
end up overwriting the 100 with a lower progress value.
This commit fixes this issue by bringing back the help of
a `isReindexingFinished` flag as it was prior to #55763.
Closes#56128
Backport of #56135
AuthN realms are ordered as a chain so that the credentials of a given
user are verified in succession. Upon the first successful verification,
the user is authenticated. Realms do however have the option to cut short
this iterative process, when the credentials don't verify and the user
cannot exist in any other realm. This mechanism is currently used by
the Reserved and the Kerberos realm.
This commit improves the early termination operation by allowing
realms to gracefully terminate authentication, as if the chain has been
tried out completely. Previously, early termination resulted in an
authentication error which varies the response body compared
to the failed authentication outcome where no realm could verify the
credentials successfully.
Reserved users are hence denied authentication in exactly the same
way as other users are when no realm can validate their credentials.
Backport of #56034.
Move includeDataStream flag from an IndicesOptions to IndexNameExpressionResolver.Context
as a dedicated field that callers to IndexNameExpressionResolver can set.
Also alter indices stats api to support data streams.
The rollover api uses this api and otherwise rolling over data stream does no longer work.
Relates to #53100
* Delay warning about missing x-pack (#54265)
Currently, when monitoring is enabled in a freshly-installed cluster,
the non-master nodes log a warning message indicating that master may
not have x-pack installed. The message is often printed even when the
master does have x-pack installed but takes some time to setup the local
exporter for monitoring. This commit adds the local exporter setting
`wait_master.timeout` which defaults to 30 seconds. The setting
configures the time that the non-master nodes should wait for master to
setup monitoring. After the time elapses, they log a message to the user
about possible missing x-pack installation on master.
The logging of this warning was moved from `resolveBulk()` to
`openBulk()` since `resolveBulk()` is called only on cluster updates and
the message might not be logged until a new cluster update occurs.
Closes#40898
If there are ill-formed pipelines, or other pipelines are not ready to be parsed, `InferenceProcessor.Factory::accept(ClusterState)` logs warnings. This can be confusing and cause log spam.
It might lead folks to think there an issue with the inference processor. Also, they would see logs for the inference processor even though they might not be using the inference processor. Leading to more confusion.
Additionally, pipelines might not be parseable in this method as some processors require the new cluster state metadata before construction (e.g. `enrich` requires cluster metadata to be set before creating the processor).
closes https://github.com/elastic/elasticsearch/issues/55985
Backport of #55858 to 7.x branch.
Currently the TransportBulkAction detects whether an index is missing and
then decides whether it should be auto created. The coordination of the
index creation also happens in the TransportBulkAction on the coordinating node.
This change adds a new transport action that the TransportBulkAction delegates to
if missing indices need to be created. The reasons for this change:
* Auto creation of data streams can't occur on the coordinating node.
Based on the index template (v2) either a regular index or a data stream should be created.
However if the coordinating node is slow in processing cluster state updates then it may be
unaware of the existence of certain index templates, which then can load to the
TransportBulkAction creating an index instead of a data stream. Therefor the coordination of
creating an index or data stream should occur on the master node. See #55377
* From a security perspective it is useful to know whether index creation originates from the
create index api or from auto creating a new index via the bulk or index api. For example
a user would be allowed to auto create an index, but not to use the create index api. The
auto create action will allow security to distinguish these two different patterns of
index creation.
This change adds the following new transport actions:
AutoCreateAction, the TransportBulkAction redirects to this action and this action will actually create the index (instead of the TransportCreateIndexAction). Later via #55377, can improve the AutoCreateAction to also determine whether an index or data stream should be created.
The create_index index privilege is also modified, so that if this permission is granted then a user is also allowed to auto create indices. This change does not yet add an auto_create index privilege. A future change can introduce this new index privilege or modify an existing index / write index privilege.
Relates to #53100
It's possible for a constant_keyword to have a 'null' value before any documents
are seen that contain a value for the field. In this case, no documents have a
value for the field, and 'exists' queries should return no documents.
Adds the step of stopping all data frame analytics before
deleting them to the cleanup of the corresponding HLRC tests.
Closes#56097
Backport of #56101
* Reject queries that act on nested fields or fields with nested field types in their hierarchy (#55721)
(cherry picked from commit 2a024461cd9da821112953d4c6e565ea622c678b)
Backports #55933 to 7.x
Implements value_count and avg aggregations over Histogram fields as discussed in #53285
- value_count returns the sum of all counts array of the histograms
- avg computes a weighted average of the values array of the histogram by multiplying each value with its associated element in the counts array
Using optimistic locking, add the ability to run a repository state
update task with a consistent view of the current repository data.
Allows for a follow-up to remove the snapshot INIT state.
* Allow Deleting Multiple Snapshots at Once (#55474)
Adds deleting multiple snapshots in one go without significantly changing the mechanics of snapshot deletes otherwise.
This change does not yet allow mixing snapshot delete and abort. Abort is still only allowed for a single snapshot delete by exact name.
* Make xpack.monitoring.enabled setting a no-op
This commit turns xpack.monitoring.enabled into a no-op. Mostly, this involved
removing the setting from the setup for integration tests. Monitoring may
introduce some complexity for test setup and teardown, so we should keep an eye
out for turbulence and failures
* Docs for making deprecated setting a no-op
This commit converts the remaining isXXXAllowed methods to instead of
use isAllowed with a Feature value. There are a couple other methods
that are static, as well as some licensed features that check the
license directly, but those will be dealt with in other followups.
* Emit deprecation warning if multiple v1 templates match with a new index (#55558)
* Emit deprecation warning if multiple v1 templates match with a new index
* DEPRECATION_LOGGER rename
This commit includes a number of minor improvements around `DelayableWriteable`: javadocs were expanded and reworded, `get` was renamed to `expand` and `DelayableWriteable` no longer implements `Supplier`. Also a couple of methods are now private instead of package private.
This commit correctly sets the maxLinesPerRow in the CsvPreference for delimited files given the file structure finder settings.
Previously, it was silently ignored.
This refactors native integ tests to assert progress without
expecting explicit phases for analyses. We can test those with
yaml tests in a single place.
Backport of #55925
* Make xpack.ilm.enabled setting a no-op
* Add watcher setting to not use ILM
* Update documentation for no-op setting
* Remove NO_ILM ml index templates
* Remove unneeded setting from test setup
* Inline variable definitions for ML templates
* Use identical parameter names in templates
* New ILM/watcher setting falls back to old setting
* Add fallback unit test for watcher/ilm setting
Fixes test by exposing the method ModelLoadingService::addModelLoadedListener()
so that the test class can be notified when a model is loaded which happens in
a background thread
implement throttling in async-indexer used by rollup and transform. The added
docs_per_second parameter is used to calculate a delay before the next
search request is send. With re-throttle its possible to change the parameter
at runtime. When stopping a running job, its ensured that despite throttling
the indexer stops in reasonable time. This change contains the groundwork, but
does not expose the new functionality.
relates #54862
backport: #55011
We were creating PemKeyConfig objects using different private
keys but always using testnode.crt certificate that uses the
RSA public key. The PemKeyConfig was built but we would
then later fail to handle SSL connections during the TLS
handshake eitherway.
This became obvious in FIPS tests where the consistency
checks that FIPS 140 mandates kick in and failed early
becausethe private key was of different type than the
public key
Anonymous roles resolution and user role deduplication are now performed during authentication instead of authorization. The change ensures:
* If anonymous access is enabled, user will be able to see the anonymous roles added in the roles field in the /_security/_authenticate response.
* Any duplication in user roles are removed and will not show in the above authenticate response.
* In any other case, the response is unchanged.
It also introduces a behaviour change: the anonymous role resolution is now authentication node specific, previously it was authorization node specific. Details can be found at #47195 (comment)
Backports #55826 to 7.x
Modified AggregatorTestCase.searchAndReduce() method so that it returns an empty aggregation result when no documents have been inserted.
Also refactored several aggregation tests so they do not re-implement method AggregatorTestCase.testCase()
Fixes#55824
On second thought, this check does not seem to be adding value.
We can test that the phases are as we expect them for each analysis
by adding yaml tests. Those would fail if we introduce new phases
from c++ accidentally or without coordination. This would achieve
the same thing. At the same time we would not have to comment out
this code each time a new phase is introduced. Instead we can just
temporarily mute those yaml tests. Note I will add those tests
right after the imminent new phases are added to the c++ side.
Backport of #55926
While it is good to not be lenient when attempting to guess the file format, it is frustrating to users when they KNOW it is CSV but there are a few ill-formatted rows in the file (via some entry error, etc.).
This commit allows for up to 10% of sample rows to be considered "bad". These rows are effectively ignored while guessing the format.
This percentage of "allows bad rows" is only applied when the user has specified delimited formatting options. As the structure finder needs some guidance on what a "bad row" actually means.
related to https://github.com/elastic/elasticsearch/issues/38890
Implements Sum aggregation over Histogram fields by summing the value of each bucket multiplied by their count as requested in #53285
Backports #55681 to 7.x
We were previously checking at least one supported field existed
when the _explain API was called. However, in the case of analyses
with required fields (e.g. regression) we were not accounting that
the dependent variable is not a feature and thus if the source index
only contains the dependent variable field there are no features to
train a model on.
This commit adds a validation that at least one feature is available
for analysis. Note that we also move that validation away from
`ExtractedFieldsDetector` and the _explain API and straight into
the _start API. The reason for doing this is to allow the user to use
the _explain API in order to understand why they would be seeing an
error like this one.
For example, the user might be using an index that has fields but
they are of unsupported types. If they start the job and get
an error that there are no features, they will wonder why that is.
Calling the _explain API will show them that all their fields are
unsupported. If the _explain API was failing instead, there would
be no way for the user to understand why all those fields are
ignored.
Closes#55593
Backport of #55876
This change adds a new setting, daily_model_snapshot_retention_after_days,
to the anomaly detection job config.
Initially this has no effect, the effect will be added in a followup PR.
This PR gets the complexities of making changes that interact with BWC
over well before feature freeze.
Backport of #55878
This replaces a reference to the result of partially reducing
aggregations that async search keeps with a reference to the serialized
form of the result of the partial reduction which we need to keep
anyway.
This has no practical impact on users since frozen indices are the only
throttled indices today. However this has an impact on upcoming features
that would use search throttling.
Filtering out throttled indices made sense a couple years ago, but as
we're now improving support for slow requests with `_async_search` and
exploring ways to reduce storage costs, this feature has most likely
become a trap, that we'd like to not have with upcoming features that
would use search throttling.
Relates #54058
Today when prewarming a searchable snapshot we use the `SparseFileTracker` to
lock each (part of a) snapshotted blob, blocking any other readers from
accessing this data until the whole part is available.
This commit changes this strategy: instead we optimistically start to download
the blob without any locking, and then lock much smaller ranges after each
individual `read()` call. This may mean that some bytes are downloaded twice,
but reduces the time that other readers may need to wait before the data they
need is available.
As a best-effort optimisation we try to request the smallest possible single
range of missing bytes in the part by first checking how many of the initial
and terminal bytes of the part are already present in cache. In particular if
the part is already fully cached before prewarming then this check means we
skip the part entirely.
Currently there is a clear mechanism to stub sending a request through
the transport. However, this is limited to testing exceptions on the
sender side. This commit reworks our transport related testing
infrastructure to allow stubbing request handling on the receiving side.
Adding to #55659, we missed another way we could set the task to
failed due to task cancellation. CI revealed that we might also
get a `SearchPhaseExecutionException` whose cause is a
`TaskCancelledException`. That exception is not wrapped so
unwrapping it will not return the underlying `TaskCancelledException`.
Thus to be complete in catching this, we also need to check the
error's cause.
Closes#55068
Backport of #55797
This is a continuation from #55580.
Now that we're parsing phase progresses from the analytics process
we change `ProgressTracker` to allow for custom phases between
the `loading_data` and `writing_results` phases. Each `DataFrameAnalysis`
may declare its own phases.
This commit sets things in place for the analytics process to start
reporting different phases per analysis type. However, this is
still preserving existing behaviour as all analyses currently
declare a single `analyzing` phase.
Backport of #55763
This change adds validation when running the users tool so that
if Elasticsearch is expected to run in a JVM that is configured to
be in FIPS 140 mode and the password hashing algorithm is not
compliant, we would throw an error.
Users tool uses the configuration from the node and this validation
would also happen upon node startup but users might be added in the
file realm before the node is started and we would have the
opportunity to notify the user of this misconfiguration.
The changes in #55544 make this much less probable to happen in 8
since the default algorithm will be compliant but this change can
act as a fallback in anycase and makes for a better user experience.
Our handling for concurrent refresh of access tokens suffered from
a race condition where:
1. Thread A has just finished with updating the existing token
document, but hasn't stored the new tokens in a new document
yet
2. Thread B attempts to refresh the same token and since the
original token document is marked as refreshed, it decrypts and
gets the new access token and refresh token and returns that to
the caller of the API.
3. The caller attempts to use the newly refreshed access token
immediately and gets an authentication error since thread A still
hasn't finished writing the document.
This commit changes the behavior so that Thread B, would first try
to do a Get request for the token document where it expects that
the access token it decrypted is stored(with exponential backoff )
and will not respond until it can verify that it reads it in the
tokens index. That ensures that we only ever return tokens in a
response if they are already valid and can be used immediately
It also adjusts TokenAuthIntegTests
to test authenticating with the tokens each thread receives,
which would fail without the fix.
Resolves: #54289
The failed_category_count statistic records the number of times
categorization wanted to create a new category but couldn't
because the job had reached its model_memory_limit.
Backport of #55716
This commit refactors all spatial Field Mappers to a common
AbstractGeometryFieldMapper that implements shared parameter functionality
(e.g., ignore_malformed, ignore_z_value) and provides a common framework for
overriding type parsing, and building in xpack. Common shape functionality is
implemented in a new AbstractShapeGeometryFieldMapper that is reused and
overridden in GeoShapeFieldMapper, GeoShapeFieldMapperWithDocValues,
LegacyGeoShapeFieldMapper, and ShapeFieldMapper. This abstraction provides a
reusable foundation for adding new xpack features; such as coordinate reference
system support.
This commit removes some code duplication in the CCR non-compliance
tests by refactoring an assertion method so that it can be used in both
tests that are present there.
Since #55580 we've introduced a new format for parsing progress
from the data frame analytics process. As the process is now
writing out progress in this new way, we can remove the parsing
of the old format.
Backport of #55711
If we choose to read from two random positions that are 1024 bytes apart then
this counts as a contiguous read for stats purposes, failing this test. This
commit ensures that we always perform a non-contiguous read.
`SearchableSnapshotDirectoryStatsTests#testCachedBytesReadsAndWrites` asserts
that each write takes one clock tick, but we now permit concurrent reads and
writes so each write might take longer. This commit relaxes the assertion to
match.
Closes#55707
Also unmutes the integ test that stops and restarts
an outlier detection job with the hope of learning more
of the failure in #55068.
Backport of #55545
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This change turns the AsyncSearchMaintenanceService into an
AbstractLifecycleComponent and ensures that the service is
stopped when a node is closing.
Closes#55646
improve tests related to stopping using a client that answers and can be
synchronized with the test thread in order to test special situations
relates #55011
This commit fixes reproducible test failures with the
SSLReloadDuringStartupIntegTests on the 7.x branch. The failures only
occur on 7.x due to the existence of the transport client and its usage
in our test infrastructure. This change removes the randomized usage of
transport clients when retrieving a client from a node in the internal
cluster. Transport clients do not support the reloading of files for
TLS configuration changes but if we build one from the nodes settings
and attempt to use it after the files have been changed, the client
will not know about the changes and the TLS connection will fail.
Closes#55524
License state is currently made up of boolean methods that check whether
a particular feature is allowed by the current license state. Each new
feature must copy/past boiler plate code. While that has gotten easier
with utilities like isAllowedByLicense, this is still more cumbersome
than should be necessary. This commit adds a general purpose isAllowed
method which takes a new Feature enum, where each value of the enum
defines the minimum license mode and whether the license must be active
to be allowed. Only security features are converted in this PR, in order
to keep the commit size relatively small. The rest of the features will
be converted in a followup.
This adds a validation to VSParserHelper to ensure that a field or
script or both are specified by the user. This is technically
required today already, but throws an exception much deeper
in the agg framework and has a very unintuitive error for the user
(as well as eating more resources instead of failing early)
The (de)serialization code of the async search response
cannot handle exceptions that extend ElasticsearchException (e.g. ScriptException).
This commit fixes this bug by serializing the error with the more generic
StreamInput#writeException.
EQL will require very similar functionality to async search. This PR refactors
AsyncSearchIndexService to make it reusable for EQL.
Supersedes #55119
Relates to #49638
This commit refactors geo_shape doc values, fielddata, and utility classes from
the single mapper package in x-pack spatial plugin to a package structure that
is consistent with the server module.
Previously audit messages were indexed when datafeeds that were
assigned to a node were stopped, but not datafeeds that were
unassigned at the time they were stopped.
This change adds auditing for the unassigned case.
Backport of #55656
While we were catching `TaskCancelledException` while we wait for
reindexing to complete, we missed the fact that this exception
may be wrapped in a multi-node cluster. This is the reason
we may still fail the task when stop is called while reindexing.
Some times we're lucky and the exception is thrown by the same
node that runs the job. Then the exception is not wrapped and
things work fine. But when that is not the case the exception is
wrapped, we fail to catch it, and set the task to failed.
The fix is to simply unwrap the exception when we check it it
is `TaskCancelledException`.
Closes#55068
Backport of #55659
The CacheFile.fileLock() method is used to acquire a lock
on a cache file so that the file can't be deleted (or its file
handle closed) during the execution of a read or a write
operation.
Today this lock is obtained by first acquiring the eviction
lock (the write lock of the readwrite lock), then by checking
if the cache file is evicted and the file channel still open,
and finally by obtaining the file lock (the read lock of the
readwrite lock). Acquiring the read lock while the eviction
lock is held ensures that the cache file eviction cannot
start in the meanwhile. But eviction starts (and terminations)
also acquire the eviction lock; and this lock cannot be
obtained while a read lock is held (the write lock of a
readwrite lock is exclusive).
If we were acquiring a read lock and checking the eviction
flag and file channel existence while holding the read lock
we know that no eviction can start or finish until the
read lock is released.
Backport of #55115.
Replace calls to deprecate(String,Object...) with deprecateAndMaybeLog(...),
with an appropriate key, so that all messages can potentially be deduplicated.
The Kibana CSV export feature uses a non-standard timestamp format.
This change adds it to the formats the find_file_structure endpoint
recognizes out-of-the-box, to make round-tripping data from Kibana
back to Kibana via CSV files easier.
Fixes#55586
A JSON schema was recently introduced for the REST API specification. #54252
This PR introduces a 3rd party validation tool to ensure that the
REST specification conforms to the schema.
The task is applied to the 3 projects that contain REST API specifications.
The plugin wires this task into the precommit commit task, and should be
considered as part of the public API for the build tools for any plugin
developer to contribute their plugin's specification.
An ignore parameter has been introduced for the task to allow specific
file to be ignored from the validation. The ignored files in this PR
will soon get issues logged and a link so they can be fixed.
Closes#54314
Out of the box "access granted" audit events are not logged
for system users. The list of system users was stale and included
only the _system and _xpack users. This commit expands this list
with _xpack_security and _async_search, effectively reducing the
auditing noise by not logging the audit events of these system
users out of the box.
Closes#37924
The testPerformAction test has been failing periodically due to
how Hamcrest's containsStringIgnoringCase does not lowercase using
the same Locale set in the test infrastructure.
This commit falls back to explicitly lowercasing using the root
locale
This commit adds a new GeoShapeBoundsAggregator to the spatial plugin and registers it with the GeoShapeValuesSourceType. This enables geo_bounds aggregations on geo_shape fields
After #53562, the `geo_shape` field mapper is registered within
a module. This opens the door for introducing a new `geo_shape`
field mapper into the Spatial Plugin that has doc-values support.
This is very much an extension of server's GeoShapeFieldMapper,
but with the addition of the doc values implementation.
Data frame analytics process currently reports progress as
an integer `progress_percent`. We parse that and report it
from the _stats API as the progress of the `analyzing` phase.
However, we want to allow the DFA process to report progress
for more than one phase. This commit prepares for this by
parsing `phase_progress` from the process, an object that
contains the `phase` name plus the `progress_percent` for that
phase.
Backport of #55580
Instead of doing our own checks against REST status, shard counts, and shard failures, this commit changes all our extractor search requests to set `.setAllowPartialSearchResults(false)`.
- Scrolls are automatically cleared when a search failure occurs with `.setAllowPartialSearchResults(false)` set.
- Code error handling is simplified
closes https://github.com/elastic/elasticsearch/issues/40793
Issue #55521 suggested that audit messages were not written when
closing an unassigned job. This is not the case, but we didn't
have a test to prove it.
Backport of #55571
The ML info endpoint returns the max_model_memory_limit setting
if one is configured. However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.
This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.
The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.
Backport of #55529
Adds a "node" field to the response from the following endpoints:
1. Open anomaly detection job
2. Start datafeed
3. Start data frame analytics job
If the job or datafeed is assigned to a node immediately then
this field will return the ID of that node.
In the case where a job or datafeed is opened or started lazily
the node field will contain an empty string. Clients that want
to test whether a job or datafeed was opened or started lazily
can therefore check for this.
Backport of #55473
PEMUtils would incorrectly fill the encryption password with zeros
(the '\0' character) after decrypting a PKCS#8 key.
Since PEMUtils did not take ownership of this password it should not
zero it out because it does not know whether the caller will use that
password array again. This is actually what PEMKeyConfig does - it
uses the key encryption password as the password for the ephemeral
keystore that it creates in order to build a KeyManager.
Backport of: #55457
In the test after the first load event is is not known which models are cached as
loading a later one will evict an earlier one and the order is not known.
The models could have been loaded 1 or 2 times not exactly twice
This change folds the removal of the in-progress snapshot entry
into setting the safe repository generation. Outside of removing
an unnecessary cluster state update, this also has the advantage
of removing a somewhat inconsistent cluster state where the safe
repository generation points at `RepositoryData` that contains a
finished snapshot while it is still in-progress in the cluster
state, making it easier to reason about the state machine of
upcoming concurrent snapshot operations.
Today a read-only engine requires a complete history of operations, in the
sense that its local checkpoint must equal its maximum sequence number. This is
a valid check for read-only engines that were obtained by closing an index
since closing an index waits for all in-flight operations to complete. However
a snapshot may not have this property if it was taken while indexing was
ongoing, but that's ok.
This commit weakens the check for a complete history to exclude the case of a
searchable snapshot.
Relates #50999
This change ensures that we return the latest expiration time
when retrieving the response from the index.
This commit also fixes a bug that stops the garbage collection of saved responses if the async search index is deleted.
If more than 100 shard-follow tasks are trying to connect to the remote
cluster, then some of them will abort with "connect listener queue is
full". This is because we retry on ESRejectedExecutionException, but not
on RejectedExecutionException.