The selective muting implemented for autoscaling in #67159
is extended to the ML tests that also fail when machine
memory is reported as 0.
Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.
Relates #66885
Backport of #67422
When tests in AbstractIndicesCleanerTestCase run at 1AM they can clash with scheduled clean up in CleanerService which leads to rare, non-reproducible failures.
This change fixes it by setting retention period in CleanerService to max possible time, so nothing is ever deleted by it.
Closes#64386
Relates to #67133.
Seem to #65037.
The main changes of this PR are:
Modify the construction method of DataTierAllocationDecider, add a param settings like FilterAllocationDecider.
Create DataTierAllocationDecider in the main method of DataTierMigrationRoutedStep and SetSingleNodeAllocateStep, and the DataTierAllocationDecider is constructed using the cluster settings in the cluster metadata, so the cluster level _tier filters can be seen when executing the steps.
Add some tests for the change.
Co-authored-by: bellengao <gbl_long@163.com>
- Use `argumentFormatting` starting with 1 as 0 is problematic for Java 16
- Actually enable the tests since the listener that checks the results
was not called and nothing was asserted.
- Disable one test that fails which is tracked with: #67102Fixes: #66778
(cherry picked from commit a4199af58b8c5ab4757a45248672e0233978b208)
Previously we treated attribute filtering for _tier-prefixed attributes a pass-through, meaning
that they were essentially always treated as matching in DiscoveryNodeFilters.match, however, for
exclude settings, this meant that the node was considered to match the node if a _tier* filter was
specified.
This commit prunes these attributes from the DiscoveryNodeFilters when considering the filters for
FilterAllocationDecider so that they are only considered in DataTierAllocationDecider.
Resolves#66679
Return null for any field that is present in the _ignored section of the
response, not only numerics and IPs.
(cherry picked from commit 106719f6af1eb2f06396161143a5f0185c5d1b54)
For async searches (EQL included) the client's request headers were
erroneously stored in the .tasks index. This might expose the requesting
client's HTTP Authorization header. This PR fixes that by employing the
usual approach to store only the security-internal headers, which carry
the authentication result, instead of the original Authorization header,
which is commonly utilized to redo authentication for scheduled tasks.
Rework trimToLast to take into account an ordinal for last trimming so
instead of keeping the last entry in a stage, it keeps the last entry
before the given ordinal.
This takes care of the case where a dense stage that requires several
passes does not discard valid data from a previous sparse stage that go
beyond the current stage point.
(cherry picked from commit 4f55749072b39f89822bdd52c67998f7bed890a9)
(cherry picked from commit 6b61dfead88a144c6e85e384d47a24f0c1480c6b)
(cherry picked from commit cece81b5dee88b18e3e7ea189fc342ef53ea19f2)
When a value is promoted in the LRU cache, its weight is removed and added.
The LocalModel object was recalculating the model size for ever weight check, which caused a polynomial runtime.
This commit changes the model size to only be calculated in the LocalModel ctor.
* SQL: Verify filter's condition type (#66268)
* Verify filter's condition type
This adds a check in the verifier to check if filter's condition is of a
boolean type and fail the request otherwise.
(cherry picked from commit 3aec1a3d99a3f4650ec8be014a97106320f0874a)
This commit ensures that the after key is parsed with the doc value formatter.
This is needed for unsigned longs that uses shifted longs internally.
Closes#65685
With the upcoming validation for type compatibility of the sequence
keys, several tests are failing because some fields that contain IP
data were previously mapped as keyword. Fixed the mapping and created a
new snaphost of the correctness data in the gcs bucket.
Relates to: #66183
(cherry picked from commit 7f638f661c5a5c57a4ea7d3d3e2ccf5c81ae92d1)
Today, CCR only checks the historyUUID of the leader shard when it has
operations to replicate. If the follower shard is already in-sync with
the leader shard, then CCR won't detect if the historyUUID of the leader
shard has been changed. While this is not an issue, it can annoy users
in the following situation:
The follower index is in-sync with the leader index
Users restore the leader index from snapshots
CCR won't detect the issue and report ok in its stats API
CCR suddenly stops working when users start indexing to the leader index
This commit makes sure that we always check historyUUID in every
read-request so we can detect and report the issue as soon as possible.
Backport of #65841
In #61906 we added the possibility for the master node to fetch
the size of a shard snapshot before allocating the shard to a
data node with enough disk space to host it. When merging
this change we agreed that any failure during size fetching
should not prevent the shard to be allocated.
Sadly it does not work as expected: the service only triggers
reroutes when fetching the size succeed but never when it
fails. It means that a shard might stay unassigned until
another cluster state update triggers a new allocation
(as in #64372). More sadly, the test I wrote was wrong as
it explicitly triggered a reroute.
This commit changes the InternalSnapshotsInfoService
so that it also triggers a reroute when fetching the snapshot
shard size failed, ensuring that the allocation can move
forward by using an UNAVAILABLE_EXPECTED_SHARD_SIZE
shard size. This unknown shard size is kept around in the
snapshot info service until no corresponding unassigned
shards need the information.
Backport of #65436
In case the local agg sorter queue gets full and no limit has been provided,
the local sorter will now erroneously call the failure callback for every
single row in the original rowset that's left over the local queue limit
(instead for just the first one). The failure response is dispatched in any
case, so this is relatively harmless. The sorter continues iterating on the
original response fetching subsequent pages. In case of correct Elasticsearch
behaviour, this is also harmless, it'll just trigger a number of internal
exceptions. However, in case of a pagination defect in Elasticsearch (like
GH#65685, where the same search_after is returned), this will result in an
effective spin loop, potentially rendering eventually the node unresponsive.
This PR simply breaks both the inner loop iterating over the current unsorted
rowset, as well as the outer one, iterating over the left pages.
It also fixes an outdated documentation limitation.
(cherry picked from commit 638402c387faf79bba38fcc95f371a73146efc0b)
This change ensures that we create the async search index with the right mappings and settings when updating or deleting a document. Users can delete the async search index at any time so we have to re-create it internally if necessary before applying any new operation.
In certain situations, such as when configured in FIPS 140 mode,
the Java security provider in use might throw a subclass of
java.lang.Error. We currently do not catch these and as a result
the JVM exits, shutting down elasticsearch.
This commit attempts to address this by catching subclasses of Error
that might be thrown for instance when a PBKDF2 implementation
is used from a Security Provider in FIPS 140 mode, with the password
input being less than 14 bytes (112 bits).
- In our PBKDF2 family of hashers, we catch the Error and
throw an ElasticsearchException while creating or verifying the
hash. We throw on verification instead of simply returning false
on purpose so that the message bubbles up and the cause becomes
obvious (otherwise it would be indistinguishable from a wrong
password).
- In KeyStoreWrapper, we catch the Error in order to wrap and re-throw
a GeneralSecurityException with a helpful message. This can happen when
using any of the keystore CLI commands, when the node starts or when we
attempt to reload secure settings.
- In the `elasticsearch-users` tool, we catch the ElasticsearchException that
the Hasher class re-throws and throw an appropriate UserException.
Tests are missing because it's not trivial to set CI in fips approved mode
right now, and thus any tests would need to be muted. There is a parallel
effort in #64024 to enable that and tests will be added in a followup.
KeyStoreAwareCommand attempted to deduce whether an error occurred
because of a wrong password by checking the cause of the
SecurityException that KeyStoreWrapper.decrypt() throws. Checking
for AEADBadTagException was wrong becase that exception could be
(and usually is) wrapped in an IOException. Furthermore, since we
are doing the check already in KeyStoreWrapper, we can just return
the message of the SecurityException to the user directly, as we do
in other places.
Adjusted the README file to mention both the option to preserve the test
data when simple reproducing/executing the tests, but also when starting
the server node manually and issuing the query(ies) against it.
Follows: #65400
(cherry picked from commit e3a1910d28d8b0ed20997754c74fa4d4d52cda15)
* Fix the `CombineBinaryComparisons` optimizer rule, so that semantic
equality taken into account during the optimization of `NotEquals`
Examples that previously removed the `NotEquals` expressions (leading
to incorrect results):
```
double >= 10 AND integer != 9
--> double >= 10
keyword != '2021' AND datetime >= '2020-01-01T00:00:00'
--> datetime >= '2020-01-01T00:00:00'
```
With the fix, expressions like the above will not be touched.
`NotEquals` will only be eliminated from the `AND` expression if the
left side of the `NotEquals` `semanticEquals()` to the left side
of the other expressions within the conjunction (comparisons against
the same field/expression).
* Unit tests and integration tests
Close#65322
(cherry-picked from 8b2b7fa)
Slack no longer recommends the legacy "integrations" setup (https://api.slack.com/legacy/custom-integrations/incoming-webhooks). Updated documentation to reference https://api.slack.com/messaging/webhooks instead.
Removed screenshots from our documentation related to Slack setup. We should avoid these screenshots (and simply point to Slack documentation) for Slack may change the instructions/their UI in the future.
Also added a short note on the use case of notifying multiple Slack channels.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
In #65332, the serialization of the WatcherSearchTemplateRequest class
changed to use IndicesOptions built in XContent facilities. This had
the side effect of fixing the handling of `all` for `expand_wildcards`
to include hidden indices. However, the tests in WatcherUtilsTests were
missed. This change updates those tests.
Backport of #65379
Watcher has a search template that stores indices options to be used as
part of a search during watch execution, but this was not updated to be
aware of hidden indices and the `hidden` expand_wildcards option. This
change makes use of the `IndicesOptions#toXContent` method in Watcher,
which already handles the new value. Additionally, the XContent parsing
is moved to the IndicesOptions class so that we will be less likely to
miss updating this in the future.
Closes#65148
Backport of #65332
The recovery stats assertions in this test ran without any waiting for
the recoveries to actually finish. The fact that they ran after the concurrent
searches checks generally meant that they would pass (because of searches warming caches
+ general relative slowness of searches) but there is no hard guarantees this will work
reliably as the pre-fetch threads which will update the recovery state might still be slow
to do so randomly, causing the assertions to trip.
closes#65302
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in
was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal
consistency assertions.
Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary
tripped assertions after the test fails)
In aa1ea96b8698aa12bed1c4e8d704882a2a639791 I made all
`testReduceRandom` tests for aggs mimick production more precisely.
More precisely, they pick the correct "lead" result when performing
partial reduction. This is great, but, sadly, some tests assumed that we
always reduced against the "first" aggregator. This fixes those tests.
Closes#65163
This commit updates the IndexAbstractionResolver so that hidden indices
are properly resolved when date math is in use and when we are checking
if the index is visible.
Closes#65157
Backport of #65236
If a shard follow-task hits a non-retryable error and stops, then we
should also stop the retention-leases renewal process associated with
that follow-task.
This change fixes the initialization of the async results service
for the EQL get async action. The boolean that differentiates EQL
from normal _async_search request is set incorrectly, which results
in errors (404) when extending the keep alive of a running EQL search.
Fixes#65108