Relates to #67133.
Seem to #65037.
The main changes of this PR are:
Modify the construction method of DataTierAllocationDecider, add a param settings like FilterAllocationDecider.
Create DataTierAllocationDecider in the main method of DataTierMigrationRoutedStep and SetSingleNodeAllocateStep, and the DataTierAllocationDecider is constructed using the cluster settings in the cluster metadata, so the cluster level _tier filters can be seen when executing the steps.
Add some tests for the change.
Co-authored-by: bellengao <gbl_long@163.com>
Previously we treated attribute filtering for _tier-prefixed attributes a pass-through, meaning
that they were essentially always treated as matching in DiscoveryNodeFilters.match, however, for
exclude settings, this meant that the node was considered to match the node if a _tier* filter was
specified.
This commit prunes these attributes from the DiscoveryNodeFilters when considering the filters for
FilterAllocationDecider so that they are only considered in DataTierAllocationDecider.
Resolves#66679
For async searches (EQL included) the client's request headers were
erroneously stored in the .tasks index. This might expose the requesting
client's HTTP Authorization header. This PR fixes that by employing the
usual approach to store only the security-internal headers, which carry
the authentication result, instead of the original Authorization header,
which is commonly utilized to redo authentication for scheduled tasks.
This change ensures that we create the async search index with the right mappings and settings when updating or deleting a document. Users can delete the async search index at any time so we have to re-create it internally if necessary before applying any new operation.
In certain situations, such as when configured in FIPS 140 mode,
the Java security provider in use might throw a subclass of
java.lang.Error. We currently do not catch these and as a result
the JVM exits, shutting down elasticsearch.
This commit attempts to address this by catching subclasses of Error
that might be thrown for instance when a PBKDF2 implementation
is used from a Security Provider in FIPS 140 mode, with the password
input being less than 14 bytes (112 bits).
- In our PBKDF2 family of hashers, we catch the Error and
throw an ElasticsearchException while creating or verifying the
hash. We throw on verification instead of simply returning false
on purpose so that the message bubbles up and the cause becomes
obvious (otherwise it would be indistinguishable from a wrong
password).
- In KeyStoreWrapper, we catch the Error in order to wrap and re-throw
a GeneralSecurityException with a helpful message. This can happen when
using any of the keystore CLI commands, when the node starts or when we
attempt to reload secure settings.
- In the `elasticsearch-users` tool, we catch the ElasticsearchException that
the Hasher class re-throws and throw an appropriate UserException.
Tests are missing because it's not trivial to set CI in fips approved mode
right now, and thus any tests would need to be muted. There is a parallel
effort in #64024 to enable that and tests will be added in a followup.
This commit fixes two problems:
- When extracting a doc value, we allow boolean scalars to be used as input
- The output order of processed feature names is deterministic. Previous custom one hot fields used to be non-deterministic and thus could cause weird bugs.
Random the compression factor starting with 1 (to elimitinate nearly 0 values)
which will only use one centroid (and yield 0 for MAD as the aproximate median
is the same as the single centroid mean value)
(cherry picked from commit 940e0f1fde0f40f99af117dd03ab0891c9eedae6)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
The SchedulerEngine used by SLM uses a custom runnable that will
schedule itself for its next execution if there is one to run. For the
majority of jobs, this scheduling could be many hours or days away. Due
to the scheduling so far in advance, there is a chance that time drifts
on the machine or even that time varies core to core so there is no
guarantee that the job actually runs on or after the scheduled time.
This can cause some jobs to reschedule themselves for the same
scheduled time even if they ran only a millisecond prior to the
scheduled time, which causes unexpected actions to be taken such as
what appears as duplicated snapshots.
This change resolves this by checking the triggered time against the
scheduled time and using the appropriate value to ensure that we do
not have unexpected job runs.
Relates #63754
Backport of #64501
This commit internalizes whether or not a role represents the ability to
contain data. In the future, this will let us remove the compatibility
role notion.
This commit adjusts the defaults for the tiered data roles so that they
are enabled by default, or if the node has the legacy data role. This
ensures that the default experience is that the tiered data roles are
enabled.
To fully specifiy the behavior for the tiered data roles then:
- starting a new node with the defaults: enabled
- starting a new node with node.roles configured: enabled if and only
if the tiered data roles are explicitly configured, independently
of the node having the data role
- starting a new node with node.data enabled: enabled unless the
tiered data roles are explicitly disabled
- starting a new node with node.data disabled: disabled unless the
tiered data roles are explicitly enabled
XPack usage starts out on management threads, but depending on the
implementation of the usage plugin, they could end up running on
transport threads instead. Fixed to always reschedule on a management
thread.
With this change, we will always return the same point in time in a
search response as its input until we implement the retry mechanism
for the point in times.
Adds support for the unsigned_long type to data frame analytics.
This type is handled in the same way as the long type. Values
sent to the ML native processes are converted to floats and
hence will lose accuracy when outside the range where a float
can uniquely represent long values.
Backport of #64066
When calculating feature importance, the leaf values directly correlate the value of the importance.
Consequently, positive leaf values -> positive feature importance
negative leaf values -> negative feature importance.
It follows that for binary classification, this is done such that the importance relates to the leaf values, which relate directly to the "probability of class 1".
So, the feature importance calculated is always for the importance as it relates to class 1.
The inverse is the importance as it relates to class 0.
* Async search should retry updates on version conflict
The _async_search APIs can throw version conflict exception when the internal response
is updated concurrently. That can happen if the final response is written while the user
extends the expiration time. That scenario should be rare but it happened in Kibana for
several users so this change ensures that updates are retried at least 5 times. That
should resolve the transient errors for Kibana. This change also preserves the version
conflict exception in case the retry didn't work instead of returning a confusing 404.
This commit also ensures that we don't delete the response if the search was cancelled
internally and not deleted explicitly by the user.
Closes#63213
The `remote_monitoring_agent` reserved role is extended to grant more privileges
over the metricbeat-* index pattern.
In addition to the index and create_index index privileges that it granted already,
it now also grants the view_index_metadata privilege.
Closes#63203
This commit ensures that jobs within the SchedulerEngine do not
continue to run after they are cancelled. There was no synchronization
between the cancel method of an ActiveSchedule and the run method, so
an actively running schedule would go ahead and reschedule itself even
if the cancel method had been called.
This commit adds synchronization between cancelling and the scheduling
of the next run to ensure that the job is cancelled. In real life
scenarios this could manifest as a job running multiple times for
SLM. This could happen if a job had been triggered and was cancelled
prior to completing its run such as if the node was no longer the
master node or if SLM was stopping/stopped.
Closes#63754
Backport of #63762
Adds validation that the dest pipeline exists when a transform
is updated. Refactors the pipeline check into the `SourceDestValidator`.
Fixes#59587
Backport of #63494
As a result of this, we can remove a chunk of code from TypeParsers as well. Tests
for search/index mode analyzers have moved into their own file. This commit also
rationalises the serialization checks for parameters into a single SerializerCheck
interface that takes the values includeDefaults, isConfigured and the value
itself.
Relates to #62988
This PR updates the `logstash_admin` role to include the recently-added Logstash Pipeline Management APIs, as well as access to the `.logstash*` index pattern.
Co-authored-by: William Brafford <williamrandolphbrafford@gmail.com>
This PR adds deprecation warnings when accessing System Indices via the REST layer. At this time, these warnings are only enabled for Snapshot builds by default, to allow projects external to Elasticsearch additional time to adjust their access patterns.
Deprecation warnings will be triggered by all REST requests which access registered System Indices, except for purpose-specific APIs which access System Indices as an implementation detail a few specific APIs which will continue to allow access to system indices by default:
- `GET _cluster/health`
- `GET {index}/_recovery`
- `GET _cluster/allocation/explain`
- `GET _cluster/state`
- `POST _cluster/reroute`
- `GET {index}/_stats`
- `GET {index}/_segments`
- `GET {index}/_shard_stores`
- `GET _cat/[indices,aliases,health,recovery,shards,segments]`
Deprecation warnings for accessing system indices take the form:
```
this request accesses system indices: [.some_system_index], but in a future major version, direct access to system indices will be prevented by default
```
Determines the shard size of shards before allocating shards that are
recovering from snapshots. It ensures during shard allocation that the
target node that is selected as recovery target will have enough free
disk space for the recovery event. This applies to regular restores,
CCR bootstrap from remote, as well as mounting searchable snapshots.
The InternalSnapshotInfoService is responsible for fetching snapshot
shard sizes from repositories. It provides a getShardSize() method
to other components of the system that can be used to retrieve the
latest known shard size. If the latest snapshot shard size retrieval
failed, the getShardSize() returns
ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE. While
we'd like a better way to handle such failures, returning this value
allows to keep the existing behavior for now.
Note that this PR does not address an issues (we already have today)
where a replica is being allocated without knowing how much disk
space is being used by the primary.
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Add a new ids field to the API of invalidating API keys so that it supports bulk
invalidation with a list of IDs.
Note the existing id field is kept as is and it is an error if both id and ids are specified.
Getting the API key document form the security index is the most time consuing part
of the API Key authentication flow (>60% if index is local and >90% if index is remote).
This traffic is now avoided by caching added with this PR.
Additionally, we add a cache invalidator registry so that clearing of different caches will
be managed in a single place (requires follow-up PRs).
We only ever use this with `XContentParser` no need to make it inline
worse by forcing the lambda and hence dynamic callsite here.
=> Extraced the exception formatting code path that is likely very cold
to a separate method and removed the lambda usage in hot loops by simplifying
the signature here.
this adds the new field `feature_importance_baseline` and allows it to be optionally be included in the model's metadata.
Related to: https://github.com/elastic/ml-cpp/pull/1522
* [ML] optimize delete expired snapshots (#63134)
When deleting expired snapshots, we do an individual delete action per snapshot per job.
We should instead gather the expired snapshots and delete them in a single call.
This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup
closes https://github.com/elastic/elasticsearch/issues/62875
We would only check for a null value and not for an empty string so
that meant that we were not actually enforcing this mandatory
setting. This commits ensures we check for both and fail
accordingly if necessary, on startup