All internal searches (triggered by APIs) across the .security index
must be performed while "under the security origin". Otherwise,
the search is performed in the context of the caller which most
likely does not have privileges to search .security (hopefully).
This commit fixes this in the case of two methods in the
TokenService and corrects an overly done such context switch
in the ApiKeyService.
In addition, this makes all tests from the client/rest-high-level
module execute as an all mighty administrator,
but not a literal superuser.
Closes#47151
which is backport merge and adds a new ingest processor, named enrich processor,
that allows document being ingested to be enriched with data from other indices.
Besides a new enrich processor, this PR adds several APIs to manage an enrich policy.
An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.
Related to #32789
This change adds:
- A new option, allow_lazy_open, to anomaly detection jobs
- A new option, allow_lazy_start, to data frame analytics jobs
Both work in the same way: they allow a job to be
opened/started even if no ML node exists that can
accommodate the job immediately. In this situation
the job waits in the opening/starting state until ML
node capacity is available. (The starting state for data
frame analytics jobs is new in this change.)
Additionally, the ML nightly maintenance tasks now
creates audit warnings for ML jobs that are unassigned.
This means that jobs that cannot be assigned to an ML
node for a very long time will show a yellow warning
triangle in the UI.
A final change is that it is now possible to close a job
that is not assigned to a node without using force.
This is because previously jobs that were open but
not assigned to a node were an aberration, whereas
after this change they'll be relatively common.
This PR adds the ability to run the enrich policy execution task in the background,
returning a task id instead of waiting for the completed operation.
Adds a new datafeed config option, max_empty_searches,
that tells a datafeed that has never found any data to stop
itself and close its associated job after a certain number
of real-time searches have returned no data.
Backport of #47922
This commit adds two APIs that allow to pause and resume
CCR auto-follower patterns:
// pause auto-follower
POST /_ccr/auto_follow/my_pattern/pause
// resume auto-follower
POST /_ccr/auto_follow/my_pattern/resume
The ability to pause and resume auto-follow patterns can be
useful in some situations, including the rolling upgrades of
cluster using a bi-directional cross-cluster replication scheme
(see #46665).
This commit adds a new active flag to the AutoFollowPattern
and adapts the AutoCoordinator and AutoFollower classes so
that it stops to fetch remote's cluster state when all auto-follow
patterns associate to the remote cluster are paused.
When an auto-follower is paused, remote indices that match the
pattern are just ignored: they are not added to the pattern's
followed indices uids list that is maintained in the local cluster
state. This way, when the auto-follow pattern is resumed the
indices created in the remote cluster in the meantime will be
picked up again and added as new following indices. Indices
created and then deleted in the remote cluster will be ignored
as they won't be seen at all by the auto-follower pattern at
resume time.
Backport of #47510 for 7.x
* [ML][Analytics] fix bug where regression deleted early does not delete state (#47885)
* [ML][Analytics] fix bug where regression deleted early does not delete state
* Fixing ml with security test failure
* fixing for older java
Batch transforms automatically stop after all data has processed, therefore tests can not reliable test the state. This change rewrites tests to remove the unreliable tests or use continuous transforms instead as they do not auto-stop.
fixes#47441
Adds the following parameters to `outlier_detection`:
- `compute_feature_influence` (boolean): whether to compute or not
feature influence scores
- `outlier_fraction` (double): the proportion of the data set assumed
to be outlying prior to running outlier detection
- `standardization_enabled` (boolean): whether to apply standardization
to the feature values
Backport of #47600
Use case:
User with `create_doc` index privilege will be allowed to only index new documents
either via Index API or Bulk API.
There are two cases that we need to think:
- **User indexing a new document without specifying an Id.**
For this ES auto generates an Id and now ES version 7.5.0 onwards defaults to `op_type` `create` we just need to authorize on the `op_type`.
- **User indexing a new document with an Id.**
This is problematic as we do not know whether a document with Id exists or not.
If the `op_type` is `create` then we can assume the user is trying to add a document, if it exists it is going to throw an error from the index engine.
Given these both cases, we can safely authorize based on the `op_type` value. If the value is `create` then the user with `create_doc` privilege is authorized to index new documents.
In the `AuthorizationService` when authorizing a bulk request, we check the implied action.
This code changes that to append the `:op_type/index` or `:op_type/create`
to indicate the implied index action.
* Add API to execute SLM retention on-demand (#47405)
This is a backport of #47405
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.
This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.
This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.
Resolves#46508
Relates to #43663
When we added support for wildcard application names, we started to build
the prefix query along with the term query but we used 'filter' clause
instead of 'should', so this would not fetch the correct application
privilege descriptor thereby failing the _has_privilege checks.
This commit changes the clause to use should and with minimum_should_match
as 1.
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.
There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.
I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.
Other changes:
* Rework `waitForDocs` to scale its timeout. Instead of calling
`assertBusy` in a loop, work out a reasonable overall timeout and await
just once.
* Some tests failed after switching to `assertBusy` and had to be fixed.
* Correct the expect templates in AbstractUpgradeTestCase. The ES
Security team confirmed that they don't use templates any more, so
remove this from the expected templates. Also rewrite how the setup
code checks for templates, in order to give more information.
* Remove an expected ML template from XPackRestTestConstants The ML team
advised that the ML tests shouldn't be waiting for any
`.ml-notifications*` templates, since such checks should happen in the
production code instead.
* Also rework the template checking code in `XPackRestTestHelper` to give
more helpful failure messages.
* Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
In the current implementation, the validation of the role query
occurs at runtime when the query is being executed.
This commit adds validation for the role query when creating a role
but not for the template query as we do not have the runtime
information required for evaluating the template query (eg. authenticated user's
information). This is similar to the scripts that we
store but do not evaluate or parse if they are valid queries or not.
For validation, the query is evaluated (if not a template), parsed to build the
QueryBuilder and verify if the query type is allowed.
Closes#34252
* [ML][Transforms] remove `force` flag from _start (#46414)
* [ML][Transforms] remove `force` flag from _start
* fixing expected error message
* adjusting bwc version
The enrich api returns enrich coordinator stats and
information about currently executing enrich policies.
The coordinator stats include per ingest node:
* The current number of search requests in the queue.
* The total number of outstanding remote requests that
have been executed since node startup. Each remote
request is likely to include multiple search requests.
This depends on how much search requests are in the
queue at the time when the remote request is performed.
* The number of current outstanding remote requests.
* The total number of search requests that `enrich`
processors have executed since node startup.
The current execution policies stats include:
* The name of policy that is executing
* A full blow task info object that is executing the policy.
Relates to #32789
* Add retention to Snapshot Lifecycle Management (#46407)
This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.
An example policy would look like:
```
PUT /_slm/policy/snapshot-every-day
{
"schedule": "0 30 2 * * ?",
"name": "<production-snap-{now/d}>",
"repository": "my-s3-repository",
"config": {
"indices": ["foo-*", "important"]
},
// Newly configured retention options
"retention": {
// Snapshots should be deleted after 14 days
"expire_after": "14d",
// Keep a maximum of thirty snapshots
"max_count": 30,
// Keep a minimum of the four most recent snapshots
"min_count": 4
}
}
```
SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.
Included in this work is a new SLM stats API endpoint available through
``` json
GET /_slm/stats
```
That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.
* Add base framework for snapshot retention (#43605)
* Add base framework for snapshot retention
This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.
Relates to #38461
* Remove extraneous 'public'
* Use a local var instead of reading class var repeatedly
* Add SnapshotRetentionConfiguration for retention configuration (#43777)
* Add SnapshotRetentionConfiguration for retention configuration
This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.
Relates to #43663
* Fix REST tests
* Fix more documentation
* Use Objects.equals to avoid NPE
* Put `randomSnapshotLifecyclePolicy` in only one place
* Occasionally return retention with no configuration
* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)
* Implement SnapshotRetentionTask's snapshot filtering and deletion
This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.
Relates to #43663
* Fix deletes running on the wrong thread
* Handle missing or null policy in snap metadata differently
* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>
* Use the `OriginSettingClient` to work with security, enhance logging
* Prevent NPE in test by mocking Client
* Allow empty/missing SLM retention configuration (#45018)
Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.
Relates to #43663
* Add min_count and max_count as SLM retention predicates (#44926)
This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.
These options are optional and one, two, or all three can be specified
in an SLM policy.
Relates to #43663
* Time-bound deletion of snapshots in retention delete function (#45065)
* Time-bound deletion of snapshots in retention delete function
With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.
Relates to #43663
* Wow snapshots suuuure can take a long time.
* Use a LongSupplier instead of actually sleeping
* Remove TestLogging annotation
* Remove rate limiting
* Add SLM metrics gathering and endpoint (#45362)
* Add SLM metrics gathering and endpoint
This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.
This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:
```
GET /_slm/stats
{
"retention_runs" : 13,
"retention_failed" : 0,
"retention_timed_out" : 0,
"retention_deletion_time" : "1.4s",
"retention_deletion_time_millis" : 1404,
"policy_metrics" : {
"daily-snapshots2" : {
"snapshots_taken" : 7,
"snapshots_failed" : 0,
"snapshots_deleted" : 6,
"snapshot_deletion_failures" : 0
},
"daily-snapshots" : {
"snapshots_taken" : 12,
"snapshots_failed" : 0,
"snapshots_deleted" : 12,
"snapshot_deletion_failures" : 6
}
},
"total_snapshots_taken" : 19,
"total_snapshots_failed" : 0,
"total_snapshots_deleted" : 18,
"total_snapshot_deletion_failures" : 6
}
```
This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.
Relates to #43663
* Version qualify serialization
* Initialize counters outside constructor
* Use computeIfAbsent instead of being too verbose
* Move part of XContent generation into subclass
* Fix REST action for master merge
* Unused import
* Record history of SLM retention actions (#45513)
This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.
* Retry SLM retention after currently running snapshot completes (#45802)
* Retry SLM retention after currently running snapshot completes
This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.
Relates to #43663
* Increase timeout waiting for snapshot completion
* Apply patch
From 2374316f0d.patch
* Rename test variables
* [TEST] Be less strict for stats checking
* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)
This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.
Relates to #43663
* Check all actions preventing snapshot delete during retention (#45992)
* Check all actions preventing snapshot delete during retention run
Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:
- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running
This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.
Relates to #43663
* Add unit test for okayToDeleteSnapshots
* Fix bug where SLM retention task would be scheduled on every node
* Enhance test logging
* Ignore if snapshot is already deleted
* Missing import
* Fix SnapshotRetentionServiceTests
* Expose SLM policy stats in get SLM policy API (#45989)
This also adds support for the SLM stats endpoint to the high level rest client.
Retrieving a policy now looks like:
```json
{
"daily-snapshots" : {
"version": 1,
"modified_date": "2019-04-23T01:30:00.000Z",
"modified_date_millis": 1556048137314,
"policy" : {
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "my_repository",
"config": {
"indices": ["data-*", "important"],
"ignore_unavailable": false,
"include_global_state": false
},
"retention": {}
},
"stats": {
"snapshots_taken": 0,
"snapshots_failed": 0,
"snapshots_deleted": 0,
"snapshot_deletion_failures": 0
},
"next_execution": "2019-04-24T01:30:00.000Z",
"next_execution_millis": 1556048160000
}
}
```
Relates to #43663
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase
This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.
Relates to #43663Resolves#46205
* Add error logging when exceptions are thrown
* Update serialization versions
* Fix type inference
* Use non-Cancellable HLRC return value
* Fix Client mocking in test
* Fix SLMSnapshotBlockingIntegTests for 7.x branch
* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval
* Add serialization guards for SnapshotLifecyclePolicy
ML users who upgrade from versions prior to 7.4 to 7.4 or later
will have ML results indices that do not have mappings for the
total_search_time_ms field. Therefore, when searching these
indices we must tolerate this field not having a mapping.
Fixes#46437
This PR merges the `vectors-optimize-brute-force` feature branch, which makes
the following changes to how vector functions are computed:
* Precompute the L2 norm of each vector at indexing time. (#45390)
* Switch to ByteBuffer for vector encoding. (#45936)
* Decode vectors and while computing the vector function. (#46103)
* Use an array instead of a List for the query vector. (#46155)
* Precompute the normalized query vector when using cosine similarity. (#46190)
Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
Currently, when using script_score functions like cosineSimilarity, the query
vector is treated as an array of doubles. Since the stored document vectors use
floats, it seems like the least surprising behavior for the query vectors to
also be float arrays.
In addition to improving consistency, this change may help with some
optimizations we have been considering around vector dot product.