This commit adds two APIs that allow to pause and resume
CCR auto-follower patterns:
// pause auto-follower
POST /_ccr/auto_follow/my_pattern/pause
// resume auto-follower
POST /_ccr/auto_follow/my_pattern/resume
The ability to pause and resume auto-follow patterns can be
useful in some situations, including the rolling upgrades of
cluster using a bi-directional cross-cluster replication scheme
(see #46665).
This commit adds a new active flag to the AutoFollowPattern
and adapts the AutoCoordinator and AutoFollower classes so
that it stops to fetch remote's cluster state when all auto-follow
patterns associate to the remote cluster are paused.
When an auto-follower is paused, remote indices that match the
pattern are just ignored: they are not added to the pattern's
followed indices uids list that is maintained in the local cluster
state. This way, when the auto-follow pattern is resumed the
indices created in the remote cluster in the meantime will be
picked up again and added as new following indices. Indices
created and then deleted in the remote cluster will be ignored
as they won't be seen at all by the auto-follower pattern at
resume time.
Backport of #47510 for 7.x
Currently there are two issues with serializing BulkByScrollResponse.
First, when deserializing from XContent, indexing exceptions and search
exceptions are switched. Additionally, search exceptions do no retain
the appropriate RestStatus code, so you must evaluate the status code
from the exception. However, the exception class is not always correctly
retained when serialized.
This commit adds tests in the failure case. Additionally, fixes the
swapping of failure types and adds the rest status code to the search
failure.
Adds the following parameters to `outlier_detection`:
- `compute_feature_influence` (boolean): whether to compute or not
feature influence scores
- `outlier_fraction` (double): the proportion of the data set assumed
to be outlying prior to running outlier detection
- `standardization_enabled` (boolean): whether to apply standardization
to the feature values
Backport of #47600
Use case:
User with `create_doc` index privilege will be allowed to only index new documents
either via Index API or Bulk API.
There are two cases that we need to think:
- **User indexing a new document without specifying an Id.**
For this ES auto generates an Id and now ES version 7.5.0 onwards defaults to `op_type` `create` we just need to authorize on the `op_type`.
- **User indexing a new document with an Id.**
This is problematic as we do not know whether a document with Id exists or not.
If the `op_type` is `create` then we can assume the user is trying to add a document, if it exists it is going to throw an error from the index engine.
Given these both cases, we can safely authorize based on the `op_type` value. If the value is `create` then the user with `create_doc` privilege is authorized to index new documents.
In the `AuthorizationService` when authorizing a bulk request, we check the implied action.
This code changes that to append the `:op_type/index` or `:op_type/create`
to indicate the implied index action.
This commit adds support to retrieve all API keys if the authenticated
user is authorized to do so.
This removes the restriction of specifying one of the
parameters (like id, name, username and/or realm name)
when the `owner` is set to `false`.
Closes#46887
* Remove eclipse conditionals
We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.
This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.
Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.
This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
* Add API to execute SLM retention on-demand (#47405)
This is a backport of #47405
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.
This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.
This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.
Resolves#46508
Relates to #43663
* [ML][Inference] adding .ml-inference* index and storage (#47267)
* [ML][Inference] adding .ml-inference* index and storage
* Addressing PR comments
* Allowing null definition, adding validation tests for model config
* fixing line length
* adjusting for backport
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.
There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.
I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.
Other changes:
* Rework `waitForDocs` to scale its timeout. Instead of calling
`assertBusy` in a loop, work out a reasonable overall timeout and await
just once.
* Some tests failed after switching to `assertBusy` and had to be fixed.
* Correct the expect templates in AbstractUpgradeTestCase. The ES
Security team confirmed that they don't use templates any more, so
remove this from the expected templates. Also rewrite how the setup
code checks for templates, in order to give more information.
* Remove an expected ML template from XPackRestTestConstants The ML team
advised that the ML tests shouldn't be waiting for any
`.ml-notifications*` templates, since such checks should happen in the
production code instead.
* Also rework the template checking code in `XPackRestTestHelper` to give
more helpful failure messages.
* Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
In the current implementation, the validation of the role query
occurs at runtime when the query is being executed.
This commit adds validation for the role query when creating a role
but not for the template query as we do not have the runtime
information required for evaluating the template query (eg. authenticated user's
information). This is similar to the scripts that we
store but do not evaluate or parse if they are valid queries or not.
For validation, the query is evaluated (if not a template), parsed to build the
QueryBuilder and verify if the query type is allowed.
Closes#34252
This commit adds support for POST requests to the SLM `_execute` API,
because POST is a more appropriate HTTP verb for this action as it is
not idempotent. The docs are also changed to favor POST over PUT,
although PUT is not removed or officially deprecated.
Using arrays of objects with embedded IDs is preferred for new APIs over
using entity IDs as JSON keys. This commit changes the SLM stats API to
use the preferred format.
* [ML][Inference] Feature pre-processing objects and functions (#46777)
To support inference on pre-trained machine learning models, some basic feature encoding will be necessary. I am using a named object serialization approach so new encodings/pre-processing steps could be added in the future.
This PR lays down the ground work for 3 basic encodings:
* HotOne
* Target Mean
* Frequency
More feature encodings or pre-processings could be added in the future:
* Handling missing columns
* Standardization
* Label encoding
* etc....
* fixing compilation for namedxcontent tests
The HLRC has a method for reindex, that allows to trigger an async reindex by running RestHighLevelClient.submitReindexTask and RestHighLevelClient.reindex. The delete by query however only has an RestHighLevelClient.deleteByQuery method (and its async counterpart), but no RestHighLevelClient.submitDeleteByQueryTask. So add RestHighLevelClient.submitDeleteByQueryTask
Closes#46395
Currently the CountRequest accepts a search source builder, while the
RestCountAction only accepts a top level query object. This can lead
to confusion if another element (e.g. aggregations) is specified,
because that will be ignored on the server side in RestCountAction.
By deprecating the current setter & constructor that accept a
SearchSourceBuilder and adding replacement that accepts a QueryBuilder
it is clear what the count api can handle from HLRC side.
Follow up from #46829
* addSnapshotLifecyclePolicy drop version assertion
This drops the assertion on the policy version (which was pinned to 1L)
as we want to execute both put policy apis (sync and async) for
documentation purposes. This will sometimes (depending on the async
call) yield a version of 2L. Waiting for the async call to always
complete could be an option but the test is already rather slow and it's
a bit of an overkill as we're already verifying the policy was created.
(cherry picked from commit af4864c39129bcdbf98d00223f445346a62075e4)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
Prior to this commit terminate_after was sent as request body parameter
(via SearchSourceBuilder), which is not possible in the count api.
Closes#46446
This commits makes all the async methods in the high level client return the `Cancellable` object that the low level client now exposes.
Relates to #45379Closes#44802
The low-level REST client exposes a `performRequestAsync` method that
allows to send async requests, but today it does not expose the ability
to cancel such requests. That is something that the underlying apache
async http client supports, and it makes sense for us to expose.
This commit adds a return value to the `performRequestAsync` method,
which is backwards compatible. A `Cancellable` object gets returned,
which exposes a `cancel` public method. When calling `cancel`, the
on-going request associated with the returned `Cancellable` instance
will be cancelled by calling its `abort` method. This works throughout
multiple retries, though some special care was needed for the case where
`cancel` is called between different attempts (when one attempt has
failed and the consecutive one has not been sent yet).
Note that cancelling a request on the client side does not automatically
translate to cancelling the server side execution of it. That needs to be
specifically implemented, which is on the work for the search API (see #43332).
Relates to #44802
rename data frame transform plugin to transform:
- rename plugin data-frame to transform
- change all package names from o.e.*.dataframe.* to o.e.*.transform.*
- necessary changes to fix loading/testing
* Add retention to Snapshot Lifecycle Management (#46407)
This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.
An example policy would look like:
```
PUT /_slm/policy/snapshot-every-day
{
"schedule": "0 30 2 * * ?",
"name": "<production-snap-{now/d}>",
"repository": "my-s3-repository",
"config": {
"indices": ["foo-*", "important"]
},
// Newly configured retention options
"retention": {
// Snapshots should be deleted after 14 days
"expire_after": "14d",
// Keep a maximum of thirty snapshots
"max_count": 30,
// Keep a minimum of the four most recent snapshots
"min_count": 4
}
}
```
SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.
Included in this work is a new SLM stats API endpoint available through
``` json
GET /_slm/stats
```
That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.
* Add base framework for snapshot retention (#43605)
* Add base framework for snapshot retention
This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.
Relates to #38461
* Remove extraneous 'public'
* Use a local var instead of reading class var repeatedly
* Add SnapshotRetentionConfiguration for retention configuration (#43777)
* Add SnapshotRetentionConfiguration for retention configuration
This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.
Relates to #43663
* Fix REST tests
* Fix more documentation
* Use Objects.equals to avoid NPE
* Put `randomSnapshotLifecyclePolicy` in only one place
* Occasionally return retention with no configuration
* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)
* Implement SnapshotRetentionTask's snapshot filtering and deletion
This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.
Relates to #43663
* Fix deletes running on the wrong thread
* Handle missing or null policy in snap metadata differently
* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>
* Use the `OriginSettingClient` to work with security, enhance logging
* Prevent NPE in test by mocking Client
* Allow empty/missing SLM retention configuration (#45018)
Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.
Relates to #43663
* Add min_count and max_count as SLM retention predicates (#44926)
This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.
These options are optional and one, two, or all three can be specified
in an SLM policy.
Relates to #43663
* Time-bound deletion of snapshots in retention delete function (#45065)
* Time-bound deletion of snapshots in retention delete function
With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.
Relates to #43663
* Wow snapshots suuuure can take a long time.
* Use a LongSupplier instead of actually sleeping
* Remove TestLogging annotation
* Remove rate limiting
* Add SLM metrics gathering and endpoint (#45362)
* Add SLM metrics gathering and endpoint
This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.
This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:
```
GET /_slm/stats
{
"retention_runs" : 13,
"retention_failed" : 0,
"retention_timed_out" : 0,
"retention_deletion_time" : "1.4s",
"retention_deletion_time_millis" : 1404,
"policy_metrics" : {
"daily-snapshots2" : {
"snapshots_taken" : 7,
"snapshots_failed" : 0,
"snapshots_deleted" : 6,
"snapshot_deletion_failures" : 0
},
"daily-snapshots" : {
"snapshots_taken" : 12,
"snapshots_failed" : 0,
"snapshots_deleted" : 12,
"snapshot_deletion_failures" : 6
}
},
"total_snapshots_taken" : 19,
"total_snapshots_failed" : 0,
"total_snapshots_deleted" : 18,
"total_snapshot_deletion_failures" : 6
}
```
This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.
Relates to #43663
* Version qualify serialization
* Initialize counters outside constructor
* Use computeIfAbsent instead of being too verbose
* Move part of XContent generation into subclass
* Fix REST action for master merge
* Unused import
* Record history of SLM retention actions (#45513)
This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.
* Retry SLM retention after currently running snapshot completes (#45802)
* Retry SLM retention after currently running snapshot completes
This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.
Relates to #43663
* Increase timeout waiting for snapshot completion
* Apply patch
From 2374316f0d.patch
* Rename test variables
* [TEST] Be less strict for stats checking
* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)
This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.
Relates to #43663
* Check all actions preventing snapshot delete during retention (#45992)
* Check all actions preventing snapshot delete during retention run
Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:
- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running
This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.
Relates to #43663
* Add unit test for okayToDeleteSnapshots
* Fix bug where SLM retention task would be scheduled on every node
* Enhance test logging
* Ignore if snapshot is already deleted
* Missing import
* Fix SnapshotRetentionServiceTests
* Expose SLM policy stats in get SLM policy API (#45989)
This also adds support for the SLM stats endpoint to the high level rest client.
Retrieving a policy now looks like:
```json
{
"daily-snapshots" : {
"version": 1,
"modified_date": "2019-04-23T01:30:00.000Z",
"modified_date_millis": 1556048137314,
"policy" : {
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "my_repository",
"config": {
"indices": ["data-*", "important"],
"ignore_unavailable": false,
"include_global_state": false
},
"retention": {}
},
"stats": {
"snapshots_taken": 0,
"snapshots_failed": 0,
"snapshots_deleted": 0,
"snapshot_deletion_failures": 0
},
"next_execution": "2019-04-24T01:30:00.000Z",
"next_execution_millis": 1556048160000
}
}
```
Relates to #43663
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)
* Rewrite SnapshotLifecycleIT as as ESIntegTestCase
This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.
Relates to #43663Resolves#46205
* Add error logging when exceptions are thrown
* Update serialization versions
* Fix type inference
* Use non-Cancellable HLRC return value
* Fix Client mocking in test
* Fix SLMSnapshotBlockingIntegTests for 7.x branch
* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval
* Add serialization guards for SnapshotLifecyclePolicy
Since 7.3, the request converter for multiSearchTemplate would silently
not set the two request parameters `typed_keys` and
`max_concurrent_searches`.
Closes#46488
Add XContentType as parameter to the
AbstractResponseTestCase#createServerTestInstance method.
In the case a server side response class serializes xcontent as
bytes then the test needs to know what xcontent type was randomily selected.
This change is needed in #45970
The existing privilege model for API keys with privileges like
`manage_api_key`, `manage_security` etc. are too permissive and
we would want finer-grained control over the cluster privileges
for API keys. Previously APIs created would also need these
privileges to get its own information.
This commit adds support for `manage_own_api_key` cluster privilege
which only allows api key cluster actions on API keys owned by the
currently authenticated user. Also adds support for retrieval of
the API key self-information when authenticating via API key
without the need for the additional API key privileges.
To support this privilege, we are introducing additional
authentication context along with the request context such that
it can be used to authorize cluster actions based on the current
user authentication.
The API key get and invalidate APIs introduce an `owner` flag
that can be set to true if the API key request (Get or Invalidate)
is for the API keys owned by the currently authenticated user only.
In that case, `realm` and `username` cannot be set as they are
assumed to be the currently authenticated ones.
The changes cover HLRC changes, documentation for the API changes.
Closes#40031
This commit introduces PKI realm delegation. This feature
supports the PKI authentication feature in Kibana.
In essence, this creates a new API endpoint which Kibana must
call to authenticate clients that use certificates in their TLS
connection to Kibana. The API call passes to Elasticsearch the client's
certificate chain. The response contains an access token to be further
used to authenticate as the client. The client's certificates are validated
by the PKI realms that have been explicitly configured to permit
certificates from the proxy (Kibana). The user calling the delegation
API must have the delegate_pki privilege.
Closes#34396
Previously, the stats API reports a progress percentage
for DF analytics tasks that are running and are in the
`reindexing` or `analyzing` state.
This means that when the task is `stopped` there is no progress
reported. Thus, one cannot distinguish between a task that never
run to one that completed.
In addition, there are blind spots in the progress reporting.
In particular, we do not account for when data is loaded into the
process. We also do not account for when results are written.
This commit addresses the above issues. It changes progress
to being a list of objects, each one describing the phase
and its progress as a percentage. We currently have 4 phases:
reindexing, loading_data, analyzing, writing_results.
When the task stops, progress is persisted as a document in the
state index. The stats API now reports progress from in-memory
if the task is running, or returns the persisted document
(if there is one).