This commit adds a first draft of a regression analysis
to data frame analytics. There is high probability that
the exact syntax might change.
This commit adds the new analysis type and its parameters as
well as appropriate validation. It also modifies the extractor
and the fields detector to be able to handle categorical fields
as regression analysis supports them.
* Restrict which tasks can use testclusters
This PR fixes a problem between the interaction of test-clusters and
build cache.
Before this any task could have used a cluster without tracking it as
input.
With this change a new interface is introduced to track the tasks that
can use clusters and we do consider the cluster as input for all of
them.
* Name each inner_hits section of nested queries differently and extract and combine the multiple values it generates into a single list.
This also introduces a limitation (its origin it's with Elasticsearch
though) on the sorting capabilities when the sorting is based on the
nested fields filtered: only one of the conditions applied to nested
documents will be used in the nested sorting.
(cherry picked from commit cfc5cf68f6e83b07bb9006986d0903d6be418ec6)
This commit replaces task_state and indexer_state in the
data frame _stats output with a single top level state
that combines the two. It is defined as:
- failed if what's currently reported as task_state is failed
- stopped if there is no persistent task
- Otherwise what's currently reported as indexer_state
Backport of #45276
When using the implicit flow in OpenID Connect, the
op.token_endpoint_url should not be mandatory as there is no need
to contact the token endpoint of the OP.
* [ML][Data Frame] Add update transform api endpoint (#45154)
This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`.
This PR contains all that is necessary for this addition:
* HLRC
* Docs
* Server side
This adds support for `geo_bounds` aggregation inside the `pivot.aggregations` configuration.
The two points returned from the `geo_bounds` aggregation are transformed into `geo_shape` whose types are dynamic given the point's similarity.
* `point` if the two points are identical
* `linestring` if the two points share either a latitude or longitude
* `polygon` if the two points are completely different
The automatically deduced mapping for the resulting field is a `geo_shape`.
We currently use the unboundid ldap SDK, which is triply licensed under
GPL-2.0, LGPL-2.1, and the "UnboundID LDAP SDK Free Use License". We currently
identify the license as the latter, but LGPL-2.1 is the one we should be using
per our policy.
The PutJob API accidentally used an "expert" API of CreateIndexRequest.
That API is semi-lenient to syntax; a type could be omitted and the
request would work as expected. But if a type was omitted it would
not merge with templates correctly, leading to index creation that
only has the template and not the requested mappings in the request.
This commit refactors the PutJob API to:
- Include the type name
- Use a less "expert" API in an attempt to future proof against errors
- Uses an XContentBuilder instead of string replacing, removes json template
Uses JDK 11's per-socket configuration of TCP keepalive (supported on Linux and Mac), see
https://bugs.openjdk.java.net/browse/JDK-8194298, and exposes these as transport settings.
By default, these options are disabled for now (i.e. fall-back to OS behavior), but we would like
to explore whether we can enable them by default, in particular to force keepalive configurations
that are better tuned for running ES.
introduces an abstraction for how checkpointing and synchronization works, covering
- retrieval of checkpoints
- check for updates
- retrieving stats information
Currently in the transport-nio work we connect and bind channels on the
a thread before the channel is registered with a selector. Additionally,
it is at this point that we set all the socket options. This commit
moves these operations onto the event-loop after the channel has been
registered with a selector. It attempts to set the socket options for a
non-server channel at registration time. If that fails, it will attempt
to set the options after the channel is connected. This should fix
#41071.
In the FIPS JVM the JVM default locale seems to leak into places
where it should be overridden. This change skips assertions
in TimestampFormatFinderTests.testGuessIsDayFirstFromLocale
that may be impacted.
Fixes#45140
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.
Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
Reloading of synonym_graph filter doesn't work currently because the search time
AnalysisMode doesn't get propagated to the TokenFilterFactory emitted by the
graph filters getChainAwareTokenFilterFactory() method. This change fixes that.
Closes#45127
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.
This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.
Fixes#45047
This change improves the exception messages that are thrown when the
system cannot read TLS resources such as keystores, truststores,
certificates, keys or certificate-chains (CAs).
This change specifically handles:
- Files that do not exist
- Files that cannot be read due to file-system permissions
- Files that cannot be read due to the ES security-manager
Backport of: #44787
There are no realms that can be configured exclusively with secure
settings. Every realm that supports secure settings also requires one
or more non-secure settings.
However, sometimes a node will be configured with entries in the
keystore for which there is nothing in elasticsearch.yml - this may be
because the realm we removed from the yml, but not deleted from the
keystore, or it could be because there was a typo in the realm name
which has accidentially orphaned the keystore entry.
In these cases the realm building would fail, but the error would not
always be clear or point to the root cause (orphaned keystore
entries). RealmSettings would act as though the realm existed, but
then fail because an incorrect combination of settings was provided.
This change causes realm building to fail early, with an explicit
message about incorrect keystore entries.
Backport of: #44471
When we create API key we check if the API key with the name
already exists. It searches with scroll enabled and this causes
the request to fail when creating large number of API keys in
parallel as it hits the number of open scroll limit (default 500).
We do not need the search context to be created so this commit
removes the scroll parameter from the search request for duplicate
API key.
If one tries to start a DF analytics job that has already run,
the result will be that the task will fail after reindexing the
dest index from the source index. The results of the prior run
will be gone and the task state is not properly set to failed
with the failure reason.
This commit improves the behavior in this scenario. First, we
set the task state to `failed` in a set of failures that were
missed. Second, a validation is added that if the destination
index exists, it must be empty.
We keep adding the current primary term to operations for which we do not assign a sequence
number. This does not make sense anymore as all operations which we care about have
sequence numbers now. The goal of this commit is to clean things up in InternalEngine and
reduce the complexity.
* We shouldn't be recreating wrapped REST handlers over and over for every request. We only use this hook in x-pack and the wrapper there does not have any per request state.
This is inefficient and could lead to some very unexpected memory behavior
=> I made the logic create the wrapper on handler registration and adjusted the x-pack wrapper implementation to correctly forward the circuit breaker and content stream flags
A mismatched configuration between the IdP and SP will often result in
SAML authentication attempts failing because the audience condition is
not met (because the IdP and SP disagree about the correct form of the
SP's Entity ID).
Previously the error message in this case did not provide sufficient
information to resolve the issue because the IdP's expected audience
would be truncated if it exceeeded 32 characters. Since the error did
not provide both IDs in full, it was not possible to determine the
correct fix (in detail) based on the error alone.
This change expands the message that is included in the thrown
exception, and also adds additional logging of every failed audience
condition, with diagnostics of the match failure.
Backport of: #44334
Today closing a `ClusterNode` in an `AbstractCoordinatorTestCase` uses
`onNode()` so has no effect if the node is not in the current list of nodes.
It also discards the `Runnable` it creates without having run it, so has no
effect anyway.
This commit makes these tests much stricter about properly closing the nodes
started during `Coordinator` tests, by tracking the persisted states that are
opened, and adds an assertion to catch the trappy requirement that the closing
node still belongs to the cluster.
The existing equals check was broken, and would always be false.
The correct behaviour is to return "Collections.emptyList()" whenever
the the active(licensed)-realms equals the configured-realms.
Backport of: #44399
* Rename indexlifecycle to ilm and snapshotlifecycle to slm (#44917)
As a followup to #44725 and #44608, which renamed the packages within
the x-pack project, this renames the packages within the core x-pack
project. It also renames 'snapshotlifecycle' within the HLRC to slm.
* Fix one more import
In case closing the process throws an exception we should be catching
it no matter its type. The process may have terminated because of a
fatal error in which case closing the process will throw a server
error, not an `IOException`. If this happens we fail to mark the
persistent task as failed and the task gets in limbo.
As data frame rows with missing values for analyzed fields are skipped,
we can be more efficient by including a query that only picks documents
that have values for all analyzed fields. Besides improving the number
of documents we go through, we also provide a more accurate measurement
of how many rows we need which reduces the memory requirements.
This also adds an integration test that runs outlier detection on data
with missing fields.
TaskListener accepts today Throwable in its onFailure method. Though
looking at where it is called (TransportAction), it can never be
notified of a Throwable.
This commit changes the signature of TaskListener#onFailure so that it
accepts an `Exception` rather than a `Throwable` as second argument.
In order to make it easier to interpret the output of the ILM Explain
API, this commit adds two request parameters to that API:
- `only_managed`, which causes the response to only contain indices
which have `index.lifecycle.name` set
- `only_errors`, which causes the response to contain only indices in an
ILM error state
"Error state" is defined as either being in the `ERROR` step or having
`index.lifecycle.name` set to a policy that does not exist.
Today the processors setting is permitted to be set to more than the
number of processors available to the JVM. The processors setting
directly sizes the number of threads in the various thread pools, with
most of these sizes being a linear function in the number of
processors. It doesn't make any sense to set processors very high as the
overhead from context switching amongst all the threads will overwhelm,
and changing the setting does not control how many physical CPU
resources there are on which to schedule the additional threads. We have
to draw a line somewhere and this commit deprecates setting processors
to more than the number of available processors. This is the right place
to draw the line given the linear growth as a function of processors in
most of the thread pools, and that some are capped at the number of
available processors already.
This is a followup to #44350. The indexer stats used to
be persisted standalone, but now are only persisted as
part of a state-and-stats document. During the review
of #44350 it was decided that we'll stick with this
design, so there will never be a need for an indexer
stats object to store its transform ID as it is stored
on the enclosing document. This PR removes the indexer
stats document ID.
Backport of #44768
The problem is that RemoteClusterConnection closes the connection manager asynchronously, which races with the threadpool being shutdown at the end of the test.
Closes#44339Closes#44610
Deleting a follower index does not delete its ShardFollowTasks, potentially
leaving many persistent tasks in the cluster that cannot be allocated on
nodes and unnecessary fill the logs. This commit adds a cluster state listener
(ShardFollowTaskCleaner) that completes (with a failure) any persistent task
that refers to a non existent follower index.
I think that this bug has been introduced by #34404: before this change the
task would have been completed as failed and removed from the cluster state.
Backport of #44702 and #44801 on 7.x
* Switch from using docvalue_fields to extracting values from _source
where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.
* This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.
(cherry picked from commit 8bf8a055e38f00df5f49c8d97f632f69d6e00c2c)
Use InspectionHelper classes to decide if the aggregations should return null (in case there is no value) or the value itself.
(cherry picked from commit dafd7b039b0da072750e8f57e7572d24f7aad44a)
* Only emit deprecation warning if there was actual change of a datafeed's job_id.
* Add @Deprecated annotation to DatafeedUpdate.Builder#setJobId method
This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.
This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.
Backport of #44350
This commit renames the ILM package from indexlifecycle to ilm. We have
all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. This
commit makes that change.
This commit renames the SLM package from snapshotlifecycle to slm. We
have all come to know index lifecycle management as ILM, the APIs and
settings use ilm, and it would be nice of the package did too. For SLM,
let's use slm for all of these including the package name from the
beginning.
We often start testing with early access versions of new Java
versions and this have caused minor issues in our tests
(i.e. #43141) because the version string that the JVM reports
cannot be parsed as it ends with the string -ea.
This commit changes how we parse and compare Java versions to
allow correct parsing and comparison of the output of java.version
system property that might include an additional alphanumeric
part after the version numbers
(see [JEP 223[(https://openjdk.java.net/jeps/223)). In short it
handles a version number part, like before, but additionally a
PRE part that matches ([a-zA-Z0-9]+).
It also changes a number of tests that would attempt to parse
java.specification.version in order to get the full version
of Java. java.specification.version only contains the major
version and is thus inappropriate when trying to compare against
a version that might contain a minor, patch or an early access
part. We know parse java.version that can be consistently
parsed.
Resolves#43141
Since #44344 we use IndicesOptions.LENIENT_EXPAND_OPEN
when deciding which indices to include in checkpoint
calculation. This change uses the same option when
deciding which indices to search for data and which
indices to get mappings from, otherwise there is a
potential mismatch between the checkpoint details and
what is searched elsewhere.
* Mute failing test
tracked in #44552
* mute EvilSecurityTests
tracking in #44558
* Fix line endings in ESJsonLayoutTests
* Mute failing ForecastIT test on windows
Tracking in #44609
* mute BasicRenormalizationIT.testDefaultRenormalization
tracked in #44613
* fix mute testDefaultRenormalization
* Increase busyWait timeout windows is slow
* Mute failure unconfigured node name
* mute x-pack internal cluster test windows
tracking #44610
* Mute JvmErgonomicsTests on windows
Tracking #44669
* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot
Tracking #44671
* Mute NodeTests on Windows
Tracking #44256
This adds a new dynamic cluster setting `xpack.data_frame.num_transform_failure_retries`.
This setting indicates how many times non-critical failures should be retried before a data frame transform is marked as failed and should stop executing. At the time of this commit; Min: 0, Max: 100, Default: 10
NamedAnalyzer should return the same AnalysisMode than any custom analyzer it
wraps, otherwise AnalysisMode.ALL. This used to be only CustomAnalyzer in the
past, but with the introduction of the ReloadableCustomAnalyzer this needs to be
added as an option where the analysis mode gets propagated.
Closes#44625
This commit converts several utility classes that implement Streamable
to have StreamInput constructors. It also adds a default version of
readFrom to Streamable so that overriding to throw UOE is not necessary.
relates #34389
This commit converts several more classes from streamable to writeable
in server, mostly within the o.e.index and o.e.persistent packages.
relates #34389
* Allow empty configuration for SLM policies
When putting or updating a snapshot lifecycle policy it was not possible
to elide the `config` map. This commit makes the configuration optional,
the same way that it is when taking a snapshot.
Relates to #38461
* Add Objects.requireNonNull for required parts of the policy
* Expose index age in ILM explain output
This adds the index's age to the ILM explain output, for example:
```
{
"indices" : {
"ilm-000001" : {
"index" : "ilm-000001",
"managed" : true,
"policy" : "full-lifecycle",
"lifecycle_date" : "2019-07-16T19:48:22.294Z",
"lifecycle_date_millis" : 1563306502294,
"age" : "1.34m",
"phase" : "hot",
"phase_time" : "2019-07-16T19:48:22.487Z",
... etc ...
}
}
}
```
This age can be used to tell when ILM will transition the index to the
next phase, based on that phase's `min_age`.
Resolves#38988
* Expose age in getters and in HLRC
If the setting on the follower and the leader are identical after
filtering out private and internal settings, then we should not call
update setting (on the follower) as there's nothing to change. Moreover,
this makes the ShardFollowTask abort as it considers
ActionRequestValidationException (caused by an empty update setting
request) as a fatal error.
Closes#44521
Removes the warning suppression -Xlint:-deprecation,-rawtypes,-serial,-try,-unchecked.
Many warnings were unchecked warnings in the test code often because of the use of mocks.
These are suppressed with @SuppressWarning
many classes still use the Streamable constructors of HandledTransportAction,
this commit moves more of those classes to the new Writeable constructors.
relates #34389.
this commit removes usage of the deprecated
constructor with a single argument and no Writeable.Reader.
The purpose of this is to reduce the boilerplate necessary for
properly implementing a new action, as well as reducing the
chances of using the incorrect super constructor while classes
are being migrated to Writeable
relates #34389.
This commit converts all remaining ActionType response classes to
writeable in xpack core. It also converts a few from server which were
used by xpack core.
relates #34389
Today we have an annotation for controlling logging levels in
tests. This annotation serves two purposes, one is to control the
logging level used in tests, when such control is needed to impact and
assert the behavior of loggers in tests. The other use is when a test is
failing and additional logging is needed. This commit separates these
two concerns into separate annotations.
The primary motivation for this is that we have a history of leaving
behind the annotation for the purpose of investigating test failures
long after the test failure is resolved. The accumulation of these stale
logging annotations has led to excessive disk consumption. Having
recently cleaned this up, we would like to avoid falling into this state
again. To do this, we are adding a link to the test failure under
investigation to the annotation when used for the purpose of
investigating test failures. We will add tooling to inspect these
annotations, in the same way that we have tooling on awaits fix
annotations. This will enable us to report on the use of these
annotations, and report when stale uses of the annotation exist.
This commit adds constructors to AcknolwedgedRequest subclasses to
implement Writeable.Reader, and ensures all future subclasses implement
the same.
relates #34389
This commit converts the request and response classes for broadcast
actions to implement ctors for Writeable.Reader and forces all future
implementations to implement the same.
relates #34389
Registering a channel with a selector is a required operation for the
channel to be handled properly. Currently, we mix the registeration with
other setup operations (ip filtering, SSL initiation, etc). However, a
fail to register is fatal. This PR modifies how registeration occurs to
immediately close the channel if it fails.
There are still two clear loopholes for how a user can interact with a
channel even if registration fails. 1. through the exception handler.
2. through the channel accepted callback. These can perhaps be improved
in the future. For now, this PR prevents writes from proceeding if the
channel is not registered.
This commit avoids a situation where we might stack overflow in the
auto-follower coordinator. In the face of repeated failures to get the
remote cluster state, we would previously be called back on the same
thread and then recurse to try again. If this failure persists, the
repeated callbacks on the same thread would lead to a stack
overflow. The repeated failures can occur, for example, if the connect
queue is full when we attempt to make a connection to the remote
cluster. This commit avoids this by truncating the call stack if we are
called back on the same thread as the initial request was made on.
This commit avoids an NPE when checking for privileges to follow
indices. The problem here is that in some cases we might not be able to
read the authentication info from the thread context. In that case, a
null user would be returned and we were not guarding against this.
* Migrate ML Actions to use writeable ActionType (#44302)
This commit converts all the StreamableResponseActionType
actions in the ML core module to be ActionType and leverage
the Writeable infrastructure.
This improves the error message when encrypting of sensitive watcher
data is configured, but no system file was specified in the keystore.
This error message is displayed on startup.
This also closes the input stream of the secure file properly.
Closes#43619
* [ML][Data Frame] treat bulk index failures as an indexing failure
* removing redundant public modifier
* changing to an ElasticsearchException
* fixing redundant public modifier
The failure is correctly getting propagated, this commit adds support to
explicitly look for .watch-history failures using the same logging strategy
as triggered watch failures.
* Add Snapshot Lifecycle Management (#43934)
* Add SnapshotLifecycleService and related CRUD APIs
This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.
This also includes the get, put, and delete APIs for these policies
Relates to #38461
* Make scheduledJobIds return an immutable set
* Use Object.equals for SnapshotLifecyclePolicy
* Remove unneeded TODO
* Implement ToXContentFragment on SnapshotLifecyclePolicyItem
* Copy contents of the scheduledJobIds
* Handle snapshot lifecycle policy updates and deletions (#40062)
(Note this is a PR against the `snapshot-lifecycle-management` feature branch)
This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.
Relates to #38461
* Take a snapshot for the policy when the SLM policy is triggered (#40383)
(This is a PR for the `snapshot-lifecycle-management` branch)
This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.
This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.
Relates to #38461
* Record most recent snapshot policy success/failure (#40619)
Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.
This is the first step toward writing history in a more permanent
fashion.
* Validate snapshot lifecycle policies (#40654)
(This is a PR against the `snapshot-lifecycle-management` branch)
With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.
Part of #38461
* Hook SLM into ILM's start and stop APIs (#40871)
(This pull request is for the `snapshot-lifecycle-management` branch)
This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.
Relates to #38461
* Add tests for SnapshotLifecyclePolicyItem (#40912)
Adds serialization tests for SnapshotLifecyclePolicyItem.
* Fix improper import in build.gradle after master merge
* Add human readable version of modified date for snapshot lifecycle policy (#41035)
* Add human readable version of modified date for snapshot lifecycle policy
This small change changes it from:
```
...
"modified_date": 1554843903242,
...
```
To
```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```
Including the `"modified_date"` field when the `?human` field is used.
Relates to #38461
* Fix test
* Add API to execute SLM policy on demand (#41038)
This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.
```json
PUT /_ilm/snapshot/<policy>/_execute
```
And it returns the response with the generated snapshot name:
```json
{
"snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```
Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.
Relates to #38461
* Add next_execution to SLM policy metadata (#41221)
* Add next_execution to SLM policy metadata
This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:
```json
GET /_ilm/snapshot?human
{
"production" : {
"version" : 1,
"modified_date" : "2019-04-15T21:16:21.865Z",
"modified_date_millis" : 1555362981865,
"policy" : {
"name" : "<production-snap-{now/d}>",
"schedule" : "*/30 * * * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"foo-*",
"important"
],
"ignore_unavailable" : true,
"include_global_state" : false
}
},
"next_execution" : "2019-04-15T21:16:30.000Z",
"next_execution_millis" : 1555362990000
},
"other" : {
"version" : 1,
"modified_date" : "2019-04-15T21:12:19.959Z",
"modified_date_millis" : 1555362739959,
"policy" : {
"name" : "<other-snap-{now/d}>",
"schedule" : "0 30 2 * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"other"
],
"ignore_unavailable" : false,
"include_global_state" : true
}
},
"next_execution" : "2019-04-16T02:30:00.000Z",
"next_execution_millis" : 1555381800000
}
}
```
Relates to #38461
* Fix and enhance tests
* Figured out how to Cron
* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)
This commit changes the endpoint for snapshot lifecycle management from:
```
GET /_ilm/snapshot/<policy>
```
to:
```
GET /_slm/policy/<policy>
```
It mimics the ILM path only using `slm` instead of `ilm`.
Relates to #38461
* Add initial documentation for SLM (#41510)
* Add initial documentation for SLM
This adds the initial documentation for snapshot lifecycle management.
It also includes the REST spec API json files since they're sort of
documentation.
Relates to #38461
* Add `manage_slm` and `read_slm` roles (#41607)
* Add `manage_slm` and `read_slm` roles
This adds two more built in roles -
`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.
`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.
Relates to #38461
* Add execute to the test
* Fix ilm -> slm typo in test
* Record SLM history into an index (#41707)
It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.
This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.
Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s. Watcher would cause the same problem, but it
is already disabled (for the same reason).
* High Level Rest Client support for SLM (#41767)
* High Level Rest Client support for SLM
This commit add HLRC support for SLM.
Relates to #38461
* Fill out documentation tests with tags
* Add more callouts and asciidoc for HLRC
* Update javadoc links to real locations
* Add security test testing SLM cluster privileges (#42678)
* Add security test testing SLM cluster privileges
This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.
Relates to #38461
* Don't redefine vars
* Add Getting Started Guide for SLM (#42878)
This commit adds a basic Getting Started Guide for SLM.
* Include SLM policy name in Snapshot metadata (#43132)
Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.
* Fix compilation after master merge
* [TEST] Move exception wrapping for devious exception throwing
Fixes an issue where an exception was created from one line and thrown in another.
* Fix SLM for the change to AcknowledgedResponse
* Add Snapshot Lifecycle Management Package Docs (#43535)
* Fix compilation for transport actions now that task is required
* Add a note mentioning the privileges needed for SLM (#43708)
* Add a note mentioning the privileges needed for SLM
This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.
Relates to #38461
* Mention that you can create snapshots for indices you can't read
* Fix REST tests for new number of cluster privileges
* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)
* Fix SnapshotHistoryStoreTests after merge
* Remove overridden newResponse functions that have been removed
* Fix compilation for backport
* Fix get snapshot output parsing in test
* [DOCS] Add redirects for removed autogen anchors (#44380)
* Switch <tt>...</tt> in javadocs for {@code ...}
make checkpointing more robust:
- do not let checkpointing fail if indexes got deleted
- treat missing seqNoStats as just created indices (checkpoint 0)
- loglevel: do not treat failed updated checks as error
fixes#43992
This commit converts all the StreamableResponseActionType security
classes in xpack core to ActionType, implementing Writeable for their
response classes.
relates #34389
When getting authentication info from the thread context, it might be
that we encounter an I/O exception. Today we swallow this exception and
return a null authentication info to the caller. Yet, this could be
hiding bugs or errors. This commits adjusts this behavior so that we no
longer swallow the exception.
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.
relates #34389
* HLRC: Fix '+' Not Correctly Encoded in GET Req.
* Encode `+` correctly as `%2B` in URL paths
* Keep encoding `+` as space in URL parameters
* Closes#33077
This commit modifies bwc behavior in FindFileStructureAction to check
against a concrete version instead of Version.CURRENT. Checking against
Version.CURRENT does not work since it is changing, in addition to it
having different meanings on each branch.
relates #42501
Rewrites how continuous data frame transforms calculates and handles buckets that require an update. Instead of storing the whole set in memory, it pages through the updates using a 2nd cursor. This lowers memory consumption and prevents problems with limits at query time (max_terms_count). The list of updates can be re-retrieved in a failure case (#43662)
This commit moves the Supplier variant of HandledTransportAction to have
a different ordering than the Writeable.Reader variant. The Supplier
version is used for the legacy Streamable, and currently having the
location of the Writeable.Reader vs Supplier in the same place forces
using casts of Writeable.Reader to select the correct super constructor.
This change in ordering allows easier migration to Writeable.Reader.
relates #34389
Fixes a bug in the PKI authentication. This manifests when there
are multiple PKI realms configured in the chain, with different
principal parse patterns. There are a few configuration scenarios
where one PKI realm might parse the principal from the Subject
DN (according to the `username_pattern` realm setting) but
another one might do the truststore validation (according to
the truststore.* realm settings).
This is caused by the two passes through the realm chain, first to
build the authentication token and secondly to authenticate it, and
that the X509AuthenticationToken sets the principal during
construction.
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports and also adds an assertion so
that cases like #44134 can be more easily debugged. The likely reason for that one is that a test,
which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle
daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
Now that ML job configs are stored in an index rather than
cluster state, availability of the .ml-config index is very
important to the operation of ML. When a cluster starts up
the ML persistent tasks will be considered for node
assignment very early on. It is best in this case if
assignment is deferred until after the .ml-config index is
available.
The introduction of data frame analytics jobs has made this
problem worse, because anomaly detection jobs already waited
for the primary shards of the .ml-state, .ml-anomalies-shared
and .ml-meta indices to be available before doing node
assignment, and by coincidence this would probably lead to
the primary shards of .ml-config also being searchable. But
data frame analytics jobs had no other index checks prior to
this change.
This fixes problem 2 of #44156
* The incompatible snapshots logic was created to track 1.x snapshots that
became incompatible with 2.x
* It serves no purpose at this point
* It adds an additional GET request to every loading of
RepositoryData (from loading the incompatible snapshots blob)
By default, we don't check ranges while indexing geo_shapes. As a
result, it is possible to index geoshapes that contain contain
coordinates outside of -90 +90 and -180 +180 ranges. Such geoshapes
will currently break SQL and ML retrieval mechanism. This commit removes
these restriction from the validator is used in SQL and ML retrieval.
The base classes for transport requests and responses currently
implement Streamable and Writeable. The writeTo method on these base
classes is implemented with an empty implementation. Not only does this
complicate subclasses to think they need to call super.writeTo, but it
also can lead to not implementing writeTo when it should have been
implemented, or extendiong one of these classes when not necessary,
since there is nothing to actually implement.
This commit removes the empty writeTo from these base classes, and fixes
subclasses to not call super and in some cases implement an empty
writeTo themselves.
relates #34389
AggregatorFactory was generic over itself, but it doesn't appear we
use this functionality anywhere (e.g. to allow the super class
to declare arguments/return types generically for subclasses to
override). Most places use a wildcard constraint, and even when a
concrete type is specified it wasn't used.
But since AggFactories are widely used, this led to
the generic touching many pieces of code and making type signatures
fairly complex
When the ML memory tracker is refreshed and a refresh is
already in progress the idea is that the second and
subsequent refresh requests receive the same response as
the currently in progress refresh.
There was a bug that if a refresh failed then the ML
memory tracker's view of whether a refresh was in progress
was not reset, leading to every subsequent request being
registered to receive a response that would never come.
This change makes the ML memory tracker pass on failures
as well as successes to all interested parties and reset
the list of interested parties so that further refresh
attempts are possible after either a success or failure.
This fixes problem 1 of #44156
* Add test for SQL not being available error message in JDBC.
* Add a new qa sub-project that explicitly disables SQL XPack module in Gradle.
(cherry picked from commit 8a1ac8a3a88a325ec9b99963e0fa288c18ee0ee5)
* The created array didn't have the correct initial size while attempting to resolve multiple indices
(cherry picked from commit 341006e9913e831408f5bbc7f8ad8c453a7f630e)
Custom timestamp overrides provided to the find_file_structure
endpoint produced an invalid Grok pattern if the fractional
seconds separator was a dot rather than a comma or colon.
This commit fixes that problem and adds tests for this sort
of timestamp override.
Fixes#44110
Previously a data frame transform would check whether the
source index was changed every 10 seconds. Sometimes it
may be desirable for the check to be done less frequently.
This commit increases the default to 60 seconds but also
allows the frequency to be overridden by a setting in the
data frame transform config.
Data frame task responses had logic to return a HTTP 500 status code if there was
any node or task failures even if other tasks in the same request reported correctly.
This is different to how other task responses are handled where a 200 is always
returned leaving the client should check for failures. Returning a 500 also breaks
the high level rest client so always return a 200
Closes#44011
This PR enables the indexing optimization using sequence numbers on
replicas. With this optimization, indexing on replicas should be faster
and use less memory as it can forgo the version lookup when possible.
This change also deactivates the append-only optimization on replicas.
Relates #34099
Currently when a document misses a vector value, vector function
returns 0 as a score for this document. We think this is incorrect
behaviour.
With this change, an error will be thrown if vector functions are
used with docs that are missing vector doc values.
Also VectorScriptDocValues is modified to allow size() function,
which can be used to check if a document has a value for the
vector field.
This commit converts the ConnectionManager's openConnection and connectToNode methods to
async-style. This will allow us to not block threads anymore when opening connections. This PR also
adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove
some hacks in the test infrastructure that had to account for the previous synchronous nature of the
connection APIs.
This commit ensures that cluster state publications to self also go through the transport layer. This
allows voting-only nodes to intercept the publication to self.
Fixes an issue discovered by a test failure where a voting-only node, which was the only
bootstrapped node, would not step down as master after state transfer because publishing to self
would succeed.
Closes#43631
The count should match the number of all df-analytics that
matched the id in the request. However, we set the count
to the number of df-analytics returned which was bound to the
`size` parameter. This commit fixes this by setting the count
to the count of the `get` response.
This commit changes the way we manage refreshes in the index engines.
Instead of relying on a SearcherManager, this change uses a ReaderManager that
creates ElasticsearchDirectoryReader when needed. Searchers are now created on-demand
(when acquireSearcher is called) from the current ElasticsearchDirectoryReader.
It also slightly changes the Engine.Searcher to extend IndexSearcher in order
to simplify the usage in the consumer.
A bug was introduced in 6.6.0 when we added support for
rollup indices. Rollup caps does NOT support looking at
remote indices, consequently, since we always look up rollup
caps, the datafeed fails with an error if its config
includes a concrete remote index. (When all remote indices
in a datafeed config are wildcards the problem did not
occur.)
The rollups feature does not support remote indices, so if
there is any remote index in a datafeed config (wildcarded
or not), we can skip the rollup cap checks. This PR
implements that change.
This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory,
and removes the need to pass around tokenizer names when building custom analyzers.
As this means that TokenizerFactory is no longer a functional interface, the commit also
adds a factory method to TokenizerFactory to make construction simpler.
* [ML][Data Frame] Adding bwc tests for pivot transform (#43506)
* [ML][Data Frame] Adding bwc tests for pivot transform
* adding continuous transforms
* adding continuous dataframes to bwc
* adding continuous data frame tests
* Adding rolling upgrade tests for continuous df
* Fixing test
* Adjusting indices used in BWC, and handling NPE for seq_no_stats
* updating and muting specific bwc test
* Adjusting bwc tests for backport
This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.
The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.
This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
This commit deprecates the `transport.profiles.*.xpack.security.type`
setting. This setting is used to configure a profile that would only
allow client actions. With the upcoming removal of the transport client
the setting should also be deprecated so that it may be removed in
a future version.
This adds the ability to execute an action for each element that occurs
in an array, for example you could sent a dedicated slack action for
each search hit returned from a search.
There is also a limit for the number of actions executed, which is
hardcoded to 100 right now, to prevent having watches run forever.
The watch history logs each action result and the total number of actions
the were executed.
Relates #34546
All valid licenses permit security, and the only license state where
we don't support security is when there is a missing license.
However, for safety we should attach the system (or xpack/security)
user to internally originated actions even if the license is missing
(or, more strictly, doesn't support security).
This allows all nodes to communicate and send internal actions (shard
state, handshake/pings, etc) even if a license is transitioning
between a broken state and a valid state.
Relates: #42215
Backport of: #43468
Document level security was depending on the shared
"BitsetFilterCache" which (by design) never expires its entries.
However, when using DLS queries - particularly templated ones - the
number (and memory usage) of generated bitsets can be significant.
This change introduces a new cache specifically for BitSets used in
DLS queries, that has memory usage constraints and access time expiry.
The whole cache is automatically cleared if the role cache is cleared.
Individual bitsets are cleared when the corresponding lucene index
reader is closed.
The cache defaults to 50MB, and entries expire if unused for 7 days.
Backport of: #43669
If an item in the bulk request fails, that could be for a variety of
reasons - it may be that the underlying behaviour of security has
changed, or it may just be a transient failure during testing.
Simply asserting a `true`/`false` value produces failure messages that
are difficult to diagnose and debug. Using hamcert (`assertThat`) will
make it easier to understand the causes of failures in this test.
Backport of: #43725
Typically, dense vectors of both documents and queries must have the same
number of dimensions. Different number of dimensions among documents
or query vector indicate an error. This PR enforces that all vectors
for the same field have the same number of dimensions. It also enforces
that query vectors have the same number of dimensions.
* [ML][Data Frame] add node attr to GET _stats (#43842)
* [ML][Data Frame] add node attr to GET _stats
* addressing testing issues with node.attributes
* adjusting for backport
* fix org.elasticsearch.xpack.watcher.test.integration.RejectedExecutionTests (#41777)
This commit un-mutes org.elasticsearch.xpack.watcher.test.integration.RejectedExecutionTests
which was failing intermittently due to a logic bug. It is not possible to use the real
Watcher scheduler (which is needed for this test) and reliabliby count the .triggered-watches
since current count of documents in the .triggered-watches index is based on the timing of the
scheduler and the ability to delete based on the Watcher and Write thread pools.
This commit simply removes the .triggered-watch check and relies soley on the .watcher-history
index as an indication that operations that can occur when the Watcher threadpool is rejecting.
closes#41734
* fix unlikely bug that can prevent Watcher from restarting (#42030)
The bug fixed here is unlikely to happen. It requires ES to be started with
ILM disabled, Watcher enabled, and Watcher explicitly stopped and restarted.
Due to template validation Watcher does not fully start and can result in a
partially started state. This is an unlikely scenerio outside of the testing
framework.
Note - this bug was introduced while the test that would have caught it was
muted. The test remains muted since the underlying cuase of the random failures
has not been identified. When this test is un-muted it will now work.
Currently the repsonse of the "_reload_search_analyzer" endpoint contains the
index names and nodeIds of indices were analyzers reloading was triggered. This
change add the names of the search-time analyzers that were reloaded.
Closes#43804
This adds a new cluster privilege for manage_api_key. Users with this
privilege are able to create new API keys (as a child of their own
user identity) and may also get and invalidate any/all API keys
(including those owned by other users).
Backport of: #43728
* [ML][Data Frame] using transform creation version for node assignment (#43764)
* [ML][Data Frame] using transform creation version for node assignment
* removing unused imports
* Addressing PR comment
* adjusing for backport
As defined in https://tools.ietf.org/html/rfc6749#section-2.3.1
both client id and client secret need to be encoded with the
application/x-www-form-urlencoded encoding algorithm when used as
credentials for HTTP Basic Authentication in requests to the OP.
Resolves#43709
This is an odd backport of #41774
UserRoleMapper.UserData is constructed by each realm and it is used to
"match" role mapping expressions that eventually supply the role names
of the principal.
This PR filters out `null` collection values (lists and maps), for the groups
and metadata, which get to take part in the role mapping, in preparation
for using Java 9 collection APIs. It filters them as soon as possible, during
the construction.
This commit merges the `object-fields` feature branch. The new 'flattened
object' field type allows an entire JSON object to be indexed into a field, and
provides limited search functionality over the field's contents.
Action is a class that encapsulates meta information about an action
that allows it to be called remotely, specifically the action name and
response type. With recent refactoring, the action class can now be
constructed as a static constant, instead of needing to create a
subclass. This makes the old pattern of creating a singleton INSTANCE
both misnamed and lacking a common placement.
This commit renames Action to ActionType, thus allowing the old INSTANCE
naming pattern to be TYPE on the transport action itself. ActionType
also conveys that this class is also not the action itself, although
this change does not rename any concrete classes as those will be
removed organically as they are converted to TYPE constants.
relates #34389
Renames `_id_copy` to `ml__id_copy` as field names starting with
underscore are deprecated. The new field name `ml__id_copy` was
chosen as an obscure enough field that users won't have in their data.
Otherwise, this field is only intented to be used by df-analytics.
Introduces a new `ConsistentSecureSettingsValidatorService` service that exposes
a single public method, namely `allSecureSettingsConsistent`. The method returns
`true` if the local node's secure settings (inside the keystore) are equal to the
master's, and `false` otherwise. Technically, the local node has to have exactly
the same secure settings - setting names should not be missing or in surplus -
for all `SecureSetting` instances that are flagged with the newly introduced
`Property.Consistent`. It is worth highlighting that the `allSecureSettingsConsistent`
is not a consensus view across the cluster, but rather the local node's perspective
in relation to the master.
If a job is opened and then closed and does nothing in
between then it should not persist any results or state
documents. This change adapts the no-op job test to
assert no results in addition to no state, and to log
any documents that cause this assertion to fail.
Relates elastic/ml-cpp#512
Relates #43680
The Action base class currently works for both Streamable and Writeable
response types. This commit intorduces StreamableResponseAction, for
which only the legacy Action implementions which provide newResponse()
will extend. This eliminates the need for overriding newResponse() with
an UnsupportedOperationException.
relates #34389
Since #41817 was merged the ml-cpp zip file for any
given version has been cached indefinitely by Gradle.
This is problematic, particularly in the case of the
master branch where the version 8.0.0-SNAPSHOT will
be in use for more than a year.
This change tells Gradle that the ml-cpp zip file is
a "changing" dependency, and to check whether it has
changed every two hours. Two hours is a compromise
between checking on every build and annoying developers
with slow internet connections and checking rarely
causing bug fixes in the ml-cpp code to take a long
time to propagate through to elasticsearch PRs that
rely on them.
This change removes the ability to wrap an IndexSearcher in plugins. The IndexSearcherWrapper is replaced by an IndexReaderWrapper and allows to wrap the DirectoryReader only. This simplifies the creation of the context IndexSearcher that is used on a per request basis. This change also moves the optimization that was implemented in the security index searcher wrapper to the ContextIndexSearcher that now checks the live docs to determine how the search should be executed. If the underlying live docs is a sparse bit set the searcher will compute the intersection
betweeen the query and the live docs instead of checking the live docs on every document that match the query.
This commit adds support for multiple source indices.
In order to deal with multiple indices having different mappings,
it attempts a best-effort approach to merge the mappings assuming
there are no conflicts. In case conflicts exists an error will be
returned.
To allow users creating custom mappings for special use cases,
the destination index is now allowed to exist before the analytics
job runs. In addition, settings are no longer copied except for
the `index.number_of_shards` and `index.number_of_replicas`.
Currently changing resources (like dictionaries, synonym files etc...) of search
time analyzers is only possible by closing an index, changing the underlying
resource (e.g. synonym files) and then re-opening the index for the change to
take effect.
This PR adds a new API endpoint that allows triggering reloading of certain
analysis resources (currently token filters) that will then pick up changes in
underlying file resources. To achieve this we introduce a new type of custom
analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows
swapping out analysis components. Custom analyzers that contain filters that are
markes as "updateable" will automatically choose this implementation. This PR
also adds this capability to `synonym` token filters for use in search time
analyzers.
Relates to #29051
TransportNodesAction provides a mechanism to easily broadcast a request
to many nodes, and collect the respones into a high level response. Each
node has its own request type, with a base class of BaseNodeRequest.
This base request requires passing the nodeId to which the request will
be sent. However, that nodeId is not used anywhere. It is private to the
base class, yet serialized to each node, where the node could just as
easily find the nodeId of the node it is on locally.
This commit removes passing the nodeId through to the node request
creation, and guards its serialization so that we can remove the base
request class altogether in the future.
GatewayIndexStateIT#testRecoverBrokenIndexMetadata replies on the
flushing on shutdown. This behaviour, however, can be randomly disabled
in MockInternalEngine.
Closes#43034
* Deduplicate org.elasticsearch.xpack.core.dataframe.utils.TimeUtils and org.elasticsearch.xpack.core.ml.utils.time.TimeUtils into a common class: org.elasticsearch.xpack.core.common.time.TimeUtils.
* Add unit tests for parseTimeField and parseTimeFieldToInstant methods
Adds support for the situation where only voting-only nodes are bootstrapped. In that case, they will
still try to become elected and bring full master nodes into the cluster.
Search requests executed through the SecurityIndexSearcherWrapper throw
an UnsupportedOperationException if they match a sparse role query.
When low level cancellation is activated (which is the default since #42857),
the context index searcher creates a weight that doesn't handle #scorer.
This change fixes this bug and adds a test to ensure that we check this case.
* Use atomic boolean to guard wakeups
* Don't trigger wakeups from the select loops thread itself for registering and closing channels
* Don't needlessly queue writes
Co-authored-by: Tim Brooks <tim@uncontended.net>
* [ML][Data Frame] Add support for allow_no_match for endpoints (#43490)
* [ML][Data Frame] Add support for allow_no_match parameter in endpoints
Adds support for:
* Get Transforms
* Get Transforms stats
* stop transforms
* Update DataFrameTransformDocumentationIT.java
This change introduces a new setting,
xpack.ml.process_connect_timeout, to enable
the timeout for one of the external ML processes
to connect to the ES JVM to be increased.
The timeout may need to be increased if many
processes are being started simultaneously on
the same machine. This is unlikely in clusters
with many ML nodes, as we balance the processes
across the ML nodes, but can happen in clusters
with a single ML node and a high value for
xpack.ml.node_concurrent_job_allocations.
A voting-only master-eligible node is a node that can participate in master elections but will not act
as a master in the cluster. In particular, a voting-only node can help elect another master-eligible
node as master, and can serve as a tiebreaker in elections. High availability (HA) clusters require at
least three master-eligible nodes, so that if one of the three nodes is down, then the remaining two
can still elect a master amongst them-selves. This only requires one of the two remaining nodes to
have the capability to act as master, but both need to have voting powers. This means that one of
the three master-eligible nodes can be made as voting-only. If this voting-only node is a dedicated
master, a less powerful machine or a smaller heap-size can be chosen for this node. Alternatively, a
voting-only non-dedicated master node can play the role of the third master-eligible node, which
allows running an HA cluster with only two dedicated master nodes.
Closes#14340
Co-authored-by: David Turner <david.turner@elastic.co>
This commit changes the `role_descriptors` field from required
to optional when creating API key. The default behavior in .NET ES
client is to omit properties with `null` value requiring additional
workarounds. The behavior for the API does not change.
Field names (`id`, `name`) in the invalidate api keys API documentation have been
corrected where they were wrong.
Closes#42053
After two recent changes (#38824 and #33888), the _cat/indices API
no longer report information for active recovering indices and
non-replicated closed indices. It also misreport replicated closed
indices that are potentially not authorized for the user.
This commit changes how the cat action works by first using the
Get Settings API in order to resolve authorized indices. It then uses
the Cluster State, Cluster Health and Indices Stats APIs to retrieve
information about the indices.
Closes#39933
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
The error message if the native controller failed to run
(for example due to running Elasticsearch on an unsupported
platform) was not easy to understand. This change removes
pointless detail from the message and adds some hints about
likely causes.
Fixes#42341
Currently nio implements ip filtering at the channel context level. This
is kind of a hack as the application logic should be implemented at the
handler level. This commit moves the ip filtering into a channel
handler. This requires adding an indicator to the channel handler to
show when a channel should be closed.
This commit ensures that ILM's Shrink action will take node versions into
account when choosing which node to allocate to when shrinking an
index. Prior to this change, ILM could pick a node with a lower version
than some shards are already allocated to, which causes the new
allocation to fail as shards can't be relocated onto a node with a lower
version than they are already on.
As part of this, when making the decision about which node to allocate
to prior to Shrink, all shards in the index are considered, rather than
choosing a random shard to consider.
Further, the unit tests for the logic that chooses a node to allocate
shards to pre-shrink has been improved to validate the behavior in more
realistic and varied initial conditions.
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.
Note that where possible response fields were made final and default
constructors were removed.
This is a large PR, but the change is mostly mechanical.
Relates to #34389
Backport of #43414
* [ML][Data Frame] Add version and create_time to transform config (#43384)
* [ML][Data Frame] Add version and create_time to transform config
* s/transform_version/version s/Date/Instant
* fixing getter/setter for version
* adjusting for backport
After the network disruption a partition is created,
one side of which can form a cluster the other can't.
Ensure requests are sent to a node on the correct side
of the cluster
This replaces the use of char[] in the password length validation
code, with the use of SecureString
Although the use of char[] is not in itself problematic, using a
SecureString encourages callers to think about the lifetime of the
password object and to clear it after use.
Backport of: #42884
Local and global checkpoints currently do not correctly reflect what's persisted to disk. The issue is
that the local checkpoint is adapted as soon as an operation is processed (but not fsynced yet). This
leaves room for the history below the global checkpoint to still change in case of a crash. As we rely
on global checkpoints for CCR as well as operation-based recoveries, this has the risk of shard
copies / follower clusters going out of sync.
This commit required changing some core classes in the system:
- The LocalCheckpointTracker keeps track now not only of the information whether an operation has
been processed, but also whether that operation has been persisted to disk.
- TranslogWriter now keeps track of the sequence numbers that have not been fsynced yet. Once
they are fsynced, TranslogWriter notifies LocalCheckpointTracker of this.
- ReplicationTracker now keeps track of the persisted local and persisted global checkpoints of all
shard copies when in primary mode. The computed global checkpoint (which represents the
minimum of all persisted local checkpoints of all in-sync shard copies), which was previously stored
in the checkpoint entry for the local shard copy, has been moved to an extra field.
- The periodic global checkpoint sync now also takes async durability into account, where the local
checkpoints on shards only advance when the translog is asynchronously fsynced. This means that
the previous condition to detect inactivity (max sequence number is equal to global checkpoint) is
not sufficient anymore.
- The new index closing API does not work when combined with async durability. The shard
verification step is now requires an additional pre-flight step to fsync the translog, so that the main
verify shard step has the most up-to-date global checkpoint at disposition.