This change introduces a new setting,
xpack.ml.process_connect_timeout, to enable
the timeout for one of the external ML processes
to connect to the ES JVM to be increased.
The timeout may need to be increased if many
processes are being started simultaneously on
the same machine. This is unlikely in clusters
with many ML nodes, as we balance the processes
across the ML nodes, but can happen in clusters
with a single ML node and a high value for
xpack.ml.node_concurrent_job_allocations.
A voting-only master-eligible node is a node that can participate in master elections but will not act
as a master in the cluster. In particular, a voting-only node can help elect another master-eligible
node as master, and can serve as a tiebreaker in elections. High availability (HA) clusters require at
least three master-eligible nodes, so that if one of the three nodes is down, then the remaining two
can still elect a master amongst them-selves. This only requires one of the two remaining nodes to
have the capability to act as master, but both need to have voting powers. This means that one of
the three master-eligible nodes can be made as voting-only. If this voting-only node is a dedicated
master, a less powerful machine or a smaller heap-size can be chosen for this node. Alternatively, a
voting-only non-dedicated master node can play the role of the third master-eligible node, which
allows running an HA cluster with only two dedicated master nodes.
Closes#14340
Co-authored-by: David Turner <david.turner@elastic.co>
This commit changes the `role_descriptors` field from required
to optional when creating API key. The default behavior in .NET ES
client is to omit properties with `null` value requiring additional
workarounds. The behavior for the API does not change.
Field names (`id`, `name`) in the invalidate api keys API documentation have been
corrected where they were wrong.
Closes#42053
Now that the fix krb5-kdc fixture (entropy problem in docker container)
is in and the converting `kerberos-tests` to testclusters is done,
enabling the kerberos-tests
Closes#40678
After two recent changes (#38824 and #33888), the _cat/indices API
no longer report information for active recovering indices and
non-replicated closed indices. It also misreport replicated closed
indices that are potentially not authorized for the user.
This commit changes how the cat action works by first using the
Get Settings API in order to resolve authorized indices. It then uses
the Cluster State, Cluster Health and Indices Stats APIs to retrieve
information about the indices.
Closes#39933
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
The error message if the native controller failed to run
(for example due to running Elasticsearch on an unsupported
platform) was not easy to understand. This change removes
pointless detail from the message and adds some hints about
likely causes.
Fixes#42341
Currently nio implements ip filtering at the channel context level. This
is kind of a hack as the application logic should be implemented at the
handler level. This commit moves the ip filtering into a channel
handler. This requires adding an indicator to the channel handler to
show when a channel should be closed.
This commit ensures that ILM's Shrink action will take node versions into
account when choosing which node to allocate to when shrinking an
index. Prior to this change, ILM could pick a node with a lower version
than some shards are already allocated to, which causes the new
allocation to fail as shards can't be relocated onto a node with a lower
version than they are already on.
As part of this, when making the decision about which node to allocate
to prior to Shrink, all shards in the index are considered, rather than
choosing a random shard to consider.
Further, the unit tests for the logic that chooses a node to allocate
shards to pre-shrink has been improved to validate the behavior in more
realistic and varied initial conditions.
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.
Note that where possible response fields were made final and default
constructors were removed.
This is a large PR, but the change is mostly mechanical.
Relates to #34389
Backport of #43414
* [ML][Data Frame] Add version and create_time to transform config (#43384)
* [ML][Data Frame] Add version and create_time to transform config
* s/transform_version/version s/Date/Instant
* fixing getter/setter for version
* adjusting for backport
After the network disruption a partition is created,
one side of which can form a cluster the other can't.
Ensure requests are sent to a node on the correct side
of the cluster
This replaces the use of char[] in the password length validation
code, with the use of SecureString
Although the use of char[] is not in itself problematic, using a
SecureString encourages callers to think about the lifetime of the
password object and to clear it after use.
Backport of: #42884
Local and global checkpoints currently do not correctly reflect what's persisted to disk. The issue is
that the local checkpoint is adapted as soon as an operation is processed (but not fsynced yet). This
leaves room for the history below the global checkpoint to still change in case of a crash. As we rely
on global checkpoints for CCR as well as operation-based recoveries, this has the risk of shard
copies / follower clusters going out of sync.
This commit required changing some core classes in the system:
- The LocalCheckpointTracker keeps track now not only of the information whether an operation has
been processed, but also whether that operation has been persisted to disk.
- TranslogWriter now keeps track of the sequence numbers that have not been fsynced yet. Once
they are fsynced, TranslogWriter notifies LocalCheckpointTracker of this.
- ReplicationTracker now keeps track of the persisted local and persisted global checkpoints of all
shard copies when in primary mode. The computed global checkpoint (which represents the
minimum of all persisted local checkpoints of all in-sync shard copies), which was previously stored
in the checkpoint entry for the local shard copy, has been moved to an extra field.
- The periodic global checkpoint sync now also takes async durability into account, where the local
checkpoints on shards only advance when the translog is asynchronously fsynced. This means that
the previous condition to detect inactivity (max sequence number is equal to global checkpoint) is
not sufficient anymore.
- The new index closing API does not work when combined with async durability. The shard
verification step is now requires an additional pre-flight step to fsync the translog, so that the main
verify shard step has the most up-to-date global checkpoint at disposition.
This commit removes some very old test logging annotations that appeared
to be added to investigate test failures that are long since closed. If
these are needed, they can be added back on a case-by-case basis with a
comment associating them to a test failure.
* Narrow period of Shrink action in which ILM prevents stopping
Prior to this change, we would prevent stopping of ILM if the index was
anywhere in the shrink action. This commit changes
`IndexLifecycleService` to allow stopping when in any of the innocuous
steps during shrink. This changes ILM only to prevent stopping if
absolutely necessary.
Resolves#43253
* Rename variable for ignore actions -> ignore steps
* Fix comment
* Factor test out to test *all* stoppable steps
* [ML][Data Frame] make response.count be total count of hits
* addressing line length check
* changing response count for filters
* adjusting serialization, variable name, and total count logic
* making count mandatory for creation
* [ML][Data Frame] adds new pipeline field to dest config (#43124)
* [ML][Data Frame] adds new pipeline field to dest config
* Adding pipeline support to _preview
* removing unused import
* moving towards extracting _source from pipeline simulation
* fixing permission requirement, adding _index entry to doc
* adjusting for java 8 compatibility
* adjusting bwc serialization version to 7.3.0
This PR is a backport a of #43214 from v8.0.0
A number of the aggregation base classes have an abstract doEquals() and doHashCode() (e.g. InternalAggregation.java, AbstractPipelineAggregationBuilder.java).
Theoretically this is so the sub-classes can add to the equals/hashCode and don't need to worry about calling super.equals(). In practice, it's mostly just confusing/inconsistent. And if there are more than two levels, we end up with situations like InternalMappedSignificantTerms which has to call super.doEquals() which defeats the point of having these overridable methods.
This PR removes the do versions and just use equals/hashCode ensuring the super when necessary.
* Return 0 for negative "free" and "total" memory reported by the OS
We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.
In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.
Resolves#42157
* Fix test passing in invalid memory value
* Fix another test passing in invalid memory value
* Also change mem check in MachineLearning.machineMemoryFromStats
* Add background documentation for why we prevent negative return values
* Clarify comment a bit more
The randomization in this test would occasionally generate duplicate
node attribute keys, causing spurious test failures. This commit adjusts
the randomization to not generate duplicate keys and cleans up the data
structure used to hold the generated keys.
Backport of: https://github.com/elastic/elasticsearch/pull/43222
This commit replaces usages of Streamable with Writeable for the
SingleShardRequest / TransportSingleShardAction classes and subclasses of
these classes.
Note that where possible response fields were made final and default
constructors were removed.
Relates to #34389
Kibana wants to create access_token/refresh_token pair using Token
management APIs in exchange for kerberos tickets. `client_credentials`
grant_type requires every user to have `cluster:admin/xpack/security/token/create`
cluster privilege.
This commit introduces `_kerberos` grant_type for generating `access_token`
and `refresh_token` in exchange for a valid base64 encoded kerberos ticket.
In addition, `kibana_user` role now has cluster privilege to create tokens.
This allows Kibana to create access_token/refresh_token pair in exchange for
kerberos tickets.
Note:
The lifetime from the kerberos ticket is not used in ES and so even after it expires
the access_token/refresh_token pair will be valid. Care must be taken to invalidate
such tokens using token management APIs if required.
Closes#41943
This commit removes some trace logging for the token service in the
rolling upgrade tests. If there is an active investigation here, it
would be best to annotate this line with a comment in the source
indicating such. From my digging, it does not appear there is an active
investigation that relies on this logging, so we remove it.
This trace logging looks like it was copy/pasted from another test,
where the logging in that test was only added to investigate a test
failure. This commit removes the trace logging.
This was added to investigate a test failure over two years ago, yet
left behind. Since the test failure has been addressed since then, this
commit removes the trace logging.
* TestClusters: Convert the security plugin
This PR moves security tests to use TestClusters.
The TLS test required support in testclusters itself, so the correct
wait condition is configgured based on the cluster settings.
* PR review
The ML failover tests sometimes need to wait for jobs to be
assigned to new nodes following a node failure. They wait
10 seconds for this to happen. However, if the node that
failed was the master node and a new master was elected then
this 10 seconds might not be long enough as a refresh of the
memory stats will delay job assignment. Once the memory
refresh completes the persistent task will be assigned when
the next cluster state update occurs or after the periodic
recheck interval, which defaults to 30 seconds. Rather than
increase the length of the wait for assignment to 31 seconds,
this change decreases the periodic recheck interval to 1
second.
Fixes#43289
With this change, we will rebuild the live version map and local
checkpoint using documents (including soft-deleted) from the safe commit
when opening an internal engine. This allows us to safely prune away _id
of all soft-deleted documents as the version map is always in-sync with
the Lucene index.
Relates #40741
Supersedes #42979
* [ML][Data Frame] only complete task after state persistence
There is a race condition where the task could be completed, but there
is still a pending document write. This change moves
the task cancellation into the actionlistener of the state persistence.
intermediate commit
intermediate commit
* removing unused import
* removing unused const
* refreshing internal index after waiting for task to complete
* adjusting test data generation
* introduce state to the REST API specification
* change state over to stability
* CCR is no GA updated to stable
* SQL is now GA so marked as stable
* Introduce `internal` as state for API's, marks stable in terms of lifetime but unstable in terms of guarantees on its output format since it exposes internal representations
* make setting a wrong stability value, or not setting it at all an error that causes the YAML test suite to fail
* update spec files to be explicit about their stability state
* Document the fact that stability needs to be defined
Otherwise the YAML test runner will fail (with a nice exception message)
* address check style violations
* update rest spec unit tests to include stability
* found one more test spec file not declaring stability, made sure stability appears after documentation everywhere
* cluster.state is stable, mark response in some way to denote its a key value format that can be changed during minors
* mark data frame API's as beta
* remove internal and private as states for an API
* removed the wrong enum values in the Stability Enum in the previous commit
(cherry picked from commit 61c34bbd92f8f7e5f22fa411c6b682b0ebd8a99d)
It's possible for force merges kicked off by ILM to silently stop (due
to a node relocating for example). In which case, the segment count may
not reach what the user configured. In the subsequent `SegmentCountStep`
waiting for the expected segment count may wait indefinitely. Because of
this, this commit makes force merges "best effort" and then changes the
`SegmentCountStep` to simply report (at INFO level) if the merge was not
successful.
Relates to #42824Resolves#43245
We were stopping a node in the cluster at a time when
the replica shards of the .ml-state index might not
have been created. This change moves the wait for
green status to a point where the .ml-state index
exists.
Fixes#40546Fixes#41742
Forward port of #43111