* [ML][Data frame] fixing failure state transitions and race condition (#45627)
There is a small window for a race condition while we are flagging a task as failed.
Here are the steps where the race condition occurs:
1. A failure occurs
2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following:
a. `finishAndSetState()` which sets the IndexerState to STARTED
b. `doSaveState(...)` which attempts to save the current state of the indexer
3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs.
The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.
Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another.
I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state.
closes#45609
Relates to #45562
* [ML][Data Frame] moves failure state transition for MT safety (#45676)
* [ML][Data Frame] moves failure state transition for MT safety
* removing unused imports
* Search enhancement: pinned queries (#44345)
Search enhancement: - new query type allows selected documents to be promoted above any "organic” search results.
This is the first feature in a new module `search-business-rules` which will house licensed (non OSS) logic for rewriting queries according to business rules.
The PinnedQueryBuilder class offers a new `pinned` query in the DSL that takes an array of promoted IDs and an “organic” query and ensures the documents with the promoted IDs rank higher than the organic matches.
Closes#44074
Encapsulate the serialization/deserialization of SQL client classes.
Make configuration specific parameters (such as ZoneId) generic just
like the version and remove the need for consumer classes to manage them
individually.
This is not only consistent but also provides significant savings in the
cursor.
Fix#40216
(cherry picked from commit 5c844798045d7baa0d932289d2e3d1607ba6a9a4)
Improve the initialization and state passing of TextFormatter in CLI
and TEXT mode by leveraging the Page listener hook. Additionally
simplify the code inside RestSqlQueryAction.
(cherry picked from commit a56db2fa119cf9e8748723e19f1fc9f6a8afe5fc)
Improve encapsulation of pagination of rowsets by breaking the cycle
between cursor and associated rowset implementation, all logic now
residing inside each cursor implementation.
(cherry picked from commit be8fe0a0ce562fe732fae12a0b236b5731e4638c)
Adjusts the cluster cleanup routine in ESRestTestCase to clean up SLM
test cases, and optionally wait for all snapshots to be deleted.
Waiting for all snapshots to be deleted, rather than failing if any are
in progress, is necessary for tests which use SLM policies because SLM
policies may be in the process of executing when the test ends.
Changes the order of parameters in Geometries from lat, lon to lon, lat
and moves all Geometry classes are moved to the
org.elasticsearch.geomtery package.
Backport of #45332Closes#45048
* Update the REST API specification
This patch updates the REST API spefication in JSON files to better encode deprecated entities,
to improve specification of URL paths, and to open up the schema for future extensions.
Notably, it changes the `paths` from a list of strings to a list of objects, where each
particular object encodes all the information for this particular path: the `parts` and the `methods`.
Among the benefits of this approach is eg. encoding the difference between using the `PUT` and `POST`
methods in the Index API, to either use a specific document ID, or let Elasticsearch generate one.
Also `documentation` becomes an object that supports an `url` and also a `description` which is a
new field.
* Adapt YAML runner to new REST API specification format
The logic for choosing the path to use when running tests has been
simplified, as a consequence of the path parts being listed under each
path in the spec. The special case for create and index has been removed.
Also the parsing code has been hardened so that errors are thrown earlier
when the structure of the spec differs from what expected, and their
error messages should be more helpful.
* Introduce Spatial Plugin (#44389)
Introduce a skeleton Spatial plugin that holds new licensed features coming to
Geo/Spatial land!
* [GEO] Refactor DeprecatedParameters in AbstractGeometryFieldMapper (#44923)
Refactor DeprecatedParameters specific to legacy geo_shape out of
AbstractGeometryFieldMapper.TypeParser#parse.
* [SPATIAL] New ShapeFieldMapper for indexing cartesian geometries (#44980)
Add a new ShapeFieldMapper to the xpack spatial module for
indexing arbitrary cartesian geometries using a new field type called shape.
The indexing approach leverages lucene's new XYShape field type which is
backed by BKD in the same manner as LatLonShape but without the WGS84
latitude longitude restrictions. The new field mapper builds on and
extends the refactoring effort in AbstractGeometryFieldMapper and accepts
shapes in either GeoJSON or WKT format (both of which support non geospatial
geometries).
Tests are provided in the ShapeFieldMapperTest class in the same manner
as GeoShapeFieldMapperTests and LegacyGeoShapeFieldMapperTests.
Documentation for how to use the new field type and what parameters are
accepted is included. The QueryBuilder for searching indexed shapes is
provided in a separate commit.
* [SPATIAL] New ShapeQueryBuilder for querying indexed cartesian geometry (#45108)
Add a new ShapeQueryBuilder to the xpack spatial module for
querying arbitrary Cartesian geometries indexed using the new shape field
type.
The query builder extends AbstractGeometryQueryBuilder and leverages the
ShapeQueryProcessor added in the previous field mapper commit.
Tests are provided in ShapeQueryTests in the same manner as
GeoShapeQueryTests and docs are updated to explain how the query works.
* Add format parameter to the range queries built for CURRENT_* functions
used in comparison conditions
* Use range queries for date fields equality/non-equality as well.
(cherry picked from commit c1e81e90f937ee5a002524d632bfce74d76962f9)
* Reenable Integ Tests in native-multi-node-tests
* The tests broken here were likely fixed by #45463 => let's reenable them and see if things run fine again
* Relates #45405, #45455
The http client could end up creating URLs, that did not resemble the
original one, when encoding. This fixes a couple of corner cases, where
too much or too few slashes were added to an URI.
Closes#44970
The current implementations make it difficult for
adding new privileges (example: a cluster privilege which is
more than cluster action-based and not exposed to the security
administrator). On the high level, we would like our cluster privilege
either:
- a named cluster privilege
This corresponds to `cluster` field from the role descriptor
- or a configurable cluster privilege
This corresponds to the `global` field from the role-descriptor and
allows a security administrator to configure them.
Some of the responsibilities like the merging of action based cluster privileges
are now pushed at cluster permission level. How to implement the predicate
(using Automaton) is being now enforced by cluster permission.
`ClusterPermission` helps in enforcing the cluster level access either by
performing checks against cluster action and optionally against a request.
It is a collection of one or more permission checks where if any of the checks
allow access then the permission allows access to a cluster action.
Implementations of cluster privilege must be able to provide information
regarding the predicates to the cluster permission so that can be enforced.
This is enforced by making implementations of cluster privilege aware of
cluster permission builder and provide a way to specify how the permission is
to be built for a given privilege.
This commit renames `ConditionalClusterPrivilege` to `ConfigurableClusterPrivilege`.
`ConfigurableClusterPrivilege` is a renderable cluster privilege exposed
as a `global` field in role descriptor.
Other than this there is a requirement where we would want to know if a cluster
permission is implied by another cluster-permission (`has-privileges`).
This is helpful in addressing queries related to privileges for a user.
This is not just simply checking of cluster permissions since we do not
have access to runtime information (like request object).
This refactoring does not try to address those scenarios.
Relates #44048
The vagrant based tests currently reside in a single project, creating
dozens of tasks to manage starting and stopping the vagrant VM along
with running java and bats tests within each image. This all-in-one
pattern makes parallelizing packaging tests difficult.
This commit rewrites the vagrant testing infrastructure to be
independent of the actual test runners, thus allowing each platform to
be handled in a separate subproject. Additionally, the java and bats
tests are changed to be run through a "destructive" gradle task, which
is run inside the VM. The combination of these will allow
parallelization both locally (through running several VMs at once) as
well as running the destructive tasks in CI machines dedicated to each
platform (thus removing the need for vagrant in CI).
This commit adds a first draft of a regression analysis
to data frame analytics. There is high probability that
the exact syntax might change.
This commit adds the new analysis type and its parameters as
well as appropriate validation. It also modifies the extractor
and the fields detector to be able to handle categorical fields
as regression analysis supports them.
* Restrict which tasks can use testclusters
This PR fixes a problem between the interaction of test-clusters and
build cache.
Before this any task could have used a cluster without tracking it as
input.
With this change a new interface is introduced to track the tasks that
can use clusters and we do consider the cluster as input for all of
them.
* Name each inner_hits section of nested queries differently and extract and combine the multiple values it generates into a single list.
This also introduces a limitation (its origin it's with Elasticsearch
though) on the sorting capabilities when the sorting is based on the
nested fields filtered: only one of the conditions applied to nested
documents will be used in the nested sorting.
(cherry picked from commit cfc5cf68f6e83b07bb9006986d0903d6be418ec6)
This commit replaces task_state and indexer_state in the
data frame _stats output with a single top level state
that combines the two. It is defined as:
- failed if what's currently reported as task_state is failed
- stopped if there is no persistent task
- Otherwise what's currently reported as indexer_state
Backport of #45276
When using the implicit flow in OpenID Connect, the
op.token_endpoint_url should not be mandatory as there is no need
to contact the token endpoint of the OP.
* [ML][Data Frame] Add update transform api endpoint (#45154)
This adds the ability to `_update` stored data frame transforms. All mutable fields are applied when the next checkpoint starts. The exception being `description`.
This PR contains all that is necessary for this addition:
* HLRC
* Docs
* Server side
This adds support for `geo_bounds` aggregation inside the `pivot.aggregations` configuration.
The two points returned from the `geo_bounds` aggregation are transformed into `geo_shape` whose types are dynamic given the point's similarity.
* `point` if the two points are identical
* `linestring` if the two points share either a latitude or longitude
* `polygon` if the two points are completely different
The automatically deduced mapping for the resulting field is a `geo_shape`.
We currently use the unboundid ldap SDK, which is triply licensed under
GPL-2.0, LGPL-2.1, and the "UnboundID LDAP SDK Free Use License". We currently
identify the license as the latter, but LGPL-2.1 is the one we should be using
per our policy.
The PutJob API accidentally used an "expert" API of CreateIndexRequest.
That API is semi-lenient to syntax; a type could be omitted and the
request would work as expected. But if a type was omitted it would
not merge with templates correctly, leading to index creation that
only has the template and not the requested mappings in the request.
This commit refactors the PutJob API to:
- Include the type name
- Use a less "expert" API in an attempt to future proof against errors
- Uses an XContentBuilder instead of string replacing, removes json template
This commit applies a normalization process to environment paths, both
in how they are stored internally, also their settings values. This
normalization is done via two means:
- we make the paths absolute
- we remove redundant name elements from the path (what Java calls
"normalization")
This change ensures that when we compare and refer to these paths within
the system, we are using a common ground. For example, prior to the
change if the data path was relative, we would not compare it correctly
to paths from disk usage. This is because the paths in disk usage were
being made absolute.
Uses JDK 11's per-socket configuration of TCP keepalive (supported on Linux and Mac), see
https://bugs.openjdk.java.net/browse/JDK-8194298, and exposes these as transport settings.
By default, these options are disabled for now (i.e. fall-back to OS behavior), but we would like
to explore whether we can enable them by default, in particular to force keepalive configurations
that are better tuned for running ES.
introduces an abstraction for how checkpointing and synchronization works, covering
- retrieval of checkpoints
- check for updates
- retrieving stats information
Currently in the transport-nio work we connect and bind channels on the
a thread before the channel is registered with a selector. Additionally,
it is at this point that we set all the socket options. This commit
moves these operations onto the event-loop after the channel has been
registered with a selector. It attempts to set the socket options for a
non-server channel at registration time. If that fails, it will attempt
to set the options after the channel is connected. This should fix
#41071.
In the FIPS JVM the JVM default locale seems to leak into places
where it should be overridden. This change skips assertions
in TimestampFormatFinderTests.testGuessIsDayFirstFromLocale
that may be impacted.
Fixes#45140
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.
Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
Reloading of synonym_graph filter doesn't work currently because the search time
AnalysisMode doesn't get propagated to the TokenFilterFactory emitted by the
graph filters getChainAwareTokenFilterFactory() method. This change fixes that.
Closes#45127
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.
This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.
Fixes#45047
This change improves the exception messages that are thrown when the
system cannot read TLS resources such as keystores, truststores,
certificates, keys or certificate-chains (CAs).
This change specifically handles:
- Files that do not exist
- Files that cannot be read due to file-system permissions
- Files that cannot be read due to the ES security-manager
Backport of: #44787
There are no realms that can be configured exclusively with secure
settings. Every realm that supports secure settings also requires one
or more non-secure settings.
However, sometimes a node will be configured with entries in the
keystore for which there is nothing in elasticsearch.yml - this may be
because the realm we removed from the yml, but not deleted from the
keystore, or it could be because there was a typo in the realm name
which has accidentially orphaned the keystore entry.
In these cases the realm building would fail, but the error would not
always be clear or point to the root cause (orphaned keystore
entries). RealmSettings would act as though the realm existed, but
then fail because an incorrect combination of settings was provided.
This change causes realm building to fail early, with an explicit
message about incorrect keystore entries.
Backport of: #44471
When we create API key we check if the API key with the name
already exists. It searches with scroll enabled and this causes
the request to fail when creating large number of API keys in
parallel as it hits the number of open scroll limit (default 500).
We do not need the search context to be created so this commit
removes the scroll parameter from the search request for duplicate
API key.
If one tries to start a DF analytics job that has already run,
the result will be that the task will fail after reindexing the
dest index from the source index. The results of the prior run
will be gone and the task state is not properly set to failed
with the failure reason.
This commit improves the behavior in this scenario. First, we
set the task state to `failed` in a set of failures that were
missed. Second, a validation is added that if the destination
index exists, it must be empty.
We keep adding the current primary term to operations for which we do not assign a sequence
number. This does not make sense anymore as all operations which we care about have
sequence numbers now. The goal of this commit is to clean things up in InternalEngine and
reduce the complexity.
* We shouldn't be recreating wrapped REST handlers over and over for every request. We only use this hook in x-pack and the wrapper there does not have any per request state.
This is inefficient and could lead to some very unexpected memory behavior
=> I made the logic create the wrapper on handler registration and adjusted the x-pack wrapper implementation to correctly forward the circuit breaker and content stream flags
The Get Users API also returns users form the restricted realm or built-in users,
as we call them in our docs. One can also change the passwords of built-in
users with the Change Password API
A mismatched configuration between the IdP and SP will often result in
SAML authentication attempts failing because the audience condition is
not met (because the IdP and SP disagree about the correct form of the
SP's Entity ID).
Previously the error message in this case did not provide sufficient
information to resolve the issue because the IdP's expected audience
would be truncated if it exceeeded 32 characters. Since the error did
not provide both IDs in full, it was not possible to determine the
correct fix (in detail) based on the error alone.
This change expands the message that is included in the thrown
exception, and also adds additional logging of every failed audience
condition, with diagnostics of the match failure.
Backport of: #44334
* Create S3 Third Party Test Task that Covers the S3 CLI Tool
* Adjust snapshot cli test tool tests to work with real S3
* Build adjustment
* Clean up repo path before testing
* Dedup the logic for asserting path contents by using the correct utility method here that somehow became unused
Today closing a `ClusterNode` in an `AbstractCoordinatorTestCase` uses
`onNode()` so has no effect if the node is not in the current list of nodes.
It also discards the `Runnable` it creates without having run it, so has no
effect anyway.
This commit makes these tests much stricter about properly closing the nodes
started during `Coordinator` tests, by tracking the persisted states that are
opened, and adds an assertion to catch the trappy requirement that the closing
node still belongs to the cluster.
The existing equals check was broken, and would always be false.
The correct behaviour is to return "Collections.emptyList()" whenever
the the active(licensed)-realms equals the configured-realms.
Backport of: #44399
* Rename indexlifecycle to ilm and snapshotlifecycle to slm (#44917)
As a followup to #44725 and #44608, which renamed the packages within
the x-pack project, this renames the packages within the core x-pack
project. It also renames 'snapshotlifecycle' within the HLRC to slm.
* Fix one more import
In case closing the process throws an exception we should be catching
it no matter its type. The process may have terminated because of a
fatal error in which case closing the process will throw a server
error, not an `IOException`. If this happens we fail to mark the
persistent task as failed and the task gets in limbo.
As data frame rows with missing values for analyzed fields are skipped,
we can be more efficient by including a query that only picks documents
that have values for all analyzed fields. Besides improving the number
of documents we go through, we also provide a more accurate measurement
of how many rows we need which reduces the memory requirements.
This also adds an integration test that runs outlier detection on data
with missing fields.
TaskListener accepts today Throwable in its onFailure method. Though
looking at where it is called (TransportAction), it can never be
notified of a Throwable.
This commit changes the signature of TaskListener#onFailure so that it
accepts an `Exception` rather than a `Throwable` as second argument.
In order to make it easier to interpret the output of the ILM Explain
API, this commit adds two request parameters to that API:
- `only_managed`, which causes the response to only contain indices
which have `index.lifecycle.name` set
- `only_errors`, which causes the response to contain only indices in an
ILM error state
"Error state" is defined as either being in the `ERROR` step or having
`index.lifecycle.name` set to a policy that does not exist.
Today the processors setting is permitted to be set to more than the
number of processors available to the JVM. The processors setting
directly sizes the number of threads in the various thread pools, with
most of these sizes being a linear function in the number of
processors. It doesn't make any sense to set processors very high as the
overhead from context switching amongst all the threads will overwhelm,
and changing the setting does not control how many physical CPU
resources there are on which to schedule the additional threads. We have
to draw a line somewhere and this commit deprecates setting processors
to more than the number of available processors. This is the right place
to draw the line given the linear growth as a function of processors in
most of the thread pools, and that some are capped at the number of
available processors already.
Since 7.3, it's possible to explicitly configure the SAML realm to
be used in Kibana's configuration. This in turn, eliminates the need
of properly setting `xpack.security.public.*` settings in Kibana
and largely simplifies relevant documentation.
This also changes `xpack.security.authProviders` to
`xpack.security.authc.providers` as the former was deprecated in
favor of the latter in 7.3 in Kibana
This is a followup to #44350. The indexer stats used to
be persisted standalone, but now are only persisted as
part of a state-and-stats document. During the review
of #44350 it was decided that we'll stick with this
design, so there will never be a need for an indexer
stats object to store its transform ID as it is stored
on the enclosing document. This PR removes the indexer
stats document ID.
Backport of #44768
The problem is that RemoteClusterConnection closes the connection manager asynchronously, which races with the threadpool being shutdown at the end of the test.
Closes#44339Closes#44610
Deleting a follower index does not delete its ShardFollowTasks, potentially
leaving many persistent tasks in the cluster that cannot be allocated on
nodes and unnecessary fill the logs. This commit adds a cluster state listener
(ShardFollowTaskCleaner) that completes (with a failure) any persistent task
that refers to a non existent follower index.
I think that this bug has been introduced by #34404: before this change the
task would have been completed as failed and removed from the cluster state.
Backport of #44702 and #44801 on 7.x
* Switch from using docvalue_fields to extracting values from _source
where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.
* This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.
(cherry picked from commit 8bf8a055e38f00df5f49c8d97f632f69d6e00c2c)