The assertion `assertOpsOnPrimary` does not store seq_no and primary
term of successful deletes to the `lastOpSeqNo` and `lastOpTerm`. This
leads to failures of the subsequence CAS deletes or indexes with seq_no
and term. Moreover, this assertion trips a translog assertion because it
bumps the primary term of some operations but not the primary term of
the engine.
Relates #36467Closes#37684
Now that warning headers no longer contain a timestamp of when the
warning was generated, we no longer need to extract the warning value
from the warning to determine whether or not the warning value is
duplicated. Instead, we can compare strings directly.
Further, when de-duplicating warning headers, are constantly rebuilding
sets. Instead of doing that, we can carry about the set with us and
rebuild it if we find a new warning value.
This commit applies both of these optimizations.
Exceptions thrown by the cluster applier service's settings and cluster appliers are bubbled up, and
block the state from being applied instead of silently being ignored. In combination with the cluster
state publishing lag detector, this will throw a node out of the cluster that can't properly apply
cluster state updates.
As acking can fail for any reason (unrelated node being too slow, node disconnecting), it should not
be required for acking to succeed in order for index requests with dynamic mapping updates to
successfully complete.
Relates to #30672 and Closes#30844
* Remove Custom Listeners from SnapshotsService
Motivations:
* Shorten the code some more
* Use ActionListener#wrap to get easy to reason about behavior in failure scenarios
* Remove duplication in the logic of handling snapshot completion listeners (listeners removing themselves and comparing snapshots to their targets)
* Also here, move all listener handling into `SnapshotsService` and remove custom listener class by putting listeners in a map
Today we support a smooth rolling upgrade from Zen1 to Zen2 by automatically
bootstrapping the cluster once all the Zen1 nodes have left, as long as the
`minimum_master_nodes` count is satisfied. However this means that Zen2 nodes
also require the `minimum_master_nodes` setting for this one specific and
transient situation.
Since nodes only perform this automatic bootstrapping if they previously
belonged to a Zen1 cluster, they can keep track of the `minimum_master_nodes`
setting from the previous master instead of requiring it to be set on the Zen2
node.
- Add deprecation warning to RestGetFieldMappingAction
- Add two new java HRLC classes GetFieldMappingsRequest and
GetFieldMappingsResponse. These classes use new typeless forms
of a request and response, and differ in that from the server
versions.
Relates to #35190
Currently we have the ability to listen for setting changes to two group
affix settings. However, it is possible that we might have the need to
listen to more than two. This commit adds a method that allows consumer
to listen to a list of affix settings for changes.
In some cases we only have a string collection instead of a string list
that we want to serialize out. We have a convenience method for writing
a list of strings, but no such method for writing a collection of
strings. Yet, a list of strings is a collection of strings, so we can
simply liberalize StreamOutput#writeStringList to be more generous in
the collections that it accepts and write out collections of strings
too. On the other side, we do not have a convenience method for reading
a list of strings. This commit addresses both of these issues.
* Add PersistentTasksClusterService::unassignPersistentTask method
* adding cancellation test
* Adding integration test for unallocating tasks from a node
* Addressing review comments
* adressing minor PR comments
Due to https://issues.apache.org/jira/browse/LUCENE-8634 this test
may fail if a really tiny polygon is generated. This commit checks for
tiny polygons and skips the final check, which is expected to fail
until the lucene bug is fixed and new version of lucene is released.
* This test compleletly overrode the `reroute` method and hence did nothing put test the override itself
* Removed the test since it tests nothing and simplified `reroute` accordingly
This commit moves the collectSearchShards method out of RemoteClusterService into TransportSearchAction that currently calls it. RemoteClusterService used to be used only for cross-cluster search but is now also used in cross-cluster replication where different API are called through the RemoteClusterAwareClient. There is no reason for the collectSearchShards and fetchShards methods to be respectively in RemoteClusterService and RemoteClusterConnection. The search shards API can be called through the RemoteClusterAwareClient too, the only missing bit is a way to handle failures based on the skip_unavailable setting for each cluster (currently only supported in RemoteClusterConnection#fetchShards) which is achieved by adding a isSkipUnavailable(String clusterAlias) method to RemoteClusterService.
This change is useful for #32125 as we will very soon need to also call the search API against remote clusters, which will be done through RemoteClusterAwareClient. In that case we will also need to support skip_unavailable when calling the search API so we need some way to handle the skip_unavailable setting like we currently do for the search_shards call.
Relates to #32125
Today we have several implementations of executing SearchOperationListener
in SearchService. While all of them seem to be safe at least on, the one that
executes scroll searches can cause illegal execution of SearchOperationListener
that can then in-turn trigger assertions in ShardSearchStats. This change
adds a SearchOperationListenerExecutor that uses try-with blocks to ensure
listeners are called in a safe way.
Relates to #37185
This commit moves the aggregation and mapping code from joda time to
java time. This includes field mappers, root object mappers, aggregations with date
histograms, query builders and a lot of changes within tests.
The cut-over to java time is a requirement so that we can support nanoseconds
properly in a future field mapper.
Relates #27330
Users may require the sequence number and primary terms to perform optimistic concurrency control operations. Currently, you can get the sequence number via the `docvalues_fields` API but the primary term is not accessible because it is maintained by the `SeqNoFieldMapper` and the infrastructure can't find it.
This commit adds a dedicated sub fetch phase to return both numbers that is connected to a new `seq_no_primary_term` parameter.
1. testSimpleOnlyMasterNodeElection - requires cluster bootstrap when
the first master node is started.
2. testElectOnlyBetweenMasterNodes - requires cluster bootstrap when
the first master node is started and requires adding voting exclusion
before shutting down the first master node.
3. testAliasFilterValidation - requires cluster bootstrap when the
first master node is started.
It's not safe to continue writing state using MetaDataStateFormat
after dirty WriteStateException occurred if it's not recovered by
successful subsequent state write.
We've encountered test failure of testFailRandomlyAndReadAnyState.
The test breaks in the following way. There are 3 state paths. And what
happens next
Successful write at the beginning of the test yields 0 0 0 state
files in the directories.
1st write in the loop is unsuccessful, but not dirty - 0 0 0.
2nd write in the loop is not successful and dirty (failure during
fsync), however before removing new files we have 1 1 1. But now during
deletion, the first deletion fails and we get - 1 0 0.
3rd write in the loop is unsuccessful, but not dirty - so we want to
keep old generation, which happens to be the 1st generation, so now we
have 1 x x in state folders. Now we assert that we either load 0 or 1
state from the state folders and select only 2rd and 3th folder to
emulate disk failures - this results in NPE because there is nothing in
these folders.
Fortunately, this won’t be a problem in real life, because if there is
a dirty exception, we shut down the node and make sure we perform a
successful write on the node startup.
This adds a set of helper classes to determine if an agg "has a value".
This is needed because InternalAggs represent "empty" in different
manners according to convention. Some use `NaN`, `+/- Inf`, `0.0`, etc.
A user can pass the Internal agg type to one of these helper methods
and it will report if the agg contains a value or not, which allows the
user to differentiate "empty" from a real `NaN`.
These helpers are best-effort in some cases. For example, several
pipeline aggs share a single return class but use different conventions
to mark "empty", so the helper uses the loosest definition that applies
to all the aggs that use the class.
Sums in particular are unreliable. The InternalSum simply returns 0.0
if the agg is empty (which is correct, no values == sum of zero). But this
also means the helper cannot differentiate from "empty" and `+1 + -1`.
This commit removes the warn-date from warning headers. Previously we
were stamping every warning header with when the request
occurred. However, this has a severe performance penalty when
deprecation logging is called frequently, as obtaining the current time
and formatting it properly is expensive. A previous change moved to
using the startup time as the time to stamp on every warning header, but
this was only to prove that the timestamping was expensive. Since the
warn-date is optional, we elect to remove it from the warning
header. Prior to this commit, we worked in Kibana to make the warn-date
treated as optional there so that we can follow-up in Elasticsearch and
remove the warn-date. This commit does that.
Prefer publishing to master-eligible nodes first, so that cluster state updates are committed more
quickly, and master-eligible nodes also turned more quickly into followers after a leader election.
PersistentTasksClusterService decides if a task should be reassigned by
checking there is a node in the cluster with the same Id. If a node is
restarted PersistentTasksClusterService may not observe the change and
decide the task still has a valid assignment because the node's
ephemeral Id is not used in that decision. This change un-assigns tasks
as the nodes in the cluster change.
* Fail start of non-data node if node has data
Check that nodes started with node.data=false cannot start if they have
shard data to avoid (old) indexes being resurrected into the cluster in red status.
Issue #27073
When publications were cancelled because a node turned to follower or candidate, it would still
show as time out, which can be confusing in the logs. This change adapts the improper call of
onTimeout by generalizing it to a cancel method.
The method calls "enabled" in addition to what the super.index() does, but this
seems to be done explicitely now in the TypeParsers `parse` method. The removed
method has been deprecated since at least 6.0. Also making some of the Builders
methods and ctos private since they are only used internally in this class.
Today when bootstrapping a Zen2 cluster we wait for every node in the
`initial_master_nodes` setting to be discovered, so that we can map the
node names or addresses in the `initial_master_nodes` list to their IDs for
inclusion in the initial voting configuration. This means that if any of
the expected master-eligible nodes fails to start then bootstrapping will
not occur and the cluster will not form. This is not ideal, and we would
prefer the cluster to bootstrap even if some of the master-eligible nodes
do not start.
Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure
given that nodes may be concurrently and independently attempting to
bootstrap the cluster. The solution is to bootstrap using an initial
configuration whose size matches the size of the expected set of
master-eligible nodes, but with the unknown IDs replaced by "placeholder"
IDs that can never belong to any node. Any quorum of received votes in any
of these placeholder-laden initial configurations is also a quorum of the
"true" initial set of master-eligible nodes, giving the guarantee that it
intersects all other quorums as required.
Note that this change means that the initial configuration is not
necessarily robust to any node failures. Normally the cluster will form and
then auto-reconfigure to a more robust configuration in which the
placeholder IDs are replaced by the IDs of genuine nodes as they join the
cluster; however if a node fails between bootstrapping and this
auto-reconfiguration then the cluster may become unavailable. This we feel
to be less likely than a node failing to start at all.
This commit also enormously simplifies the cluster bootstrapping process.
Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins. However this flexibility is not required for the current
design so it is adding a good deal of unnecessary complexity. Here we remove
this complexity in favour of a much simpler ClusterBootstrapService
implementation that does all the work itself.
In order to be able to parse epoch seconds and epoch milli seconds own
java time fields had been introduced. These fields are however not
compatible with the way that java time allows one to configure default
fields (when a part of a timestamp cannot be read then a default value
is added), which is used for the formatters that are rounding up to the
next value.
This commit allows java date formatters to configure its round up parsing
by setting default values via a consumer. By default all formats are setting
JavaDateFormatter.ROUND_UP_BASE_FIELDS for rounding up. The epoch
however parsers both need to set different fields. The merged date
formatters do not set any fields, they just append all the round up formatters.
Also the formatter now properly copies the locale and the timezone,
fractional parsing has been set to nano seconds with proper width.
Since version 6.7.0 the Close Index API guarantees that all translog
operations have been correctly flushed before the index is closed. If
the index is reopened as a Frozen index (which uses a ReadOnlyEngine)
we can verify that the maximum sequence number from the last Lucene
commit is indeed equal to the last known global checkpoint and refuses
to open the read only engine if it's not the case. In this PR the check is
only done for indices created on or after 6.7.0 as they are guaranteed
to be closed using the new Close Index API.
Related #33888
This commit introduces a NetworkMessage class. This class has two
subclasses - InboundMessage and OutboundMessage. These messages can
be serialized and deserialized independent of the transport. This allows
more granular testing. Additionally, the serialization mechanism is now
a simple Supplier. This builds the framework to eventually move the
serialization of transport messages to the network thread. This is the
one serialization component that is not currently performed on the
network thread (transport deserialization and http serialization and
deserialization are all on the network thread).
Currently we create dedicated network threads for both the http and
transport implementations. Since these these threads should never
perform blocking operations, these threads could be shared. This commit
modifies the nio-transport to have 0 http workers be default. If the
default configs are used, this will cause the http transport to be run
on the transport worker threads. The http worker setting will still exist
in case the user would like to configure dedicated workers. Additionally,
this commmit deletes dedicated acceptor threads. We have never had these
for the netty transport and they can be added back if a need is
determined in the future.
* The repo id was determined wrong when the delete picked up on an in progress snapshot
* NOTE: This solution is still a best-effort fix and there's a slight chance of running into concurrency issues here
when multiple create and delete requests for the same snapshot name are happening concurrently, but these require a sequence
of multiple cluster state updates between the changed method reading the genId and submitting its cluster state update task
* Added test reproduced the issue reliably in about 50% of runs
* Closes#37581
This will be used in cross-cluster search when reduction will be
performed locally on each cluster. The CCS coordinating node will send
one search request per remote cluster involved and will get one search
response back from each one of them. Such responses contain all the info
to be able to perform an additional reduction and return results back
to the user.
Relates to #32125
This commit optimizes some of the performance issues from using
deprecation logging:
- we optimize encoding the deprecation value
- we optimize formatting the deprecation string
- we optimize away getting the current time (by using cached startup
time)
To make further refactoring of GeoGrid aggregations
easier (related: #30320), splitting out these inner
class dependencies into their own files makes it
easier to map the relationship between classes
From #29453 and #37285, the `include_type_name` parameter was already present and defaulted to false. This PR makes the following updates:
- Add deprecation warnings to `RestPutMappingAction`, plus tests in `RestPutMappingActionTests`.
- Add a typeless 'put mappings' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I opted to create a new `PutMappingRequest` object that differs from the existing server one.
This adds deprecation to _type in the script contexts for ingest and update.
This adds a DeprecationMap that wraps the ctx Map containing _type for these
specific contexts.
Some tests (e.g. testRestoreIndexWithShardsMissingInLocalGateway) were split-braining since
being switched to Zen2 because the bootstrap setting was left around when nodes got restarted
with data folders wiped.
The test in question here was starting one node (which autobootstrapped to that single node), then
another node. The first node was then shut down (after excluding it from the voting configuration),
its data folder wiped, and restarted. After restart, the node had an empty data folder yet
initial_master_nodes set to itself (i.e. same name). This made the node sometimes form a cluster of
its own, and not rejoin the existing cluster with the other node.
Currently when adding a response header, we do some de-duplication, and
maybe drop the header on the floor if we have reached capacity. Yet, we
still update the thread local tracking the response headers. This is
really expensive because under the hood there is a shared reference that
we synchronize on. In the case of a request processed across many shards
in a tight loop, this contention can be detrimental to performance. We
can avoid updating the thread local in these cases though, when the
response header is duplicate of one that we have already seen, or when
it's dropped on the floor. This commit addresses these performance
issues by avoiding the unnecessary set.
As of today the Close Index API does its best to close indices,
but closing an index with ongoing recoveries might or might not
be acknowledged depending of the values of the max seq number
and global checkpoint at the time the
TransportVerifyShardBeforeClose action is executed.
These tests failed because they always expect that the index is
correctly closed on the first try, which is not always the case.
Instead we need to retry the closing until it succeed.
Closes#37571
* Fixes `testTwoNodeFirstNodeCleared` by manipulating voting config exclusions.
* Removes `testRecoveryDifferentNodeOrderStartup` since state recovery is now
handled entirely on the elected master, so the order in which the data nodes
start is irrelevant.
This test was actually passing, for the wrong reason: it asserts a
`MasterNotDiscoveredException` is thrown, expecting this to be due to a failure
to perform state recovery, but in fact it's thrown because the node is not
correctly bootstrapped.
This change adds deprecation warning to the indices.get_mapping API in case the
"inlcude_type_name" parameter is set to "true" and changes the parsing code in
GetMappingsResponse to parse the type-less response instead of the one
containing types. As a consequence the HLRC client doesn't need to force
"include_type_name=true" any more and the GetMappingsResponseTests can be
adapted to the new format as well. Also removing some "include_type_name"
parameters in yaml test and docs where not necessary.
delete and close index actions threw IllegalArgumentExceptions
when attempting to run against an index that has a snapshot
in progress.
This change introduces a dedicated SnapshotInProgressException
for these scenarios. This is done to explicitly signal to clients that
this is the reason the action failed, and it is a retryable error.
relates to #37541.
This is a continuation of #28667 and has as goal to convert all executors to propagate errors to the
uncaught exception handler. Notable missing ones were the direct executor and the scheduler. This
commit also makes it the property of the executor, not the runnable, to ensure this property. A big
part of this commit also consists of vastly improving the test coverage in this area.
This change adds a way to customize how phrase prefix queries should be created
on field types. The match phrase prefix query is exposed in field types in order
to allow optimizations based on the options set on the field.
For instance the text field uses the configured prefix field (if available) to
build a span near that mixes the original field and the prefix field on the last
position.
This change also contains a small refactoring of the match/multi_match query that
simplifies the interactions between the builders.
Closes#31921
This test failed because the refresh at the end of the test is not guaranteed to run before the
indexing is completed, and therefore there's no guarantee that the refresh will free all operations.
This triggers an assertion failure in the test clean-up, which asserts that there are no more pending
operations.
Some systems default to a nofile ulimit of 65535. To reduce the pain of
deploying Elasticsearch to such systems, this commit lowers the required
limit from 65536 to 65535.
We flush quite often in testAddNewReplicas to create the safe index
commit with gaps in sequence numbers. This test is failing recently
because CI is too slow to complete 5 small flushes in 10 seconds.
This commit increases timeout for this test and also ensures to always
terminate the background indexing. The latter is to eliminate unrelated
failures if this test fails again.
Closes#37183
All tests except testRestorePersistentSettings (renamed to
testExceptionWhenRestoringPersistentSettings) worked fine.
testExceptionWhenRestoringPersistentSettings re-written to use a custom
setting, because "minimum master node" setting is no longer available
in Zen2. It turns out there is no good replacement for "minimum master
node" setting for this test, that's why the custom setting is
introduced.
Unfortunately, there is #37485 bug and currently
RestoreService does not perform setting validation. That's why the
test is annotated with @AwaitsFix, the idea is to merge this commit and
then fix the issue and enable the test. (The test passes with a simple
fix, that adds a single line to RestoreService).
Currently all proxied actions are denied for the `SystemPrivilege`.
Unfortunately, there are use cases (CCR) where we would like to proxy
actions to a remote node that are normally performed by the
system context. This commit allows the system context to perform
proxy actions if they are actions that the system context is normally
allowed to execute.
Currently it takes a type, but this isn't really needed now that indices can
have at most one type. The only downside is that we might return a different
error when trying to index into a type that doesnt't exist yet.
This commit removes some leniency from REST handling where we move to
reject all requests that have a body where the body is not used during
the course of handling the request. For example,
DELETE /index
{
"query" : {
"term" : {
"field" : "value"
}
}
}
is now rejected.
* The internal create request is absolutely redundant, the only difference to the transport request is that we resolved the snapshot
name when moving from the transport to the internal version
* Removed it and passed the transport request into the snapshot service instead
* nicer way of resolve snapshot name in callback
The AbstracLifecycleComponent used to extend AbstractComponent, so it had to pass settings to the constractor of its supper class.
It no longer extends the AbstractComponent so there is no need for this constructor
There is also no need for AbstracLifecycleComponent subclasses to have Settings in their constructors if they were only passing it over to super constructor.
This is part 1. which will be backported to 6.x with a migration guide/deprecation log.
part 2 will have this constructor removed in 7
relates #35560
relates #34488
This commit adds one more underlying implementation of MockPersistedState.
Previously only InMemoryPersistentState was used, not GatewayMetaState
is used rarely.
When adding GatewayMetaState support the main question was: do we want to
emulate exceptions as we do today in MockPersistedState before
delegating to GatewayMetaState or do we want these exceptions to
propagate from the lower level, i.e. file system exceptions?
On the one hand, lower level exception propagation is already tested in
GatewayMetaStateTests, so this won't improve the coverage.
On the other hand, the benefit of low-level exceptions is to see how all these
components work in conjunction. Finally, we abandoned the idea of low-level
exceptions because we don't have a way to deal with IOError today in
CoordinatorTests, but hacking GatewayMetaState not to throw
IOError seems unnatural.
So MockPersistedState rarely throws an exception before delegating to
GatewayMetaState, which is not supposed to throw the exception.
This commit required two changes:
Move GatewayMetaStateUT to upper-level from
GatewayMetaStatePersistedStateTests, because otherwise, it's not easy
to construct GatewayMetaState instance in CoordinatorTests.
Move addition of STATE_NOT_RECOVERED_BLOCK from GatewayMetaState
constructor to GatewayMetaState.applyClusterUpdaters, because
CoordinatorTests class assumes that there is no such block and most of
them fail.
The completion suggester ignores the original weight of the suggestion when duplicates are removed. This change fixes this bug and keeps the best weighted suggestion among the duplicates. It also removes the custom implementation of the top docs suggest collector now that https://issues.apache.org/jira/browse/LUCENE-8529 is committed in Lucene.
Closes#35836
This commit prepares the required infra to make send a translog snapshot
of the recovery source non-blocking. I'll make a follow-up to make the send
snapshot method non-blocking.
Relates #37291
There were 5 tests in MinimumMasterNodesIT. 2 of them removed, 3 of
them changed and renamed.
1) testSimpleMinimumMasterNodes -> testTwoNodesNoMasterBlock. The
flow of this test is left intact but in order to make it work on
Zen2, additional work for the cluster bootstrapping and voting
exclusions is needed.
2) testDynamicUpdateMinimumMasterNodes -> removed, there is nothing
that corresponds to the dynamic change of the minimum master nodes
setting.
3) testCanNotBringClusterDown -> removed, it also plays with changing
minimum master nodes dynamically.
4) testMultipleNodesShutdownNonMasterNodes ->
testThreeNodesNoMasterBlock. Previously this test was checking that
there would be no master block, if min_master_nodes=3 and 4 nodes are
started, then 2 nodes are brought down. Zen2 dynamically accommodates
to the number of nodes in the cluster, so it's possible that there
still will be a master in 2 nodes cluster. For Zen2, we start up 3
nodes. And shut down 2 of them (w/o voting exclusions), which results
in no master block.
5) testCanNotPublishWithoutMinMastNodes ->
testCanNotCommitStateThreeNodes. Test flow is not changed. But
previously there was no check that nodes in the bigger part of
network partition will elect the master, before healing the network
partition. For Zen2 it does not work, because persistent setting
addition is accepted on the old master and if it's elected new master
again, this setting will appear in the cluster state.
Also, I have a feeling that we need to remove this class, but could not
come up with a good name.
Adds the node's current term and the term and version of the the last-accepted
cluster state to the message reported by the `ClusterFormationFailureHelper`,
since these values may be of importance when tracking down a cluster formation
failure.
The test that remote clusters used by ML datafeeds have
a license that allows ML was not accounting for the
possibility that the remote cluster name could be
wildcarded. This change fixes that omission.
Fixes#36228
The test testSendSnapshotSendsOps is currently using a mock instance of
RecoveryTargetHandler which will be hard to modify when we make the
RecoveryTargetHandler non-blocking. This commit prepares for the
incoming changes by replacing the mock instance with a stub.