It's not safe to continue writing state using MetaDataStateFormat
after dirty WriteStateException occurred if it's not recovered by
successful subsequent state write.
We've encountered test failure of testFailRandomlyAndReadAnyState.
The test breaks in the following way. There are 3 state paths. And what
happens next
Successful write at the beginning of the test yields 0 0 0 state
files in the directories.
1st write in the loop is unsuccessful, but not dirty - 0 0 0.
2nd write in the loop is not successful and dirty (failure during
fsync), however before removing new files we have 1 1 1. But now during
deletion, the first deletion fails and we get - 1 0 0.
3rd write in the loop is unsuccessful, but not dirty - so we want to
keep old generation, which happens to be the 1st generation, so now we
have 1 x x in state folders. Now we assert that we either load 0 or 1
state from the state folders and select only 2rd and 3th folder to
emulate disk failures - this results in NPE because there is nothing in
these folders.
Fortunately, this won’t be a problem in real life, because if there is
a dirty exception, we shut down the node and make sure we perform a
successful write on the node startup.
This adds a set of helper classes to determine if an agg "has a value".
This is needed because InternalAggs represent "empty" in different
manners according to convention. Some use `NaN`, `+/- Inf`, `0.0`, etc.
A user can pass the Internal agg type to one of these helper methods
and it will report if the agg contains a value or not, which allows the
user to differentiate "empty" from a real `NaN`.
These helpers are best-effort in some cases. For example, several
pipeline aggs share a single return class but use different conventions
to mark "empty", so the helper uses the loosest definition that applies
to all the aggs that use the class.
Sums in particular are unreliable. The InternalSum simply returns 0.0
if the agg is empty (which is correct, no values == sum of zero). But this
also means the helper cannot differentiate from "empty" and `+1 + -1`.
This commit removes the warn-date from warning headers. Previously we
were stamping every warning header with when the request
occurred. However, this has a severe performance penalty when
deprecation logging is called frequently, as obtaining the current time
and formatting it properly is expensive. A previous change moved to
using the startup time as the time to stamp on every warning header, but
this was only to prove that the timestamping was expensive. Since the
warn-date is optional, we elect to remove it from the warning
header. Prior to this commit, we worked in Kibana to make the warn-date
treated as optional there so that we can follow-up in Elasticsearch and
remove the warn-date. This commit does that.
It looks like the output of FileUserPasswdStore.parseFile shouldn't be wrapped
into another map since its output can be null. Doing this wrapping after the null
check (which potentially raises an exception) instead.
Use PEM files for the key/cert for TLS on the http layer of the
node instead of a JKS keystore so that the tests can also run
in a FIPS 140 JVM .
Resolves: #37682
* Add separate CLI Mode
* Use the correct Mode for cursor close requests
* Renamed CliFormatter and have different formatting behavior for CLI and "text" format.
Prefer publishing to master-eligible nodes first, so that cluster state updates are committed more
quickly, and master-eligible nodes also turned more quickly into followers after a leader election.
PersistentTasksClusterService decides if a task should be reassigned by
checking there is a node in the cluster with the same Id. If a node is
restarted PersistentTasksClusterService may not observe the change and
decide the task still has a valid assignment because the node's
ephemeral Id is not used in that decision. This change un-assigns tasks
as the nodes in the cluster change.
* Fail start of non-data node if node has data
Check that nodes started with node.data=false cannot start if they have
shard data to avoid (old) indexes being resurrected into the cluster in red status.
Issue #27073
When publications were cancelled because a node turned to follower or candidate, it would still
show as time out, which can be confusing in the logs. This change adapts the improper call of
onTimeout by generalizing it to a cancel method.
The method calls "enabled" in addition to what the super.index() does, but this
seems to be done explicitely now in the TypeParsers `parse` method. The removed
method has been deprecated since at least 6.0. Also making some of the Builders
methods and ctos private since they are only used internally in this class.
With this commit we add a note to the API conventions documentation that
all date math expressions are resolved independently of any locale. This
behavior might be puzzling to users that try to specify a different
calendar than a Gregorian calendar.
Closes#37330
Relates #37663
Today when bootstrapping a Zen2 cluster we wait for every node in the
`initial_master_nodes` setting to be discovered, so that we can map the
node names or addresses in the `initial_master_nodes` list to their IDs for
inclusion in the initial voting configuration. This means that if any of
the expected master-eligible nodes fails to start then bootstrapping will
not occur and the cluster will not form. This is not ideal, and we would
prefer the cluster to bootstrap even if some of the master-eligible nodes
do not start.
Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure
given that nodes may be concurrently and independently attempting to
bootstrap the cluster. The solution is to bootstrap using an initial
configuration whose size matches the size of the expected set of
master-eligible nodes, but with the unknown IDs replaced by "placeholder"
IDs that can never belong to any node. Any quorum of received votes in any
of these placeholder-laden initial configurations is also a quorum of the
"true" initial set of master-eligible nodes, giving the guarantee that it
intersects all other quorums as required.
Note that this change means that the initial configuration is not
necessarily robust to any node failures. Normally the cluster will form and
then auto-reconfigure to a more robust configuration in which the
placeholder IDs are replaced by the IDs of genuine nodes as they join the
cluster; however if a node fails between bootstrapping and this
auto-reconfiguration then the cluster may become unavailable. This we feel
to be less likely than a node failing to start at all.
This commit also enormously simplifies the cluster bootstrapping process.
Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins. However this flexibility is not required for the current
design so it is adding a good deal of unnecessary complexity. Here we remove
this complexity in favour of a much simpler ClusterBootstrapService
implementation that does all the work itself.
In order to be able to parse epoch seconds and epoch milli seconds own
java time fields had been introduced. These fields are however not
compatible with the way that java time allows one to configure default
fields (when a part of a timestamp cannot be read then a default value
is added), which is used for the formatters that are rounding up to the
next value.
This commit allows java date formatters to configure its round up parsing
by setting default values via a consumer. By default all formats are setting
JavaDateFormatter.ROUND_UP_BASE_FIELDS for rounding up. The epoch
however parsers both need to set different fields. The merged date
formatters do not set any fields, they just append all the round up formatters.
Also the formatter now properly copies the locale and the timezone,
fractional parsing has been set to nano seconds with proper width.
Since version 6.7.0 the Close Index API guarantees that all translog
operations have been correctly flushed before the index is closed. If
the index is reopened as a Frozen index (which uses a ReadOnlyEngine)
we can verify that the maximum sequence number from the last Lucene
commit is indeed equal to the last known global checkpoint and refuses
to open the read only engine if it's not the case. In this PR the check is
only done for indices created on or after 6.7.0 as they are guaranteed
to be closed using the new Close Index API.
Related #33888
Due to missing stubbing for `NativePrivilegeStore#getPrivileges`
the test `testNegativeLookupsAreCached` failed
when the superuser role name was present in the role names.
This commit adds missing stubbing.
Closes: #37657
This commit introduces a NetworkMessage class. This class has two
subclasses - InboundMessage and OutboundMessage. These messages can
be serialized and deserialized independent of the transport. This allows
more granular testing. Additionally, the serialization mechanism is now
a simple Supplier. This builds the framework to eventually move the
serialization of transport messages to the network thread. This is the
one serialization component that is not currently performed on the
network thread (transport deserialization and http serialization and
deserialization are all on the network thread).
Currently we create dedicated network threads for both the http and
transport implementations. Since these these threads should never
perform blocking operations, these threads could be shared. This commit
modifies the nio-transport to have 0 http workers be default. If the
default configs are used, this will cause the http transport to be run
on the transport worker threads. The http worker setting will still exist
in case the user would like to configure dedicated workers. Additionally,
this commmit deletes dedicated acceptor threads. We have never had these
for the netty transport and they can be added back if a need is
determined in the future.
The integ tests currently use the raw zip project name as the
distribution type. This commit simplifies this specification to be
"default" or "oss". Whether zip or tar is used should be an internal
implementation detail of the integ test setup, which can (in the future)
be platform specific.
* The repo id was determined wrong when the delete picked up on an in progress snapshot
* NOTE: This solution is still a best-effort fix and there's a slight chance of running into concurrency issues here
when multiple create and delete requests for the same snapshot name are happening concurrently, but these require a sequence
of multiple cluster state updates between the changed method reading the genId and submitting its cluster state update task
* Added test reproduced the issue reliably in about 50% of runs
* Closes#37581
This will be used in cross-cluster search when reduction will be
performed locally on each cluster. The CCS coordinating node will send
one search request per remote cluster involved and will get one search
response back from each one of them. Such responses contain all the info
to be able to perform an additional reduction and return results back
to the user.
Relates to #32125