This commit enables real BWC testing against a 5.1 snapshot. All
REST tests plus rolling upgrade test now run against a mixed version
cross major version cluster.
When a cluster update task executes, there can be log messages after the
update task has finished processing and the new cluster state becomes
visible. The visibility of the cluster state allows the test thread in
UpdateSettingsIT#testUpdateAutoThrottleSettings and
UpdateSettingsiT#testUpdateMergeMaxThreadCount to proceed. The test
thread will remove and stop a mock appender setup at the beginning of
the test. The log messages in the cluster state update task that occur
after processing has finished can race with the removal of the
appender. Log4j will grab a reference to the appenders when processing
these log messages, and this races with the removal and stopping of the
appenders. If Log4j grabs a reference to the appenders before the mock
appender has been removed, and the test thread subsequently removes and
stops the appender before Log4j has appended the log message, Log4j will
get angry that we are appending to a stopped appender, causing the test
to fail. This commit addresses this race by waiting for the cluster
state update task to have finished processing before freeing the test
thread to make its assertions and finally remove and stop the
appender. Yes, this is a hack.
Relates #21518
When a node closes, we shutdown logging as the last statement. This
statement must be last lest any subsequent attempts to log will blow up
by running into security permissions. Yet, in the case of a tribe node
this isn't enough. The first internal tribe node to close will shutdown
logging, and subsequent node closes will blow up with the aforementioned
problem. This commit migrate the Log4j shutdown to occur as part of the
shutdown hook that closes the node, after all nodes have
closed. Consequently, we can remove a hack in the test infrastructure to
prevent Log4j shutdowns when internal test nodes close and instead just
register a single shutdown hook that runs when the test JVM exits.
Relates #21519
Under some rare conditions search cancellation response might not fully clean scroll context. For now this commit adds the cleaning operation to the test, and we will address the root cause in https://github.com/elastic/elasticsearch/issues/21511
so that we are more likely to trigger I/O exceptions on writing the control files during the
finalize phase of snapshotting (with the aim of triggering an I/O failure when writing pending-index-*).
This commit sets the mock appender in UpdateSettingsIT to not ignore
exceptions. This means that when an exception is hit, we will see an
actual stack trace that could be useful in debugging a non-reproducible
test failure.
Relates #21461
[TEST] Makes the snapshot throttling test go much faster. Before,
the snapshot throttling test would throttle at a rate of 0.5 kb per
second, even though it would snapshot/restore about 25 kb of data.
This commit increases the throttling rate to 10kb per second, so
we still test the throttling mechanism while speeding up the test from
taking 30 plus seconds down to 2 seconds or less.
Each node checks on every cluster state update if there are shards that it can possibly delete from its disk. It decides this by doing a file-system lookup for each shard id that is fully allocated in the cluster. With lots of shards, this amounts to lots of Files.exists() checks, considerably slowing down cluster state updates. This commit adds a caching layer so that the Files.exists() checks can be skipped if not needed.
Currently the task cancellation command returns as soon as the top-level parent child is marked as cancelled. This create race conditions in tests where child tasks on other nodes may continue to run for some time after the main task is cancelled. This commit fixes this situation making task cancellation command to wait until it got propagated to all nodes that have child tasks.
Closes#21126
ShardActiveResponseHandler doesn't need to hold to an entire cluster state since it only needs to know the cluster state version. It seems that on overloaded systems where nodes are unresponsive holding onto a lot of different cluster states can make the situation worse.
Closes#21394
Ensures pending index-* blobs are deleted when snapshotting. The
index-* blobs are generational files that maintain the snapshots
in the repository. To write these atomically, we first write a
`pending-index-*` blob, then move it to `index-*`, which also deletes
`pending-index-*` in case its not a file-system level move (e.g.
S3 repositories) . For example, to write the 5th generation of the
index blob for the repository, we would first write the bytes to
`pending-index-5` and then move `pending-index-5` to `index-5`. It is
possible that we fail after writing `pending-index-5`, but before
moving it to `index-5` or deleting `pending-index-5`. In this case,
we will have a dangling `pending-index-5` blob laying around. Since
snapshot #5 would have failed, the next snapshot assumes a generation
number of 5, so it tries to write to `index-5`, which first tries to
write to `pending-index-5` before moving the blob to `index-5`. Since
`pending-index-5` is leftover from the previous failure, the snapshot
fails as it cannot overwrite this blob.
This commit solves the problem by first, adding a UUID to the
`pending-index-*` blobs, and secondly, strengthen the logic around
failure to write the `index-*` generational blob to ensure pending
files are deleted on cleanup.
Closes#21462
This change was reverted after it caused random test failures. This was
due to a copy/paste error in the original PR which caused the mock
version of ClusterInfoService to be used whenever the mock *ZenPing* was
used, and the real ClusterInfoService to be used when MockZenPing was
not used.
Currently, pending operations can complete after tests with disruption scheme
completes. This commit waits for the pending operation counter to complete
after the tests are run
* Allows for an array of index template patterns to be provided to an
index template, and rename the field from 'template' to 'index_pattern'.
Closes#20690
When logging a mock exception, Log4j attempts to render the stack
trace. On a mock exception, this will be null and Log4j will hit a
NullPointerException. This NullPointerException will get recorded in the
status logger buffer that we use to ensure that we do not having any
misuses of Log4j in production code. This commit replaces the use of a
mock exception with an actual exception to avoid angering the Log4j
assertions in ESTestCase.
Today when a message is not fully read on a response, we log (among
other details) the handler name. Unfortunately, if the handler is a
wrapper, all that we see is
o.e.t.TransportService$ContextRestoreResponseHandler@7446ba18
completely losing the offending handler. This commit adds an override
for TransportService$ContextRestoreResponseHandler#toString so that the
underlying offender can be discovered.
Relates #21478
Today when handling responses from nodes in TransportNodesAction, if a
node timeouts or some other failure occurs and the action is not
accumulating exceptions, we log a confusing message:
org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction]
ignoring unexpected response [null] of type [null], expected
[ClusterStatsNodeResponse] or [FailedNodeException]
Moreover, the original exception is completely lost. Since this log
message is confusing and unhelpful, we can drop it. Instead, we hold
onto the exception and log it at the warn level before dropping it from
the response.
Relates #21476
This commit updates JodaTime to version 2.9.5 that contain a fix for a bug when parsing time zones (see https://github.com/JodaOrg/joda-time/pull/332, https://github.com/JodaOrg/joda-time/issues/386 and https://github.com/JodaOrg/joda-time/issues/373).
It also remove the joda-convert dependency that seems to be unused.
closes#20911
Here is the changelog for 2.9.5:
```
Changes in 2.9.5
----------------
- Add Norwegian period translations [#378]
- Add Duration.dividedBy(long,RoundingMode) [#69, #379]
- DateTimeZone data updated to version 2016i
- Fixed bug where clock read twice when comparing two nulls in DateTimeComparator [#404]
- Fixed minor issues with historic time-zone data [#373]
- Fix bug in time-zone binary search [#332, #386]
The fix in v2.9.2 caused problems when the time-zone being parsed
was not the last element in the input string. New approach uses a
different approach to the problem.
- Update tests for JDK 9 [#394]
- Close buffered reader correctly in zone info compiler [#396]
- Handle locale correctly zone info compiler [#397]
```
The method used to be called `isSourceEmpty`, and was renamed to `hasSource`, but the return value never changed. Updated tests and users accordingly.
Closes#21419
Currently, `executeIndexRequestOnPrimary` and `executeDeleteRequestOnPrimary`
methods used to prepare and execute write operations, modifies the provided
request, updating the version and versionType. This commit makes it the
callers responsibility to update request version and versionType and avoids
mutating the provided request in the execute methods.
This commit introduces a new execution mode for the
`simple_query_string` query, which is intended down the road to be a
replacement for the current _all field.
It now does auto-field-expansion and auto-leniency when the following criteria
are ALL met:
The _all field is disabled
No default_field has been set in the index settings
No fields are specified in the request
Additionally, a user can force the "all-like" execution by setting the
all_fields parameter to true.
When executing in all field mode, the `simple_query_string` query will
look at all the fields in the mapping that are not metafields and can be
searched, and automatically expand the list of fields that are going to
be queried.
Relates to #20925, which is the `query_string` version of this work.
This is basically the same behavior, but for the `simple_query_string`
query.
Relates to #19784
This commit clarifies some log messages for the bootstrap checks. The
message is that the user has limits set that are below the minimums that
Elasticsearch requires.
Relates #21423
This commit ensures that we always restore the thread's original context after execution of
a context preserving runnable. We always wrap runnables in a wrapper that restores the context
at the time it was submitted to the execute method. The ContextPreservingAbstractRunnable
would restore the calling context in the doRun method and then in a try with resources
block would restore the thread's original context. However, the onFailure and onAfter methods
of a AbstractRunnable could modify the thread context and this modified thread context would
continue on as the thread's context after it was returned to the pool and potentially used
for a different purpose.
* Plugins: Convert custom discovery to pull based plugin
This change primarily moves registering custom Discovery implementations
to the pull based DiscoveryPlugin interface. It also keeps the cloud
based discovery plugins re-registering ZenDiscovery under their own name
in order to maintain backwards compatibility. However,
discovery.zen.hosts_provider is changed here to no longer fallback to
discovery.type. Instead, each plugin which previously relied on the
value of discovery.type now sets the hosts_provider to itself if
discovery.type is set to itself, along with a deprecation warning.
This commit cleans up the formatting of some Javadocs in
BootstrapCheck.java, and corrects some docs that had become stale as the
bootstrap checks evolved.
The ClusterState class currently has a mutable volatile field "status" that is only used by the ClusterStateObserver to differentiate between a cluster state that is being applied or one that has already been applied. This commit removes the field from cluster state, making it a truly immutable class. This information is stored instead by ClusterService, which is the only place that should update this field (PublishClusterStateAction was also updating it, but that information was never used anywhere). A new class is introduced (ClusterServiceState) which emcompasses the current cluster state as well as the current status, which is only used by the ClusterStateObserver mechanism.
Applied (almost) the same rules we use to validate index names
to new alias names. The only rule not applies it "must be lowercase".
We have tests that don't follow that rule and I except there are lots
of examples of camelCase alias names in the wild. We can add that
validation but I'm not sure it is worth it.
Closes#20748
Adds an alias that starts with `#` to the BWC index and validates
that you can read from it and remove it. Starting with `#` isn't
allowed after 5.1.0/6.0.0 so we don't create the alias or check it
after those versions.