The `ShardFailedClusterStateTaskExecutor` fails some shards, which performs a
reroute, but then sometimes schedules a followup reroute. It's not clear from
the code why this followup is necessary, so this commit adds a short comment
describing why it's necessary.
This commit documents the backup and restore of a cluster's
security configuration.
It is not possible to only backup (or only restore) security
configuration, independent to the rest of the cluster's conf,
so this describes how a full configuration backup&restore
will include security as well. Moreover, it explains how part
of the security conf data resides on the special .security
index and how to backup that using regular data snapshot API.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
Co-Authored-By: Tim Vernum <tim@adjective.org>
* Add test for SQL not being available error message in JDBC.
* Add a new qa sub-project that explicitly disables SQL XPack module in Gradle.
(cherry picked from commit 8a1ac8a3a88a325ec9b99963e0fa288c18ee0ee5)
* The created array didn't have the correct initial size while attempting to resolve multiple indices
(cherry picked from commit 341006e9913e831408f5bbc7f8ad8c453a7f630e)
Custom timestamp overrides provided to the find_file_structure
endpoint produced an invalid Grok pattern if the fractional
seconds separator was a dot rather than a comma or colon.
This commit fixes that problem and adds tests for this sort
of timestamp override.
Fixes#44110
This commit replaces usages of Streamable with Writeable for the
MultiSearchRequest class.
I ran into this when developing a custom action that reuses
MultiSearchRequest in the enrich branch.
Relates to #34389
Previously a data frame transform would check whether the
source index was changed every 10 seconds. Sometimes it
may be desirable for the check to be done less frequently.
This commit increases the default to 60 seconds but also
allows the frequency to be overridden by a setting in the
data frame transform config.
Today the `ClusterInfoService` requires the `DiskThresholdMonitor` at
construction time so that it can notify it when nodes report changes in their
disk usage, but this is awkward to construct: the `DiskThresholdMonitor`
requires a `RerouteService` which requires an `AllocationService` which comees
from the `ClusterModule` which requires the `ClusterInfoService`.
Today we break the cycle with a `LazilyInitializedRerouteService` which is
itself a little ugly. This commit replaces this with a more traditional
subject/observer relationship between the `ClusterInfoService` and the
`DiskThresholdMonitor`.
Today `testOnlyBlocksOnConnectionsToNewNodes` fails (extremely rarely) if the
last attempt to connect to `node0` is delayed for so long that the test runs
`nodeConnectionsBlocks.clear()` before the connection attempt obtains the
expected connection block. We can turn this into a reliable failure with this
delay:
```diff
diff --git a/server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java b/server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java
index f48413824d3..9a1d0336bcd 100644
--- a/server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java
@@ -300,6 +300,13 @@ public class NodeConnectionsService extends AbstractLifecycleComponent {
private final Runnable connectActivity = () -> threadPool.executor(ThreadPool.Names.MANAGEMENT).execute(new AbstractRunnable() {
@Override
protected void doRun() {
+
+ try {
+ Thread.sleep(500);
+ } catch (InterruptedException e) {
+ throw new AssertionError("unexpected", e);
+ }
+
assert Thread.holdsLock(mutex) == false : "mutex unexpectedly held";
transportService.connectToNode(discoveryNode);
consecutiveFailureCount.set(0);
```
This commit reverts the extra logging introduced in #43979 and fixes this
failure by waiting for the connection attempt to hit the barrier before
removing it.
Fixes#40170
Since #42636 we no longer treat connections specially when simulating a
blackholed connection. This means that at the end of the safety phase we may
have just started a connection attempt which will time out, but the default
timeout is 30 seconds, much longer than the 2 seconds we normally allow for
post-safety-phase discovery. This commit adds time for such a connection
attempt to time out.
It also fixes some spurious logging of `this` that now refers to an object with
an unhelpful `toString()` implementation introduced in #42636.
Fixes#44073
Refactor ScrollableHitSource to pump data out and have a simplified
interface (callers should no longer call startNextScroll, instead they
simply mark that they are done with the previous result, triggering a
new batch of data). This eases making reindex resilient, since we will
sometimes need to rerun search during retries.
Relates #43187 and #42612
Today the `index.routing.allocation.initial_recovery._id` setting can only be
set on indices that are the result of a shrink, but the filtered allocation
decider also applies this filter to shards with a recovery source of
`EMPTY_STORE`. The only way to have this setting set while the recovery source
is `EMPTY_STORE` is to force-allocate an empty primary, but such a forced
allocation ignores this allocation decider.
This commit simplifies the allocation decider so that the `initial_recovery`
setting only applies to shards with a recovery source of `LOCAL_SHARDS`.
* Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode
* See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now
* Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts
* Closes#39852
The rest client does not communicate over the transport protocol.
However, in the move to make all apis supported in the HLRC, some
response classes were copied with extending ActionResponse, which is
meant strictly for the transport protocol. This commit removes uses of
that base class from HLRC.
In today's test suite indices mostly use the default value of `12h` for the
`index.soft_deletes.retention_lease.period` setting, which in the context of
the test suite essentially means "never expires". In fact, the tests should all
behave correctly even if the lease period is much shorter; tests that rely on
leases not expiring should configure their indices appropriately.
This commit randomises the lease expiry time for those indices created during
tests which do not set a specific value for this setting.
* Missed this one in #42189 and it randomly runs into a situation where the broken mock repo is broken such that we can't get to a consistent end state via a delete
* Closes#43498
Today we prevent any index that is actively snapshotted or restored to be closed.
This verification is done during the execution of the first phase of index closing
(ie before blocking the indices).
We should also do this verification again in the last phase of index closing
(ie after the shard sanity checks and right before actually changing the index
state and the routing table) because a snapshot/restore could sneak in while
the shards are verified-before-close.
Provides a hook for aggregations to introspect the `ValuesSourceType` for a user supplied Missing value on an unmapped field, when the type would otherwise be `ANY`. Mapped field behavior is unchanged, and still applies the `ValuesSourceType` of the field. This PR just provides the hook for doing this, no existing aggregations have their behavior changed.
* In the current codebase it is hardly obvious what code operates on a shard and is run by a datanode what code operates on the global metadata and is run on master
* Fixed by adjusting the method names accordingly
* The nested context classes don't add much if any value, they simply spread out the parameters that go into a shard snapshot create or delete all over the place since their
constructors can be inlined in all spots
* Fixed by flattening the nested classes into BlobStoreRepository
* Also:
* Inlined the other single use inner classes
* Remove some obvious dead code
* Move assert methods that were only used in a single test class to the child they belong to
* Inline some redundant methods
Data frame task responses had logic to return a HTTP 500 status code if there was
any node or task failures even if other tasks in the same request reported correctly.
This is different to how other task responses are handled where a 200 is always
returned leaving the client should check for failures. Returning a 500 also breaks
the high level rest client so always return a 200
Closes#44011
If an index is the result of a shrink then it will have a value set for
`index.routing.allocation.initial_recovery._id`. If this index is subsequently
split then this value will be copied over, forcing the initial allocation of
the split shards to occur on the node on which the shrink took place. Moreover
if this node no longer exists then the split will fail. This commit suppresses
the copying of this setting when splitting an index.
Fixes#43955
* Use ability to list child "folders" in the blob store to implement recursive delete on all stale index folders when cleaning up instead of using the diff between two `RepositoryData` instances to cover aborted deletes
* Runs after ever delete operation
* Relates #13159 (fixing most of this issues caused by unreferenced indices, leaving some meta files to be cleaned up only)
Joda allowed for date_optional_time and strict_date_optional_time a decimal point to be . dot or , comma
For our java.time implementation we should also extend this for strict_date_optional_time-nanos
the approach to fix this is the same as in iso8601 parser
closes#43730
* Provide an Option to Use Path-Style-Access with S3 Repo
* As discussed, added the option to use path style access back again and
deprecated it.
* Defaulted to `false`
* Added warning to docs
* Closes#41816