This commit introduces a background sync for retention leases. The idea
here is that we do a heavyweight sync when adding a new retention lease,
and then periodically we want to background sync any retention lease
renewals to the replicas. As long as the background sync interval is
significantly lower than the extended lifetime of a retention lease, it
is okay if from time to time a replica misses a sync (it will still have
an older version of the lease that is retaining more data as we assume
that renewals do not decrease the retaining sequence number). There are
two follow-ups that will come after this commit. The first is to address
the fact that we have not adapted the should periodically flush logic to
possibly flush the retention leases. We want to do something like flush
if we have not flushed in the last five minutes and there are renewed
retention leases since the last time that we flushed. An additional
follow-up will remove the syncing of retention leases when a retention
lease expires. Today this sync could be invoked in the background by a
merge operation. Rather, we will move the syncing of retention lease
expiration to be done under the background sync. The background sync
will use the heavyweight sync (write action) if a lease has expired, and
will use the lightweight background sync (replication action) otherwise.
Today the following settings in the `discovery.zen` namespace are still used:
- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`
This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
Today we have DiscoveryDisruptionIT tests for checking that discovery can still
work once the cluster has formed, even if the cluster is misconfigured and only
has a single master-eligible node in its unicast hosts list. In fact with Zen2
we can go one better: we do not need any nodes in the unicast hosts list,
because nodes also use the contents of the last-committed cluster state for
discovery. Additionally, the DiscoveryDisruptionIT tests were failing due to
the overenthusiastic fault-detection timeouts.
This commit replaces these tests with deterministic `CoordinatorTests` that
verify the same behaviour. It also removes some duplication by extracting a
test method called `testFollowerCheckerAfterMasterReelection()`
Closes#37687
Because concurrent sync requests from a primary to its replicas could be
in flight, it can be the case that an older retention leases collection
arrives and is processed on the replica after a newer retention leases
collection has arrived and been processed. Without a defense, in this
case the replica would overwrite the newer retention leases with the
older retention leases. This commit addresses this issue by introducing
a versioning scheme to retention leases. This versioning scheme is used
to resolve out-of-order processing on the replica. We persist this
version into Lucene and restore it on recovery. The encoding of
retention leases is starting to get a little ugly. We can consider
addressing this in a follow-up.
This PR removes the temporary change we made to the yml test harness in #37285
to automatically set `include_type_name` to `true` in index creation requests
if it's not already specified. This is possible now that the vast majority of
index creation requests were updated to be typeless in #37611. A few additional
tests also needed updating here.
Additionally, this PR updates the test harness to set `include_type_name` to
`false` in index creation requests when communicating with 6.x nodes. This
mirrors the logic added in #37611 to allow for typeless document write requests
in test set-up code. With this update in place, we can remove many references
to `include_type_name: false` from the yml tests.
Currently, there are a few tests that use autoMinMasterNodes=false and
hence override addExtraClusterBootstrapSettings, mostly this is 10-30
lines of codes that are copy-pasted from class to class.
This PR introduces `InternalTestCluster.setBootstrapMasterNodeIndex`
which is suitable for all classes and copy-paste could be removed.
Removing code is always a good thing!
* Fixes two broken spots:
1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting
2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state
* Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk
* Significantly extended test infrastructure to reproduce the above two issues
* Two new test runs:
1. Reproducing the effects of node disconnects/restarts in isolation
2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
* Relates #32265
* Closes#32348
Implements `geotile_grid` aggregation
This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240
This code uses the same base classes as `geohash_grid` agg, but uses a different hashing
algorithm to allow zoom consistency. Each grid bucket is aligned to Web Mercator tiles.
When extending ESIntegTestCase are run on the same jvm, the static field in
NodeAndClusterIdConverter will throw an AlreadySet exceptions.
overriding the configuration method from Node.configureNodeAndClusterIdStateListener in the MockNode will prevent the listener registration from happening
relates #32850
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.
This is a continuation of #28667, #36137 and also fixes#37708.
This commit pushes the primary term into the replication tracker. This
is a precursor to using the primary term to resolving ordering problems
for retention leases. Namely, it can be that out-of-order retention
lease sync requests arrive on a replica. To resolve this, we need a
tuple of (primary term, version). For this to be, the primary term needs
to be accessible in the replication tracker. As the primary term is part
of the replication group anyway, this change conceptually makes sense.
With #37566 we have introduced the ability to merge multiple search responses into one. That makes it possible to expose a new way of executing cross-cluster search requests, that makes CCS much faster whenever there is network latency between the CCS coordinating node and the remote clusters. The coordinating node can now send a single search request to each remote cluster, which gets reduced by each one of them. from + size results are requested to each cluster, and the reduce phase in each cluster is non final (meaning that buckets are not pruned and pipeline aggs are not executed). The CCS coordinating node performs an additional, final reduction, which produces one search response out of the multiple responses received from the different clusters.
This new execution path will be activated by default for any CCS request unless a scroll is provided or inner hits are requested as part of field collapsing. The search API accepts now a new parameter called ccs_minimize_roundtrips that allows to opt-out of the default behaviour.
Relates to #32125
Today we pass `discovery.zen.minimum_master_nodes` to nodes started up in
tests, but for 7.x nodes this setting is not required as it has no effect.
This commit removes this setting so that nodes are started with more realistic
configurations, and deprecates it.
This drops the option for tests to disable strict deprecation mode in
the low level rest client in favor of configuring expected warnings on
any calls that should expect warnings. This behavior is
paranoid-by-default which is generally the right way to handle
deprecations and tests in general.
This PR attempts to remove all typed calls from our YAML REST tests. The PR adds include_type_name: false to create index requests that use a mapping and also to put mapping requests. It also removes _type from index requests where they haven't already been removed. The PR ignores tests named *_with_types.yml since this are specifically testing typed API behaviour.
The change also includes changing the test harness to add the type _doc to index, update, get and bulk requests that do not specify the document type when the test is running against a mixed 7.x/6.x cluster.
This is related to #35975. It adds a action timeout setting that allows
timeouts to be applied to the individual transport actions that are
used during a ccr recovery.
In order to support JSON log format, a custom pattern layout was used and its configuration is enclosed in ESJsonLayout. Users are free to use their own patterns, but if smooth Beats integration is needed, they should use ESJsonLayout. EvilLoggerTests are left intact to make sure user's custom log patterns work fine.
To populate additional fields node.id and cluster.uuid which are not available at start time,
a cluster state update will have to be received and the values passed to log4j pattern converter.
A ClusterStateObserver.Listener is used to receive only one ClusteStateUpdate. Once update is received the nodeId and clusterUUid are set in a static field in a NodeAndClusterIdConverter.
Following fields are expected in JSON log lines: type, tiemstamp, level, component, cluster.name, node.name, node.id, cluster.uuid, message, stacktrace
see ESJsonLayout.java for more details and field descriptions
Docker log4j2 configuration is now almost the same as the one use for ES binary.
The only difference is that docker is using console appenders, whereas ES is using file appenders.
relates: #32850
This commit introduces retention lease syncing from the primary to its
replicas when a new retention lease is added. A follow-up commit will
add a background sync of the retention leases as well so that renewed
retention leases are synced to replicas.
* Remove empty statements
There are a couple of instances of undocumented empty statements all across the
code base. While they are mostly harmless, they make the code hard to read and
are potentially error-prone. Removing most of these instances and marking blocks
that look empty by intention as such.
* Change test, slightly more verbose but less confusing
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.
Closes#33028
From #29453 and #37285, the include_type_name parameter was already present and defaulted to false. This PR makes the following updates:
* Add deprecation warnings to RestCreateIndexAction, plus tests in RestCreateIndexActionTests.
* Add a typeless 'create index' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I created new CreateIndexRequest and CreateIndexResponse objects that differ from the existing server ones.
elasticsearch-node tool helps to restore cluster if half or more of
master eligible nodes are lost. Of course, all bets are off, regarding
data consistency.
There are two parts of the tool: unsafe-bootstrap to be used when there
is still at least one master-eligible node alive and detach-cluster,
when there are no master-eligible nodes left.
This commit implements the first part.
Docs for the tool will be added separately as a part of #37812.
This change split out all the specific GeoHash
classes for the geohash_grid aggregation into
abstract GeoGrid classes that can be re-used for
specific hashing types, like `geohash`
* Testing conventions now checks for tests in main
This is the last outstanding feature of the old NamingConventionsTask,
so time to remove it.
* PR review
The assertion `assertOpsOnPrimary` does not store seq_no and primary
term of successful deletes to the `lastOpSeqNo` and `lastOpTerm`. This
leads to failures of the subsequence CAS deletes or indexes with seq_no
and term. Moreover, this assertion trips a translog assertion because it
bumps the primary term of some operations but not the primary term of
the engine.
Relates #36467Closes#37684
Exceptions thrown by the cluster applier service's settings and cluster appliers are bubbled up, and
block the state from being applied instead of silently being ignored. In combination with the cluster
state publishing lag detector, this will throw a node out of the cluster that can't properly apply
cluster state updates.
This commit moves the aggregation and mapping code from joda time to
java time. This includes field mappers, root object mappers, aggregations with date
histograms, query builders and a lot of changes within tests.
The cut-over to java time is a requirement so that we can support nanoseconds
properly in a future field mapper.
Relates #27330
Users may require the sequence number and primary terms to perform optimistic concurrency control operations. Currently, you can get the sequence number via the `docvalues_fields` API but the primary term is not accessible because it is maintained by the `SeqNoFieldMapper` and the infrastructure can't find it.
This commit adds a dedicated sub fetch phase to return both numbers that is connected to a new `seq_no_primary_term` parameter.
* Fail start of non-data node if node has data
Check that nodes started with node.data=false cannot start if they have
shard data to avoid (old) indexes being resurrected into the cluster in red status.
Issue #27073
This commit introduces a NetworkMessage class. This class has two
subclasses - InboundMessage and OutboundMessage. These messages can
be serialized and deserialized independent of the transport. This allows
more granular testing. Additionally, the serialization mechanism is now
a simple Supplier. This builds the framework to eventually move the
serialization of transport messages to the network thread. This is the
one serialization component that is not currently performed on the
network thread (transport deserialization and http serialization and
deserialization are all on the network thread).
Currently we create dedicated network threads for both the http and
transport implementations. Since these these threads should never
perform blocking operations, these threads could be shared. This commit
modifies the nio-transport to have 0 http workers be default. If the
default configs are used, this will cause the http transport to be run
on the transport worker threads. The http worker setting will still exist
in case the user would like to configure dedicated workers. Additionally,
this commmit deletes dedicated acceptor threads. We have never had these
for the netty transport and they can be added back if a need is
determined in the future.
From #29453 and #37285, the `include_type_name` parameter was already present and defaulted to false. This PR makes the following updates:
- Add deprecation warnings to `RestPutMappingAction`, plus tests in `RestPutMappingActionTests`.
- Add a typeless 'put mappings' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I opted to create a new `PutMappingRequest` object that differs from the existing server one.
Some tests (e.g. testRestoreIndexWithShardsMissingInLocalGateway) were split-braining since
being switched to Zen2 because the bootstrap setting was left around when nodes got restarted
with data folders wiped.
The test in question here was starting one node (which autobootstrapped to that single node), then
another node. The first node was then shut down (after excluding it from the voting configuration),
its data folder wiped, and restarted. After restart, the node had an empty data folder yet
initial_master_nodes set to itself (i.e. same name). This made the node sometimes form a cluster of
its own, and not rejoin the existing cluster with the other node.
This commit removes some leniency from REST handling where we move to
reject all requests that have a body where the body is not used during
the course of handling the request. For example,
DELETE /index
{
"query" : {
"term" : {
"field" : "value"
}
}
}
is now rejected.
The AbstracLifecycleComponent used to extend AbstractComponent, so it had to pass settings to the constractor of its supper class.
It no longer extends the AbstractComponent so there is no need for this constructor
There is also no need for AbstracLifecycleComponent subclasses to have Settings in their constructors if they were only passing it over to super constructor.
This is part 1. which will be backported to 6.x with a migration guide/deprecation log.
part 2 will have this constructor removed in 7
relates #35560
relates #34488
This commit fixes the IndexShardTestCase.recoverReplica(IndexShard, IndexShard, boolean)
method where the startReplica parameter was not correctly propagated and the value
true always used instead.
This commit reformats some classes in the index universe with the
purpose of breaking some long method definitions and invocations into a
line per parameter. This has the advantage that for an upcoming change
to these definitions and invocations, the diff for that change will be a
single line per definition or invocation. That makes these sorts of
changes easier to read.
* Default include_type_name to false for get and put mappings.
* Default include_type_name to false for get field mappings.
* Add a constant for the default include_type_name value.
* Default include_type_name to false for get and put index templates.
* Default include_type_name to false for create index.
* Update create index calls in REST documentation to use include_type_name=true.
* Some minor clean-ups around the get index API.
* In REST tests, use include_type_name=true by default for index creation.
* Make sure to use 'expression == false'.
* Clarify the different IndexTemplateMetaData toXContent methods.
* Fix FullClusterRestartIT#testSnapshotRestore.
* Fix the ml_anomalies_default_mappings test.
* Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests.
We make sure to specify include_type_name=true during xContent parsing,
so we continue to test the legacy typed responses. XContent generation
for the typeless responses is currently only covered by REST tests,
but we will be adding unit test coverage for these as we implement
each typeless API in the Java HLRC.
This commit also refactors GetMappingsResponse to follow the same appraoch
as the other mappings-related responses, where we read include_type_name
out of the xContent params, instead of creating a second toXContent method.
This gives better consistency in the response parsing code.
* Fix more REST tests.
* Improve some wording in the create index documentation.
* Add a note about types removal in the create index docs.
* Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL.
* Make sure to mention include_type_name in the REST docs for affected APIs.
* Make sure to use 'expression == false' in FullClusterRestartIT.
* Mention include_type_name in the REST templates docs.
If the checking node no longer holds the shard copy, the assertion
assertSameDocIdsOnShards might fail. This is too harsh since the
assertion is to ensure the consistency between active copies.
Today file-chunks are sent sequentially one by one in peer-recovery. This is a
correct choice since the implementation is straightforward and recovery is
network bound in most of the time. However, if the connection is encrypted, we
might not be able to saturate the network pipe because encrypting/decrypting
are cpu bound rather than network-bound.
With this commit, a source node can send multiple (default to 2) file-chunks
without waiting for the acknowledgments from the target.
Below are the benchmark results for PMC and NYC_taxis.
- PMC (20.2 GB)
| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| -------- | -------- | -------- | -------- |
| Plain | 184s | 137s | 106s | 105s | 106s |
| TLS | 346s | 294s | 176s | 153s | 117s |
| Compress | 1556s | 1407s | 1193s | 1183s | 1211s |
- NYC_Taxis (38.6GB)
| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| ---------| ---------| ---------| -------- |
| Plain | 321s | 249s | 191s | * | * |
| TLS | 618s | 539s | 323s | 290s | 213s |
| Compress | 2622s | 2421s | 2018s | 2029s | n/a |
Relates #33844
This is related to #35975. It implements a file based restore in the
CcrRepository. The restore transfers files from the leader cluster
to the follower cluster. It does not implement any advanced resiliency
features at the moment. Any request failure will end the restore.