When a primary shard is recovered from its store, we trim the last
commit (when it's unsafe). If that primary crashes before the recovery
completes, we will lose the committed retention leases because they are
baked in the last commit. With this change, we copy the retention leases
from the last commit to the safe commit when trimming unsafe commits.
Relates #37165
In this case, we were incrementing the policy too much. This means on
every iteration we actually keep increasing the minimum retained
sequence number, even with leases in place. It was a bug from when the
soft deletes policy had retention leases incorporated into it. This
commit fixes this bug by ensuring we only increment in the proper
places, and adds careful tests for the various situations.
Forward port of https://github.com/elastic/elasticsearch/pull/38757
This change reverts the initial 7.0 commits and replaces them
with the 6.7 variant that still allows for the ecs flag.
This commit differs from the 6.7 variants in that ecs flag will
now default to true.
6.7: `ecs` : default `false`
7.x: `ecs` : default `true`
8.0: no option, but behaves as `true`
* Revert "Ingest node - user agent, move device to an object (#38115)"
This reverts commit 5b008a34aa.
* Revert "Add ECS schema for user-agent ingest processor (#37727) (#37984)"
This reverts commit cac6b8e06f.
* cherry-pick 5dfe1935345da3799931fd4a3ebe0b6aa9c17f57
Add ECS schema for user-agent ingest processor (#37727)
* cherry-pick ec8ddc890a34853ee8db6af66f608b0ad0cd1099
Ingest node - user agent, move device to an object (#38115) (#38121)
* cherry-pick f63cbdb9b426ba24ee4d987ca767ca05a22f2fbb (with manual merge fixes)
Dep. check for ECS changes to User Agent processor (#38362)
* make true the default for the ecs option, and update 7.0 references and tests
The hardcoded '\n' in string will not work in Windows where there is a
different line separator. A System.lineSeparator should be used to make
it work on all platforms
closes#38705
backport #38771
- Disables the request cache on the test, to prevent cached
values from potentially interfering with test results
- Changes the test to execute a single query, in hopes of making
failures more reproducible
Backport of #38583
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.
Relates #38633
We need to use the current primary term instead of 1L for the initial
retention leases; otherwise, the primary term of the committed
retention leases won't match the current primary term if the
retention leases never gets updated.
This commit introduces actions for some common retention lease
operations that clients need to be able to perform remotely. These
actions include add/renew/remove.
fix tests to use clock in milliseconds precision in watcher code
make sure the date comparison in string format is using same formatters
some of the code was modified in #38514 possibly because of merge conflicts
closes#38581
Backport #38738
A recent test failure triggered an edge case scenario where failures may be coming back with the same shard id, yet from different clusters.
This commit adapts the failures comparator to take the cluster alias into account when merging failures as part of CCS requests execution.
Also the corresponding test has been split in two: with and without
search shard target set to the failure.
Closes#38672
When a retention lease already exists on an add retention lease
invocation, or a retention lease is not found on a renew retention lease
invocation today we throw an illegal argument exception. This puts a
burden on the caller to catch that specific exception and parse the
message. This commit relieves the burden from the caller by adding
dedicated exception types for these situations.
This commit introduces the ability to remove retention leases. Explicit
removal will be needed to manage retention leases used to increase the
likelihood of operation-based recoveries syncing, and for consumers such
as ILM.
geo_shape indexes created before 6.6 use geohash string encoding as default tree parameter and quadtree encoding for 6.6 and later. This commit fixes bwc to use geohash encoding in LegacyGeoshapeFieldMapper for indexes created before 6.6.
The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.
The likely cause is a cross-thread visibility
problem for a count variable. This change
makes that count volatile.
Relates #37117
Backport of #38714
The Close Index API has been refactored in 6.7.0 and it now performs
pre-closing sanity checks on shards before an index is closed: the maximum
sequence number must be equals to the global checkpoint. While this is a
strong requirement for regular shards, we identified the need to relax this
check in the case of CCR following shards.
The following shards are not in charge of managing the max sequence
number or global checkpoint, which are pulled from a leader shard. They
also fetch and process batches of operations from the leader in an unordered
way, potentially leaving gaps in the history of ops. If the following shard lags
a lot it's possible that the global checkpoint and max seq number never get
in sync, preventing the following shard to be closed and a new PUT Follow
action to be issued on this shard (which is our recommended way to
resume/restart a CCR following).
This commit allows each Engine implementation to define the specific
verification it must perform before closing the index. In order to allow
following/frozen/closed shards to be closed whatever the max seq number
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do
not perform any check before the index is closed.
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Added a constructor accepting `StreamInput` as argument, which allowed to
make most of the instance members final as well as remove the default
constructor.
Removed a test only constructor in favour of invoking the existing
constructor that takes a `SearchRequest` as first argument.
Also removed profile members and related methods as they were all unused.
The existing formatter being used was not on par with the joda formatter
as it was missing the ability to parse a comma as a separator between
seconds and milliseconds.
While a real iso8601 would be much more complex, this might be
sufficient for some more use-cases.
The ingest date formatter now also uses the iso8601 formatter by
default.
Closes#38345
The benchmarks showed a sharp decrease in aggregation performance for
the UTC case.
This commit uses the same calculation as joda time, which requires no
conversion into any java time object, also, the check for an fixedoffset
has been put into the ctor to reduce the need for runtime calculations.
The same goes for the amount of the used unit in milliseconds.
Closes#37826
Whenever phase failure is raised in AbstractSearchAsyncAction, we go and
release search contexts of shards that successfully returned their
results, prior to notifying the listener of the failure. In case we are
executing a CCS request, it's important to look-up the connection to
send the release context request to.
This commit makes sure that the lookup takes the cluster alias into
account. We used to use `null` at all times instead which is not correct
and was not caught as any exception is caught without re-throwing it.
The test was relying on toString in ZonedDateTime which is different to
what is formatted by strict_date_time when milliseconds are 0
The method is just delegating to dateFormatter, so that scenario should
be covered there.
closes#38359
Backport #38610
Adds the ability to fetch chunks from different files in parallel, configurable using the new `ccr.indices.recovery.max_concurrent_file_chunks` setting, which defaults to 5 in this PR.
The implementation uses the parallel file writer functionality that is also used by peer recoveries.
This commit adds the 7.1 version constant to the 7.x branch.
Co-authored-by: Andy Bristol <andy.bristol@elastic.co>
Co-authored-by: Tim Brooks <tim@uncontended.net>
Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
Co-authored-by: Luca Cavanna <javanna@users.noreply.github.com>
Co-authored-by: markharwood <markharwood@gmail.com>
Co-authored-by: Ioannis Kakavas <ioannis@elastic.co>
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: Jason Tedor <jason@tedor.me>
Co-authored-by: Alpar Torok <torokalpar@gmail.com>
Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Tim Vernum <tim@adjective.org>
Co-authored-by: Albert Zaharovits <albert.zaharovits@gmail.com>
This commit changes the `TransportVerifyShardBeforeCloseAction` so that it
always forces the flush of the shard. It seems that #37961 is not sufficient to
ensure that the translog and the Lucene commit share the exact same max
seq no and global checkpoint information in case of one or more noop
operations have been made.
The `BulkWithUpdatesIT.testThatMissingIndexDoesNotAbortFullBulkRequest`
and `FrozenIndexTests.testFreezeEmptyIndexWithTranslogOps` test this trivial
situation and they both fail 1 on 10 executions.
Relates to #33888
In #38333 and #38350 we moved away from the `discovery.zen` settings namespace
since these settings have an effect even though Zen Discovery itself is being
phased out. This change aligns the documentation and the names of related
classes and methods with the newly-introduced naming conventions.
This commit integrates retention leases with recovery. With this change,
we copy the current retention leases on primary to the replica during
phase two of recovery. At this point, the replica will have been added
to the replication group and so is already receiving retention lease
sync requests from the primary. This means that if any retention lease
syncs are triggered on the primary after we sample the retention leases
here during phase two, that sync request will also arrive on the replica
ensuring that the replica is from this point on up to date with the
retention leases on the primary. We have to copy these during phase two
since we will be applying indexing operations, potentially triggering
merges, and therefore must ensure the correct retention leases are in
place beforehand.
Instead of logging warnings we now rethrow exceptions thrown inside
scheduled/submitted tasks. This will still log them as warnings in
production but has the added benefit that if they are thrown during
unit/integration test runs, the test will be flagged as an error.
This is a continuation of #38014
Fixed NPE that caused CCR tests (IndexFollowingIT and likely others)
to fail.
schedule could bubble rejected exception to uncaught exception
handler when not using SAME executor if thread pool is terminated.
Now ignore rejected exception silently if executor is shutdown.
Elasticsearch has long [supported](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning) compare and set (a.k.a optimistic concurrency control) operations using internal document versioning. Sadly that approach is flawed and can sometime do the wrong thing. Here's the relevant excerpt from the resiliency status page:
> When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is partitioned away. It won’t acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed. The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted writes for the same document
We recently [introduced](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/optimistic-concurrency-control.html) a new sequence number based approach that doesn't suffer from this dirty reads problem.
This commit removes support for internal versioning as a concurrency control mechanism in favor of the sequence number approach.
Relates to #1078
This commit lifts the control of when retention leases are expired to
index shard. In this case, we move expiration to an explicit action
rather than a side-effect of calling
ReplicationTracker#getRetentionLeases. This explicit action is invoked
on a timer. If any retention leases expire, then we hard sync the
retention leases to the replicas. Otherwise, we proceed with a
background sync.
Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.
This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.
Previously, date formats of `YYYY.MM.dd` would hit an issue
where the year would jump towards the end of the calendar year.
This was an issue that had since been resolved in tests by using
`yyyy` to be the more accurate representation of the year.
Closes#37037.
`CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when
deserializing index creation requests, accidentally accepts mappings that are
nested twice under the type key (as described in the bug report #38266).
This in turn causes us to be too lenient in parsing typeless mappings. In
particular, we accept the following index creation request, even though it
should not contain the type key `_doc`:
```
PUT index?include_type_name=false
{
"mappings": {
"_doc": {
"properties": { ... }
}
}
}
```
There is a similar issue for both 'put templates' and 'put mappings' requests
as well.
This PR makes the minimal changes to detect and reject these typed mappings in
requests. It does not address #38266 generally, or attempt a larger refactor
around types in these server-side requests, as I think this should be done at a
later time.
With this change we no longer support pluggable discovery implementations. No
known implementations of `DiscoveryPlugin` actually override this method, so in
practice this should have no effect on the wider world. However, we were using
this rather extensively in tests to provide the `test-zen` discovery type. We
no longer need a separate discovery type for tests as we no longer need to
customise its behaviour.
Relates #38410
Today we throw a fatal `RuntimeException` if an exception occurs in
`getMasterName()`, and this includes the case where there is currently no
master. However, sometimes we call this method inside an `assertBusy()` in
order to allow for a cluster that is in the process of stabilising and electing
a master. The trouble is that `assertBusy()` only retries on an
`AssertionError` and not on a general `RuntimeException`, so the lack of a
master is immediately fatal.
This commit fixes the issue by asserting there is a master, triggering a retry
if there is not.
Fixes#38331
* The problem in #38226 is that in some corner cases multiple calls to `endSnapshot` were made concurrently, leading to non-deterministic behavior (`beginSnapshot` was triggering a repository finalization while one that was triggered by a `deleteSnapshot` was already in progress)
* Fixed by:
* Making all `endSnapshot` calls originate from the cluster state being in a "completed" state (apart from on short-circuit on initializing an empty snapshot). This forced putting the failure string into `SnapshotsInProgress.Entry`.
* Adding deduplication logic to `endSnapshot`
* Also:
* Streamlined the init behavior to work the same way (keep state on the `SnapshotsService` to decide which snapshot entries are stale)
* closes#38226
We already support unknown objects in the list of pipelines, this changes the
`PipelineConfiguration` to support fields other than just `id` and `config`.
Relates to #36938
Renames the following settings to remove the mention of `zen` in their names:
- `discovery.zen.hosts_provider` -> `discovery.seed_providers`
- `discovery.zen.ping.unicast.concurrent_connects` -> `discovery.seed_resolver.max_concurrent_resolvers`
- `discovery.zen.ping.unicast.hosts.resolve_timeout` -> `discovery.seed_resolver.timeout`
- `discovery.zen.ping.unicast.hosts` -> `discovery.seed_addresses`
X-Pack security supports built-in authentication service
`token-service` that allows access tokens to be used to
access Elasticsearch without using Basic authentication.
The tokens are generated by `token-service` based on
OAuth2 spec. The access token is a short-lived token
(defaults to 20m) and refresh token with a lifetime of 24 hours,
making them unsuitable for long-lived or recurring tasks where
the system might go offline thereby failing refresh of tokens.
This commit introduces a built-in authentication service
`api-key-service` that adds support for long-lived tokens aka API
keys to access Elasticsearch. The `api-key-service` is consulted
after `token-service` in the authentication chain. By default,
if TLS is enabled then `api-key-service` is also enabled.
The service can be disabled using the configuration setting.
The API keys:-
- by default do not have an expiration but expiration can be
configured where the API keys need to be expired after a
certain amount of time.
- when generated will keep authentication information of the user that
generated them.
- can be defined with a role describing the privileges for accessing
Elasticsearch and will be limited by the role of the user that
generated them
- can be invalidated via invalidation API
- information can be retrieved via a get API
- that have been expired or invalidated will be retained for 1 week
before being deleted. The expired API keys remover task handles this.
Following are the API key management APIs:-
1. Create API Key - `PUT/POST /_security/api_key`
2. Get API key(s) - `GET /_security/api_key`
3. Invalidate API Key(s) `DELETE /_security/api_key`
The API keys can be used to access Elasticsearch using `Authorization`
header, where the auth scheme is `ApiKey` and the credentials, is the
base64 encoding of API key Id and API key separated by a colon.
Example:-
```
curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health
```
Closes#34383
The HLRC client currently uses `org.elasticsearch.action.admin.indices.get.GetIndexRequest`
and `org.elasticsearch.action.admin.indices.get.GetIndexResponse` in its get index calls. Both request and
response are designed for the typed APIs, including some return types e.g. for `getMappings()` which in
the maps it returns still use a level including the type name.
In order to change this without breaking existing users of the HLRC API, this PR introduces two new request
and response objects in the `org.elasticsearch.client.indices` client package. These are used by the
IndicesClient#get and IndicesClient#exists calls now by default and support the type-less API. The old request
and response objects are still kept for use in similarly named, but deprecated methods.
The newly introduced client side classes are simplified versions of the server side request/response classes since
they don't need to support wire serialization, and only the response needs fromXContent parsing (but no
xContent-serialization, since this is the responsibility of the server-side class).
Also changing the return type of `GetIndexResponse#getMapping` to
`Map<String, MappingMetaData> getMappings()`, while it previously was returning another map
keyed by the type-name. Similar getters return simple Maps instead of the ImmutableOpenMaps that the
server side response objects return.
As mapping types are being removed throughout Elasticsearch, the use of
`_type` in pipeline simulation requests is deprecated. Additionally, the
default `_type` used if one is not supplied has been changed to `_doc` for
consistency with the rest of Elasticsearch.
The message `... took [31s] above the warn threshold of 30s` suggests
incorrectly that the task took 61 seconds. This commit adds the clarifying
words `which is`.
This commit introduces a background sync for retention leases. The idea
here is that we do a heavyweight sync when adding a new retention lease,
and then periodically we want to background sync any retention lease
renewals to the replicas. As long as the background sync interval is
significantly lower than the extended lifetime of a retention lease, it
is okay if from time to time a replica misses a sync (it will still have
an older version of the lease that is retaining more data as we assume
that renewals do not decrease the retaining sequence number). There are
two follow-ups that will come after this commit. The first is to address
the fact that we have not adapted the should periodically flush logic to
possibly flush the retention leases. We want to do something like flush
if we have not flushed in the last five minutes and there are renewed
retention leases since the last time that we flushed. An additional
follow-up will remove the syncing of retention leases when a retention
lease expires. Today this sync could be invoked in the background by a
merge operation. Rather, we will move the syncing of retention lease
expiration to be done under the background sync. The background sync
will use the heavyweight sync (write action) if a lease has expired, and
will use the lightweight background sync (replication action) otherwise.
Reduces the leader and follower check timeout to 3 * 10 = 30s instead of 3 * 30 = 90s, with 30s still
being a very long time for a node to be completely unresponsive.
This adds a dedicated field mapper that supports nanosecond resolution -
at the price of a reduced date range.
When using the date field mapper, the time is stored as milliseconds since the epoch
in a long in lucene. This field mapper stores the time in nanoseconds
since the epoch - which means its range is much smaller, ranging roughly from
1970 to 2262.
Note that aggregations will still be in milliseconds.
However docvalue fields will have full nanosecond resolution
Relates #27330
Today the following settings in the `discovery.zen` namespace are still used:
- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`
This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
* This was a merge mistake on my end I think, obviously we only need to loop over the shards once not twice here to find those that we missed in INIT state
* Fix Incorrect Transport Response Handler Type
* The response type here is not empty and was always wrong but this only became visible now that 0a604e3b24 was introduced
* As a result of 0a604e3b24 we started actually handling the response
of this request and logging/handling exceptions before that we simply dropped the classcast exception here quietly using the empty response handler
* fix busy assert not handling `Exception`
* Closes#38226
* Closes#38256
Today we have DiscoveryDisruptionIT tests for checking that discovery can still
work once the cluster has formed, even if the cluster is misconfigured and only
has a single master-eligible node in its unicast hosts list. In fact with Zen2
we can go one better: we do not need any nodes in the unicast hosts list,
because nodes also use the contents of the last-committed cluster state for
discovery. Additionally, the DiscoveryDisruptionIT tests were failing due to
the overenthusiastic fault-detection timeouts.
This commit replaces these tests with deterministic `CoordinatorTests` that
verify the same behaviour. It also removes some duplication by extracting a
test method called `testFollowerCheckerAfterMasterReelection()`
Closes#37687
We should increase primary term before renewing leases; otherwise, the
term of the latest RetentionLeases will be lower than the current term.
Relates #37951
If the innerLength is 0, the version won't be increased; then there will
be two RetentionLeases with the same term and version, but their leases
are different.
Relates #37951Closes#38245
Because concurrent sync requests from a primary to its replicas could be
in flight, it can be the case that an older retention leases collection
arrives and is processed on the replica after a newer retention leases
collection has arrived and been processed. Without a defense, in this
case the replica would overwrite the newer retention leases with the
older retention leases. This commit addresses this issue by introducing
a versioning scheme to retention leases. This versioning scheme is used
to resolve out-of-order processing on the replica. We persist this
version into Lucene and restore it on recovery. The encoding of
retention leases is starting to get a little ugly. We can consider
addressing this in a follow-up.
* Instead of replacing the `shardSnapshots` field, we mutate it, explicitly removing entries from it in only a single spot
* Decreased the amount of indirection by moving all logic for starting a snapshot's newly discovered shard tasks into `startNewShards` (saves us two maps (keyed by snapshot) and iterations over them)
In Zen 1 there are commit timeout and publish timeout and these
settings could be changed on-the-fly.
In Zen 2, there is only commit timeout and this setting is static.
RareClusterStateIT is actively using these settings and the fact, they
are dynamic.
This commit adds cancelCommitedPublication method to Coordinator to
be used by tests. This method will cancel current committed publication
if there is any.
When there is BlockClusterStateProcessing on the non-master node, the
publication will be accepted and committed, but not yet applied. So we
can use the method above to cancel it.
Also, this commit replaces callback + AtomicReference with ActionFuture,
which makes test code easier to read.
Using index.routing.allocation.require._host does not correctly work because the boolean logic in
filter matching is broken (DiscoveryNodeFilters.match(...) will return false) when
opType ==OpType.AND
When restoring shards of existing indices, the RestoreService also
restores the values of primary terms stored in the snapshot index
metadata. The primary terms are not updated and could potentially
conflict with current index primary terms if the restored primary terms
are lower than the existing ones.
This situation is likely to happen with replicated closed indices
(because primary terms are increased when the index is transitioning
from open to closed state, and the snapshotted primary terms are the
one at the time the index was opened) (see #38024) and maybe also
with CCR.
This commit changes the RestoreService so that it updates the primary
terms using the maximum value between the snapshotted values and
the existing values.
Related to #33888
If custom date formats are used, there may be combinations that the new
performat DateFormatters.from() method has not covered yet. This adds a
few such corner cases and ensures the tests are correctly commented
out.
In #38158 we ensured that global ordinals are not loaded when another execution hint is explicitly set on the source. This change is a follow up that addresses a comment
dd6043c1c0 (r252984782) added after the merge.
This commit adds the second part of `elasticsearch-node` tool -
`detach-cluster` command in addition to `unsafe-bootstrap` command.
Also, this commit changes the semantics of `unsafe-bootstrap`, now
`unsafe-bootstrap` changes clusterUUID.
So the algorithm of running `elasticsearch-node` tool is the following:
1) Stop all nodes in the cluster.
2) Pick master-eligible node with the highest (term, version) pair and
run the `unsafe-bootstrap` command on it. If there are no survived
master-eligible nodes - skip this step.
3) Run `detach-cluster` command on the remaining survived nodes.
Detach cluster makes the following changes to the node metadata:
1) Sets clusterUUID committed to false.
2) Sets currentTerm and term to 0.
3) Removes voting tombstones and sets voting configurations to special
constant MUST_JOIN_ELECTED_MASTER, that prevents initial cluster
bootstrap.
`ElasticsearchNodeCommand` base abstract class is introduced, because
`UnsafeBootstrapMasterCommand` and `DetachClusterCommand` have a lot in
common.
Also, this commit adds "ordinal" parameter to both commands, because it's
impossible to write IT otherwise.
For MUST_JOIN_ELECTED_MASTER case special handling is introduced in
`ClusterFormationFailureHelper`.
Tests for both commands reside in `ElasticsearchNodeCommandIT` (renamed
from `UnsafeBootstrapMasterIT`).
The current CloseWhileRelocatingShardsIT test adds some "send behavior"
rule to a target node's mocked transport service in order to detect when shard
relocating are started. These rules are never cleared and prevent the test to
complete normally after the rebalance is re-enabled again.
This commit changes the test so that rules are cleared and most verifications
are done before the rebalance is reenabled again.
Closes#38090
Folks at the Lucene project do not seem to be interested in classifying corruptions and
distinguishing them from file-system exceptions (see https://issues.apache.org/jira/browse/LUCENE-8525),
so we'll just cop out as well.
Closes#34322
With #37000 we made sure that fnial reduction is automatically disabled
whenever a localClusterAlias is provided with a SearchRequest.
While working on #37838, we found a scenario where we do need to set a
localClusterAlias yet we would like to perform a final reduction in the
remote cluster: when searching on a single remote cluster.
Relates to #32125
This commit adds support for a separate finalReduce flag to
SearchRequest and makes use of it in TransportSearchAction in case we
are searching against a single remote cluster.
This also makes sure that num_reduce_phases is correct when searching
against a single remote cluster: it makes little sense to return
`num_reduce_phases` set to `2`, which looks especially weird in case
the search was performed against a single remote shard. We should
perform one reduction phase only in this case and `num_reduce_phases`
should reflect that.
* line length
This change forbids negative field boost in the `query_string`, `simple_query_string`
and `multi_match` queries.
Negative boosts are not allowed in Lucene 8 (scores must be positive).
The backport of this change to 6x will turn the error into a deprecation warning
in order to raise the awareness of this breaking change in 7.0.
Closes#33309
Currently, there are a few tests that use autoMinMasterNodes=false and
hence override addExtraClusterBootstrapSettings, mostly this is 10-30
lines of codes that are copy-pasted from class to class.
This PR introduces `InternalTestCluster.setBootstrapMasterNodeIndex`
which is suitable for all classes and copy-paste could be removed.
Removing code is always a good thing!
The terms aggregator loads the global ordinals to retrieve the cardinality of the field to aggregate on. This information is then used to select the strategy to use for the aggregation (breadth_first or depth_first). However this should be avoided if the execution_hint is explicitly set to map since this mode doesn't really need the global ordinals. Since we still need the cardinality of the field this change picks the maximum cardinality in the segments as an estimation of the total cardinality to select the strategy to use (breadth_first or depth_first). This estimation is only used if the execution hint is set to map, otherwise the global ordinals are still used to retrieve the accurate cardinality.
Closes#37705
Today we use `AbstractDisruptionTestCase` to test the behaviour of things like
master elections in the presence of cluster disruptions. These tests have
rather enthusiastic fault detection settings, detecting a fault if a single
ping fails, with a one-second timeout. Furthermore there are some tests that
assert the identity of the master remains unchanged during some disruption, and
these assertions fail rather often thanks to the overly sensitive fault
detector.
However in a number of these tests the fault detector need not be this
sensitive. This commit moves some such tests into their own test suite and uses
more sensible fault-detection settings to avoid the kind of master instability
that is causing CI failures.
Closes#37699
The self written epoch date formatters were not properly able to format
an Instant to a string due to a misconfiguration.
This fix also removes a until now existing runtime behaviour under java
8 regarding the names of the aggregation buckets, which are now the same
as before and have been under java 11.
The '{' as a first character in log line is causing problems for beats when parsing plaintext logs. This can happen if the submitted document has an additional '\n' at the beginning and we are not reformatting.
Trimming the source part of a SlogLog solves that and keeps the logs readable.
closes#38080
* Fixes two broken spots:
1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting
2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state
* Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk
* Significantly extended test infrastructure to reproduce the above two issues
* Two new test runs:
1. Reproducing the effects of node disconnects/restarts in isolation
2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
* Relates #32265
* Closes#32348
Since #31140 we no longer require acking on the dynamic mapping of index
requests. Thus, a returned mapping from a get mapping request does not
necessarily contain the dynamic updates from the index request. This
commit replaces the dynamic mapping update with a manual put mapping.
Relates #31140Closes#37928
This changes the test to not use a `CountDownlatch`, instead adding an assertion
for the final logging message and waiting until the `MockAppender` has seen it
before proceeding.
Related to df2c06f6f30f7e23a6863a3f72fc3bdb7648885c
Resolves#23739
Implements `geotile_grid` aggregation
This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240
This code uses the same base classes as `geohash_grid` agg, but uses a different hashing
algorithm to allow zoom consistency. Each grid bucket is aligned to Web Mercator tiles.
If a replica does not have a right mapping yet, we will retry the index
request on that replica; then the actual tasks is higher than the
expected tasks. Since #31140 this happens more frequently for we no
longer require acking on the dynamic mapping of index requests.
Relates #31140Closes#37893
When extending ESIntegTestCase are run on the same jvm, the static field in
NodeAndClusterIdConverter will throw an AlreadySet exceptions.
overriding the configuration method from Node.configureNodeAndClusterIdStateListener in the MockNode will prevent the listener registration from happening
relates #32850
If a new retention lease is added while a primary's soft-deletes policy
is locked for peer-recovery, that lease won't be baked into the Lucene
commit.
Relates #37165
Relates #37375
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.
This is a continuation of #28667, #36137 and also fixes#37708.
This commit adapts the version used in StartedShardEntry serialization
after the backport of #37899 and reenables bwc tests.
Related to #37899
Related to #38074
This commit pushes the primary term into the replication tracker. This
is a precursor to using the primary term to resolving ordering problems
for retention leases. Namely, it can be that out-of-order retention
lease sync requests arrive on a replica. To resolve this, we need a
tuple of (primary term, version). For this to be, the primary term needs
to be accessible in the replication tracker. As the primary term is part
of the replication group anyway, this change conceptually makes sense.
With #37566 we have introduced the ability to merge multiple search responses into one. That makes it possible to expose a new way of executing cross-cluster search requests, that makes CCS much faster whenever there is network latency between the CCS coordinating node and the remote clusters. The coordinating node can now send a single search request to each remote cluster, which gets reduced by each one of them. from + size results are requested to each cluster, and the reduce phase in each cluster is non final (meaning that buckets are not pruned and pipeline aggs are not executed). The CCS coordinating node performs an additional, final reduction, which produces one search response out of the multiple responses received from the different clusters.
This new execution path will be activated by default for any CCS request unless a scroll is provided or inner hits are requested as part of field collapsing. The search API accepts now a new parameter called ccs_minimize_roundtrips that allows to opt-out of the default behaviour.
Relates to #32125
This reduces objects creations in the rounding class (used by aggs) by properly
creating the objects only once. Furthermore a few unneeded ZonedDateTime objects
were created in order to create other objects out of them. This was
changed as well.
Running the benchmarks shows a much faster performance for all of the
java time based Rounding classes.
Currently the put-mapping API assumes that because the type name is `_doc` then
it is dealing with a typeless put-mapping call. Yet we still allow running the
put-mapping API in a typed fashion with `_doc` as a type name. The current logic
triggers surprising errors when doing a typed put-mapping call with `_doc` as a
type name on an index that has a type already.
This is a bit of a corner-case, but is more important on 6.x due to the fact
that using the index API with `_doc` as a type name triggers typed calls to the
put-mapping API with `_doc` as a type name.
Zen2 nodes will bootstrap themselves once they believe there to be no remaining
Zen1 master-eligible nodes in the cluster, as long as minimum_master_nodes is
satisfied.
Today the bootstrap configuration comprises just the ids of the known
master-eligible nodes, and this might be too small to be safe. For instance, if
there are 5 master-eligible nodes (so that minimum_master_nodes is 3) then the
bootstrap configuration could comprise just 3 nodes, of which 2 form a quorum,
and this does not intersect other quorums that might arise, leading to a
split-brain.
This commit fixes this by expanding the bootstrap configuration so that its
quorums satisfy minimum_master_nodes, by adding some of the IDs of the other
master-eligible nodes in the last-published cluster state.
The existing implementation was slow due to exceptions being thrown if
an accessor did not have a time zone. This implementation queries for
having a timezone, local time and local date and also checks for an
instant preventing to throw an exception and thus speeding up the conversion.
This removes the existing method and create a new one named
DateFormatters.from(TemporalAccessor accessor) to resemble the naming of
the java time ones.
Before this change an epoch millis parser using the toZonedDateTime
method took approximately 50x longer.
Relates #37826
This commit allows implementors of the `HandledTransportAction` to
specify what thread the action should be executed on. The motivation for
this commit is that certain CCR requests should be performed on the
generic threadpool.
The subparser in verify repository allows for unknown fields. This
commit sets the value to true for the parser and modifies the test such
that it accurately tests it.
Relates #36938
Today we pass `discovery.zen.minimum_master_nodes` to nodes started up in
tests, but for 7.x nodes this setting is not required as it has no effect.
This commit removes this setting so that nodes are started with more realistic
configurations, and deprecates it.
* Extracted the logic for master request duplication so it can be reused by the snapshotting logic
* Removed custom listener used by `ShardStateAction` to not leak these into future users of this class
* Changed semantics slightly to get rid of redundant instantiations of the composite listener
* Relates #37686
This commit encapsulates the primary terms fields in index shard. This
is a precursor to pushing the operation primary term down to the
replication tracker.
This test failed a few times over the last several months. It seems that
we triggered a flush, but CI was too slow to finish it in several
seconds. I added the flush stats and commit stats and unmuted this test.
We should have a good clue if this test fails again.
Relates #37896
Removes some guice index level extension point marked as @Deprecated since at
least 6.0. They served as a signpost for plugin authors upgrading from 2.x but
this shouldn't be relevant in 7.0 anymore.
In order to retain BWC this changes the java date formatters to be able to
parse nanoseconds resolution, even if only milliseconds are supported.
This used to work on joda time as well so that a user could store a date
like `2018-10-03T14:42:44.613469+0000` and then just loose the precision
on anything lower than millisecond level.
Doc-value fields now return a value that is based on the mappings rather than
the script implementation by default.
This deprecates the special `use_field_mapping` docvalue format which was added
in #29639 only to ease the transition to 7.x and it is not necessary anymore in
7.0.
Currently if you mix typed templates and typeless index creation or typeless
templates and typed index creation then you will end up with an error because
Elasticsearch tries to create an index that has multiple types: `_doc` and
the explicit type name that you used.
This commit proposes to give precedence to the index creation call so that
the type from the template will be ignored if the index creation call is
typeless while the template is typed, and the type from the index creation
call will be used if there is a typeless template.
This is consistent with the fact that index creation already "wins" if a field
is defined differently in the index creation call and in a template: the
definition from the index creation call is used in such cases.
Closes#37773
This change fixes the copy of the fetch source option into the
expand search request that is used to retrieve the documents of each
collapsed group.
Closes#23829
This commit restores a noop version of the AllFieldMapper that is instanciated only
for indices created in 6x. We need this metadata field mapper to be present in this version
in order to allow the upgrade of indices that explicitly disable _all (enabled: false).
The mapping of these indices contains a reference to the _all field that we cannot remove
in 7 so we'll need to keep this metadata mapper in 7x. Since indices created in 6x will not
be compatible with 8, we'll remove this noop mapper in the next major version.
Closes#37429
Added deprecation warnings for use of include_type_name in put/get index templates.
HLRC changes:
GetIndexTemplateRequest has a new client-side class which is a copy of server's GetIndexTemplateResponse but modified to be typeless.
PutIndexTemplateRequest has a new client-side counterpart which doesn't use types in the mappings
Relates to #35190
This commit modifies the put follow index action to use a
CcrRepository when creating a follower index. It routes
the logic through the snapshot/restore process. A
wait_for_active_shards parameter can be used to configure
how long to wait before returning the response.
This commit adds the code in the HTTP layer that will parse exclusion wildcard
expressions.
The existing code issues 404s for wildcards as well as explicit indices.
But, in general, in an expression with exclude wildcards (-...*) following other
include wildcards, there is no way to tell if the include wildcard produced no
results or they were subsequently excluded.
Therefore, the proposed change is breaking the behavior of 404s for
wildcards. Specifically, no 404s will be returned for wildcards, even
if they are not followed by exclude wildcards or the exclude wildcards
could not possibly exclude what has previously been included.
Only explicitly requested aliases will be called out as missing.
The delete and update by query APIs both offer protection against overriding concurrent user changes to the documents they touch. They currently are using internal versioning. This PR changes that to rely on sequences numbers and primary terms.
Relates #37639
Relates #36148
Relates #10708
This commit adds join validation around cluster UUIDs, preventing a node to join a cluster if it was
previously part of another cluster. The commit introduces a new flag to the cluster state,
clusterUUIDCommitted, which denotes whether the node has locked into a cluster with the given
uuid. When a cluster is committed, this flag will turn to true, and subsequent cluster state updates
will keep the information about committal. Note that coordinating-only nodes are still free to switch
clusters at will (after restart), as they don't carry any persistent state.
The clusterAlias member is only used in the copy constructor, to be able
to reconstruct the fully qualified index. It is also possible to remove
the instance member and add a private constructor that accepts the already built Index object which contains the cluster alias.
The update request has a lesser known support for a one off update of a known document version. This PR adds an a seq# based alternative to power these operations.
Relates #36148
Relates #10708
This commit changes the StartedShardEntry so that it also contains the
primary term of the shard to start. This way the master node can also
checks that the primary term from the start request is equal to the current
shard's primary term in the cluster state, and it can ignore any shard
started request that would concerns a previous instance of the shard that
would have been allocated to the same node.
Such situation are likely to happen with frozen (or restored) indices and
the replication of closed indices, because with replicated closed indices
the shards will be initialized again after the index is closed and can
potentially be re initialized again if the index is reopened as a frozen
index. In such cases the lifecycle of the shards would be something like:
* shard is STARTED
* index is closed
* shards is INITIALIZING (index state is CLOSED, primary term is X)
* index is reopened
* shards are INITIALIZING again (index state is OPENED, potentially frozen,
primary term is X+1)
Adding the primary term to the shard started request will allow to discard
potential StartedShardEntry requests received by the master node if the
request concerns the shard with primary term X because it has been
moved/reinitialized in the meanwhile under the primary term X+1.
Relates to #33888
When serializing allowPartialSearchResults to the shards through ShardSearchTransportRequest, we use an optional boolean field, though
the corresponding instance member is declared `boolean` which can never
be null. We also have an assert to verify that the incoming search
request provides a non-null value for the flag, and a comment explaining
that null should be considered a bug.
This commit makes the allowPartialSearchResults method in
ShardSearchRequest return a `boolean` rather than a `Boolean` and
changes the serialization from optional to non optional, in a bw comp manner.
This commit changes the TransportVerifyShardBeforeCloseAction so that it issues a
forced flush, forcing the translog and the Lucene commit to contain the same max seq
number and global checkpoint in the case the Translog contains operations that were
not written in the IndexWriter (like a Delete that touches a non existing doc). This way
the assertion added in #37426 won't trip.
Related to #33888
Abdicates to another master-eligible node once the active master is reconfigured out of the voting
configuration, for example through the use of voting configuration exclusions.
Follow-up to #37712
In order to support JSON log format, a custom pattern layout was used and its configuration is enclosed in ESJsonLayout. Users are free to use their own patterns, but if smooth Beats integration is needed, they should use ESJsonLayout. EvilLoggerTests are left intact to make sure user's custom log patterns work fine.
To populate additional fields node.id and cluster.uuid which are not available at start time,
a cluster state update will have to be received and the values passed to log4j pattern converter.
A ClusterStateObserver.Listener is used to receive only one ClusteStateUpdate. Once update is received the nodeId and clusterUUid are set in a static field in a NodeAndClusterIdConverter.
Following fields are expected in JSON log lines: type, tiemstamp, level, component, cluster.name, node.name, node.id, cluster.uuid, message, stacktrace
see ESJsonLayout.java for more details and field descriptions
Docker log4j2 configuration is now almost the same as the one use for ES binary.
The only difference is that docker is using console appenders, whereas ES is using file appenders.
relates: #32850
* The test failure reported in the issue looks like a mere timeout. Logging suggestst hat the snapshot completes/aborts correctly but the busy
loop polling the snapshot state times out too early.
* Closes#37888
This commit introduces a sync of retention leases when a retention lease
expires. As expiration of retention leases is lazy, their expiration is
managed only when getting the current retention leases from the
replication tracker. At this point, we callback to our full retention
lease sync to sync and flush these on all shard copies. With this
change, replicas do not locally manage expiration of retention leases;
instead, that is done only on the primary.
The test was not using the TRACK_TOTAL_HITS_ACCURATE and thus
encountered a different issue tracked in #37907. In the meanwhile
we can adapt the test to not fail anymore.
Closes#37897
* Changed `LuceneSnapshot` to throw an `OperationsMissingException` if the requested ops are missing.
* Changed the shard changes api to handle the `OperationsMissingException` and wrap the exception into `ResourceNotFound` exception and include metadata to indicate the requested range can no longer be retrieved.
* Changed `ShardFollowNodeTask` to handle this `ResourceNotFound` exception with the included metdata header.
Relates to #35975
This change sets track_total_hits to true on a test that requires
to check the total hits of a query that can return more than 10,000 docs.
Closes#37895
Replace `threadPool().schedule()` / catch
`EsRejectedExecutionException` pattern with direct calls to
`ThreadPool#scheduleUnlessShuttingDown()`.
Closes#36318
From previous PRs, we've already added support for include_type_name to
the get mapping API. We had also taken an approach to the HLRC where the
server-side `GetMappingResponse#fromXContent` could only handle typeless
input.
This PR updates the HLRC for 'get mapping' to be in line with our new approach:
* Add a typeless 'get mappings' method to the Java HLRC, that accepts new
client-side request and response objects. This new response only handles
typeless mapping definitions.
* Switch the old version of `GetMappingResponse` back to expecting typed
mappings, and deprecate the corresponding method on the HLRC.
Finally, the PR also does some small, related clean-up around 'get field mappings'.
There is a method invocation here spanning multiple lines. This commit
breaks it up into a line per parameter as this is friendlier to future
changes and diffs.
When adding a retention lease, we make a reference copy of the retention
leases under lock and then make a copy of that collection outside of the
lock. However, since we merely copied a reference to the retention
leases, after leaving a lock the underlying collection could change on
us. Rather, we want to copy these under lock. This commit adds a
dedicated method for doing this, asserts that we hold a lock when we use
this method, and changes adding a retention lease to use this method.
This commit was intended to be included with #37398 but was pushed to
the wrong branch.
This commit introduces retention lease syncing from the primary to its
replicas when a new retention lease is added. A follow-up commit will
add a background sync of the retention leases as well so that renewed
retention leases are synced to replicas.
Previously, a hardcoded precision value of 4 was
used by these tests resulting in no approximation
errors. Now that the precision is between 1-12,
precision values of 1 and 2 result in potential
bucketing errors.
This commit adjusts the range to be 4-12.
Fixes#37892.
If the in_sync_allocations of index-1 or index-2 is changed, the
metadata version will be increased. This leads to the failure in
the metadata version checks. We need to relax them.
Closes#37820
* Refactored GeoHashGrid unit tests
This change allows other grid aggregations to reuse the same tests.
The change mostly just moves code to the base classes, trying to
keep changes to a bare minimum.
* rename createInternalGeoHashGridBucket to createInternalGeoGridBucket
* indentation
Replaces intermediate geo objects built by ShapeBuilders with
objects from the libs/geo hierarchy. This should allow us to build
all geo functionality around a single hierarchy.
Follow up for #35320
Due to floating point error, it was possible for variances to become negative which should never happen. This bugfix sets variance to zero if it becomes negative as a result of fp error.
This test indexes an unlimited number of documents, this commit
reduces this number to 25K and also tracks exact number of hits
when counting the docs.
* Remove empty statements
There are a couple of instances of undocumented empty statements all across the
code base. While they are mostly harmless, they make the code hard to read and
are potentially error-prone. Removing most of these instances and marking blocks
that look empty by intention as such.
* Change test, slightly more verbose but less confusing
Node started with node.data=false and node.master=false can no longer
start if they have index metadata. This avoids resurrecting old indexes
into the cluster and ensures metadata is cleaned out before
re-purposing a node that was previously master or data node.
Issue #27073
This changes adds the support to handle `nested` fields in the `composite`
aggregation. A `nested` aggregation can be used as parent of a `composite`
aggregation in order to target `nested` fields in the `sources`.
Closes#28611
The ingest date processor is currently only able to parse joda formats.
However it is not using the existing elasticsearch classes but access
joda directly. This means that our existing BWC layer does not notify
the user about deprecated formats. This commit switches to use the
exising Elasticsearch Joda methods to acquire a date format, that
includes the BWC check and the ability to parse java 8 dates.
The date parsing in ingest has also another extra feature, that the
fallback year, when a date format without a year is used, is the current
year, and not 1970 like usual. This is currently not properly supported
in the DateFormatter class. As this is the only case for this feature
and java time can take care of this using the toZonedDateTime() method,
a workaround just for the joda time parser has been created, that can be
removed soon again from 7.0.
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.
Closes#33028
* Correct deprec log in RestGetFieldMappingAction
Correct a class used for deprecation logging in
RestGetFieldMappingAction
* Correct deprec log in RestCreateIndexAction
Correct a class used for deprecation logging in
RestCreateIndexAction
* Stop threads before logging the list of exceptions
* For the broken case of concurrent iteration in the finally block and the threads not having shut down,
use `CopyOnWriteArrayList` to have concurrency safe iteration
* Closes#37810
TransportAction and BaseRestHandler now no longer extends AbstractComponent. The AbstractComponent no longer has usages so it was deleted.
Closes#34488
From #29453 and #37285, the include_type_name parameter was already present and defaulted to false. This PR makes the following updates:
* Add deprecation warnings to RestCreateIndexAction, plus tests in RestCreateIndexActionTests.
* Add a typeless 'create index' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I created new CreateIndexRequest and CreateIndexResponse objects that differ from the existing server ones.
In these tests, we initialize the retained_seq_no with NO_OPS_PERFORMED,
thus we should verify that the min of the retained_seq_no is at least
NO_OPS_PERFORMED not 0.
Closes#35994
elasticsearch-node tool helps to restore cluster if half or more of
master eligible nodes are lost. Of course, all bets are off, regarding
data consistency.
There are two parts of the tool: unsafe-bootstrap to be used when there
is still at least one master-eligible node alive and detach-cluster,
when there are no master-eligible nodes left.
This commit implements the first part.
Docs for the tool will be added separately as a part of #37812.
This change split out all the specific GeoHash
classes for the geohash_grid aggregation into
abstract GeoGrid classes that can be re-used for
specific hashing types, like `geohash`
Today, the mapping on the follower is managed and replicated from its
leader index by the ShardFollowTask. Thus, we should prevent users
from modifying the mapping on the follower indices.
Relates #30086
This completes the BWC serialisation changes required for a 6.7 master to
inform other nodes of the node-level value of the `minimum_master_nodes`
setting.
Relates #37701, #37811
The assertion `assertOpsOnPrimary` does not store seq_no and primary
term of successful deletes to the `lastOpSeqNo` and `lastOpTerm`. This
leads to failures of the subsequence CAS deletes or indexes with seq_no
and term. Moreover, this assertion trips a translog assertion because it
bumps the primary term of some operations but not the primary term of
the engine.
Relates #36467Closes#37684
Now that warning headers no longer contain a timestamp of when the
warning was generated, we no longer need to extract the warning value
from the warning to determine whether or not the warning value is
duplicated. Instead, we can compare strings directly.
Further, when de-duplicating warning headers, are constantly rebuilding
sets. Instead of doing that, we can carry about the set with us and
rebuild it if we find a new warning value.
This commit applies both of these optimizations.
Exceptions thrown by the cluster applier service's settings and cluster appliers are bubbled up, and
block the state from being applied instead of silently being ignored. In combination with the cluster
state publishing lag detector, this will throw a node out of the cluster that can't properly apply
cluster state updates.
As acking can fail for any reason (unrelated node being too slow, node disconnecting), it should not
be required for acking to succeed in order for index requests with dynamic mapping updates to
successfully complete.
Relates to #30672 and Closes#30844
* Remove Custom Listeners from SnapshotsService
Motivations:
* Shorten the code some more
* Use ActionListener#wrap to get easy to reason about behavior in failure scenarios
* Remove duplication in the logic of handling snapshot completion listeners (listeners removing themselves and comparing snapshots to their targets)
* Also here, move all listener handling into `SnapshotsService` and remove custom listener class by putting listeners in a map
Today we support a smooth rolling upgrade from Zen1 to Zen2 by automatically
bootstrapping the cluster once all the Zen1 nodes have left, as long as the
`minimum_master_nodes` count is satisfied. However this means that Zen2 nodes
also require the `minimum_master_nodes` setting for this one specific and
transient situation.
Since nodes only perform this automatic bootstrapping if they previously
belonged to a Zen1 cluster, they can keep track of the `minimum_master_nodes`
setting from the previous master instead of requiring it to be set on the Zen2
node.
- Add deprecation warning to RestGetFieldMappingAction
- Add two new java HRLC classes GetFieldMappingsRequest and
GetFieldMappingsResponse. These classes use new typeless forms
of a request and response, and differ in that from the server
versions.
Relates to #35190
Currently we have the ability to listen for setting changes to two group
affix settings. However, it is possible that we might have the need to
listen to more than two. This commit adds a method that allows consumer
to listen to a list of affix settings for changes.
In some cases we only have a string collection instead of a string list
that we want to serialize out. We have a convenience method for writing
a list of strings, but no such method for writing a collection of
strings. Yet, a list of strings is a collection of strings, so we can
simply liberalize StreamOutput#writeStringList to be more generous in
the collections that it accepts and write out collections of strings
too. On the other side, we do not have a convenience method for reading
a list of strings. This commit addresses both of these issues.
* Add PersistentTasksClusterService::unassignPersistentTask method
* adding cancellation test
* Adding integration test for unallocating tasks from a node
* Addressing review comments
* adressing minor PR comments
Due to https://issues.apache.org/jira/browse/LUCENE-8634 this test
may fail if a really tiny polygon is generated. This commit checks for
tiny polygons and skips the final check, which is expected to fail
until the lucene bug is fixed and new version of lucene is released.
* This test compleletly overrode the `reroute` method and hence did nothing put test the override itself
* Removed the test since it tests nothing and simplified `reroute` accordingly
This commit moves the collectSearchShards method out of RemoteClusterService into TransportSearchAction that currently calls it. RemoteClusterService used to be used only for cross-cluster search but is now also used in cross-cluster replication where different API are called through the RemoteClusterAwareClient. There is no reason for the collectSearchShards and fetchShards methods to be respectively in RemoteClusterService and RemoteClusterConnection. The search shards API can be called through the RemoteClusterAwareClient too, the only missing bit is a way to handle failures based on the skip_unavailable setting for each cluster (currently only supported in RemoteClusterConnection#fetchShards) which is achieved by adding a isSkipUnavailable(String clusterAlias) method to RemoteClusterService.
This change is useful for #32125 as we will very soon need to also call the search API against remote clusters, which will be done through RemoteClusterAwareClient. In that case we will also need to support skip_unavailable when calling the search API so we need some way to handle the skip_unavailable setting like we currently do for the search_shards call.
Relates to #32125
Today we have several implementations of executing SearchOperationListener
in SearchService. While all of them seem to be safe at least on, the one that
executes scroll searches can cause illegal execution of SearchOperationListener
that can then in-turn trigger assertions in ShardSearchStats. This change
adds a SearchOperationListenerExecutor that uses try-with blocks to ensure
listeners are called in a safe way.
Relates to #37185
This commit moves the aggregation and mapping code from joda time to
java time. This includes field mappers, root object mappers, aggregations with date
histograms, query builders and a lot of changes within tests.
The cut-over to java time is a requirement so that we can support nanoseconds
properly in a future field mapper.
Relates #27330
Users may require the sequence number and primary terms to perform optimistic concurrency control operations. Currently, you can get the sequence number via the `docvalues_fields` API but the primary term is not accessible because it is maintained by the `SeqNoFieldMapper` and the infrastructure can't find it.
This commit adds a dedicated sub fetch phase to return both numbers that is connected to a new `seq_no_primary_term` parameter.
1. testSimpleOnlyMasterNodeElection - requires cluster bootstrap when
the first master node is started.
2. testElectOnlyBetweenMasterNodes - requires cluster bootstrap when
the first master node is started and requires adding voting exclusion
before shutting down the first master node.
3. testAliasFilterValidation - requires cluster bootstrap when the
first master node is started.
It's not safe to continue writing state using MetaDataStateFormat
after dirty WriteStateException occurred if it's not recovered by
successful subsequent state write.
We've encountered test failure of testFailRandomlyAndReadAnyState.
The test breaks in the following way. There are 3 state paths. And what
happens next
Successful write at the beginning of the test yields 0 0 0 state
files in the directories.
1st write in the loop is unsuccessful, but not dirty - 0 0 0.
2nd write in the loop is not successful and dirty (failure during
fsync), however before removing new files we have 1 1 1. But now during
deletion, the first deletion fails and we get - 1 0 0.
3rd write in the loop is unsuccessful, but not dirty - so we want to
keep old generation, which happens to be the 1st generation, so now we
have 1 x x in state folders. Now we assert that we either load 0 or 1
state from the state folders and select only 2rd and 3th folder to
emulate disk failures - this results in NPE because there is nothing in
these folders.
Fortunately, this won’t be a problem in real life, because if there is
a dirty exception, we shut down the node and make sure we perform a
successful write on the node startup.
This adds a set of helper classes to determine if an agg "has a value".
This is needed because InternalAggs represent "empty" in different
manners according to convention. Some use `NaN`, `+/- Inf`, `0.0`, etc.
A user can pass the Internal agg type to one of these helper methods
and it will report if the agg contains a value or not, which allows the
user to differentiate "empty" from a real `NaN`.
These helpers are best-effort in some cases. For example, several
pipeline aggs share a single return class but use different conventions
to mark "empty", so the helper uses the loosest definition that applies
to all the aggs that use the class.
Sums in particular are unreliable. The InternalSum simply returns 0.0
if the agg is empty (which is correct, no values == sum of zero). But this
also means the helper cannot differentiate from "empty" and `+1 + -1`.
This commit removes the warn-date from warning headers. Previously we
were stamping every warning header with when the request
occurred. However, this has a severe performance penalty when
deprecation logging is called frequently, as obtaining the current time
and formatting it properly is expensive. A previous change moved to
using the startup time as the time to stamp on every warning header, but
this was only to prove that the timestamping was expensive. Since the
warn-date is optional, we elect to remove it from the warning
header. Prior to this commit, we worked in Kibana to make the warn-date
treated as optional there so that we can follow-up in Elasticsearch and
remove the warn-date. This commit does that.
Prefer publishing to master-eligible nodes first, so that cluster state updates are committed more
quickly, and master-eligible nodes also turned more quickly into followers after a leader election.
PersistentTasksClusterService decides if a task should be reassigned by
checking there is a node in the cluster with the same Id. If a node is
restarted PersistentTasksClusterService may not observe the change and
decide the task still has a valid assignment because the node's
ephemeral Id is not used in that decision. This change un-assigns tasks
as the nodes in the cluster change.
* Fail start of non-data node if node has data
Check that nodes started with node.data=false cannot start if they have
shard data to avoid (old) indexes being resurrected into the cluster in red status.
Issue #27073
When publications were cancelled because a node turned to follower or candidate, it would still
show as time out, which can be confusing in the logs. This change adapts the improper call of
onTimeout by generalizing it to a cancel method.
The method calls "enabled" in addition to what the super.index() does, but this
seems to be done explicitely now in the TypeParsers `parse` method. The removed
method has been deprecated since at least 6.0. Also making some of the Builders
methods and ctos private since they are only used internally in this class.
Today when bootstrapping a Zen2 cluster we wait for every node in the
`initial_master_nodes` setting to be discovered, so that we can map the
node names or addresses in the `initial_master_nodes` list to their IDs for
inclusion in the initial voting configuration. This means that if any of
the expected master-eligible nodes fails to start then bootstrapping will
not occur and the cluster will not form. This is not ideal, and we would
prefer the cluster to bootstrap even if some of the master-eligible nodes
do not start.
Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure
given that nodes may be concurrently and independently attempting to
bootstrap the cluster. The solution is to bootstrap using an initial
configuration whose size matches the size of the expected set of
master-eligible nodes, but with the unknown IDs replaced by "placeholder"
IDs that can never belong to any node. Any quorum of received votes in any
of these placeholder-laden initial configurations is also a quorum of the
"true" initial set of master-eligible nodes, giving the guarantee that it
intersects all other quorums as required.
Note that this change means that the initial configuration is not
necessarily robust to any node failures. Normally the cluster will form and
then auto-reconfigure to a more robust configuration in which the
placeholder IDs are replaced by the IDs of genuine nodes as they join the
cluster; however if a node fails between bootstrapping and this
auto-reconfiguration then the cluster may become unavailable. This we feel
to be less likely than a node failing to start at all.
This commit also enormously simplifies the cluster bootstrapping process.
Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins. However this flexibility is not required for the current
design so it is adding a good deal of unnecessary complexity. Here we remove
this complexity in favour of a much simpler ClusterBootstrapService
implementation that does all the work itself.
In order to be able to parse epoch seconds and epoch milli seconds own
java time fields had been introduced. These fields are however not
compatible with the way that java time allows one to configure default
fields (when a part of a timestamp cannot be read then a default value
is added), which is used for the formatters that are rounding up to the
next value.
This commit allows java date formatters to configure its round up parsing
by setting default values via a consumer. By default all formats are setting
JavaDateFormatter.ROUND_UP_BASE_FIELDS for rounding up. The epoch
however parsers both need to set different fields. The merged date
formatters do not set any fields, they just append all the round up formatters.
Also the formatter now properly copies the locale and the timezone,
fractional parsing has been set to nano seconds with proper width.
Since version 6.7.0 the Close Index API guarantees that all translog
operations have been correctly flushed before the index is closed. If
the index is reopened as a Frozen index (which uses a ReadOnlyEngine)
we can verify that the maximum sequence number from the last Lucene
commit is indeed equal to the last known global checkpoint and refuses
to open the read only engine if it's not the case. In this PR the check is
only done for indices created on or after 6.7.0 as they are guaranteed
to be closed using the new Close Index API.
Related #33888
This commit introduces a NetworkMessage class. This class has two
subclasses - InboundMessage and OutboundMessage. These messages can
be serialized and deserialized independent of the transport. This allows
more granular testing. Additionally, the serialization mechanism is now
a simple Supplier. This builds the framework to eventually move the
serialization of transport messages to the network thread. This is the
one serialization component that is not currently performed on the
network thread (transport deserialization and http serialization and
deserialization are all on the network thread).
Currently we create dedicated network threads for both the http and
transport implementations. Since these these threads should never
perform blocking operations, these threads could be shared. This commit
modifies the nio-transport to have 0 http workers be default. If the
default configs are used, this will cause the http transport to be run
on the transport worker threads. The http worker setting will still exist
in case the user would like to configure dedicated workers. Additionally,
this commmit deletes dedicated acceptor threads. We have never had these
for the netty transport and they can be added back if a need is
determined in the future.
* The repo id was determined wrong when the delete picked up on an in progress snapshot
* NOTE: This solution is still a best-effort fix and there's a slight chance of running into concurrency issues here
when multiple create and delete requests for the same snapshot name are happening concurrently, but these require a sequence
of multiple cluster state updates between the changed method reading the genId and submitting its cluster state update task
* Added test reproduced the issue reliably in about 50% of runs
* Closes#37581
This will be used in cross-cluster search when reduction will be
performed locally on each cluster. The CCS coordinating node will send
one search request per remote cluster involved and will get one search
response back from each one of them. Such responses contain all the info
to be able to perform an additional reduction and return results back
to the user.
Relates to #32125
This commit optimizes some of the performance issues from using
deprecation logging:
- we optimize encoding the deprecation value
- we optimize formatting the deprecation string
- we optimize away getting the current time (by using cached startup
time)
To make further refactoring of GeoGrid aggregations
easier (related: #30320), splitting out these inner
class dependencies into their own files makes it
easier to map the relationship between classes
From #29453 and #37285, the `include_type_name` parameter was already present and defaulted to false. This PR makes the following updates:
- Add deprecation warnings to `RestPutMappingAction`, plus tests in `RestPutMappingActionTests`.
- Add a typeless 'put mappings' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I opted to create a new `PutMappingRequest` object that differs from the existing server one.
This adds deprecation to _type in the script contexts for ingest and update.
This adds a DeprecationMap that wraps the ctx Map containing _type for these
specific contexts.
Some tests (e.g. testRestoreIndexWithShardsMissingInLocalGateway) were split-braining since
being switched to Zen2 because the bootstrap setting was left around when nodes got restarted
with data folders wiped.
The test in question here was starting one node (which autobootstrapped to that single node), then
another node. The first node was then shut down (after excluding it from the voting configuration),
its data folder wiped, and restarted. After restart, the node had an empty data folder yet
initial_master_nodes set to itself (i.e. same name). This made the node sometimes form a cluster of
its own, and not rejoin the existing cluster with the other node.
Currently when adding a response header, we do some de-duplication, and
maybe drop the header on the floor if we have reached capacity. Yet, we
still update the thread local tracking the response headers. This is
really expensive because under the hood there is a shared reference that
we synchronize on. In the case of a request processed across many shards
in a tight loop, this contention can be detrimental to performance. We
can avoid updating the thread local in these cases though, when the
response header is duplicate of one that we have already seen, or when
it's dropped on the floor. This commit addresses these performance
issues by avoiding the unnecessary set.
As of today the Close Index API does its best to close indices,
but closing an index with ongoing recoveries might or might not
be acknowledged depending of the values of the max seq number
and global checkpoint at the time the
TransportVerifyShardBeforeClose action is executed.
These tests failed because they always expect that the index is
correctly closed on the first try, which is not always the case.
Instead we need to retry the closing until it succeed.
Closes#37571
* Fixes `testTwoNodeFirstNodeCleared` by manipulating voting config exclusions.
* Removes `testRecoveryDifferentNodeOrderStartup` since state recovery is now
handled entirely on the elected master, so the order in which the data nodes
start is irrelevant.
This test was actually passing, for the wrong reason: it asserts a
`MasterNotDiscoveredException` is thrown, expecting this to be due to a failure
to perform state recovery, but in fact it's thrown because the node is not
correctly bootstrapped.
This change adds deprecation warning to the indices.get_mapping API in case the
"inlcude_type_name" parameter is set to "true" and changes the parsing code in
GetMappingsResponse to parse the type-less response instead of the one
containing types. As a consequence the HLRC client doesn't need to force
"include_type_name=true" any more and the GetMappingsResponseTests can be
adapted to the new format as well. Also removing some "include_type_name"
parameters in yaml test and docs where not necessary.
delete and close index actions threw IllegalArgumentExceptions
when attempting to run against an index that has a snapshot
in progress.
This change introduces a dedicated SnapshotInProgressException
for these scenarios. This is done to explicitly signal to clients that
this is the reason the action failed, and it is a retryable error.
relates to #37541.
This is a continuation of #28667 and has as goal to convert all executors to propagate errors to the
uncaught exception handler. Notable missing ones were the direct executor and the scheduler. This
commit also makes it the property of the executor, not the runnable, to ensure this property. A big
part of this commit also consists of vastly improving the test coverage in this area.
This change adds a way to customize how phrase prefix queries should be created
on field types. The match phrase prefix query is exposed in field types in order
to allow optimizations based on the options set on the field.
For instance the text field uses the configured prefix field (if available) to
build a span near that mixes the original field and the prefix field on the last
position.
This change also contains a small refactoring of the match/multi_match query that
simplifies the interactions between the builders.
Closes#31921
This test failed because the refresh at the end of the test is not guaranteed to run before the
indexing is completed, and therefore there's no guarantee that the refresh will free all operations.
This triggers an assertion failure in the test clean-up, which asserts that there are no more pending
operations.
Some systems default to a nofile ulimit of 65535. To reduce the pain of
deploying Elasticsearch to such systems, this commit lowers the required
limit from 65536 to 65535.
We flush quite often in testAddNewReplicas to create the safe index
commit with gaps in sequence numbers. This test is failing recently
because CI is too slow to complete 5 small flushes in 10 seconds.
This commit increases timeout for this test and also ensures to always
terminate the background indexing. The latter is to eliminate unrelated
failures if this test fails again.
Closes#37183
All tests except testRestorePersistentSettings (renamed to
testExceptionWhenRestoringPersistentSettings) worked fine.
testExceptionWhenRestoringPersistentSettings re-written to use a custom
setting, because "minimum master node" setting is no longer available
in Zen2. It turns out there is no good replacement for "minimum master
node" setting for this test, that's why the custom setting is
introduced.
Unfortunately, there is #37485 bug and currently
RestoreService does not perform setting validation. That's why the
test is annotated with @AwaitsFix, the idea is to merge this commit and
then fix the issue and enable the test. (The test passes with a simple
fix, that adds a single line to RestoreService).
Currently all proxied actions are denied for the `SystemPrivilege`.
Unfortunately, there are use cases (CCR) where we would like to proxy
actions to a remote node that are normally performed by the
system context. This commit allows the system context to perform
proxy actions if they are actions that the system context is normally
allowed to execute.
Currently it takes a type, but this isn't really needed now that indices can
have at most one type. The only downside is that we might return a different
error when trying to index into a type that doesnt't exist yet.
This commit removes some leniency from REST handling where we move to
reject all requests that have a body where the body is not used during
the course of handling the request. For example,
DELETE /index
{
"query" : {
"term" : {
"field" : "value"
}
}
}
is now rejected.
* The internal create request is absolutely redundant, the only difference to the transport request is that we resolved the snapshot
name when moving from the transport to the internal version
* Removed it and passed the transport request into the snapshot service instead
* nicer way of resolve snapshot name in callback
The AbstracLifecycleComponent used to extend AbstractComponent, so it had to pass settings to the constractor of its supper class.
It no longer extends the AbstractComponent so there is no need for this constructor
There is also no need for AbstracLifecycleComponent subclasses to have Settings in their constructors if they were only passing it over to super constructor.
This is part 1. which will be backported to 6.x with a migration guide/deprecation log.
part 2 will have this constructor removed in 7
relates #35560
relates #34488
This commit adds one more underlying implementation of MockPersistedState.
Previously only InMemoryPersistentState was used, not GatewayMetaState
is used rarely.
When adding GatewayMetaState support the main question was: do we want to
emulate exceptions as we do today in MockPersistedState before
delegating to GatewayMetaState or do we want these exceptions to
propagate from the lower level, i.e. file system exceptions?
On the one hand, lower level exception propagation is already tested in
GatewayMetaStateTests, so this won't improve the coverage.
On the other hand, the benefit of low-level exceptions is to see how all these
components work in conjunction. Finally, we abandoned the idea of low-level
exceptions because we don't have a way to deal with IOError today in
CoordinatorTests, but hacking GatewayMetaState not to throw
IOError seems unnatural.
So MockPersistedState rarely throws an exception before delegating to
GatewayMetaState, which is not supposed to throw the exception.
This commit required two changes:
Move GatewayMetaStateUT to upper-level from
GatewayMetaStatePersistedStateTests, because otherwise, it's not easy
to construct GatewayMetaState instance in CoordinatorTests.
Move addition of STATE_NOT_RECOVERED_BLOCK from GatewayMetaState
constructor to GatewayMetaState.applyClusterUpdaters, because
CoordinatorTests class assumes that there is no such block and most of
them fail.
The completion suggester ignores the original weight of the suggestion when duplicates are removed. This change fixes this bug and keeps the best weighted suggestion among the duplicates. It also removes the custom implementation of the top docs suggest collector now that https://issues.apache.org/jira/browse/LUCENE-8529 is committed in Lucene.
Closes#35836
This commit prepares the required infra to make send a translog snapshot
of the recovery source non-blocking. I'll make a follow-up to make the send
snapshot method non-blocking.
Relates #37291