Previously, if a version conflict occurred and a previous primary
response was present, the original primary response would be used both
for sending to replica and back to client. This was made in the past as an
attempt to fix issues with conflicts after relocations where a bulk request
would experience a closed shard half way through and thus have to retry
on the new primary. It could then fail on its own update.
With sequence numbers, this leads to an issue, since if a primary is
demoted (network partitions), it will send along the original response
in the request. In case of a conflict on the new primary, the old
response is sent to the replica. That data could be stale, leading to
inconsistency between primary and replica.
Relocations now do an explicit hand-off from old to new primary and
ensures that no operations are active while doing this. Above is thus no
longer necessary. This change removes the special handling of conflicts
and ignores primary responses when executing shard bulk requests on the
primary.
The previous logic for concurrent file chunk fetching did not allow for multiple chunks from the same
file to be fetched in parallel. The parallelism only allowed to fetch chunks from different files in
parallel. This required complex logic on the follower to be aware from which file it was already
fetching information, in order to ensure that chunks for the same file would be fetched in sequential
order. During benchmarking, this exhibited throughput issues when recovery came towards the end,
where it would only be sequentially fetching chunks for the same largest segment file, with
throughput considerably going down in a high-latency network as there was no parallelism anymore.
The new logic here follows the peer recovery model more closely, and sends multiple requests for
the same file in parallel, and then reorders the results as necessary. Benchmarks show that this
leads to better overall throughput and the implementation is also simpler.
Backport of #38818 to `7.x`. Original description:
The HTTP exporter code in the Monitoring plugin makes `GET _template` requests to check for existence of templates. These requests don't need to pass the `include_type_name` query parameter so this PR removes it from the request. This should remove the following deprecation log entries on the Monitoring cluster in 7.0.0 onwards:
```
[types removal] Specifying include_type_name in get index template requests is deprecated.
```
This change makes the writing of new usage data conditional based on
the version that is being written to. A test has also been added to
ensure serialization works as expected to an older version.
Relates #38687, #38917
Backport of #38903
When tearing down from `ESSingleNodeTestCase` we perform a delete on "*"
indices, it some cases, however, those indices are not fully deleted. Rather
than have a failure occur later down the change (see:
https://github.com/elastic/elasticsearch/issues/30290#issuecomment-463589008 )
the failure should occurr immediately so it can be diagnosed more easily.
Prior to this commit, when an indexing operation resulted in an
`Engine.Result.Type.MAPPING_UPDATE_REQUIRED`, TransportShardBulkAction
immediately retries the indexing operation to see if it succeeds. In the event
that it succeeds the context does not wait until the mapping update has
propagated through the cluster state before finishing the indexing.
In some of our tests we rely on mappings being available as soon as they've been
introduced in a document that indexed correctly. By removing the immediate retry
we always wait for this to be the case.
Resolves#38428
Supercedes #38579
Relates to #38711
One of the test methods wasn't run because it was private. Making this method
public and fixing some issues around mocking the threadpool that otherwise would
lead to an NPE.
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.
Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.
Closes#30301
This changes the output of the `_cat/indices` API with `Security` enabled.
It is possible to only display the index name (and possibly the index
health, depending on the request options) but not its stats (doc count, merges,
size, etc). This is the case for closed indices which have index metadata in the
cluster state but no associated shards, hence no shard stats.
However, when `Security` is enabled, and the request contains wildcards,
**open** indices without stats are a common occurrence. This is because the
index names in the response table are picked up directly from the cluster state
which is not filtered by `Security`'s _indexNameExpressionResolver_, unlike the
stats data which is populated by the indices stats API which does go through the
index name resolver.
This is a bug, because it is circumventing `Security`'s function to hide
unauthorized indices.
This has been fixed by displaying the index names as they are resolved by the indices
stats API. The outputs of these two APIs is now very similar: same index names,
similar data but different format.
Closes#37190
Right now there is no way to determine whether the
token service or API key service is enabled or not.
This commit adds support for the enabled status of
token and API key service to the security feature set
usage API `/_xpack/usage`.
Closes#38535
The should fix the following NPE:
```
[2019-02-11T23:27:48,452][WARN ][o.e.p.PersistentTasksNodeService] [node_s_0] task kD8YzUhHTK6uKNBNQI-1ZQ-0 failed with an exception
1> java.lang.NullPointerException: null
1> at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor.lambda$fetchFollowerShardInfo$7(ShardFollowTasksExecutor.java:305) ~[main/:?]
1> at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:383) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:352) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:324) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:314) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1189) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1169) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:417) [elasticsearch-8.0.0-SNAP
SHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:391) [elasticsearch-8.0.0-SNAP
SHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:687) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
```
Relates to #38779
#37767 changed the expected exception for "no such cluster" error from
`IllegalStateException` to a dedicated `NoSuchRemoteClusterException`.
An assertion in `testCollectNodes` needs to be updated accordingly.
* Add rolling upgrade multi cluster test module (#38277)
This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.
Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)
This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.
This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.
Relates to #37231 and #38037
* Filter out upgraded version index settings when starting index following (#38838)
The `index.version.upgraded` and `index.version.upgraded_string` are likely
to be different between leader and follower index. In the event that
a follower index gets restored on a upgraded node while the leader index
is still on non-upgraded nodes.
Closes#38835
When shutting down Watcher, the `bulkProcessor` is null if watcher has been
disabled in the configuration. This protects the flush and close calls with a
check for watcher enabled to avoid a NullPointerException
Resolves#38798
When a primary shard is recovered from its store, we trim the last
commit (when it's unsafe). If that primary crashes before the recovery
completes, we will lose the committed retention leases because they are
baked in the last commit. With this change, we copy the retention leases
from the last commit to the safe commit when trimming unsafe commits.
Relates #37165
Currently we index documents concurrently to attempt to ensure that we
update mappings during the restore process. However, this does not
actually test that the mapping will be correct and is dangerous as it
can lead to a misalignment between the max sequence number and the local
checkpoint. If these are not aligned, peer recovery cannot be completed
without initiating following which this test does not do. That causes
teardown assertions to fail.
This commit removes the concurrent indexing and flushes after the
documents are indexed. Additionally it modifies the mapping specific
test to ensure that there is a mapping update when the restore session
is initiated. This mapping update is picked up at the end of the restore
by the follower.
Instead of using `WarningsHandler.PERMISSIVE`, we only match warnings
that are due to types removal.
This PR also renames `allowTypeRemovalWarnings` to `allowTypesRemovalWarnings`.
Relates to #37920.
In this case, we were incrementing the policy too much. This means on
every iteration we actually keep increasing the minimum retained
sequence number, even with leases in place. It was a bug from when the
soft deletes policy had retention leases incorporated into it. This
commit fixes this bug by ensuring we only increment in the proper
places, and adds careful tests for the various situations.
Forward port of https://github.com/elastic/elasticsearch/pull/38757
This change reverts the initial 7.0 commits and replaces them
with the 6.7 variant that still allows for the ecs flag.
This commit differs from the 6.7 variants in that ecs flag will
now default to true.
6.7: `ecs` : default `false`
7.x: `ecs` : default `true`
8.0: no option, but behaves as `true`
* Revert "Ingest node - user agent, move device to an object (#38115)"
This reverts commit 5b008a34aa.
* Revert "Add ECS schema for user-agent ingest processor (#37727) (#37984)"
This reverts commit cac6b8e06f.
* cherry-pick 5dfe1935345da3799931fd4a3ebe0b6aa9c17f57
Add ECS schema for user-agent ingest processor (#37727)
* cherry-pick ec8ddc890a34853ee8db6af66f608b0ad0cd1099
Ingest node - user agent, move device to an object (#38115) (#38121)
* cherry-pick f63cbdb9b426ba24ee4d987ca767ca05a22f2fbb (with manual merge fixes)
Dep. check for ECS changes to User Agent processor (#38362)
* make true the default for the ecs option, and update 7.0 references and tests
The hardcoded '\n' in string will not work in Windows where there is a
different line separator. A System.lineSeparator should be used to make
it work on all platforms
closes#38705
backport #38771
Currently init scripts fail when `/proc/sys/vm/max_map_count` is not present
with `-bash: [: too many arguments`.
Fix conditional logic to avoid trying to set the `max_map_count` sysctl if not
present.
Backport of: #35933
Relates: #27236
Change the formatting for Watcher.status.lastCheck and lastMetCondition
to be the same as Watcher.status.state.timestamp. These should all have
only millisecond precision
closes#38619
backport #38626
- Disables the request cache on the test, to prevent cached
values from potentially interfering with test results
- Changes the test to execute a single query, in hopes of making
failures more reproducible
Backport of #38583
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.
Relates #38633
We need to use the current primary term instead of 1L for the initial
retention leases; otherwise, the primary term of the committed
retention leases won't match the current primary term if the
retention leases never gets updated.
This change removes the pinning of TLSv1.2 in the
SSLConfigurationReloaderTests that had been added to workaround an
issue with the MockWebServer and Apache HttpClient when using TLSv1.3.
The way HttpClient closes the socket causes issues with the TLSv1.3
SSLEngine implementation that causes the MockWebServer to loop
endlessly trying to send the close message back to the client. This
change wraps the created http connection in a way that allows us to
override the closing behavior of HttpClient.
An upstream request with HttpClient has been opened at
https://issues.apache.org/jira/browse/HTTPCORE-571 to see if the method
of closing can be special cased for SSLSocket instances.
This is caused by a JDK bug, JDK-8214418 which is fixed by
https://hg.openjdk.java.net/jdk/jdk12/rev/5022a4915fe9.
Relates #38646
`<expression>::<dataType>` is a simplified altenative syntax to
`CAST(<expression> AS <dataType> which exists in PostgreSQL and
provides an improved user experience and possibly more compact
SQL queries.
Fixes: #38717
This commit introduces actions for some common retention lease
operations that clients need to be able to perform remotely. These
actions include add/renew/remove.