The assertBusy() that waits the default 10 seconds for a
datafeed to complete very occasionally times out on slow
machines. This commit increases the timeout to 60 seconds.
It will almost never actually take this long, but it's
better to have a timeout that will prevent time being
wasted looking at spurious test failures.
This test should no longer pass when the functionality it is intended to
test is broken, as it now indexes a number of documents and verifies
that the index is staying on the same step until after indexing and
replication of those documents is finished. This prevents the test from
passing if the leader index progresses in its lifecycle during that time.
This test failed once in a very long time with the assertion
that there is no document for the `non_existing_job` in the
state index. I could not see how that is possible and I cannot
reproduce. With this commit the failure message will reveal
some examples of the left behind docs which might shed a light
about what could go wrong.
With this commit we remove all usages of the deprecated method
`ExceptionsHelper#detailedMessage` in tests. We do not address
production code here but rather in dedicated follow-up PRs to keep the
individual changes manageable.
Relates #19069
This commit makes `TransformIntegrationTests` into a standard integration test, as
opposed to using `TimeWarp`, which registers the mock component
`ScheduleEngineTriggerMock` to trigger watches.
The simplification may help with flakiness we've observed `TimeWarp, as in #37882.
The SchedulerEngine is used in several places in our code and not all
of these usages properly stopped the SchedulerEngine, which could lead
to test failures due to leaked threads from the SchedulerEngine. This
change adds stopping to these usages in order to avoid the thread leaks
that cause CI failures and noise.
Closes#38875
Currently remote compression and ping schedule settings are dynamic.
However, we do not listen for changes. This commit adds listeners for
changes to those two settings. Additionally, when those settings change
we now close existing connections and open new ones with the settings
applied.
Fixes#37201.
Sleeps in tests smell funny, and we try to avoid them to the extent
possible. We are using a small one in a CCR test. This commit clarifies
the purpose of that sleep by adding a comment explaining it. We also
removed a hard-coded value from the test, that if we ever modified the
value higher up where it was set, we could end up forgetting to change
the value here. Now we ensure that these would move in lock step if we
ever maintain them later.
We have some CCR tests where we use mock transport send rules to control
the behavior that we desire in these tests. Namely, we want to simulate
an exception being thrown on the leader side, or a variety of other
situations. These send rules were put in place between the data nodes on
each side. However, it might not be the case that these requests are
being sent between data nodes. For example, a request that is handled on
a non-data master node would not be sent from a data node. And it might
not be the case that the request is sent to a data node, as it could be
proxied through a non-data coordinating node. This commit addresses this
by putting these send rules in places between all nodes on each side.
Closes#39011Closes#39201
`ReadOnlyEngine` never recovers operations from translog and never
updates translog information in the index shard's recovery state, even
though the recovery state goes through the `TRANSLOG` stage during
the recovery. It means that recovery information for frozen shards indicates
an unkown number of recovered translog ops in the Recovery APIs
(translog_ops: `-1` and translog_ops_percent: `-1.0%`) and this is confusing.
This commit changes the `recoverFromTranslog()` method in `ReadOnlyEngine`
so that it always recover from an empty translog snapshot, allowing the recovery
state translog information to be correctly updated.
Related to #33888
Initially in #38910, ShardFollowTask was reusing ImmutableFollowParameters'
serialization logic. After merging, bwc tests failed sometimes and
the binary serialization that ShardFollowTask was originally was using
was added back. ImmutableFollowParameters is using optional fields (optional vint)
while ShardFollowTask was not (vint).
Today we always refresh when looking up the primary term in
FollowingEngine. This is not necessary for we can simply
return none for operations before the global checkpoint.
The follower won't always have the same history as the leader for its
soft-deletes retention can be different. However, if some operation
exists on the history of the follower, then the same operation must
exist on the leader. This change relaxes the history check in
ShardFollowTaskReplicationTests.
Closes#39093
This commit introduces the retention leases to ESIndexLevelReplicationTestCase,
then adds some tests verifying that the retention leases replication works
correctly in spite of the presence of the primary failover or out of order
delivery of retention leases sync requests.
Relates #37165
This change aims to fix failures in the session factory load balancing
tests that mock failure scenarios. For these tests, we randomly shut
down ldap servers and bind a client socket to the port they were
listening on. Unfortunately, we would occasionally encounter failures
in these tests where a socket was already in use and/or the port
we expected to connect to was wrong and in fact was to one of the ldap
instances that should have been shut down.
The failures are caused by the behavior of certain operating systems
when it comes to binding ports and wildcard addresses. It is possible
for a separate application to be bound to a wildcard address and still
allow our code to bind to that port on a specific address. So when we
close the server socket and open the client socket, we are still able
to establish a connection since the other application is already
listening on that port on a wildcard address. Another variant is that
the os will allow a wildcard bind of a server socket when there is
already an application listening on that port for a specific address.
In order to do our best to prevent failures in these scenarios, this
change does the following:
1. Binds a client socket to all addresses in an awaitBusy
2. Adds assumption that we could bind all valid addresses
3. In the case that we still establish a connection to an address that
we should not be able to, try to bind and expect a failure of not
being connected
Closes#32190
This commit fixes a broken CCR retention lease unfollow test. The
problem with the test is that the random subset of shards that we picked
to disrupt would not necessarily overlap with the actual shards in
use. We could take a non-empty subset of [0, 3] (e.g., { 2 }) when the
only shard IDs in use were [0, 1]. This commit fixes this by taking into
account the number of shards in use in the test.
With this change, we also take measure to ensure that a successful
branch is tested more frequently than would otherwise be the case. On
that branch, we want to sometimes pretend that the retention lease is
already removed. The randomness here was also sometimes selecting a
subset of shards that did not overlap with the shards actually in use
during the test. While this does not break the test, it is confusing and
reduces the amount of coverage of that branch.
Relates #39185
In most of the places we avoid creating the `.security` index (or updating the mapping)
for read/search operations. This is more of a nit for the case of the getRole call,
that fixes a possible mapping update during a get role, and removes a dead if branch
about creating the `.security` index.
This commit attempts to remove the retention leases on the leader shards
when unfollowing an index. This is best effort, since the leader might
not be available.
This defaults to "true" (current behavior) and will throw an exception
if there is a property that cannot be recognized. If "false", it will
ignore anything unrecognizable.
(cherry picked from commit 38fbf9792bcf4fe66bb3f17589e5fe6d29748d07)
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.
This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.
Backport of: #39092
The ML memory tracker does searches against ML results
and config indices. These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.
This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete. Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.
The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.
Fixes#37117
These two changes are interlinked.
Before this change unsetting ML upgrade mode would wait for all
datafeeds to be assigned and not waiting for their corresponding
jobs to initialise. However, this could be inappropriate, if
there was a reason other that upgrade mode why one job was unable
to be assigned or slow to start up. Unsetting of upgrade mode
would hang in this case.
This change relaxes the condition for considering upgrade mode to
be unset to simply that an assignment attempt has been made for
each ML persistent task that did not fail because upgrade mode
was enabled. Thus after unsetting upgrade mode there is no
guarantee that every ML persistent task is assigned, just that
each is not unassigned due to upgrade mode.
In order to make setting upgrade mode work immediately after
unsetting upgrade mode it was then also necessary to make it
possible to stop a datafeed that was not assigned. There was
no particularly good reason why this was not allowed in the past.
It is trivial to stop an unassigned datafeed because it just
involves removing the persistent task.
Prior to this commit, if during fetch leader / follower GCP
a fatal error occurred, then the shard follow task was removed.
This is unexpected, because if such an error occurs during the lifetime of shard follow task then replication is stopped and the fatal error flag is set. This allows the ccr stats api to report the fatal exception that has occurred (instead of the user grepping through the elasticsearch logs).
This issue was found by a rare failure of the `FollowStatsIT#testFollowStatsApiIncludeShardFollowStatsWithRemovedFollowerIndex` test.
Closes#38779
* Disable specific locales for tests in fips mode
The Bouncy Castle FIPS provider that we use for running our tests
in fips mode has an issue with locale sensitive handling of Dates as
described in https://github.com/bcgit/bc-java/issues/405
This causes certificate validation to fail if any given test that
includes some form of certificate validation happens to run in one
of the locales. This manifested earlier in #33081 which was
handled insufficiently in #33299
This change ensures that the problematic 3 locales
* th-TH
* ja-JP-u-ca-japanese-x-lvariant-JP
* th-TH-u-nu-thai-x-lvariant-TH
will not be used when running our tests in a FIPS 140 JVM. It also
reverts #33299
The .ml-annotations index is created asynchronously when
some other ML index exists. This can interfere with the
post-test index deletion, as the .ml-annotations index
can be created after all other indices have been deleted.
This change adds an ML specific post-test cleanup step
that runs before the main cleanup and:
1. Checks if any ML indices exist
2. If so, waits for the .ml-annotations index to exist
3. Deletes the other ML indices found in step 1.
4. Calls the super class cleanup
This means that by the time the main post-test index
cleanup code runs:
1. The only ML index it has to delete will be the
.ml-annotations index
2. No other ML indices will exist that could trigger
recreation of the .ml-annotations index
Fixes#38952
* Fix#38623 remove xpack namespace REST API
Except for xpack.usage and xpack.info API's, this moves the last remaining API's out of the xpack namespace
* rename xpack api's inside inside the files as well
* updated yaml tests references to xpack namespaces api's
* update callsApi calls in the IT subclasses
* make sure docs testing does not use xpack namespaced api's
* fix leftover xpack namespaced method names in docs/build.gradle
* found another leftover reference
(cherry picked from commit ccb5d934363c37506b76119ac050a254fa80b5e7)
The data frame plugin allows users to create feature indexes by pivoting a source index. In a
nutshell this can be understood as reindex supporting aggregations or similar to the so called entity
centric indexing.
Full history is provided in: feature/data-frame-transforms
* During fetching remote mapping if remote client is missing then
`NoSuchRemoteClusterException` was not handled.
* When adding remote connection, check that it is really connected
before continue-ing to run the tests.
Relates to #38695
Follow index in follow cluster that follows an index in the leader cluster and another
follow index in the leader index that follows that index in the follow cluster.
During the upgrade index following is paused and after the upgrade
index following is resumed and then verified index following works as expected.
Relates to #38037
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
This commit adds a `ListenerTimeouts` class that will wrap a
`ActionListener` in a listener with a timeout scheduled on the generic
thread pool. If the timeout expires before the listener is completed,
`onFailure` will be called with an `ElasticsearchTimeoutException`.
Timeouts for the get ccr file chunk action are implemented using this
functionality. Additionally, this commit attempts to fix#38027 by also
blocking proxied get ccr file chunk actions. This test being un-muted is
useful to verify the timeout functionality.
Today when processing an operation on a replica engine (or the
following engine), we first add it to Lucene, then add it to translog,
then finally marks its seq_no as completed. If a flush occurs after step1,
but before step-3, the max_seq_no in the commit's user_data will be
smaller than the seq_no of some documents in the Lucene commit.
We verify seq_no_stats is aligned between copies at the end of some
disruption tests. Sometimes, the assertion `assertSeqNos` is tripped due
to a lagged global checkpoint on replicas. The global checkpoint on
replicas is lagged because we sync the global checkpoint 30 seconds (by
default) after the last replication operation. This change reduces the
global checkpoint sync-internal to 1s in the disruption tests.
Closes#38318Closes#36789
The CCR REST tests that rely on these assertions are flaky. They are
flaky since the introduction of recovery from the remote.
The underlying problem is this: these tests are making assertions about
the number of operations read by the shard following task. However, with
recovery from remote, we no longer have guarantees that the assumptions
these tests were relying on hold. Namely, these tests were assuming that
the only way that a document could land in the follower index is via the
shard following task. With recovery from remote, there is another way,
which is via the files that are copied over during the recovery
phase. Most of the time this will not be a problem because with the
small number of documents that we are indexing in these tests, it is
usally not the case that a flush would occur and so there would not be
any documents in the files copied over. However, a flush can occur any
time at which point all of the indexed documents could end up in a safe
commit and copied over during recovery from remote. This commit modifies
these assertions to ones that are not prone to this issue, yet still
validate the health of the follower shard.
Few tests failed intermittently and most of the
times due to invalidated or expired keys that were
deleted were still reported in search results.
This commit removes the test and adds enhancements
to other tests testing different scenario's.
When ExpiredApiKeysRemover is triggered, the tests
did not await its termination thereby sometimes
the results would be wrong for a search operation.
DELETE_INTERVAL setting has been further reduced to
100ms so we can trigger ExpiredApiKeysRemover faster.
Closes#38408
The rest of `CCRIT` is now no longer relevant, because the remaining
test tests the same of the index following test in the rolling upgrade
multi cluster module.
Added `tests.upgrade_from_version` version to test. It is not needed
in this branch, but is in 6.7 branch.
Closes#37231
The previous logic for concurrent file chunk fetching did not allow for multiple chunks from the same
file to be fetched in parallel. The parallelism only allowed to fetch chunks from different files in
parallel. This required complex logic on the follower to be aware from which file it was already
fetching information, in order to ensure that chunks for the same file would be fetched in sequential
order. During benchmarking, this exhibited throughput issues when recovery came towards the end,
where it would only be sequentially fetching chunks for the same largest segment file, with
throughput considerably going down in a high-latency network as there was no parallelism anymore.
The new logic here follows the peer recovery model more closely, and sends multiple requests for
the same file in parallel, and then reorders the results as necessary. Benchmarks show that this
leads to better overall throughput and the implementation is also simpler.
Backport of #38818 to `7.x`. Original description:
The HTTP exporter code in the Monitoring plugin makes `GET _template` requests to check for existence of templates. These requests don't need to pass the `include_type_name` query parameter so this PR removes it from the request. This should remove the following deprecation log entries on the Monitoring cluster in 7.0.0 onwards:
```
[types removal] Specifying include_type_name in get index template requests is deprecated.
```
This change makes the writing of new usage data conditional based on
the version that is being written to. A test has also been added to
ensure serialization works as expected to an older version.
Relates #38687, #38917
This change updates the authentication service to use a consistent view
of the realms based on the license state at the start of
authentication. Without this, the license can change during
authentication of a request and it will result in a failure if the
realm that extracted the token is no longer in the realm list. This
manifests in some tests as an authentication failure that should never
really happen; one example would be the test framework's transport
client user should always have a succesful authentication but in the
LicensingTests this can fail and will show up as a
NoNodeAvailableException.
Additionally, the licensing tests have been updated to ensure that
there is consistency when changing the license. The license is changed
by modifying the internal xpack license state on each node, which has
no protection against be changed by some pending cluster action. The
methods to disable and enable now ensure we have a green cluster and
that the cluster is consistent before returning.
Closes#30301
This changes the output of the `_cat/indices` API with `Security` enabled.
It is possible to only display the index name (and possibly the index
health, depending on the request options) but not its stats (doc count, merges,
size, etc). This is the case for closed indices which have index metadata in the
cluster state but no associated shards, hence no shard stats.
However, when `Security` is enabled, and the request contains wildcards,
**open** indices without stats are a common occurrence. This is because the
index names in the response table are picked up directly from the cluster state
which is not filtered by `Security`'s _indexNameExpressionResolver_, unlike the
stats data which is populated by the indices stats API which does go through the
index name resolver.
This is a bug, because it is circumventing `Security`'s function to hide
unauthorized indices.
This has been fixed by displaying the index names as they are resolved by the indices
stats API. The outputs of these two APIs is now very similar: same index names,
similar data but different format.
Closes#37190
Right now there is no way to determine whether the
token service or API key service is enabled or not.
This commit adds support for the enabled status of
token and API key service to the security feature set
usage API `/_xpack/usage`.
Closes#38535
The should fix the following NPE:
```
[2019-02-11T23:27:48,452][WARN ][o.e.p.PersistentTasksNodeService] [node_s_0] task kD8YzUhHTK6uKNBNQI-1ZQ-0 failed with an exception
1> java.lang.NullPointerException: null
1> at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor.lambda$fetchFollowerShardInfo$7(ShardFollowTasksExecutor.java:305) ~[main/:?]
1> at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:383) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:352) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:324) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:314) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1189) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1169) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:417) [elasticsearch-8.0.0-SNAP
SHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:391) [elasticsearch-8.0.0-SNAP
SHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:687) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
```
Relates to #38779
* Add rolling upgrade multi cluster test module (#38277)
This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.
Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)
This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.
This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.
Relates to #37231 and #38037
* Filter out upgraded version index settings when starting index following (#38838)
The `index.version.upgraded` and `index.version.upgraded_string` are likely
to be different between leader and follower index. In the event that
a follower index gets restored on a upgraded node while the leader index
is still on non-upgraded nodes.
Closes#38835
When shutting down Watcher, the `bulkProcessor` is null if watcher has been
disabled in the configuration. This protects the flush and close calls with a
check for watcher enabled to avoid a NullPointerException
Resolves#38798
Currently we index documents concurrently to attempt to ensure that we
update mappings during the restore process. However, this does not
actually test that the mapping will be correct and is dangerous as it
can lead to a misalignment between the max sequence number and the local
checkpoint. If these are not aligned, peer recovery cannot be completed
without initiating following which this test does not do. That causes
teardown assertions to fail.
This commit removes the concurrent indexing and flushes after the
documents are indexed. Additionally it modifies the mapping specific
test to ensure that there is a mapping update when the restore session
is initiated. This mapping update is picked up at the end of the restore
by the follower.
Instead of using `WarningsHandler.PERMISSIVE`, we only match warnings
that are due to types removal.
This PR also renames `allowTypeRemovalWarnings` to `allowTypesRemovalWarnings`.
Relates to #37920.
Forward port of https://github.com/elastic/elasticsearch/pull/38757
This change reverts the initial 7.0 commits and replaces them
with the 6.7 variant that still allows for the ecs flag.
This commit differs from the 6.7 variants in that ecs flag will
now default to true.
6.7: `ecs` : default `false`
7.x: `ecs` : default `true`
8.0: no option, but behaves as `true`
* Revert "Ingest node - user agent, move device to an object (#38115)"
This reverts commit 5b008a34aa.
* Revert "Add ECS schema for user-agent ingest processor (#37727) (#37984)"
This reverts commit cac6b8e06f.
* cherry-pick 5dfe1935345da3799931fd4a3ebe0b6aa9c17f57
Add ECS schema for user-agent ingest processor (#37727)
* cherry-pick ec8ddc890a34853ee8db6af66f608b0ad0cd1099
Ingest node - user agent, move device to an object (#38115) (#38121)
* cherry-pick f63cbdb9b426ba24ee4d987ca767ca05a22f2fbb (with manual merge fixes)
Dep. check for ECS changes to User Agent processor (#38362)
* make true the default for the ecs option, and update 7.0 references and tests
Change the formatting for Watcher.status.lastCheck and lastMetCondition
to be the same as Watcher.status.state.timestamp. These should all have
only millisecond precision
closes#38619
backport #38626
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.
Relates #38633
This change removes the pinning of TLSv1.2 in the
SSLConfigurationReloaderTests that had been added to workaround an
issue with the MockWebServer and Apache HttpClient when using TLSv1.3.
The way HttpClient closes the socket causes issues with the TLSv1.3
SSLEngine implementation that causes the MockWebServer to loop
endlessly trying to send the close message back to the client. This
change wraps the created http connection in a way that allows us to
override the closing behavior of HttpClient.
An upstream request with HttpClient has been opened at
https://issues.apache.org/jira/browse/HTTPCORE-571 to see if the method
of closing can be special cased for SSLSocket instances.
This is caused by a JDK bug, JDK-8214418 which is fixed by
https://hg.openjdk.java.net/jdk/jdk12/rev/5022a4915fe9.
Relates #38646
`<expression>::<dataType>` is a simplified altenative syntax to
`CAST(<expression> AS <dataType> which exists in PostgreSQL and
provides an improved user experience and possibly more compact
SQL queries.
Fixes: #38717
fix tests to use clock in milliseconds precision in watcher code
make sure the date comparison in string format is using same formatters
some of the code was modified in #38514 possibly because of merge conflicts
closes#38581
Backport #38738
The java time formatter used in the exporter adds a plus sign to the
year, if a year with more than five digits is used. This changes the
creation of those timestamp to only have a date up to 9999.
Closes#38378
The Close Index API has been refactored in 6.7.0 and it now performs
pre-closing sanity checks on shards before an index is closed: the maximum
sequence number must be equals to the global checkpoint. While this is a
strong requirement for regular shards, we identified the need to relax this
check in the case of CCR following shards.
The following shards are not in charge of managing the max sequence
number or global checkpoint, which are pulled from a leader shard. They
also fetch and process batches of operations from the leader in an unordered
way, potentially leaving gaps in the history of ops. If the following shard lags
a lot it's possible that the global checkpoint and max seq number never get
in sync, preventing the following shard to be closed and a new PUT Follow
action to be issued on this shard (which is our recommended way to
resume/restart a CCR following).
This commit allows each Engine implementation to define the specific
verification it must perform before closing the index. In order to allow
following/frozen/closed shards to be closed whatever the max seq number
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do
not perform any check before the index is closed.
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Added a constructor accepting `StreamInput` as argument, which allowed to
make most of the instance members final as well as remove the default
constructor.
Removed a test only constructor in favour of invoking the existing
constructor that takes a `SearchRequest` as first argument.
Also removed profile members and related methods as they were all unused.
* Rename integTest to bwcTestSample for bwc test projects
This change renames the `integTest` task to `bwcTestSample` for projects
testing bwc to make it possible to run all the bwc tests that check
would run without running on bwc tests.
This change makes it possible to add a new PR check on backports to make
sure these don't break BWC tests in master.
* Rename task as per PR
The test was relying on toString in ZonedDateTime which is different to
what is formatted by strict_date_time when milliseconds are 0
The method is just delegating to dateFormatter, so that scenario should
be covered there.
closes#38359
Backport #38610
* Enhance parsing of StatusCode in SAML Responses
<Status> elements in a failed response might contain two nested
<StatusCode> elements. We currently only parse the first one in
order to create a message that we attach to the Exception we return
and log. However this is generic and only gives out informarion
about whether the SAML IDP believes it's an error with the
request or if it couldn't handle the request for other reasons. The
encapsulated StatusCode has a more interesting error message that
potentially gives out the actual error as in Invalid nameid policy,
authentication failure etc.
This change ensures that we print that information also, and removes
Message and Details fields from the message when these are not
part of the Status element (which quite often is the case)
When the millisecond part of a timestamp is 0 the toString
representation in java-time is omitting the millisecond part (joda was
not). The Search response is returning timestamps formatted with
WatcherDateTimeUtils, therefore comparisons of strings should be done
with the same formatter
relates #27330
BackPort #38505
Adds the ability to fetch chunks from different files in parallel, configurable using the new `ccr.indices.recovery.max_concurrent_file_chunks` setting, which defaults to 5 in this PR.
The implementation uses the parallel file writer functionality that is also used by peer recoveries.
Improve verifier to disallow grouping over grouping functions (e.g.
HISTOGRAM over HISTOGRAM).
Close#38308
(cherry picked from commit 4e9b1cfd4df38c652bba36b4b4b538ce7c714b6e)
Constant numbers (of any form: integers, decimals, negatives,
scientific) and strings shouldn't increase the depth counters
as they don't contribute to the increment of the stack depth.
Fixes: #38571
* ML: update set_upgrade_mode, add logging
* Attempt to fix datafeed isolation
Also renamed a few methods/variables for clarity and added
some comments
This commit adds the 7.1 version constant to the 7.x branch.
Co-authored-by: Andy Bristol <andy.bristol@elastic.co>
Co-authored-by: Tim Brooks <tim@uncontended.net>
Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
Co-authored-by: Luca Cavanna <javanna@users.noreply.github.com>
Co-authored-by: markharwood <markharwood@gmail.com>
Co-authored-by: Ioannis Kakavas <ioannis@elastic.co>
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: Jason Tedor <jason@tedor.me>
Co-authored-by: Alpar Torok <torokalpar@gmail.com>
Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Tim Vernum <tim@adjective.org>
Co-authored-by: Albert Zaharovits <albert.zaharovits@gmail.com>
- Add resolution to the exact keyword field (if exists) for text fields.
- Add proper verification and error message if underlying keyword
doesn'texist.
- Move check for field attribute in the comparison list to the
`resolveType()` method of `IN`.
Fixes: #38424
In #38333 and #38350 we moved away from the `discovery.zen` settings namespace
since these settings have an effect even though Zen Discovery itself is being
phased out. This change aligns the documentation and the names of related
classes and methods with the newly-introduced naming conventions.
Aliases defined in SELECT (Project or Aggregate) are now resolved in the
following WHERE clause. The Analyzer has been enhanced to identify this
rule and replace the field accordingly.
Close#29983
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.
The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).
Given all these issues, it seems that this custom timeout mechanism
gives little benefits while causing a lot of harm. We should rather rely
on connect and socket timeout exposed by the underlying http client
and accept that a request can overall take longer than the configured
timeout, which is the case even with a single retry anyways.
This commit removes the `maxRetryTimeout` setting and all of its usages.
This commit adds an authentication cache for API keys that caches the
hash of an API key with a faster hash. This will enable better
performance when API keys are used for bulk or heavy searching.
I have not been able to reproduce the failing
test scenario locally for #38408 and there are other similar
tests which are running fine in the same test class.
I am re-enabling the test with additional logs so
that we can debug further on what's happening.
I will keep the issue open for now and look out for the builds
to see if there are any related failures.
This is related to #35975. We do not want a slow master to fail a
recovery from remote process due to a slow put mappings call. This
commit increases the master node timeout on this call to 30 mins.
`waitForRollUpJob` is an assertBusy that waits for the rollup job
to appear in the tasks list, and waits for it to be a certain state.
However, there was a null check around the state assertion, which meant
if the job _was_ null, the assertion would be skipped, and the
assertBusy would pass withouot an exception. This could then lead to
downstream assertions to fail because the job was not actually ready,
or in the wrong state.
This changes the test to assert the job is not null, so the assertBusy
operates as intended.
Since introduction of data types that don't have a corresponding type
in ES the `esType` is error-prone when used for `unmappedType()` calls.
Moreover since the renaming of `DATE` to `DATETIME` and the introduction
of an actual date-only `DATE` the `esType` would return `datetime` which
is not a valid type for ES mapping.
Fixes: #38051
For some users, the built in authorization mechanism does not fit their
needs and no feature that we offer would allow them to control the
authorization process to meet their needs. In order to support this,
a concept of an AuthorizationEngine is being introduced, which can be
provided using the security extension mechanism.
An AuthorizationEngine is responsible for making the authorization
decisions about a request. The engine is responsible for knowing how to
authorize and can be backed by whatever mechanism a user wants. The
default mechanism is one backed by roles to provide the authorization
decisions. The AuthorizationEngine will be called by the
AuthorizationService, which handles more of the internal workings that
apply in general to authorization within Elasticsearch.
In order to support external authorization services that would back an
authorization engine, the entire authorization process has become
asynchronous, which also includes all calls to the AuthorizationEngine.
The use of roles also leaked out of the AuthorizationService in our
existing code that is not specifically related to roles so this also
needed to be addressed. RequestInterceptor instances sometimes used a
role to ensure a user was not attempting to escalate their privileges.
Addressing this leakage of roles meant that the RequestInterceptor
execution needed to move within the AuthorizationService and that
AuthorizationEngines needed to support detection of whether a user has
more privileges on a name than another. The second area where roles
leaked to the user is in the handling of a few privilege APIs that
could be used to retrieve the user's privileges or ask if a user has
privileges to perform an action. To remove the leakage of roles from
these actions, the AuthorizationService and AuthorizationEngine gained
methods that enabled an AuthorizationEngine to return the response for
these APIs.
Ultimately this feature is the work included in:
#37785#37495#37328#36245#38137#38219Closes#32435
Elasticsearch has long [supported](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning) compare and set (a.k.a optimistic concurrency control) operations using internal document versioning. Sadly that approach is flawed and can sometime do the wrong thing. Here's the relevant excerpt from the resiliency status page:
> When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is partitioned away. It won’t acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed. The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted writes for the same document
We recently [introduced](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/optimistic-concurrency-control.html) a new sequence number based approach that doesn't suffer from this dirty reads problem.
This commit removes support for internal versioning as a concurrency control mechanism in favor of the sequence number approach.
Relates to #1078
Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.
This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.
`CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when
deserializing index creation requests, accidentally accepts mappings that are
nested twice under the type key (as described in the bug report #38266).
This in turn causes us to be too lenient in parsing typeless mappings. In
particular, we accept the following index creation request, even though it
should not contain the type key `_doc`:
```
PUT index?include_type_name=false
{
"mappings": {
"_doc": {
"properties": { ... }
}
}
}
```
There is a similar issue for both 'put templates' and 'put mappings' requests
as well.
This PR makes the minimal changes to detect and reject these typed mappings in
requests. It does not address #38266 generally, or attempt a larger refactor
around types in these server-side requests, as I think this should be done at a
later time.
The backport of #38022 introduced types-deprecation warning for get/put template requests
that cause problems on tests master in mixed cluster scenarios. While these warnings are
caught and ignored in regular Rest tests, the get template requests in XPackRestTestHelper
were missed.
Closes#38412
Tests can override assertToXContentEquivalence() in case their xcontent
cannot be directly compared (e.g. due to insertion order in maps
affecting the xcontent ordering). But the `testHlrcFromXContent` test
hardcoded the equivalence test to `true` instead of consulting
`assertToXContentEquivalence()`
Fixes#36034
With this change we no longer support pluggable discovery implementations. No
known implementations of `DiscoveryPlugin` actually override this method, so in
practice this should have no effect on the wider world. However, we were using
this rather extensively in tests to provide the `test-zen` discovery type. We
no longer need a separate discovery type for tests as we no longer need to
customise its behaviour.
Relates #38410
the clock resolution changed from jdk8->jdk10, hence the test is passing
in jdk8 but failing in jdk10. The Watcher's objects are serialised and
deserialised with milliseconds precision, making test to fail in jdk 10
and higher
closes#38400
The test is now expected to be always passing no matter what the random
locale is. This is fixed with using jdk ZoneId.systemDefault() in both
the test and CronEvalTool
closes#35687
If a job cannot be assigned to a node because an index it
requires is unavailable and there are lazy ML nodes then
index unavailable should be reported as the assignment
explanation rather than waiting for a lazy ML node.
Introduced FollowParameters class that put follow, resume follow,
put auto follow pattern requests and follow info response classes reuse.
The FollowParameters class had the fields, getters etc. for the common parameters
that all these APIs have. Also binary and xcontent serialization /
parsing is handled by this class.
The follow, resume follow, put auto follow pattern request classes originally
used optional non primitive fields, so FollowParameters has that too and the follow info api can handle that now too.
Also the followerIndex field can in production only be specified via
the url path. If it is also specified via the request body then
it must have the same value as is specified in the url path. This
option only existed to xcontent testing. However the AbstractSerializingTestCase
base class now also supports createXContextTestInstance() to provide
a different test instance when testing xcontent, so allowing followerIndex
to be specified via the request body is no longer needed.
By moving the followerIndex field from Body to ResumeFollowAction.Request
class and not allowing the followerIndex field to be specified via
the request body the Body class is redundant and can be removed. The
ResumeFollowAction.Request class can then directly use the
FollowParameters class.
For consistency I also removed the ability to specified followerIndex
in the put follow api and the name in put auto follow pattern api via
the request body.
Authn is enabled only if `license_type` is non `basic`, but `basic` is
what the `LicenseService` generates implicitly. This commit explicitly sets
license type to `trial`, which allows for authn, in the `SecuritySettingsSource`
which is the settings configuration parameter for `InternalTestCluster`s.
The real problem, that had created tests failures like #31028 and #32685, is
that the check `licenseState.isAuthAllowed()` can change sporadically. If it were
to return `true` or `false` during the whole test there would be no problem.
The problem manifests when it turns from `true` to `false` right before `Realms.asList()`.
There are other license checks before this one (request filter, token service, etc)
that would not cause a problem if they would suddenly see the check as `false`.
But switching to `false` before `Realms.asList()` makes it appear that no installed
realms could have handled the authn token which is an authentication error, as can
be seen in the failing tests.
Closes#31028#32685
Renames the following settings to remove the mention of `zen` in their names:
- `discovery.zen.hosts_provider` -> `discovery.seed_providers`
- `discovery.zen.ping.unicast.concurrent_connects` -> `discovery.seed_resolver.max_concurrent_resolvers`
- `discovery.zen.ping.unicast.hosts.resolve_timeout` -> `discovery.seed_resolver.timeout`
- `discovery.zen.ping.unicast.hosts` -> `discovery.seed_addresses`
* Adding apm_user
* Fixing SecurityDocumentationIT testGetRoles test
* Adding access to .ml-anomalies-*
* Fixing APM test, we don't have access to the ML state index
X-Pack security supports built-in authentication service
`token-service` that allows access tokens to be used to
access Elasticsearch without using Basic authentication.
The tokens are generated by `token-service` based on
OAuth2 spec. The access token is a short-lived token
(defaults to 20m) and refresh token with a lifetime of 24 hours,
making them unsuitable for long-lived or recurring tasks where
the system might go offline thereby failing refresh of tokens.
This commit introduces a built-in authentication service
`api-key-service` that adds support for long-lived tokens aka API
keys to access Elasticsearch. The `api-key-service` is consulted
after `token-service` in the authentication chain. By default,
if TLS is enabled then `api-key-service` is also enabled.
The service can be disabled using the configuration setting.
The API keys:-
- by default do not have an expiration but expiration can be
configured where the API keys need to be expired after a
certain amount of time.
- when generated will keep authentication information of the user that
generated them.
- can be defined with a role describing the privileges for accessing
Elasticsearch and will be limited by the role of the user that
generated them
- can be invalidated via invalidation API
- information can be retrieved via a get API
- that have been expired or invalidated will be retained for 1 week
before being deleted. The expired API keys remover task handles this.
Following are the API key management APIs:-
1. Create API Key - `PUT/POST /_security/api_key`
2. Get API key(s) - `GET /_security/api_key`
3. Invalidate API Key(s) `DELETE /_security/api_key`
The API keys can be used to access Elasticsearch using `Authorization`
header, where the auth scheme is `ApiKey` and the credentials, is the
base64 encoding of API key Id and API key separated by a colon.
Example:-
```
curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health
```
Closes#34383
We mention in our documentation for the token
expiration configuration maximum value is 1 hour
but do not enforce it. This commit adds max limit
to the TOKEN_EXPIRATION setting.
This test should not pass until CCR finishes integrating shard history
retention leases. It currently sometimes passes (which is a bug in the
test), but cannot pass reliably until the linked issue is resolved.
There are two issues regarding the way that we sync mapping from leader
to follower when a ccr restore is completed:
1. The returned mapping from a cluster service might not be up to date
as the mapping of the restored index commit.
2. We should not compare the mapping version of the follower and the
leader. They are not related to one another.
Moreover, I think we should only ensure that once the restore is done,
the mapping on the follower should be at least the mapping of the copied
index commit. We don't have to sync the mapping which is updated after
we have opened a session.
Relates #36879Closes#37887
This is related to #35975. Currently when an index falls behind a leader
it encounters a fatal exception. This commit adds a test for that
scenario. Additionally, it tests that the user can stop following, close
the follower index, and put follow again. After the indexing is
re-bootstrapped, it will recover the documents it lost in normal
following operations.
This commit fixes the pinning of SSLContexts to TLSv1.2 in the
SSLConfigurationReloaderTests. The pinning was added for the initial
creation of clients and webservers but the updated contexts would
default to TLSv1.3, which is known to cause hangs with the
MockWebServer that we use.
Relates #38103Closes#38247
This PR removes the use of document types from the monitoring exporters and template + watches setup code.
It does not remove the notion of types from the monitoring bulk API endpoint "front end" code as that code will eventually just go away in 8.0 and be replaced with Beats as collectors/shippers directly to the monitoring cluster.
At times, we need to check for usage of deprecated settings in settings
which should not be returned by the NodeInfo API. This commit changes
the deprecation info API to run all node checks locally so that these
settings can be checked without exposing them via any externally
accessible API.
This commit introduces a background sync for retention leases. The idea
here is that we do a heavyweight sync when adding a new retention lease,
and then periodically we want to background sync any retention lease
renewals to the replicas. As long as the background sync interval is
significantly lower than the extended lifetime of a retention lease, it
is okay if from time to time a replica misses a sync (it will still have
an older version of the lease that is retaining more data as we assume
that renewals do not decrease the retaining sequence number). There are
two follow-ups that will come after this commit. The first is to address
the fact that we have not adapted the should periodically flush logic to
possibly flush the retention leases. We want to do something like flush
if we have not flushed in the last five minutes and there are renewed
retention leases since the last time that we flushed. An additional
follow-up will remove the syncing of retention leases when a retention
lease expires. Today this sync could be invoked in the background by a
merge operation. Rather, we will move the syncing of retention lease
expiration to be done under the background sync. The background sync
will use the heavyweight sync (write action) if a lease has expired, and
will use the lightweight background sync (replication action) otherwise.
The explanation so far can be invaluable for troubleshooting
as incorrect decisions made early on in the structure analysis
can result in seemingly crazy decisions or timeouts later on.
Relates elastic/kibana#29821
Today the following settings in the `discovery.zen` namespace are still used:
- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`
This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
It would be beneficial to apply some of the request interceptors even
when features are disabled. This change reworks the way we build that
list so that the interceptors we always want to use are constructed
outside of the settings check.
Instead of throwing an exception, use an unresolved attribute to pass
the message to the Verifier.
Additionally improve the parser to save the extended source for the
Aggregate and OrderBy.
Close#38208
The culprit in #38097 is an `IndicesRequest` that has no indices,
but instead of `request.indices()` returning `null` or `String[0]`
it returned `String[] {null}` . This tripped the audit filter.
I have addressed this in two ways:
1. `request.indices()` returning `String[] {null}` is treated as `null`
or `String[0]`, i.e. no indices
2. `null` values among the roles and indices lists, which are
unexpected, will never again stumble the audit filter; `null` values
are treated as special values that will not match any policy,
i.e. their events will always be printed.
Closes#38097
* Add checks for Grouping functions restriction to be placed inside GROUP BY
* Fixed bug where GROUP BY HISTOGRAM (not using alias) wasn't recognized
properly in the Verifier due to functions equality not working correctly.
Adds a Step to the Shrink and Delete actions which prevents those
actions from running on a leader index - all follower indices must first
unfollow the leader index before these actions can run. This prevents
the loss of history before follower indices are ready, which might
otherwise result in the loss of data.
Introduce client-side sorting of groups based on aggregate
functions. To allow this, the Analyzer has been extended to push down
to underlying Aggregate, aggregate function and the Querier has been
extended to identify the case and consume the results in order and sort
them based on the given columns.
The underlying QueryContainer has been slightly modified to allow a view
of the underlying values being extracted as the columns used for sorting
might not be requested by the user.
The PR also adds minor tweaks, mainly related to tree output.
Close#35118
Because concurrent sync requests from a primary to its replicas could be
in flight, it can be the case that an older retention leases collection
arrives and is processed on the replica after a newer retention leases
collection has arrived and been processed. Without a defense, in this
case the replica would overwrite the newer retention leases with the
older retention leases. This commit addresses this issue by introducing
a versioning scheme to retention leases. This versioning scheme is used
to resolve out-of-order processing on the replica. We persist this
version into Lucene and restore it on recovery. The encoding of
retention leases is starting to get a little ugly. We can consider
addressing this in a follow-up.
There was a bug where creating a new policy would start
the ILM service, even if it was stopped. This change ensures
that there is no change to the existing operation mode
This suite still fails one per week sometimes with a worrying assertion.
Sadly we are still unable to find the actual source.
Expected: <SeqNoStats{maxSeqNo=229, localCheckpoint=86, globalCheckpoint=86}>
but: was <SeqNoStats{maxSeqNo=229, localCheckpoint=-1, globalCheckpoint=86}>
This change enables trace log in the suite so we will have a better
picture if this fails again.
Relates #3333
This PR removes the temporary change we made to the yml test harness in #37285
to automatically set `include_type_name` to `true` in index creation requests
if it's not already specified. This is possible now that the vast majority of
index creation requests were updated to be typeless in #37611. A few additional
tests also needed updating here.
Additionally, this PR updates the test harness to set `include_type_name` to
`false` in index creation requests when communicating with 6.x nodes. This
mirrors the logic added in #37611 to allow for typeless document write requests
in test set-up code. With this update in place, we can remove many references
to `include_type_name: false` from the yml tests.
Unlike assertBusy, awaitBusy does not retry if the code-block throws an
AssertionError. A refresh in atLeastDocsIndexed can fail because we call
this method while we are closing some node in FollowerFailOverIT.
This PR adds the `monitor/xpack/info` cluster-level privilege to the built-in `monitoring_user` role.
This privilege is required for the Monitoring UI to call the `GET _xpack API` on the Monitoring Cluster. It needs to do this in order to determine the license of the Monitoring Cluster, which further determines whether Cluster Alerts are shown to the user or not.
Resolves#37970.
In 7.x Java timestamp formats are the default timestamp format and
there is no need to prefix them with "8". (The "8" prefix was used
in 6.7 to distinguish Java timestamp formats from Joda timestamp
formats.)
This change removes the "8" prefixes from timestamp formats in the
output of the ML file structure finder.
This commit enables the use of TLSv1.3 with security by enabling us to
properly map `TLSv1.3` in the supported protocols setting to the
algorithm for a SSLContext. Additionally, we also enable TLSv1.3 by
default on JDKs that support it.
An issue was uncovered with the MockWebServer when TLSv1.3 is used that
ultimately winds up in an endless loop when the client does not trust
the server's certificate. Due to this, SSLConfigurationReloaderTests
has been pinned to TLSv1.2.
Closes#32276
In 6.3 trial licenses were changed to default to security
disabled, and ee added some heuristics to detect when security should
be automatically be enabled if `xpack.security.enabled` was not set.
This change removes those heuristics, and requires that security be
explicitly enabled (via the `xpack.security.enabled` setting) for
trial licenses.
Relates: #38009
Currently we use the raw byte array length when calling the IndexInput
read call to determine how many bytes we want to read. However, due to
how BigArrays works, the array length might be longer than the reference
length. This commit fixes the issue and uses the BytesRef length when
calling read. Additionally, it expands the index follow test to index
many more documents. These documents should potentially lead to large
enough segment files to trigger scenarios where this fix matters.
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.
This is a continuation of #28667, #36137 and also fixes#37708.
FIRST and LAST can be used with one argument and work similarly to MIN
and MAX but they are implemented using a Top Hits aggregation and
therefore can also operate on keyword fields. When a second argument is
provided then they return the first/last value of the first arg when its
values are ordered ascending/descending (respectively) by the values of
the second argument. Currently because of the usage of a Top Hits
aggregation FIRST and LAST cannot be used in the HAVING clause of a
GROUP BY query to filter on the results of the aggregation.
Closes: #35639
With #37566 we have introduced the ability to merge multiple search responses into one. That makes it possible to expose a new way of executing cross-cluster search requests, that makes CCS much faster whenever there is network latency between the CCS coordinating node and the remote clusters. The coordinating node can now send a single search request to each remote cluster, which gets reduced by each one of them. from + size results are requested to each cluster, and the reduce phase in each cluster is non final (meaning that buckets are not pruned and pipeline aggs are not executed). The CCS coordinating node performs an additional, final reduction, which produces one search response out of the multiple responses received from the different clusters.
This new execution path will be activated by default for any CCS request unless a scroll is provided or inner hits are requested as part of field collapsing. The search API accepts now a new parameter called ccs_minimize_roundtrips that allows to opt-out of the default behaviour.
Relates to #32125
* Added SSL configuration options tests
Removed the allow.self.signed option from the documentation since we allow
by default self signed certificates as well.
* Added more tests
The existing implementation was slow due to exceptions being thrown if
an accessor did not have a time zone. This implementation queries for
having a timezone, local time and local date and also checks for an
instant preventing to throw an exception and thus speeding up the conversion.
This removes the existing method and create a new one named
DateFormatters.from(TemporalAccessor accessor) to resemble the naming of
the java time ones.
Before this change an epoch millis parser using the toZonedDateTime
method took approximately 50x longer.
Relates #37826
* move watcher to seq# occ
* top level set
* fix parsing and missing setters
* share toXContent for PutResponse and rest end point
* fix redacted password
* fix username reference
* fix deactivate-watch.asciidoc have seq no references
* add seq# + term to activate-watch.asciidoc
* more doc fixes
This commit fixes the test case that ensures only a priority
less then 0 is used with testNonPositivePriority. This also
allows the HLRC to support a value of 0.
Closes#37652
Previously, ShrinkAction would fail if
it was executed on an index that had
the same number of shards as the target
shrunken number.
This PR introduced a new BranchingStep that
is used inside of ShrinkAction to branch which
step to move to next, depending on the
shard values. So no shrink will occur if the
shard count is unchanged.
This commit allows implementors of the `HandledTransportAction` to
specify what thread the action should be executed on. The motivation for
this commit is that certain CCR requests should be performed on the
generic threadpool.
The apache commons http client implementations recently released
versions that solve TLS compatibility issues with the new TLS engine
that supports TLSv1.3 with JDK 11. This change updates our code to
use these versions since JDK 11 is a supported JDK and we should
allow the use of TLSv1.3.
This fixes#38027. Currently we assert that all shards have failed.
However, it is possible that some shards do not have segement files
created yet. The action that we block is fetching these segement files
so it is possible that some shards successfully recover.
This commit changes the assertion to ensure that at least some of the
shards have failed.
Today we pass `discovery.zen.minimum_master_nodes` to nodes started up in
tests, but for 7.x nodes this setting is not required as it has no effect.
This commit removes this setting so that nodes are started with more realistic
configurations, and deprecates it.
Types have been deprecated and this commit removes the documentation
for specifying types in the index action, and search input/transform.
Relates #37594#35190
Doc-value fields now return a value that is based on the mappings rather than
the script implementation by default.
This deprecates the special `use_field_mapping` docvalue format which was added
in #29639 only to ease the transition to 7.x and it is not necessary anymore in
7.0.
The certgen, certutil and saml-metadata tools did not correctly return
their exit code to the calling shell.
These commands now explicitly exit with the code that was returned
from the main(args, terminal) method.
This commit fixes a potential race in the IndexFollowingIT. Currently it
is possible that we fetch the task metadata, it is null, and that throws
a null pointer exception. Assertbusy does not catch null pointer
exceptions. This commit assertions that the metadata is not null.
This is related to #35975. It adds a action timeout setting that allows
timeouts to be applied to the individual transport actions that are
used during a ccr recovery.
Restricted indices (currently only .security-6 and .security) are special
internal indices that require setting the `allow_restricted_indices` flag
on every index permission that covers them. If this flag is `false`
(default) the permission will not cover these and actions against them
will not be authorized.
However, the monitoring APIs were the only exception to this rule.
This exception is herein forfeited and index monitoring privileges have to be
granted explicitly, using the `allow_restricted_indices` flag on the permission,
as is the case for any other index privilege.
This commit modifies the put follow index action to use a
CcrRepository when creating a follower index. It routes
the logic through the snapshot/restore process. A
wait_for_active_shards parameter can be used to configure
how long to wait before returning the response.
This change adds a _meta field storing the version in which
the index mappings were last updated to the 3 ML indices
that didn't previously have one:
- .ml-annotations
- .ml-meta
- .ml-notifications
All other ML indices already had such a _meta field.
This field will be useful if we ever need to automatically
update the index mappings during a future upgrade.
Runnables can be submitted to
AutodetectProcessManager.AutodetectWorkerExecutorService
without error after it has been shutdown which can lead
to requests timing out as their handlers are never called
by the terminated executor.
This change throws an EsRejectedExecutionException if a
runnable is submitted after after the shutdown and calls
AbstractRunnable.onRejection on any tasks not run.
Closes#37108
This commit changes the TransportVerifyShardBeforeCloseAction so that it issues a
forced flush, forcing the translog and the Lucene commit to contain the same max seq
number and global checkpoint in the case the Translog contains operations that were
not written in the IndexWriter (like a Delete that touches a non existing doc). This way
the assertion added in #37426 won't trip.
Related to #33888
This commit moves the auditing of job deletion related errors
to the final listener in the job delete action. This ensures
any error that occurs during job deletion is audited.
In order to support JSON log format, a custom pattern layout was used and its configuration is enclosed in ESJsonLayout. Users are free to use their own patterns, but if smooth Beats integration is needed, they should use ESJsonLayout. EvilLoggerTests are left intact to make sure user's custom log patterns work fine.
To populate additional fields node.id and cluster.uuid which are not available at start time,
a cluster state update will have to be received and the values passed to log4j pattern converter.
A ClusterStateObserver.Listener is used to receive only one ClusteStateUpdate. Once update is received the nodeId and clusterUUid are set in a static field in a NodeAndClusterIdConverter.
Following fields are expected in JSON log lines: type, tiemstamp, level, component, cluster.name, node.name, node.id, cluster.uuid, message, stacktrace
see ESJsonLayout.java for more details and field descriptions
Docker log4j2 configuration is now almost the same as the one use for ES binary.
The only difference is that docker is using console appenders, whereas ES is using file appenders.
relates: #32850
This commit allows JIRA API fields that require a list of key/value
pairs (maps), such as JIRA "components" to use use template snippets
(e.g. {{ctx.payload.foo}}). Prior to this change the templated value
(not the de-referenced value) would be sent via the API and error.
Closes#30068
We inject an Unfollow action before Shrink because the Shrink action
cannot be safely used on a following index, as it may not be fully
caught up with the leader index before the "original" following index is
deleted and replaced with a non-following Shrunken index. The Unfollow
action will verify that 1) the index is marked as "complete", and 2) all
operations up to this point have been replicated from the leader to the
follower before explicitly disconnecting the follower from the leader.
Injecting an Unfollow action before the Rollover action is done mainly
as a convenience: This allow users to use the same lifecycle policy on
both the leader and follower cluster without having to explictly modify
the policy to unfollow the index, while doing what we expect users to
want in most cases.
If the index request is executed before the mapping update is applied on
the IndexShard, the index request will perform a dynamic mapping update.
This mapping update will be timeout (i.e, ProcessClusterEventTimeoutException)
because the latch is not open. This leads to the failure of the index
request and the test. This commit makes sure the mapping is ready
before we execute the index request.
Closes#37807
This commit adds deprecation warnings for index actions
and search actions when executed via watcher. Unit and
integration tests updated accordingly.
relates #35190
* ML: Add MlMetadata.upgrade_mode and API
* Adding tests
* Adding wait conditionals for the upgrade_mode call to return
* Adding tests
* adjusting format and tests
* Adjusting wait conditions for api return and msgs
* adjusting doc tests
* adding upgrade mode tests to black list
We have read and write aliases for the ML results indices. However,
the job still had methods that purported to reliably return the name
of the concrete results index being used by the job. After reindexing
prior to upgrade to 7.x this will be wrong, so the method has been
renamed and the comments made more explicit to say the returned index
name may not be the actual concrete index name for the lifetime of the
job. Additionally, the selection of indices when deleting the job
has been changed so that it works regardless of concrete index names.
All these changes are nice-to-have for 6.7 and 7.0, but will become
critical if we add rolling results indices in the 7.x release stream
as 6.7 and 7.0 nodes may have to operate in a mixed version cluster
that includes a version that can roll results indices.
* Changed `LuceneSnapshot` to throw an `OperationsMissingException` if the requested ops are missing.
* Changed the shard changes api to handle the `OperationsMissingException` and wrap the exception into `ResourceNotFound` exception and include metadata to indicate the requested range can no longer be retrieved.
* Changed `ShardFollowNodeTask` to handle this `ResourceNotFound` exception with the included metdata header.
Relates to #35975
Replace `threadPool().schedule()` / catch
`EsRejectedExecutionException` pattern with direct calls to
`ThreadPool#scheduleUnlessShuttingDown()`.
Closes#36318
This commit introduces the `create_snapshot` cluster privilege and
the `snapshot_user` role.
This role is to be used by "cronable" tools that call the snapshot API
periodically without recurring to the `manage` cluster privilege. The
`create_snapshot` cluster privilege is much more limited compared to
the `manage` privilege.
The `snapshot_user` role grants the privileges to view the metadata of
all indices (including restricted ones, i.e. .security). It obviously grants the
create snapshot privilege but the repository has to be created using another
role. In addition, it grants the privileges to (only) GET repositories and
snapshots, but not create and delete them.
The role does not allow to create repositories. This distinction is important
because snapshotting equates to the `read` index privilege if the user has
control of the snapshot destination, but this is not the case in this instance,
because the role does not grant control over repository configuration.
This commit introduces retention lease syncing from the primary to its
replicas when a new retention lease is added. A follow-up commit will
add a background sync of the retention leases as well so that renewed
retention leases are synced to replicas.
Made the test tolerant to index upgrade being run
in between the old/mixed/upgraded portions. This
can occur because the rolling upgrade tests all
share the same indices.
Fixes#37763
The ML file structure finder has always reported both Joda
and Java time format strings. This change makes the Java time
format strings the ones that are incorporated into mappings
and ingest pipeline definitions.
The BWC syntax of prepending "8" to these formats is used.
This will need to be removed once Java time format strings
become the default in Elasticsearch.
This commit also removes direct imports of Joda classes in the
structure finder unit tests. Instead the core Joda BWC class
is used.
* Exit batch files explictly using ERRORLEVEL
This makes sure the exit code is preserved when calling the batch
files from different contexts other than DOS
Fixes#29582
This also fixes specific error codes being masked by an explict
exit /b 1
causing the useful exitcodes from ExitCodes to be lost.
* fix line breaks for calling cli to match the bash scripts
* indent size of bash files is 2, make sure editorconfig does the same for bat files
* update indenting to match bash files
* update elasticsearch-keystore.bat indenting
* Update elasticsearch-node.bat to exit outside of endlocal
The TransportUnfollowAction updates the index settings but does not
increase the settings version to reflect that change.
This issue has been caught while working on the replication of closed
indices (#33888). The IndexFollowingIT.testUnfollowIndex() started to
fail and this specific assertion tripped. It does not happen on master
branch today because index metadata for closed indices are never
updated in IndexService instances, but this is something that is going
to change with the replication of closed indices.
The unlucky timing can cause this test to fail when the indexing is triggered from `maybeTriggerAsyncJob`. As this is asynchronous, in can finish quicker then the test stepping over to next assertion
The introduced barrier solves the problem
closes#37695
* Remove empty statements
There are a couple of instances of undocumented empty statements all across the
code base. While they are mostly harmless, they make the code hard to read and
are potentially error-prone. Removing most of these instances and marking blocks
that look empty by intention as such.
* Change test, slightly more verbose but less confusing
When upgrading from 5.4 to 5.5 to 6.7 (inclusive) it was
necessary to ensure there was a mapping for type "doc" on
the ML state index before opening a job. This was because
5.4 created a multi-type ML state index.
In version 7.x we can be sure that any such 5.4 index is no
longer in use. It would have had to be reindexed into the
6.x index format prior to the upgrade to version 7.x.
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.
Closes#33028
When an index is frozen, two index settings are updated (index.frozen and
index.search.throttled) but the settings version is left unchanged and does
not reflect the settings update. This commit change the
TransportFreezeIndexAction so that it also increases the settings version
when an index is frozen/unfrozen.
This issue has been caught while working on the replication of closed
indices (#3388) in which index metadata for a closed index are updated
to frozen metadata and this specific assertion tripped.
The default value for ssl.supported_protocols no longer includes TLSv1
as this is an old protocol with known security issues.
Administrators can enable TLSv1.0 support by configuring the
appropriate `ssl.supported_protocols` setting, for example:
xpack.security.http.ssl.supported_protocols: ["TLSv1.2","TLSv1.1","TLSv1"]
Relates: #36021
This deprecates the `xpack.watcher.history.cleaner_service.enabled` setting,
since all newly created `.watch-history` indices in 7.0 will use ILM to manage
their retention.
In 8.0 the setting itself and cleanup actions will be removed.
Resolves#32041
Today, the mapping on the follower is managed and replicated from its
leader index by the ShardFollowTask. Thus, we should prevent users
from modifying the mapping on the follower indices.
Relates #30086
When the arguements of PERCENTILE and PERCENTILE_RANK can be folded,
the `ConstantFolding` rule kicks in and calls the `replaceChildren()`
method on `InnerAggregate` which is created from the aggregation rules
of the `Optimizerz. `InnerAggregate` in turn, cannot implement the method
as the logic of creating a new `InnerAggregate` instance from a list of
`Expression`s resides in the Optimizer. So, instead, `ConstantFolding`
should be applied before any of the aggregations related rules.
Fixes: #37099
* Testing conventions now checks for tests in main
This is the last outstanding feature of the old NamingConventionsTask,
so time to remove it.
* PR review
This change adds a docker compose configuration that's used with
the `elasticsearch.test.fixtures` plugin to start up the image
and check that the TCP ports are up.
We can build on this to add other checks for culster health,
run REST tests, etc.
We can add multiple containers and configurations to the compose
file (e.x. test different env vars) and form clusters.
When we can't map the principal attribute from the configured SAML
attribute in the realm settings, we can't complete the
authentication. We return an error to the user indicating this and
we present them with a list of attributes we did get from the SAML
response to point out that the expected one was not part of that
list. This list will never contain the NameIDs though as they are
not part of the SAMLAttribute list. So we might have a NameID but
just with a different format.
This commit removes the Index Audit Output type, following its deprecation
in 6.7 by 8765a31d4e6770. It also adds the migration notice (settings notice).
In general, the problem with the index audit output is that event indexing
can be slower than the rate with which audit events are generated,
especially during the daily rollovers or the rolling cluster upgrades.
In this situation audit events will be lost which is a terrible failure situation
for an audit system.
Besides of the settings under the `xpack.security.audit.index` namespace, the
`xpack.security.audit.outputs` setting has also been deprecated and will be
removed in 7. Although explicitly configuring the logfile output does not touch
any deprecation bits, this setting is made redundant in 7 so this PR deprecates
it as well.
Relates #29881
The filtering by follower index was completely broken.
Also the wrong persistent tasks were selected, causing the
wrong status to be reported.
Closes#37738
Today we keep the mapping on the follower in sync with the leader's
using the mapping version from changes requests. There are two rare
cases where the mapping on the follower is not synced properly:
1. The returned mapping version (from ClusterService) is outdated than
the actual mapping. This happens because we expose the latest cluster
state in ClusterService after applying it to IndexService.
2. It's possible for the FollowTask to receive an outdated mapping than
the min_required_mapping. In that case, it should fetch the mapping
again; otherwise, the follower won't have the right mapping.
Relates to #31140
In some cases we only have a string collection instead of a string list
that we want to serialize out. We have a convenience method for writing
a list of strings, but no such method for writing a collection of
strings. Yet, a list of strings is a collection of strings, so we can
simply liberalize StreamOutput#writeStringList to be more generous in
the collections that it accepts and write out collections of strings
too. On the other side, we do not have a convenience method for reading
a list of strings. This commit addresses both of these issues.
* Use ILM for Watcher history deletion
This commit adds an index lifecycle policy for the `.watch-history-*` indices.
This policy is automatically used for all new watch history indices.
This does not yet remove the automatic cleanup that the monitoring plugin does
for the .watch-history indices, and it does not touch the
`xpack.watcher.history.cleaner_service.enabled` setting.
Relates to #32041
Some steps, such as steps that delete, close, or freeze an index, may fail due to a currently running snapshot of the index. In those cases, rather than move to the ERROR step, we should retry the step when the snapshot has completed.
This change adds an abstract step (`AsyncRetryDuringSnapshotActionStep`) that certain steps (like the ones I mentioned above) can extend that will automatically handle a situation where a snapshot is taking place. When a `SnapshotInProgressException` is received by the listener wrapper, a `ClusterStateObserver` listener is registered to wait until the snapshot has completed, re-running the ILM action when no snapshot is occurring.
This also adds integration tests for these scenarios (thanks to @talevy in #37552).
Resolves#37541
This commit moves the aggregation and mapping code from joda time to
java time. This includes field mappers, root object mappers, aggregations with date
histograms, query builders and a lot of changes within tests.
The cut-over to java time is a requirement so that we can support nanoseconds
properly in a future field mapper.
Relates #27330
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.
When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.
Closes#37607
While tests migration from Zen1 to Zen2, we've encountered this test.
This test is organized as follows:
Starts the first cluster node.
Starts the second cluster node.
Checks that license is active.
Interesting fact that adding assertLicenseActive(true) between 1
and 2 also makes the test pass.
assertLicenseActive retrieves XPackLicenseState from the nodes
and checks that active flag is set. It's set to true even before
the cluster is initialized.
So this test does not make sense.
This adds a set of helper classes to determine if an agg "has a value".
This is needed because InternalAggs represent "empty" in different
manners according to convention. Some use `NaN`, `+/- Inf`, `0.0`, etc.
A user can pass the Internal agg type to one of these helper methods
and it will report if the agg contains a value or not, which allows the
user to differentiate "empty" from a real `NaN`.
These helpers are best-effort in some cases. For example, several
pipeline aggs share a single return class but use different conventions
to mark "empty", so the helper uses the loosest definition that applies
to all the aggs that use the class.
Sums in particular are unreliable. The InternalSum simply returns 0.0
if the agg is empty (which is correct, no values == sum of zero). But this
also means the helper cannot differentiate from "empty" and `+1 + -1`.
It looks like the output of FileUserPasswdStore.parseFile shouldn't be wrapped
into another map since its output can be null. Doing this wrapping after the null
check (which potentially raises an exception) instead.
Use PEM files for the key/cert for TLS on the http layer of the
node instead of a JKS keystore so that the tests can also run
in a FIPS 140 JVM .
Resolves: #37682
* Add separate CLI Mode
* Use the correct Mode for cursor close requests
* Renamed CliFormatter and have different formatting behavior for CLI and "text" format.
Due to missing stubbing for `NativePrivilegeStore#getPrivileges`
the test `testNegativeLookupsAreCached` failed
when the superuser role name was present in the role names.
This commit adds missing stubbing.
Closes: #37657
Currently we create dedicated network threads for both the http and
transport implementations. Since these these threads should never
perform blocking operations, these threads could be shared. This commit
modifies the nio-transport to have 0 http workers be default. If the
default configs are used, this will cause the http transport to be run
on the transport worker threads. The http worker setting will still exist
in case the user would like to configure dedicated workers. Additionally,
this commmit deletes dedicated acceptor threads. We have never had these
for the netty transport and they can be added back if a need is
determined in the future.
The integ tests currently use the raw zip project name as the
distribution type. This commit simplifies this specification to be
"default" or "oss". Whether zip or tar is used should be an internal
implementation detail of the integ test setup, which can (in the future)
be platform specific.
This grants the capability to grant privileges over certain restricted
indices (.security and .security-6 at the moment).
It also removes the special status of the superuser role.
IndicesPermission.Group is extended by adding the `allow_restricted_indices`
boolean flag. By default the flag is false. When it is toggled, you acknowledge
that the indices under the scope of the permission group can cover the
restricted indices as well. Otherwise, by default, restricted indices are ignored
when granting privileges, thus rendering them hidden for authorization purposes.
This effectively adds a confirmation "check-box" for roles that might grant
privileges to restricted indices.
The "special status" of the superuser role has been removed and coded as
any other role:
```
new RoleDescriptor("superuser",
new String[] { "all" },
new RoleDescriptor.IndicesPrivileges[] {
RoleDescriptor.IndicesPrivileges.builder()
.indices("*")
.privileges("all")
.allowRestrictedIndices(true)
// this ----^
.build() },
new RoleDescriptor.ApplicationResourcePrivileges[] {
RoleDescriptor.ApplicationResourcePrivileges.builder()
.application("*")
.privileges("*")
.resources("*")
.build()
},
null, new String[] { "*" },
MetadataUtils.DEFAULT_RESERVED_METADATA,
Collections.emptyMap());
```
In the context of the Backup .security work, this allows the creation of a
"curator role" that would permit listing (get settings) for all indices
(including the restricted ones). That way the curator role would be able to
ist and snapshot all indices, but not read or restore any of them.
Supersedes #36765
Relates #34454
Removes all sensitive settings (passwords, auth tokens, urls, etc...) for
watcher notifications accounts. These settings were deprecated (and
herein removed) in favor of their secure sibling that is set inside the
elasticsearch keystore. For example:
`xpack.notification.email.account.<id>.smtp.password`
is no longer a valid setting, and it is replaced by
`xpack.notification.email.account.<id>.smtp.secure_password`
The ML subproject of xpack has a cache for the cpp artifact snapshots
which is checked on each build. The cache is outside of the build dir so
that it is not wiped on a typical clean, as the artifacts can be large
and do not change often. This commit adds a cleanCache task which will
wipe the cache dir, as over time the size of the directory can become
bloated.
Currently we add the CcrRestoreSourceService as a index event
listener. However, if ccr is disabled, this service is null and we
attempt to add a null listener throwing an exception. This commit only
adds the listener if ccr is enabled.
This is related to #35975. This commit adds timeout functionality to
the local session on a leader node. When a session is started, a timeout
is scheduled using a repeatable runnable. If the session is not accessed
in between two runs the session is closed. When the sssion is closed,
the repeating task is cancelled.
Additionally, this commit moves session uuid generation to the leader
cluster. And renames the PutCcrRestoreSessionRequest to
StartCcrRestoreSessionRequest to reflect that change.
* Remove obsolete deprecation checks
This also updates the old-indices check to be appropriate for the 7.x
series of releases, and leaves it as the only deprecation check in
place.
* Add toString to DeprecationIssue
* Bring filterChecks across from 6.x
* License headers
This change adds the unfollow action for CCR follower indices.
This is needed for the shrink action in case an index is a follower index.
This will give the follower index the opportunity to fully catch up with
the leader index, pause index following and unfollow the leader index.
After this the shrink action can safely perform the ilm shrink.
The unfollow action needs to be added to the hot phase and acts as
barrier for going to the next phase (warm or delete phases), so that
follower indices are being unfollowed properly before indices are expected
to go in read-only mode. This allows the force merge action to execute
its steps safely.
The unfollow action has three steps:
* `wait-for-indexing-complete` step: waits for the index in question
to get the `index.lifecycle.indexing_complete` setting be set to `true`
* `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks
for the index being handled to report that the leader shard global checkpoint
is equal to the follower shard global checkpoint.
* `pause-follower-index` step: Pauses index following, necessary to unfollow
* `close-follower-index` step: Closes the index, necessary to unfollow
* `unfollow-follower-index` step: Actually unfollows the index using
the CCR Unfollow API
* `open-follower-index` step: Reopens the index now that it is a normal index
* `wait-for-yellow` step: Waits for primary shards to be allocated after
reopening the index to ensure the index is ready for the next step
In the case of the last two steps, if the index in being handled is
a regular index then the steps acts as a no-op.
Relates to #34648
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Gordon Brown <gordon.brown@elastic.co>
This change fixes the setup of the SSL configuration for the test
openldap realm. The configuration was missing the realm identifier so
the SSL settings being used were just the default JDK ones that do not
trust the certificate of the idp fixture.
See #37591
Throws an exception if hit extractor tries to retrieve unsupported
object. For example, selecting "a" from `{"a": {"b": "c"}}` now throws
an exception instead of returning null.
Relates to #37364
* Add ccr follow info api
This api returns all follower indices and per follower index
the provided parameters at put follow / resume follow time and
whether index following is paused or active.
Closes#37127
* iter
* [DOCS] Edits the get follower info API
* [DOCS] Fixes link to remote cluster
* [DOCS] Clarifies descriptions for configured parameters
Commit #37535 removed an internal restore request in favor of the
RestoreSnapshotRequest. Commit #37449 added a new test that used the
internal restore request. This commit modifies the new test to use the
RestoreSnapshotRequest.
This is a continuation of #28667 and has as goal to convert all executors to propagate errors to the
uncaught exception handler. Notable missing ones were the direct executor and the scheduler. This
commit also makes it the property of the executor, not the runnable, to ensure this property. A big
part of this commit also consists of vastly improving the test coverage in this area.
This commit adds a set_priority action to the hot, warm, and cold
phases for an ILM policy. This action sets the `index.priority`
on the managed index to allow different priorities between the
hot, warm, and cold recoveries.
This commit also includes the HLRC and documentation changes.
closes#36905
* SQL: Rename SQL data type DATE to DATETIME
SQL data type DATE has only the date part (e.g.: 2019-01-14)
without any time information. Previously the SQL type DATE was
referring to the ES DATE which contains also the time part along
with TZ information. To conform with SQL data types the data type
`DATE` is renamed to `DATETIME`, since it includes also the time,
as a new runtime SQL `DATE` data type will be introduced down the road,
which only contains the date part and meets the SQL standard.
Closes: #36440
* Address comments
When reporting metadata, several clients have issues with the 'ALIAS'
type. To improve compatibility and be consistent with the ANSI SQL
expectations and because they are similar, aliases targets are now
reported as views.
Close#37422
Currently all proxied actions are denied for the `SystemPrivilege`.
Unfortunately, there are use cases (CCR) where we would like to proxy
actions to a remote node that are normally performed by the
system context. This commit allows the system context to perform
proxy actions if they are actions that the system context is normally
allowed to execute.
The AbstracLifecycleComponent used to extend AbstractComponent, so it had to pass settings to the constractor of its supper class.
It no longer extends the AbstractComponent so there is no need for this constructor
There is also no need for AbstracLifecycleComponent subclasses to have Settings in their constructors if they were only passing it over to super constructor.
This is part 1. which will be backported to 6.x with a migration guide/deprecation log.
part 2 will have this constructor removed in 7
relates #35560
relates #34488
This change fixes failures in the SslMultiPortTests where we attempt to
connect to a profile on a port it is listening on but the connection
fails. The failure is due to the profile being bound to multiple
addresses and randomization will pick one of these addresses to
determine the listening port. However, the address we get the port for
may not be the address we are actually connecting to. In order to
resolve this, the test now sets the bind host for profiles to the
loopback address and uses the same address for connecting.
Closes#37481
Migrate ml job and datafeed config of open jobs and update
the parameters of the persistent tasks as they become unallocated
during a rolling upgrade. Block allocation of ml persistent tasks
until the configs are migrated.
Currently when there are no more auto follow patterns for a remote cluster then
the AutoFollower instance for this remote cluster will be removed. If
a new auto follow pattern for this remote cluster gets added quickly enough
after the last delete then there may be two AutoFollower instance running
for this remote cluster instead of one.
Each AutoFollower instance stops automatically after it sees in the
start() method that there are no more auto follow patterns for the
remote cluster it is tracking. However when an auto follow pattern
gets removed and then added back quickly enough then old AutoFollower
may never detect that at some point there were no auto follow patterns
for the remote cluster it is monitoring. The creation and removal of
an AutoFollower instance happens independently in the `updateAutoFollowers()`
as part of a cluster state update.
By adding the `removed` field, an AutoFollower instance will not miss the
fact there were no auto follow patterns at some point in time. The
`updateAutoFollowers()` method now marks an AutoFollower instance as
removed when it sees that there are no more patterns for a remote cluster.
The updateAutoFollowers() method can then safely start a new AutoFollower
instance.
Relates to #36761
This change deletes the SslNullCipherTests from our codebase since it
will have issues with newer JDK versions and it is essentially testing
JDK functionality rather than our own. The upstream JDK issue for
disabling these ciphers by default is
https://bugs.openjdk.java.net/browse/JDK-8212823.
Closes#37403
The test that remote clusters used by ML datafeeds have
a license that allows ML was not accounting for the
possibility that the remote cluster name could be
wildcarded. This change fixes that omission.
Fixes#36228
The SourceOnlySnapshotIT class tests a source only repository
using the following scenario:
starts a master node
starts a data node
creates a source only repository
creates an index with documents
snapshots the index to the source only repository
deletes the index
stops the data node
starts a new data node
restores the index
Thanks to ESIntegTestCase the index is sometimes created using a custom
data path. With such a setting, when a shard is assigned to one of the data
node of the cluster the shard path is resolved using the index custom data
path and the node's lock id by the NodeEnvironment#resolveCustomLocation().
It should work nicely but in SourceOnlySnapshotIT.snashotAndRestore(), b
efore the change in this PR, the last data node was restarted using a different
path.home. At startup time this node was assigned a node lock based on other
locks in the data directory of this temporary path.home which is empty. So it
always got the 0 lock id. And when this new data node is assigned a shard for
the index and resolves it against the index custom data path, it also uses the
node lock id 0 which conflicts with another node of the cluster, resulting in
various errors with the most obvious one being LockObtainFailedException.
This commit removes the temporary home path for the last data node so that it
uses the same path home as other nodes of the cluster and then got assigned
a correct node lock id at startup.
Closes#36330Closes#36276
Adjust FieldExtractor to handle fields which contain `.` in their
name, regardless where they fall in, in the document hierarchy. E.g.:
```
{
"a.b": "Elastic Search"
}
{
"a": {
"b.c": "Elastic Search"
}
}
{
"a.b": {
"c": {
"d.e" : "Elastic Search"
}
}
}
```
Fixes: #37128
* Default include_type_name to false for get and put mappings.
* Default include_type_name to false for get field mappings.
* Add a constant for the default include_type_name value.
* Default include_type_name to false for get and put index templates.
* Default include_type_name to false for create index.
* Update create index calls in REST documentation to use include_type_name=true.
* Some minor clean-ups around the get index API.
* In REST tests, use include_type_name=true by default for index creation.
* Make sure to use 'expression == false'.
* Clarify the different IndexTemplateMetaData toXContent methods.
* Fix FullClusterRestartIT#testSnapshotRestore.
* Fix the ml_anomalies_default_mappings test.
* Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests.
We make sure to specify include_type_name=true during xContent parsing,
so we continue to test the legacy typed responses. XContent generation
for the typeless responses is currently only covered by REST tests,
but we will be adding unit test coverage for these as we implement
each typeless API in the Java HLRC.
This commit also refactors GetMappingsResponse to follow the same appraoch
as the other mappings-related responses, where we read include_type_name
out of the xContent params, instead of creating a second toXContent method.
This gives better consistency in the response parsing code.
* Fix more REST tests.
* Improve some wording in the create index documentation.
* Add a note about types removal in the create index docs.
* Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL.
* Make sure to mention include_type_name in the REST docs for affected APIs.
* Make sure to use 'expression == false' in FullClusterRestartIT.
* Mention include_type_name in the REST templates docs.
This commit removes the fallback for SSL settings. While this may be
seen as a non user friendly change, the intention behind this change
is to simplify the reasoning needed to understand what is actually
being used for a given SSL configuration. Each configuration now needs
to be explicitly specified as there is no global configuration or
fallback to some other configuration.
Closes#29797
This new `hostname` field is meant to be a replacement for its sibling `name` field. See https://github.com/elastic/beats/pull/9943, particularly https://github.com/elastic/beats/pull/9943#discussion_r245932581.
This PR simply adds the new field (`hostname`) to the mapping without removing the old one (`name`), because a user might be running an older-version Beat (without this field rename in it) with a newer-version Monitoring ES cluster (with this PR's change in it).
AFAICT the Monitoring UI isn't currently using the `name` field so no changes are necessary there yet. If it decides to start using the `name` field, it will also want to look at the value of the `hostname` field.
This is related to #35975. It implements a file based restore in the
CcrRepository. The restore transfers files from the leader cluster
to the follower cluster. It does not implement any advanced resiliency
features at the moment. Any request failure will end the restore.
Adds another field, named "request.method", to the structured logfile audit.
This field is present for all events associated with a REST request (not a
transport request) and the value is one of GET, POST, PUT, DELETE, OPTIONS,
HEAD, PATCH, TRACE and CONNECT.
Improve error messages by returning the original SQL statement
declaration instead of trying to reproduce it as the casing and
whitespaces are not preserved accurately leading to small
differences.
Close#37161
Since `full` can be common as a field name or part of a field name
(e.g.: `full.name` or `name.full`), it's nice if it's not a reserved
keyword of the grammar so a user can use it without resorting to quotes.
Fixes: #37376
SqlPlugin cannot have more than one public constructor, so for the testing
purposes the `getLicenseState()` should be overriden.
Fixes: #37191
Co-authored-by: Michael Basnight <mbasnight@gmail.com>
When this message was first added the model debug config was
the only thing that could be updated, but now more aspects of
the config can be updated so the message needs to be more
general.
* ML: Updating .ml-state calls to be able to support > 1 index
* Matching bulk delete behavior with dbq
* Adjusting state name
* refreshing indices before search
* fixing line length
* adjusting index expansion options
This is a reinforcement of #37227. It turns out that
persistent tasks are not made stale if the node they
were running on is restarted and the master node does
not notice this. The main scenario where this happens
is when minimum master nodes is the same as the number
of nodes in the cluster, so the cluster cannot elect a
master node when any node is restarted.
When an ML node restarts we need the datafeeds for any
jobs that were running on that node to not just wait
until the jobs are allocated, but to wait for the
autodetect process of the job to start up. In the case
of reassignment of the job persistent task this was
dealt with by the stale status test. But in the case
where a node restarts but its persistent tasks are not
reassigned we need a deeper test.
Fixes#36810
This adds a configurable whitelist to the HTTP client in watcher. By
default every URL is allowed to retain BWC. A dynamically configurable
setting named "xpack.http.whitelist" was added that allows to
configure an array of URLs, which can also contain simple regexes.
Closes#29937
Added warnings checks to existing tests
Added “defaultTypeIfNull” to DocWriteRequest interface so that Bulk requests can override a null choice of document type with any global custom choice.
Related to #35190
This change ensures we always countdown the latch in the
SSLConfigurationReloaderTests to prevent the suite from timing out in
case of an exception. Additionally, we also increase the logging of the
resource watcher in case an IOException occurs.
See #36053
Field of types aliases that have dots in name are returned without a
hierarchy by field_caps, as oppose to the mapping api or field with
concrete types, which in turn breaks IndexResolver.
This commit fixes this by creating the backing hierarchy similar to the
mapping api.
Close#37224
Jobs created in version 6.1 or earlier can have a
null model_memory_limit. If these are parsed from
cluster state following a full cluster restart then
we replace the null with 4096mb to make the meaning
explicit. But if such jobs are streamed from an
old node in a mixed version cluster this does not
happen. Therefore we need to account for the
possibility of a null model_memory_limit in the ML
memory tracker.
This commit reorders the realm list for iteration based on the last
successful authentication for the given principal. This is an
optimization to prevent unnecessary iteration over realms if we can
make a smart guess on which realm to try first.
Fail with a 403 when indexing a document directly into a follower index.
In order to test this change, I had to move specific assertions into a dedicated class and
disable assertions for that class in the rest qa module. I think that is the right trade off.
If a running shard follow task needs to be restarted and
the remote connection seeds have changed then
a shard follow task currently fails with a fatal error.
The change creates the remote client lazily and adjusts
the errors a shard follow task should retry.
This issue was found in test failures in the recently added
ccr rolling upgrade tests. The reason why this issue occurs
more frequently in the rolling upgrade test is because ccr
is setup in local mode (so remote connection seed will become stale) and
all nodes are restarted, which forces the shard follow tasks to get
restarted at some point during the test. Note that these tests
cannot be enabled yet, because this change will need to be backported
to 6.x first. (otherwise the issue still occurs on non upgraded nodes)
I also changed the RestartIndexFollowingIT to setup remote cluster
via persistent settings and to also restart the leader cluster. This
way what happens during the ccr rolling upgrade qa tests, also happens
in this test.
Relates to #37231
* Tests: Add ElasticsearchAssertions.awaitLatch method
Some tests are using assertTrue(latch.await(...)) in their code. This
leads to an assertion error without any error message. This adds a
method which has a nicer error message and can be used in tests.
* fix forbidden apis
* fix spaces
* provide overriden `hashCode` and toString methods to account for `DISTINCT`
* change the analyzer for scenarios where `COUNT <field_name>` and `COUNT DISTINCT` have different paths
* defined a new `filter` aggregation encapsulating an `exists` query to filter out null or missing values
* ML: add migrate anomalies assistant
* adjusting failure handling for reindex
* Fixing request and tests
* Adding tests to blacklist
* adjusting test
* test fix: posting data directly to the job instead of relying on datafeed
* adjusting API usage
* adding Todos and adjusting endpoint
* Adding types to reindexRequest
* removing unreliable "live" data test
* adding index refresh to test
* adding index refresh to test
* adding index refresh to yaml test
* fixing bad exists call
* removing todo
* Addressing remove comments
* Adjusting rest endpoint name
* making service have its own logger
* adjusting validity check for newindex names
* fixing typos
* fixing renaming
This commit fixes a race condition in a test introduced by #36900 that
verifies concurrent authentications get a result propagated from the
first thread that attempts to authenticate. Previously, a thread may
be in a state where it had not attempted to authenticate when the first
thread that authenticates finishes the authentication, which would
cause the test to fail as there would be an additional authentication
attempt. This change adds additional latches to ensure all threads have
attempted to authenticate before a result gets returned in the
thread that is performing authentication.
Closing a channel using TLS/SSL requires reading and writing a
CLOSE_NOTIFY message (for pre-1.3 TLS versions). Many implementations do
not actually send the CLOSE_NOTIFY message, which means we are depending
on the TCP close from the other side to ensure channels are closed. In
case there is an issue with this, we need a timeout. This commit adds a
timeout to the channel close process for TLS secured channels.
As part of this change, we need a timer service. We could use the
generic Elasticsearch timeout threadpool. However, it would be nice to
have a local to the nio event loop timer service dedicated to network needs. In
the future this service could support read timeouts, connect timeouts,
request timeouts, etc. This commit adds a basic priority queue backed
service. Since our timeout volume (channel closes) is very low, this
should be fine. However, this can be updated to something more efficient
in the future if needed (timer wheel). Everything being local to the event loop
thread makes the logic simple as no locking or synchronization is necessary.
We already had logic to stop datafeeds running against
jobs that were OPENING, but a job that relocates from
one node to another while OPENED stays OPENED, and this
could cause the datafeed to fail when it sent data to
the OPENED job on its new node before it had a
corresponding autodetect process.
This change extends the check to stop datafeeds running
when their job is OPENING _or_ stale (i.e. has not had
its status reset since relocating to a different node).
Relates #36810
This bug was introduced in #36893 and had the effect that
execution would continue after calling onFailure on the the
listener in checkIfTokenIsValid in the case that the token is
expired. In a case of many consecutive requests this could lead to
the unwelcome side effect of an expired access token producing a
successful authentication response.
After #30794, our caching realms limit each principal to a single auth
attempt at a time. This prevents hammering of external servers but can
cause a significant performance hit when requests need to go through a
realm that takes a long time to attempt to authenticate in order to get
to the realm that actually authenticates. In order to address this,
this change will propagate failed results to listeners if they use the
same set of credentials that the authentication attempt used. This does
prevent these stalled requests from retrying the authentication attempt
but the implementation does allow for new requests to retry the
attempt.
* Use `_count` aggregation value only for not-DISTINCT COUNT function calls
* COUNT DISTINCT will use the _exact_ version of a field (the `keyword` sub-field for example), if there is one
This commit implements a straightforward approach to retention lease
expiration. Namely, we inspect which leases are expired when obtaining
the current leases through the replication tracker. At that moment, we
clean the map that persists the retention leases in memory.