This is in preparation to move to nested virtualization which is much slower
than the bare metal setup we use right now, but parallelizes better
resulting in a net win.t
We no longer run the sample tests in CI, so it's safe to create a task
for every project.
This will make it easier to set them up in a matrix like fashion.
This is a follow up of https://github.com/elastic/elasticsearch/issues/43453 where we added
a system property to disallow allocation awareness in search requests. Since search requests
will no longer check the allocation awareness attributes for routing in the next major version,
this change adds a deprecation warning on any setup that uses these attributes.
Relates #43453
* Use versions specific distribution folders so we don't need to clean up (#46539)
* Retry deleting distro dir on windows
When retarting the cluster we clean up old distribution files that might
still be in use by the OS.
Windows closes resources of ded processes async, so we do a couple of
retries to get arround it.
Closes#46014
* Avoid having to delete the distro folder.
* Remove the use of ClusterFormationTasks form RestTestTask (#47022)
This PR removes a use-case of the ClusterFormationTasks and converts a
project that flew under the radar so far.
There's probably more clean-up possible here, but for now the goal is
to be able to remove that code after `RunTask` is also updated.
* Migrate some 7.x only projects
* Bwc testclusters all (#46265)
Convert all bwc projects to testclusters
* Fix bwc versions config
* WIP fix rolling upgrade
* Fix bwc tests on old versions
* Fix rolling upgrade
On windows, it happens that the process we called terminates but some
other process it creates still has the same output strems and thus the
files open, so we can't clean it up.
This PR makes the cleanup a best effort.
This PR makes the necesary adaptations to the tests and adds a power shell script to
invoke the OS tests on GCP instances connected as CI workers.
Also noticed that logs were not being produced by the tests and that theses were not using log4j so fixed that too.
One of the difficulties in working on theses tests was that the tests just stalled with no indication where the problem is.
To ease with the debugging, after process explorer suggested that the tests are running some commands, we now have multiple timeouts: one for the tests ( which will generate a thread dump ) and one for individual commands ( that bails with the command being ran and output and error so far ) to make it easier to see what went wrong.
The tests were blocking because apparently the pipes to the sub-process were not closing, thus the threads were blocking on them and we were blocking indefinitely on the join. I'm not sure why this doesn't happen in vagrant, but we now properly deal with it.
The pattern in the latest failure is similar to the source fixed in #46956
but relates to synced-flush. If peer recovery happens after indexing,
and indexing flushes some shard at the end, then a synced flush in the
test will not roll or commit translog.
Closes#46712
Changes auto-id index requests to use optype CREATE, making it compliant with our docs.
This will also make these auto-id index requests compatible with the new "create-doc" index
privilege (which is based on the optype), the default optype is changed to create, just as it is
already documented.
Bulk requests currently do not allow adding "create" actions with auto-generated IDs.
This commit allows using the optype CREATE for append-only indexing operations. This is
mainly the user facing aspect of it.
* Add support for bwc for testclusters and convert full cluster restart (#45374)
* Testclusters fix bwc (#46740)
Additions to make testclsuters work with lather versions of ES
* Do common node config on bwc tests
Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.
* Do common node config on bwc tests
Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.
* Fix logic to pick up snapshot from 6.x
* Make sure ports are cleared
* Fix test
* Don't clear all the config as we rely on it
* Fix removal of keys
Since the bundled jdk was added to Elasticsearch, there are now 2 ways
java can be missing. Either JAVA_HOME is set but does not exist, or the
bundled jdk does not exist. This commit improves the error messages in
those two cases, and also ensures our tests cover both cases.
This commit removes a use of Setting#getRaw from the deprecation header
tests. The use of Setting#getRaw is not needed here, the x-content
infrastructure will take care of emitting the appropriate values here,
and so the caller does not need to convert these to string
representations of the settings values.
The archives stopElasticsearch utility method sends SIGTERM to the
elasticsearch process, but does not wait for it to exit. That can cause
subsequent tests to sometimes file. This commit adds wait logic to both
linux and windows for the stopElasticsearch method.
closes#44501
The test for java home with special characters on linux would create a
temporary java home under /home/elasticsearch. But our packaging
assertions expect that to not exist. Unfortunately this would fail much
later when the checks were actually done in bats tests. This commit
fixes the linux test to match the behavior of windows, which links the
entire java directory, and now does it into a /tmp dir.
closes#45903
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.
There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.
I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.
Other changes:
* Rework `waitForDocs` to scale its timeout. Instead of calling
`assertBusy` in a loop, work out a reasonable overall timeout and await
just once.
* Some tests failed after switching to `assertBusy` and had to be fixed.
* Correct the expect templates in AbstractUpgradeTestCase. The ES
Security team confirmed that they don't use templates any more, so
remove this from the expected templates. Also rewrite how the setup
code checks for templates, in order to give more information.
* Remove an expected ML template from XPackRestTestConstants The ML team
advised that the ML tests shouldn't be waiting for any
`.ml-notifications*` templates, since such checks should happen in the
production code instead.
* Also rework the template checking code in `XPackRestTestHelper` to give
more helpful failure messages.
* Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
This is the Java side of https://github.com/elastic/ml-cpp/pull/593
with a fallback so that ml-cpp bundles with either the
new or old directory structure work for the time being.
A few days after merging the C++ changes a followup to
this change will be made that removes the fallback.
Previously, queries on the _index field were not able to specify index aliases.
This was a regression in functionality compared to the 'indices' query that was
deprecated and removed in 6.0.
Now queries on _index can specify an alias, which is resolved to the concrete
index names when we check whether an index matches. To match a remote shard
target, the pattern needs to be of the form 'cluster:index' to match the
fully-qualified index name. Index aliases can be specified in the following query
types: term, terms, prefix, and wildcard.
If peer recovery happens after indexing, and indexing flushes some shard
at the end, then the explicit flush in the test will be a noop. Then
replicas will have some uncommitted translog , which is transferred in
peer recovery, although all of these operations are in the commit
already. If that replica becomes primary (after we restarted the
cluster), it will have translog to replay and the test will fail.
Another issue in this test is that synced_flush is not a replication
action, then the global checkpoint on replicas might be not up to date.
We need to either wait for the global checkpoint to be synced or call a
replication action to sync it.
Closes#46712
This test verifies automatic cancellation of search requests on connection close.
It was previously not present in 7.x as the http client was subject do a bug which
made testing cancellation of requests impossible. Now that the bug is fixed upstream,
we can also backport this test
This is a follow up of #19191 for 7.x.
This change adds a system property called "es.routing.search_ignore_awareness_attributes" that when set to true will
effectively ignore allocation awareness attributes when routing search and get requests. This is now the default in 8.x so this
commit adds a way to opt-in to this new behavior in a minor version of 7.x.
Relates #45735
When the Elasticsearch process does not have write permissions to
upgrade the Elasticsearch keystore, we bail with an error message that
indicates there is a filesystem permissions problem. This commit
clarifies that error message by pointing out the directory where write
permissions are required, or that the user can also run the
elasticsearch-keystore upgrade command manually before starting the
Elasticsearch process. In this case, the upgrade would not be needed at
runtime, so the permissions would not be needed then.
We hit a bug where we can't partially update documents created in a
mixed cluster between 5.x and 6.x. Although this bug does not affect
7.0 or later, we should have a good test that catches this issue.
Relates #46198
* Pass COMPUTERNAME env var to elasticsearch.bat
When we run bin/elasticsearch with bash, we get a $HOSTNAME builtin that
contains the hostname of the machine the script is running on. When
there's no provided nodename, Elasticsearch uses the HOSTNAME to create
a nodename. On Windows, Powershell provides a $COMPUTERNAME variable for
the same purpose. CMD.EXE provides the same thing, except it's called
%COMPUTERNAME%. bin/elasticsearch.bat sets $HOSTNAME to the value of
$COMPUTERNAME. However, when testclusters invokes bin/elasticsearch.bat,
the COMPUTERNAME variable doesn't get passed in, leaving HOSTNAME null
and breaking an integration test on Windows.
This commit sets COMPUTERNAME in the environment so that our tests get
the value that Elasticsearch would have when bin/elasticsearch.bat is
invoked from the shell.
* Add null check to protect in non-Windows case
What good is it a developer to gain the whole Windows if they forfeit
their Unix? The value that fixes things on Windows is null on
Linux/Darwin, so let's null-check it.
* Override system hostnames for testclusters
Rather than relying on variable system behavior, let's just override
HOSTNAME and COMPUTERNAME and test for correct values in the integration
test that was originally failing.
* Rename constants for clarity
Since we are setting HOSTNAME and COMPUTERNAME regardless of whether the
tests are running on Windows or Linux, we shouldn't imply that constants
are only used in one case or the other.
This commit moves many features of individual distro tests into the base
class so that other test cases can utilize them. It also standardizes
the pattern for tests adding assumptions for the particular
distributions to test.
Most of our CLI tools use the Terminal class, which previously did not provide methods for writing to standard output. When all output goes to standard out, there are two basic problems. First, errors and warnings are "swallowed" in pipelines, making it hard for a user to know when something's gone wrong. Second, errors and warnings are intermingled with legitimate output, making it difficult to pass the results of interactive scripts to other tools.
This commit adds a second set of print commands to Terminal for printing to standard error, with errorPrint corresponding to print and errorPrintln corresponding to println. This leaves it to developers to decide which output should go where. It also adjusts existing commands to send errors and warnings to stderr.
Usage is printed to standard output when it's correctly requested (e.g., bin/elasticsearch-keystore --help) but goes to standard error when a command is invoked incorrectly (e.g. bin/elasticsearch-keystore list-with-a-typo | sort).
The java based distribution tests currently have a single Tests class
which encapsulates all of the tests for a particular distribution. The
test task in gradle then depends on all distributions being built, and
each individual tests class looks for the particular distribution it is
trying to test. This means that reproducing a single test failure
triggers all the distributions to be built, even though only one is
needed for the test.
This commit reworks the java distribution tests to pass in a particular
distribution to be tested, and changes the base test classes to be
actual test classes which have assumptions around which distributions
they operate on. For example, the archives tests will be skipped when
run with an rpm distribution, and vice versa for the package tests. This
makes reproduction much more granular. It also also better splitting up
tests around a particular use case. For example, all tests for systemd
behavior can be in one test class, and run independently of all tests
against rpm/deb distributions.
This commit addresses an issue when trying to using Elasticsearch on
systems with Sys V init and the bundled JDK was not being used. Instead,
we were still inadvertently trying to fallback on the path. This commit
removes that fallback as that is against our intentions for 7.x where we
only support the bundled JDK or an explicit JDK via JAVA_HOME.
The system level tests for our distributions have historically be run in
vagrant, and thus the name of the gradle project has been "vagrant".
However, as we move to running these tests in other environments (eg
GCP) the name vagrant no longer makes sense. This commit renames the
project to "os" (short for operating system), since these tests ensure
all of our distributions run correctly on our supported operating
systems.
The bats tests currently require many additional artifacts to be built.
In addition to the current distributions, they need all the plugins to
be installed, as well as a randomly chosen bwc distribution. This commit
splits these two cases into their own bats task, so the dependencies do
not slow down other tasks like distroTests which do not need them.
The distro test plugin was originally designed to be applied within each
subproject, per operating system we run in a VM with vagrant. However,
for efficiency, and also ease of having a single task to run in CI when
launching within individual OS VMs, having the "destructive" tasks in a
single place is more convenient. This commit reworks the distro test
plugin to be applied to the qa/vagrant project, which now creates only
the wrapper tasks in each of the subprojects for each vagrant VM.
The vagrant based tests currently reside in a single project, creating
dozens of tasks to manage starting and stopping the vagrant VM along
with running java and bats tests within each image. This all-in-one
pattern makes parallelizing packaging tests difficult.
This commit rewrites the vagrant testing infrastructure to be
independent of the actual test runners, thus allowing each platform to
be handled in a separate subproject. Additionally, the java and bats
tests are changed to be run through a "destructive" gradle task, which
is run inside the VM. The combination of these will allow
parallelization both locally (through running several VMs at once) as
well as running the destructive tasks in CI machines dedicated to each
platform (thus removing the need for vagrant in CI).
* Restrict which tasks can use testclusters
This PR fixes a problem between the interaction of test-clusters and
build cache.
Before this any task could have used a cluster without tracking it as
input.
With this change a new interface is introduced to track the tasks that
can use clusters and we do consider the cluster as input for all of
them.
This commit applies a normalization process to environment paths, both
in how they are stored internally, also their settings values. This
normalization is done via two means:
- we make the paths absolute
- we remove redundant name elements from the path (what Java calls
"normalization")
This change ensures that when we compare and refer to these paths within
the system, we are using a common ground. For example, prior to the
change if the data path was relative, we would not compare it correctly
to paths from disk usage. This is because the paths in disk usage were
being made absolute.
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.
Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
Deprecation logger was filtering log entries by key, that means that if two log messages with the same key are logged from different users, then the second log messages will be filtered.
This change allows to log deprecation message with the same key by different users.
relates #41354
backport #44587
* Mute failing test
tracked in #44552
* mute EvilSecurityTests
tracking in #44558
* Fix line endings in ESJsonLayoutTests
* Mute failing ForecastIT test on windows
Tracking in #44609
* mute BasicRenormalizationIT.testDefaultRenormalization
tracked in #44613
* fix mute testDefaultRenormalization
* Increase busyWait timeout windows is slow
* Mute failure unconfigured node name
* mute x-pack internal cluster test windows
tracking #44610
* Mute JvmErgonomicsTests on windows
Tracking #44669
* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot
Tracking #44671
* Mute NodeTests on Windows
Tracking #44256
In https://github.com/elastic/elasticsearch/pull/41913 setting up the
temp dir for ES was moved from the env script to individual cli scripts.
However, moving it to the windows service cli was missed. This commit
restores setting up the temp dir for the windows service control script.
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
This is a refactor to current JSON logging to make it more open for extensions
and support for custom ES log messages used inDeprecationLogger IndexingSlowLog , SearchSLowLog
We want to include x-opaque-id in deprecation logs. The easiest way to have this as an additional JSON field instead of part of the message is to create a custom DeprecatedMessage (extends ESLogMEssage)
These messages are regular log4j messages with a text, but also carry a map of fields which can then populate the log pattern. The logic for this lives in ESJsonLayout and ESMessageFieldConverter.
Similar approach can be used to refactor IndexingSlowLog and SearchSlowLog JSON logs to contain fields previously only present as escaped JSON string in a message field.
closes#41350
backport #41354
* Testclusters: Convert additional projects
Found some more that were not using testclusters from elasticsearch-ci/1
* Allow IOException too
* Make the client more resilient