A rolling upgrade from oss Elasticsearch to the default distribution of
Elasticsearch is significantly different than a full cluster restart to
install a plugin and is again different from starting a new cluster with
xpack installed. So this adds some basic tests to make sure that the
rolling upgrade that enables xpack works at all.
This also removes some unused imports from the tests that I modified in
PR #30728. I didn't mean to leave them.
Switches the rolling upgrade tests from upgrading two nodes to upgrading
three nodes which is much more realistic and much better able to find
unexpected bugs. It upgrades the nodes one at a time and runs tests
between each upgrade. As such this now has four test runs:
1. Old
2. One third upgraded
3. Two thirds upgraded
4. Upgraded
It sets system properties so the tests can figure out which stage they
are in. It reuses the same yml tests for the "one third" and "two
thirds" cases because they are *almost* the same case.
This rewrites the yml-based indexing tests to be Java based because the
yml-based tests can't handle different expected values for the counts.
And the indexing tests need that when they are run twice.
Applying the rest test gradle plugin already uses the zip distribution
by default, so specifying it explicitly is not necessary. These are
leftovers from before zip was the default for rest tests.
Nodes are reusing task ids after restart. So in some rare circumstances
the same task id might be assigned to the reindexing task stored by the
old cluster and the new task that is trying to retrieve the task
results. As a result, the get task request can timeout waiting on
itself. Since we already waited for the task to finish before restarting
the cluster, waiting for the task here doesn't make any sense to start
with.
Fixes#28732
Generalizing BWC building so that there is less code to modify for a release. This ensures we do not
need to think about what major or minor version is in the gradle code. It follows the general rules of the
elastic release structure. For more information on the rules, see the VersionCollection's javadoc.
This also removes the additional bwc snapshots that will never be released, such as 6.0.2, which were
being built and tested against every time we ran bwc tests.
Additionally, it creates 4 new projects that correspond to the different types of snapshots that may exist
for a given version. Its possible to now run those individual tasks to work out bwc logic whereas
previously it was impossible and the entire suite of bwc tests had to be run to work out any logic
changes in the build tools' bwc project. Please note that if the project does not make sense for the
version that is current, that an error will be thrown from that individual project if an attempt is made to
run it.
This should allow for automating the version bumps as well, since it removes all the hardcoded version
logic from the configs.
The configuration of the upgraded cluster task was missing a dependency
on the stopping of the second old node in the cluster. In some cases
(e.g., --parallel) Gradle would then try to run the configuration of a
node in the upgraded cluster before it had even configured the old nodes
in the cluster.
Relates #28036
extract all clauses from a conjunction query.
When clauses from a conjunction are extracted the number of clauses is
also stored in an internal doc values field (minimum_should_match field).
This field is used by the CoveringQuery and allows the percolator to
reduce the number of false positives when selecting candidate matches and
in certain cases be absolutely sure that a conjunction candidate match
will match and then skip MemoryIndex validation. This can greatly improve
performance.
Before this change only a single clause was extracted from a conjunction
query. The percolator tried to extract the clauses that was rarest in order
(based on term length) to attempt less candidate queries to be selected
in the first place. However this still method there is still a very high
chance that candidate query matches are false positives.
This change also removes the influencing query extraction added via #26081
as this is no longer needed because now all conjunction clauses are extracted.
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extractionCloses#26307
Today all these API calls have a sideeffect of making documents visible
to search requests. While this is sometimes desired it's an unnecessary sideeffect
and now that we have an internal (engine-private) index reader (#26972) we artificially
add a refresh call for bwc. This change removes this sideeffect in 7.0.
The rolling-upgrade test was only writing the "minimum_master_nodes" setting to the configuration file of the old nodes, but not the upgraded ones.
Also changes the value of "minimum_master_nodes" from "number_of_nodes" to "(number_of_nodes / 2) + 1".
The ensure green approach to avoid allocation delays caused problems with other indices created by other tests which didn't use ensure green in the various cluster stages. This aligns testHistoryUUIDIsGenerated to use the same approach used by the other test.
This commit increases the amount of time to wait for green to accound for unassigned shards that
have been delayed. The default delay is 60s, so we need to wait longer than that. Previously, the
wait would timeout at 30s due to the rest client and the default for the cluster health api.
Closes#26742
The test starts with two old nodes and creates indices (without waiting for green, which is fixed here too). Then it restarts one of the nodes and waits for it to join the cluster. This wait condition only uses wait for yellow as our generic infra doesn't how many nodes are there in total. Once the restarted node is part of the cluster (mixed mode) the second old node is restarted. If indices are not fully allocated when that happens, the shards will go into delayed unassigned mode. If the recovery of the replica never completed we may end up with corrupted / no secondary copy on the node. This will cause the shards to be delayed for 1m before being reassigned and the test will time out.
Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore.
As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid.
Relates #10708Closes#26544
This commit updates the version for master to 7.0.0-alpha1. It also adds
the 6.1 version constant, and fixes many tests, as well as marking some
as awaits fix.
Closes#25893Closes#25870
This commit removes a rolling upgrade test for scripting that is totally
busted yet is preventing builds from succeeding. We elect to remove this
test as opposed to skipping the test as:
- it has beeen being skipped for months with no apparent loss
- it appears to need significant work to get to an unbusted state
In the rolling upgrade tests, there is a test to create an index with
replica shards and ensure that in the mixed cluster environment, the
cluster health is green before any other tests are executed. However,
there were two problems with this. First, if the replica shard was
residing on the restarted node, then delayed allocation will kick in and
cause the cluster health request to timeout after 1m. The fix to this
was to drastically lower the delayed allocation setting. Second, if the
primary exists on the higher version node, then the replica cannot be
assigned to the lower version node because recovery cannot happen from
lower lucene versions. The fix here was to wait for the cluster health
to be yellow instead of green in the mixed cluster environment. In the
fully upgraded cluster, the cluster health check waits for a green
cluster as before.
Closes#25185
In 6.x we prevent multiple types and default to `index.mapping.single_type: false`
This change removes the registered setting and ensures that it's preserved for
5.x indices.
Relates to #24961
This commit adds a gradle project, set inside the root build.gradle,
which controls all our bwc tests. This allows for seamless (ie no errant
CI failures) backporting of behavior.
In #25201, a setting was added to allow setting the retry timeout for the rest client under the
impression that this would allow requests to go longer than 30s. However, there is also a socket
timeout that needs to be set to greater than 30s, which this change adds a setting for.
This commit adds a setting to change the request timeout for the rest client. This is useful as the
default timeout is 30s, which is also the same default for calls like cluster health. If both are
the same then the response from the cluster health api will not be received as the client usually
times out first making test failures harder to debug.
Relates #25185
This commit adds back "id" as the key within a script to specify a
stored script (which with file scripts now gone is no longer ambiguous).
It also adds "source" as a replacement for "code". This is in an attempt
to normalize how scripts are specified across both put stored scripts and script usages, including search template requests. This also deprecates the old inline/stored keys.
We default to 0 replicas in the rolling restart scenario already to ensure
we test against worst case. Yet, this adds a dummy index to ensure we also
recover and index with replicas just fine.
These tests spin up two nodes of an older version of Elasticsearch,
create some stuff, shut down the nodes, start the current version,
and verify that the created stuff works.
You can run `gradle qa:full-cluster-restart:check` to run these
tests against the head of the previous branch of Elasticsearch
(5.x for master, 5.4 for 5.x, etc) or you can run
`gradle qa:full-cluster-restart:bwcTest` to run this test against
all "index compatible" versions, one after the other. For master
this is every released version in the 5.x.y version *and* the tip
of the 5.x branch.
I'd love to add more to these tests in the future but these
currently just cover the functionality of the `create_bwc_index.py`
script and start to cover the assertions in the
`OldIndexBackwardsCompatibilityIT` test.
This commit changes the rolling upgrade test to create a set of rest
test tasks per wire compat version. The most recent wire compat version
is always tested with the `integTest` task, and all versions can be
tested with `bwcTest`.
This commit expands the logic for version extraction from Version.java
to include a list of all versions for backcompat purposes. The tests
using bwcVersion are converted to use this list, but those tests
(rolling upgrade and backwards-5.0) are still not randomized; that will
happen in another followup.
This commit renames all rest test files to use the .yml extension
instead of .yaml. This way the extension used within all of
elasticsearch for yaml is consistent.
In #24251 we fix an issue with stored search templates that
this test would have discovered: stored search templates cause
the node to refuse to start. Technically a "restart" test would
have caught this as well and would have caught it more quickly.
But we already *have* an upgrade test and we don't have restart tests.
And testing this on upgrade is a good thing too.
This change simplifies how the rest test runner finds test files and
removes all leniency. Previously multiple prefixes and suffixes would
be tried, and tests could exist inside or outside of the classpath,
although outside of the classpath never quite worked. Now only classpath
tests are supported, and only one resource prefix is supported,
`/rest-api-spec/tests`.
closes#20240
After splitting integ tests into cluster configuration and the test
runner task, we still have dependencies of the test runner added as deps
of the cluster. This commit adds dependencies directly to the cluster,
so that the runner can have other dependencies independent of what is
needed for the cluster.
We currently have the last minor version of the previous major hardcoded
in tests like rolling upgrade. This change programatically finds this
during gradle initialization by parsing versions from Version.java.
Gradle's finalizedBy on tasks only ensures one task runs after another,
but not immediately after. This is problematic for our integration tests
since it allows multiple project's integ test clusters to be
simultaneously. While this has not been a problem thus far (gradle 2.13
happened to keep the finalizedBy tasks close enough that no clusters
were running in parallel), with gradle 3.3 the task graph generation has
changed, and numerous clusters may be running simultaneously, causing
memory pressure, and thus generally slower tests, or even failure if the
system has a limited amount of memory (eg in a vagrant host).
This commit reworks how integ tests are configured. It adds an
`integTestCluster` extension to gradle which is equivalent to the current
`integTest.cluster` and moves the rest test runner task to
`integTestRunner`. The `integTest` task is then just a dummy task,
which depends on the cluster runner task, as well as the cluster stop
task. This means running `integTest` in one project will both run the
rest tests, and shut down the cluster, before running `integTest` in
another project.
This commit enforces the requirement of Content-Type for the REST layer and removes the deprecated methods in transport
requests and their usages.
While doing this, it turns out that there are many places where *Entity classes are used from the apache http client
libraries and many of these usages did not specify the content type. The methods that do not specify a content type
explicitly have been added to forbidden apis to prevent more of these from entering our code base.
Relates #19388
Currently, stored scripts use a namespace of (lang, id) to be put, get, deleted, and executed. This is not necessary since the lang is stored with the stored script. A user should only have to specify an id to use a stored script. This change makes that possible while keeping backwards compatibility with the previous namespace of (lang, id). Anywhere the previous namespace is used will log deprecation warnings.
The new behavior is the following:
When a user specifies a stored script, that script will be stored under both the new namespace and old namespace.
Take for example script 'A' with lang 'L0' and data 'D0'. If we add script 'A' to the empty set, the scripts map will be ["A" -- D0, "A#L0" -- D0]. If a script 'A' with lang 'L1' and data 'D1' is then added, the scripts map will be ["A" -- D1, "A#L1" -- D1, "A#L0" -- D0].
When a user deletes a stored script, that script will be deleted from both the new namespace (if it exists) and the old namespace.
Take for example a scripts map with {"A" -- D1, "A#L1" -- D1, "A#L0" -- D0}. If a script is removed specified by an id 'A' and lang null then the scripts map will be {"A#L0" -- D0}. To remove the final script, the deprecated namespace must be used, so an id 'A' and lang 'L0' would need to be specified.
When a user gets/executes a stored script, if the new namespace is used then the script will be retrieved/executed using only 'id', and if the old namespace is used then the script will be retrieved/executed using 'id' and 'lang'
This is related to #22116. URLRepository requires SocketPermission
connect. This commit introduces a new module called "repository-url"
where URLRepository will reside. With the new module, permissions can
be removed from core.