Currently the BulkProcessor class uses a single scheduler to schedule
flushes and retries. Functionally these are very different concerns but
can result in a dead lock. Specifically, the single shared scheduler
can kick off a flush task, which only finishes it's task when the bulk
that is being flushed finishes. If (for what ever reason), any items in
that bulk fails it will (by default) schedule a retry. However, that retry
will never run it's task, since the flush task is consuming the 1 and
only thread available from the shared scheduler.
Since the BulkProcessor is mostly client based code, the client can
provide their own scheduler. As-is the scheduler would require
at minimum 2 worker threads to avoid the potential deadlock. Since the
number of threads is a configuration option in the scheduler, the code
can not enforce this 2 worker rule until runtime. For this reason this
commit splits the single task scheduler into 2 schedulers. This eliminates
the potential for the flush task to block the retry task and removes this
deadlock scenario.
This commit also deprecates the Java APIs that presume a single scheduler,
and updates any internal code to no longer use those APIs.
Fixes#47599
Note - #41451 fixed the general case where a bulk fails and is retried
that can result in a deadlock. This fix should address that case as well as
the case when a bulk failure *from the flush* needs to be retried.
This commit fixes the names of the UBI-based Docker build contexts to
lift the ubi component of the name into the archive base name, instead
of the classifier.
* [ML] Add new geo_results.(actual_point|typical_point) fields for `lat_long` results (#47050)
[ML] Add new geo_results.(actual_point|typical_point) fields for `lat_long` results (#47050)
Related PR: https://github.com/elastic/ml-cpp/pull/809
* adjusting bwc version
We were tripping the assertion that the makes sure we only have empty `ShardGenerations` in `RepositoryData` in the BwC case because shard generations were passed to the `Repository` in the BwC case. Fixed by only generating empty shard gen for BwC snapshots in `SnapshotsService`.
The timeout was increased to 60s to allow this test more time to reach a
yellow state. However, the test will still on occasion fail even with the
60s timeout.
Related: #48381
Related: #48434
Related: #47950
Related: #40178
Backport of #48450.
Make a number of changes so that code in the `server` directory is more
resilient to automatic formatting. This covers:
* Reformatting multiline JSON to embed whitespace in the strings
* Move some comments around to they aren't auto-formatted to a strange
place. This also required moving some `&&` and `||` operators from the
end-of-line to start-of-line`.
* Add helper method `reformatJson()`, to strip whitespace from a JSON
document using XContent methods. This is sometimes necessary where
a test is comparing some machine-generated JSON with an expected
value.
Also, `HyperLogLogPlusPlus.java` is now excluded from formatting because it
contains large data tables that don't reformat well with the current settings,
and changing the settings would be worse for the rest of the codebase.
Backport of #48898.
We no longer configure distributions for prior versions for Docker. This
is because doing so prompts Gradle to try and resolve the Docker
dependencies, which doesn't work as they can't be downloaded via Ivy
(configured in DistributionDownloadPlugin). Since we need these for the
BATS upgrade tests, and those tests only cover .rpm and .deb, it's OK to
omit creating such distributions in the first place. We may need to
revisit this in the future, to allow upgrade testing using Docker
containers.
* Add links to infra-stats for scans generated in CI
It turns out we already gather system logs in infra-stats, and we have
system metrics too there.
This PR adds a links to the logs we gather for the host the build is
runnig on.
And a link to the host overview in the infrastructure app tuned to 5
minutes from before gradle started to 5 minutes after the scan was
generated.
* add buildFinished
The realtime GET API currently has erratic performance in case where a document is accessed
that has just been indexed but not refreshed yet, as the implementation will currently force an
internal refresh in that case. Refreshing can be an expensive operation, and also will block the
thread that executes the GET operation, blocking other GETs to be processed. In case of
frequent access of recently indexed documents, this can lead to a refresh storm and terrible
GET performance.
While older versions of Elasticsearch (2.x and older) did not trigger refreshes and instead opted
to read from the translog in case of realtime GET API or update API, this was removed in 5.0
(#20102) to avoid inconsistencies between values that were returned from the translog and
those returned by the index. This was partially reverted in 6.3 (#29264) to allow _update and
upsert to read from the translog again as it was easier to guarantee consistency for these, and
also brought back more predictable performance characteristics of this API. Calls to the realtime
GET API, however, would still always do a refresh if necessary to return consistent results. This
means that users that were calling realtime GET APIs to coordinate updates on client side
(realtime GET + CAS for conditional index of updated doc) would still see very erratic
performance.
This PR (together with #48707) resolves the inconsistencies between reading from translog and
index. In particular it fixes the inconsistencies that happen when requesting stored fields, which
were not available when reading from translog. In case where stored fields are requested, this
PR will reparse the _source from the translog and derive the stored fields to be returned. With
this, it changes the realtime GET API to allow reading from the translog again, avoid refresh
storms and blocking the GET threadpool, and provide overall much better and predictable
performance for this API.
We should not open new engines if a shard is closed. We break this
assumption in #45263 where we stop verifying the shard state before
creating an engine but only before swapping the engine reference.
We can fail to snapshot the store metadata or checkIndex a closed shard
if there's some IndexWriter holding the index lock.
Closes#47060
This change fixes a poisonous situation where an ongoing recovery was
canceled because a better copy was found on a node that the cluster had
previously tried allocating the shard to but failed. The solution is to
keep track of the set of nodes that an allocation was failed on so that
we can avoid canceling the current recovery for a copy on failed nodes.
Closes#47974
The first example of splitting rules for the `word_delimiter` token filter was spread across two bullet points. This makes it look like they are two separate splitting rules.
CCR follower stats can return information for persistent tasks that are in the process of being cleaned up. This is problematic for tests where CCR follower indices have been deleted, but their persistent follower task is only cleaned up asynchronously afterwards. If one of the following tests then accesses the follower stats, it might still get the stats for that follower task.
In addition, some tests were not cleaning up their auto-follow patterns, leaving orphaned patterns behind. Other tests cleaned up their auto-follow patterns. As always the same name was used, it just depended on the test execution order whether this led to a failure or not. This commit fixes the offensive tests, and will also automatically remove auto-follow-patterns at the end of tests, like we do for many other features.
Closes #48700
Added test demonstrating that grok using ignore case works, since this
does a minimal test that the `joni` and `jcodings` libraries are
compatible.
Forward-port of test from #43334
Similarly to what has be done for Azure in #48636, this commit
adds a new :test:fixtures:gcs-fixture project which provides two
docker-compose based fixtures that emulate a Google Cloud
Storage service.
Some code has been extracted from existing tests and placed
into this new project so that it can be easily reused in other
projects.
Backport of #48883.
Per elastic/infra#15864, the Elasticsearch CI images are failing due to
a packer_cache failure. This is because Gradle is trying to resolve
a `.docker` file through the Ivy repository, which doesn't work. Disable
the Docker tests again until we figure out the way forward.
Today if the primary discovers that an indexing request needs a mapping update
then it will send it to the master for validation and processing. If, however,
the put-mapping request is invalid then the master still processes it as a
(no-op) cluster state update. When there are a large number of indexing
operations that result in invalid mapping updates this can overwhelm the
master.
However, the primary already has a reasonably up-to-date mapping against which
it can check the (approximate) validity of the put-mapping request before
sending it to the master. For instance it is not possible to remove fields in a
mapping update, so if the primary detects that a mapping update will exceed the
fields limit then it can reject it itself and avoid bothering the master.
This commit adds a pre-flight check to the mapping update path so that the
primary can discard obviously-invalid put-mapping requests itself.
Fixes#35564
Backport of #48817
Backport of #46599 and #47640. Add packaging tests for Docker.
* Introduce packaging tests for Docker (#46599)
Closes#37617. Add packaging tests for our Docker images, similar to what
we have for RPMs or Debian packages. This works by running a container and
probing it e.g. via `docker exec`. Test can also be run in Vagrant, by
exporting the Docker images to disk and loading them again in VMs. Docker
is installed via `Vagrantfile` in a selection of boxes.
* Only define Docker pkg tests if Docker is available (#47640)
Closes#47639, and unmutes tests that were muted in b958467.
The Docker packaging tests were being defined irrespective of whether
Docker was actually available in the current environment. Instead,
implement exclude lists so that in environments where Docker is not
available, no Docker packaging tests are defined. For CI hosts, the build
checks `.ci/dockerOnLinuxExclusions`. The Vagrant VMs can defined the
extension property `shouldTestDocker` property to opt-in to packaging
tests.
As part of this, define a seperate utility class for checking Docker,
and call that instead of defining checks in-line in BuildPlugin.groovy
We might have some outstanding renew retention lease requests after a
shard has unfollowed. If testRetentionLeaseIsAddedIfItDisappearsWhileFollowing
intercepts a renew request from other tests then we will never unlatch
and the test will time out.
Closes#45192
This test failure manifests the limitation of the recovery source merge
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.
Relates #41628Closes#48735
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.
Fixes#48122
testRetentionLeaseIsAddedIfItDisappearsWhileFollowing is still failing
although we already have several fixes. I think other tests interfere
and cause this test to fail. We can use the test scope to isolate them.
However, I prefer to add debug logs so we can find the source.
Relates #45192
This commit fixes the name of the UBI-based Docker image build contexts
to include "7" (to set us up for the future where we are likely to have
a ubi8-based image).
Previously this step moved to the forcemerge step, however, if the
machine running the test was fast enough, it would execute the
forcemerge and move to the next step (`segment-count`) so the comparison
would fail. This commit changes the step to be a step that will never go
anywhere else, the terminal step.
Resolves#48761
The problem with wrapping here is that it converts any exception into an
IAE, which we treat as a client error (400 status) whereas the exception
being wrapped here could be a server error (e.g., NPE). This commit
stops wrapping all ingest processor exceptions as IAEs.