When running Elasticsearch on a flaky network, we may see nodes leaving the
cluster with reason `disconnected`. It may be useful to the cluster
administrator to see the full exception that caused the disconnection, but this
is only available with `TRACE` level logging which commingles the details of
the problem with other messages that are not useful to end users.
This commit promotes logging of exceptions in `TcpTransport` from `TRACE` to
`DEBUG` to separate them from the truly `TRACE`-level messages.
This commit switches the strategy for managing dot-prefixed indices that
should be hidden indices from using "fake" system indices to an explicit
exclusions list that must be updated when those indices are converted to
hidden indices.
* Fix InternalEngineTests.testSeqNoAndCheckpoints
If we force flush while possibly triggering a merge the local checkpoint may change
from the expectation from the loop that just increments on every operation.
Closes#51604
There is no reason not to allow deletes in parallel to restores
if they're dealing with different snapshots.
A delete will not remove any files related to the snapshot that
is being restored if it is different from the deleted snapshot
because those files will still be referenced by the restoring
snapshot.
Loading RepositoryData concurrently to modifying it is concurrency
safe nowadays as well since the repo generation is tracked in the
cluster state.
Closes#41463
ActionListener.map would call listener.onFailure for exceptions from
listener.onResponse, but this means we could double trigger some
listeners which is generally unexpected. Instead, we should assume that
a listener's onResponse (and onFailure) implementation is responsible
for its own exception handling.
In #50259 we redirected stdout and stderr to log4j, to capture jdk
and external library messages. However, a typo in the method name used
to redirect the stream in java means stdout is currently being
duplicated twice, and stderr not captured. This commit corrects that
mistake. Unfortunately this is at a level that cannot really be tested,
thus we are still missing tests for this behavior.
We need to warm up the engine (i.e., perform an external refresh) before
accessing the external refresh. Note that we refresh externally before
allowing reading from a shard.
Relates #48605Closes#51548
When checking if a device is up, today we can run into virtual ethernet
devices that disappear while we are in the middle of checking. This
leads to "no such device". This commit addresses such devices by
treating them as not being up, if they are virtual ethernet devices that
disappeared while we were checking.
This commit moves the logic that cancels search requests when the rest channel is closed
to a generic client that can be used by other APIs. This will be useful for any rest action
that wants to cancel the execution of a task if the underlying rest channel is closed by the
client before completion.
Relates #49931
Relates #50990
Relates #50990
* Allow Repository Plugins to Filter Metadata on Create
Add a hook that allows repository plugins to filter the repository metadata
before it gets written to the cluster state.
This commit deprecates the creation of dot-prefixed index names (e.g.
.watches) unless they are either 1) a hidden index, or 2) registered by
a plugin that extends SystemIndexPlugin. This is the first step
towards more thorough protections for system indices.
This commit also modifies several plugins which use dot-prefixed indices
to register indices they own as system indices, and adds a plugin to
register .tasks as a system index.
* Reload secure settings with password (#43197)
If a password is not set, we assume an empty string to be
compatible with previous behavior.
Only allow the reload to be broadcast to other nodes if TLS is
enabled for the transport layer.
* Add passphrase support to elasticsearch-keystore (#38498)
This change adds support for keystore passphrases to all subcommands
of the elasticsearch-keystore cli tool and adds a subcommand for
changing the passphrase of an existing keystore.
The work to read the passphrase in Elasticsearch when
loading, which will be addressed in a different PR.
Subcommands of elasticsearch-keystore can handle (open and create)
passphrase protected keystores
When reading a keystore, a user is only prompted for a passphrase
only if the keystore is passphrase protected.
When creating a keystore, a user is allowed (default behavior) to create one with an
empty passphrase
Passphrase can be set to be empty when changing/setting it for an
existing keystore
Relates to: #32691
Supersedes: #37472
* Restore behavior for force parameter (#44847)
Turns out that the behavior of `-f` for the add and add-file sub
commands where it would also forcibly create the keystore if it
didn't exist, was by design - although undocumented.
This change restores that behavior auto-creating a keystore that
is not password protected if the force flag is used. The force
OptionSpec is moved to the BaseKeyStoreCommand as we will presumably
want to maintain the same behavior in any other command that takes
a force option.
* Handle pwd protected keystores in all CLI tools (#45289)
This change ensures that `elasticsearch-setup-passwords` and
`elasticsearch-saml-metadata` can handle a password protected
elasticsearch.keystore.
For setup passwords the user would be prompted to add the
elasticsearch keystore password upon running the tool. There is no
option to pass the password as a parameter as we assume the user is
present in order to enter the desired passwords for the built-in
users.
For saml-metadata, we prompt for the keystore password at all times
even though we'd only need to read something from the keystore when
there is a signing or encryption configuration.
* Modify docs for setup passwords and saml metadata cli (#45797)
Adds a sentence in the documentation of `elasticsearch-setup-passwords`
and `elasticsearch-saml-metadata` to describe that users would be
prompted for the keystore's password when running these CLI tools,
when the keystore is password protected.
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
* Elasticsearch keystore passphrase for startup scripts (#44775)
This commit allows a user to provide a keystore password on Elasticsearch
startup, but only prompts when the keystore exists and is encrypted.
The entrypoint in Java code is standard input. When the Bootstrap class is
checking for secure keystore settings, it checks whether or not the keystore
is encrypted. If so, we read one line from standard input and use this as the
password. For simplicity's sake, we allow a maximum passphrase length of 128
characters. (This is an arbitrary limit and could be increased or eliminated.
It is also enforced in the keystore tools, so that a user can't create a
password that's too long to enter at startup.)
In order to provide a password on standard input, we have to account for four
different ways of starting Elasticsearch: the bash startup script, the Windows
batch startup script, systemd startup, and docker startup. We use wrapper
scripts to reduce systemd and docker to the bash case: in both cases, a
wrapper script can read a passphrase from the filesystem and pass it to the
bash script.
In order to simplify testing the need for a passphrase, I have added a
has-passwd command to the keystore tool. This command can run silently, and
exit with status 0 when the keystore has a password. It exits with status 1 if
the keystore doesn't exist or exists and is unencrypted.
A good deal of the code-change in this commit has to do with refactoring
packaging tests to cleanly use the same tests for both the "archive" and the
"package" cases. This required not only moving tests around, but also adding
some convenience methods for an abstraction layer over distribution-specific
commands.
* Adjust docs for password protected keystore (#45054)
This commit adds relevant parts in the elasticsearch-keystore
sub-commands reference docs and in the reload secure settings API
doc.
* Fix failing Keystore Passphrase test for feature branch (#50154)
One problem with the passphrase-from-file tests, as written, is that
they would leave a SystemD environment variable set when they failed,
and this setting would cause elasticsearch startup to fail for other
tests as well. By using a try-finally, I hope that these tests will fail
more gracefully.
It appears that our Fedora and Ubuntu environments may be configured to
store journald information under /var rather than under /run, so that it
will persist between boots. Our destructive tests that read from the
journal need to account for this in order to avoid trying to limit the
output we check in tests.
* Run keystore management tests on docker distros (#50610)
* Add Docker handling to PackagingTestCase
Keystore tests need to be able to run in the Docker case. We can do this
by using a DockerShell instead of a plain Shell when Docker is running.
* Improve ES startup check for docker
Previously we were checking truncated output for the packaged JDK as
an indication that Elasticsearch had started. With new preliminary
password checks, we might get a false positive from ES keystore
commands, so we have to check specifically that the Elasticsearch
class from the Bootstrap package is what's running.
* Test password-protected keystore with Docker (#50803)
This commit adds two tests for the case where we mount a
password-protected keystore into a Docker container and provide a
password via a Docker environment variable.
We also fix a logging bug where we were logging the identifier for an
array of strings rather than the contents of that array.
* Add documentation for keystore startup prompting (#50821)
When a keystore is password-protected, Elasticsearch will prompt at
startup. This commit adds documentation for this prompt for the archive,
systemd, and Docker cases.
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
* Warn when unable to upgrade keystore on debian (#51011)
For Red Hat RPM upgrades, we warn if we can't upgrade the keystore. This
commit brings the same logic to the code for Debian packages. See the
posttrans file for gets executed for RPMs.
* Restore handling of string input
Adds tests that were mistakenly removed. One of these tests proved
we were not handling the the stdin (-x) option correctly when no
input was added. This commit restores the original approach of
reading stdin one char at a time until there is no more (-1, \r, \n)
instead of using readline() that might return null
* Apply spotless reformatting
* Use '--since' flag to get recent journal messages
When we get Elasticsearch logs from journald, we want to fetch only log
messages from the last run. There are two reasons for this. First, if
there are many logs, we might get a string that's too large for our
utility methods. Second, when we're looking for a specific message or
error, we almost certainly want to look only at messages from the last
execution.
Previously, we've been trying to do this by clearing out the physical
files under the journald process. But there seems to be some contention
over these directories: if journald writes a log file in between when
our deletion command deletes the file and when it deletes the log
directory, the deletion will fail.
It seems to me that we might be able to use journald's "--since" flag to
retrieve only log messages from the last run, and that this might be
less likely to fail due to race conditions in file deletion.
Unfortunately, it looks as if the "--since" flag has a granularity of
one-second. I've added a two-second sleep to make sure that there's a
sufficient gap between the test that will read from journald and the
test before it.
* Use new journald wrapper pattern
* Update version added in secure settings request
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
This commit modifies the bounding box for geogrid unit tests
to only consider bounding boxes that have significant longitudinal
width and whose coordinates are normalized to quantized space
Closes#51103.
We added a new rounding in #50609 that handles offsets to the start and
end of the rounding so that we could support `offset` in the `composite`
aggregation. This starts moving `date_histogram` to that new offset.
This is a redo of #50873 with more integration tests.
This reverts commit d114c9db3e1d1a766f9f48f846eed0466125ce83.
We've been parsing the `time_zone` parameter on `date_hitogram` for a
while but it hasn't *done* anything. This wires it up.
Closes#45199
Inspired by #45200
We shouldn't be using potentially changing versions of the cluster state
when answering a snapshot status API call by calling `SnapshotService#currentSnapshots` multiple times (each time using `ClusterService#state` under the hood) but instead pass down the state from the transport action.
Having these API behave more in a more deterministic way will make it easier to use them once parallel repository operations
are introduced.
This commit overrides the stdout and stderr print streams to be
redirected to the main elasticsearch.log file. While the Elasticsearch
project ensures stdout and stderr are not written to, the jdk or 3rd
party libs may do this, which can be unexepected for users used to
looking the elasticsearch log.
closes#50156
Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.
Closes#51308
Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.
* Fix Inconsistent Shard Failure Count in Failed Snapshots
This fix was necessary to allow for the below test enhancement:
We were not adding shard failure entries to a failed snapshot for those
snapshot entries that were never attempted because the snapshot failed during
the init stage and wasn't partial. This caused the never attempted snapshots
to be counted towards the successful shard count which seems wrong and
broke repository consistency tests.
Also, this change adjusts snapshot resiliency tests to run another snapshot
at the end of each test run to guarantee a correct `index.latest` blob exists
after each run.
Closes#47550
The order indices are returned in in the metadata is not guaranteed.
This commit accounts for any possible ordering in assertions about
hidden indices.
closes#51340
It is permitted for nodes to accept transport connections at addresses other
than their publish address, which allows a good deal of flexibility when
configuring discovery. However, it is not unusual for users to misconfigure
nodes to pick a publish address which is inaccessible to other nodes. We see
this happen a lot if the nodes are on different networks separated by a proxy,
or if the nodes are running in Docker with the wrong kind of network config.
In this case we offer no useful feedback to the user unless they enable
TRACE-level logs. It's particularly tricky to diagnose because if we test
connectivity between the nodes (using their discovery addresses) then all will
appear well.
This commit adds a WARN-level log if this kind of misconfiguration is detected:
the probe connection has succeeded (to indicate that we are really talking to a
healthy Elasticsearch node) but the followup connection attempt fails.
It also tidies up some loose ends in `HandshakingTransportAddressConnector`,
removing some TODOs that need not be completed, and registering its
accidentally-unregistered timeout settings.
IndexWriter might not filter out fully deleted segments if retention
leases exist or the number of the retaining operations is non-zero.
SoftDeletesDirectoryReaderWrapper, however, always filters out fully
deleted segments.
This change uses the original directory reader when calculating segment
stats instead.
Relates #51192Closes#51303
We were loading `RepositoryData` twice during snapshot initialization,
redundantly checking if a snapshot existed already.
The first snapshot existence check is somewhat redundant because a snapshot could be
created between loading `RepositoryData` and updating the cluster state with the `INIT`
state snapshot entry.
Also, it is much safer to do the subsequent checks for index existence in the repo and
and the presence of old version snapshots once the `INIT` state entry prevents further
snapshots from being created concurrently.
While the current state of things will never lead to corruption on a concurrent snapshot
creation, it could result in a situation (though unlikely) where all the snapshot's work
is done on the data nodes, only to find out that the repository generation was off during
snapshot finalization, failing there and leaving a bunch of dead data in the repository
that won't be used in a subsequent snapshot (because the shard generation was never referenced
due to the failed snapshot finalization).
Note: This is a step on the way to parallel repository operations by making snapshot related CS
and repo related CS more tightly correlated.
This reverts commit c7fd24ca1569a809b499caf34077599e463bb8d6.
Now that JDK-8236582 is fixed in JDK 14 EA, we can revert the workaround.
Relates #50523 and #50512
LuceneChangesSnapshot can be slow if nested documents are heavily used.
Also, it estimates the number of operations to be recovered in peer
recoveries inaccurately. With this change, we prefer excluding the
nested non-root documents in a Lucene query instead.
The method parameter is not used in the percentile aggs, instead
the method is determined by the presence of `hdr` or `tdigest`
objects.
Relates to #8324
After we rollover the index we wait for the configured number of shards for the
rolled index to become active (based on the index.write.wait_for_active_shards
setting which might be present in a template, or otherwise in the default case,
for the primaries to become active).
This wait might be long due to disk watermarks being tripped, replicas not
being able to spring to life due to cluster nodes reconfiguration and others
and, the RolloverStep might not complete successfully due to this inherent
transient situation, albeit the rolled index having been created.
(cherry picked from commit 457a92fb4c68c55976cc3c3e2f00a053dd2eac70)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of `equals`.
If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.
Closes#51253
Add the character position of a scripting error to error responses.
The contents of the `position` field are experimental and subject to
change. Currently, `offset` refers to the character location where the
error was encountered, `start` and `end` define a range of characters
that contain the error.
eg.
```
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"y = x;",
" ^---- HERE"
],
"script": "def x = new ArrayList(); Map y = x;",
"lang": "painless",
"position": {
"offset": 33,
"start": 29,
"end": 35
}
}
```
Refs: #50993
This replaces the message we return for unknown queries with the standard
one that we use for unknown fields from `ObjectParser`. This is nice
because it includes "did you mean". One day we might convert parsing
queries to using object parser, but that looks complex. This change is
much smaller and seems useful.
Adding back accidentally removed jvm option that is required to enforce
start of the week = Monday in IsoCalendarDataProvider.
Adding a `feature` to yml test in order to skip running it in JDK8
commit that removed it 398c802
commit that backports SystemJvmOptions c4fbda3
relates 7.x backport of code that enforces CalendarDataProvider use #48349
The tests, when creating broken serialized blobs could randomly create
a sequence of bytes that is partially readable by the deserializer and then
not throw `IOException` but instead `ElasticsearchParseException`.
We should just handle these unexpected exceptions downstream properly and pass them
wrapped as `RepositoryException` to the listener to fix the test and keep the API consistent.
This change introduces a new feature for indices so that they can be
hidden from wildcard expansion. The feature is referred to as hidden
indices. An index can be marked hidden through the use of an index
setting, `index.hidden`, at creation time. One primary use case for
this feature is to have a construct that fits indices that are created
by the stack that contain data used for display to the user and/or
intended for querying by the user. The desire to keep them hidden is
to avoid confusing users when searching all of the data they have
indexed and getting results returned from indices created by the
system.
Hidden indices have the following properties:
* API calls for all indices (empty indices array, _all, or *) will not
return hidden indices by default.
* Wildcard expansion will not return hidden indices by default unless
the wildcard pattern begins with a `.`. This behavior is similar to
shell expansion of wildcards.
* REST API calls can enable the expansion of wildcards to hidden
indices with the `expand_wildcards` parameter. To expand wildcards
to hidden indices, use the value `hidden` in conjunction with `open`
and/or `closed`.
* Creation of a hidden index will ignore global index templates. A
global index template is one with a match-all pattern.
* Index templates can make an index hidden, with the exception of a
global index template.
* Accessing a hidden index directly requires no additional parameters.
Backport of #50452
When you declare an ObjectParser with top level named objects like we do
with `significant_terms` we didn't support "did you mean". This fixes
that.
Relates #50938
* Fix Infinite Retry Loop in loading RepositoryData
We were running into an infinite loop when trying to load corrupted (or otherwise un-loadable)
repository data for a repo that uses best effort consistency (e.g. that was just freshly mounted
as done in the test) because we kepy resetting to `-1` on `IOException`, listing and finding the broken
generation `N` and then interpreted the subsequent reset to `-1` as a concurrent change to the repository.
This moves the testing of custom significance heuristic plugins from an
`ESIntegTestCase` to an example plugin. This is *much* more "real" and
can be used as an example for anyone that needs to actually build such a
plugin. The old test had testing concerns and the example all jumbled
together.
The PreConfiguredTokenFilter#singletonWithVersion uses the version
internally for the token filter factories but it registers only one
instance in the cache and not one instance per version. This can lead
to exceptions like the one described in #50734 since the singleton is
created and cached using the version created of the first index
that is processed.
Remove the singletonWithVersion() methods and use the
elasticsearchVersion() methods instead.
Fixes: #50734
(cherry picked from commit 24e1858)
Backport: #50467
This commit adds the name of the current pipeline to ingest metadata.
This pipeline name is accessible under the following key: '_ingest.pipeline'.
Example usage in pipeline:
PUT /_ingest/pipeline/2
{
"processors": [
{
"set": {
"field": "pipeline_name",
"value": "{{_ingest.pipeline}}"
}
}
]
}
Closes#42106
Index templates created in the 5x line can still be present in the cluster state
through multiple upgrades, and may have more than one mapping defined. 8x
will stop supporting templates with multiple mappings, and we should emit
deprecation warnings in 7x clusters to give users a chance to update their
templates before upgrading.
Check it out:
```
$ curl -u elastic:password -HContent-Type:application/json -XPOST localhost:9200/test/_update/foo?pretty -d'{
"dac": {}
}'
{
"error" : {
"root_cause" : [
{
"type" : "x_content_parse_exception",
"reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
}
],
"type" : "x_content_parse_exception",
"reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
},
"status" : 400
}
```
The tricky thing about implementing this is that x-content doesn't
depend on Lucene. So this works by creating an extension point for the
error message using SPI. Elasticsearch's server module provides the
"spell checking" implementation.
s
We seem to have settled on the `ContextParser` interface for parsing
stuff, mostly because `ObjectParser` implements it. We don't really
*need* the old `Aggregator.Parser` interface any more because it
duplicates `ContextParser` but with the arguments reversed.
This adds support to `AggregationSpec` to declare aggregation parsers
using `ContextParser`. This should integrate cleanly with
`ObjectParser`. It doesn't drop support for `Aggregator.Parser` or
change the plugin intrface at all so it *should* be safe to backport to
7.x. And we can remove `Aggregator.Parser` in a follow up which is only
targeted to 8.0.
We added a new rounding in #50609 that handles offsets to the start and
end of the rounding so that we could support `offset` in the `composite`
aggregation. This starts moving `date_histogram` to that new offset.
* Adds support for geo-bounds filtering in geogrid aggregations (#50002)
It is fairly common to filter the geo point candidates in
geohash_grid and geotile_grid aggregations according to some
viewable bounding box. This change introduces the option of
specifying this filter directly in the tiling aggregation.
This is even more relevant to `geo_shape` where the bounds will restrict
the shape to be within the bounds
this optional `bounds` parameter is parsed in an equivalent fashion to
the bounds specified in the geo_bounding_box query.
* Track Snapshot Version in RepositoryData (#50930)
Add tracking of snapshot versions to RepositoryData to make BwC logic more efficient.
Follow up to #50853
Currently, proxy mode allows a remote cluster connection to be setup by
expecting all open connections to be routed through an intermediate
proxy. The proxy must use some logic to ensure that the connections end
up on the correct remote cluster. One mechanism provided is that the
default distribution TLS implementations will forward the host component
of the configured address to the remote connection using the SNI
extension. This is limiting as it requires that the proxy be configured
in a way that always uses a valid hostname as the proxy address.
Instead, this commit adds an additional setting to allow the server_name
to be configured independently. This allows the proxy address to be
specified as a IP literal, but the server_name specified as an arbitrary
string which still must be a valid hostname. It also decouples the
server_name from the requirement of being a DNS resolvable domain.
Generally speaking, deprecated analysis components in elasticsearch will issue deprecation
warnings when they are first used. However, this means that no warnings are emitted when
indexes are created with deprecated components, and users have to actually index a document
to see warnings. This makes it much harder to see these warnings and act on them at
appropriate times.
This is worse in the case where components throw exceptions on upgrade. In this case, users
will not be aware of a problem until a document is indexed, instead of at index creation time.
This commit adds a new check that pushes an empty string through all user-defined analyzers
and normalizers when an IndexAnalyzers object is built for each index; deprecation warnings
and exceptions are now emitted when indexes are created or opened.
Fixes#42349
* Move metadata storage to Lucene (#50907)
Today we split the on-disk cluster metadata across many files: one file for the metadata of each
index, plus one file for the global metadata and another for the manifest. Most metadata updates
only touch a few of these files, but some must write them all. If a node holds a large number of
indices then it's possible its disks are not fast enough to process a complete metadata update before timing out. In severe cases affecting master-eligible nodes this can prevent an election
from succeeding.
This commit uses Lucene as a metadata storage for the cluster state, and is a squashed version
of the following PRs that were targeting a feature branch:
* Introduce Lucene-based metadata persistence (#48733)
This commit introduces `LucenePersistedState` which master-eligible nodes
can use to persist the cluster metadata in a Lucene index rather than in
many separate files.
Relates #48701
* Remove per-index metadata without assigned shards (#49234)
Today on master-eligible nodes we maintain per-index metadata files for every
index. However, we also keep this metadata in the `LucenePersistedState`, and
only use the per-index metadata files for importing dangling indices. However
there is no point in importing a dangling index without any shard data, so we
do not need to maintain these extra files any more.
This commit removes per-index metadata files from nodes which do not hold any
shards of those indices.
Relates #48701
* Use Lucene exclusively for metadata storage (#50144)
This moves metadata persistence to Lucene for all node types. It also reenables BWC and adds
an interoperability layer for upgrades from prior versions.
This commit disables a number of tests related to dangling indices and command-line tools.
Those will be addressed in follow-ups.
Relates #48701
* Add command-line tool support for Lucene-based metadata storage (#50179)
Adds command-line tool support (unsafe-bootstrap, detach-cluster, repurpose, & shard
commands) for the Lucene-based metadata storage.
Relates #48701
* Use single directory for metadata (#50639)
Earlier PRs for #48701 introduced a separate directory for the cluster state. This is not needed
though, and introduces an additional unnecessary cognitive burden to the users.
Co-Authored-By: David Turner <david.turner@elastic.co>
* Add async dangling indices support (#50642)
Adds support for writing out dangling indices in an asynchronous way. Also provides an option to
avoid writing out dangling indices at all.
Relates #48701
* Fold node metadata into new node storage (#50741)
Moves node metadata to uses the new storage mechanism (see #48701) as the authoritative source.
* Write CS asynchronously on data-only nodes (#50782)
Writes cluster states out asynchronously on data-only nodes. The main reason for writing out
the cluster state at all is so that the data-only nodes can snap into a cluster, that they can do a
bit of bootstrap validation and so that the shard recovery tools work.
Cluster states that are written asynchronously have their voting configuration adapted to a non
existing configuration so that these nodes cannot mistakenly become master even if their node
role is changed back and forth.
Relates #48701
* Remove persistent cluster settings tool (#50694)
Adds the elasticsearch-node remove-settings tool to remove persistent settings from the on
disk cluster state in case where it contains incompatible settings that prevent the cluster from
forming.
Relates #48701
* Make cluster state writer resilient to disk issues (#50805)
Adds handling to make the cluster state writer resilient to disk issues. Relates to #48701
* Omit writing global metadata if no change (#50901)
Uses the same optimization for the new cluster state storage layer as the old one, writing global
metadata only when changed. Avoids writing out the global metadata if none of the persistent
fields changed. Speeds up server:integTest by ~10%.
Relates #48701
* DanglingIndicesIT should ensure node removed first (#50896)
These tests occasionally failed because the deletion was submitted before the
restarting node was removed from the cluster, causing the deletion not to be
fully acked. This commit fixes this by checking the restarting node has been
removed from the cluster.
Co-authored-by: David Turner <david.turner@elastic.co>
* fix tests
Co-authored-by: David Turner <david.turner@elastic.co>
Currently, the connection manager is configured with a default profile
for both the sniff and proxy connection stratgies. This profile
correctly reflects the expected number of connection (6 for sniff, 18
for proxy). This commit removes the proxy strategy usages of the per
connection attempt profile configuration.
Additionally, it refactors other unnecessary code around the connection
manager. The connection manager now can always be built inside the
remote connection.
Currently we reuse the same test connection for all connection attempts
in the testConcurrentConnectsAndDisconnects test. This means that if the
connection fails due to a pre-existing connection, the connection will
be closed impacting the state of all connection attempts. This commit
fixes the test, by returning a unique connection for each attempt.
Fixes#49903.
Follow up to #50692 that starts writing a `min_version` field to
the `RepositoryData` so that pre-7.6 ES versions can not read it
(and potentially corrupt it if they attempt to modify the repo contents)
after the repository moved to the new metadata format.
When deserializing time zones in the Rounding classes we used to include a tiny
normalization step via `DateUtils.of(in.readString())` that was lost in #50609.
Its at least necessary for some tests, e.g. the cause of #50827 is that when
sending the default time zone ZoneOffset.UTC on a stream pre 7.0 we convert it
to a "UTC" string id via `DateUtils.zoneIdToDateTimeZone`. This gets then read
back as a UTC ZoneRegion, which should behave the same but fails the equality
tests in our serialization tests. Reverting to the previous behaviour with an
additional normalization step on 7.x.
Co-authored-by: Nik Everett <nik9000@gmail.com>
Closes#50827
Today we make multiple attempts to corrupt the translog header in
`TranslogHeaderTests#testCurrentHeaderVersion`, but if we are extraordinarily
unlucky then this sequence of corruptions may restore the file to its original
state. This change adjusts the test to only corrupt the file once, which is
certain not to leave the file in its original state.
The test checked queue size and active count, however,
ThreadPoolExecutor pulls out the request from the queue before marking
the worker active, risking that we think all tasks are done when they
are not. Now check on completed-tasks metric instead, which is
guaranteed to be monotonic.
Relates #50769
Today we periodically check the indexing buffer memory every 5 seconds
or after we have used 1/30 of the configured memory. If the total used
memory is over the threshold, then we refresh the "largest" shards. If
refreshing takes longer these intervals (i.e., 5s or 1/30 buffer), then
we continue to enqueue refreshes to these shards. This leads to two
issues:
- The refresh thread pool can be exhausted and other shards can't refresh
- Execute too many refreshes for the "largest" shards
With this change, we only refresh the largest shards if they are not refreshing.
Here we rely on the periodic check to trigger another refresh if needed. We can
harden this by making the ongoing refresh triggers the memory check when
it's completed. I opted out this option in this PR for simplicity.
See: https://discuss.elastic.co/t/write-queue-continue-to-rise/213652/
When a composite aggregation is reduced using the results from an
index that has one of the fields unmapped we were throwing away the
formatter. This is mildly annoying, except in the case of IP addresses
which were coming out as non-utf-8-characters. And tripping assertions.
This carefully preserves the formatter from the working bucket.
Closes#50600
This change fixes the upgrade of index metadata that contain
a custom similarity with options that are not compatible with BM25.
The upgrade doesn't need a real similarity service so we fake one that
resolves all custom similarity to BM25 but this logic fails because the
BM25 provider checks that all options are compatible. This commit removes
the verification step as it is not needed during the upgrade (the verification
is done when the index is restored/opened).
Closes#50763
* Fix Snapshot Shard Status Request Deduplication
The request deduplication didn't actually work for these requests
since they had no `equals` and `hashCode` so the deduplicator wouldn't
actually recognize equal requests.
Replaces the "funny"
`Function<String, ConstructingObjectParser<T, Void>>` with a much
simpler `ConstructingObjectParser<T, String>`. This makes pretty much
all of our object parsers static.
* Fix Snapshot Repository Corruption in Downgrade Scenarios (#50692)
This PR introduces test infrastructure for downgrading a cluster while interacting with a given repository.
It fixes the fact that repository metadata in the new format could be written while there's still older snapshots in the repository that require the old-format metadata to be restorable.
A very large number of recursive calls can cause a stack overflow
exception. This commit forks the recursive calls for non-async
processors. Once forked, each thread will handle at most 10
recursive calls to help keep the stack size and thread count
down to a reasonable size.
Adds support for the `offset` parameter to the `date_histogram` source
of composite aggs. The `offset` parameter is supported by the normal
`date_histogram` aggregation and is useful for folks that need to
measure things from, say, 6am one day to 6am the next day.
This is implemented by creating a new `Rounding` that knows how to
handle offsets and delegates to other rounding implementations. That
implementation doesn't fully implement the `Rounding` contract, namely
`nextRoundingValue`. That method isn't used by composite aggs so I can't
be sure that any implementation that I add will be correct. I propose to
leave it throwing `UnsupportedOperationException` until I need it.
Closes#48757
Previously, the following situation would throw an error:
* A search contains a `collapse` on a particular field.
* The search spans multiple indices, and in one index the field is mapped as a
concrete field, but in another it is a field alias.
The error occurs when we attempt to merge `CollapseTopFieldDocs` across shards.
When merging, we validate that the name of the collapse field is the same across
shards. But the name has already been resolved to the concrete field name, so it
will be different on shards where the field was mapped as an alias vs. shards
where it was a concrete field.
This PR updates the collapse field name in `CollapseTopFieldDocs` to the
original requested field, so that it will always be consistent across shards.
Note that in #32648, we already made a fix around collapsing on field aliases.
However, we didn't test this specific scenario where the field was mapped as an
alias in only one of the indices being searched.
Currently, if an updateable synonym filter is included in a multiplexer filter,
it is not reloaded via the _reload_search_analyzers because the multiplexer
itself doesn't pass on the analysis mode of the filters it contains, so its not
recognized as "updateable" in itself. Instead we can check and merge the
AnalysisMode settings of all filters in the multiplexer and use the resulting
mode (e.g. search-time only) for the multiplexer itself, thus making any synonym
filters contained in it reloadable. This, of course, will also make the
analyzers using the multiplexer be usable at search-time only.
Closes#50554
strict_date_optional_time changes to have optional minute part.
It already allowed optional second and fraction of second part.
This allows parsing 2018-01-01T00+01 , 2018-01-01T00:00+01 , 2018-01-01T00:00:00+01 , 2018-01-01T00:00:00.000+01
It won't allow parsing a timezone without an hour part as this is not allowed by iso8601 spec
closes#49351
ElasticsearchException.guessRootCauses would return wrapper exception if
inner exception was not an ElasticsearchException. Fixed to never return
wrapper exceptions.
At least following APIs change root_cause.0.type as a result:
_update with bad script
_index with bad pipeline
Relates #50417
This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.
In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
- keys can't be longer than 20 chars,
- values can only be numbers or strings of no more than 50 chars - no inner
arrays or objects,
- the metadata can't have more than 5 keys in total.
Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.
Here is how the meta might look like in mappings:
```json
{
"properties": {
"latency": {
"type": "long",
"meta": {
"unit": "ms"
}
}
}
}
```
And then in the field capabilities response:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms" ]
}
}
}
}
```
When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms", "ns" ]
}
}
}
}
```
Closes#33267
Introduce a new static setting, `gateway.auto_import_dangling_indices`, which prevents dangling indices from being automatically imported. Part of #48366.
In security we currently monitor a set of files for changes:
- config/role_mapping.yml (or alternative configured path)
- config/roles.yml
- config/users
- config/users_roles
This commit prevents unnecessary reloading when the file change actually doesn't change the internal structure.
Backport of: #50207
Co-authored-by: Anton Shuvaev <anton.shuvaev91@gmail.com>
We *very* commonly have object with ctors like:
```
public Foo(String name)
```
And then declare a bunch of setters on the object. Every aggregation
works like this, for example. This change teaches `ObjectParser` how to
build these aggregations all on its own, without any help. This'll make
it much cleaner to parse aggs, and, probably, a bunch of other things.
It'll let us remove lots of wrapping. I've used this new power for the
`avg` aggregation just to prove that it works outside of a unit test.
If an auto-refresh happens, then version_map_memory is reset to 0. By
default, the auto-refresh occurs for every second in the first 30
seconds until search becomes idle.
Closes#50362
In 7.x an internal API used for validating remote cluster does not throw, see #50420 for the
details. This change implements a workaround for remote cluster validation, only for 7.x branches.
fixes#50420
A failure of a recovering shard can race with a new allocation of the shard, and cause the new
allocation to be failed as well. This can result in a shard being marked as initializing in the cluster
state, but not exist on the node anymore.
Closes#50508
This change removes a no longer used method, `fromByte`, in
IndicesOptions. This method was necessary for backwards compatibility
with versions prior to 6.4.0 and was used when talking to those
versions. However, the minimum wire compatibility version has changed
and we no longer use this code.
Backport of #50665
This replaces the hand written xcontent parsers for significance
heristics with `ObjectParser` and parsing named xcontent.
As a happy accident, this was the last user of `ParseFieldRegistry` so
this PR entirely removes that class.
Closes#25519
Currently, we use delayed address resolution in the proxy strategy tests
to allow tests to connect to different addresses. Unfortunately, this
has the potential to introduce races as the address is resolved each
connection attempt. The number of connection attempts can vary based on
when connections are opening and closing. This commit modifies the test
be allowing them to specifically control which address is used.
Related to #50618
This test code fixes a serialization test bug:
https://gradle-enterprise.elastic.co/s/7x2ct6yywkw3o
Rarely stats for the same processor are generated and
the production code then sums up these stats. However
the test code wasn't summing up in that case,
which caused inconsistencies between the actual and expected results.
Closes#50507
Previously, as long as a deleted version value was kept as a tombstone,
another index or delete operation against the same id would leak that
the doc had existed (through seq_no info) or would allow the operation
if the client forged the seq_no. Fixed to disregard info on deleted docs
when doing seq_no based optimistic concurrency check.
Today the `InternalClusterInfoService` collects information on the sizes of
shards of open indices, but does not consider closed indices. This means that
shards of closed indices are treated as having zero size when they are being
allocated. This commit fixes this, obtaining the sizes of all shards.
Relates #33888
* Adds JavaDoc to `AbstractWireTestCase` and
`AbstractWireSerializingTestCase` so it is more obvious you should prefer
the latter if you have a choice
* Moves the `instanceReader` method out of `AbstractWireTestCase` becaue
it is no longer used.
* Marks a bunch of methods final so it is more obvious which classes are
for what.
* Cleans up the side effects of the above.
We have about 800 `ObjectParsers` in Elasticsearch, about 700 of which
are final. This is *probably* the right way to declare them because in
practice we never mutate them after they are built. And we certainly
don't change the static reference. Anyway, this adds `final` to these
parsers.
I found the non-final parsers with this:
```
diff \
<(find . -type f -name '*.java' -exec grep -iHe 'static.*PARSER\s*=' {} \+ | sort) \
<(find . -type f -name '*.java' -exec grep -iHe 'static.*final.*PARSER\s*=' {} \+ | sort) \
2>&1 | grep '^<'
```
A geo box with a top value of Double.NEGATIVE_INFINITY will yield an empty
xContent which translates to a null `geoBoundingBox`. This commit marks the
field as `Nullable` and guards against null when retrieving the `topLeft`
and `bottomRight` fields.
Fixes https://github.com/elastic/elasticsearch/issues/50505
(cherry picked from commit 051718f9b1e1ca957229b01e80d7b79d7e727e14)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This marks a couple of constants in the `DecayFunctionBuilder` as final.
They are written in CONSTANT_CASE and used as constants but not final
which is a little confusing and might lead to sneaky bugs.
FutureUtils.get() would unwrap ElasticsearchWrapperExceptions. This
is trappy, since nearly all usages of FutureUtils.get() expected only to
not have to deal with checked exceptions.
In particular, StepListener builds upon ListenableFuture which uses
FutureUtils.get to be informed about the exception passed to onFailure.
This had the bad consequence of masking away any exception that was an
ElasticsearchWrapperException like RemoteTransportException.
Specifically for recovery, this made CircuitBreakerExceptions happening
on the target node look like they originated from the source node.
The only usage that expected that behaviour was AdapterActionFuture.
The unwrap behaviour has been moved to that class.
Today we log changes to index settings like this:
updating [index.setting.blah] from [A] to [B]
The identity of the index whose settings were updated is conspicuously absent
from this message. This commit addresses this by adding the index name to these
messages.
Fixes#49818.
This intervals source will return terms that are similar to an input term, up to
an edit distance defined by fuzziness, similar to FuzzyQuery.
Closes#49595
The cat nodes API performs a `ClusterStateAction` then a `NodesInfoAction`.
Today it accepts the `?local` parameter and passes this to the
`ClusterStateAction` but this parameter has no effect on the `NodesInfoAction`.
This is surprising, because `GET _cat/nodes?local` looks like it might be a
completely local call but in fact it still depends on every node in the
cluster.
This commit deprecates the `?local` parameter on this API so that it can be
removed in 8.0.
Relates #50088
testCancelRecoveryDuringPhase1 uses a mock of IndexShard, which can't
create retention leases. We need to stub method createRetentionLease.
Relates #50351Closes#50424
The additional change to the original PR (#49657), is that `org.elasticsearch.client.cluster.RemoteConnectionInfo` now parses the initial_connect_timeout field as a string instead of a TimeValue instance.
The reason that this is needed is because that the initial_connect_timeout field in the remote connection api is serialized for human consumption, but not for parsing purposes.
Therefore the HLRC can't parse it correctly (which caused test failures in CI, but not in the PR CI
:( ). The way this field is serialized needs to be changed in the remote connection api, but that is a breaking change. We should wait making this change until rest api versioning is introduced.
Co-Authored-By: j-bean <anton.shuvaev91@gmail.com>
Co-authored-by: j-bean <anton.shuvaev91@gmail.com>
Today, the replica allocator uses peer recovery retention leases to
select the best-matched copies when allocating replicas of indices with
soft-deletes. We can employ this mechanism for indices without
soft-deletes because the retaining sequence number of a PRRL is the
persisted global checkpoint (plus one) of that copy. If the primary and
replica have the same retaining sequence number, then we should be able
to perform a noop recovery. The reason is that we must be retaining
translog up to the local checkpoint of the safe commit, which is at most
the global checkpoint of either copy). The only limitation is that we
might not cancel ongoing file-based recoveries with PRRLs for noop
recoveries. We can't make the translog retention policy comply with
PRRLs. We also have this problem with soft-deletes if a PRRL is about to
expire.
Relates #45136
Relates #46959
Both geo_bounding_box query and geo_bounds aggregation have
a very similar definition of a "bounding box". A lot of this
logic (serialization, xcontent-parsing, etc) can be centralized
instead of having separated efforts to do the same things
This commit makes the TransportRolloverAction more resilient, by having it execute
only one cluster state update that creates the new (rollover index), rolls over
the alias from the source to the target index and set the RolloverInfo on the
source index. Before these 3 steps were represented as 3 chained cluster state
updates, which would've seen the user manually intervene if, say, the alias
rollover cluster state update (second in the chain) failed but the creation of
the rollover index (first in the chain) update succeeded
* Rename innerExecute to applyAliasActions
(cherry picked from commit 1ba4339a0c73ef3354b8c8b44b628fc55f1dbc78)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
Co-authored-by: Daniel Huang <danielhuang@tencent.com>
This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
* Multi-valued source, in such case early termination is not possible.
* missing_bucket is set to true
Follow-up to #48974 that ensures that replicas are only auto-expanded according to allocation
filtering rules once all nodes are upgraded to a version that supports this. Helps with
orchestrating cluster upgrades.
refactors source and dest validation, adds support for CCS, makes resolve work like reindex/search, allow aliased dest index with a single write index.
fixes#49988fixes#49851
relates #43201
The built-in task index mapping has a version field in its metadata, so that
the TaskResultsService can check to see if it needs to update mappings when
a new task result is stored. #48393 updated this version in TaskResultsService
but omitted to change the version in the mapping itself, so a mapping update
is applied every time a new task result is stored.
This commit updates the mapping version so that it corresponds to the version
in TaskResultsService.
Follow-up to #48974 that ensures that replicas are only auto-expanded according to allocation
filtering rules once all nodes are upgraded to a version that supports this. Helps with
orchestrating cluster upgrades.
* Update remote cluster stats to support simple mode (#49961)
Remote cluster stats API currently only returns useful information if
the strategy in use is the SNIFF mode. This PR modifies the API to
provide relevant information if the user is in the SIMPLE mode. This
information is the configured addresses, max socket connections, and
open socket connections.
* Send hostname in SNI header in simple remote mode (#50247)
Currently an intermediate proxy must route conncctions to the
appropriate remote cluster when using simple mode. This commit offers
a additional mechanism for the proxy to route the connections by
including the hostname in the TLS SNI header.
* Rename the remote connection mode simple to proxy (#50291)
This commit renames the simple connection mode to the proxy connection
mode for remote cluster connections. In order to do this, the mode specific
settings which we namespaced by their mode (ex: sniff.seed and
proxy.addresses) have been reverted.
* Modify proxy mode to support a single address (#50391)
Currently, the remote proxy connection mode uses a list setting for the
proxy address. This commit modifies this so that the setting is
proxy_address and only supports a single remote proxy address.
Avoid backwards incompatible changes for 8.x and 7.6 by removing type
restriction on compile and Factory. Factories may optionally implement
ScriptFactory. If so, then they can indicate determinism and thus
cacheability.
**Backport**
Relates: #49466
We randomly generate intervals sources to test serialization and query generation
in IntervalQueryBuilderTests. However, rarely we can generate a query that has
too many nested disjunctions, resulting in query rewrites running afoul of the maximum
boolean clause limit.
This commit reduces the maximum depth of the randomly generated intervals source
to make running into this limit much more unlikely.
* Extract IndexCreationTask execute into applyCreateIndexRequest
This is the first step in preparation for separating the index creation into a few
steps that only deal with the cluster state mutation and removing the IndexCreationTask
altogether.
* Split applyCreateIndexRequest
This breaks down the logic in applyCreateIndexRequest into multiple
steps that will hopefully make the service more readable and unit testable.
The service creation process now goes through a few well defined steps,
namely find the templates that possibly match the new index, parse the
requested and template matching mappings, process the index and template
matching settings, validate the wait for active shards request and
create the `IndexService`, update the mappings in the `MapperService` (which
is grouped together with creating the sort order for validation purposes),
validate the requested and templated matching aliases and finally update
the `ClusterState` to reflect the requested changes.
This also removes the `IndexCreationTask` as it was a shallow
indirection and migrates the tests from `IndexCreationTaskTests` to
`MetaDataCreateIndexServiceTests` (making them "real" unit tests
operating on the `ClusterState` rather than mocks).
* Add more unit tests.
* Add IT to verify we cleanup in case of failure
(cherry picked from commit 57e6269f750471f05a1a79539ca45361b9e3c2b5)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
# Conflicts:
# server/src/main/java/org/elasticsearch/cluster/metadata/MetaDataCreateIndexService.java
# server/src/test/java/org/elasticsearch/action/admin/indices/create/CreateIndexIT.java
# server/src/test/java/org/elasticsearch/cluster/metadata/IndexCreationTaskTests.java
Cache results from queries that use scripts if they use only
deterministic API calls. Nondeterministic API calls are marked in the
whitelist with the `@nondeterministic` annotation. Examples are
`Math.random()` and `new Date()`.
Refs: #49466
Loading shard state information during shard allocation sometimes runs into a situation where a
data node does not know yet how to look up the shard on disk if custom data paths are used.
The current implementation loads the index metadata from disk to determine what the custom
data path looks like. This PR removes this dependency, simplifying the lookup.
Relates #48701
Backport #50244 to 7.x branch.
If a processor executes asynchronously and the ingest simulate api simulates with
multiple documents then the order of the documents in the response may not match
the order of the documents in the request.
Alexander Reelsen discovered this issue with the enrich processor with the following reproduction:
```
PUT cities/_doc/munich
{"zip":"80331","city":"Munich"}
PUT cities/_doc/berlin
{"zip":"10965","city":"Berlin"}
PUT /_enrich/policy/zip-policy
{
"match": {
"indices": "cities",
"match_field": "zip",
"enrich_fields": [ "city" ]
}
}
POST /_enrich/policy/zip-policy/_execute
GET _cat/indices/.enrich-*
POST /_ingest/pipeline/_simulate
{
"pipeline": {
"processors" : [
{
"enrich" : {
"policy_name": "zip-policy",
"field" : "zip",
"target_field": "city",
"max_matches": "1"
}
}
]
},
"docs": [
{ "_id": "first", "_source" : { "zip" : "80331" } } ,
{ "_id": "second", "_source" : { "zip" : "50667" } }
]
}
```
* fixed test compile error
We can simply filter out shard generation updates for indices
that were removed from the cluster state concurrently to fix
index deletes during partial snapshots as that completely removes
any reference to those shards from the snapshot.
Follow up to #50202Closes#50200
Follow up to #49729
This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
G1GC will use humongous allocations when an allocation exceeds half the
chosen region size, which is minimum 1MB. By reducing the recovery
buffer size by 16 bytes we ensure that the recovery buffer is never
allocated as a humongous allocation.
Today we do not consider trimAboveSeqNo when calculating the translog
generation of an index commit. If there is no new indexing after the
primary promotion, then we won't be able to clean up the translog.
Backport of #49612.
The current Docker entrypoint script picks up environment variables and
translates them into -E command line arguments. However, since any tool
executes via `docker exec` doesn't run the entrypoint, it results in
a poorer user experience.
Therefore, refactor the env var handling so that the -E options are
generated in `elasticsearch-env`. These have to be appended to any
existing command arguments, since some CLI tools have subcommands and
-E arguments must come after the subcommand.
Also extract the support for `_FILE` env vars into a separate script, so
that it can be called from more than once place (the behaviour is
idempotent).
Finally, add noop -E handling to CronEvalTool for parity, and support
`-E` in MultiCommand before subcommands.
With #45689 making it so that index metadata is written
after all shards have been snapshotted we can't delete indices
that are part of the upcoming snapshot finalization any longer
and it is not sufficient to check if all shards of an index have been
snapshotted before deciding that it is safe to delete it.
This change forbids deleting any index that is in the process of being
snapshot to avoid issues during snapshot finalization.
Relates #50200 (doesn't fully fix yet because we're not fixing the `partial=true`
snapshot case here
* Remove BlobContainer Tests against Mocks
Removing all these weird mocks as asked for by #30424.
All these tests are now part of real repository ITs and otherwise left unchanged if they had
independent tests that didn't call the `createBlobStore` method previously.
The HDFS tests also get added coverage as a side-effect because they did not have an implementation
of the abstract repository ITs.
Closes#30424
Lucene 8.4 added support for "CONTAINS", therefore in this commit those
changes are integrated in Elasticsearch. This commit contains as well a
bug fix when querying with a geometry collection with "DISJOINT" relation.
Since 7.4, we switch from translog to Lucene as the source of history
for peer recoveries. However, we reduce the likelihood of
operation-based recoveries when performing a full cluster restart from
pre-7.4 because existing copies do not have PPRL.
To remedy this issue, we fallback using translog in peer recoveries if
the recovering replica does not have a peer recovery retention lease,
and the replication group hasn't fully migrated to PRRL.
Relates #45136
Today we do not use retention leases in peer recovery for closed indices
because we can't sync retention leases on closed indices. This change
allows that ability and adjusts peer recovery to use retention leases
for all indices with soft-deletes enabled.
Relates #45136
Co-authored-by: David Turner <david.turner@elastic.co>
A recent change around date parsing (#46675) made it stricter, so we should now also
catch DateTimeExceptions in DateFieldMapper and ignore those when the
`ignore_malformed` option is set.
Closes#50081
When decoupling the pipeline reduction from regular agg reduction,
MultiBucket aggs were modified to reduce their bucket's pipeline
aggs first before reducing the sibling aggs. This modification
was missed on SingleBucket aggs, meaning any SingleBucket would
fail to reduce any pipeline sub-aggs
* Remove Unused Single Delete in BlobStoreRepository
There are no more production uses of the non-bulk delete or the delete that throws
on missing so this commit removes both these methods.
Only the bulk delete logic remains. Where the bulk delete was derived from single deletes,
the single delete code was inlined into the bulk delete method.
Where single delete was used in tests it was replaced by bulk deleting.
* Better Logging GCS Blobstore Mock
Two things:
1. We should just throw a descriptive assertion error and figure out why we're not reading a multi-part instead of
returning a `400` and failing the tests that way here since we can't reproduce these 400s locally.
2. We were missing logging the exception on a cleanup delete failure that coincides with the `400` issue in tests.
Relates #49429
This commit ensures deseriable a GetResult from StreamInput does not
leave metaFields and documentFields null. This could cause an NPE in
situations where upsert response for a document that did not exist is
passed back to a node that forwarded the upsert request.
closes#48215
Currently we do not know the size of the transport header (map of
request response headers, features array, and action name). This means
that we must read the entire transport message to dependably act on the
headers. This commit adds an int indicating the size of the transport
headers. With this addition we can act upon the headers prior to reading
the entire message.
Adjusts the subclasses of `TransportMasterNodeAction` to use their own loggers
instead of the one for the base class.
Relates #50056.
Partial backport of #46431 to 7.x.
Removes a reference to shadow replicas from the cat shards API docs
and a comment in cluster/routing/UnassignedInfo.java.
Shadow replicas were removed with #23906.
* Improve Snapshot Finalization Ex. Handling
Like in #49989 we can get into a situation where the setting of
the repository generation (during snapshot finalization) in the cluster
state fails due to master failing over.
In this case we should not try to execute the next cluster state update
that will remove the snapshot from the cluster state.
Closes#49989
The elasticsearch-node tools allow manipulating the on-disk cluster state. The tool is currently
unaware of plugins and will therefore drop custom metadata from the cluster state once the
state is written out again (as it skips over the custom metadata that it can't read). This commit
preserves unknown customs when editing on-disk metadata through the elasticsearch-node
command-line tools.
Today settings can declare dependencies on another setting. This
declaration is implemented so that if the declared setting is not set
when the declaring setting is, settings validation fails. Yet, in some
cases we want not only that the setting is set, but that it also has a
specific value. For example, with the monitoring exporter settings, if
xpack.monitoring.exporters.my_exporter.host is set, we not only want
that xpack.monitoring.exporters.my_exporter.type is set, but that it is
also set to local. This commit extends the settings infrastructure so
that this declaration is possible. The use of this in the monitoring
exporter settings will be implemented in a follow-up.
* Cleanup Old index-N Blobs in Repository Cleanup
Repository cleanup didn't deal with old index-N, this change adds
cleaning up all old index-N found in the repository.
Step on the road to #49060.
This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.
It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer.
With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
The fake translog corruption in the test sometimes generates invalid translog files where some
assertions do not hold (e.g. minSeqNo <= maxSeqNo or minTranslogGen <= translogGen)
Closes#49909
Makes sure that CCR also properly works with _source disabled.
Changes one exception in LuceneChangesSnapshot as the case of missing _recovery_source
because of a missing lease was not properly properly bubbled up to CCR (testIndexFallBehind
was failing).
If we have a nested `AbstractRunnable` inside of `TimedRunnable`
it's executed twice on `run` (once when its own `run` method is invoked and once when
the `onAfter` in the `TimedRunnable` is executed).
Simply removing the `onAfter` override in `TimedRunnable` makes sure that the `onAfter`
is only called once by the `run` on the nested `AbstractRunnable` itself.
Same was done for `onFailure` as it was double-triggering as well on exceptions in the inner `onFailure`.
In order to cache script results in the query shard cache, we need to
check if scripts are deterministic. This change adds a default method
to the script factories, `isResultDeterministic() -> false` which is
used by the `QueryShardContext`.
Script results were never cached and that does not change here. Future
changes will implement this method based on whether the results of the
scripts are deterministic or not and therefore cacheable.
Refs: #49466
**Backport**
Our docs specifically mention that CBOR is supported when ingesting attachments. However this is not tested anywhere.
This adds a test, that uses specifically CBOR format in its IndexRequest and another one that behaves like CBOR in the ingest attachment unit tests.
Historically only two things happened in the final reduction:
empty buckets were filled, and pipeline aggs were reduced (since it
was the final reduction, this was safe). Usage of the final reduction
is growing however. Auto-date-histo might need to perform
many reductions on final-reduce to merge down buckets, CCS
may need to side-step the final reduction if sending to a
different cluster, etc
Having pipelines generate their output in the final reduce was
convenient, but is becoming increasingly difficult to manage
as the rest of the agg framework advances.
This commit decouples pipeline aggs from the final reduction by
introducing a new "top level" reduce, which should be called
at the beginning of the reduce cycle (e.g. from the SearchPhaseController).
This will only reduce pipeline aggs on the final reduce after
the non-pipeline agg tree has been fully reduced.
By separating pipeline reduction into their own set of methods,
aggregations are free to use the final reduction for whatever
purpose without worrying about generating pipeline results
which are non-reducible
This is related to #49067. As part of this work a new sniff number of
node connections setting, a simple addresses setting, and a simple
number of sockets setting have been added. This commit ensures that
these settings are properly hooked up to support dynamic updates.
The list used by the search progress listener can be nullified
by another thread that reports a query result. This change replaces
the usage of this list with a new array that is synchronously modified.
Closes#49778
Adds `GET /_script_language` to support Kibana dynamic scripting
language selection.
Response contains whether `inline` and/or `stored` scripts are
enabled as determined by the `script.allowed_types` settings.
For each scripting language registered, such as `painless`,
`expression`, `mustache` or custom, available contexts for the language
are included as determined by the `script.allowed_contexts` setting.
Response format:
```
{
"types_allowed": [
"inline",
"stored"
],
"language_contexts": [
{
"language": "expression",
"contexts": [
"aggregation_selector",
"aggs"
...
]
},
{
"language": "painless",
"contexts": [
"aggregation_selector",
"aggs",
"aggs_combine",
...
]
}
...
]
}
```
Fixes: #49463
**Backport**
not adjust testCacheability(), which how fails occasionally when given a random
interval source containing a script. This commit overrides testCacheability() to
explicitly sources with and without script filters.
Fixes#49821
By default, AbstractQueryTestCase only changes name and boost in its mutateInstance
method, used when checking equals and hashcode implementations. This commit adds
a mutateInstance method to InveralQueryBuilderTests that will check hashcode and
equality when the field or intervals source are changed.
By default the unified highlighter splits the input into passages using
a sentence break iterator. However we don't check if the field is tokenized
or not so `keyword` field also applies the break iterator even though they can
only match on the entire content. This means that by default we'll split the
content of a `keyword` field on sentence break if the requested number of fragments
is set to a value different than 0 (default to 5). This commit changes this behavior
to ignore the break iterator on non-tokenized fields (keyword) in order to always
highlight the entire values. The number of requested fragments control the number of
matched values are returned but the boundary_scanner_type is now ignored.
Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter
exposed this bug in Elasticsearch 7x.
There is a possible NPE in IntervalFilter xcontent serialization when scripts are
used, and `equals` and `hashCode` are also incorrectly implemented for script
filters. This commit fixes both.
* Copying the request is not necessary here. We can simply release it once the response has been generated and a lot of `Unpooled` allocations that way
* Relates #32228
* I think the issue that preventet that PR that PR from being merged was solved by #39634 that moved the bulk index marker search to ByteBuf bulk access so the composite buffer shouldn't require many additional bounds checks (I'd argue the bounds checks we add, we save when copying the composite buffer)
* I couldn't neccessarily reproduce much of a speedup from this change, but I could reproduce a very measureable reduction in GC time with e.g. Rally's PMC (4g heap node and bulk requests of size 5k saw a reduction in young GC time by ~10% for me)
This commit fixes a number of issues with data replication:
- Local and global checkpoints are not updated after the new operations have been fsynced, but
might capture a state before the fsync. The reason why this probably went undetected for so
long is that AsyncIOProcessor is synchronous if you index one item at a time, and hence working
as intended unless you have a high enough level of concurrent indexing. As we rely in other
places on the assumption that we have an up-to-date local checkpoint in case of synchronous
translog durability, there's a risk for the local and global checkpoints not to be up-to-date after
replication completes, and that this won't be corrected by the periodic global checkpoint sync.
- AsyncIOProcessor also has another "bad" side effect here: if you index one bulk at a time, the
bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread
is tasked by AsyncIOProcessor to drain the processing queue and fsync, other threads can
easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread
might continue doing a lot more fsyncs before returning (as the other threads pile more and
more on top), which blocks it from returning as a replication request (e.g. if this thread is on the
primary, it blocks the replication requests to the replicas from going out, and delaying
checkpoint advancement).
This commit fixes all these issues, and also simplifies the code that coordinates all the after
write actions.
Some unit test checking locale sensitive functionality require the
-Djava.locale.providers=SPI,COMPAT flag to be set. When running tests though
gradle we pass this already to the BuildPlugin, but running from the IDE this
might need to be set manually. Adding a note explaining this to the
CONTRIBUTING.md doc and leaving a note in the test comment of
SearchQueryIT.testRangeQueryWithLocaleMapping which is a test we know
that suffers from this issue.
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.
It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.
Related to #47567
Some queries return bulk scorers that can be significantly faster than
iterating naively over the scorer. By giving script_score a BulkScorer
that would delegate to the wrapped query, we could make it faster in some cases.
Closes#40837
This rewrites long sort as a `DistanceFeatureQuery`, which can
efficiently skip non-competitive blocks and segments of documents.
Depending on the dataset, the speedups can be 2 - 10 times.
The optimization can be disabled with setting the system property
`es.search.rewrite_sort` to `false`.
Optimization is skipped when an index has 50% or more data with
the same value.
Optimization is done through:
1. Rewriting sort as `DistanceFeatureQuery` which can
efficiently skip non-competitive blocks and segments of documents.
2. Sorting segments according to the primary numeric sort field(#44021)
This allows to skip non-competitive segments.
3. Using collector manager.
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
We use collectorManager, where for every segment a dedicated collector
will be created.
4. Using Lucene's shared TopFieldCollector manager
This collector manager is able to exchange minimum competitive
score between collectors, which allows us to efficiently skip
the whole segments that don't contain competitive scores.
5. When index is force merged to a single segment, #48533 interleaving
old and new segments allows for this optimization as well,
as blocks with non-competitive docs can be skipped.
Backport for #48804
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.
It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.
Related to #47567
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.
In contrast to prior PR (#49647), this PR also fixes (see b3697cc) a situation where the previous
index expression logic had an interesting side effect. For bulk requests (which had resolveIndex
= false), the reroute phase was waiting for the index to appear in case where it was not present,
and for all other replication requests (resolveIndex = true) it would right away throw an
IndexNotFoundException while resolving the name and exit. With #49647, every replication
request was now waiting for the index to appear, which was problematic when the given index
had just been deleted (e.g. deleting a follower index while it's still receiving requests from the
leader, where these requests would now wait up to a minute for the index to appear). This PR
now adds b3697cc on top of that prior PR to make sure to reestablish some of the prior behavior
where the reroute phase waits for the bulk request for the index to appear. That logic was in
place to ensure that when an index was created and not all nodes had learned about it yet, that
the bulk would not fail somewhere in the reroute phase. This is now only restricted to the
situation where the current node has an older cluster state than the one that coordinated the
bulk request (which checks that the index is present). This also means that when an index is
deleted, we will no longer unnecessarily wait up to the timeout for the index o appear, and
instead fail the request.
Closes#20279
* Make BlobStoreRepository Aware of ClusterState (#49639)
This is a preliminary to #49060.
It does not introduce any substantial behavior change to how the blob store repository
operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency
by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation
(create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple.
This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the
repository operates in #49060
This commit adds a function in NodeClient that allows to track the progress
of a search request locally. Progress is tracked through a SearchProgressListener
that exposes query and fetch responses as well as partial and final reduces.
This new method can be used by modules/plugins inside a node in order to track the
progress of a local search request.
Relates #49091
This stems from a time where index requests were directly forwarded to
TransportReplicationAction. Nowadays they are wrapped in a BulkShardRequest, and this logic is
obsolete.
Closes#20279
This change adds a dynamic cluster setting named `indices.id_field_data.enabled`.
When set to `false` any attempt to load the fielddata for the `_id` field will fail
with an exception. The default value in this change is set to `false` in order to prevent
fielddata usage on this field for future versions but it will be set to `true` when backporting
to 7x. When the setting is set to true (manually or by default in 7x) the loading will also issue
a deprecation warning since we want to disallow fielddata entirely when https://github.com/elastic/elasticsearch/issues/26472
is implemented.
Closes#43599
JavaDateFormatter should keep the pattern with the prefixed 8 as it will be used for serialisation. The stripped pattern should be used for the enclosed formatters.
closes#48698
Fixes a bug where a scripted upsert that causes a dynamic mapping update is retried (because
mapping update is still in-flight), and the request is mutated multiple times.
Closes#48670
Backport of #49076
In case an exception occurs inside a pipeline processor,
the pipeline stack is kept around as header in the exception.
Then in the on_failure processor the id of the pipeline the
exception occurred is made accessible via the `on_failure_pipeline`
ingest metadata.
Closes#44920
Preliminary to shorten the diff of #49060. In #49060
we execute cluster state updates during the writing of a new
index gen and thus it must be an async API.
All the implementations of `EsBlobStoreTestCase` use the exact same
bootstrap code that is also used by their implementation of
`EsBlobStoreContainerTestCase`.
This means all tests might as well live under `EsBlobStoreContainerTestCase`
saving a lot of code duplication. Also, there was no HDFS implementation for
`EsBlobStoreTestCase` which is now automatically resolved by moving the tests over
since there is a HDFS implementation for the container tests.
The annotated text mapper has a field type that currently extends StringFieldType,
which means that all the positional-related query factory methods need to be copied
over from TextFieldType. In addition, MappedFieldType.intervals() hasn't been
overridden, so you can't use intervals queries with annotated text - a major drawback,
since one of the purposes of annotated text is to be able to run positional queries against
annotations.
This commit changes the annotated text field type to extend TextFieldType instead,
adding tests to ensure that position queries work correctly.
Closes#49289
Same as #49518 pretty much but for GCS.
Fixing a few more spots where input stream can get closed
without being fully drained and adding assertions to make sure
it's always drained.
Moved the no-close stream wrapper to production code utilities since
there's a number of spots in production code where it's also useful
(will reuse it there in a follow-up).
Backport of #49552.
Closes#49428. The code that works out an HTTP code for an exception didn't
consider the JsonParseException case, meant that an invalid JSON request could
result in a 500 Internal Server Error. Now it returns 400 Bad Request.
This commit back ports three commits related to enabling the simple
connection strategy.
Allow simple connection strategy to be configured (#49066)
Currently the simple connection strategy only exists in the code. It
cannot be configured. This commit moves in the direction of allowing it
to be configured. It introduces settings for the addresses and socket
count. Additionally it introduces new settings for the sniff strategy
so that the more generic number of connections and seed node settings
can be deprecated.
The simple settings are not yet registered as the registration is
dependent on follow-up work to validate the settings.
Ensure at least 1 seed configured in remote test (#49389)
This fixes#49384. Currently when we select a random subset of seed
nodes from a list, it is possible for 0 seeds to be selected. This test
depends on at least 1 seed being selected.
Add the simple strategy to cluster settings (#49414)
This is related to #49067. This commit adds the simple connection
strategy settings and strategy mode setting to the cluster settings
registry. With these changes, the simple connection mode can be used.
Additionally, it adds validation to ensure that settings cannot be
misconfigured.
The new CompensatedSum is a nice DRY refactor, but had the unanticipated
side effect of creating a lot of object allocation in the aggregation hot collection
loop: one object per visited document, per aggregator. In some places it
created two per-doc-per-agg (weighted avg, geo centroids, etc) since there
were multiple compensations being maintained.
This PR moves the object creation out of the hot loop so that it is now
created once per segment, and resets the internal state each time through
the loop
A few enhancements to `SnapshotResiliencyTests`:
1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures)
2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response)
3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method
4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
The NodeTests class contains tests that check behavior when shutting
down a node. This involves starting a node, performing some operation,
stopping the node, and then awaiting the close of the node. Part of
closing a node is the termination of the node's ThreadPool. ThreadPool
termination semantics can be deceiving. The ThreadPool#terminate method
takes a timeout value and the first oddity is that the terminate method
can take two times the timeout value before returning. Internally this
method acts on the ExecutorService instances that are held by the
ThreadPool. First, an orderly shutdown is attempted and pending tasks
are allowed to execute while waiting for the timeout value. If any of
the ExecutorService instances have not terminated, a call is made to
attempt to stop all active tasks (usually using interrupts) and then
waits for up to the timeout value a second time for the termination of
the ExecutorService instances. This means that if use a large value
when waiting for a node to close, we may not attempt to interrupt any
threads that are in a blocking call before the test times out.
In order to avoid causing these tests to time out, this change reduces
the timeout passed to Node#awaitClose to 10 seconds from 1 day. This
will allow blocked threads to be interrupted before the test suite
fails due to the timeout.
Closes#44256Closes#42350Closes#44435
This commit enhances the required pipeline functionality by changing it
so that default/request pipelines can also be executed, but the required
pipeline is always executed last. This gives users the flexibility to
execute their own indexing pipelines, but also ensure that any required
pipelines are also executed. Since such pipelines are executed last, we
change the name of required pipelines to final pipelines.
Just realized we were missing some annotations here which was somewhat
confusing since other methods/parameters have the `Nullable` annotation
wherever a `null` can be passed.
Backport of #47208.
Closes#46900. When running ES with `--quiet`, if ES then exits abnormally, a
user has to go hunting in the logs for the error. Instead, never close
System.err, and print more information to it if ES encounters a fatal error
e.g. config validation, or some fatal runtime exception. This is useful when
running under e.g. systemd, since the error will go into the journal.
Note that stderr is still closed in daemon (`-d`) mode.
This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merge have more chance to contain contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least
recent documents from time-based indices.
Relates #49091
Currently the condtion that is supposed to test creation of test instances with
multiple indices is never true because it compares Strings with an enum. This
changes it so the condition uses the enum constants instead.
Lucene 8.3 included a root fix for #43976, which was temporarily fixed in elasticsearch
by #44340. Since we have upgraded to 8.3 we no longer need this workaround. This
commit fixes the test that was added to check the workaround, and instead checks that
fields with index_phrases enabled correctly build queries when used with multi-term
synonyms.
Closes#47777
When opening a translog file, we check whether the UUID matches what we expect (the UUID
from the latest commit). The UUID check can in certain cases fail when the translog is
corrupted. This commit changes the ordering of the checks so that corruption is detected first.
If the translog on a replica is corrupt, we should not perform an
operation-based recovery or utilize sync_id as we won't be able to open
an engine in the next step. This change adds an extra validation that
ensures translog is okay when preparing a peer recovery request.
This change disables the query and request cache when
profile is set to true in the request. This means that profiled queries
will not check caches to execute the query and the result will never be
added in the cache either.
Closes#33298
Lucene now allows us to explore the structure of a query using QueryVisitors,
delegating the knowledge of how to recurse through and collect terms to the
query implementations themselves. The percolator currently has a home-grown
external version of this API to construct sets of matching terms that must be
present in a document in order for it to possibly match the query.
This commit removes the home-grown implementation in favour of one using
QueryVisitor. This has the added benefit of making interval queries available
for percolator pre-filtering. Due to a bug in multi-term intervals (LUCENE-9050)
it also includes a clone of some of the lucene intervals logic, that can be removed
once upstream has been fixed.
Closes#45639
This is a pure code rearrangement refactor. Logic for what specific ValuesSource instance to use for a given type (e.g. script or field) moved out of ValuesSourceConfig and into CoreValuesSourceType (previously just ValueSourceType; we extract an interface for future extensibility). ValueSourceConfig still selects which case to use, and then the ValuesSourceType instance knows how to construct the ValuesSource for that case.
This commit changes the ThreadContext to just use a regular ThreadLocal
over the lucene CloseableThreadLocal. The CloseableThreadLocal solves
issues with ThreadLocals that are no longer needed during runtime but
in the case of the ThreadContext, we need it for the runtime of the
node and it is typically not closed until the node closes, so we miss
out on the benefits that this class provides.
Additionally by removing the close logic, we simplify code in other
places that deal with exceptions and tracking to see if it happens when
the node is closing.
Closes#42577
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of #49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.
In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
We shouldn't be throwing `RepositoryException` when the repository
wasn't concurrently modified in an unexpected fashion (i.e. on the blob/file level).
When we know that the known repo gen moved higher in terms of the generation
tracked in master memory we should throw the concurrent snapshot exception.
This change makes concurrent snapshot create and delete always throw the same exception,
prevents unnecessary listings when the generation is known to be off and prevents
future test failures in SLM tests that assume the concurrent snapshot exception
is always thrown here.
Without this change, the newly added test randomly fails the `instanceOf` assertion
by running into a `RepositoryException`.
Reindex, update by query and delete by query would silently disregard
RED/unavailable shards, thus not copying, updating or deleting matching
data in those shards. Now use `allow_partial_search_results=false` to
ensure these operations fail if the search crosses an unavailable chard.
Added the option to explicitly specify `allow_partial_search_results=true`
for reindex only (seemed too strange for update/delete by query).
Relates #45739 and #42612
* [ML] ML Model Inference Ingest Processor (#49052)
* [ML][Inference] adds lazy model loader and inference (#47410)
This adds a couple of things:
- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:
* [ML][Inference] Adjust inference configuration option API (#47812)
* [ML][Inference] adds logistic_regression output aggregator (#48075)
* [ML][Inference] Adding read/del trained models (#47882)
* [ML][Inference] Adding inference ingest processor (#47859)
* [ML][Inference] fixing classification inference for ensemble (#48463)
* [ML][Inference] Adding model memory estimations (#48323)
* [ML][Inference] adding more options to inference processor (#48545)
* [ML][Inference] handle string values better in feature extraction (#48584)
* [ML][Inference] Adding _stats endpoint for inference (#48492)
* [ML][Inference] add inference processors and trained models to usage (#47869)
* [ML][Inference] add new flag for optionally including model definition (#48718)
* [ML][Inference] adding license checks (#49056)
* [ML][Inference] Adding memory and compute estimates to inference (#48955)
* fixing version of indexed docs for model inference
The logic for `cleanupInProgress()` was backwards everywhere (method itself and
all but one user). Also, we weren't checking it when removing a repository.
This lead to a bug (in the one spot that didn't use the method backwards) that prevented
the cleanup cluster state entry from ever being removed from the cluster state if master
failed over during the cleanup process.
This change corrects the backwards logic, adds a test that makes sure the cleanup
is always removed and adds a check that prevents repository removal during cleanup
to the repositories service.
Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.
* Better Exceptions on Concurrent Snapshot Operations
It is somewhat tricky to debug test failures from concurrent operations
without having the exact knowledge of what ran concurrently so I added
it to these exceptions in all spots.
The network disruption was acting on node ids and node names
which made reconnects not work. Moved all usages to node names
to fix this. Since the map of all nodes in the test is indexed
by name this was easier to work with.
* Make FsBlobContainer Listing Resilient to Concurrent Modifications
If we list out files in a folder via the lazily computed directory
stream, we have to deal with concurrent deletes when reading the file
attributes since we don't have a lock on the directory in any way.
Closes#37581
Make queries on the “_index” field fast-fail if the target shard is an index that doesn’t match the query expression. Part of the “canMatch” phase optimisations.
Closes#48473
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.
Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see https://github.com/elastic/elasticsearch/issues/47520).
Closes#47834Closes#49048
This commit introduces a new class called ESSloppyMath
that is meant to reflect the purpose of Lucene's SloppyMath,
but add additional unimplemented faster alternatives to math functions.
The two that are used by geotile-grid a lot are sinh/atan.
In a quick elasticsearch rally benchmark for geotile-grid on Switzerland
data points, this shows a (1.22x) 22% speed-up over using Math's functions.
closes#41166.
Currently the `_analyze` endpoint doesn't correctly use normalizers specified
in the request. This change fixes that by returning the resolved normalizer from
TransportAnalyzeAction#getAnalyzer and updates test to be able to catch this
in the future.
Closes#48650
Today we wrap exceptions that occur while executing an ingest processor
in an ElasticsearchException. Today, in ExceptionsHelper#unwrapCause we
only unwrap causes for exceptions that implement
ElasticsearchWrapperException, which the top-level
ElasticsearchException does not. Ultimately, this means that any
exception that occurs during processor execution does not have its cause
unwrapped, and so its status is blanket treated as a 500. This means
that while executing a bulk request with an ingest pipeline,
document-level failures that occur during a processor will cause the
status for that document to be treated as 500. Since that does not give
the client any indication that they made a mistake, it means some
clients will enter infinite retries, thinking that there is some
server-side problem that merely needs to clear. This commit addresses
this by introducing a dedicated ingest processor exception, so that its
causes can be unwrapped. While we could consider a broader change to
unwrap causes for more than just ElasticsearchWrapperExceptions, that is
a broad change with unclear implications. Since the problem of reporting
500s on client errors is a user-facing bug, we take the conservative
approach for now, and we can revisit the unwrapping in a future change.
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
When a node shuts down, `TransportService` moves to stopped state and
then closes connections. If a request is done in between, an exception
was thrown that was not retried in replication actions. Now throw a
wrapped `NodeClosedException` exception instead, which is correctly
handled in replication action. Fixed other usages too.
Relates #42612
Ensures that we always use the primary term established by the primary to index docs on the
replica. Makes the logic around replication less brittle by always using the operation primary
term on the replica that is coming from the primary.
Reverts #48947 and fixes the issue orginally addressed by removing the assertion.
It turns out we can't simply pass empty shard generations to the snapshot finalization in the
BwC case as that results in no indices being added to the meta for the given snapshot since
we take the indices from the shard generations (even in the BwC case the `null` generations work
fine for this).
Closes#48983
Today in 6.x it is possible to add an index tombstone to the graveyard without
deleting the corresponding index metadata, because the deletion is slightly
deferred. If you shut down the node and upgrade to 7.x when in this state then
the node will fail to apply any cluster states, reporting
java.lang.IllegalStateException: Cannot delete index [...], it is still part of the cluster state.
This commit addresses this situation by skipping over any index metadata with a
corresponding tombstone, allowing this metadata to be cleaned up by the 7.x
node.
This commits sends the cluster name and discovery naode in the transport
level handshake response. This will allow us to stop sending the
transport service level handshake request in the 8.0-8.x release cycle.
It is necessary to start sending this in 7.x so that 8.0 is guaranteed
to be communicating with a version that sends the required information.
Currently the BulkProcessor class uses a single scheduler to schedule
flushes and retries. Functionally these are very different concerns but
can result in a dead lock. Specifically, the single shared scheduler
can kick off a flush task, which only finishes it's task when the bulk
that is being flushed finishes. If (for what ever reason), any items in
that bulk fails it will (by default) schedule a retry. However, that retry
will never run it's task, since the flush task is consuming the 1 and
only thread available from the shared scheduler.
Since the BulkProcessor is mostly client based code, the client can
provide their own scheduler. As-is the scheduler would require
at minimum 2 worker threads to avoid the potential deadlock. Since the
number of threads is a configuration option in the scheduler, the code
can not enforce this 2 worker rule until runtime. For this reason this
commit splits the single task scheduler into 2 schedulers. This eliminates
the potential for the flush task to block the retry task and removes this
deadlock scenario.
This commit also deprecates the Java APIs that presume a single scheduler,
and updates any internal code to no longer use those APIs.
Fixes#47599
Note - #41451 fixed the general case where a bulk fails and is retried
that can result in a deadlock. This fix should address that case as well as
the case when a bulk failure *from the flush* needs to be retried.
We were tripping the assertion that the makes sure we only have empty `ShardGenerations` in `RepositoryData` in the BwC case because shard generations were passed to the `Repository` in the BwC case. Fixed by only generating empty shard gen for BwC snapshots in `SnapshotsService`.
Backport of #48450.
Make a number of changes so that code in the `server` directory is more
resilient to automatic formatting. This covers:
* Reformatting multiline JSON to embed whitespace in the strings
* Move some comments around to they aren't auto-formatted to a strange
place. This also required moving some `&&` and `||` operators from the
end-of-line to start-of-line`.
* Add helper method `reformatJson()`, to strip whitespace from a JSON
document using XContent methods. This is sometimes necessary where
a test is comparing some machine-generated JSON with an expected
value.
Also, `HyperLogLogPlusPlus.java` is now excluded from formatting because it
contains large data tables that don't reformat well with the current settings,
and changing the settings would be worse for the rest of the codebase.
The realtime GET API currently has erratic performance in case where a document is accessed
that has just been indexed but not refreshed yet, as the implementation will currently force an
internal refresh in that case. Refreshing can be an expensive operation, and also will block the
thread that executes the GET operation, blocking other GETs to be processed. In case of
frequent access of recently indexed documents, this can lead to a refresh storm and terrible
GET performance.
While older versions of Elasticsearch (2.x and older) did not trigger refreshes and instead opted
to read from the translog in case of realtime GET API or update API, this was removed in 5.0
(#20102) to avoid inconsistencies between values that were returned from the translog and
those returned by the index. This was partially reverted in 6.3 (#29264) to allow _update and
upsert to read from the translog again as it was easier to guarantee consistency for these, and
also brought back more predictable performance characteristics of this API. Calls to the realtime
GET API, however, would still always do a refresh if necessary to return consistent results. This
means that users that were calling realtime GET APIs to coordinate updates on client side
(realtime GET + CAS for conditional index of updated doc) would still see very erratic
performance.
This PR (together with #48707) resolves the inconsistencies between reading from translog and
index. In particular it fixes the inconsistencies that happen when requesting stored fields, which
were not available when reading from translog. In case where stored fields are requested, this
PR will reparse the _source from the translog and derive the stored fields to be returned. With
this, it changes the realtime GET API to allow reading from the translog again, avoid refresh
storms and blocking the GET threadpool, and provide overall much better and predictable
performance for this API.
We should not open new engines if a shard is closed. We break this
assumption in #45263 where we stop verifying the shard state before
creating an engine but only before swapping the engine reference.
We can fail to snapshot the store metadata or checkIndex a closed shard
if there's some IndexWriter holding the index lock.
Closes#47060
This change fixes a poisonous situation where an ongoing recovery was
canceled because a better copy was found on a node that the cluster had
previously tried allocating the shard to but failed. The solution is to
keep track of the set of nodes that an allocation was failed on so that
we can avoid canceling the current recovery for a copy on failed nodes.
Closes#47974
Today if the primary discovers that an indexing request needs a mapping update
then it will send it to the master for validation and processing. If, however,
the put-mapping request is invalid then the master still processes it as a
(no-op) cluster state update. When there are a large number of indexing
operations that result in invalid mapping updates this can overwhelm the
master.
However, the primary already has a reasonably up-to-date mapping against which
it can check the (approximate) validity of the put-mapping request before
sending it to the master. For instance it is not possible to remove fields in a
mapping update, so if the primary detects that a mapping update will exceed the
fields limit then it can reject it itself and avoid bothering the master.
This commit adds a pre-flight check to the mapping update path so that the
primary can discard obviously-invalid put-mapping requests itself.
Fixes#35564
Backport of #48817
This test failure manifests the limitation of the recovery source merge
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.
Relates #41628Closes#48735
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.
Fixes#48122
The problem with wrapping here is that it converts any exception into an
IAE, which we treat as a client error (400 status) whereas the exception
being wrapped here could be a server error (e.g., NPE). This commit
stops wrapping all ingest processor exceptions as IAEs.
This commit introduces a consistent, and type-safe manner for handling
global build parameters through out our build logic. Primarily this
replaces the existing usages of extra properties with static accessors.
It also introduces and explicit API for initialization and mutation of
any such parameters, as well as better error handling for uninitialized
or eager access of parameter values.
Closes#42042
* Add ingest info to Cluster Stats (#48485)
This commit enhances the ClusterStatsNodes response to include global
processor usage stats on a per-processor basis.
example output:
```
...
"processor_stats": {
"gsub": {
"count": 0,
"failed": 0
"current": 0
"time_in_millis": 0
},
"script": {
"count": 0,
"failed": 0
"current": 0,
"time_in_millis": 0
}
}
...
```
The purpose for this enhancement is to make it easier to collect stats on how specific processors are being used across the cluster beyond the current per-node usage statistics that currently exist in node stats.
Closes#46146.
* fix BWC of ingest stats
The introduction of processor types into IngestStats had a bug.
It was set to `null` and set as the key to the map. This would
throw a NPE. This commit resolves this by setting all the processor
types from previous versions that are not serializing it out to
`_NOT_AVAILABLE`.
Previous behavior while copying HTTP headers to the ThreadContext,
would allow multiple HTTP headers with the same name, handling only
the first occurrence and disregarding the rest of the values. This
can be confusing when dealing with multiple Headers as it is not
obvious which value is read and which ones are silently dropped.
According to RFC-7230, a client must not send multiple header fields
with the same field name in a HTTP message, unless the entire field
value for this header is defined as a comma separated list or this
specific header is a well-known exception.
This commits changes the behavior in order to be more compliant to
the aforementioned RFC by requiring the classes that implement
ActionPlugin to declare if a header can be multi-valued or not when
registering this header to be copied over to the ThreadContext in
ActionPlugin#getRestHeaders.
If the header is allowed to be multivalued, then all such headers
are read from the HTTP request and their values get concatenated in
a comma-separated string.
If the header is not allowed to be multivalued, and the HTTP
request contains multiple such Headers with different values, the
request is rejected with a 400 status.
Today a couple of allocation deciders iterate through all the shards on a node
to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster
state updates in clusters with very high-density nodes holding many thousands
of shards even if those shards belong to closed or frozen indices. This commit
pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up
this search.
Closes#46941
Relates #48579
Co-authored-by: "hongju.xhj" <hongju.xhj@alibaba-inc.com>
Backport of #48553. Make a number of changes so that JSON in the server
directory is more resilient to automatic formatting. This covers:
* Reformatting multiline JSON to embed whitespace in the strings
* Add helper method `stripWhitespace()`, to strip whitespace from a JSON
document using XContent methods. This is sometimes necessary where
a test is comparing some machine-generated JSON with an expected
value.
Today it is possible that we import a dangling index that was created in a
newer version than one or more of the nodes in the cluster. Such an index would
prevent the older node(s) from rejoining the cluster if they were to briefly
leave it for some reason. This commit prevents the import of such dangling
indices.
Fixes#34264
With this change, we won't warm up searchers until we externally refresh
an engine. We explicitly refresh before allowing reading from a shard
(i.e., move to post_recovery state) and during resetting. These
guarantees that we have warmed up the engine before exposing the
external searcher.
Another prerequisite for #47186.
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.
Closes#48526
We can run into an already closed store here and hence
throw on trying to increment the ref count => moving to
the guarded ref count increment
closes#48625
Previously the functions accepted a doc values reference, whereas they now
accept the name of the vector field. Here's an example of how a vector function
was called before and after the change.
```
Before: cosineSimilarity(params.query_vector, doc['field'])
After: cosineSimilarity(params.query_vector, 'field')
```
This seems more intuitive, since we don't allow direct access to vector doc
values and the the meaning of `doc['field']` is unclear.
The PR makes the following changes (broken into distinct commits):
* Add new function signatures of the form `function(params.query_vector,
'field')` and deprecates the old ones. Because Painless doesn't allow two
methods with the same name and number of arguments, we allow a generic `Object`
to be passed in to the function and decide on the behavior through an
`instanceof` check.
* Refactor the class bindings so that the document field is passed to the
constructor instead of the instance method. This allows us to avoid retrieving
the vector doc values on every function invocation, which gives a tiny speed-up
in benchmarks.
Note that this PR adds new signatures for the sparse vector functions too, even
though sparse vectors are deprecated. It seemed simplest to understand (for both
us and users) to keep everything symmetric between dense and sparse vectors.
Today we won't advance the safe commit on a new global checkpoint unless
the last commit can become safe. This is not great if we have more than
two commits as we can have a new safe commit earlier.
Closes#4853
This commit fixes intermittent failures in ShuffleForcedMergePolicyTests#testDiagnostics
by setting a more restricted merge policy that ensures that extra merging will not happen
before the forced merge.
This commit fixes the expectations of SearchAfterIT#shouldFail regarding the inner exceptions that should be thrown
when testing failures. The exception is sometimes wrapped in a QueryShardException so this change only checks that
the toString representation contains the expected message.
Closes#43143
This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges
and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents
first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices
even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working
on in # remain efficient even after running a force merge.
Relates #37043
Today, we hold the engine readLock while refreshing. Although this
choice simplifies the correctness reasoning, it can block IndexShard
from closing if warming an external reader takes time. The current
implementation of refresh does not need to hold readLock as
ReferenceManager can handle errors correctly if the engine is closed in
midway.
This PR is a prerequisite that we need to solve #47186.
* Extract remote "sniffing" to connection strategy (#47253)
Currently the connection strategy used by the remote cluster service is
implemented as a multi-step sniffing process in the
RemoteClusterConnection. We intend to introduce a new connection strategy
that will operate in a different manner. This commit extracts the
sniffing logic to a dedicated strategy class. Additionally, it implements
dedicated tests for this class.
Additionally, in previous commits we moved away from a world where the
remote cluster connection was mutable. Instead, when setting updates are
made, the connection is torn down and rebuilt. We still had methods and
tests hanging around for the mutable behavior. This commit removes those.
* Introduce simple remote connection strategy (#47480)
This commit introduces a simple remote connection strategy which will
open remote connections to a configurable list of user supplied
addresses. These addresses can be remote Elasticsearch nodes or
intermediate proxies. We will perform normal clustername and version
validation, but otherwise rely on the remote cluster to route requests
to the appropriate remote node.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
* Make remote setting updates support diff strategies (#47891)
Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.
As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
* Fix .tasks index strict mapping: parent_id should be parent_task_id
The .tasks index has mappings that's strictly defined. `parent_task_id`
was defined as `parent_id` though which would cause an exception in case
a task is persisted that has a parent task id set.
While at it, a couple of compiler warnings were addressed and a test
request builder was removed in favour of using its corresponding request.
* increment version
The expand phase is always created providing a function that builds
the next phase to be run, which has a single purpose: sending the
response back. Such small search phase is not necessary and causes some
issues when reporting search progress and counting the search phases
that need to be executed and that are already executed. We can simply
rather send back the response, without creating a specific phase for that.
This relates to the effort towards #46250. We added
tracking of the shard generation for successful
snapshots to `8.0`.
This assertion isn't correct though. While an `8.0`
master won't create an entry with sucess state and
a null shard generation it may still (on e.g. master
failover) send a success entry created by a 7.x master
with a `null` generation over the wire.
Closes#47406
This changes the queries equals() method so that the boost factors for each term
are considered for the equality calculation. This means queries are only equal
if both their terms and associated boosts match. The ordering of the terms
doesn't matter as before, which is why we internally need to sort the terms and
boost for comparison on the first equals() call like before. Boosts that are
`null` are considered equal to boosts of 1.0f because topLevelQuery() will only
wrap into BoostQuery if boost is not null and different from 1f.
Closes#48184
BytesReference is currently an abstract class which is extended by
various implementations. This makes it very difficult to use the
delegation pattern. The implication of this is that our releasable
BytesReference is a PagedBytesReference type and cannot be used as a
generic releasable bytes reference that delegates to any reference type.
This commit makes BytesReference an interface and introduces an
AbstractBytesReference for common functionality.
On data-only nodes we were not using the last persisted cluster state as base point to compute
what needed storage, but the last applied cluster state (but not necessarily properly persisted)
instead.
In #48392 we added a second computation of the sizes of the relocating shards
in `canRemain()` but passed the wrong value for `subtractLeavingShards`. This
fixes that. It also removes some unnecessary logging in a test case added in
the same commit.
This is a follow up of https://github.com/elastic/elasticsearch/issues/43453 where we added
a system property to disallow allocation awareness in search requests. Since search requests
will no longer check the allocation awareness attributes for routing in the next major version,
this change adds a deprecation warning on any setup that uses these attributes.
Relates #43453
Previously there was a bug when an query inside script_score query
was rewritten. If min_score was not set and was equal to null,
we were converting it to float value which resulted to NPE.
This commit corrects this.
Closes#48081
Brings handling of out of bounds points in linestrings in line with
points. Now points with latitude above 90 and below -90 are handled
the same way as for points by adjusting the longitude by moving it by
180 degrees.
Relates to #43916
* Do not throw errors on unknown types in SearchAfterBuilder
The support for BigInteger and BigDecimal was added for XContent in
https://github.com/elastic/elasticsearch/pull/32888. However the SearchAfterBuilder
xcontent parser doesn't expect them to be present so it throws an AssertionError.
This change fixes this discrepancy by changing the AssertionError into an
IllegalArgumentException that will not cause the node to die when thrown.
Closes#48074
Today it is possible that the total size of all relocating shards exceeds the
total amount of free disk space. For instance, this may be caused by another
user of the same disk increasing their disk usage, or may be due to how
Elasticsearch double-counts relocations that are nearly complete particularly
if there are many concurrent relocations in progress.
The `DiskThresholdDecider` treats negative free space similarly to zero free
space, but it then fails when rendering the messages that explain its decision.
This commit fixes its handling of negative free space.
Fixes#48380
The comment says it needs random-access, but it passes `Long#MAX_VALUE` as the
lead cost, which forces sequential access, it should pass `0` instead. I took
advantage of this fix to improve the logic to leverage an estimation of the
number of times that `Bits#get` gets called to make better decisions.
Reverting the change introducing IsoLocal.ROOT and introducing IsoCalendarDataProvider that defaults start of the week to Monday and requires minimum 4 days in first week of a year. This extension is using java SPI mechanism and defaults for Locale.ROOT only.
It require jvm property java.locale.providers to be set with SPI,COMPAT
closes#41670
backport #48209
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).
This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).
Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.
This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__
Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).
This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.
Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
This commit fixes the usage of JsonStringEncoder#quoteAsUTF8 in the SearchSlowLog.
JsonStringEncoder#getInstance should always be called to get a thread local object
but this assumption was broken by #44642. This means that any slow log can throw
an AIOOBE since it uses the same byte array concurrently.
Closes#48358
The code here was needlessly complicated when it
enqueued all file uploads up-front. Instead, we can
go with a cleaner worker + queue pattern here by taking
the max-parallelism from the threadpool info.
Also, I slightly simplified the rethrow and
listener (step listener is pointless when you add the callback in the next line)
handling it since I noticed that we were needlessly rethrowing in the same
code and that wasn't worth a separate PR.
This class is only used by the blob store repository
and CCR and the abstractions didn't really make sense
with CCR ignoring the concrete `restoreFiles` method
completely and having a method used only by the blobstore
overriden as unsupported.
=> Moved to a more fitting set of abstractions
=> Dried up the stream wrapping in `BlobStoreRepository` a little
now that the `restoreFile` method could be simplified
Relates #48110 as it makes changing the API of `FileRestoreContext`
to what is needed for async restores simpler
Today it is possible that we create the `QueryCache` and then fail to create
the owning `IndexService` and this means we do not close the `QueryCache`
again. This commit addresses that leak.
Fixes#48186
Today if an Elasticsearch node reaches a disk watermark then it will repeatedly
emit logging about it, which implies that some action needs to be taken by the
administrator. This is misleading. Elasticsearch strives to keep nodes under
the high watermark, but it is normal to have a few nodes occasionally exceed
this level. Nodes may be over the low watermark for an extended period without
any ill effects.
This commit enhances the logging emitted by the `DiskThresholdMonitor` to be
less misleading. The expected case of hitting the high watermark and
immediately relocating one or more shards that to bring the node back under the
watermark again is reduced in severity to `INFO`. Additionally, `INFO` messages
are not emitted repeatedly.
Fixes#48038