* Wait for all Rec. to Stop on Node Close
* This issue is in the `RecoverySourceHandler#acquireStore`. If we submit the store release to the generic threadpool while it is getting shut down we never complete the futue we wait on (in the generic pool as well) and fail to ever release the store potentially.
* Fixed by waiting for all recoveries to end on node close so that we aways have a healthy thread pool here
* Closes#45956
This is the first PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.
The plan is to merge changes made to the server module separately from
the pr that will merge enrich into master, so that these changes can
be reviewed in isolation.
When recovering a shard locally, we use a translog snapshot from
newSnapshotFromGen which consists of all readers from a certain
generation. In the test, we use newSnapshotFromMinSeqNo for the
expectation. The snapshot of this method includes only readers
containing operations in the requesting range.
Closes#46022
If a node is misconfigured to talk to remote node HTTP port (instead of
transport port) eventually it will receive an HTTP response from the
remote node on transport port (this happens when a node sends
accidentally line terminating byte in a transport request).
If this happens today it results in a non-friendly log message and a
long stack trace.
This commit adds a check if a malformed response is HTTP response. In
this case, a concise log message would appear.
(cherry picked from commit 911d02b7a9c3ce7fe316360c127a935ca4b11f37)
When a query contains a mandatory clause that doesn't track the max score per
block, we disable the max score optimization. Previously, we were doing this by
wrapping the collector with a FilterCollector that always returned
ScoreMode.COMPLETE.
However we weren't adjusting totalHitsThreshold, so the collector could still
call Scorer#setMinCompetitiveScore. It is against the method contract to call
setMinCompetitiveScore when the score mode is COMPLETE, and some scorers like
ReqOptSumScorer throw an error in this case.
This commit tries to disable the optimization by always setting
totalHitsThreshold to max int, as opposed to wrapping the collector.
Today we might carry on a big merge uncommitted and therefore
occupy a significant amount of diskspace for quite a long time
if for instance indexing load goes down and we are not quickly
reaching the translog size threshold. This change will cause a
flush if we hit a significant merge (512MB by default) which
frees diskspace sooner.
We only sync translog if the given offset hasn't synced yet. We can't
verify the global checkpoint from the latest translog checkpoint unless
a sync has occurred.
Closes#46065
Relates #45634
Today the `DiskThresholdDecider` attempts to account for already-relocating
shards when deciding how to allocate or relocate a shard. Its goal is to stop
relocating shards onto a node before that node exceeds the low watermark, and
to stop relocating shards away from a node as soon as the node drops below the
high watermark.
The decider handles multiple data paths by only accounting for relocating
shards that affect the appropriate data path. However, this mechanism does not
correctly account for _new_ relocating shards, which are unwittingly ignored.
This means that we may evict far too many shards from a node above the high
watermark, and may relocate far too many shards onto a node causing it to blow
right past the low watermark and potentially other watermarks too.
There are in fact two distinct issues that this PR fixes. New incoming shards
have an unknown data path until the `ClusterInfoService` refreshes its
statistics. New outgoing shards have a known data path, but we fail to account
for the change of the corresponding `ShardRouting` from `STARTED` to
`RELOCATING`, meaning that we fail to find the correct data path and treat the
path as unknown here too.
This PR also reworks the `MockDiskUsagesIT` test to avoid using fake data paths
for all shards. With the changes here, the data paths are handled in tests as
they are in production, except that their sizes are fake.
Fixes#45177
Today we assume that document failures can not occur for deletes. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.
Today we assume that document failures can not occur for no-ops. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.
Backport of 1a0dddf4ad24b3f2c751a1fe0e024fdbf8754f94 (AKA #445395)
* Add support for a Range field ValuesSource, including decode logic for range doc values and exposing RangeType as a first class enum
* Provide hooks in ValuesSourceConfig for aggregations to control ValuesSource class selection on missing & script values
* Branch aggregator creation in Histogram and DateHistogram based on ValuesSource class, to enable specialization based on type. This is similar to how Terms aggregator works.
* Prioritize field type when available for selecting the ValuesSource class type to use for an aggregation
Searching with `allowPartialSearchResults=false` could still return
partial search results during recovery. If a shard copy fails
with a "shard not available" exception, the failure would be ignored and
a partial result returned. The one case where this is known to happen
is when a shard copy is recovering when searching, since
`IllegalIndexShardStateException` is considered a "shard not available"
exception.
Relates to #42612
Since #45473, we trim translog below the local checkpoint of the safe
commit immediately if soft-deletes enabled. In
testRestoreLocalHistoryFromTranslog, we should have a safe commit after
recoverFromTranslog is called; then we will trim translog files which
contain only operations that are at most the global checkpoint.
With this change, we relax the assertion to ensure that we don't put
operations to translog while recovering history from the local translog.
If two translog syncs happen concurrently, then one can return before
its operations are marked as persisted. In general, this should not be
an issue; however, peer recoveries currently rely on this assumption.
Closes#29161
Today we create new engines under IndexShard#mutex. This is not ideal
because it can block the cluster state updates which also execute under
the same mutex. We can avoid this problem by creating new engines under
a separate mutex.
Closes#43699
This commit starts from the simple premise that the use of node settings
in blob store repositories is a mistake. Here we see that the node
settings are used to get default settings for store and restore throttle
rates. Yet, since there are not any node settings registered to this
effect, there can never be a default setting to fall back to there, and
so we always end up falling back to the default rate. Since this was the
only use of node settings in blob store repository, we move them. From
this, several places fall out where we were chaining settings through
only to get them to the blob store repository, so we clean these up as
well. That leaves us with the changeset in this commit.
This adds a pipeline aggregation that calculates the cumulative
cardinality of a field. It does this by iteratively merging in the
HLL sketch from consecutive buckets and emitting the cardinality up
to that point.
This is useful for things like finding the total "new" users that have
visited a website (as opposed to "repeat" visitors).
This is a Basic+ aggregation and adds a new Data Science plugin
to house it and future advanced analytics/data science aggregations.
Backport of #45799
This PR modifies the logic in IngestService to preserve the original content type
on the IndexRequest, such that when a document with a content type like SMILE
is submitted to a pipeline, the resulting document that is persisted will remain in
the original content type (SMILE in this case).
This is essentially the same issue fixed in #43362 but for http request
version instead of the request method. We have to deal with the
case of not being able to parse the request version, otherwise
channel closing fails.
Fixes#43850
The snapshot status when blocking can still be INIT in rare cases when
the new cluster state that has the snapshot in `STARTED` hasn't yet
become visible.
Fixes#45917
This commit enhances logging for 2 cases:
1. If non-TLS enabled node receives transport message from TLS enabled
node on transport port.
2. If non-TLS enabled node receives HTTPs request on transport port.
(cherry picked from commit 4f52ebd32eb58526b4c8022f8863210bf88fc9be)
Closing a `RemoteClusterConnection` concurrently with trying to connect
could result in double invoking the listener.
This fixes
RemoteClusterConnectionTest#testCloseWhileConcurrentlyConnecting
Closes#45845
In case of an in-progress snapshot this endpoint was broken because
it tried to execute repository operations in the callback on a
transport thread which is not allowed (only generic or snapshot
pool are allowed here).
This commit namespaces the existing processors setting under the "node"
namespace. In doing so, we deprecate the existing processors setting in
favor of node.processors.
Since #45136, we use soft-deletes instead of translog in peer recovery.
There's no need to retain extra translog to increase a chance of
operation-based recoveries. This commit ignores the translog retention
policy if soft-deletes is enabled so we can discard translog more
quickly.
Backport of #45473
Relates #45136
Today, when rolling a new translog generation, we block all write
threads until a new generation is created. This choice is perfectly
fine except in a highly concurrent environment with the translog
async setting. We can reduce the blocking time by pre-sync the
current generation without writeLock before rolling. The new step
would fsync most of the data of the current generation without
blocking write threads.
Close#45371
Follow up on #32806.
The system property es.http.cname_in_publish_address is deprecated
starting from 7.0.0 and deprecation warning should be added if the
property is specified.
This PR will go to 7.x and master.
Follow-up PR to remove es.http.cname_in_publish_address property
completely will go to the master.
(cherry picked from commit a5ceca7715818f47ec87dd5f17f8812c584b592b)
* There is no point in listing out every shard over and over when the `index-N` blob in the shard contains a list of all the files
* Rebuilding the `index-N` from the `snap-${uuid}.dat` blobs does not provide any material benefit. It only would in the corner case of a corrupted `index-N` but otherwise uncorrupted blobs since we neither check the correctness of the content of all segment blobs nor do we do a similar recovery at the root of the repository.
* Also, at least in version `6.x` we only mark a shard snapshot as successful after writing out the updated `index-N` blob so all snapshots that would work with `7.x` and newer must have correct `index-N` blobs
=> Removed the rebuilding of the `index-N` content from `snap-${uuid}.dat` files and moved to only listing `index-N` when taking a snapshot instead of listing all files
=> Removed check of file existence against physical blob listing
=> Kept full listing on the delete side to retain full cleanup of blobs that aren't referenced by the `index-N`
This PR introduces a mechanism to cancel a search task when its corresponding connection gets closed. That would relief users from having to manually deal with tasks and cancel them if needed. Especially the process of finding the task_id requires calling get tasks which needs to call every node in the cluster.
The implementation is based on associating each http channel with its currently running search task, and cancelling the task when the previously registered close listener gets called.
Today we can release a Store using CancellableThreads. If we are holding
the last reference, then we will verify the node lock before deleting
the store. Checking node lock performs some I/O on FileChannel. If the
current thread is interrupted, then the channel will be closed and the
node lock will also be invalid.
Closes#45237
* Add is_write_index column to cat.aliases (#44772)
Aliases have had the option to set `is_write_index` since 6.4,
but the cat.aliases action was never updated.
* correct version bounds to 7.4
Most of our CLI tools use the Terminal class, which previously did not provide methods for writing to standard output. When all output goes to standard out, there are two basic problems. First, errors and warnings are "swallowed" in pipelines, making it hard for a user to know when something's gone wrong. Second, errors and warnings are intermingled with legitimate output, making it difficult to pass the results of interactive scripts to other tools.
This commit adds a second set of print commands to Terminal for printing to standard error, with errorPrint corresponding to print and errorPrintln corresponding to println. This leaves it to developers to decide which output should go where. It also adjusts existing commands to send errors and warnings to stderr.
Usage is printed to standard output when it's correctly requested (e.g., bin/elasticsearch-keystore --help) but goes to standard error when a command is invoked incorrectly (e.g. bin/elasticsearch-keystore list-with-a-typo | sort).
* Repository Cleanup Endpoint (#43900)
* Snapshot cleanup functionality via transport/REST endpoint.
* Added all the infrastructure for this with the HLRC and node client
* Made use of it in tests and resolved relevant TODO
* Added new `Custom` CS element that tracks the cleanup logic.
Kept it similar to the delete and in progress classes and gave it
some (for now) redundant way of handling multiple cleanups but only allow one
* Use the exact same mechanism used by deletes to have the combination
of CS entry and increment in repository state ID provide some
concurrency safety (the initial approach of just an entry in the CS
was not enough, we must increment the repository state ID to be safe
against concurrent modifications, otherwise we run the risk of "cleaning up"
blobs that just got created without noticing)
* Isolated the logic to the transport action class as much as I could.
It's not ideal, but we don't need to keep any state and do the same
for other repository operations
(like getting the detailed snapshot shard status)
This change adds a new option called user_dictionary_rules to
Kuromoji's tokenizer. It can be used to set additional tokenization rules
to the Japanese tokenizer directly in the settings (instead of using a file).
This commit also adds a check that no rules are duplicated since this is not allowed
in the UserDictionary.
Closes#25343
Backports PR #45737:
Similar to PR #45030 integration test testDontCacheScripts() was moved to unit test AvgAggregatorTests#testDontCacheScripts.
AvgIT class was removed.
If the background refresh is running, but not finished yet then the
document might not be visible to the next search. Thus, if
scheduledRefresh returns false, we need to wait until the background
refresh is done.
Closes#45571
`OsProbe` fetches cgroup data from the filesystem, and has asserts that
check for missing values. This PR changes most of these asserts into
runtime checks, since at least one user has reported an NPE where
a piece of cgroup data was missing.
Backport of #45606 to 7.x.
If a scheduled task of an AbstractAsyncTask starts after it was closed,
then isScheduledOrRunning can remain true forever although no task is
running or scheduled.
Closes#45576
The elasticsearch keystore was originally backed by a PKCS#12 keystore, which had several limitations. To overcome some of these limitations in encoding, the setting names existing within the keystore were limited to lowercase alphanumberic (with underscore). Now that the keystore is backed by an encrypted blob, this restriction is no longer relevant. This commit relaxes that restriction by allowing uppercase ascii characters as well.
closes#43835
Changes the order of parameters in Geometries from lat, lon to lon, lat
and moves all Geometry classes are moved to the
org.elasticsearch.geomtery package.
Backport of #45332Closes#45048
This commit adds an explicit error message when a create index request
contains a settings key that is not a json object. Prior to this change
the user would be given a ClassCastException with no explanation of what
went wrong.
closes#45126
The current idiom is to have the InternalAggregator find all the
buckets sharing the same key, put them in a list, get the first bucket
and ask that bucket to reduce all the buckets (including itself).
This a somewhat confusing workflow, and feels like the aggregator should
be reducing the buckets (since the aggregator owns the buckets), rather
than asking one bucket to do all the reductions.
This commit basically moves the `Bucket.reduce()` method to the
InternalAgg and renames it `reduceBucket()`. It also moves the
`createBucket()` (or equivalent) method from the bucket to the
InternalAgg as well.
This commit adds CNAME reporting for transport.publish_address same way
it's done for http.publish_address.
Relates #32806
Relates #39970
(cherry picked from commit e0a2558a4c3a6b6fbfc6cd17ed34a6f6ef7b15a9)
* Since we're buffering network reads to the heap and then deserializing them it makes no sense to buffer a message that is 90% of the heap size since we couldn't deserialize it anyway
* I think `30%` is a more reasonable guess here given that we can reasonably assume that the deserialized message will be larger than the serialized message itself and processing it will take additional heap as well
When a persistent task attempts to register an allocated task locally,
this creates the Task object and starts tracking it locally. If there
is a failure while initializing the task, this is handled by a catch
and subsequent error handling (canceling, unregistering, etc).
But if the task fails to be created because an exception is thrown
in the tasks ctor, this is uncaught and fails the cluster update
thread. The ramification is that a persistent task remains in the
cluster state, but is unable to create the allocated task, and the
exception prevents other tasks "after" the poisoned task from starting
too.
Because the allocated task is never created, the cancellation tools
are not able to remove the persistent task and it is stuck as a
zombie in the CS.
This commit adds exception handling around the task creation,
and attempts to notify the master if there is a failure (so the
persistent task can be removed). Even if this notification fails,
the exception handling means the rest of the uninitialized tasks
can proceed as normal.
* Painless generates a ton of duplicate strings and empty `Hashmap` instances wrapped as unmodifiable
* This change brings down the static footprint of Painless on an idle node by 20MB (after running the PMC benchmark against said node)
* Since we were looking into ways of optimizing for smaller node sizes I think this is a worthwhile optimization
Moves methods added in #44213 and uses them to configure the port range
for `ExternalTestCluster` too.
These were still using `9300-9400` ( teh default ) and running into
races.
* Fixing this for two reasons:
1. Why not verify that the seed we wrote is actually there when we can
2. The AWS S3 SDK started to log a bunch of WARN messages about not fully reading the stream now that we started to abuse the read blob as an `exists` check after removing that method from the blob container
* Introduce Spatial Plugin (#44389)
Introduce a skeleton Spatial plugin that holds new licensed features coming to
Geo/Spatial land!
* [GEO] Refactor DeprecatedParameters in AbstractGeometryFieldMapper (#44923)
Refactor DeprecatedParameters specific to legacy geo_shape out of
AbstractGeometryFieldMapper.TypeParser#parse.
* [SPATIAL] New ShapeFieldMapper for indexing cartesian geometries (#44980)
Add a new ShapeFieldMapper to the xpack spatial module for
indexing arbitrary cartesian geometries using a new field type called shape.
The indexing approach leverages lucene's new XYShape field type which is
backed by BKD in the same manner as LatLonShape but without the WGS84
latitude longitude restrictions. The new field mapper builds on and
extends the refactoring effort in AbstractGeometryFieldMapper and accepts
shapes in either GeoJSON or WKT format (both of which support non geospatial
geometries).
Tests are provided in the ShapeFieldMapperTest class in the same manner
as GeoShapeFieldMapperTests and LegacyGeoShapeFieldMapperTests.
Documentation for how to use the new field type and what parameters are
accepted is included. The QueryBuilder for searching indexed shapes is
provided in a separate commit.
* [SPATIAL] New ShapeQueryBuilder for querying indexed cartesian geometry (#45108)
Add a new ShapeQueryBuilder to the xpack spatial module for
querying arbitrary Cartesian geometries indexed using the new shape field
type.
The query builder extends AbstractGeometryQueryBuilder and leverages the
ShapeQueryProcessor added in the previous field mapper commit.
Tests are provided in ShapeQueryTests in the same manner as
GeoShapeQueryTests and docs are updated to explain how the query works.
* It's in the title, follow up to #45233
* Flatten more listeners into `StepListener`
* Remove duplication from repo and index bootstrap and asserting that the steps execute successfully
* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Currently, when adding a new mapping, we attempt to parse + merge it before
checking whether its top-level document type matches the existing type. So when
a user attempts to introduce a new mapping type, we may give a confusing error
message around merging instead of complaining that it's not possible to add
more than one type ("Rejecting mapping update to [my-index] as the final
mapping would have more than 1 type...").
This PR moves the type validation to the start of
`MetaDataMappingService#applyRequest` so that we make sure the type matches
before performing any mapper merging.
We already partially addressed this issue in #29316, but the tests there
focused on `MapperService` and did not catch this problem with end-to-end
mapping updates.
Addresses #43012.
We accidentally introduced this bug when adding a typeless version of the
rollover request. The bug is not present if include_type_name is set to true.
The previous hasProcessors method would validate if a processor was
present within a pipeline, but would not return the contents of the
processors. This does not allow a consumer to inspect the processor for
specific metadata. The method now returns the list of processors based
on the class of the processor passed in.
Because auto-date-histo can perform multiple reductions while
merging buckets, we need to ensure that the intermediate reductions
are done with a `finalReduce` set to false to prevent Pipeline aggs
from generating their output.
Once all the buckets have been merged and the output is stable,
a mostly-noop reduction can be performed which will allow pipelines
to generate their output.
This commit makes sure that mapping parameters to `CreateIndex` and
`PutIndexTemplate` are keyed by the type name.
`IndexCreationTask` expects mappings to be keyed by the type name.
It asserts this for template mappings but not for the mappings in the request.
The `CreateIndexRequest` and `RestCreateIndexAction` mostly make it sure
that the mapping is keyed by a type name, but not always.
When building the create-index request outside of the REST handler, there are
a few methods to set the mapping for the request. Some of them add the type
name some of them do not.
For example, `CreateIndexRequest#mapping(String type, Map<String, ?> source)`
adds the type name, but
`CreateIndexRequest#mapping(String type, XContentBuilder source)` does not.
This PR asserts the type name in the request mapping inside `IndexCreationTask`
and makes all `CreateIndexRequest#mapping` methods add the type name.
testShouldFlushAfterPeerRecovery was added #28350 to make sure the
flushing loop triggered by afterWriteOperation eventually terminates.
This test relies on the fact that we call afterWriteOperation after
making changes in translog. In #44756, we roll a new generation in
RecoveryTarget#finalizeRecovery but do not call afterWriteOperation.
Relates #28350
Relates #45073
Today, if an operation-based peer recovery occurs, we won't trim
translog but leave it as is. Some unacknowledged operations existing in
translog of that replica might suddenly reappear when it gets promoted.
With this change, we ensure trimming translog above the starting
sequence number of phase 2. This change can allow us to read translog
forward.
* Follow up to #44949
* Stop using a special code path for multi-line JSON and instead handle its detection like that of other XContent types when creating the request
* Only leave a single path that holds a reference to the full REST request
* In the next step we can move the copying of request content to happen before the actual request handling and make it conditional on the handler in question to stop copying bulk requests as suggested in #44564
* Reduces complicated callback relations in `testSuccessfulSnapshotAndRestore` to flat steps of sequential actions
* Will refactor the other tests in this suit as a follow up
* This format certainly makes it easier to create more complicated tests that involve multiple subsequent snapshots as it would allow adding loops
When having a cluster state from 6.x, display the metadata version as the cluster state version.
Avoids confusion where a cluster state from 6.x is displayed as version 0 even if has some actual
content.
Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.
This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.
The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.
Relates #41536
CellIdSource is a helper ValuesSource that encodes GeoPoint
into a long-encoded representation of the grid bucket the point
is associated with. This complicates thing as usage evolves to
support shapes that are associated with more than one bucket ordinal.
Today the test waits for one of the shards to be blocked, but this does not
mean that the block has been applied on all nodes, so a subsequent indexing
operation may still go through.
Fixes#45338
The client and remote hit sources had each their own retry mechanism,
which would do the same. Supporting resiliency we would have to expand
on the retry mechanisms and as a preparation for that, the retry
mechanism is now shared such that each sub class is only responsible for
sending requests and converting responses/failures to common format.
Part of #42612
Refreshes happening during indexing can result differen segment counts and
slightly skewed term statistics, which in turn has the potential to change
suggestion output slightly. In order to prevent this, disable refresh for the
affected tests.
Closes#43261
`newSearcher()` from lucene can randomly choose index readers which
are not compatible with our tests, like ParallelCompositeReader.
The `newIndexSearcher()` method on AggregatorTestCase is a wrapper
similar to newSearcher but compatible with our tests
This commit adds a helper method to the ingest service allowing it to
inspect a pipeline by id and verify the existence of a processor in the
pipeline. This work exposed a potential bug in that some processors
contain inner processors that are passed in at instantiation. These
processors needed a common way to expose their inner processors, so the
WrappingProcessor was created in order to expose the inner processor.
If a node exceeds the flood-stage disk watermark then we add a block to all of
its indices to prevent further writes as a last-ditch attempt to prevent the
node completely exhausting its disk space. However today this block remains in
place until manually removed, and this block is a source of confusion for users
who current have ample disk space and did not even realise they nearly ran out
at some point in the past.
This commit changes our behaviour to automatically remove this block when a
node drops below the high watermark again. The expectation is that the high
watermark is some distance below the flood-stage watermark and therefore the
disk space problem is truly resolved.
Fixes#39334
The reason field of DefaultShardOperationFailedException is lost during serialization.
This is sad because this field is checked for nullity during xcontent generation and it
means that the cause won't be included in the generated xcontent and won't be
printed in two REST API responses (Close Index API and Indices Shard Stores API).
This commit simply restores the reason from the cause during deserialization.
Adds a tighter threshold for logging a warning about slowness in the
`MasterService` instead of relying on the cluster service's 30-second warning
threshold. This new threshold applies to the computation of the cluster state
update in isolation, so we get a warning if computing a new cluster state
update takes longer than 10 seconds even if it is subsequently applied quickly.
It also applies independently to the length of time it takes to notify the
cluster state tasks on completion of publication, in case any of these
notifications holds up the master thread for too long.
Relates #45007
Backport of #45086
This commit adds a deprecation warning in 7.x for the Force Merge API
when both only_expunge_deletes and max_num_segments are set in a request.
Relates #44761
This commit applies a normalization process to environment paths, both
in how they are stored internally, also their settings values. This
normalization is done via two means:
- we make the paths absolute
- we remove redundant name elements from the path (what Java calls
"normalization")
This change ensures that when we compare and refer to these paths within
the system, we are using a common ground. For example, prior to the
change if the data path was relative, we would not compare it correctly
to paths from disk usage. This is because the paths in disk usage were
being made absolute.
Uses JDK 11's per-socket configuration of TCP keepalive (supported on Linux and Mac), see
https://bugs.openjdk.java.net/browse/JDK-8194298, and exposes these as transport settings.
By default, these options are disabled for now (i.e. fall-back to OS behavior), but we would like
to explore whether we can enable them by default, in particular to force keepalive configurations
that are better tuned for running ES.
This adjusts the `buckets_path` parser so that pipeline aggs can
select specific buckets (via their bucket keys) instead of fetching
the entire set of buckets. This is useful for bucket_script in
particular, which might want specific buckets for calculations.
It's possible to workaround this with `filter` aggs, but the workaround
is hacky and probably less performant.
- Adjusts documentation
- Adds a barebones AggregatorTestCase for bucket_script
- Tweaks AggTestCase to use getMockScriptService() for reductions and
pipelines. Previously pipelines could just pass in a script service
for testing, but this didnt work for regular aggs. The new
getMockScriptService() method fixes that issue, but needs to be used
for pipelines too. This had a knock-on effect of touching MovFn,
AvgBucket and ScriptedMetric
Introduce shift field to MovingFunction aggregation.
By default, shift = 0. Behavior, in this case, is the same as before.
Increasing shift by 1 moves starting window position by 1 to the right.
To simply include current bucket to the window, use shift = 1
For center alignment (n/2 values before and after the current bucket), use shift = window / 2
For right alignment (n values after the current bucket), use shift = window.
* Resolve TODO in `readString` by moving to reading chunks of `byte[]` instead of going byte by byte
* Motivated by `readString` showing up as a significant user of CPU time on the IO thread in Rally PMC benchmark
* Benchmarking this:
* Could not reproduce a slowdown in the potential worst case (one or two non-ascii chars) since in this case the cost of creating the string itself exceeds the read times anyway
* Speedup for 50%+ for reading 200 char ascii strings from `ByteBuf` or pages bytes backed streams
* Longer strings obviously get bigger speedups
* More ascii chars -> more speedup
This commit switches to using the full hash to build into the JAR
manifest, which is used in node startup and the REST main action to
display the build hash.
Currently in the transport-nio work we connect and bind channels on the
a thread before the channel is registered with a selector. Additionally,
it is at this point that we set all the socket options. This commit
moves these operations onto the event-loop after the channel has been
registered with a selector. It attempts to set the socket options for a
non-server channel at registration time. If that fails, it will attempt
to set the options after the channel is connected. This should fix
#41071.
Introduce shift field to MovingFunction aggregation.
By default, shift = 0. Behavior, in this case, is the same as before.
Increasing shift by 1 moves starting window position by 1 to the right.
To simply include current bucket to the window, use shift = 1
For center alignment (n/2 values before and after the current bucket), use shift = window / 2
For right alignment (n values after the current bucket), use shift = window.
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.
Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
* Stop Passing Around REST Request in Multiple Spots
* Motivated by #44564
* We are currently passing the REST request object around to a large number of places. This works fine since we simply copy the full request content before we handle the rest itself which is needlessly hard on GC and heap.
* This PR removes a number of spots where the request is passed around needlessly. There are many more spots to optimize in follow-ups to this, but this one would already enable bypassing the request copying for some error paths in a follow up.
Sparse role queries are executed differently than other queries in order
to account for the fact that most of the documents are filtered from search.
However this special execution does not set the scorer for the query so any
collector that needs to access the score of a document fails with an NPE.
This change fixed this bug by setting the scorer before collecting any hits
when intersecting the main query and the sparse role.
The Settings#processSetting method is intended to take a setting map and add a
setting to it, adjusting the keys as it goes in case of "conflicts" where the
new setting implies an object where there is currently a string, or vice
versa. processSetting was failing in two cases: adding a setting two levels
under a string, and adding a setting two levels under a string and four levels
under a map. This commit fixes the bug and adds test coverage for the
previously faulty edge cases.
* fix issue #43791 about settings
* add unit test in testProcessSetting()
We keep adding the current primary term to operations for which we do not assign a sequence
number. This does not make sense anymore as all operations which we care about have
sequence numbers now. The goal of this commit is to clean things up in InternalEngine and
reduce the complexity.
* There's no need to have the trie iterator hold another reference to the request object (which could be huge, see #44564)
* Also removed unused boolean field from trie node
Previously, we use ThreadPoolStats to ensure that the scheduledRefresh
triggered by the internal refresh setting update is executed before we
index a new document. With that change (#40387), this test did not fail for
the last 3 months. However, using ThreadPoolStats is not entirely watertight
as both "active" and "queue" count can be 0 in a very small interval
when ThreadPoolExecutor pulls a task from the queue but before marking
the corresponding worker as active (i.e., lock it).
Closes#39565
Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in
`ESIntegTestCase#waitForRelocation()` to deal with the case that this health
request returns successfully despite the fact that there is a pending reroute task which
will relocate another shard.
Relates #44433Fixes#45003
Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.
This commit adds a good deal more detail to make the issues of slow nodes more
prominent:
- after 10 seconds (by default) we log an INFO message indicating that a
publication is still waiting for responses from some nodes, including the
identities of the problematic nodes.
- when the publication times out after 30 seconds (by default) we log a WARN
message identifying the nodes that are still pending.
- the lag detector logs a more detailed warning when a fatally-lagging node is
detected.
- if applying a cluster state takes too long then the cluster applier service
logs a breakdown of all the tasks it ran as part of that process.
* Dry up code for creating simple `ActionRunnable` a little
* Shorten some other code around `ActionListener` usage, in particular
when wrapping it in a `TransportResponseListener`
* As a result of #44096 this test shouldn't fail anymore on `master` and `7.4`+ so we should reenable it there
* For older versions we won't backport that change so the tests should stay disabled there
* Closes#44671
Reading up on #33673 it looks like parts of these tests have been reworked and
there is no intention to fix the remains on 7.x, so I think we can remove the
entire test.
The task that TaskManager#register returns cannot be null. The method
enforces that it is not null after calling request#createTask. It is
then needless to check for null in the listener later. Also, added the
call to the delegate listener in a finally block, just to make sure.
* We shouldn't be recreating wrapped REST handlers over and over for every request. We only use this hook in x-pack and the wrapper there does not have any per request state.
This is inefficient and could lead to some very unexpected memory behavior
=> I made the logic create the wrapper on handler registration and adjusted the x-pack wrapper implementation to correctly forward the circuit breaker and content stream flags
When finding nodes in a connected cluster for cross cluster
search the requests to get cluster state on the connected
cluster should be made in the system context because
logically they are equivalent to checking a single detail
in the local cluster state and should not require that the
user who made the request that is using this method in its
implementation is authorized to view the entire cluster
state.
Fixes#43974
Today closing a `ClusterNode` in an `AbstractCoordinatorTestCase` uses
`onNode()` so has no effect if the node is not in the current list of nodes.
It also discards the `Runnable` it creates without having run it, so has no
effect anyway.
This commit makes these tests much stricter about properly closing the nodes
started during `Coordinator` tests, by tracking the persisted states that are
opened, and adds an assertion to catch the trappy requirement that the closing
node still belongs to the cluster.
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.
Fixes#44909
Replaying operations from the local translog must never fail as those
operations were processed successfully on the primary before and the
mapping is up to update already. This change removes leniency during
resetting engine from translog in IndexShard and InternalEngine.
This is a temporary fix during the Joda to Java datetime transition. This will
implicitly cast a JodaCompatibleZonedDateTime to a ZonedDateTime for
both def and static types. This is necessary to insulate users from needing
to know about JodaCompatibleZonedDateTime explicitly.
TaskListener accepts today Throwable in its onFailure method. Though
looking at where it is called (TransportAction), it can never be
notified of a Throwable.
This commit changes the signature of TaskListener#onFailure so that it
accepts an `Exception` rather than a `Throwable` as second argument.
We currently block the transport thread on startup, which has caused test failures. I think this is
some kind of deadlock situation. I don't think we should even block a transport thread, and
there's also no need to do so. We can just reject requests as long we're not fully set up. Note
that the HTTP layer is only started much later (after we've completed full start up of the
transport layer), so that one should be completely unaffected by this.
Closes#41745
Fixes an issue where a call to openConnection was not properly guarded, allowing an exception
to bubble up to the uncaught exception handler, causing test failures.
Closes#44912
SimpleClusterStateIT testIndicesOptions failed in #44817 because it tries to close
an index at the beginning of the test. With random index settings, it is possible that
the index has a high number of shards (10) and replicas (1), which means that on
CI this index can take time to be fully allocated.
The close index request can fail in the case where replicas are still recovering operations.
Thiscommit adds a simple ensureGreen() at the beginning of the test to be sure that all
replicas are started before trying to close the index.
closes#44817
The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails
because the resize operation is not acknowledged (see #44736). This resize
operation creates a new index "splitagain" and it results in a cluster state
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex()
to create the resized index). This cluster state update is expected to be
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but
this is not always true: the data node that was just stopped in the test before
executing the resize operation might still be considered as a "faulty" node
(and not yet removed from the cluster nodes) by the FollowersChecker. The
cluster state is then acked on all nodes but one, and it results in a non
acknowledged resize operation.
This commit adds an ensureStableCluster() check after stopping the node in
the test. The goal is to ensure that the data node has been correctly removed
from the cluster and that all nodes are fully connected to each before moving
forward with the resize operation.
Closes#44736
Today the processors setting is permitted to be set to more than the
number of processors available to the JVM. The processors setting
directly sizes the number of threads in the various thread pools, with
most of these sizes being a linear function in the number of
processors. It doesn't make any sense to set processors very high as the
overhead from context switching amongst all the threads will overwhelm,
and changing the setting does not control how many physical CPU
resources there are on which to schedule the additional threads. We have
to draw a line somewhere and this commit deprecates setting processors
to more than the number of available processors. This is the right place
to draw the line given the linear growth as a function of processors in
most of the thread pools, and that some are capped at the number of
available processors already.
With this change, we will return primary_term and seq_no of the current
document if an update is detected as a noop. We already return the
version; hence we should also return seq_no and primary_term.
Relates #42497
Adds an API to clone an index. This is similar to the index split and shrink APIs, just with the
difference that the number of primary shards is kept the same. In case where the filesystem
provides hard-linking capabilities, this is a very cheap operation.
Indexing cloning can be done by running `POST my_source_index/_clone/my_target_index` and it
supports the same options as the split and shrink APIs.
Closes#44128
While joda no longer exists in the apis for 7.x, the compatibility layer
still exists with helper methods mimicking the behavior of joda for
ZonedDateTime objects returned for date fields in scripts. This layer
was originally intended to be removed in 7.0, but is now likely to exist
for the lifetime of 7.x.
This commit adds missing methods from ChronoZonedDateTime to the compat
class. These methods were not part of joda, but are needed to act like a
real ZonedDateTime.
relates #44411
This PR makes two changes to FetchSourceSubPhase when _source is disabled and
we're in a nested context:
* If no source filters are provided, return early to avoid an NPE.
* If there are source filters, make sure to throw an exception.
The behavior was chosen to match what currently happens in a non-nested context.
Refactors GeoShapeQueryBuilder to derive from a new AbstractGeometryQueryBuilder that provides common parsing and build logic for spatial geometries. This will allow development of custom geometry queries by extending AbstractGeometryQueryBuilder preventing duplication of common spatial query logic.
The problem is that RemoteClusterConnection closes the connection manager asynchronously, which races with the threadpool being shutdown at the end of the test.
Closes#44339Closes#44610
Removes unnecessary now timeline decompositions from shape builders
and deprecates ShapeBuilders in QueryBuilder in favor of libs/geo
shapes.
Relates to #40908
This test can fail (super-rarely) if it generates a list of length 11
containing a duplicate, because the `.distinct()` reduces the list length to 10
and then it is not abbreviated any more. This change generalises the test to
cover lists of any random length.
* In both fake connection validators we were potentially executing the listener twice. This lead to the situation that the locking via `connectionLock` that ensures that each listener is only executed once ever
would fail and the lister would run twice (in which case the listeners for that node are already `null` and we get an NPE)
* The fact that two different tests fail is due to the fact that we weren't safely shutting down the threadpool which meant the the task that trips the assertion (on the generic pool) would leak into the next test and fail it
* Closes#44758
We have some old permissions lying around, granted to untrusted code
from the days of yore when we supported Groovy and Javascript
scripting. This commit removes these stale permissions.
Today our systemd service defaults to a service type of simple. This
means that systemd assumes Elasticsearch is ready as soon as the
ExecStart (bin/elasticsearch) process is forked off. This means that the
service appears ready long before it actually is, so before it is ready
to receive requests. It also means that services that want to depend on
Elasticsearch being ready to start can not as there is not a reliable
mechanism to determine this. This commit changes the service type to
notify. This requires that Elasticsearch sends a notification message
via libsystemd sd_notify method. This commit does that by using JNA to
invoke this native method. Additionally, we use this integration to also
notify systemd when we are stopping.
* Fix this test randomly failing when running into async translog persistence edge case and failing to successfully close index
* Also, slightly improve debug logging on close failure
* Closes#44681
Switches to more robust way of generating random test geometries by
reusing lucene's GeoTestUtil. Removes duplicate random geometry
generators by moving them to the test framework.
Closes#37278
The `elasticsearch-node override-version` command fails if it cannot read the
existing node metadata file. However, it reads this file strictly and fails if
there are any unknown fields, which means it will not be useful if we add
another field in future.
This commit adds leniency to this command, allowing it to ignore any unknown
fields and proceed with the downgrade. A downgrade is already unsafe, and the
user is already copiously warned about this, so being lenient in this case does
not make things much worse.
Today when creating an index and checking cluster shard limits, we check
the number of shards before applying index templates. At this point, we
do not know the actual number of shards that will be used to create the
index. In a case when the defaults are used and a template would
override, we could be grossly underestimating the number of shards that
would be created, and thus incorrectly applying the limits. This commit
addresses this by checking the shard limits after applying index
templates.
We often start testing with early access versions of new Java
versions and this have caused minor issues in our tests
(i.e. #43141) because the version string that the JVM reports
cannot be parsed as it ends with the string -ea.
This commit changes how we parse and compare Java versions to
allow correct parsing and comparison of the output of java.version
system property that might include an additional alphanumeric
part after the version numbers
(see [JEP 223[(https://openjdk.java.net/jeps/223)). In short it
handles a version number part, like before, but additionally a
PRE part that matches ([a-zA-Z0-9]+).
It also changes a number of tests that would attempt to parse
java.specification.version in order to get the full version
of Java. java.specification.version only contains the major
version and is thus inappropriate when trying to compare against
a version that might contain a minor, patch or an early access
part. We know parse java.version that can be consistently
parsed.
Resolves#43141
This commit removes the method AllocationService.reroute(ClusterState, String, boolean)
in favor of AllocationService.reroute(ClusterState, String).
Motivations are:
there are already 3 other reroute methods in this class
this method is always called with the debug parameter set to false
almost all tests use the method reroute(ClusterState, String)
* As a result of #44665 the collections returned by the deserialization methods on `StreamInput` may be either mutable or immutable now,
this PR adds documentation for that fact
In 7.x we cannot start a new master-eligible node before the cluster has formed
since we first try and update minimum_master_nodes and this is blocked. This
commit changes the test to start a data-only node so that no such adjustment is
necessary.
Relates #44685
Add more logging to indexRandom
Seems that asynchronous indexing from indexRandom sometimes indexes
the same document twice, which will mess up the expected score calculations.
For example, indexing:
{ "index" : {"_id" : "1" } }
{"important" :"phrase match", "less_important": "nothing important"}
{ "index" : {"_id" : "2" } }
{"important" :"nothing important", "less_important" :"phrase match"}
Produces the expected scores: 13.8 for doc1, and 1.38 for doc2
indexing:
{ "index" : {"_id" : "1" } }
{"important" :"phrase match", "less_important": "nothing important"}
{ "index" : {"_id" : "2" } }
{"important" :"nothing important", "less_important" :"phrase match"}
{ "index" : {"_id" : "3" } }
{"important" :"phrase match", "less_important": "nothing important"}
Produces scores: 9.4 for doc1, and 1.96 for doc2 which are found in the
error logs.
Relates to #43144
Fields in JSON logs should be an escaped JSON fields. It is a broken json value at the moment
"stats": "["group1", "group2"]", -> "stats": "[\"group1\", \"group2\"]",
This should later be refactored into a JSON array of strings (the same as types in 7.x)
Today we block access to the pending tasks API before the cluster has recovered
its state. There's no real need to do so, and the master does meaningful work
even before performing state recovery so it might sometimes be useful to allow
access to this API. This commit changes this API to ignore all cluster blocks.
Fixes#44652
The field has to be defined in log4j2.properties and should be an
escaped JSON for now (it is a broken JSON at the moment). This should later be refactored into a JSON array
of strings.
Deprecation logger was filtering log entries by key, that means that if two log messages with the same key are logged from different users, then the second log messages will be filtered.
This change allows to log deprecation message with the same key by different users.
relates #41354
backport #44587
* It seems redundant to synchronize here and check that the map hasn't checked via the `isRunning` under the mutex
* The map won't change if under the mutex that locks on all the updates to it
* Without the mutex it's very unlikely to change inside the method call relative to the likelihood of changing until the generic pool where we check for `isRunning` again anyway
-> just remove the synchronization (it's on the IO loop) and check since we do check the running state on the generic pool under the mutex anyway when we actually fail it
We have to copy the field names otherwise we either have a handle of a
list that a caller might mutate or we might mutate when they aren't
expecting it, or worse, a handle of a list that is not mutable (and we
end up mutating the list).
Relates #44665
* Mute failing test
tracked in #44552
* mute EvilSecurityTests
tracking in #44558
* Fix line endings in ESJsonLayoutTests
* Mute failing ForecastIT test on windows
Tracking in #44609
* mute BasicRenormalizationIT.testDefaultRenormalization
tracked in #44613
* fix mute testDefaultRenormalization
* Increase busyWait timeout windows is slow
* Mute failure unconfigured node name
* mute x-pack internal cluster test windows
tracking #44610
* Mute JvmErgonomicsTests on windows
Tracking #44669
* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot
Tracking #44671
* Mute NodeTests on Windows
Tracking #44256
* We only had the `size == 0` optimization in some but not all spots of deserializing collections in this class, fixed the remaining spots.
* Also fixed the a similar spot when deserializing `ThreadContextStruct` that could now be simplified (it was apparently doing it's own version of this optimization for the first map it deserialized before ... but not for the second map -> made it not instantiate anything if both maps are empty since it's always the same object here anyway)
* Optimize some StreamOutput Operations
* Writing numbers byte by byte adds a lot of unnecessary bounds checks to serialization
* Serializing to a threadlocal `byte[]` instead and bulk writing gives about a 50% speedup on `long` and `vlong` (for large numbers) writes and 30% for `int`, `vint` on Linux on an i9
* Using a threadlocal of the maximum string buffer size we used to allocate before also removes allocations when writing strings in general since we now never have to allocate a `byte[]` for that
* And don't have to GC one either resolving the TODO removed here
This commit converts all remaining TransportRequest and
TransportResponse classes to implement Writeable, and disallows
Streamable implementations.
relates #34389