This builds an `auto_date_histogram` aggregator that natively aggregates
from many buckets and uses it when the `auto_date_histogram` used to use
`asMultiBucketAggregator` which should save a significant amount of
memory in those cases. In particular, this happens when
`auto_date_histogram` is a sub-aggregator of a multi-bucketing aggregator
like `terms` or `histogram` or `filters`. For the most part we preserve
the original implementation when `auto_date_histogram` only collects from
a single bucket.
It isn't possible to "just port the aggregator" without taking a pretty
significant performance hit because we used to rewrite all of the
buckets every time we switched to a coarser and coarser rounding
configuration. Without some major surgery to how to delay sub-aggs
we'd end up rewriting the delay list zillions of time if there are many
buckets.
The multi-bucket version of the aggregator has a "budget" of "wasted"
buckets and only rewrites all of the buckets when we exceed that budget.
Now that we don't rebucket every time we increase the rounding we can no
longer get an accurate count of the number of buckets! So instead the
aggregator uses an estimate of the number of buckets to trigger switching
to a coarser rounding. This estimate is likely to be *terrible* when
buckets are far apart compared to the rounding. So it also uses the
difference between the first and last bucket to trigger switching to a
coarser rounding. Which covers for the shortcomings of the bucket
estimation technique pretty well. It also causes the aggregator to emit
fewer buckets in cases where they'd be reduced together on the
coordinating node. This is wonderful! But probably fairly rare.
All of that does buy us some speed improvements when the aggregator is
a child of multi-bucket aggregator:
Without metrics or time zone: 25% faster
With metrics: 15% faster
With time zone: 22% faster
Relates to #56487
This commit bumps our JNA dependency from 4.5.1 to 5.5.0, so that we are
now on the latest maintained line, and pick up a large collection of bug
fixes that have accumulated.
This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.
The dangling indices action is not a proper master node action so it does not
retry when executed while the cluster hasn't fully formed yet.
Since we use node restarts when setting up the dangling indices state we need
to manually ensure a fully formed cluster before moving on with the tests to avoid
failures.
Backport of #50920. Part of #48366. Implement an API for listing,
importing and deleting dangling indices.
Co-authored-by: David Turner <david.turner@elastic.co>
MappedFieldType is a combination of two concerns:
* an extension of lucene's FieldType, defining how a field should be indexed
* a set of query factory methods, defining how a field should be searched
We want to break these two concerns apart. This commit is a first step to doing this, breaking
the inheritance relationship between MappedFieldType and FieldType. MappedFieldType
instead has a series of boolean flags defining whether or not the field is searchable or
aggregatable, and FieldMapper has a separate FieldType passed to its constructor defining
how indexing should be done.
Relates to #56814
* Normalized prefix for rollover API (#57271)
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Lee Hinman <lee@writequit.org>
It fixes the issue #53388
by normalizing prefix at index creation request itself
* Fix compilation for backport
Co-authored-by: Gaurav Chandani <chngau@amazon.com>
After an index has been deleted it may take some time to cancel all the
maintenance tasks such as RetentionLeaseSync, it's possible that the
task is already executing before the cancellation. This commit just
avoids logging a warning message for those scenarios.
Closes#57864
Backport of (#58098)
This commit adds an optional field, `description`, to all ingest processors
so that users can explain the purpose of the specific processor instance.
Closes#56000.
Instead of serializing compilation using a plain lock / mutex combined with a double check, rely on the computeIfAbsent logic to prevent duplicated compilation of scripts. Made checkCompilationLimit to be thread-safe and lock free.
Backport: 865acad
Co-authored-by: Michael Bischoff <michael.bischoff@elastic.co>
This need some reorg of BinaryDV field data classes to allow specialisation of scripted doc values.
Moved common logic to a new abstract base class and added a new subclass to return string-based representations to scripts.
Closes#58044
* Remove usage of deprecated testCompile configuration
* Replace testCompile usage by testImplementation
* Make testImplementation non transitive by default (as we did for testCompile)
* Update CONTRIBUTING about using testImplementation for test dependencies
* Fail on testCompile configuration usage
We keep a static list of meta-fields: META_FIELDS_BEFORE_7_8
as it was before.
This is done to ensure the backwards compatability with pre 7.8 nodes.
Closes#57831
If `ExtraFS` decides to put `extra0/0` into the indices folder
then the previous logic in this test would have interpreted the `0`
as shard `0` of index `extra0` and fail to list its contents (since it's a file
and not an actual shard directory).
=> simplified the logic to use actually referenced `IndexId` for iterating over indices
instead.
Scheduling on the threadpool will throw if the scheduler is already
shut down. Handled by treating the rejection like any other non-retryable
exception.
Closes#58021
This moves the code to look up significance heuristics information like
background frequency and superset size out of
`SignificantTermsAggregatorFactory` and into its own home so that it is
easier to pass around. This will:
1. Make us feel better about ourselves for not passing around the
factory, which is really *supposed* to be a throw away thing.
2. Abstract the significance lookup logic so we can reuse it for the
`significant_text` aggregation.
3. Make if very simple to cache the background frequencies which should
speed up when the agg is a sub-agg. We had done this for numerics
but not string-shaped significant terms.
When a search phase fails, we release the context of all successful shards.
Successful shards that rewrite the request to match none will not create any context
since #. This change ensures that we don't try to release a `null` context on these
successful shards.
Closes#57945
Ensures that InternalClusterInfoService's internally cached stats are refreshed whenever the
shard size or disk usage function (to mock out disk usage) are overridden.
Closes#57888
Today `InternalEngine#releaseIndexCommit` fails with an
`AlreadyClosedException` if the engine is closed before the index commit is
released. This can happen if, for example, a node leaves and rejoins the
cluster and acquires an index commit for replica shard allocation concurrently
with shutting the shard down.
There's no need to fail the operation like this: if the engine is shut down
then we will clean up the unreferenced files when it's restarted (or if it's
allocated elsewhere) so we can suppress an `AlreadyClosedException` in this
case. This commit does so.
Fixes#57797
Per 49554 I added standard deviation sampling and variance sampling to the extended stats interface.
Closes#49554
Co-authored-by: Igor Motov <igor@motovs.org>
Co-authored-by: andrewjohnson2 <aj114114@gmail.com>
When reducing `auto_date_histogram` we were using `Rounding#round`
which is quite a bit more expensive than
```
Rounding.Prepared prepared = rounding.prepare(min, max);
long result = prepared.round(date);
```
when rounding to a non-fixed time zone like `America/New_York`. This
stops using the former and starts using the latter.
Relates to #56124
Use the the hack used in `CorruptedBlobStoreRepositoryIT` in more snapshot
failure tests to verify that BwC repository metadata is handled properly
in these so far not-test-covered scenarios.
Also, some minor related dry-up of snapshot tests.
Relates #57798
Adds assertions to Netty to make sure that its threads are not polluted by thread contexts (and
also that thread contexts are not leaked). Moves the ClusterApplierService to use the system
context (same as we do for MasterService), which allows to remove a hack from
TemplateUgradeService and makes it clearer that applying CS updates is fully executing under
system context.
If a node is disconnected we retry. It does not make sense
to retry the recovery if the node is removed from the cluster though.
=> added a CS listener that cancels the recovery for removed nodes
Also, we were running the retry on the `SAME` pool which for each retry will
be the scheduler pool. Since the error path of the listener we use here
will do blocking operations when closing the resources used by the recovery
we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler`
threading like e.g. `NodeNotConnectedException`.
Closes#57585
In ff9e8c622427d42a2d87b4ceb298d043ae3c4e6a we changed the format
used when serializing snapshot failures in the cluster state and
`SnapshotInfo`. This turned them from a short string holding all the
nested exception messages into a multi kb stacktrace in many cases.
This is not great if you snapshot a large number of shards that all fail
for example and massively blows up the size of the GET snapshots response
if there are snapshots with failures in there.
This change reverts to the format used for exceptions before the above commit.
Also, this change short circuits logging and serialization of the failure
for an aborted snapshot where we don't care about the specific message at all
and aligns the message to "aborted" in all cases (current if we aborted before any IO,
it would have been "aborted" and an exception when aborting later during IO).
Previously, hidden indices were not included in snapshots by default, unless
specified using one of the usual methods for doing so: naming indices directly,
using index patterns starting with a ., or specifying expand_wildcards to
a value that includes hidden (e.g. all or hidden,open).
This commit changes the default expand_wildcards value to include hidden
indices.
Fixed two newly introduced issues with rollover:
1. Using auto-expand replicas, rollover could result in unexpected log
messages on future indexes.
2. It did a reroute and other heavy work on the network thread.
Closes#57706
Supersedes #57865
Relates #53965
Allow for optimistic concurrency control during ingest by checking the
sequence number and primary term. This is accomplished by defining
_if_seq_no and _if_primary_term in the pipeline, similarly to _version
and _version_type.
Closes#41255
Co-authored-by: Maria Ralli <mariai.ralli@gmail.com>
The shrink action creates a shrunken index with the target number of shards.
This makes the shrink action data stream aware. If the ILM managed index is
part of a data stream the shrink action will make sure to swap the original
managed index with the shrunken one as part of the data stream's backing
indices and then delete the original index.
(cherry picked from commit 99aeed6acf4ae7cbdd97a3bcfe54c5d37ab7a574)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
This commit fixes a bug on the composite aggregation when the index
is sorted and the primary composite source needs to round values (date_histo).
In such case, we cannot take into account the subsequent sources even if they
match the index sort because the rounding of the primary sort value may break
the original index order.
Fixes#57849
This deprecates `Rounding#round` and `Rounding#nextRoundingValue` in
favor of calling
```
Rounding.Prepared prepared = rounding.prepare(min, max);
...
prepared.round(val)
```
because it is always going to be faster to prepare once. There
are going to be some cases where we won't know what to prepare *for*
and in those cases you can call `prepareForUnknown` and stil be faster
than calling the deprecated method over and over and over again.
Ultimately, this is important because it doesn't look like there is an
easy way to cache `Rounding.Prepared` or any of its precursors like
`LocalTimeOffset.Lookup`. Instead, we can just build it at most once per
request.
Relates to #56124
Currently it is possible for a transient network error to disrupt the
start recovery request from the remote to source node. This disruption
is racy with the recovery occurring on the source node. It is possible
for the source node to finish and clear its recovery. When this occurs,
the recovery cannot be reestablished and the "no two start" assertion
is tripped. This commit fixes this issue by allowing two starts if the
finalize request has been received.
Fixes#57416.
Currently, the translog ops request is reentrent when there is a mapping
update. The impact of this is that a translog ops ends up waiting on the
pre-existing listener and it is never completed. This commit fixes this
by introducing a new code path to avoid the idempotency logic.
The action name is passed to the `ChannelListener` and is used for
logging purposes. Currently, we are using the incorrect action name for
the translog ops listener. This commit fixes the issue.
This reworks string flavored implementations of the `terms` aggregation
to save memory when it is under another bucket by dropping the usage of
`asMultiBucketAggregator`.
Adds assertions to Netty to make sure that its threads are not polluted by thread contexts (and
also that thread contexts are not leaked). Moves the ClusterApplierService to use the system
context (same as we do for MasterService), which allows to remove a hack from
TemplateUgradeService and makes it clearer that applying CS updates is fully executing under
system context.
Currently we check that exceptions are the same in the recovery request
tracker test. This is inconsistent because the future wraps the
exception in a new instance. This commit fixes the test by comparing a
random exception message.
Fixes#57199
In #57701 we changed mappings merging so that duplicate fields specified in mappings caused an
exception during validation. This change makes the same exception thrown when metadata fields are
duplicated. This will allow us to be strict currently with plans to make the merging more
fine-grained in a later release.
Currently a network disruption will fail a peer recovery. This commit
adds network errors as retryable actions for the source node.
Additionally, it adds sequence numbers to the recovery request to
ensure that the requests are idempotent.
Additionally it adds a reestablish recovery action. The target node
will attempt to reestablish an existing recovery after a network
failure. This is necessary to ensure that the retries occurring on the
source node provide value in bidirectional failures.
Fix broken numeric shard generations when reading them from the wire
or physically from the physical repository.
This should be the cheapest way to clean up broken shard generations
in a BwC and safe-to-backport manner for now. We can potentially
further optimize this by also not doing the checks on the generations
based on the versions we see in the `RepositoryData` but I don't think
it matters much since we will read `RepositoryData` from cache in almost
all cases.
Closes#57798
When you run a `significant_terms` aggregation on a field and it *is*
mapped but there aren't any values for it then the count of the
documents that match the query on that shard still have to be added to
the overall doc count. I broke that in #57361. This fixes that.
Closes#57402
Before to determine if a field is meta-field, a static method of MapperService
isMetadataField was used. This method was using an outdated static list
of meta-fields.
This PR instead changes this method to the instance method that
is also aware of meta-fields in all registered plugins.
Related #38373, #41656Closes#24422
We want to validate the DataStreams on creation to make sure the future backing
indices would not clash with existing indices in the system (so we can
always rollover the data stream).
This changes the validation logic to allow for a DataStream to be created
with a backing index that has a prefix (eg. `shrink-foo-000001`) even if the
former backing index (`foo-000001`) exists in the system.
The new validation logic will look for potential index conflicts with indices
in the system that have the counter in the name greater than the data stream's
generation.
This ensures that the `DataStream`'s future rollovers are safe because for a
`DataStream` `foo` of generation 4, we will look for standalone indices in the
form of `foo-%06d` with the counter greater than 4 (ie. validation will fail if
`foo-000006` exists in the system), but will also allow replacing a
backing index with an index named by prefixing the backing index it replaces.
(cherry picked from commit 695b242d69f0dc017e732b63737625adb01fe595)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
* Fix Bug With RepositoryData Caching
This fixes a really subtle bug with caching `RepositoryData`
that can corrupt a repository.
We were caching `RepositoryData` serialized in the newest
metadata format. This lead to a confusing situation where
numeric shard generations would be cached in `ShardGenerations`
that were not written to the repository because the repository
or cluster did not yet support `ShardGenerations`.
In the case where shard generations are not actually supported yet,
these cached numeric generations are not safe and there's multiple
scenarios where they would be incorrect, leading to the repository
trying to read shard level metadata from index-N that don't exist.
This commit makes it so that cached metadata is always in the same
format as the metadata in the repository.
Relates #57798
This makes it easier to debug where such tasks come from in case they are returned from the get tasks API.
Also renamed the last occurrence of waitForCompletion to waitForCompletionTimeout in get async search request.
Improve efficiency of background indexer by allowing to add
an assertion for failures while they are produced to prevent
queuing them up.
Also, add non-blocking stop to the background indexer so that when
stopping multiple indexers we don't needlessly continue indexing
on some indexers while stopping another one.
Closes#57766
This removes the deprecated `asMultiBucketAggregator` wrapper from
`scripted_metric`. Unlike most other such removals, this isn't likely to
save much memory. But it does make the internals of the aggregator
slightly less twisted.
Relates to #56487
Backport of #57640 to 7.x branch.
Composable templates with exact matches, can match with the data stream name, but not with the backing index name.
Also if the backing index naming scheme changes, then a composable template may never match with a backing index.
In that case mappings and settings may not get applied.
#47711 and #47246 helped to validate that monitoring settings are
rejected at time of setting the monitoring settings. Else an invalid
monitoring setting can find it's way into the cluster state and result
in an exception thrown [1] on the cluster state application (there by
causing significant issues). Some additional monitoring settings have
been identified that can result in invalid cluster state that also
result in exceptions thrown on cluster state application.
All settings require a type of either http or local to be
applicable. When a setting is changed, the exporters are automatically
updated with the new settings. However, if the old or new settings lack
of a type setting an exception will be thrown (since exporters are
always of type 'http' or 'local'). Arguably we shouldn't blindly create
and destroy new exporters on each monitoring setting update, but the
lifecycle of the exporters is abit out the scope this PR is trying to
address.
This commit introduces a similar methodology to check for validity as
#47711 and #47246 but this time for ALL (including non-http) settings.
Monitoring settings are not useful unless there an exporter with a type
defined. The type is used as dependent setting, such that it must
exist to set the value. This ensures that when any monitoring settings
changes that they can only get added to cluster state if the type
exists. If the type exists (and the other validations pass) then the
exporters will get re-built and the cluster state remains valid.
Tests have been included to ensure that all dynamic monitoring settings
have the type as dependent settings.
[1]
org.elasticsearch.common.settings.SettingsException: missing exporter type for [found-user-defined] exporter
at org.elasticsearch.xpack.monitoring.exporter.Exporters.initExporters(Exporters.java:126) ~[?:?]
Prior to this commit, `cluster.max_shards_per_node` is not correctly handled
when it is set via the YAML config file, only when it is set via the Cluster
Settings API.
This commit refactors how the limit is implemented, both to enable correctly
handling the setting in the YAML and to more effectively centralize the logic
used to enforce the limit. The logic used to apply the limit, as well as the
setting value, has been moved to the new `ShardLimitValidator`.
Merges the remaining implementation of `significant_terms` into `terms`
so that we can more easilly make them work properly without
`asMultiBucketAggregator` which *should* save memory and speed them up.
Relates #56487
Today `GET _cluster/health?wait_for_events=...&timeout=...` will wait
indefinitely for the master to process the pending cluster health task,
ignoring the specified timeout. This could take a very long time if the master
is overloaded. This commit fixes this by adding a timeout to the pending
cluster health task.
This PR replaces the marker interface with the method
FieldMapper#parsesArrayValue. I find this cleaner and it will help with the
fields retrieval work (#55363).
The refactor also ensures that only field mappers can declare they parse array
values. Previously other types like ObjectMapper could implement the marker
interface and be passed array values, which doesn't make sense.
The test for `auto_date_histogram` as trying to round `Long.MAX_VALUE`
if there were 0 buckets. That doesn't work.
Also, this replaces all of the class variables created to make
consistent random result when testing `InternalAutoDateHistogram` with
the newer `randomResultsToReduce` which is a little simpler to
understand.
The test failed when it was running with 4 replicas and 3 indexing
threads. The recovering replicas can prevent the global checkpoint from
advancing. This commit increases the timeout to 60 seconds for this
suite and the check for no inflight requests.
Closes#57204
SigTerms cannot run on fields that are not searchable, and SigText
cannot run on fields that do not have analyzers. Both of these
situations fail today with an esoteric exception, so this just formalizes
the constraint by throwing an IllegalArgumentException up front.
In practice, the only affected field seems to be the `binary` field,
which is neither searchable or has a default analyzer (e.g. even numeric
and keyword fields have a default analyzer despite not being tokenized)
Adds supported-type tests, and makes some changes to the test itself
to allow testing sigtext (indexing _source).
Also a few tweaks to the test to avoid bad randomization (negative
numbers, etc).
When the `terms` agg runs against strings and uses global ordinals it
has an optimization when it collects segments that only ever have a
single value for the particular string. This is *very* common. But I
broke it in #57241. This fixes that optimization and adds `debug`
information that you can use to see how often we collect segments of
each type. And adds a test to make sure that I don't break the
optimization again.
We also had a specialiation for when there isn't a filter on the terms
to aggregate. I had removed that specialization in #57241 which resulted
in some slow down as well. This adds it back but in a more clear way.
And, hopefully, a way that is marginally faster when there *is* a
filter.
Closes#57407
Almost every outbound message is serialized to buffers of 16k pagesize.
We were serializing these messages off the IO loop (and retaining the concrete message
instance as well) and would then enqueue it on the IO loop to be dealt with as soon as the
channel is ready.
1. This would cause buffers to be held onto for longer than necessary, causing less reuse on average.
2. If a channel was slow for some reason, not only would concrete message instances queue up for it, but also 16k of buffers would be reserved for each message until it would be written+flushed physically.
With this change, the serialization happens on the event loop which effectively limits the number of buffers that `N` IO-threads will ever use so long as messages are small and channels writable.
Also, this change dereferences the reference to the concrete outbound message as soon as it has been serialized to save some more on GC.
This reduces the GC time for a default PMC run by about 50% in experiments (3 nodes, 2G heap each, loopback ... obvious caveat is that GC isn't that heavy in the first place with recent changes but still a measurable gain).
I also expect it to be helpful for master node stability by causing less of a spike if master is e.g. hit by a large number of requests that are processed batched (e.g. shard snapshot status updates) and responded to in a short time frame all at once.
Obviously, the downside to this change is that it introduces more latency on the IO loop for the serialization. But since we read all of these messages on the IO loop as well I don't see it as much of a qualitative change really and the more predictable buffer use seems much more valuable relatively.
Allow for a fairer distribution of snapshot and restore operations
to enable parallel snapshots and improve behaviour for parallel snapshot + restore.
Closes#55803
There are several mapping settings that are currently re-parsed every
time they are read. This can be quite frequent, for example within every
document ingestion. This commit moves the parsed versions of these
mapping settings to be stored in IndexSettings, just as other index settings
are already.
closes#57395
The `routingNodes` variable is unused. Replace `clusterState.getRoutingNodes()` with `routingNodes`.
Co-authored-by: Boice Huang <boicehuang@tencent.com>
At some point, we changed the supported-type test to also catch
assertion errors. This has the side effect of also catching the
`fail()` call inside the try-catch, which silently smothered some
failures.
This modifies the test to throw at the end of the try-catch
block to prevent from accidentally catching itself.
Catching the AssertionError is convenient because there are other locations
that do throw an assertion in tests (due to hitting an assertion
before the exception is thrown) so I think we should keep it around.
Also includes a variety of fixes to other tests which were failing
but being silently smothered.
In case the local checkpoint in the latest commit is less
than the last processed local checkpoint we would recover
0 ops and hence not commit again.
This would lead to the logic in `IndexShard#recoverLocallyUpToGlobalCheckpoint`
not seeing the latest local checkpoint when it reload the safe commit from the store
and thus cause inefficient recoveries because the recoveries would work from a
lower than possible local checkpoint.
Closes#57010
This merges the global-ordinals-based implementation for
`significant_terms` into the global-ordinals-based implementation of
`terms`, removing a bunch of copy and pasted code that is subtly
different across the two implementations and replacing it with an
explicit `ResultStrategy` with nice stuff like Javadoc.
The actual behavior is mostly unchanged, though I was able to remove a
redundant copy of bytes representing the string from the result
construction phase of `significant_terms`.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Previously we'd get a `ClassCastException` when you tried to use
`numeric_type` on `scaled_float`. Oops! This cleans up the CCE and moves
some code around so the casting actually works.