It's believed that using diffs obsoletes the other mechanism for reusing the
bits of the ClusterState that didn't change between updates, but in fact we
don't know for sure how often the diff mechanism works successfully. The stats
collected here will tell us.
Right now if the number of shards for a particular index is equal across the
data paths, we tie-break on space. This changes to tie-break first on the total
number of shards for each path, and then, if that is the same, on the usable
bytes.
Relates to #26654 (it's a follow-up)
If timed runnable wraps an abstract runnable, then it should delegate to
the abstract runnable otherwise force execution and handling rejections
is dropped on the floor. Thus, timed runnable should itself be an
abstract runnable delegating all methods to the wrapped runnable in
cases when it is an abstract runnable. This commit causes this to be the
case.
Relates #27095
Memory usage of queries can't be properly accounted, which can be an issue when
large queries are cached since the actual memory usage will be much higher than
what the cache thinks. This problem is very hard if not impossible to fix so as
a workaround I would like to decrease the maximum number of cached queries so
that this problem is less likely to cause trouble in practice.
For the record, this problem is more likely to occur in envirenments that have
small shards or don't give much memory to the JVM.
Closes#26938
Today we internally accumulate elapsed scroll time in nanoseconds. The
problem here is that this can reasonably overflow. For example, on a
system with scrolls that are open for ten minutes on average, after
sixteen million scrolls the largest value that can be represented by a
long will be executed. To address this, we switch to internally
representing scrolls using microseconds as this enables with the same
number of scrolls scrolls that are open for seven days on average, or
with the same average elapsed time sixteen billion scrolls which will
never happen (executing one scroll a second until sixteen billion have
executed would not occur until more than five-hundred years had
elapsed).
Relates #27068
Upgrade to Jackson 2.9.2 and also use a boolean `closed` flag to
indicate that a FastStringReader instance is closed, so that length
is still correctly reported after the reader is closed.
Due to a change happened via #26102 to make the nested source consistent
with or without source filtering, the _source of a nested inner hit was
always wrapped in the parent path. This turned out to be not ideal for
users relying on the nested source, as it would require additional parsing
on the client side. This change fixes this, the _source of nested inner hits
is now no longer wrapped by parent json objects, irregardless of whether
the _source is included as is or source filtering is used.
Internally source filtering and highlighting relies on the fact that the
_source of nested inner hits are accessible by its full field path, so
in order to now break this, the conversion of the _source into its binary
form is performed in FetchSourceSubPhase, after any potential source filtering
is performed to make sure the structure of _source of the nested inner hit
is consistent irregardless if source filtering is performed.
PR for #26944Closes#26944
* Balance shards for an index more evenly across multiple data paths
When a node has multiple data paths configured, and is assigned all of the
shards for a particular index, it's possible now that all shards will be
assigned to the same path (see #16763).
This change keeps the same behavior around determining the "best" path for a
shard based on space, however, it enforces limits for the number of shards on a
path for an index from the single-node perspective. For example:
Assume you had a node with 4 data paths, where `/path1` has a tremendously high
amount of disk space available compared to the other paths. If you create an
index with 5 primary shards, the previous behavior would be to assign all 5
shards to `/path1`.
This change would enforce a limit of 2 shards to each data path for that
particular node, so you would end up with the following distribution:
- `/path1` - 2 shards (because it has the most usable space)
- `/path2` - 1 shard
- `/path3` - 1 shard
- `/path4` - 1 shard
Note, however, that this limit is only enforced at the local node level for
simplicity in implementation, so if you had multiple nodes, the "limit" for the
node is still 2, so assuming you had enough nodes that there was only 2 shards
for this index assigned to this node, they would still both be assigned to
`/path1`.
* Switch from ObjectLongHashMap to regular HashMap
* Remove unneeded Files.isDirectory check
* Skip iterating directories when not necessary
* Add message to assert
* Implement different (better) ranking for node paths
This is the method we discussed
* Remove unused pathHasEnoughSpace method
* Use findFirst instead of .get(0);
* Update for master merge to fix compilation
Settings.putArray -> Settings.putList
Today when getting ready to enter seccomp, we do some probes to ensure
that we are really talking to seccomp, etc. One of these probes is pure
paranoia. The paranoia was driven by a kernel bug
(https://lkml.org/lkml/2014/7/20/222) that only impacted 32-bit x86
kernels wherein invoking a non-existant syscall was not returning ENOSYS
(as it should). This probe causes problems though, for example in
containers with syscall filters, invoking a non-existant syscall will
lead to the process being sent SIGSYS and terminated. We do not need
this paranoid, we do not support 32-bit, and our other probes give us
enough of a defense to ensure that we are talking to seccomp (and we
hardcode the seccomp syscall number for platforms that we
support). Given that this probe offers us little value, but does cause
problems in valid use-cases, this commit removes this paranoia.
Relates #27016
The internal engine constructor declares a checked engine exception yet
this constructor does not actually throw this exception. This commit
removes this declaration from the internal engine constructor.
Relates #27022
Today all these API calls have a sideeffect of making documents visible
to search requests. While this is sometimes desired it's an unnecessary sideeffect
and now that we have an internal (engine-private) index reader (#26972) we artificially
add a refresh call for bwc. This change removes this sideeffect in 7.0.
Right now we are attempting to set SO_LINGER to 0 on server channels
when we are stopping the tcp transport. This is not a supported socket
option and throws an exception. This also prevents the channels from
being closed.
This commit 1. doesn't set SO_LINGER for server channges, 2. checks
that it is a supported option in nio, and 3. changes the log message
to warn for server channel close exceptions.
Today we only allow to decode byte arrays where the data has a 0 offset
and the same length as the array. Allowing to decode stuff from a slice will
make decoding IDs cheaper if the the ID is for instance coming from a term dictionary
or BytesRef.
Relates to #26931
Today, when ES detects it's using too much heap vs the configured indexing
buffer (default 10% of JVM heap) it opens a new searcher to force Lucene to move
the bytes to disk, clear version map, etc.
But this has the unexpected side effect of making newly indexed/deleted
documents visible to future searches, which is not nice for users who are trying
to prevent that, e.g. #3593.
This is also an indirect spinoff from #26802 where we potentially pay a big
price on rebuilding caches etc. when updates / realtime-get is used. We are
refreshing the internal reader for realtime gets which causes for instance
global ords to be rebuild. I think we can gain quite a bit if we'd use a reader
that is only used for GETs and not for searches etc. that way we can also solve
problems of searchers being refreshed unexpectedly aside of replica recovery /
relocation.
Closes#15768Closes#26912
Previous to this change the weights for the filter and filters aggregation were created in the `Filter(s)AggregatorFactory` which meant that they were created regardless of whether the aggregator actually collects any documents. This meant that for filters that are expensive to initialise, requests would not be quick when the query of the request was (or effectively was) a `match_none` query.
This change maintains a single Weight instance for each filter across parent buckets but passes a weight supplier to the aggregator instances which will create the weight on first call and then return that instance for subsequent calls.
Our convention is to use lower case when naming things "Tcp". For
example, `TcpTransport`. This commit renames the outlier
(TcpTransportTests) to use lower case.
When a node which contains the primary shard is unavailable, the primary
stats (and the total stats) of an `IndexStats` will be empty for a short
moment (while the primary shard is being relocated). However, we assume
that these stats are always non-empty when handling `_cat/indices` in
RestIndicesAction. This commit checks the content of these stats before
accessing.
Closes#26942
While opening a connection to a node, a channel can subsequently
close. If this happens, a future callback whose purpose is to close all
other channels and disconnect from the node will fire. However, this
future will not be ready to close all the channels because the
connection will not be exposed to the future callback yet. Since this
callback is run once, we will never try to disconnect from this node
again and we will be left with a closed channel. This commit adds a
check that all channels are open before exposing the channel and throws
a general connection exception. In this case, the usual connection retry
logic will take over.
Relates #26932
Previously collisions in headers between old and new contexts could be silently ignored, allowing the original context's headers to "win". This commit fixes the headers to require they are disjoint.
the only nodes preference was used as a replacement of `_primary` which was removed. Sadly, it's not the same as we also check that it makes sense - i.e., that the given node has a shard copy. Since the test uses indices with >1 shards, the primaries may be spread to multiple nodes. Using one (like it currently does) will fail for some primaries. Using all will probably end up hitting all nodes.
This commit removed the `_only_nodes` usage in favor a simple search
Relates to #26791
Cache final result instead of result of advanceExact.
Fix SortedNumericDoubleValues does not test MEDIAN mode
Replace deprecated random string generation method
Today we return a `String[]` that requires copying values for every
access. Yet, we already store the setting as a list so we can also directly
return the unmodifiable list directly. This makes list / array access in settings
a much cheaper operation especially if lists are large.
The shard preference _primary, _replica and its variants were useful
for the asynchronous replication. However, with the current impl, they
are no longer useful and should be removed.
Closes#26335
Support for search_after and geo distance sorting is broken when the optimized LatLonDocValuesField.distanceSort is used.
This commit fixes the parsing of the search_after value for this case.
* Add additional low-level logging handler
We have the trace handler which is useful for recording sent messages
but there are times where it would be useful to have more low-level
logging about the events occurring on a channel. This commit adds a
logging handler that can be enabled by setting a certain log level
(org.elasticsearch.transport.netty4.ESLoggingHandler) to trace that
provides trace logging on low-level channel events and includes some
information about the request/response read/write events on the channel
as well.
* Remove imports
* License header
* Remove redundant
* Add test
* More assertions
Today we represent each value of a list setting with it's own dedicated key
that ends with the index of the value in the list. Aside of the obvious
weirdness this has several issues especially if lists are massive since it
causes massive runtime penalties when validating settings. Like a list of 100k
words will literally cause a create index call to timeout and in-turn massive
slowdown on all subsequent validations runs.
With this change we use a simple string list to represent the list. This change
also forbids to add a settings that ends with a .0 which was internally used to
detect a list setting. Once this has been rolled out for an entire major
version all the internal .0 handling can be removed since all settings will be
converted.
Relates to #26723
Add fuzzy_transpositions parameter to multi_match and query_string queries.
Add fuzzy_transpositions, fuzzy_prefix_length and fuzzy_max_expansions
parameters to simple_query_string query.
The single shard optimization that we have in our search api changes the type of response returned by the query transport action name based on the shard search request. if the request goes to one shard, we will do query and fetch at the same time, hence the response will be different. The proxying layer used in cross cluster search was not aware of this distinction, which causes serialization issues every time a cross cluster search request goes to a single shard and goes through a gateway node which has to forward the shard request to a data node. The coordinating node would then expect a QueryFetchSearchResult while the gateway would return a QuerySearchResult.
Closes#26833
This is a follow up to #26764. That commit set SO_LINGER to 0 in order
to fix a scenario where we were running out of resources during CI. We
are primarily interested in setting this to 0 when stopping the
tranport. Allowing TIMED_WAIT is standard for other failure scenarios
during normal operation.
Unfortunately this commit set SO_LINGER to 0 every time we close
NodeChannels. NodeChannels can be closed in case of an exception or
other failures (such as parsing a response). We want to only disable
linger when actually shutting down.
Since `#getAsMap` exposes internal representation we are trying to remove it
step by step. This commit is cleaning up some xcontent writing as well as
usage in tests
This change adds cgroup memory usage/limit to the OS stats section of
the node stats on Linux. This information is useful because in Docker
containers the standard node stats report the host memory limit, not
taking account of extra restrictions that may have been applied to the
container.
The original idea was to store these values as Long, truncating any values
outside the range of long. However, this meant that in the relatively common
case of no limit being applied, users would not see the same value in the OS
stats as they see by querying Linux directly. So instead the values are stored
as String. This change places a burden on consumers of the strings to
convert the strings to numbers and decide what to do about extremely large
values, but there will be very few consumers and they would need to have a
policy for dealing with "no limit" in any case.
We use group settings historically instead of using a prefix setting which is more restrictive and type safe. The majority of the usecases needs to access a key, value map based on the _leave node_ of the setting ie. the setting `index.tag.*` might be used to tag an index with `index.tag.test=42` and `index.tag.staging=12` which then would be turned into a `{"test": 42, "staging": 12}` map. The group settings would always use `Settings#getAsMap` which is loosing type information and uses internal representation of the settings. Using prefix settings allows now to access such a method type-safe and natively.
Elasticsearch doesn't allow having an index alias named with the same name as an existing index. We currently have logic that tries to prevents that in the `MetaData.Builder#build()` method. Sadly that logic is flawed. Depending on iteration order, we may allow the above to happen (if we encounter the alias before the index).
This commit fixes the above and improves the error message while at it.
Note that we have a lot of protections in place before we end up relying on the metadata builder (validating this when we process APIs). I takes quite an abuse of the cluster to get that far.
Early termination with index sorting always return the best top N in the response but set the flag `terminated_early`
in the response. This can be confusing because we use the same flag for `terminate_after` which on the contrary returns partial results.
This change removes the flag when results are not partial (early termination due to index sorting) and keeps it only when `terminate_after` is used.
Closes#26408
Numeric fields no longer support the index_options parameter. This changes the parameter
to be rejected in numeric field types after it was deprecated in 6.0.
Closes#21475
The log message here is incorrect, a failure here is occuring on the
post-operation global checkpoint sync, not the background sync. This
commit fixes the log message.
We were accidentally defaulting it to the scroll size.
Untwists some of the tricks that we play with parsing
so that the size is no longer scrambled.
Closes#26761
The routing entry is used by external components to check whether the shard is ready to perform as primary. Most notably, the peer recovery source handler delays recoveries until the shard routing entry says the shard is ready.
When a shard is promoted to primary, we currently update the shard's routing entry before we finish all the work relating to the promotion. This can cause recoveries to fail later on because the `GlobalCheckpointTracker` isn't set (yet) to primary mode.
This commit fixes this issue by updating the routing entry last.
The previous test was too strict and enforced that the target object was a
parent. It has been relaxed so that fields that belong to the same nested
object can copy to each other.
The commit also improves error handling in case of multi-fields. The current
validation works but may throw confusing error messages since it assumes that
only object fields may introduce dots in fields names while multi fields may
too.
Closes#26763
Other tokenizers like the standard tokenizer allow overriding the default
maximum token length of 255 using the `"max_token_length` parameter. This change
enables using this parameter also with the whitespace tokenizer. The range that
is currently allowed is from 0 to StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT,
which is 1024 * 1024 = 1048576 characters.
Closes#26643
This change adds a fromXContent method to Settings that allows to read
the xcontent that is produced by toXContent. It also replaces the entire settings
loader infrastructure and removes the structured map representation. Future PRs will
also tackle the `getAsMap` that exposes the internal represenation of settings for
better encapsulation.
This commit fixes issues with the global checkpoint sync test. The test
was off in initializing the maximum sequence number on the primary
shard, and off setting the local knowledge on the primary of the global
checkpoint on the replica.
Both TopDocsCollector and LeafCollector were being kept around at the aggregator level. In case the nested aggregator would do a post collection then this could cause pushing down docids to top hits child aggregators that already moved the next LeafCollector (causing assertions to trip and incorrect results).
By keeping track of the LeafCollector in a seperate map at the leaf bucket level this problem can simply not happen any more as the place holding LeafCollector is no longer shared.
Also LeafCollector instances for TopDocsCollectors are no longer pre-created as the beginning a new segment is evaluated. There is no guarantee that TopHitsAggregator encounters a document for a particular bucket and there has to be logic to create LeafCollector instances which have not been seen before.
Closes#26738
Adds the wait_for_active_shards parameter to the index open command. Similar to the index creation command, the index open command will now, by default, wait until the primaries have been allocated.
Closes#20937
It is the exciting return of the global checkpoint background
sync. Long, long ago, in snapshot version far, far away we had and only
had a global checkpoint background sync. This sync would fire
periodically and send the global checkpoint from the primary shard to
the replicas so that they could update their local knowledge of the
global checkpoint. Later in time, as we sped ahead towards finalizing
the initial version of sequence IDs, we realized that we need the global
checkpoint updates to be inline. This means that on a replication
operation, the primary shard would piggy back the global checkpoint with
the replication operation to the replicas. The replicas would update
their local knowledge of the global checkpoint and reply with their
local checkpoint. However, this could allow the global checkpoint on the
primary to advance again and the replicas would fall behind in their
local knowledge of the global checkpoint. If another replication
operation never fired, then the replicas would be permanently behind. To
account for this, we added one more sync that would fire when the
primary shard fell idle. However, this has problems:
- the shard idle timer defaults to five minutes, a long time to wait
for the replicas to learn of the new global checkpoint
- if a replica missed the sync, there was no follow-up sync to catch
them up
- there is an inherent race condition where the primary shard could
fall idle mid-operation (after having sent the replication request to
the replicas); in this case, there would never be a background sync
after the operation completes
- tying the global checkpoint sync to the idle timer was never natural
To fix this, we add two additional changes for the global checkpoint to
be synced to the replicas. The first is that we add a post-operation
sync that only fires if there are no operations in flight and there is a
lagging replica. This gives us a chance to sync the global checkpoint to
the replicas immediately after an operation so that they are always kept
up to date. The second is that we add back a global checkpoint
background sync that fires on a timer. This timer fires every thirty
seconds, and is not configurable (for simplicity). This background sync
is smarter than what we had previously in the sense that it only sends a
sync if the global checkpoint on at least one replica is lagging that of
the primary. When the timer fires, we can compare the global checkpoint
on the primary to its knowledge of the global checkpoint on the replicas
and only send a sync if there is a shard behind.
Relates #26591
When using a bulk processor, the thread context was not preserved for the flush runnable which is
executed in another thread in the thread pool. This change wraps the flush runnable in a context
preserving runnable so that the headers and transients from the creation time of the bulk processor
are available during the execution of the flush.
Closes#26596
The `fielddata` field and the use of the `_name` field in the short syntax of the range
query have been deprecated in 5.0 and can be removed.
The same goes for the deprecated `score_mode` field in HasParentQueryBuilder,
the deprecated `like_text`, `ids` and `docs` parameter in the `more_like_this` query,
the deprecated query name in the short version of the `regexp` query, and several
deprecated alternative field names in other query builders.
The `type` field has been deprecated in 5.0 and can be removed. It has been
replaced by using the MatchPhraseQueryBuilder or the
MatchPhrasePrefixQueryBuilder. The `slop` field has also been deprecated and can
be removed, the phrase and phrase prefix query builders still provide this
parameter.
- Removes mutual dependency between GatewayMetaState and TransportNodesListGatewayMetaState
- Deguices MetaDataIndexUpgradeService
- Deguices GatewayMetaState
- Makes Gateway the master-level component that is only responsible for coordinating the state recovery