The Transport#connectToNodeLight concepts is confusing and not very flexible.
neither really testable on a unittest level. This commit cleans up the code used
to connect to nodes and simplifies transport implementations to share more code.
This also allows to connect to nodes with custom profiles if needed, for instance
future improvements can be added to connect to/from nodes that are non-data nodes without
dedicated bulks and recovery connections.
This commit improves the decision explanation messages,
particularly for NO decisions, in the various AllocationDecider
implementations by including the setting(s) in the explanation
message that led to the decision.
This commit also returns a THROTTLE decision instead of a NO
decision when the concurrent rebalances limit has been reached
in ConcurrentRebalanceAllocationDecider, because it more accurately
reflects a temporary throttling that will turn into a YES decision
once the number of concurrent rebalances lessens, as opposed to a
more permanent NO decision (e.g. due to filtering).
Integrate the patch from LUCENE-6664 into elasticsearch and
add support for handling a graph token stream in match/multi-match
queries.
This fixes longstanding bugs with multi-token synonyms returning
incorrect results with proximity queries.
This is a cleanup of the fix pushed in https://github.com/elastic/elasticsearch/pull/20400.
FiltersFunctionScoreQuery sub query should be extracted in CustomQueryScorer.extract (and not in CustomQueryScorer.extractUnknownQuery).
This does not fix any bug in this branch (it's just a cleanup) but the intent is first to clean up and then to backport in 2.x where there is a real bug.
The bug is in 2.x only because the backport of https://github.com/elastic/elasticsearch/pull/20400 in 2.x mistakenly renamed the FiltersFunctionScoreQuery to FunctionScoreQuery.
This leads to incorrect highlighting on FiltersFunctionScoreQuery in 2.x.
When a file script fails to compile, rather than logging the
exception that caused the failure this logs the xcontent of
that exception. This is both shorter and has the script stack
which is useful for figuring out why the compilation failed.
Still logs the entire stacktrace at debug level just in case
you need it.
Relates to #21733
In making changes for the 5.0 version of snapshots, a bug was
introduced where if an index-N file could not be found for an
individual shard, the backup was to iterate over all snap-*.dat
files in the shard folder to know which snapshots contain that
shard's data, but in 5.0, reading the snap-*.dat files as backup
was incorrectly passing in the blob name for the snap-*.dat file,
thereby failing to load all index files for a given snapshot
when the index-N file is missing. This condition should be rare
as there is no reason an index-N file should be absent (unless
it was deleted or there was corruption reading the file), but
nevertheless, this situation can be encountered and this commit
fixes the bug by reading the correct snap-*.dat blob name in the
shard data folder.
TransportAddress used to be customizable per transport but this has been removed
a while ago. Therefore we can remove all usage of this method as well.
Relates to #20695
* Fix cross_fields type on multi_match query with synonyms
This change fixes the cross_fields type of the multi_match query when synonyms are involved.
Since 2.x the Lucene query parser creates SynonymQuery for words that appear at the same position.
For simple term query the CrossFieldsQueryBuilder expands the term to all requested fields and creates a BlendedTermQuery.
This change adds the same mechanism for SynonymQuery which otherwise are not expanded to all requested fields.
As a side note I wonder if we should not replace the BlendedTermQuery with the SynonymQuery. They have the same purpose and behave similarly.
Fixes#21633
* Fallback to SynonymQuery for blended terms on a single field
The disruption type LongGCDisruption simulates GCs on a node by suspending all the threads of that node. If the suspended threads are in a code section with shared JVM locks, however, it can prevent the other nodes from doing their thing. The class LongGCDisruption has a list of class names for which we know that this can occur. Whenever a test using the GC disruption type fails in mysterious ways, it becomes a long guessing game to find the offending class. This commit adds code to LongGCDisruption to automatically detect these situations, fail the test early and report the offending class and all relevant context.
This commit removes the unused AllocationExplanation class. The
RoutingAllocation class only created an empty instance of it and
never used it anywhere else. The allocation explanations will be
encompassed in the various decision classes exposed via the cluster
allocation explain API. Therefore, there is no reason to keep the
AllocationExplanation class.
With #21738 we added an indices section to the search shards api, that will return the concrete indices hit by the request, and eventually the corresponding alias filter.
The java API returns the AliasFilter object, which holds the filter itself and an array of aliases that pointed to the index in the original request. The REST layer doesn't print out the aliases array though. This commit adds the aliases array as well and tests for this.
Group, List and Affix settings generate a bogus diff that turns the actual
diff into a string containing a json structure for instance:
```
"action" : {
"search" : {
"remote" : {
"" : "{\"my_remote_cluster\":\"[::1]:60378\"}"
}
}
}
```
which make reading the setting impossible. This happens for instance
if a group or affix setting is rendered via `_cluster/settings?include_defaults=true`
This change fixes the issue as well as several minor issues with affix settings that
where not accepted as valid setting today.
Today if a comma-separated list is passed to action.auto_create_index
leading and trailing whitespaces are not trimmed but since the values are
index expressions whitespaces should be removed for convenience.
Closes#21449
* ClusterSearchShardsGroup to return ShardId rather than the int shard id
This allows more info to be retrieved, like the index uuid which is exposed through the ShardId object but was not available before
* Make ClusterSearchShardsResponse empty constructor public
This allows to receive such responses when sending ClusterSearchShardsRequests directly through TransportService (not using ClusterSearchShardsAction via Client), otherwise an empty response cannot be created unless the class that does it is in org.elasticsearch.action, admin.cluster.shards package
* adjust visibility of ClusterSearchShards members
The index uuid is unique across multiple clusters, while the index name is not. Using the index uuid to look up filters in the alias filters map is better and will be needed for multi cluster search.
This commit makes sure that there is only one instance of the two services rather than one per transport action that uses it.
Also, we take their initialization out of guice's hands by binding it to a specific instance. Otherwise those two objects would get created within a constructor that is called by guice. That may cause problem for instance when throwing an exception from such constructors as guice tries all over again to re-initialize objects and fills up logs with stacktraces.
* replace ShardRouting argument in AbstractSearchAsyncAction#onFirstPhaseResult with more contained String nodeId
There is no need to pass in ShardRouting if the only info read from it is the current node id, the shard id can be read directly from the ShardIterator that's already provided as an argument.
* avoid creating a new ShardId when creating a SearchShardTarget in SnapshotsService
Today when handling unreleased versions for backwards compatilibity
support, we scatted version constants across the code base and add some
asserts to support removing these constants when the version in question
is actually released. This commit improves this situation, enabling us
to just add a single unreleased version constant that can be renamed
when the version is actually released. This should make maintenance of
these versions simpler.
Relates #21760
ShardSearchRequest was previously taking in the whole ShardRouting as a constructor argument while it only needs the ShardsId, changed that to carry over only the needed bits.
* Transport client: Fix remove address to actually work
The removeTransportAddress method of TransportClient removes the address
from the list of nodes that the client pings to sniff for nodes.
However, it does not remove it from the list of existing connected
nodes. This means removing a node is not possible, as long as that node
is still up.
This change removes the node from the connected nodes list before
triggering sampling (ie sniffing). While the fix is simple, testing was
not because there were no existing tests for sniffing. This change also
modifies the mocks used by transport client unit tests in order to allow
mocking sniffing.
* Scripting: Remove groovy scripting language
Groovy was deprecated in 5.0. This change removes it, along with the
legacy default language infrastructure in scripting.
The `error_trace` parameter turns on the `stack_trace` field
in errors which returns stack traces.
Removes documentation for `camelCase` because it hasn't worked
in a while....
Documents the internal parameters used to render stack traces as
internal only.
Closes#21708
This commit refactors the handling of bind permissions, which is in need
of a little cleanup. For example, in its current state, the code for
handling permissions for transport profiles is split across two
methods. This commit refactors this code hopefully making it easier to
work with in future changes. This change is mostly mechanical, no
functionality is changed.
Relates #21742
The test UnicastZenPing#testResolveTimeout chooses a random resolve
timeout between 1ms and 100ms. Close to the lower bound, this is far too
short and the test races against the concurrent resolves executing
before the timeout elapses. This commit increases the timeout to
something that is far less likely to race, yet will not slow the test
down since we are not doing resolves against a real DNS service anyway.
Note that we still want a short resolve timeout since we are testing
whether or not timeouts really work here (by latching one of the
resolves to respond slowly).
Add indices and filter information to search shards api output
The search shards api returns info about which shards are going to be hit by executing a search with provided parameters: indices, routing, preference. Indices can also be aliases, which can also hold filters. The output includes an array of shards and a summary of all the nodes the shards are allocated on. This commit adds a new indices section to the search shards output that includes one entry per index, where each index can be associated with an optional filter in case the index was hit through a filtered alias.
This is relevant since we have moved parsing of alias filters to the coordinating node.
Relates to #20916
Today there is no way to get notified if a node is disconnected. Client code
must poll the TransportClient constantly to detect that a node is not connected
anymore in order to react and add new nodes or notify altering etc. For instance
if a hostname gets resolved to an IP but that host is disconnected clients want
to reconnect by resolving the hostname again which is a common situation in cloud
environments.
Closes#21424
Today we eagerly resolve unicast hosts. This means that if DNS changes,
we will never find the host at the new address. Moreover, a single host
failng to resolve causes startup to abort. This commit introduces lazy
resolution of unicast hosts. If a DNS entry changes, there is an
opportunity for the host to be discovered. Note that under the Java
security manager, there is a default positive cache of infinity for
resolved hosts; this means that if a user does want to operate in an
environment where DNS can change, they must adjust
networkaddress.cache.ttl in their security policy. And if a host fails
to resolve, we warn log the hostname but continue pinging other
configured hosts.
When doing DNS resolutions for unicast hostnames, we wait until the DNS
lookups timeout. This appears to be forty-five seconds on modern JVMs,
and it is not configurable. If we do these serially, the cluster can be
blocked during ping for a lengthy period of time. This commit introduces
doing the DNS lookups in parallel, and adds a user-configurable timeout
for these lookups.
Relates #21630
PR #19416 added a safety mechanism to shard state fetching to only access the store when the shard lock can be acquired. This can lead to the following situation however where a shard has not fully shut down yet while the shard fetching is going on, resulting in a ShardLockObtainFailedException. PrimaryShardAllocator that decides where to allocate primary shards sees this exception and treats the shard as unusable. If this is the only shard copy in the cluster, the cluster stays red and a new shard fetching cycle will not be triggered as shard state fetching treats exceptions while opening the store as permanent failures.
This commit makes it so that PrimaryShardAllocator treats the locked shard as a possible allocation target (although with the least priority).
This commit clarifies the contract of Cache#computeIfAbsent so that an exception that occurs during the execution
of the loader is thrown to all callers. Prior to this commit, the first caller would get the ExecutionException
and other callers that called during the load execution would get null, which is confusing.
The `type` parameter has always been accepted by the search_shards api, probably to make the api and its urls the same as search. Truth is that the type never had any effect, it's been ignored from day one while accepting it may make users think that we actually do something with it.
This commit removes support for the type parameter from the REST layer and the Java API. Backwards compatibility is maintained on the transport layer though.
The new added serialization test also uncovered a bug in the java API where the `ClusterSearchShardsRequest` could be created with no arguments, but the indices were required to be not null otherwise the request couldn't be serialized as `writeTo` would throw NPE. Fixed by setting a default value (empty array) for indices.
When a fatal error tragically closes an index writer, such an error
never makes its way to the uncaught exception handler. This prevents the
node from being torn down if an out of memory error or other fatal error
is thrown in the Lucene layer. This commit ensures that such events
bubble their way up to the uncaught exception handler.
Relates #21721
The PR #21694 was initially planned to go into v6.0.0 and v5.1.0. Due to another PR relying on this one though for backport to v5.0.2, #21694 must go to v5.0.2
as well. As such, the initial backward compatibility rules established by the PR must be changed to include v5.0.2 and above.