Requests are handled by the worked thread pool of the target node instead of the generic thread pool of the source node.
Also this change is required in order to make GC disruption work with local transport. Previously the handling of the a request was performed on on a node that that was being GC disrupted, resulting in some actions being performed while GC was being simulated.
The cluster state version allows resolving the case where a old master node become unresponsive and later wakes up and pings all the nodes in the cluster, allowing the newly elected master to decide whether it should step down or ask the old master to rejoin.
Both master fault detection and nodes fault detection request should also send the cluster name, so that on the receiving side the handling of these requests can be failed with an error. This error can be caught on the sending side and for master fault detection the node can fail the master locally and for nodes fault detection the node can be failed.
Note this validation will most likely never fail in a production cluster, but in during automated tests where cluster / nodes are created and destroyed very frequently.
In large clusters when a new elected master is chosen, there are many join requests to handle. By batching them up the the cluster state doesn't get published for each individual join request, but many handled at the same time, which results into a single new cluster state which ends up be published.
Closes#6984
After master election, nodes send join requests to the elected master. Master is then responsible for publishing a new cluster state which sets the master on the local node's cluster state. If something goes wrong with the cluster state publishing, this process will not successfully complete. We should check it after the join request returns and if it failed, retry pinging.
Closes#6969
Currently, pinging results are only used if the local node is elected master or if they detect another *already* active master. This has the effect that master election requires two pinging rounds - one for the elected master to take is role and another for the other nodes to detect it and join the cluster. We can be smarter and use the election of the first round on other nodes as well. Those nodes can try to join the elected master immediately. There is a catch though - the elected master node may still be processing the election and may reject the join request if not ready yet. To compensate a retry mechanism is introduced to try again (up to 3 times by default) if this happens.
Closes#6943
Reduced expected time to heal to 0 (we interrupt and wait on stop disruption). It was also wrongly indicated in seconds.
We didn't properly wait between slow cluster state tasks
added explicit cleaning of temp unicast ping results
reduce gateway local.list_timeout to 10s.
testVerifyApiBlocksDuringPartition: verify master node has stepped down before restoring partition
Since it runs in a background thread after a node is added, or submits a cluster state update when a node leaves, it may be that by the time it is executed the local node is no longer master.
After a node joins the clusters, it starts pinging the master to verify it's health. Before, the cluster join request was processed async and we had to give some time to complete. With #6480 we changed this to wait for the join process to complete on the master. We can therefore start pinging immediately for fast detection of failures. Similar change can be made to the Node fault detection from the master side.
Closes#6706
* waiting time should be long enough depending on the type of the disruption scheme
* MockTransportService#addUnresponsiveRule if remaining delay is smaller than 0 don't double execute transport logic
This commit adds the notion of ServiceDisruptionScheme allowing for introducing disruptions in our test cluster. This
abstraction as used in a couple of wrappers around the functionality offered by MockTransportService to simulate various
network partions. There is also one implementation for causing a node to be slow in processing cluster state updates.
This new mechnaism is integrated into existing tests DiscoveryWithNetworkFailuresTests.
A new test called testAckedIndexing is added to verify retrieval of documents whose indexing was acked during various disruptions.
Closes#6505
If the master FD flags master as gone while there are still pending cluster states, the processing of those cluster states we re-instate that node a master again.
Closes#6526
The previous default was true, which means that after a node disconnected event we try to connect to it as an extra validation. This can result in slow detection of network partitions if the extra reconnect times out before failure.
Also added tests to verify the settings' behaviour
We have an optimization which compares routing/meta data version of cluster states and tries to reuse the current object if the versions are equal. This can cause rare failures during recovery from a minimum_master_node breach when using the "new light rejoin" mechanism and simulated network disconnects. This happens where the current master updates it's state, doesn't manage to broadcast it to other nodes due to the disconnect and then steps down. The new master will start with a previous version and continue to update it. When the old master rejoins, the versions of it's state can equal but the content is different.
Also improved DiscoveryWithNetworkFailuresTests to simulate this failure (and other improvements)
Closes#6466
When a node steps down from being a master (because, for example, min_master_node is breached), it may still have
cluster state update tasks queued up. Most (but not all) are tasks that should no longer be executed as the node
no longer has authority to do so. Other cluster states updates, like electing the current node as master, should be
executed even if the current node is no longer master.
This commit make sure that, by default, `ClusterStateUpdateTask` is not executed if the node is no longer master. Tasks
that should run on non masters are changed to implement a new interface called `ClusterStateNonMasterUpdateTask`
Closes#6230
Only the previous master node has been removed, so only shards allocated to that node will get failed.
This would have happened anyhow on later on when AllocationService#reroute is invoked (for example when a cluster setting changes or another cluster event),
but by cleaning the routing table pro-actively, the stale routing table is fixed sooner and therefor the shards
that are not accessible anyhow (because the node these shards were on has left the cluster) will get re-assigned sooner.
Switches TranslogStreams to check a header in the file to determine the
translog format, delegating to the version-specific stream.
Version 1 of the translog format writes a header using Lucene's
CodecUtil at the beginning of the file and appends a checksum for each
translog operation written.
Also refactors much of the translog operations, such as merging
.hasNext() and .next() in FsChannelSnapshot
Relates to #6554
These caches have no advantage compared to the default node cache. Additionally,
the soft cache makes use of soft references which make fielddata loading quite
unpredictable in addition to pushing more pressure on the garbage collector.
The `none` cache is still there because of tests. There is no other good
reason to use it.
LongFieldDataBenchmark has been removed because the refactoring exposed a
compilation error in this class, which seems to not having been working for a
long time. In addition it's not as much useful now that we are progressively
moving more fields to doc values.
Close#7443
This also fixes has_parent filters with a nested empty bool filter
(see test SimpleChildQuerySearchTests#test6722, the test should actually expect
either 0 results when searching for has_parent "test" or one result when
search for has_parent "foo")
closes#7240closes#7347
Complex explanations were always read as Explanations. Depending
on if the response was streamed or not the explanation was
therefore generated by a ComplexExplanation or by a regular
Explanation.
closes#7257
Added the ability to register template filters that are being applied when a new index is created. The default filter that checks whether the template pattern matches the index name always runs first, additional filters can also be registered so that templates can be filtered out based on custom logic.
Took the chance to add the handy source(Object... source) method to PutIndexTemplateRequest and corresponding builder
Closes#7459Closes#7454
This change forces the GetRequest when a script is being loaded from an index
to use preference("_local") and threaded(false) to prevent the script service from
forking for GetRequests.
The comparison and read code in the BlobStoreIndexShardRepository
used the physicalName and Name in reverse order. This caused
SnapshotBackwardsCompatibilityTest to fail.
This reverts commit 636af40da1
The "optimized" encoders/decoders have been unreliable and error prone.
Also, fix LZFCompressor.compress to use LZFEncoder.safeEncode, which
creates a new safe encoder, instead of using a shared encoder (which
is not threadsafe).
closes#7468
Due to additional safety added in #7351 we compute now a strong hash for
.si and segments_N files which are compared during snapshot / restore.
Old snapshots don't have this hash which can cause unnecessary copying
of large amount of data. This commit adds the ability to fetch this
hash from the blob store if needed.
Closes#7434
Today we have a small window where a searcher can be acquired but the
engine is in the state of starting up. This causes a NPE triggering a
shard failure if we are fast enough. This commit fixes this situation
gracefully.
Closes#7455
- _all field was never merged when mapping was updated and no conflict reported
- _all accepted doc_values format although it is always tokenized
relates to #777closes#7377
It's risky to have our own snapshot of Java 8's ConcurrentHashMap:
unless we keep the sources in sync over time (and OpenJDK's version
had already diverged), then we won't get bug/performance fixes. Users
can choose to upgrade to Java 8 to see the improvements of CHM.
Closes#7392Closes#7296
Shard requests like broadcast delete and delete by query, that needs to be executed on primary and all replicas, get read and written out to the transport on the same node. That means that if we add some field version checks are not enough to maintain bw comp since a newer node that holds the primary might receive the request from an older node, that didn't provide the field. Yet, when writing the request out again to a newer node that holds the replica, we do try and serialize the field although it's missing. The newer fields just needs to be set to optional in these cases, in addition to the version checks.
Re-enabled testDeleteByQuery and testDeleteRoutingRequired bw comp tests since this was the cause of their failures.
Closes#7406
The verifyThreadNames starts a node and checks that all new threads on the JVM are properly named. The current test uses the name of the new node which sometimes fails because our shared cluster spawns a new thread which is properly named but for not for the new name.
The commits relaxes the requirement of the test and on verify the threads are properly named (but not necessarily of the new node)
Items with no specified field now defaults to all the possible fields from the
document source. Previously, we had required 'fields' to be specified either
as a top level parameter or for each item. The default behavior is now similar
to the MLT API.
Closes#7382
This commit changes the way how files are selected for retransmission
on recovery / restore. Today this happens on a per-file basis where the
rather weak checksum and the file length in bytes is compared to check if
a file is identical. This is prone to fail in the case of a checksum collision
which can happen under certain circumstances.
The changes in this commit move the identity comparsion to a per-commit / per-segment
level where files are only treated as identical iff all the other files in the
commit / segment are the same. This "all or nothing" strategy is reducing the chance for
a collision dramatically since we also use a strong hash to identify commits / segments
based on the content of the ".si" / "segments.N" file.
Closes#7351
resetting the clients on each test (in after test) makes the tests running, especially in network mode, much slower, since transport client needs to be created each time when randmized to be used. Also, on OSX, the excessive connections causes bind exceptions eventually which makes running the network tests much harder on it.
closes#7329
Today we always use the latest commit point to return the metadata from
the store. This might cause problems for snapshot and restore since in
contrast to recovery it won't prevent concurrent flushes (lucene commits).
This can lead to all kinds of interesting effects if we are snapshotting
while flushing. This change uses the IndexCommit to open the metadata snapshot
from the store which is consistent with what we snapshot.
Closes#7376
Changes the serialisation of pre and post offset to use Long instead of VLong so that negative values are supported. This actually only showed up in the case where minDocCount=0 as the rounding is only serialised in this case.
Closes#7312
The term vector API can now generate term vectors on the fly, if the terms are
not already stored in the index. This commit exploits this new functionality
for the MLT query. Now the terms are directly retrieved using multi-
termvectors API, instead of generating them from the texts retrieved using the
multi-get API.
Closes#7014
Today we have very confusing naming since some methods names claim to
read binary but in fact read utf-16 and convert to utf-8 under the hood.
This commit clarifies the interfaces and adds additional documentation.
Closes#7367
These options are not used anymore. Instead numeric values can now contain dups
and it is the responsibility of the aggregation to deal with it (eg. terms).
And otherwise all values sources are now sorted, which is the contract of the
interfaces that they implement.
Close#7276
A request that relates to indices (`IndicesRequest` or `CompositeIndicesRequest`) might be converted to some other internal request(s) (e.g. shard level request) that get distributed over the cluster. Those requests contain the concrete index they refer to, but it is not known which indices (or aliases or expressions) the original request related to.
This commit makes sure that the original indices are available as part of the shard level requests and makes them implement `IndicesRequest` as well.
Also every internal request should be created passing in the original request, so that the original headers, together with the eventual original indices and options, get copied to it. Corrected some places where this information was lost.
NOTE: As for the bulk api and other multi items api (e.g. multi_get), their shard level requests won't keep around the whole set of original indices, but only the ones that related to the bulk items sent to each shard, the important bit is that we keep the original names though, not only the concrete ones.
Closes#7319
This change fixes the creation circle shapes o it calculates it correctly instead of essentially using the diameter as the radius. The radius has to be converted into degrees but calculating the ratio of the desired radius to the circumference of the earth and then multiplying it by 360 (number of degrees around the earths circumference). This issue here was that it was only multiplied by 180 making the result out by a factor of 2. Also made the test for circles actually check to make sure it has the correct centre and radius.
Closes#7301
Recent test failures triggered by #7289 were caused by this, simply because internal node settings (transport type key) that are not supported by the external older nodes were copied to them by mistake.