Commit Graph

3446 Commits

Author SHA1 Message Date
David Turner 9ff320d967
Use index for peer recovery instead of translog (#45137)
Today we recover a replica by copying operations from the primary's translog.
However we also retain some historical operations in the index itself, as long
as soft-deletes are enabled. This commit adjusts peer recovery to use the
operations in the index for recovery rather than those in the translog, and
ensures that the replication group retains enough history for use in peer
recovery by means of retention leases.

Reverts #38904 and #42211
Relates #41536
Backport of #45136 to 7.x.
2019-08-02 15:00:43 +01:00
Armin Braun 9450505d5b
Stop Passing Around REST Request in Multiple Spots (#44949) (#45109)
* Stop Passing Around REST Request in Multiple Spots

* Motivated by #44564
  * We are currently passing the REST request object around to a large number of places. This works fine since we simply copy the full request content before we handle the rest itself which is needlessly hard on GC and heap.
  * This PR removes a number of spots where the request is passed around needlessly. There are many more spots to optimize in follow-ups to this, but this one would already enable bypassing the request copying for some error paths in a follow up.
2019-08-02 07:31:38 +02:00
Jim Ferenczi 3f94e2ea43 Sparse role queries can throw an NPE (#45053)
Sparse role queries are executed differently than other queries in order
to account for the fact that most of the documents are filtered from search.
However this special execution does not set the scorer for the query so any
collector that needs to access the score of a document fails with an NPE.
This change fixed this bug by setting the scorer before collecting any hits
when intersecting the main query and the sparse role.
2019-08-01 20:21:53 +02:00
William Brafford 5f50da947a
Fix bug in the Settings#processSetting method (#45095)
The Settings#processSetting method is intended to take a setting map and add a
setting to it, adjusting the keys as it goes in case of "conflicts" where the
new setting implies an object where there is currently a string, or vice
versa. processSetting was failing in two cases: adding a setting two levels
under a string, and adding a setting two levels under a string and four levels
under a map. This commit fixes the bug and adds test coverage for the
previously faulty edge cases.

* fix issue #43791 about settings
* add unit test in testProcessSetting()
2019-08-01 13:27:08 -04:00
Yannick Welsch 917510d3e4 Always use primary term of operation in InternalEngine (#45083)
We keep adding the current primary term to operations for which we do not assign a sequence
number. This does not make sense anymore as all operations which we care about have
sequence numbers now. The goal of this commit is to clean things up in InternalEngine and
reduce the complexity.
2019-08-01 17:30:00 +02:00
Armin Braun 48dc53f8d2
Make PathTrieIterator a Little more Memory Efficient (#44951) (#45070)
* There's no need to have the trie iterator hold another reference to the request object (which could be huge, see #44564)
* Also removed unused boolean field from trie node
2019-08-01 17:26:08 +02:00
Nhat Nguyen 3a487379c3 Tighten no pending scheduled refresh check (#45025)
Previously, we use ThreadPoolStats to ensure that the scheduledRefresh
triggered by the internal refresh setting update is executed before we
index a new document. With that change (#40387), this test did not fail for 
the last 3 months. However, using ThreadPoolStats is not entirely watertight
as both "active" and "queue" count can be 0 in a very small interval
when ThreadPoolExecutor pulls a task from the queue but before marking
the corresponding worker as active (i.e., lock it).

Closes #39565
2019-08-01 09:06:22 -04:00
David Turner c088bafbbc Wait for events in waitForRelocation (#45074)
Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in
`ESIntegTestCase#waitForRelocation()` to deal with the case that this health
request returns successfully despite the fact that there is a pending reroute task which
will relocate another shard.

Relates #44433
Fixes #45003
2019-08-01 13:47:39 +01:00
David Turner 532ade7816 More logging for slow cluster state application (#45007)
Today the lag detector may remove nodes from the cluster if they fail to apply
a cluster state within a reasonable timeframe, but it is rather unclear from
the default logging that this has occurred and there is very little extra
information beyond the fact that the removed node was lagging. Moreover the
only forewarning that the lag detector might be invoked is a message indicating
that cluster state publication took unreasonably long, which does not contain
enough information to investigate the problem further.

This commit adds a good deal more detail to make the issues of slow nodes more
prominent:

- after 10 seconds (by default) we log an INFO message indicating that a
  publication is still waiting for responses from some nodes, including the
  identities of the problematic nodes.

- when the publication times out after 30 seconds (by default) we log a WARN
  message identifying the nodes that are still pending.

- the lag detector logs a more detailed warning when a fatally-lagging node is
  detected.

- if applying a cluster state takes too long then the cluster applier service
  logs a breakdown of all the tasks it ran as part of that process.
2019-08-01 13:20:46 +01:00
Hendrik Muhs b3be8f75f0 Fix version logic after 7.3 release (BWC) (#45077)
removes unreleased version 7.2.2 after release of 7.3.0 as it breaks the version verifier, add documentation that explains the logic
2019-08-01 12:43:23 +02:00
Christoph Büscher a669efd2a4
Remove left-over AwaitsFix in RateClusterStateIT (#45043)
Issues are closed and fixes in #42580 and #42430 seem to be merged to 7.x at
least.
2019-08-01 12:03:29 +02:00
Tim Brooks aff66e3ac5
Add Cors integration tests (#44361)
This commit adds integration tests to ensure that the basic cors
functionality works for the netty and nio transports.
2019-07-31 14:24:23 -06:00
Armin Braun 8d63bd1d1e
Cleanup Various Action- Listener and Runnable Usages (#42273) (#45052)
* Dry up code for creating simple `ActionRunnable` a little
* Shorten some other code around `ActionListener` usage, in particular
when wrapping it in a `TransportResponseListener`
2019-07-31 18:55:31 +02:00
Armin Braun ee663dc9ac
Reenable Parallel Restore Test on Windows (#45037) (#45050)
* As a result of #44096 this test shouldn't fail anymore on `master` and `7.4`+ so we should reenable it there
  * For older versions we won't backport that change so the tests should stay disabled there
* Closes #44671
2019-07-31 18:35:34 +02:00
Christoph Büscher 35291ae175
Remove muted AckIT and AckClusterUpdateSettingsIT (#45044)
Reading up on #33673 it looks like parts of these tests have been reworked and
there is no intention to fix the remains on 7.x, so I think we can remove the
entire test.
2019-07-31 17:17:21 +02:00
Luca Cavanna 8cc3c0dd93 Remove task null check in TransportAction (#45014)
The task that TaskManager#register returns cannot be null. The method
enforces that it is not null after calling request#createTask. It is
then needless to check for null in the listener later. Also, added the
call to the delegate listener in a finally block, just to make sure.
2019-07-31 17:16:41 +02:00
Christoph Büscher e85b53a955
Remove left-over AwaitsFix in DedicatedClusterSnapshotRestoreIT (#45042)
The issue mentioned (#38845) seems to have been closed with #38891 so the test
can be re-activated.
2019-07-31 17:15:41 +02:00
Armin Braun c7d7230524
Stop Recreating Wrapped Handlers in RestController (#44964) (#45040)
* We shouldn't be recreating wrapped REST handlers over and over for every request. We only use this hook in x-pack and the wrapper there does not have any per request state.
  This is inefficient and could lead to some very unexpected memory behavior
   => I made the logic create the wrapper on handler registration and adjusted the x-pack wrapper implementation to correctly forward the circuit breaker and content stream flags
2019-07-31 17:11:34 +02:00
Zachary Tong c25f3dd5d0
Introduce 7.3.1 version (#45046) 2019-07-31 10:53:55 -04:00
Andrey Ershov c27ac3d24c Unmute testClusterJoinDespiteOfPublishingIssues and testElectMasterWithLatestVersion (#38555)
See my comments for #37539 and #37685

(cherry picked from commit 038d4ab2940340eca942e32b54044f183b7804d9)
2019-07-31 14:55:02 +02:00
David Roberts 5e3010a606 Use system context for looking up connected nodes (#43991)
When finding nodes in a connected cluster for cross cluster
search the requests to get cluster state on the connected
cluster should be made in the system context because
logically they are equivalent to checking a single detail
in the local cluster state and should not require that the
user who made the request that is using this method in its
implementation is authorized to view the entire cluster
state.

Fixes #43974
2019-07-31 09:09:56 +01:00
Igor Motov 1a1bb4707d Geo: move indexShape to AbstractGeometryFieldMapper.Indexer (#44979)
Move indexShape functionality into AbstractGeometryFieldMapper to make
it more unit testable.

Relates to #43644
2019-07-30 14:50:23 -04:00
Mayya Sharipova a154b73d99 Assure index ops are successful for SimpleNestedIT (#44815)
relates to #44486
2019-07-30 14:24:28 -04:00
Nhat Nguyen 979d0a71c7 Remove leniency during replay translog in peer recovery (#44989)
This change removes leniency in InternalEngine during replaying translog
in peer recovery.
2019-07-30 13:25:15 -04:00
Jake Landis 41a99c9e4a introduce 7.2.2 as a version (#44371)
* introduce 7.2.2 as a version
2019-07-30 18:52:34 +02:00
Jake Landis 03fea1c503 introduce 6.8.3 as a version (#44708) 2019-07-30 18:48:41 +02:00
David Kyle 78aa6143a6 Mute FilteringAllocationIT testTransientSettingsStillApplied
Relates to https://github.com/elastic/elasticsearch/issues/45003
2019-07-30 14:10:50 +01:00
Yannick Welsch c1b569ed4b Revert "Mute Zen1IT#testMixedClusterDisruption"
This reverts commit cf78ca58e3.
2019-07-30 13:10:14 +02:00
David Turner 55f1dd8da6 Close nodes properly in Coordinator tests (#44967)
Today closing a `ClusterNode` in an `AbstractCoordinatorTestCase` uses
`onNode()` so has no effect if the node is not in the current list of nodes.
It also discards the `Runnable` it creates without having run it, so has no
effect anyway.

This commit makes these tests much stricter about properly closing the nodes
started during `Coordinator` tests, by tracking the persisted states that are
opened, and adds an assertion to catch the trappy requirement that the closing
node still belongs to the cluster.
2019-07-30 11:47:36 +01:00
David Kyle cf78ca58e3 Mute Zen1IT#testMixedClusterDisruption 2019-07-30 11:33:39 +01:00
Jim Ferenczi 43bd8f2ba0 Fix aggregators early termination with breadth-first mode (#44963)
This commit fixes a bug when a deferred aggregator tries to early terminate the collection. In such case the CollectionTerminatedException is not caught and
the search fails on the shard. This change makes sure that we catch the exception in order to continue the deferred collection on the next leaf.

Fixes #44909
2019-07-30 11:26:40 +02:00
Andrey Ershov 5a0bd696fc
Snapshot tool S3 cleanup 7.x backport (#44575)
Backport of #44551
2019-07-30 11:02:08 +02:00
Nhat Nguyen 4813728783 Remove leniency in reset engine from translog (#44711)
Replaying operations from the local translog must never fail as those
operations were processed successfully on the primary before and the
mapping is up to update already. This change removes leniency during
resetting engine from translog in IndexShard and InternalEngine.
2019-07-29 16:31:45 -04:00
Jack Conradson 1a21682ed0 Fix JodaCompatibleZonedDateTime casts in Painless (#44874)
This is a temporary fix during the Joda to Java datetime transition. This will 
implicitly cast a JodaCompatibleZonedDateTime to a ZonedDateTime for 
both def and static types. This is necessary to insulate users from needing 
to know about JodaCompatibleZonedDateTime explicitly.
2019-07-29 12:05:26 -07:00
Igor Motov b6cef227a5 Geo: fix geo query decomposition (#44924)
The recent refactoring introduced an issue where queries where not
going through the decomposition processing.

Fixes #44891
2019-07-29 11:48:24 -04:00
Luca Cavanna a3cc32da64 TaskListener#onFailure to accept Exception instead of Throwable (#44946)
TaskListener accepts today Throwable in its onFailure method. Though
looking at where it is called (TransportAction), it can never be
notified of a Throwable.

This commit changes the signature of TaskListener#onFailure so that it
accepts an `Exception` rather than a `Throwable` as second argument.
2019-07-29 16:47:19 +02:00
Michał Perlak 245c9b7914 Optimize Min and Max BKD optimizations (#44315)
MinAggregator - skip BKD optimization when no result found after 1024 lookups.
MaxAggregator - skip unnecessary conversions.
2019-07-29 10:04:39 -04:00
Yannick Welsch 24873dd3e3 Do not block transport thread on startup (#44939)
We currently block the transport thread on startup, which has caused test failures. I think this is
some kind of deadlock situation. I don't think we should even block a transport thread, and
there's also no need to do so. We can just reject requests as long we're not fully set up. Note
that the HTTP layer is only started much later (after we've completed full start up of the
transport layer), so that one should be completely unaffected by this.

Closes #41745
2019-07-29 11:35:17 +02:00
Armin Braun f5efafd4d6
Cleanup Deadcode o.e.indices (#44931) (#44938)
* none of this is used anywhere
2019-07-29 10:38:35 +02:00
Igor Motov cfc8d17bb4 Geo: refactor geo mapper and query builder (#44884)
Refactors out the indexing and query generation logic out of the
mapper and query builder into a separate unit-testable classes.
2019-07-26 16:48:31 -04:00
Yannick Welsch 1561ab5420 Guard open connection call in RemoteClusterConnection (#44921)
Fixes an issue where a call to openConnection was not properly guarded, allowing an exception
to bubble up to the uncaught exception handler, causing test failures.

Closes #44912
2019-07-26 22:27:45 +02:00
Tanguy Leroux e1b626b947 Ensure index is green in SimpleClusterStateIT.testIndicesOptions() (#44893)
SimpleClusterStateIT testIndicesOptions failed in #44817 because it tries to close 
an index at the beginning of the test. With random index settings, it is possible that 
the index has a high number of shards (10) and replicas (1), which means that on 
CI this index can take time to be fully allocated.

The close index request can fail in the case where replicas are still recovering operations. 
Thiscommit adds a simple ensureGreen() at the beginning of the test to be sure that all 
replicas are started before trying to close the index.

closes #44817
2019-07-26 17:07:53 +02:00
Armin Braun 1340ff19bc
Fix Test Failure in ScalingThreadPoolTests (#44898) (#44901)
* Due to #44894 some constellations log a deprecation warning here now
* Fixed by checking for that
2019-07-26 17:05:50 +02:00
Tanguy Leroux 8848fcfb22 Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode (#44860)
The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails 
because the resize operation is not acknowledged (see #44736). This resize 
operation creates a new index "splitagain" and it results in a cluster state 
update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() 
to create the resized index). This cluster state update is expected to be 
acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but 
this is not always true: the data node that was just stopped in the test before 
executing the resize operation might still be considered as a "faulty" node
 (and not yet removed from the cluster nodes) by the FollowersChecker. The 
cluster state is then acked on all nodes but one, and it results in a non 
acknowledged resize operation.

This commit adds an ensureStableCluster() check after stopping the node in 
the test. The goal is to ensure that the data node has been correctly removed 
from the cluster and that all nodes are fully connected to each before moving 
forward with the resize operation.

Closes #44736
2019-07-26 10:14:27 +02:00
Jason Tedor 6ea2b5dec0
Deprecate setting processors to more than available (#44889)
Today the processors setting is permitted to be set to more than the
number of processors available to the JVM. The processors setting
directly sizes the number of threads in the various thread pools, with
most of these sizes being a linear function in the number of
processors. It doesn't make any sense to set processors very high as the
overhead from context switching amongst all the threads will overwhelm,
and changing the setting does not control how many physical CPU
resources there are on which to schedule the additional threads. We have
to draw a line somewhere and this commit deprecates setting processors
to more than the number of available processors. This is the right place
to draw the line given the linear growth as a function of processors in
most of the thread pools, and that some are capped at the number of
available processors already.
2019-07-26 17:06:44 +09:00
Ignacio Vera 821f6f893b
Upgrade to Lucene 8.2.0 release (#44859) (#44892) 2019-07-26 08:14:59 +02:00
Nhat Nguyen d128188c28 Return seq_no and primary_term in noop update (#44603)
With this change, we will return primary_term and seq_no of the current
document if an update is detected as a noop. We already return the
version; hence we should also return seq_no and primary_term.

Relates #42497
2019-07-25 19:16:56 -04:00
Yannick Welsch bd8470e738 Asynchronously connect to remote clusters (#44825)
Refactors RemoteClusterConnection so that it no longer blockingly connects to remote clusters.

Relates to #40150
2019-07-25 22:59:59 +02:00
Yannick Welsch 0ce841915c Add Clone Index API (#44267)
Adds an API to clone an index. This is similar to the index split and shrink APIs, just with the
difference that the number of primary shards is kept the same. In case where the filesystem
provides hard-linking capabilities, this is a very cheap operation.

Indexing cloning can be done by running `POST my_source_index/_clone/my_target_index` and it
supports the same options as the split and shrink APIs.

Closes #44128
2019-07-25 22:02:28 +02:00
Ryan Ernst 03dd22b56c Add missing ZonedDateTime methods for joda compat layer (#44829)
While joda no longer exists in the apis for 7.x, the compatibility layer
still exists with helper methods mimicking the behavior of joda for
ZonedDateTime objects returned for date fields in scripts. This layer
was originally intended to be removed in 7.0, but is now likely to exist
for the lifetime of 7.x.

This commit adds missing methods from ChronoZonedDateTime to the compat
class. These methods were not part of joda, but are needed to act like a
real ZonedDateTime.

relates #44411
2019-07-25 11:45:57 -07:00