OpenSearch

Commit Graph

Author	SHA1	Message	Date
Armin Braun	750ec8ba53	Minor Cleanups in QueryPhase (#39680 ) (#39694 ) * Soften redundant cast to allow use of `DeterministicTaskQueue` in this class for #39504 * Remove two redundant variables and lower visibility in two possible spots * Make field `final`	2019-03-05 15:04:16 +01:00
Christoph Büscher	5cdea6ef17	Fix Fuzziness#asDistance(String) (#39643 ) Currently Fuzziness#asDistance(String) doesn't work for custom AUTO values. If the fuzziness is AUTO, the method returns the correct edit distance to use, depending on the input string, but for custom AUTO values it currently always returns an edit distance of 1. Correcting this and adding unit and integration tests to catch these cases. Closes #39614	2019-03-05 14:31:07 +01:00
Simon Willnauer	19f6a35358	Move BWC Version to 7.1.0 after backport Relates to #39512	2019-03-05 14:11:59 +01:00
Simon Willnauer	d112c89041	Allow inclusion of unloaded segments in stats (#39512 ) Today we have no chance to fetch actual segment stats for segments that are currently unloaded. This is relevant in the case of frozen indices. This allows to monitor how much memory a frozen index would use if it was unfrozen.	2019-03-05 14:02:20 +01:00
Armin Braun	e8d9744340	Use Threadpool Time in ClusterApplierService (#39679 ) (#39685 ) * Use threadpool's time in `ClusterApplierService` to allow for deterministic tests * This is a part of/requirement for #39504	2019-03-05 12:37:49 +01:00
Gordon Brown	380dc27d91	Mute testCloseWhileRelocatingShards (#39589 )	2019-03-05 13:34:43 +02:00
Alan Woodward	0b14782b23	Add stopword support to IntervalBuilder (#39637 ) The match interval builder analyses input text and converts it to an IntervalSource, and as such may generate token streams with stopwords. This commit deals with these by using the extend factory to cover the gaps produced by these stopwords so that phrase and ordered queries work correctly.	2019-03-05 10:50:45 +00:00
Christoph Büscher	2fe1fa8972	Shortcut counts on exists queries (#39570 ) (#39660 ) `TopDocsCollectorContext` can already shortcut hit counts on `match_all` and `term` queries when there are no deletions. This change adds this ability for `exists` queries if the index doesn't have deletions and fields are indexed. Closes #37475	2019-03-04 19:53:43 +01:00
Prabhakar S	98925e9a09	Fixing the custom object serialization bug in diffable utils. (#39544 ) While serializing custom objects, the length of the list is computed after filtering out the unsupported objects but while writing objects the filter is not applied thus resulting in writing unsupported objects which will fail to deserialize by the receiever. Adding the condition to filter out unsupported custom objects.	2019-03-04 18:41:14 +01:00
Nhat Nguyen	801f13f201	Assert recovery done in testDoNotWaitForPendingSeqNo (#39595 ) Since #39006 we should be able to complete a peer-recovery without waiting for pending indexing operations. Thus, the assertion in testDoNotWaitForPendingSeqNo should be updated from false to true. Closes #39510	2019-03-04 10:21:23 -05:00
Yannick Welsch	936dbb00e3	Isolate Zen1 (#39470 ) Cherry-picks a few commits from #39466 to align 7.x with master branch.	2019-03-04 15:51:17 +01:00
Luca Cavanna	9ddaabba88	Remote private SearchHits.Total class (#39556 ) This is now possible as Lucene's `TotalHits` implements `equals`/`hashcode`, all the other methods can be in-lined in `SearchHits` instead, no need for a specific wrapper class.	2019-03-04 13:46:45 +01:00
Armin Braun	547af21a12	Introduce Mapping ActionListener (#39538 ) (#39636 ) * Introduce Safer Chaining of Listeners * The motivation here is to make reasoning about chains of `ActionListener` a little easier, by providing a safe method for nesting `ActionListener` that guarantees that a response is never dropped. Also, it dries up the code a little by removing the need to repeat `listener::onFailure` and `listener.onResponse` over and over. * Refactored a number of obvious/easy spots to use the new listener constructor	2019-03-04 12:56:46 +01:00
Daniel Mitterdorfer	fca6a2f006	Avoid deprecated API usage in TaskOperationFailure (#39303 ) (#39628 ) With this commit we remove usage of the deprecated method `ExceptionsHelper#detailedMessage` in the class `TaskOperationFailure`. Relates #19069	2019-03-04 11:37:59 +01:00
David Turner	dd68244841	Wait for state recovery in testFreshestMasterElectedAfterFullClusterRestart (#39602 ) Zen1IT#testFreshestMasterElectedAfterFullClusterRestart fails sometimes because we request the cluster state before state recovery has completed, and therefore obtain the default value for the setting we're relying on. Confusingly, we were starting out by setting this setting to its default value, so the test looked like it was failing because of a production bug. This commit avoids this confusion in future by setting it to a non-default value at the start of the test. Fixes #39586.	2019-03-04 10:26:07 +00:00
Adrien Grand	782f873165	Don't swallow exceptions in Store#close(). (#39035 ) (#39622 ) Store#close() swallows any `IOException`. Relates #39030	2019-03-04 10:58:43 +01:00
Adrien Grand	934946a232	Don't swallow exception in ThreadPool.terminate. (#39038 ) (#39623 ) The use of `closeWhileHandlingException` means that any exception while trying to close the threadpool is going to be swallowed. Relates #39030	2019-03-04 10:58:29 +01:00
Adrien Grand	21540a5ada	Enhancements to IndicesQueryCache. (#39099 ) (#39626 ) This commit adds the following: - more tests to IndicesServiceCloseTests, one of them found a bug in the order in which `IndicesQueryCache#onClose` and `IndicesService.indicesRefCount#decRef` are called. - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are already protected by the lock of the Lucene cache, but the final read from an assertion in `IndicesQueryCache#close()` was not so this change should avoid any potential visibility issues. - human-readable `toString`s to make debugging easier. Relates #37117	2019-03-04 10:58:12 +01:00
Armin Braun	68bc178017	Disable Bwc Tests (#39551 ) * Disable Bwc Tests * For #39550	2019-03-04 10:41:52 +01:00
Yannick Welsch	0f65390c29	Do not mutate engine during planning step (#39571 ) This cleans up the Engine implementation by separating the sequence number generation from the planning step in the engine, to avoid for the planning step to have any side effects. This makes it easier to see that every sequence number is properly accounted for.	2019-03-04 10:11:39 +01:00
David Turner	9ec24bae80	Mute testDoNotWaitForPendingSeqNo Relates #39510, #39595.	2019-03-03 22:03:53 -05:00
Mayya Sharipova	d0e65a45a2	Add debug log for flush for IndicesRequestCacheIT (#39475 ) Add debug log when index is flushed to investigate a failure in IndicesRequestCacheIT "DEBUG" level is used as "TRACE" produces too much output irrelevant for this issue Relates to #32827	2019-03-01 13:12:45 -05:00
Luca Cavanna	29e3c18713	Mute failing IndexShardIT#testPendingRefreshWithIntervalChange Relates to #39565	2019-03-01 14:55:19 +01:00
Tanguy Leroux	e005eeb0b3	Backport support for replicating closed indices to 7.x (#39506 )(#39499 ) Backport support for replicating closed indices (#39499) Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888	2019-03-01 14:48:26 +01:00
Yannick Welsch	1a50af7dd4	Do not close bad indices on startup (#39500 ) With #17187, we verified IndexService creation during initial state recovery on the master and if the recovery failed the index was imported as closed, not allocating any shards. This was mainly done to prevent endless allocation loops and full log files on data-nodes when the indexmetadata contained broken settings / analyzers. Zen2 loads the cluster state eagerly, and this check currently runs on all nodes (not only the elected master), which can significantly slow down startup on data nodes. Furthermore, with replicated closed indices (#33888) on the horizon, importing the index as closed will no longer not allocate any shards. Fortunately, the original issue for endless allocation loops is no longer a problem due to #18467, where we limit the retries of failed allocations. The solution here is therefore to just undo #17187, as it's no longer necessary, and covered by #18467, which will solve the issue for Zen2 and replicated closed indices as well.	2019-03-01 09:23:46 +01:00
Tal Levy	b9b46fdec6	fix UpdateSettingsRequestStreamableTests.mutateInstance (#39386 ) (#39477 ) Mutations of the timeout values were using string-representations. This resulted in very rare cases where the original timeout value was represented as something like "0ms" and the new random time-value generated was "0s". Although their string representations differ, their underlying TimeValue does not. This resulted in `-Dtests.seed=7F4C034C43C22B1B` to fail.	2019-02-28 21:02:32 -08:00
Mark Tozzi	609118c229	Override and mute InternalAutoDateHistogramTests#testReduceRandom() (#39536 ) pending resolution of #39497	2019-02-28 16:00:32 -05:00
Lee Hinman	dae48ba262	Add details about what acquired the shard lock last (#38807 ) This adds a `details` parameter to shard locking in `NodeEnvironment`. This is intended to be used for diagnosing issues such as ``` 1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index 1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index 1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms 1> at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?] ``` In the hope that we will be able to determine why the shard is still locked. Relates to #30290 as well as some other CI failures	2019-02-28 10:50:47 -07:00
Armin Braun	e564c4d8ad	Add Package Level JavaDoc on Snapshots (#38108 ) (#39514 ) * Add Package Level JavaDoc on Snapshots	2019-02-28 18:23:01 +01:00
Simon Willnauer	5c96b90ed5	Never block on scheduled refresh if a refresh is running (#39462 ) Today we block on the ReferenceManager in the case of a scheduled refresh. Yet if there is a refresh happening concurrently we might block and create very smallish segments. Instead we should just move on to the next shard and free up the refresh thread instead.	2019-02-28 11:57:45 +01:00
Armin Braun	d3d7d9bb9d	Remove Dead Code + Duplication in o.e.c.routing (#36678 ) (#39493 ) * Removed obviously unused fields+methods * Inlined public methods that only had one caller * Simplified `Optional` chain * Simplified some obviously redundant conditions	2019-02-28 10:33:05 +01:00
Armin Braun	90ab4a6f6e	Stabilize RareClusterState (#38671 ) (#39468 ) * Use actual master node, not just a master elligible node when trying to cancel publication. This only works on the master and for unlucky seeds we never try the master within the 10s that the busy assert runs. * Closes #36813	2019-02-28 08:01:52 +01:00
Tanguy Leroux	4dd274b51d	Unmute CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState() (#39452 ) This commit unmutes the test and comments out the offending call to linearizabilityChecker.isLinearizable() as suggested in #39437	2019-02-27 20:38:54 +01:00
Tanguy Leroux	983b5d1c0e	Mute SpecificMasterNodesIT.testElectOnlyBetweenMasterNodes() Tracked in #38331	2019-02-27 18:00:02 +01:00
Daniel Mitterdorfer	2ccba18809	Correct name of basic_date_time_no_millis (#39367 ) (#39454 ) With this commit we correct the name of the Java time based formatter for `basic_date_time_no_millis`.	2019-02-27 17:03:50 +01:00
Alan Woodward	71b8494181	Upgrade to lucene 8.0.0-snapshot-ff9509a8df (#39444 ) Backport of #39350 Contains the following: * LUCENE-8635: Move terms dictionary off-heap for non-primary-key fields in `MMapDirectory` * LUCENE-8292: `TermsEnum` is fully abstract * LUCENE-8679: Return WITHIN in `EdgeTree#relateTriangle` only when polygon and triangle share one edge * LUCENE-8676: Nori tokenizer deals correctly with large buffers * LUCENE-8697: `GraphTokenStreamFiniteStrings` better handles side paths with gaps * LUCENE-8664: Add `equals` and `hashCode` to `TotalHits` * LUCENE-8660: `TopDocsCollector` returns accurate hit counts if the total equals the threshold * LUCENE-8654: `Polygon2D#relateTriangle` fix for when the polygon is inside the triangle * LUCENE-8645: `Intervals#fixField` can merge intervals from different fields * LUCENE-8585: Create jump-tables for DocValues at index time	2019-02-27 14:36:08 +00:00
Armin Braun	f675b33d50	Increase Timeout in UnicastZenPingTests (#38893 ) (#39449 ) * Just like #37268 removing another 1s timeout, those are dangerous since they're easily exceeded by an untimely gc pause * Closes #26701	2019-02-27 15:22:17 +01:00
Jason Tedor	55e98f08d8	Provide a clearer error message on keystore add (#39327 ) When trying to add a setting to the keystore with an upper case name, we reject with an unclear error message. This commit makes that error message much clearer.	2019-02-27 08:10:23 -05:00
Armin Braun	27485871b8	Don't Ping on Handshake Connection (#39076 ) (#39446 ) * Don't Ping on Handshake Connection * It does not make sense to run pings on the handshake connection * Set the ping interval to `-1` to deactivate pings on it	2019-02-27 13:39:25 +01:00
Tanguy Leroux	6912e27ee0	Mute MinimumMasterNodesIT.testThreeNodesNoMasterBlock() Tracked in #39172	2019-02-27 13:13:22 +01:00
David Turner	41668f7723	Move PeerFinder's logger to the expected package (#39412 ) Today the abstract `org.elasticsearch.discovery.PeerFinder` uses the logger of its implementation, which in production is in `o.e.cluster.coordination`. This turns out to be confusing and unhelpful, so with this change we move to using the logger that belongs to `PeerFinder`.	2019-02-27 08:44:05 +00:00
Armin Braun	28b771f5db	Remove Dead Code Test Infrastructure (#39192 ) (#39436 ) * Just removing some obviously unused things	2019-02-27 09:38:47 +01:00
Tim Brooks	f24dae302d	Make security tests transport agnostic (#39411 ) Currently there are two security tests that specifically target the netty security transport. This PR moves the client authentication tests into `AbstractSimpleSecurityTransportTestCase` so that the nio transport will also be tested. Additionally the work to build transport configurations is moved out of the netty transport and tested independently.	2019-02-26 18:55:19 -07:00
Nhat Nguyen	a9e86bc941	Adjust testWaitForPendingSeqNo (#39404 ) Since #39006, we should either remove `testWaitForPendingSeqNo` or adjust it not to wait for the pending operations. This change picks the latter. Relates #39006	2019-02-26 16:21:56 -05:00
Mayya Sharipova	4ca514f18c	Fix testCacheWithFilteredAlias failure (#39401 ) Move refresh after Forcemerge Relates to #32827	2019-02-26 14:11:35 -05:00
Luca Cavanna	2619f48e4d	Rename SearchRequest#withLocalReduction (#39108 ) `withLocalReduction` is confusing as `local` effectively means "local to the remote clusters" rather than "local the coordinating node" where the method is executed. I propose we rename the method to `crossClusterSearch` which better resembles what the static method is used for.	2019-02-26 16:30:54 +01:00
Luca Cavanna	c09773a76e	Completion suggestions to be reduced once instead of twice (#39255 ) We have been calling `reduce` against completion suggestions twice, once in `SearchPhaseController#reducedQueryPhase` where all suggestions get reduced, and once more in `SearchPhaseController#sortDocs` where we add the top completion suggestions to the `TopDocs` so their docs can be fetched. There is no need to do reduction twice. All suggestions can be reduced in one call, then we can filter the result and pass only the already reduced completion suggestions over to `sortDocs`. The small important detail is that `shardIndex`, which is currently used only to fetch suggestions hits, needs to be set before the first reduction, hence outside of `sortDocs` where we have been doing it until now.	2019-02-26 11:42:02 +01:00
Yannick Welsch	d42f422258	Add linearizability checker for coordination layer (#36943 ) Checks that the core coordination algorithm implemented as part of Zen2 (#32006) supports linearizable semantics. This commit adds a linearizability checker based on the Wing and Gong graph search algorithm with support for compositional checking and activates these checks for all CoordinatorTests.	2019-02-26 08:26:55 +01:00
Nhat Nguyen	575eed8582	Bubble up exception when processing NoOp (#39338 ) Today we do not bubble up exceptions when processing NoOps but always treat them as document-level failures. This incorrect treatment causes the assert_no_failure being tripped in peer-recovery if IndexWriter was closed exceptionally before. Closes #38898	2019-02-25 17:54:45 -05:00
Nhat Nguyen	e9dda75834	Enable soft-deletes by default for 7.0+ indices (#38929 ) Today when users upgrade to 7.0, existing indices will automatically switch to soft-deletes without an opt-out option. With this change, we only enable soft-deletes by default for new indices. Relates #36141	2019-02-25 17:54:29 -05:00
Igor Motov	d5046b1c25	[CI] Fixes testQueryRandomGeoCollection failure again (#39275 ) Moves the check for tiny polygons earlier in the test. It turned out that polygons can be so tiny that we cannot even figure out their orientation. Relates to #37356	2019-02-25 16:35:17 -05:00
Evgenia Badyanova	1ed3407930	Reduce garbage from allocations in deprecation logger (#38780 ) (#39370 ) 1. Setting length for formatWarning String to avoid AbstractStringBuilder.ensureCapacityInternal calls 2. Adding extra check for parameter array length == 0 to avoid unnecessarily creating StringBuilder in LoggerMessageFormat.format Helps to narrow the performance gap in throughout for geonames benchmark (#37411) by 3%. For more details: https://github.com/elastic/elasticsearch/issues/37530#issuecomment-462758384 Relates to #37530 Relates to #37411 Relates to #35754	2019-02-25 16:23:22 -05:00
Lee Hinman	5c7dd6f0ee	Set mappings when creating indices in SuggestSearchIT (#39323 ) * Set mappings when creating indices in SuggestSearchIT These tests don't test dynamic mapping, so they can use preset mappings. This removes the possibility they may fail due to the mapping not being available since mapping updates are asynchronous. Resolves #39315 * Wrap creates in assertAcked	2019-02-25 13:27:03 -07:00
Mayya Sharipova	bf058d6e4d	Fix anaylze NullPointerException when AnalyzeTokenList tokens is null (#39332 ) (#39361 )	2019-02-25 12:49:18 -05:00
Nhat Nguyen	48219112e3	Do not wait for advancement of checkpoint in recovery (#39006 ) With this change, we won't wait for the local checkpoint to advance to the max_seq_no before starting phase2 of peer-recovery. We also remove the sequence number range check in peer-recovery. We can safely do these thanks to Yannick's finding. The replication group to be used is currently sampled after indexing into the primary (see `ReplicationOperation` class). This means that when initiating tracking of a new replica, we have to consider the following two cases: - There are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable). - There are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to `endingSeqNo` are available in the snapshot (i.e. also see comment in `RecoverySourceHandler` saying `We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.`). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint. Relates #39000 Closes #38949 Co-authored-by: Yannick Welsch <yannick@welsch.lu>	2019-02-25 12:10:14 -05:00
David Turner	236db51d34	Fix testSnapshotFileFailureDuringSnapshot (#39362 ) Today this test catches an exception and asserts that its proximate cause has message `Random IOException` but occasionally this exception is wrapped two layers deep, causing the test to fail. This commit adjusts the test to look at the root cause of the exception instead. 1> [2019-02-25T12:31:50,837][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotFileFailureDuringSnapshot] --> caught a top level exception, asserting what's expected 1> org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] Snapshot could not be read 1> at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:212) ~[main/:?] 1> at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:135) ~[main/:?] 1> at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54) ~[main/:?] 1> at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127) ~[main/:?] 1> at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208) ~[main/:?] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?] 1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202] 1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] 1> Caused by: org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots 1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:564) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?] 1> ... 9 more 1> Caused by: java.io.IOException: Random IOException 1> at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.maybeIOExceptionOrBlock(MockRepository.java:275) ~[test/:?] 1> at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.readBlob(MockRepository.java:317) ~[test/:?] 1> at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:101) ~[main/:?] 1> at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:90) ~[main/:?] 1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:560) ~[main/:?] 1> at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?] 1> ... 9 more FAILURE 0.59s J0 \| SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot <<< FAILURES! > Throwable #1: java.lang.AssertionError: > Expected: a string containing "Random IOException" > but: was "[test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots" > at __randomizedtesting.SeedInfo.seed([B73CA847D4B4F52D:884E042D2D899330]:0) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot(SharedClusterSnapshotRestoreIT.java:821) > at java.lang.Thread.run(Thread.java:748)	2019-02-25 16:43:55 +00:00
Marios Trivyzas	11fe8cd16f	[Tests] Fix flakiness by ensuring stable cluster (#39300 ) (#39356 ) In integration tests where `setBootstrapMasterNodeIndex()` is used in combination with `autoMinMasterNodes = false` the cluster can start bootstrapping once the number of nodes set with the `setBootstrapMasterNodeIndex` have been started but it's not ensured that all nodes have successfully joined to form the cluster. This behaviour was introduced with `5db7ed22a0` and in order to ensure that the cluster is properly formed before proceeding with the integration test, use `ensureStableCluster()` with the appropriate number of expected nodes. Fixes: #39220	2019-02-25 17:26:15 +01:00
David Turner	dc23be5a9d	Avoid creating a green index in RetentionLeaseIT (#39347 ) In #39224 we made shard history retention lease syncing ignore the `index.write.wait_for_active_shards` setting on the index, and added a test that showed that it was ignored. However the test as merged actually creates a green index, so the `wait_for_active_shards` setting has no effect. This change adjusts the test to create a yellow index to verify that `wait_for_active_shards` really is ignored.	2019-02-25 15:33:09 +00:00
Yannick Welsch	a2bc41621c	Clean GatewayAllocator when stepping down as master (#38885 ) This fixes an issue where a messy master election might prevent shard allocation to properly proceed. I've encountered this in failing CI tests when we were bootstrapping multiple nodes. Tests would sometimes time out on an `ensureGreen` after an unclean master election. The reason for this is how the async shard information fetching works and how the clean-up logic in GatewayAllocator is integrated with the rest of the system. When a node becomes master, it will, as part of the first cluster state update where it becomes master, already try allocating shards (see `JoinTaskExecutor`, in particular the call to `reroute`). This process, which runs on the MasterService thread, will trigger async shard fetching. If the node is still processing an earlier election failure in ClusterApplierService (e.g. due to a messy election), that will possibly trigger the clean-up logic in GatewayAllocator after the shard fetching has been initiated by MasterService, thereby cancelling the fetching, which means that no subsequent reroute (allocation) is triggered after the shard fetching results return. This means that no shard allocation will happen unless the user triggers an explicit reroute command. The bug imo is that GatewayAllocator is called from both MasterService and ClusterApplierService threads, with no clear happens-before relation. The fix here makes it so that the clean-up logic is also run on the MasterService thread instead of the ClusterApplierService thread, reestablishing a clear happens-before relation. Note that testing this is tricky. With the newly added test, I can quite often reproduce this by adding `Thread.sleep(10);` in ClusterApplierService (to make sure it does not go too quickly) and adding `Thread.sleep(50);` in `TransportNodesListGatewayStartedShards` to make sure that shard state fetching does not go too quickly either. Note that older versions of Zen discovery are affected by this as well, but did not exhibit this issue as often because master elections are much slower there.	2019-02-25 10:37:31 +01:00
David Turner	96c09b032d	Ignore waitForActiveShards when syncing leases (#39224 ) Adjust the retention lease sync actions so that they do not respect the `index.write.wait_for_active_shards` setting on an index, allowing them to sync retention leases even if insufficiently many shards are currently active to accept writes. Relates #39089	2019-02-25 08:53:43 +00:00
Nhat Nguyen	f17d408fbb	Add cause to assert_no_failure when replay translog (#39333 ) We tripped this assertion three times for the last two weeks. However, it only says "this IndexWriter is closed" without the actual cause. ``` [2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed ``` This change replaces an assert with an AssertionError so that we will have the actual cause in the next build failures. Relates #38898	2019-02-23 13:04:43 -05:00
Zachary Tong	c7516b03b6	Better HoltWinters parameter validation (#38747 ) We validate HW parameters (namely, window > 2 * period) when parsing the XContent... but that means transport clients can configure bad params. This change allows model to validate the window and throw an exception if they wish. It also makes some test changes: - removes testBadModelParams(), which was a junk test (didn't do anything), and bad param checking is done elsewhere in units tests - Fixes one of the windows in testHoltWintersNotEnoughData() - Ensures the period in testHoltWintersNotEnoughData() is >> window - Removes `setTypes()` since that's deprecated	2019-02-22 15:25:26 -05:00
Daniel Mitterdorfer	9fea21aca5	Remove ExceptionsHelper#detailedMessage in tests (#37921 ) (#39297 ) With this commit we remove all usages of the deprecated method `ExceptionsHelper#detailedMessage` in tests. We do not address production code here but rather in dedicated follow-up PRs to keep the individual changes manageable. Relates #19069	2019-02-22 14:03:29 +01:00
Tim Brooks	44df76251f	Rebuild remote connections on profile changes (#39146 ) Currently remote compression and ping schedule settings are dynamic. However, we do not listen for changes. This commit adds listeners for changes to those two settings. Additionally, when those settings change we now close existing connections and open new ones with the settings applied. Fixes #37201.	2019-02-21 14:00:39 -07:00
Tanguy Leroux	fc896e452c	ReadOnlyEngine should update translog recovery state information (#39238 ) (#39251 ) `ReadOnlyEngine` never recovers operations from translog and never updates translog information in the index shard's recovery state, even though the recovery state goes through the `TRANSLOG` stage during the recovery. It means that recovery information for frozen shards indicates an unkown number of recovered translog ops in the Recovery APIs (translog_ops: `-1` and translog_ops_percent: `-1.0%`) and this is confusing. This commit changes the `recoverFromTranslog()` method in `ReadOnlyEngine` so that it always recover from an empty translog snapshot, allowing the recovery state translog information to be correctly updated. Related to #33888	2019-02-21 18:08:06 +01:00
Daniel Mitterdorfer	ef921fd157	Migrate Streamable to Writeable for cluster block package (#37391 ) (#39236 )	2019-02-21 15:21:44 +01:00
Marios Trivyzas	ecfd48b6d3	[Tests] Make testEngineGCDeletesSetting deterministic (#38942 ) (#39231 ) `InternalEngine.resolveDocVersion()` uses `relativeTimeInMillis()` from `ThreadPool` so it needs, the cached time to be advanced. Add a check to ensure that and decrease the `thread_pool.estimated_time_interval` to 1msec to prevent long running times for the test. Fixes: #38874 Co-authored-by: Boaz Leskes <b.leskes@gmail.com>	2019-02-21 14:30:59 +02:00
Marios Trivyzas	1316825f52	Replace superfluous usage of Counter with Supplier (#39048 ) (#39225 ) `Counter` was used as a means of a functional argument to pass the relative cached time before `Supplier` iface was introduced.	2019-02-21 12:42:54 +02:00
Ignacio Vera	be8a5315d7	Extend nextDoc to delegate to the wrapped doc-value iterator for date_nanos (#39176 ) The type date_nanos does not direct doc-value iterators and it needs to extend `next_doc` in order to delegate the call to the wrapped iterator.	2019-02-21 11:10:51 +01:00
Tal Levy	8150ca40f2	mute test3MasterNodes2Failed	2019-02-20 17:35:37 -08:00
Nhat Nguyen	820ba8169e	Add retention leases replication tests (#38857 ) This commit introduces the retention leases to ESIndexLevelReplicationTestCase, then adds some tests verifying that the retention leases replication works correctly in spite of the presence of the primary failover or out of order delivery of retention leases sync requests. Relates #37165	2019-02-20 19:21:00 -05:00
Mirko Jotic	a6ae146ccc	Converting Derivative Pipeline Agg integration test into AggregatorTestsCase. (#38679 ) Replicates the majority of existing Derivative pipeline integration tests into an AggregatorTestCase, with the goal of removing the integration tests in the near future.	2019-02-20 16:35:32 -05:00
Igor Motov	3d93011e32	Fix median calculation in MedianAbsoluteDeviationAggregatorTests (#38979 ) Fixes an error in median calculation in MedianAbsoluteDeviationAggregatorTests for odd number of sample points, which causes some rare test failures. Fixes #38937	2019-02-20 13:24:30 -05:00
Ioannis Kakavas	c783069804	Fix NPE on Stale Index in IndicesService(#39173 ) This is a backport of #38891 which closes #38845	2019-02-20 15:35:35 +02:00
David Turner	efffb3d5b7	Simplify calculation in AwarenessAllocationDecider (#38091 ) Today's calculation of the maximum number of shards per attribute is rather convoluted. This commit clarifies that it returns ceil(shardCount/numberOfAttributes).	2019-02-20 08:54:57 +00:00
Henning Andersen	00a26b9dd2	Blob store compression fix (#39073 ) Blob store compression was not enabled for some of the files in snapshots due to constructor accessing sub-class fields. Fixed to instead accept compress field as constructor param. Also fixed chunk size validation to work. Deprecated repositories.fs.compress setting as well to be able to unify in a future commit.	2019-02-20 09:24:41 +01:00
Hendrik Muhs	50b3858f7c	add version 6.6.2	2019-02-19 20:28:06 +01:00
David Turner	0a9574c9d4	Add some missing toString() implementations (#39124 ) Sometimes we turn objects into strings for logging or debugging using `toString()`, but the default implementation is often unhelpful. This change improves on this in two places I ran into recently.	2019-02-19 17:52:41 +00:00
Jason Tedor	fef9bdb23f	Allow retention lease operations under blocks (#39089 ) This commit allows manipulating retention leases under blocks.	2019-02-19 10:26:49 -05:00
Jason Tedor	12f6963456	Fix retention leases sync on recovery test This test had a bug. We attempt to allow only the primary to be allocated, to force all replicas to recovery from the primary after we had set the state of the retention leases on the primary. However, in building the index settings, we were overwriting the settings that exclude the replicas from being allocated. This means that some of the replicas would end up assigned and rather than receive retention leases during recovery, they would be part of the replication group receiving retention leases as they are manipulated. Since retention lease renewals are only synced periodically, this means that the replica could be lagging a little behind in some cases leading to an assertion tripping in the test. This commit addresses this by ensuring that the replicas are indeed not allocated until after the retention leases are done being manipulated on the replica. We did this by not overwriting the exclude settings. Closes #39105	2019-02-19 09:07:33 -05:00
Alexander Reelsen	7f8a640363	Fix DateFormatters.parseMillis when no timezone is given (#39100 ) The parseMillis method was able to work on formats without timezones by falling back to UTC. The Date Formatter interface did not support this, as the calling code was using the `Instant.from` java time API. This switches over to an internal method which adds UTC as a timezone. Closes #39067	2019-02-19 14:12:22 +01:00
Jim Ferenczi	199155f5fb	Enforce Completion Context Limit (#38675 ) (#39075 ) This change adds a limit to the number of completion contexts that a completion field can define. Closes #32741	2019-02-19 08:52:24 +01:00
Albert Zaharovits	6bc88b00ec	Mute GatewayMetaStateTests.testAtomicityWithFailures (#39079 ) Mute test GatewayMetaStateTests.testAtomicityWithFailures	2019-02-19 00:25:45 +02:00
Jason Tedor	2d8f6b6501	Introduce retention lease state file (#39004 ) This commit moves retention leases from being persisted in the Lucene commit point to being persisted in a dedicated state file.	2019-02-18 16:53:46 -05:00
Jason Tedor	d43ac8fe11	Include in log retention leases that failed to sync When retention leases fail to sync after an expiration check, we emit a log message about this. This commit adds the retention leases that failed to sync.	2019-02-18 15:08:08 -05:00
Jason Tedor	bbb61002ba	Add some logging related to retention lease syncing (#39066 ) When the background retention lease sync fires, we check an see if any retention leases are expired. If any did expire, we execute a full retention lease sync (write action). Since this is happening on a background thread, we do not block that thread waiting for success (it will simply try again when the timer elapses). However, we were swallowing exceptions that indicate failure. This commit addresses that by logging the failures. Additionally, we add some trace logging to the execution of syncing retention leases.	2019-02-18 15:02:31 -05:00
Henning Andersen	99b2bc3461	Fix potential race during TcpTransport close (#39031 ) Fixed two potential causes for leaked threads during tests: 1. When adding a channel to serverChannels, we add it under a monitor that we do not use when reading from it. This is potentially unsafe if there is no other happens-before relationship ensuring the safety of this. 2. Long-shot but if the thread pool was shutdown before entering this code, we would silently forget about closing server channels so added assert. Strengthened the locking to ensure that once we stop the transport, no new server channels can be made. Relates to CI failure issue: #37543	2019-02-18 19:13:23 +01:00
Alan Woodward	ab4d5f404f	Add overlapping, before, after filters to intervals query (#38999 ) Lucene recently added `overlapping`, `before` and `after` filters to the intervals package. This commit exposes them in elasticsearch.	2019-02-18 15:06:24 +00:00
Adrien Grand	45b17e8645	Don't close caches while there might still be in-flight requests. (#38958 ) Many of our index components use ref-counting so that in the event that a shard is closed while there are still ongoing requests, then the index reader and the store only effectively get closed when ongoing requests have finished. However we don't apply the same principle to the request and query caches, which might get closed while there are still in-flight requests. This commit adds ref-counting to `IndicesService` so that the caches and other components it maintains only get closed when all shards are effectively closed. Closes #37117	2019-02-18 13:59:58 +01:00
Martijn van Groningen	ed08bc3537	Fix LocalIndexFollowingIT#testRemoveRemoteConnection() test (#38709 ) * During fetching remote mapping if remote client is missing then `NoSuchRemoteClusterException` was not handled. * When adding remote connection, check that it is really connected before continue-ing to run the tests. Relates to #38695	2019-02-18 09:41:44 +01:00
Jason Tedor	a5ce1e0bec	Integrate retention leases to recovery from remote (#38829 ) This commit is the first step in integrating shard history retention leases with CCR. In this commit we integrate shard history retention leases with recovery from remote. Before we start transferring files, we take out a retention lease on the primary. Then during the file copy phase, we repeatedly renew the retention lease. Finally, when recovery from remote is complete, we disable the background renewing of the retention lease.	2019-02-16 15:37:52 -05:00
Tim Brooks	b1c1daa63f	Add get file chunk timeouts with listener timeouts (#38758 ) This commit adds a `ListenerTimeouts` class that will wrap a `ActionListener` in a listener with a timeout scheduled on the generic thread pool. If the timeout expires before the listener is completed, `onFailure` will be called with an `ElasticsearchTimeoutException`. Timeouts for the get ccr file chunk action are implemented using this functionality. Additionally, this commit attempts to fix #38027 by also blocking proxied get ccr file chunk actions. This test being un-muted is useful to verify the timeout functionality.	2019-02-16 10:56:03 -07:00
Luca Cavanna	a1a49f201d	Tie break search shard iterator comparisons on cluster alias (#38853 ) `SearchShardIterator` inherits its `compareTo` implementation from `PlainShardIterator`. That is good in most of the cases, as such comparisons are based on the shard id which is unique, even when searching against indices with same names across multiple clusters (thanks to the index uuid being different). In case though the same cluster is registered multiple times with different aliases, the shard id is exactly the same, hence remote results will be returned before local ones with same shard id objects. That is because remote iterators are added before local ones, and we use a stable sorting method in `GroupShardIterators` constructor. This PR enhances `compareTo` for `SearchShardIterator` to tie break on cluster alias and introduces consistent `equals` and `hashcode` methods. This allows to remove a TODO in `SearchResponseMerger` which otherwise has to handle this special case specifically. Also, while at it I added missing tests around equals/hashcode and compareTo and expanded existing ones.	2019-02-16 09:41:03 +01:00
Nhat Nguyen	7e20a92888	Advance max_seq_no before add operation to Lucene (#38879 ) Today when processing an operation on a replica engine (or the following engine), we first add it to Lucene, then add it to translog, then finally marks its seq_no as completed. If a flush occurs after step1, but before step-3, the max_seq_no in the commit's user_data will be smaller than the seq_no of some documents in the Lucene commit.	2019-02-15 21:04:28 -05:00
Nhat Nguyen	20755e666c	Reduce global checkpoint sync interval in disruption tests (#38931 ) We verify seq_no_stats is aligned between copies at the end of some disruption tests. Sometimes, the assertion `assertSeqNos` is tripped due to a lagged global checkpoint on replicas. The global checkpoint on replicas is lagged because we sync the global checkpoint 30 seconds (by default) after the last replication operation. This change reduces the global checkpoint sync-internal to 1s in the disruption tests. Closes #38318 Closes #36789	2019-02-15 21:04:20 -05:00
Nhat Nguyen	a67b9f6d1f	Relax testStressMaybeFlushOrRollTranslogGeneration (#38918 ) The predicate shouldPeriodicallyFlush is determined by the uncommitted translog size and the local checkpoint. The uncommitted translog size depends on the local checkpoint. The condition shouldPeriodicallyFlush can be true twice in in the test in the following scenario: 1. Index doc-0 and advances the local checkpoint to 0, the condition shouldPeriodicallyFlush remains false. 2. Index doc-1 and add it to translog, but the local checkpoint is not advanced yet (still 0). The condition shouldPeriodicallyFlush becomes true because the uncommitted translog size is 216bytes (2ops + gen-1 + gen-2) > 180bytes and the translog generation of the new index commit would advance from 1 to 2. > [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine ] [node_s_0] > [test][0] committing writer with commit data [{local_checkpoint=0, > max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g, > min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q, > retention_leases=primary_term:1;version:0;, translog_generation=2, > max_seq_no=1}] 1. The shouldPeriodicallyFlush becomes true again after the local checkpoint is advanced to 1 because the uncommitted translog size is 216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation of the new index commit would advance from 2 to 4. > [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine ] [node_s_0] > [test][0] committing writer with commit data [{local_checkpoint=1, > max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g, > min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q, > retention_leases=primary_term:1;version:0;, translog_generation=4, > max_seq_no=1}] We need to relax the assertion in this test to cover this situation. Closes #31629	2019-02-15 21:04:12 -05:00
Armin Braun	238425e5e7	Fix Issue with Concurrent Snapshot Init + Delete (#38518 ) * Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread * Closes #38489	2019-02-15 16:50:47 -08:00
Alan Woodward	176013e23c	Avoid double term construction in DfsPhase (#38716 ) DfsPhase captures terms used for scoring a query in order to build global term statistics across multiple shards for more accurate scoring. It currently does this by building the query's `Weight` and calling `extractTerms` on it to collect terms, and then calling `IndexSearcher.termStatistics()` for each collected term. This duplicates work, however, as the various `Weight` implementations will already have collected these statistics at construction time. This commit replaces this round-about way of collecting stats, instead using a delegating IndexSearcher that collects the term contexts and statistics when `IndexSearcher.termStatistics()` is called from the Weight. It also fixes a bug when using rescorers, where a `QueryRescorer` would calculate distributed term statistics, but ignore field statistics. `Rescorer.extractTerms` has been removed, and replaced with a new method on `RescoreContext` that returns any queries used by the rescore implementation. The delegating IndexSearcher then collects term contexts and statistics in the same way described above for each Query.	2019-02-15 16:00:38 +00:00
Daniel Mitterdorfer	fcc7f553f5	Also mmap cfs files for hybridfs (#38940 ) (#38947 ) With this commit we add the `.cfs` file extension to the list of file types that are memory-mapped by hybridfs. `.cfs` files combine all files of a Lucene segment into a single file in order to save file handles. As this strategy is only used for "small" segments (less than 10% of the shard size), it is benefical to memory-map them instead of accessing them via NIO. Relates #36668	2019-02-15 15:34:40 +01:00
David Turner	578514e892	Recover peers from translog, ignoring soft deletes (#38904 ) Today if soft deletes are enabled then we read the operations needed for peer recovery from Lucene. However we do not currently make any attempt to retain history in Lucene specifically for peer recoveries so we may discard it and fall back to a more expensive file-based recovery. Yet we still retain sufficient history in the translog to perform an operations-based peer recovery. In the long run we would like to fix this by retaining more history in Lucene, possibly using shard history retention leases (#37165). For now, however, this commit reverts to performing peer recoveries using the history retained in the translog regardless of whether soft deletes are enabled or not.	2019-02-15 10:45:15 +01:00

1 2 3 4 5 ...

2708 Commits