Commit Graph

44792 Commits

Author SHA1 Message Date
Nhat Nguyen a9e86bc941 Adjust testWaitForPendingSeqNo (#39404)
Since #39006, we should either remove `testWaitForPendingSeqNo` 
or adjust it not to wait for the pending operations. This change picks 
the latter.

Relates #39006
2019-02-26 16:21:56 -05:00
Mayya Sharipova 4ca514f18c Fix testCacheWithFilteredAlias failure (#39401)
Move refresh after Forcemerge

Relates to #32827
2019-02-26 14:11:35 -05:00
Luca Cavanna 2619f48e4d Rename SearchRequest#withLocalReduction (#39108)
`withLocalReduction` is confusing as `local` effectively means "local
to the remote clusters" rather than "local the coordinating node" where
the method is executed. I propose we rename the method to
`crossClusterSearch` which better resembles what the static method is
used for.
2019-02-26 16:30:54 +01:00
Neeraj Jain 2f7206ada7 Use pkill to shutdown elasticsearch using pid file (#39135)
While running these commands from alias, facing issues using kill `cat pid`, In some situations, the more compact:
```
pkill -F /var/run/myProcess.pid
```
is the way to go.
2019-02-26 16:28:36 +01:00
Luca Cavanna c09773a76e Completion suggestions to be reduced once instead of twice (#39255)
We have been calling `reduce` against completion suggestions twice, once
in `SearchPhaseController#reducedQueryPhase` where all suggestions get
reduced, and once more in `SearchPhaseController#sortDocs` where we
add the top completion suggestions to the `TopDocs` so their docs can
be fetched. There is no need to do reduction twice. All suggestions can
be reduced in one call, then we can filter the result and pass only the
already reduced completion suggestions over to `sortDocs`. The small
important detail is that `shardIndex`, which is currently used only
to fetch suggestions hits, needs to be set before the first reduction,
hence outside of `sortDocs` where we have been doing it until now.
2019-02-26 11:42:02 +01:00
Martijn van Groningen 24e478c58e
Fix test, more than one node may be connected.
Relates to #37681
2019-02-26 10:40:09 +01:00
David Kyle f7cba82c77
[ML] Reenable ml rolling upgrade tests (#39290) 2019-02-26 08:51:59 +00:00
Ioannis Kakavas 7f999c43b3
[BACKPORT-7.x] Fix TokenBackwardsCompatibility tests (#39294)
This change is a backport of  #39252

- Fixes TokenBackwardsCompatibilityIT: Existing tests seemed to made
  the assumption that in the oneThirdUpgraded stage the master node
  will be on the old version and in the twoThirdsUpgraded stage, the
  master node will be one of the upgraded ones. However, there is no
  guarantee that the master node in any of the states will or will
  not be one of the upgraded ones.
  This class now tests:
  - That we can generate and consume tokens before we start the
  rolling upgrade.
  - That we can consume tokens generated in the old cluster during
  all the stages of the rolling upgrade.
  - That while on a mixed cluster, when/if the master node is
  upgraded, we can generate, consume and refresh a token
  - That after the rolling upgrade, we can consume a token
  generated in an old cluster and can invalidate it so that it
  can't be used any more.
- Ensures that during the rolling upgrade, the upgraded nodes have
the same configuration as the old nodes. Specifically that the
file realm we use is explicitly named `file1`. This is needed
because while attempting to refresh a token in a mixed cluster
we might create a token hitting an old node and attempt to refresh
it hitting a new node. If the file realm name is not the same, the
refresh will be seen as being made by a "different" client, and
will, thus, fail.
- Renames the Authentication variable we check while refreshing a
token to be clientAuth in order to make the code more readable.

Some of the above were possibly causing the flakiness of #37379
2019-02-26 10:42:36 +02:00
Martijn van Groningen b159cc51c0
Ensure remote connection established and
clean remote connection prior to leader cluster restart

Relates to #37681
2019-02-26 09:06:30 +01:00
Yannick Welsch d42f422258 Add linearizability checker for coordination layer (#36943)
Checks that the core coordination algorithm implemented as part of Zen2 (#32006) supports
linearizable semantics. This commit adds a linearizability checker based on the Wing and Gong
graph search algorithm with support for compositional checking and activates these checks for all
CoordinatorTests.
2019-02-26 08:26:55 +01:00
Alpar Torok b40628f6a6 Don't fail bwc check if these are disabled (#38919) 2019-02-26 09:14:57 +02:00
Nhat Nguyen 575eed8582 Bubble up exception when processing NoOp (#39338)
Today we do not bubble up exceptions when processing NoOps but always
treat them as document-level failures. This incorrect treatment causes
the assert_no_failure being tripped in peer-recovery if IndexWriter was
closed exceptionally before.

Closes #38898
2019-02-25 17:54:45 -05:00
Nhat Nguyen e9dda75834 Enable soft-deletes by default for 7.0+ indices (#38929)
Today when users upgrade to 7.0, existing indices will automatically
switch to soft-deletes without an opt-out option. With this change, 
we only enable soft-deletes by default for new indices.

Relates #36141
2019-02-25 17:54:29 -05:00
Jason Tedor a6c0166d68
Renew retention leases while following (#39335)
This commit is the final piece of the integration of CCR with retention
leases. Namely, we periodically renew retention leases and advance the
retaining sequence number while following.
2019-02-25 17:14:19 -05:00
Lee Hinman 7b8178c839
Remove Hipchat support from Watcher (#39374)
* Remove Hipchat support from Watcher (#39199)

Hipchat has been shut down and has previously been deprecated in
Watcher (#39160), therefore we should remove support for these actions.

* Add migrate note
2019-02-25 15:08:46 -07:00
Igor Motov d5046b1c25 [CI] Fixes testQueryRandomGeoCollection failure again (#39275)
Moves the check for tiny polygons earlier in the test. It turned out
that polygons can be so tiny that we cannot even figure out their
orientation.

Relates to #37356
2019-02-25 16:35:17 -05:00
Evgenia Badyanova 1ed3407930
Reduce garbage from allocations in deprecation logger (#38780) (#39370)
1. Setting length for formatWarning String to avoid AbstractStringBuilder.ensureCapacityInternal calls
2. Adding extra check for parameter array length == 0 to avoid unnecessarily creating StringBuilder in LoggerMessageFormat.format

Helps to narrow the performance gap in throughout for geonames benchmark (#37411) by 3%. For more details: https://github.com/elastic/elasticsearch/issues/37530#issuecomment-462758384 

Relates to #37530
Relates to #37411
Relates to #35754
2019-02-25 16:23:22 -05:00
Lee Hinman 5c7dd6f0ee Set mappings when creating indices in SuggestSearchIT (#39323)
* Set mappings when creating indices in SuggestSearchIT

These tests don't test dynamic mapping, so they can use preset mappings. This
removes the possibility they may fail due to the mapping not being available
since mapping updates are asynchronous.

Resolves #39315

* Wrap creates in assertAcked
2019-02-25 13:27:03 -07:00
Benjamin Trent 926291aac8
[DATA-FRAME] Sort `GET` transforms and stats by ID (#39365) (#39369)
* [Data-Frame] Sort `GET` transforms and stats by ID

* removing unused import
2019-02-25 14:22:41 -06:00
Nhat Nguyen 0f29b89655 Unmute FollowerFailOverIT#testFailOverOnFollower
Relates #38633
2019-02-25 14:44:44 -05:00
James Baiera e6a124c118
[Backport 7.x] Fix the OS sensing code in ClusterFormationTasks (#38457)
This fixes a bug in the sensing of the current OS family in the test cluster
formation code. Previously all builds would assume every environment 
was windows and would jump to using the windows zip build. This fixes 
the OS sensing code as well as updates some tests to account for 
different build flavors.

Backport of #38457
2019-02-25 14:39:34 -05:00
Hendrik Muhs 1897883adc
[ML-DataFrame] Dataframe access headers (#39289) (#39368)
store user headers as part of the config and run transform as user
2019-02-25 19:08:26 +01:00
Mayya Sharipova bf058d6e4d
Fix anaylze NullPointerException when AnalyzeTokenList tokens is null (#39332) (#39361) 2019-02-25 12:49:18 -05:00
Benjamin Trent 3d49523726
[DATA-FRAME] adds specs and yml tests for existing endpoints (#39326) (#39363)
* [DATA-FRAME] adds specs and yml tests for existing endpoints

* removing bad URL, adding test for _all
2019-02-25 11:19:49 -06:00
Nhat Nguyen 48219112e3 Do not wait for advancement of checkpoint in recovery (#39006)
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
2019-02-25 12:10:14 -05:00
David Turner 236db51d34 Fix testSnapshotFileFailureDuringSnapshot (#39362)
Today this test catches an exception and asserts that its proximate cause has
message `Random IOException` but occasionally this exception is wrapped two
layers deep, causing the test to fail. This commit adjusts the test to look at
the root cause of the exception instead.

      1> [2019-02-25T12:31:50,837][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotFileFailureDuringSnapshot] --> caught a top level exception, asserting what's expected
      1> org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] Snapshot could not be read
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:212) ~[main/:?]
      1> 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:135) ~[main/:?]
      1> 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54) ~[main/:?]
      1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127) ~[main/:?]
      1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208) ~[main/:?]
      1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
      1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
      1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
      1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
      1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
      1> Caused by: org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:564) ~[main/:?]
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?]
      1> 	... 9 more
      1> Caused by: java.io.IOException: Random IOException
      1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.maybeIOExceptionOrBlock(MockRepository.java:275) ~[test/:?]
      1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.readBlob(MockRepository.java:317) ~[test/:?]
      1> 	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:101) ~[main/:?]
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:90) ~[main/:?]
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:560) ~[main/:?]
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?]
      1> 	... 9 more

    FAILURE 0.59s J0 | SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot <<< FAILURES!
       > Throwable #1: java.lang.AssertionError:
       > Expected: a string containing "Random IOException"
       >      but: was "[test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots"
       > 	at __randomizedtesting.SeedInfo.seed([B73CA847D4B4F52D:884E042D2D899330]:0)
       > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
       > 	at org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot(SharedClusterSnapshotRestoreIT.java:821)
       > 	at java.lang.Thread.run(Thread.java:748)
2019-02-25 16:43:55 +00:00
Marios Trivyzas 11fe8cd16f
[Tests] Fix flakiness by ensuring stable cluster (#39300) (#39356)
In integration tests where `setBootstrapMasterNodeIndex()` is used in
combination with `autoMinMasterNodes = false` the cluster can start
bootstrapping once the number of nodes set with the
`setBootstrapMasterNodeIndex` have been started but it's not ensured
that all nodes have successfully joined to form the cluster.

This behaviour was introduced with 5db7ed22a0
and in order to ensure that the cluster is properly formed before proceeding
with the integration test, use `ensureStableCluster()` with the
appropriate number of expected nodes.

Fixes: #39220
2019-02-25 17:26:15 +01:00
Luca Cavanna b6734b412d [DOCS] Fix typo in network-host.asciidoc 2019-02-25 16:55:06 +01:00
Darren Meiss 1d902cce88 Edits to text in Optimistic Concurrency Ctrl doc (#39191) 2019-02-25 16:54:23 +01:00
Darren Meiss 8f0d864ae1 Minor edits to text in Reindex API doc (#39137) 2019-02-25 16:54:17 +01:00
David Turner dc23be5a9d Avoid creating a green index in RetentionLeaseIT (#39347)
In #39224 we made shard history retention lease syncing ignore the
`index.write.wait_for_active_shards` setting on the index, and added a test
that showed that it was ignored. However the test as merged actually creates a
green index, so the `wait_for_active_shards` setting has no effect. This change
adjusts the test to create a yellow index to verify that
`wait_for_active_shards` really is ignored.
2019-02-25 15:33:09 +00:00
Martijn van Groningen 6f69ef165b
Protect against the leader index being removed (#39351)
when dealing with TimeoutException

The `IndexFollowingIT#testDeleteLeaderIndex()`` test failed,
because a NPE was captured as fatal error instead of an IndexNotFoundException.

Closes #39308
2019-02-25 13:40:10 +01:00
Costin Leau 9d97f3289d Mute CcrRollingUpgradeIT#testCannotFollowLeaderInUpgradedCluster
See #39355
2019-02-25 14:06:27 +02:00
Martijn van Groningen ea4e9ed0a9
Add missing tests for CcrRequestConverters (#39228) 2019-02-25 12:48:43 +01:00
Martijn van Groningen 9bf0538878
Wait for index following is active for auto followed index (#39175)
before executing pause follow api:

https://github.com/elastic/elasticsearch/issues/39126#issuecomment-465512002

Closes #39126
2019-02-25 10:44:20 +01:00
Yannick Welsch a2bc41621c Clean GatewayAllocator when stepping down as master (#38885)
This fixes an issue where a messy master election might prevent shard allocation to properly
proceed. I've encountered this in failing CI tests when we were bootstrapping multiple nodes. Tests
would sometimes time out on an `ensureGreen` after an unclean master election. The reason for
this is how the async shard information fetching works and how the clean-up logic in
GatewayAllocator is integrated with the rest of the system. When a node becomes master, it will, as
part of the first cluster state update where it becomes master, already try allocating shards (see
`JoinTaskExecutor`, in particular the call to `reroute`). This process, which runs on the
MasterService thread, will trigger async shard fetching. If the node is still processing an earlier
election failure in ClusterApplierService (e.g. due to a messy election), that will possibly trigger the
clean-up logic in GatewayAllocator after the shard fetching has been initiated by MasterService,
thereby cancelling the fetching, which means that no subsequent reroute (allocation) is triggered
after the shard fetching results return. This means that no shard allocation will happen unless the
user triggers an explicit reroute command. The bug imo is that GatewayAllocator is called from both
MasterService and ClusterApplierService threads, with no clear happens-before relation. The fix
here makes it so that the clean-up logic is also run on the MasterService thread instead of the
ClusterApplierService thread, reestablishing a clear happens-before relation. Note that testing this
is tricky. With the newly added test, I can quite often reproduce this by adding `Thread.sleep(10);`
in ClusterApplierService (to make sure it does not go too quickly) and adding `Thread.sleep(50);` in
`TransportNodesListGatewayStartedShards` to make sure that shard state fetching does not go too
quickly either.

Note that older versions of Zen discovery are affected by this as well, but did not exhibit this issue
as often because master elections are much slower there.
2019-02-25 10:37:31 +01:00
Yogesh Gaikwad 7021e1bd3b
Add await busy loop for SimpleKdcLdapServer initialization (#39221) (#39342)
There have been intermittent failures where either
LDAP server could not be started or KDC server could
not be started causing failures during test runs.

`KdcNetwork` class from Apache kerby project does not set reuse
address to `true` on the socket so if the port that we found to be free
is in `TIME_WAIT` state it may fail to bind. As this is an internal
class for kerby, I could not find a way to extend.

This commit adds a retry loop for initialization. It will keep
trying in an await busy loop and fail after 10 seconds if not
initialized.

Closes #35982
2019-02-25 20:35:08 +11:00
David Turner 96c09b032d Ignore waitForActiveShards when syncing leases (#39224)
Adjust the retention lease sync actions so that they do not respect the
`index.write.wait_for_active_shards` setting on an index, allowing them to sync
retention leases even if insufficiently many shards are currently active to
accept writes.

Relates #39089
2019-02-25 08:53:43 +00:00
Armin Braun 50d2736746
Fix Deadlock from Thread.suspend in Test (#39261) (#39341)
* The lambda invoked by the `lockedExecutor` eventually gets JITed (which runs a static initializer that we will suspend in with a very tiny chance).
   * Fixed by creating the `Runnable` in the main test thread and using the same instance in all threads
* Closes #35686
2019-02-25 09:15:19 +01:00
Nhat Nguyen f17d408fbb Add cause to assert_no_failure when replay translog (#39333)
We tripped this assertion three times for the last two weeks. However,
it only says "this IndexWriter is closed" without the actual cause.

```
[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting

java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
```

This change replaces an assert with an AssertionError so that we will
have the actual cause in the next build failures.

Relates #38898
2019-02-23 13:04:43 -05:00
Mayya Sharipova e80284231d
Backport distance functions vectors (#39330)
Distance functions for dense and sparse vectors

Backport for #37947, #39313
2019-02-23 11:52:43 -05:00
Jason Tedor 6e06f82106
Fix failing CCR retention lease test
Finally! This commit should fix the issues with the CCR retention lease
that has been plaguing build failures. The issue here is that we are
trying to prevent the clear session requests from being executed until
after we have been able to validate that retention leases are being
renewed. However, we were only blocking the clear session requests but
not blocking them when they are proxied through another node. This
commit addresses that.

Relates #39268
2019-02-22 20:43:39 -05:00
Jason Tedor 2d4c98a991
Change sort order of shard stats in CCR test
This commit changes the sort order of shard stats that are collected in
CCR retention lease integration tests. This change is done so that
primaries appear first in sort order.
2019-02-22 18:17:28 -05:00
Jason Tedor e569cf8324
Address failing CCR retention lease test
This test fails rarely but it is flaky in its current form. The problem
here is that we lack a guarantee on the retention leases having been
synced to all shard copies. We need to sleep long enough to ensure that
that occurs, and then we can sample the retention leases, possibly sleep
again (we usually will not have too since the first sleep will have been
long enough to allow a sync and a renewal to happen, if one was going to
happen), and the sample the retention leases for comparison.

Closes #39331
2019-02-22 18:15:10 -05:00
Jason Tedor e4e96b8181
Fix shard logged in background lease renewal
The shard logged here is the leader shard but it should be the follower
shard since this background retention lease renewal is happening on the
follower side. This commit fixes that.
2019-02-22 17:32:51 -05:00
Jason Tedor feb25c71a0
Simplify mocking in CCR retention lease tests
This commit simplifies the use of transport mocking in the CCR retention
lease integration tests. Instead of adding a send rule between nodes, we
add a default send rule. This greatly simplifies the code here, and
speeds the test up a little bit too.
2019-02-22 17:24:12 -05:00
Zachary Tong c7516b03b6
Better HoltWinters parameter validation (#38747)
We validate HW parameters (namely, window > 2 * period) when parsing
the XContent... but that means transport clients can configure bad
params.

This change allows model to validate the window and throw an 
exception if they wish.

It also makes some test changes:

- removes testBadModelParams(), which was a junk test (didn't do
anything), and bad param checking is done elsewhere in units tests
- Fixes one of the windows in testHoltWintersNotEnoughData()
- Ensures the period in testHoltWintersNotEnoughData() is >> window
- Removes `setTypes()` since that's deprecated
2019-02-22 15:25:26 -05:00
Tim Brooks 931953a3ee
Ensure index commit released when testing timeouts (#39273)
This fixes #39245. Currently it is possible in this test that the clear
session call times-out. This means that the index commit will not be
released and there will be an assertion triggered in the test teardown.
This commit ensures that we wipe the leader index in the test to avoid
this assertion.

It is okay if the clear session call times-out in normal usage. This
scenario is unavoidable due to potential network issues. We have a local
timeout on the leader to clean it up when this scenario happens.
2019-02-22 11:14:42 -07:00
Benjamin Trent 3262d6c917
[ML-DataFrame] Add _preview endpoint (#38924) (#39319)
* [DATA-FRAME] add preview endpoint

* adjusting preview tests and fixing parser

* adjusing preview transport

* remove unused import

* adjusting test

* Addressing PR comments

* Fixing failing test and adjusting for pr comments

* fixing integration test
2019-02-22 10:55:38 -06:00
David Roberts 4f2bd238d2 [ML] Increase datafeed integration test timeout for slow machines (#39311)
The assertBusy() that waits the default 10 seconds for a
datafeed to complete very occasionally times out on slow
machines.  This commit increases the timeout to 60 seconds.
It will almost never actually take this long, but it's
better to have a timeout that will prevent time being
wasted looking at spurious test failures.
2019-02-22 15:35:32 +00:00