Commit Graph

3028 Commits

Author SHA1 Message Date
David Roberts 366eef99a1 Mute SharedClusterSnapshotRestoreIT.testCloseOrDeleteIndexDuringSnapshot
Due to https://github.com/elastic/elasticsearch/issues/39828
2019-03-08 11:42:13 +00:00
David Turner 5d68143b18 Reformat elasticsearch-node messages (#39811)
Flows the warning messages emitted by the `elasticsearch-node` tool to a width
of 72 characters and tweaks the wording slightly.
2019-03-08 10:01:29 +00:00
Jake Landis 797d6b8a66
Execute ingest node pipeline before creating the index (#39607) (#39796)
Prior to this commit (and after 6.5.0), if an ingest node changes
the _index in a pipeline, the original target index would be created.
For daily indexes this could create an extra, empty index per day.

This commit changes the TransportBulkAction to execute the ingest node
pipeline before attempting to create the index. This ensures that the 
only index created is the original or one set by the ingest node pipeline. 
This was the execution order prior to 6.5.0 (#32786). 

The execution order was changed in 6.5 to better support default pipelines. 
Specifically the execution order was changed to be able to read the settings
from the index meta data. This commit also includes a change in logic such 
that if the target index does not exist when ingest node pipeline runs, it 
will now pull the default pipeline (if one exists) from the settings of the 
best matched of the index template. 

Relates #32786
Relates #32758 
Closes #36545
2019-03-07 13:31:41 -06:00
Jason Tedor 0250d554b6
Introduce forget follower API (#39718)
This commit introduces the forget follower API. This API is needed in cases that
unfollowing a following index fails to remove the shard history retention leases
on the leader index. This can happen explicitly through user action, or
implicitly through an index managed by ILM. When this occurs, history will be
retained longer than necessary. While the retention lease will eventually
expire, it can be expensive to allow history to persist for that long, and also
prevent ILM from performing actions like shrink on the leader index. As such, we
introduce an API to allow for manual removal of the shard history retention
leases in this case.
2019-03-07 11:08:45 -05:00
Armin Braun 213cc6673c
Remove Dead Code in o.e.util package (#39717) (#39779)
* None of this code is used so we should delete it, we can always bring it back if needed
2019-03-07 08:31:46 +01:00
Nhat Nguyen b69affda6a Use unwrapped cause to determine if node is closing (#39723)
We need to unwrap and use the actual cause when determining if the node
with primary shard is shutting down because TransportService will throw
a TransportException wrapped in a SendRequestTransportException.

Relates #39584
2019-03-06 15:30:55 -05:00
Nhat Nguyen 1fe7cb594f Don’t ack if unable to remove failing replica (#39584)
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
2019-03-06 15:30:55 -05:00
markharwood 1873de5240
Bug fix for AnnotatedTextHighlighter - port of 39525 (#39749)
Bug fix for AnnotatedTextHighlighter - port of 39525

Relates to #39395
2019-03-06 19:02:04 +00:00
Yannick Welsch d094107592 Fix SharedClusterSnapshotRestoreIT
Relates to #39644
2019-03-06 17:51:23 +01:00
Yannick Welsch fef11f7efc Allow snapshotting replicated closed indices (#39644)
This adds the capability to snapshot replicated closed indices.

It also changes snapshot requests in v8.0.0 to automatically expand wildcards to closed indices and hence start snapshotting closed indices by default. For v7.1.0 and above, wildcards are by default only expanded to open indices, which can be changed by explicitly setting the expand_wildcards option either to all or closed.

Note that indices are always restored as open indices, even if they have been snapshotted as closed replicated indices.

Relates to #33888
2019-03-06 16:08:20 +01:00
Simon Willnauer e620fb2e4a Add option to force load term dict into memory (#39741)
Lucene added an optimization to leave the term dictionary on disk
for non-id like fields. This change happened very late in the release
processes such that it's better to have an escape hatch if certain
use-cases are hurt by this optimization. This setting might be
removed in the future if it turns out to be unnecessary.
2019-03-06 15:29:04 +01:00
Christoph Büscher 6c503824c8 Fix occasional SearchServiceTests failure (#39697)
Currently SearchServiceTests.testCloseSearchContextOnRewriteException can fail
if a refresh happens while we test for the SearchPhaseExecutionException that is
thrown later in the test. The test takes the current Store#refCount and expects
it to be the same after the exception is thrown. If a refresh happens in that
interval however, the refCound will be different, causing the test to fail. This
can be provoked e.g. by running this section in a tight loop.
Switching of refresh for this tests solves the issue.
2019-03-06 14:18:03 +01:00
Andrey Ershov 52fd102e23 Avoid serialising state if it was already serialised (#39179)
When preparing the state to send to other nodes, we're serializing it
for each node, despite using putIfAbsent.
This commit checks if the state was already serialized for this node
version before performing the potentially expensive computation.
The map is not used by multiple threads, so computeIfAbsent is not
needed (and could not be used here easily, because IOException could
be thrown).

(cherry picked from commit c99be63b43f5250f3cd220130df73c5e9e097459)
2019-03-06 11:54:13 +01:00
David Turner 295e39a8c8 Drop node if asymmetrically partitioned from master (#39598)
When a node is joining the cluster we ensure that it can send requests to the
master _at that time_. If it joins the cluster and _then_ loses the ability to
send requests to the master then it should be removed from the cluster. Today
this is not the case: the master can still receive responses to its follower
checks, and receives acknowledgements to cluster state publications, so has no
reason to remove the node.

This commit changes the handling of follower checks so that they fail if they
come from a master that the other node was following but which it now believes
to have failed.
2019-03-06 09:41:57 +00:00
David Turner 77dd711847 Tidy up GroupedActionListener (#39633)
Today the `GroupedActionListener` accepts a `defaults` parameter but all
callers pass an empty list. Also it is permitted to pass an empty group but
this is trappy because the delegated listener is never be called in that case.
This commit removes the `defaults` parameter and forbids an empty group.
2019-03-06 09:25:10 +00:00
Armin Braun aaecaf59a4
Optimize Bulk Message Parsing and Message Length Parsing (#39634) (#39730)
* Optimize Bulk Message Parsing and Message Length Parsing

* findNextMarker took almost 1ms per invocation during the PMC rally track
  * Fixed to be about an order of magnitude faster by using Netty's bulk `ByteBuf` search
* It is unnecessary to instantiate an object (the input stream wrapper) and throw it away, just to read the `int` length from the message bytes
  * Fixed by adding bulk `int` read to BytesReference
2019-03-06 08:13:15 +01:00
Jason Tedor 75a0d4f470
Rename retention lease setting (#39719)
This commit renames the retention lease setting
index.soft_deletes.retention.lease so that it is under the namespace
index.soft_deletes.retention_lease. As such, we rename the setting to
index.soft_deletes.retention_lease.period.
2019-03-05 22:04:45 -05:00
Jason Tedor 504c792861
Add Docker build type (#39378)
This commit adds a new build type (together with deb/rpm/tar/zip) to
represent the official Docker images. This build type will be displayed
in APIs such as the main and nodes info APIs.
2019-03-05 22:03:15 -05:00
Luca Cavanna 9d0211485c Tie-break completion suggestions with same score and surface form (#39564)
In case multiple completion suggestion entries have the same score and
surface form, the order in which such options will be returned is
currently not deterministic.

With this commmit we introduce tie-breaking for such situations, based
on shard id, index name, index uuid and doc id like we already do for
 ordinary search hits. With this change we also make shardIndex
mandatory when sorting and comparing completion suggestion options,
which was previously only needed later when fetching hits).

Also, we need to make sure shardIndex is properly set when merging
completion suggestions coming from multiple clusters in
`SearchResponseMerger`
2019-03-05 18:03:54 +01:00
Jim Ferenczi 160dc29f0e Handle total hits equal to track_total_hits (#37907)
This change ensures that a total hits equal to the value set for
track_total_hits is not considered as a lower bound.
2019-03-05 16:28:48 +01:00
Armin Braun 750ec8ba53
Minor Cleanups in QueryPhase (#39680) (#39694)
* Soften redundant cast to allow use of `DeterministicTaskQueue` in this class for #39504
* Remove two redundant variables and lower visibility in two possible spots
* Make field `final`
2019-03-05 15:04:16 +01:00
Christoph Büscher 5cdea6ef17 Fix Fuzziness#asDistance(String) (#39643)
Currently Fuzziness#asDistance(String) doesn't work for custom AUTO values. If
the fuzziness is AUTO, the method returns the correct edit distance to use,
depending on the input string, but for custom AUTO values it currently always
returns an edit distance of 1. Correcting this and adding unit and integration
tests to catch these cases.

Closes #39614
2019-03-05 14:31:07 +01:00
Simon Willnauer 19f6a35358 Move BWC Version to 7.1.0 after backport
Relates to #39512
2019-03-05 14:11:59 +01:00
Simon Willnauer d112c89041 Allow inclusion of unloaded segments in stats (#39512)
Today we have no chance to fetch actual segment stats for segments that
are currently unloaded. This is relevant in the case of frozen indices.
This allows to monitor how much memory a frozen index would use if it was
unfrozen.
2019-03-05 14:02:20 +01:00
Armin Braun e8d9744340
Use Threadpool Time in ClusterApplierService (#39679) (#39685)
* Use threadpool's time in `ClusterApplierService` to allow for deterministic tests
* This is a part of/requirement for #39504
2019-03-05 12:37:49 +01:00
Gordon Brown 380dc27d91 Mute testCloseWhileRelocatingShards (#39589) 2019-03-05 13:34:43 +02:00
Alan Woodward 0b14782b23 Add stopword support to IntervalBuilder (#39637)
The match interval builder analyses input text and converts it to an IntervalSource, and as such
may generate token streams with stopwords. This commit deals with these by using the extend
factory to cover the gaps produced by these stopwords so that phrase and ordered queries work
correctly.
2019-03-05 10:50:45 +00:00
Christoph Büscher 2fe1fa8972
Shortcut counts on exists queries (#39570) (#39660)
`TopDocsCollectorContext` can already shortcut hit counts on `match_all` and `term` queries when there are no deletions. 
This change adds this ability for `exists` queries if the index doesn't have deletions and fields are indexed.

Closes #37475
2019-03-04 19:53:43 +01:00
Prabhakar S 98925e9a09 Fixing the custom object serialization bug in diffable utils. (#39544)
While serializing custom objects, the length of the list is computed after
filtering out the unsupported objects but while writing objects the filter
is not applied thus resulting in writing unsupported objects which will fail
to deserialize by the receiever. Adding the condition to filter out unsupported
custom objects.
2019-03-04 18:41:14 +01:00
Nhat Nguyen 801f13f201 Assert recovery done in testDoNotWaitForPendingSeqNo (#39595)
Since #39006 we should be able to complete a peer-recovery without
waiting for pending indexing operations. Thus, the assertion in
testDoNotWaitForPendingSeqNo should be updated from false to true.

Closes #39510
2019-03-04 10:21:23 -05:00
Yannick Welsch 936dbb00e3
Isolate Zen1 (#39470)
Cherry-picks a few commits from #39466 to align 7.x with master branch.
2019-03-04 15:51:17 +01:00
Luca Cavanna 9ddaabba88 Remote private SearchHits.Total class (#39556)
This is now possible as Lucene's `TotalHits` implements `equals`/`hashcode`,
all the other methods can be in-lined in `SearchHits` instead, no need for
a specific wrapper class.
2019-03-04 13:46:45 +01:00
Armin Braun 547af21a12
Introduce Mapping ActionListener (#39538) (#39636)
* Introduce Safer Chaining of Listeners

* The motivation here is to make reasoning about chains of `ActionListener` a little easier, by providing a safe method for nesting `ActionListener` that guarantees that a response is never dropped. Also, it dries up the code a little by removing the need to repeat `listener::onFailure` and `listener.onResponse` over and over.
* Refactored a number of obvious/easy spots to use the new listener constructor
2019-03-04 12:56:46 +01:00
Daniel Mitterdorfer fca6a2f006
Avoid deprecated API usage in TaskOperationFailure (#39303) (#39628)
With this commit we remove usage of the deprecated method
`ExceptionsHelper#detailedMessage` in the class `TaskOperationFailure`.

Relates #19069
2019-03-04 11:37:59 +01:00
David Turner dd68244841
Wait for state recovery in testFreshestMasterElectedAfterFullClusterRestart (#39602)
Zen1IT#testFreshestMasterElectedAfterFullClusterRestart fails sometimes because
we request the cluster state before state recovery has completed, and therefore
obtain the default value for the setting we're relying on.

Confusingly, we were starting out by setting this setting to its default value,
so the test looked like it was failing because of a production bug. This commit
avoids this confusion in future by setting it to a non-default value at the
start of the test.

Fixes #39586.
2019-03-04 10:26:07 +00:00
Adrien Grand 782f873165
Don't swallow exceptions in Store#close(). (#39035) (#39622)
Store#close() swallows any `IOException`.

Relates #39030
2019-03-04 10:58:43 +01:00
Adrien Grand 934946a232
Don't swallow exception in ThreadPool.terminate. (#39038) (#39623)
The use of `closeWhileHandlingException` means that any exception while trying
to close the threadpool is going to be swallowed.

Relates #39030
2019-03-04 10:58:29 +01:00
Adrien Grand 21540a5ada
Enhancements to IndicesQueryCache. (#39099) (#39626)
This commit adds the following:
 - more tests to IndicesServiceCloseTests, one of them found a bug in the order
   in which `IndicesQueryCache#onClose` and
   `IndicesService.indicesRefCount#decRef` are called.
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates #37117
2019-03-04 10:58:12 +01:00
Armin Braun 68bc178017
Disable Bwc Tests (#39551)
* Disable Bwc Tests
* For #39550
2019-03-04 10:41:52 +01:00
Yannick Welsch 0f65390c29 Do not mutate engine during planning step (#39571)
This cleans up the Engine implementation by separating the sequence number generation from the
planning step in the engine, to avoid for the planning step to have any side effects. This makes it
easier to see that every sequence number is properly accounted for.
2019-03-04 10:11:39 +01:00
David Turner 9ec24bae80 Mute testDoNotWaitForPendingSeqNo
Relates #39510, #39595.
2019-03-03 22:03:53 -05:00
Mayya Sharipova d0e65a45a2 Add debug log for flush for IndicesRequestCacheIT (#39475)
Add debug log when index is flushed to investigate a failure
in IndicesRequestCacheIT

"DEBUG" level is used as "TRACE" produces too  much output irrelevant for this
issue

Relates to #32827
2019-03-01 13:12:45 -05:00
Luca Cavanna 29e3c18713 Mute failing IndexShardIT#testPendingRefreshWithIntervalChange
Relates to #39565
2019-03-01 14:55:19 +01:00
Tanguy Leroux e005eeb0b3
Backport support for replicating closed indices to 7.x (#39506)(#39499)
Backport support for replicating closed indices (#39499)
    
    Before this change, closed indexes were simply not replicated. It was therefore
    possible to close an index and then decommission a data node without knowing
    that this data node contained shards of the closed index, potentially leading to
    data loss. Shards of closed indices were not completely taken into account when
    balancing the shards within the cluster, or automatically replicated through shard
    copies, and they were not easily movable from node A to node B using APIs like
    Cluster Reroute without being fully reopened and closed again.
    
    This commit changes the logic executed when closing an index, so that its shards
    are not just removed and forgotten but are instead reinitialized and reallocated on
    data nodes using an engine implementation which does not allow searching or
     indexing, which has a low memory overhead (compared with searchable/indexable
    opened shards) and which allows shards to be recovered from peer or promoted
    as primaries when needed.
    
    This new closing logic is built on top of the new Close Index API introduced in
    6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before
    closing them, and closing an index on a 8.0 cluster will reinitialize the index shards
    and therefore impact the cluster health.
    
    Some APIs have been adapted to make them work with closed indices:
    - Cluster Health API
    - Cluster Reroute API
    - Cluster Allocation Explain API
    - Recovery API
    - Cat Indices
    - Cat Shards
    - Cat Health
    - Cat Recovery
    
    This commit contains all the following changes (most recent first):
    * c6c42a1 Adapt NoOpEngineTests after #39006
    * 3f9993d Wait for shards to be active after closing indices (#38854)
    * 5e7a428 Adapt the Cluster Health API to closed indices (#39364)
    * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767)
    * 71f5c34 Recover closed indices after a full cluster restart (#39249)
    * 4db7fd9 Adapt the Recovery API for closed indices (#38421)
    * 4fd1bb2 Adapt more tests suites to closed indices (#39186)
    * 0519016 Add replica to primary promotion test for closed indices (#39110)
    * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631)
    * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955)
    * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex()
    * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329)
    * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327)
    * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326)
    * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024)
    * e53a9be Fix compilation error in IndexShardIT after merge with master
    * cae4155 Relax NoOpEngine constraints (#37413)
    * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes
    * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903)
    
    Relates to #33888
2019-03-01 14:48:26 +01:00
Yannick Welsch 1a50af7dd4 Do not close bad indices on startup (#39500)
With #17187, we verified IndexService creation during initial state recovery on the master and if the
recovery failed the index was imported as closed, not allocating any shards. This was mainly done to
prevent endless allocation loops and full log files on data-nodes when the indexmetadata contained
broken settings / analyzers. Zen2 loads the cluster state eagerly, and this check currently runs on all
nodes (not only the elected master), which can significantly slow down startup on data nodes.
Furthermore, with replicated closed indices (#33888) on the horizon, importing the index as closed
will no longer not allocate any shards. Fortunately, the original issue for endless allocation loops is
no longer a problem due to #18467, where we limit the retries of failed allocations. The solution here
is therefore to just undo #17187, as it's no longer necessary, and covered by #18467, which will solve
the issue for Zen2 and replicated closed indices as well.
2019-03-01 09:23:46 +01:00
Tal Levy b9b46fdec6
fix UpdateSettingsRequestStreamableTests.mutateInstance (#39386) (#39477)
Mutations of the timeout values were using string-representations.

This resulted in very rare cases where the original timeout value was
represented as something like "0ms" and the new random time-value generated
was "0s". Although their string representations differ, their underlying
TimeValue does not. This resulted in `-Dtests.seed=7F4C034C43C22B1B` to
fail.
2019-02-28 21:02:32 -08:00
Mark Tozzi 609118c229 Override and mute InternalAutoDateHistogramTests#testReduceRandom() (#39536)
pending resolution of #39497
2019-02-28 16:00:32 -05:00
Lee Hinman dae48ba262 Add details about what acquired the shard lock last (#38807)
This adds a `details` parameter to shard locking in `NodeEnvironment`. This is
intended to be used for diagnosing issues such as

```
  1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index
  1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService     ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index
  1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms
  1> 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?]
```

In the hope that we will be able to determine why the shard is still locked.

Relates to #30290 as well as some other CI failures
2019-02-28 10:50:47 -07:00
Armin Braun e564c4d8ad
Add Package Level JavaDoc on Snapshots (#38108) (#39514)
* Add Package Level JavaDoc on Snapshots
2019-02-28 18:23:01 +01:00
Simon Willnauer 5c96b90ed5 Never block on scheduled refresh if a refresh is running (#39462)
Today we block on the ReferenceManager in the case of a scheduled refresh.
Yet if there is a refresh happening concurrently we might block and create
very smallish segments. Instead we should just move on to the next shard
and free up the refresh thread instead.
2019-02-28 11:57:45 +01:00
Armin Braun d3d7d9bb9d
Remove Dead Code + Duplication in o.e.c.routing (#36678) (#39493)
* Removed obviously unused fields+methods
* Inlined public methods that only had one caller
* Simplified `Optional` chain
* Simplified some obviously redundant conditions
2019-02-28 10:33:05 +01:00
Armin Braun 90ab4a6f6e
Stabilize RareClusterState (#38671) (#39468)
* Use actual master node, not just a master elligible node when trying to cancel publication. This only works on the master and for unlucky seeds we never try the master within the 10s that the busy assert runs.
* Closes #36813
2019-02-28 08:01:52 +01:00
Tanguy Leroux 4dd274b51d Unmute CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState() (#39452)
This commit unmutes the test and comments out the
offending call to linearizabilityChecker.isLinearizable() as suggested
in #39437
2019-02-27 20:38:54 +01:00
Tanguy Leroux 983b5d1c0e Mute SpecificMasterNodesIT.testElectOnlyBetweenMasterNodes()
Tracked in #38331
2019-02-27 18:00:02 +01:00
Daniel Mitterdorfer 2ccba18809
Correct name of basic_date_time_no_millis (#39367) (#39454)
With this commit we correct the name of the Java time based formatter
for `basic_date_time_no_millis`.
2019-02-27 17:03:50 +01:00
Alan Woodward 71b8494181
Upgrade to lucene 8.0.0-snapshot-ff9509a8df (#39444)
Backport of #39350

Contains the following:

* LUCENE-8635: Move terms dictionary off-heap for non-primary-key fields in `MMapDirectory`
* LUCENE-8292: `TermsEnum` is fully abstract
* LUCENE-8679: Return WITHIN in `EdgeTree#relateTriangle` only when polygon and triangle share one edge
* LUCENE-8676: Nori tokenizer deals correctly with large buffers
* LUCENE-8697: `GraphTokenStreamFiniteStrings` better handles side paths with gaps
* LUCENE-8664: Add `equals` and `hashCode` to `TotalHits`
* LUCENE-8660: `TopDocsCollector` returns accurate hit counts if the total equals the threshold
* LUCENE-8654: `Polygon2D#relateTriangle` fix for when the polygon is inside the triangle
* LUCENE-8645: `Intervals#fixField` can merge intervals from different fields
* LUCENE-8585: Create jump-tables for DocValues at index time
2019-02-27 14:36:08 +00:00
Armin Braun f675b33d50
Increase Timeout in UnicastZenPingTests (#38893) (#39449)
* Just like #37268 removing another 1s timeout, those are dangerous since they're easily exceeded by an untimely gc pause
* Closes #26701
2019-02-27 15:22:17 +01:00
Jason Tedor 55e98f08d8
Provide a clearer error message on keystore add (#39327)
When trying to add a setting to the keystore with an upper case name, we
reject with an unclear error message. This commit makes that error
message much clearer.
2019-02-27 08:10:23 -05:00
Armin Braun 27485871b8
Don't Ping on Handshake Connection (#39076) (#39446)
* Don't Ping on Handshake Connection

* It does not make sense to run pings on the handshake connection
   * Set the ping interval to `-1` to deactivate pings on it
2019-02-27 13:39:25 +01:00
Tanguy Leroux 6912e27ee0 Mute MinimumMasterNodesIT.testThreeNodesNoMasterBlock()
Tracked in #39172
2019-02-27 13:13:22 +01:00
David Turner 41668f7723 Move PeerFinder's logger to the expected package (#39412)
Today the abstract `org.elasticsearch.discovery.PeerFinder` uses the logger of
its implementation, which in production is in `o.e.cluster.coordination`. This
turns out to be confusing and unhelpful, so with this change we move to using
the logger that belongs to `PeerFinder`.
2019-02-27 08:44:05 +00:00
Armin Braun 28b771f5db
Remove Dead Code Test Infrastructure (#39192) (#39436)
* Just removing some obviously unused things
2019-02-27 09:38:47 +01:00
Tim Brooks f24dae302d
Make security tests transport agnostic (#39411)
Currently there are two security tests that specifically target the
netty security transport. This PR moves the client authentication tests
into `AbstractSimpleSecurityTransportTestCase` so that the nio transport
will also be tested.

Additionally the work to build transport configurations is moved out of
the netty transport and tested independently.
2019-02-26 18:55:19 -07:00
Nhat Nguyen a9e86bc941 Adjust testWaitForPendingSeqNo (#39404)
Since #39006, we should either remove `testWaitForPendingSeqNo` 
or adjust it not to wait for the pending operations. This change picks 
the latter.

Relates #39006
2019-02-26 16:21:56 -05:00
Mayya Sharipova 4ca514f18c Fix testCacheWithFilteredAlias failure (#39401)
Move refresh after Forcemerge

Relates to #32827
2019-02-26 14:11:35 -05:00
Luca Cavanna 2619f48e4d Rename SearchRequest#withLocalReduction (#39108)
`withLocalReduction` is confusing as `local` effectively means "local
to the remote clusters" rather than "local the coordinating node" where
the method is executed. I propose we rename the method to
`crossClusterSearch` which better resembles what the static method is
used for.
2019-02-26 16:30:54 +01:00
Luca Cavanna c09773a76e Completion suggestions to be reduced once instead of twice (#39255)
We have been calling `reduce` against completion suggestions twice, once
in `SearchPhaseController#reducedQueryPhase` where all suggestions get
reduced, and once more in `SearchPhaseController#sortDocs` where we
add the top completion suggestions to the `TopDocs` so their docs can
be fetched. There is no need to do reduction twice. All suggestions can
be reduced in one call, then we can filter the result and pass only the
already reduced completion suggestions over to `sortDocs`. The small
important detail is that `shardIndex`, which is currently used only
to fetch suggestions hits, needs to be set before the first reduction,
hence outside of `sortDocs` where we have been doing it until now.
2019-02-26 11:42:02 +01:00
Yannick Welsch d42f422258 Add linearizability checker for coordination layer (#36943)
Checks that the core coordination algorithm implemented as part of Zen2 (#32006) supports
linearizable semantics. This commit adds a linearizability checker based on the Wing and Gong
graph search algorithm with support for compositional checking and activates these checks for all
CoordinatorTests.
2019-02-26 08:26:55 +01:00
Nhat Nguyen 575eed8582 Bubble up exception when processing NoOp (#39338)
Today we do not bubble up exceptions when processing NoOps but always
treat them as document-level failures. This incorrect treatment causes
the assert_no_failure being tripped in peer-recovery if IndexWriter was
closed exceptionally before.

Closes #38898
2019-02-25 17:54:45 -05:00
Nhat Nguyen e9dda75834 Enable soft-deletes by default for 7.0+ indices (#38929)
Today when users upgrade to 7.0, existing indices will automatically
switch to soft-deletes without an opt-out option. With this change, 
we only enable soft-deletes by default for new indices.

Relates #36141
2019-02-25 17:54:29 -05:00
Igor Motov d5046b1c25 [CI] Fixes testQueryRandomGeoCollection failure again (#39275)
Moves the check for tiny polygons earlier in the test. It turned out
that polygons can be so tiny that we cannot even figure out their
orientation.

Relates to #37356
2019-02-25 16:35:17 -05:00
Evgenia Badyanova 1ed3407930
Reduce garbage from allocations in deprecation logger (#38780) (#39370)
1. Setting length for formatWarning String to avoid AbstractStringBuilder.ensureCapacityInternal calls
2. Adding extra check for parameter array length == 0 to avoid unnecessarily creating StringBuilder in LoggerMessageFormat.format

Helps to narrow the performance gap in throughout for geonames benchmark (#37411) by 3%. For more details: https://github.com/elastic/elasticsearch/issues/37530#issuecomment-462758384 

Relates to #37530
Relates to #37411
Relates to #35754
2019-02-25 16:23:22 -05:00
Lee Hinman 5c7dd6f0ee Set mappings when creating indices in SuggestSearchIT (#39323)
* Set mappings when creating indices in SuggestSearchIT

These tests don't test dynamic mapping, so they can use preset mappings. This
removes the possibility they may fail due to the mapping not being available
since mapping updates are asynchronous.

Resolves #39315

* Wrap creates in assertAcked
2019-02-25 13:27:03 -07:00
Mayya Sharipova bf058d6e4d
Fix anaylze NullPointerException when AnalyzeTokenList tokens is null (#39332) (#39361) 2019-02-25 12:49:18 -05:00
Nhat Nguyen 48219112e3 Do not wait for advancement of checkpoint in recovery (#39006)
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
2019-02-25 12:10:14 -05:00
David Turner 236db51d34 Fix testSnapshotFileFailureDuringSnapshot (#39362)
Today this test catches an exception and asserts that its proximate cause has
message `Random IOException` but occasionally this exception is wrapped two
layers deep, causing the test to fail. This commit adjusts the test to look at
the root cause of the exception instead.

      1> [2019-02-25T12:31:50,837][INFO ][o.e.s.SharedClusterSnapshotRestoreIT] [testSnapshotFileFailureDuringSnapshot] --> caught a top level exception, asserting what's expected
      1> org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] Snapshot could not be read
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:212) ~[main/:?]
      1> 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:135) ~[main/:?]
      1> 	at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:54) ~[main/:?]
      1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127) ~[main/:?]
      1> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208) ~[main/:?]
      1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
      1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
      1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
      1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
      1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
      1> Caused by: org.elasticsearch.snapshots.SnapshotException: [test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:564) ~[main/:?]
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?]
      1> 	... 9 more
      1> Caused by: java.io.IOException: Random IOException
      1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.maybeIOExceptionOrBlock(MockRepository.java:275) ~[test/:?]
      1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.readBlob(MockRepository.java:317) ~[test/:?]
      1> 	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:101) ~[main/:?]
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreFormat.read(BlobStoreFormat.java:90) ~[main/:?]
      1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:560) ~[main/:?]
      1> 	at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:206) ~[main/:?]
      1> 	... 9 more

    FAILURE 0.59s J0 | SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot <<< FAILURES!
       > Throwable #1: java.lang.AssertionError:
       > Expected: a string containing "Random IOException"
       >      but: was "[test-repo:test-snap/e-hn_pLGRmOo97ENEXdQMQ] failed to get snapshots"
       > 	at __randomizedtesting.SeedInfo.seed([B73CA847D4B4F52D:884E042D2D899330]:0)
       > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
       > 	at org.elasticsearch.snapshots.SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot(SharedClusterSnapshotRestoreIT.java:821)
       > 	at java.lang.Thread.run(Thread.java:748)
2019-02-25 16:43:55 +00:00
Marios Trivyzas 11fe8cd16f
[Tests] Fix flakiness by ensuring stable cluster (#39300) (#39356)
In integration tests where `setBootstrapMasterNodeIndex()` is used in
combination with `autoMinMasterNodes = false` the cluster can start
bootstrapping once the number of nodes set with the
`setBootstrapMasterNodeIndex` have been started but it's not ensured
that all nodes have successfully joined to form the cluster.

This behaviour was introduced with 5db7ed22a0
and in order to ensure that the cluster is properly formed before proceeding
with the integration test, use `ensureStableCluster()` with the
appropriate number of expected nodes.

Fixes: #39220
2019-02-25 17:26:15 +01:00
David Turner dc23be5a9d Avoid creating a green index in RetentionLeaseIT (#39347)
In #39224 we made shard history retention lease syncing ignore the
`index.write.wait_for_active_shards` setting on the index, and added a test
that showed that it was ignored. However the test as merged actually creates a
green index, so the `wait_for_active_shards` setting has no effect. This change
adjusts the test to create a yellow index to verify that
`wait_for_active_shards` really is ignored.
2019-02-25 15:33:09 +00:00
Yannick Welsch a2bc41621c Clean GatewayAllocator when stepping down as master (#38885)
This fixes an issue where a messy master election might prevent shard allocation to properly
proceed. I've encountered this in failing CI tests when we were bootstrapping multiple nodes. Tests
would sometimes time out on an `ensureGreen` after an unclean master election. The reason for
this is how the async shard information fetching works and how the clean-up logic in
GatewayAllocator is integrated with the rest of the system. When a node becomes master, it will, as
part of the first cluster state update where it becomes master, already try allocating shards (see
`JoinTaskExecutor`, in particular the call to `reroute`). This process, which runs on the
MasterService thread, will trigger async shard fetching. If the node is still processing an earlier
election failure in ClusterApplierService (e.g. due to a messy election), that will possibly trigger the
clean-up logic in GatewayAllocator after the shard fetching has been initiated by MasterService,
thereby cancelling the fetching, which means that no subsequent reroute (allocation) is triggered
after the shard fetching results return. This means that no shard allocation will happen unless the
user triggers an explicit reroute command. The bug imo is that GatewayAllocator is called from both
MasterService and ClusterApplierService threads, with no clear happens-before relation. The fix
here makes it so that the clean-up logic is also run on the MasterService thread instead of the
ClusterApplierService thread, reestablishing a clear happens-before relation. Note that testing this
is tricky. With the newly added test, I can quite often reproduce this by adding `Thread.sleep(10);`
in ClusterApplierService (to make sure it does not go too quickly) and adding `Thread.sleep(50);` in
`TransportNodesListGatewayStartedShards` to make sure that shard state fetching does not go too
quickly either.

Note that older versions of Zen discovery are affected by this as well, but did not exhibit this issue
as often because master elections are much slower there.
2019-02-25 10:37:31 +01:00
David Turner 96c09b032d Ignore waitForActiveShards when syncing leases (#39224)
Adjust the retention lease sync actions so that they do not respect the
`index.write.wait_for_active_shards` setting on an index, allowing them to sync
retention leases even if insufficiently many shards are currently active to
accept writes.

Relates #39089
2019-02-25 08:53:43 +00:00
Nhat Nguyen f17d408fbb Add cause to assert_no_failure when replay translog (#39333)
We tripped this assertion three times for the last two weeks. However,
it only says "this IndexWriter is closed" without the actual cause.

```
[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting

java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
```

This change replaces an assert with an AssertionError so that we will
have the actual cause in the next build failures.

Relates #38898
2019-02-23 13:04:43 -05:00
Zachary Tong c7516b03b6
Better HoltWinters parameter validation (#38747)
We validate HW parameters (namely, window > 2 * period) when parsing
the XContent... but that means transport clients can configure bad
params.

This change allows model to validate the window and throw an 
exception if they wish.

It also makes some test changes:

- removes testBadModelParams(), which was a junk test (didn't do
anything), and bad param checking is done elsewhere in units tests
- Fixes one of the windows in testHoltWintersNotEnoughData()
- Ensures the period in testHoltWintersNotEnoughData() is >> window
- Removes `setTypes()` since that's deprecated
2019-02-22 15:25:26 -05:00
Daniel Mitterdorfer 9fea21aca5
Remove ExceptionsHelper#detailedMessage in tests (#37921) (#39297)
With this commit we remove all usages of the deprecated method
`ExceptionsHelper#detailedMessage` in tests. We do not address
production code here but rather in dedicated follow-up PRs to keep the
individual changes manageable.

Relates #19069
2019-02-22 14:03:29 +01:00
Tim Brooks 44df76251f
Rebuild remote connections on profile changes (#39146)
Currently remote compression and ping schedule settings are dynamic.
However, we do not listen for changes. This commit adds listeners for
changes to those two settings. Additionally, when those settings change
we now close existing connections and open new ones with the settings
applied.

Fixes #37201.
2019-02-21 14:00:39 -07:00
Tanguy Leroux fc896e452c
ReadOnlyEngine should update translog recovery state information (#39238) (#39251)
`ReadOnlyEngine` never recovers operations from translog and never 
updates translog information in the index shard's recovery state, even 
though the recovery state goes through the `TRANSLOG` stage during 
the recovery. It means that recovery information for frozen shards indicates 
an unkown number of recovered translog ops in the Recovery APIs 
(translog_ops: `-1` and translog_ops_percent: `-1.0%`) and this is confusing.

This commit changes the `recoverFromTranslog()` method in `ReadOnlyEngine` 
so that it always recover from an empty translog snapshot, allowing the recovery 
state translog information to be correctly updated.

Related to #33888
2019-02-21 18:08:06 +01:00
Daniel Mitterdorfer ef921fd157
Migrate Streamable to Writeable for cluster block package (#37391) (#39236) 2019-02-21 15:21:44 +01:00
Marios Trivyzas ecfd48b6d3
[Tests] Make testEngineGCDeletesSetting deterministic (#38942) (#39231)
`InternalEngine.resolveDocVersion()` uses `relativeTimeInMillis()` from
`ThreadPool` so it needs, the cached time to be advanced. Add a check
to ensure that and decrease the `thread_pool.estimated_time_interval`
to 1msec to prevent long running times for the test.

Fixes: #38874

Co-authored-by: Boaz Leskes <b.leskes@gmail.com>
2019-02-21 14:30:59 +02:00
Marios Trivyzas 1316825f52
Replace superfluous usage of Counter with Supplier (#39048) (#39225)
`Counter` was used as a means of a functional argument to pass
the relative cached time before `Supplier` iface was introduced.
2019-02-21 12:42:54 +02:00
Ignacio Vera be8a5315d7 Extend nextDoc to delegate to the wrapped doc-value iterator for date_nanos (#39176)
The type date_nanos does not direct doc-value iterators and it needs to extend `next_doc` in order to delegate the call to the wrapped iterator.
2019-02-21 11:10:51 +01:00
Tal Levy 8150ca40f2 mute test3MasterNodes2Failed 2019-02-20 17:35:37 -08:00
Nhat Nguyen 820ba8169e Add retention leases replication tests (#38857)
This commit introduces the retention leases to ESIndexLevelReplicationTestCase,
then adds some tests verifying that the retention leases replication works
correctly in spite of the presence of the primary failover or out of order
delivery of retention leases sync requests.

Relates #37165
2019-02-20 19:21:00 -05:00
Mirko Jotic a6ae146ccc Converting Derivative Pipeline Agg integration test into AggregatorTestsCase. (#38679)
Replicates the majority of existing Derivative pipeline integration tests into
an AggregatorTestCase, with the goal of removing the integration
tests in the near future.
2019-02-20 16:35:32 -05:00
Igor Motov 3d93011e32 Fix median calculation in MedianAbsoluteDeviationAggregatorTests (#38979)
Fixes an error in median calculation in
MedianAbsoluteDeviationAggregatorTests for odd number of sample points,
which causes some rare test failures.

Fixes #38937
2019-02-20 13:24:30 -05:00
Ioannis Kakavas c783069804
Fix NPE on Stale Index in IndicesService(#39173)
This is a backport of  #38891 which closes #38845
2019-02-20 15:35:35 +02:00
David Turner efffb3d5b7 Simplify calculation in AwarenessAllocationDecider (#38091)
Today's calculation of the maximum number of shards per attribute is rather
convoluted. This commit clarifies that it returns
ceil(shardCount/numberOfAttributes).
2019-02-20 08:54:57 +00:00
Henning Andersen 00a26b9dd2 Blob store compression fix (#39073)
Blob store compression was not enabled for some of the files in
snapshots due to constructor accessing sub-class fields. Fixed to
instead accept compress field as constructor param. Also fixed chunk
size validation to work.

Deprecated repositories.fs.compress setting as well to be able to unify
in a future commit.
2019-02-20 09:24:41 +01:00
Hendrik Muhs 50b3858f7c add version 6.6.2 2019-02-19 20:28:06 +01:00
David Turner 0a9574c9d4 Add some missing toString() implementations (#39124)
Sometimes we turn objects into strings for logging or debugging using
`toString()`, but the default implementation is often unhelpful. This change
improves on this in two places I ran into recently.
2019-02-19 17:52:41 +00:00
Jason Tedor fef9bdb23f
Allow retention lease operations under blocks (#39089)
This commit allows manipulating retention leases under blocks.
2019-02-19 10:26:49 -05:00
Jason Tedor 12f6963456
Fix retention leases sync on recovery test
This test had a bug. We attempt to allow only the primary to be
allocated, to force all replicas to recovery from the primary after we
had set the state of the retention leases on the primary. However, in
building the index settings, we were overwriting the settings that
exclude the replicas from being allocated. This means that some of the
replicas would end up assigned and rather than receive retention leases
during recovery, they would be part of the replication group receiving
retention leases as they are manipulated. Since retention lease renewals
are only synced periodically, this means that the replica could be
lagging a little behind in some cases leading to an assertion tripping
in the test. This commit addresses this by ensuring that the replicas
are indeed not allocated until after the retention leases are done being
manipulated on the replica. We did this by not overwriting the exclude
settings.

Closes #39105
2019-02-19 09:07:33 -05:00
Alexander Reelsen 7f8a640363 Fix DateFormatters.parseMillis when no timezone is given (#39100)
The parseMillis method was able to work on formats without timezones by
falling back to UTC. The Date Formatter interface did not support this, as
the calling code was using the `Instant.from` java time API.

This switches over to an internal method which adds UTC as a timezone.

Closes #39067
2019-02-19 14:12:22 +01:00
Jim Ferenczi 199155f5fb
Enforce Completion Context Limit (#38675) (#39075)
This change adds a limit to the number of completion contexts that a completion field can define.

Closes #32741
2019-02-19 08:52:24 +01:00
Albert Zaharovits 6bc88b00ec Mute GatewayMetaStateTests.testAtomicityWithFailures (#39079)
Mute test GatewayMetaStateTests.testAtomicityWithFailures
2019-02-19 00:25:45 +02:00
Jason Tedor 2d8f6b6501
Introduce retention lease state file (#39004)
This commit moves retention leases from being persisted in the Lucene
commit point to being persisted in a dedicated state file.
2019-02-18 16:53:46 -05:00
Jason Tedor d43ac8fe11
Include in log retention leases that failed to sync
When retention leases fail to sync after an expiration check, we emit a
log message about this. This commit adds the retention leases that
failed to sync.
2019-02-18 15:08:08 -05:00
Jason Tedor bbb61002ba
Add some logging related to retention lease syncing (#39066)
When the background retention lease sync fires, we check an see if any
retention leases are expired. If any did expire, we execute a full
retention lease sync (write action). Since this is happening on a
background thread, we do not block that thread waiting for success (it
will simply try again when the timer elapses). However, we were
swallowing exceptions that indicate failure. This commit addresses that
by logging the failures. Additionally, we add some trace logging to the
execution of syncing retention leases.
2019-02-18 15:02:31 -05:00
Henning Andersen 99b2bc3461 Fix potential race during TcpTransport close (#39031)
Fixed two potential causes for leaked threads during tests:
1. When adding a channel to serverChannels, we add it under a monitor
that we do not use when reading from it. This is potentially unsafe if
there is no other happens-before relationship ensuring the safety of
this.
2. Long-shot but if the thread pool was shutdown before entering this
code, we would silently forget about closing server channels so added
assert.

Strengthened the locking to ensure that once we stop the transport, no
new server channels can be made.

Relates to CI failure issue: #37543
2019-02-18 19:13:23 +01:00
Alan Woodward ab4d5f404f Add overlapping, before, after filters to intervals query (#38999)
Lucene recently added `overlapping`, `before` and `after` filters to the intervals package. This
commit exposes them in elasticsearch.
2019-02-18 15:06:24 +00:00
Adrien Grand 45b17e8645
Don't close caches while there might still be in-flight requests. (#38958)
Many of our index components use ref-counting so that in the event that a shard
is closed while there are still ongoing requests, then the index reader and the
store only effectively get closed when ongoing requests have finished. However
we don't apply the same principle to the request and query caches, which might
get closed while there are still in-flight requests.

This commit adds ref-counting to `IndicesService` so that the caches and other
components it maintains only get closed when all shards are effectively closed.

Closes #37117
2019-02-18 13:59:58 +01:00
Martijn van Groningen ed08bc3537
Fix LocalIndexFollowingIT#testRemoveRemoteConnection() test (#38709)
* During fetching remote mapping if remote client is missing then
`NoSuchRemoteClusterException` was not handled.
* When adding remote connection, check that it is really connected
before continue-ing to run the tests.

Relates to #38695
2019-02-18 09:41:44 +01:00
Jason Tedor a5ce1e0bec
Integrate retention leases to recovery from remote (#38829)
This commit is the first step in integrating shard history retention
leases with CCR. In this commit we integrate shard history retention
leases with recovery from remote. Before we start transferring files, we
take out a retention lease on the primary. Then during the file copy
phase, we repeatedly renew the retention lease. Finally, when recovery
from remote is complete, we disable the background renewing of the
retention lease.
2019-02-16 15:37:52 -05:00
Tim Brooks b1c1daa63f
Add get file chunk timeouts with listener timeouts (#38758)
This commit adds a `ListenerTimeouts` class that will wrap a
`ActionListener` in a listener with a timeout scheduled on the generic
thread pool. If the timeout expires before the listener is completed,
`onFailure` will be called with an `ElasticsearchTimeoutException`.

Timeouts for the get ccr file chunk action are implemented using this
functionality. Additionally, this commit attempts to fix #38027 by also
blocking proxied get ccr file chunk actions. This test being un-muted is
useful to verify the timeout functionality.
2019-02-16 10:56:03 -07:00
Luca Cavanna a1a49f201d Tie break search shard iterator comparisons on cluster alias (#38853)
`SearchShardIterator` inherits its `compareTo` implementation from `PlainShardIterator`. That is good in most of the cases, as such comparisons are based on the shard id which is unique, even when searching against indices with same names across multiple clusters (thanks to the index uuid being different). In case though the same cluster is registered multiple times with different aliases, the shard id is exactly the same, hence remote results will be returned before local ones with same shard id objects. That is because remote iterators are added before local ones, and we use a stable sorting method in `GroupShardIterators` constructor.

This PR enhances `compareTo` for `SearchShardIterator` to tie break on cluster alias and introduces consistent `equals` and `hashcode` methods. This allows to remove a TODO in `SearchResponseMerger` which otherwise has to handle this special case specifically. Also, while at it I added missing tests around equals/hashcode and compareTo and expanded existing ones.
2019-02-16 09:41:03 +01:00
Nhat Nguyen 7e20a92888 Advance max_seq_no before add operation to Lucene (#38879)
Today when processing an operation on a replica engine (or the 
following engine), we first add it to Lucene, then add it to translog, 
then finally marks its seq_no as completed. If a flush occurs after step1,
but before step-3, the max_seq_no in the commit's user_data will be
smaller than the seq_no of some documents in the Lucene commit.
2019-02-15 21:04:28 -05:00
Nhat Nguyen 20755e666c Reduce global checkpoint sync interval in disruption tests (#38931)
We verify seq_no_stats is aligned between copies at the end of some
disruption tests. Sometimes, the assertion `assertSeqNos` is tripped due
to a lagged global checkpoint on replicas. The global checkpoint on
replicas is lagged because we sync the global checkpoint 30 seconds (by
default) after the last replication operation. This change reduces the
global checkpoint sync-internal to 1s in the disruption tests.

Closes #38318
Closes #36789
2019-02-15 21:04:20 -05:00
Nhat Nguyen a67b9f6d1f Relax testStressMaybeFlushOrRollTranslogGeneration (#38918)
The predicate shouldPeriodicallyFlush is determined by the uncommitted
translog size and the local checkpoint. The uncommitted translog size
depends on the local checkpoint. The condition shouldPeriodicallyFlush
can be true twice in in the test in the following scenario:

1. Index doc-0 and advances the local checkpoint to 0, the condition
shouldPeriodicallyFlush remains false.

2. Index doc-1 and add it to translog, but the local checkpoint is not
advanced yet (still 0). The condition shouldPeriodicallyFlush becomes
true because the uncommitted translog size is 216bytes (2ops + gen-1 +
gen-2) > 180bytes and the translog generation of the new index commit
would advance from 1 to 2.

> [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=0,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=2,
> max_seq_no=1}]

1. The shouldPeriodicallyFlush becomes true again after the local
checkpoint is advanced to 1 because the uncommitted translog size is
216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation
of the new index commit would advance from 2 to 4.

> [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=1,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=4,
> max_seq_no=1}]

We need to relax the assertion in this test to cover this situation.

Closes #31629
2019-02-15 21:04:12 -05:00
Armin Braun 238425e5e7 Fix Issue with Concurrent Snapshot Init + Delete (#38518)
* Fix Issue with Concurrent Snapshot Init + Delete by ensuring that we're not finalizing a snapshot in the repository while it is initializing on another thread

* Closes #38489
2019-02-15 16:50:47 -08:00
Alan Woodward 176013e23c Avoid double term construction in DfsPhase (#38716)
DfsPhase captures terms used for scoring a query in order to build global term statistics across
multiple shards for more accurate scoring. It currently does this by building the query's `Weight`
and calling `extractTerms` on it to collect terms, and then calling `IndexSearcher.termStatistics()`
for each collected term. This duplicates work, however, as the various `Weight` implementations 
will already have collected these statistics at construction time.

This commit replaces this round-about way of collecting stats, instead using a delegating
IndexSearcher that collects the term contexts and statistics when `IndexSearcher.termStatistics()`
is called from the Weight.

It also fixes a bug when using rescorers, where a `QueryRescorer` would calculate distributed term
statistics, but ignore field statistics.  `Rescorer.extractTerms` has been removed, and replaced with
a new method on `RescoreContext` that returns any queries used by the rescore implementation.
The delegating IndexSearcher then collects term contexts and statistics in the same way described
above for each Query.
2019-02-15 16:00:38 +00:00
Daniel Mitterdorfer fcc7f553f5
Also mmap cfs files for hybridfs (#38940) (#38947)
With this commit we add the `.cfs` file extension to the list of file
types that are memory-mapped by hybridfs. `.cfs` files combine all files
of a Lucene segment into a single file in order to save file handles. As
this strategy is only used for "small" segments (less than 10% of the
shard size), it is benefical to memory-map them instead of accessing
them via NIO.

Relates #36668
2019-02-15 15:34:40 +01:00
David Turner 578514e892 Recover peers from translog, ignoring soft deletes (#38904)
Today if soft deletes are enabled then we read the operations needed for peer
recovery from Lucene. However we do not currently make any attempt to retain
history in Lucene specifically for peer recoveries so we may discard it and
fall back to a more expensive file-based recovery. Yet we still retain
sufficient history in the translog to perform an operations-based peer
recovery.

In the long run we would like to fix this by retaining more history in Lucene,
possibly using shard history retention leases (#37165). For now, however, this
commit reverts to performing peer recoveries using the history retained in the
translog regardless of whether soft deletes are enabled or not.
2019-02-15 10:45:15 +01:00
Henning Andersen a211e51343 ShardBulkAction ignore primary response on primary (#38901)
Previously, if a version conflict occurred and a previous primary
response was present, the original primary response would be used both
for sending to replica and back to client. This was made in the past as an
attempt to fix issues with conflicts after relocations where a bulk request
would experience a closed shard half way through and thus have to retry
on the new primary. It could then fail on its own update.

With sequence numbers, this leads to an issue, since if a primary is
demoted (network partitions), it will send along the original response
in the request. In case of a conflict on the new primary, the old
response is sent to the replica. That data could be stale, leading to
inconsistency between primary and replica.

Relocations now do an explicit hand-off from old to new primary and
ensures that no operations are active while doing this. Above is thus no
longer necessary. This change removes the special handling of conflicts
and ignores primary responses when executing shard bulk requests on the
primary.
2019-02-15 10:13:11 +01:00
Jason Tedor 00cb8d0be8
Mark coordinator test as awaits fix
This test is failing frequently so this commit mutes it.

Relates #38867
2019-02-14 12:43:31 -05:00
Lee Hinman 0c733c04be Remove immediate operation retry after mapping update (#38873)
Prior to this commit, when an indexing operation resulted in an
`Engine.Result.Type.MAPPING_UPDATE_REQUIRED`, TransportShardBulkAction
immediately retries the indexing operation to see if it succeeds. In the event
that it succeeds the context does not wait until the mapping update has
propagated through the cluster state before finishing the indexing.

In some of our tests we rely on mappings being available as soon as they've been
introduced in a document that indexed correctly. By removing the immediate retry
we always wait for this to be the case.

Resolves #38428
Supercedes #38579
Relates to #38711
2019-02-14 09:31:08 -07:00
Christoph Büscher 6c5cec4ff4 Enable silent FollowersCheckerTest (#38851)
One of the test methods wasn't run because it was private. Making this method
public and fixing some issues around mocking the threadpool that otherwise would
lead to an NPE.
2019-02-14 16:16:48 +01:00
Albert Zaharovits 6243a9797f _cat/indices with Security, hide names when wildcard (#38824)
This changes the output of the `_cat/indices` API with `Security` enabled.

It is possible to only display the index name (and possibly the index
health, depending on the request options) but not its stats (doc count, merges,
size, etc). This is the case for closed indices which have index metadata in the
cluster state but no associated shards, hence no shard stats.
However, when `Security` is enabled, and the request contains wildcards,
**open** indices without stats are a common occurrence. This is because the
index names in the response table are picked up directly from the cluster state
which is not filtered by `Security`'s _indexNameExpressionResolver_, unlike the
stats data which is populated by the indices stats API which does go through the
index name resolver.
This is a bug, because it is circumventing `Security`'s function to hide
unauthorized indices.

This has been fixed by displaying the index names as they are resolved by the indices
stats API. The outputs of these two APIs is now very similar: same index names,
similar data but different format.

Closes #37190
2019-02-14 15:09:17 +02:00
David Roberts 6ea483a663 Mute DedicatedClusterSnapshotRestoreIT testRestoreShrinkIndex
Due to https://github.com/elastic/elasticsearch/issues/38845
2019-02-14 11:46:22 +00:00
Luca Cavanna 7456117019 [TEST] address testCollectNodes rare failure (#38559)
#37767 changed the expected exception for "no such cluster" error from
`IllegalStateException` to a dedicated `NoSuchRemoteClusterException`.
An assertion in `testCollectNodes` needs to be updated accordingly.
2019-02-14 10:57:14 +01:00
Nhat Nguyen 5d22e45990 Copy retention leases when trim unsafe commits (#37995)
When a primary shard is recovered from its store, we trim the last
commit (when it's unsafe). If that primary crashes before the recovery
completes, we will lose the committed retention leases because they are
baked in the last commit. With this change, we copy the retention leases
from the last commit to the safe commit when trimming unsafe commits.

Relates #37165
2019-02-13 17:27:48 -05:00
Jason Tedor 062eea8fcc
Fix excessive increments in soft delete policy (#38813)
In this case, we were incrementing the policy too much. This means on
every iteration we actually keep increasing the minimum retained
sequence number, even with leases in place. It was a bug from when the
soft deletes policy had retention leases incorporated into it. This
commit fixes this bug by ensuring we only increment in the proper
places, and adds careful tests for the various situations.
2019-02-13 14:04:45 -05:00
Jake Landis 46bb663a09
Make 7.x like 6.7 user agent ecs, but default to true (#38828)
Forward port of https://github.com/elastic/elasticsearch/pull/38757

This change reverts the initial 7.0 commits and replaces them
with the 6.7 variant that still allows for the ecs flag. 
This commit differs from the 6.7 variants in that ecs flag will 
now default to true. 

6.7: `ecs` : default `false`
7.x: `ecs` : default `true`
8.0: no option, but behaves as `true`

* Revert "Ingest node - user agent, move device to an object (#38115)"
This reverts commit 5b008a34aa.

* Revert "Add ECS schema for user-agent ingest processor (#37727) (#37984)"
This reverts commit cac6b8e06f.

* cherry-pick 5dfe1935345da3799931fd4a3ebe0b6aa9c17f57 
Add ECS schema for user-agent ingest processor (#37727)

* cherry-pick ec8ddc890a34853ee8db6af66f608b0ad0cd1099 
Ingest node - user agent, move device to an object (#38115) (#38121)
  
* cherry-pick f63cbdb9b426ba24ee4d987ca767ca05a22f2fbb (with manual merge fixes)
Dep. check for ECS changes to User Agent processor (#38362)

* make true the default for the ecs option, and update 7.0 references and tests
2019-02-13 10:28:01 -06:00
Przemyslaw Gomulka 7404882105
Fix line separators in JSON logging tests backport#38771 #38834
The hardcoded '\n' in string will not work in Windows where there is a
different line separator. A System.lineSeparator should be used to make
it work on all platforms
closes #38705 
backport #38771
2019-02-13 13:34:33 +01:00
Zachary Tong 57f69082fd
Disable cache on QueryProfilerIT (#38748)
- Disables the request cache on the test, to prevent cached
values from potentially interfering with test results
- Changes the test to execute a single query, in hopes of making
failures more reproducible
Backport of #38583
2019-02-12 13:11:52 -05:00
Nhat Nguyen a3f39741be Adjust log and unmute testFailOverOnFollower (#38762)
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.

Relates #38633
2019-02-12 11:42:25 -05:00
Nhat Nguyen 4a5070dcfb Use current term in initial leases in engine test (#38285)
We need to use the current primary term instead of 1L for the initial
retention leases; otherwise, the primary term of the committed
retention leases won't match the current primary term if the
retention leases never gets updated.
2019-02-12 11:40:04 -05:00
Nhat Nguyen eca5404572 Fix synchronization in LocalCheckpointTracker#contains (#38755)
We are accessing the `CountedBitSet` in `LocalCheckpointTracker#contains`
without proper synchronization.

Relates #33871
2019-02-12 11:39:50 -05:00
Nhat Nguyen 225ebb6935 Ensure no snapshotted commit when close engine (#38663)
With this change, we can automatically detect an implementation 
that acquires an index commit but fails to release.
2019-02-12 11:39:35 -05:00
Tanguy Leroux 51d6b9ab31 Fix CloseWhileRelocatingShardsIT (#38728) 2019-02-12 14:04:44 +01:00
Jason Tedor bbc9aa9979
Introduce retention lease actions (#38756)
This commit introduces actions for some common retention lease
operations that clients need to be able to perform remotely. These
actions include add/renew/remove.
2019-02-12 07:38:03 -05:00
Przemyslaw Gomulka 7e178aa4a7
Enable IndexActionTests and WatcherIndexingListenerTests Backport #38738
fix tests to use clock in milliseconds precision in watcher code
make sure the date comparison in string format is using same formatters
some of the code was modified in #38514 possibly because of merge conflicts

closes #38581
Backport #38738
2019-02-12 13:05:44 +01:00
Luca Cavanna 90fff54954 Tie break on cluster alias when merging shard search failures (#38715)
A recent test failure triggered an edge case scenario where failures may be coming back with the same shard id, yet from different clusters.
This commit adapts the failures comparator to take the cluster alias into account when merging failures as part of CCS requests execution.
Also the corresponding test has been split in two: with and without
search shard target set to the failure.

Closes #38672
2019-02-12 11:25:44 +01:00
Jason Tedor c7cdd6a46a
Add dedicated retention lease exceptions (#38754)
When a retention lease already exists on an add retention lease
invocation, or a retention lease is not found on a renew retention lease
invocation today we throw an illegal argument exception. This puts a
burden on the caller to catch that specific exception and parse the
message. This commit relieves the burden from the caller by adding
dedicated exception types for these situations.
2019-02-12 00:32:09 -05:00
Jason Tedor b97c74bbab
Enable removal of retention leases (#38751)
This commit introduces the ability to remove retention leases. Explicit
removal will be needed to manage retention leases used to increase the
likelihood of operation-based recoveries syncing, and for consumers such
as ILM.
2019-02-11 21:19:11 -05:00
Nick Knize e2f432a413 Fix the version check for LegacyGeoShapeFieldMapper (#38547)
Change version check from 7.0 to 6.6 in BaseGeoShapeFieldMapper to correctly use LegacyGeoShapeFieldMapper for indexes created prior to 6.6.
2019-02-11 16:27:47 -06:00
Nick Knize 078da6d9bd Fix GeoHash PrefixTree BWC (#38584)
geo_shape indexes created before 6.6 use geohash string encoding as default tree parameter and quadtree encoding for 6.6 and later. This commit fixes bwc to use geohash encoding in LegacyGeoshapeFieldMapper for indexes created before 6.6.
2019-02-11 11:59:51 -06:00
David Roberts d1848b96fc
Fix possible assertion failure in IndicesQueryCache.close (#38731)
The assertion that the stats2 map is empty in
IndicesQueryCache.close has been observed to
fail very occasionally in internal cluster tests.

The likely cause is a cross-thread visibility
problem for a count variable.  This change
makes that count volatile.

Relates #37117
Backport of #38714
2019-02-11 17:33:20 +00:00
Tanguy Leroux dc212de822
Specialize pre-closing checks for engine implementations (#38702) (#38722)
The Close Index API has been refactored in 6.7.0 and it now performs 
pre-closing sanity checks on shards before an index is closed: the maximum 
sequence number must be equals to the global checkpoint. While this is a 
strong requirement for regular shards, we identified the need to relax this 
check in the case of CCR following shards.

The following shards are not in charge of managing the max sequence 
number or global checkpoint, which are pulled from a leader shard. They 
also fetch and process batches of operations from the leader in an unordered 
way, potentially leaving gaps in the history of ops. If the following shard lags 
a lot it's possible that the global checkpoint and max seq number never get 
in sync, preventing the following shard to be closed and a new PUT Follow 
action to be issued on this shard (which is our recommended way to 
resume/restart a CCR following).

This commit allows each Engine implementation to define the specific 
verification it must perform before closing the index. In order to allow 
following/frozen/closed shards to be closed whatever the max seq number 
or global checkpoint are, the FollowingEngine and ReadOnlyEngine do 
not perform any check before the index is closed.

Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
2019-02-11 17:34:17 +01:00
Luca Cavanna 6443b46184
Clean up ShardSearchLocalRequest (#38574)
Added a constructor accepting `StreamInput` as argument, which allowed to
make most of the instance members final as well as remove the default
constructor.
Removed a test only constructor in favour of invoking the existing
constructor that takes a `SearchRequest` as first argument.
Also removed profile members and related methods as they were all unused.
2019-02-11 15:55:46 +01:00
Alexander Reelsen 884b5063a4 Create ISO8601 joda compatible java time formatter (#38434)
The existing formatter being used was not on par with the joda formatter
as it was missing the ability to parse a comma as a separator between
seconds and milliseconds.

While a real iso8601 would be much more complex, this might be
sufficient for some more use-cases.

The ingest date formatter now also uses the iso8601 formatter by
default.

Closes #38345
2019-02-11 15:11:26 +01:00
Alexander Reelsen e7868e92bd
Restore date aggregation performance in UTC case (#38221) (#38700)
The benchmarks showed a sharp decrease in aggregation performance for
the UTC case.

This commit uses the same calculation as joda time, which requires no
conversion into any java time object, also, the check for an fixedoffset
has been put into the ctor to reduce the need for runtime calculations.
The same goes for the amount of the used unit in milliseconds.

Closes #37826
2019-02-11 16:30:48 +03:00
Luca Cavanna fe8bd757b2
Look up connection using the right cluster alias when releasing contexts (#38570)
Whenever phase failure is raised in AbstractSearchAsyncAction, we go and
release search contexts of shards that successfully returned their
results, prior to notifying the listener of the failure. In case we are
executing a CCS request, it's important to look-up the connection to
send the release context request to.

This commit makes sure that the lookup takes the cluster alias into
account. We used to use `null` at all times instead which is not correct
and was not caught as any exception is caught without re-throwing it.
2019-02-11 13:40:42 +01:00
Przemyslaw Gomulka ba9a4d13e1
mute Failing tests related to logging and joda-java migration backport(#38704)(#38710)
the tests awaits fix from #38693 and #38705 and #38581
2019-02-11 13:15:12 +01:00
Przemyslaw Gomulka ab9e2f2e69
Move testToUtc test to DateFormattersTests #38698 Backport #38610
The test was relying on toString in ZonedDateTime which is different to
what is formatted by strict_date_time when milliseconds are 0
The method is just delegating to dateFormatter, so that scenario should
be covered there.

closes #38359
Backport #38610
2019-02-11 11:34:25 +01:00
Like b8be6cb5c7
Reject index.optimize_auto_generated_id setting (#28895)
This commit rejects the index.optmize_auto_generated_id setting for
indices created on or after 7.0.0. This setting was deprecated in 6.7.0.
2019-02-10 13:46:09 -05:00
Tim Brooks 023e3c207a
Concurrent file chunk fetching for CCR restore (#38656)
Adds the ability to fetch chunks from different files in parallel, configurable using the new `ccr.indices.recovery.max_concurrent_file_chunks` setting, which defaults to 5 in this PR.

The implementation uses the parallel file writer functionality that is also used by peer recoveries.
2019-02-09 21:19:57 -07:00
Christoph Büscher e3c7b93917 Mute failure in InternalEngineTests (#38622) 2019-02-08 16:29:54 +01:00
Dimitris Athanasiou fe8182ece2
Mute RetentionLeastIT.testRetentionLeasesSyncOnRecovery on 7x (#38597) 2019-02-08 11:32:28 +02:00
Jason Tedor fdf6b3f23f
Add 7.1 version constant to 7.x branch (#38513)
This commit adds the 7.1 version constant to the 7.x branch.

Co-authored-by: Andy Bristol <andy.bristol@elastic.co>
Co-authored-by: Tim Brooks <tim@uncontended.net>
Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
Co-authored-by: Luca Cavanna <javanna@users.noreply.github.com>
Co-authored-by: markharwood <markharwood@gmail.com>
Co-authored-by: Ioannis Kakavas <ioannis@elastic.co>
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: Jason Tedor <jason@tedor.me>
Co-authored-by: Alpar Torok <torokalpar@gmail.com>
Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Tim Vernum <tim@adjective.org>
Co-authored-by: Albert Zaharovits <albert.zaharovits@gmail.com>
2019-02-07 16:32:27 -05:00
Jason Tedor f8ed6c15c4
Enable BWC after backport recovering leases (#38485)
This commit enables the BWC tests after backporting recovery of retention
leases during peer recovery.
2019-02-06 08:03:19 -05:00
Jason Tedor 4b42281a4e
Collapse retention lease integration tests (#38483)
This commit collapses the retention lease integration tests into a
single suite.
2019-02-06 07:55:41 -05:00
Tanguy Leroux 510829f9f7
TransportVerifyShardBeforeCloseAction should force a flush (#38401)
This commit changes the `TransportVerifyShardBeforeCloseAction` so that it 
always forces the flush of the shard. It seems that #37961 is not sufficient to 
ensure that the translog and the Lucene commit share the exact same max 
seq no and global checkpoint information in case of one or more noop 
operations have been made.

The `BulkWithUpdatesIT.testThatMissingIndexDoesNotAbortFullBulkRequest` 
and `FrozenIndexTests.testFreezeEmptyIndexWithTranslogOps` test this trivial 
situation and they both fail 1 on 10 executions.

Relates to #33888
2019-02-06 13:22:54 +01:00
David Turner 5a3c452480
Align docs etc with new discovery setting names (#38492)
In #38333 and #38350 we moved away from the `discovery.zen` settings namespace
since these settings have an effect even though Zen Discovery itself is being
phased out. This change aligns the documentation and the names of related
classes and methods with the newly-introduced naming conventions.
2019-02-06 11:34:38 +00:00
Ioannis Kakavas e1d464b22c
Mute testRetentionLeasesSyncOnRecovery (#38488)
Relates: #38487
2019-02-06 08:52:54 +02:00
Armin Braun 34f2cc78f6
Fix Master Failover and DataNode Leave Blocking Snapshot (#38460)
* Closes #38447
2019-02-05 23:56:59 +01:00
Jason Tedor 79a45b47da
Recover retention leases during peer recovery (#38435)
This commit integrates retention leases with recovery. With this change,
we copy the current retention leases on primary to the replica during
phase two of recovery. At this point, the replica will have been added
to the replication group and so is already receiving retention lease
sync requests from the primary. This means that if any retention lease
syncs are triggered on the primary after we sample the retention leases
here during phase two, that sync request will also arrive on the replica
ensuring that the replica is from this point on up to date with the
retention leases on the primary. We have to copy these during phase two
since we will be applying indexing operations, potentially triggering
merges, and therefore must ensure the correct retention leases are in
place beforehand.
2019-02-05 17:43:41 -05:00
Henning Andersen 20c66c5a05
Bubble-up exceptions from scheduler (#38317)
Instead of logging warnings we now rethrow exceptions thrown inside
scheduled/submitted tasks. This will still log them as warnings in
production but has the added benefit that if they are thrown during
unit/integration test runs, the test will be flagged as an error.

This is a continuation of #38014

Fixed NPE that caused CCR tests (IndexFollowingIT and likely others)
to fail.

schedule could bubble rejected exception to uncaught exception
handler when not using SAME executor if thread pool is terminated.
Now ignore rejected exception silently if executor is shutdown.
2019-02-05 21:48:24 +01:00
Boaz Leskes 033ba725af
Remove support for internal versioning for concurrency control (#38254)
Elasticsearch has long [supported](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning) compare and set (a.k.a optimistic concurrency control) operations using internal document versioning. Sadly that approach is flawed and can sometime do the wrong thing. Here's the relevant excerpt from the resiliency status page:

> When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is partitioned away. It won’t acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed. The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted writes for the same document 

We recently [introduced](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/optimistic-concurrency-control.html) a new sequence number based approach that doesn't suffer from this dirty reads problem. 

This commit removes support for internal versioning as a concurrency control mechanism in favor of the sequence number approach.

Relates to #1078
2019-02-05 20:53:35 +01:00
Jason Tedor b03d138122
Lift retention lease expiration to index shard (#38380)
This commit lifts the control of when retention leases are expired to
index shard. In this case, we move expiration to an explicit action
rather than a side-effect of calling
ReplicationTracker#getRetentionLeases. This explicit action is invoked
on a timer. If any retention leases expire, then we hard sync the
retention leases to the replicas. Otherwise, we proceed with a
background sync.
2019-02-05 14:42:17 -05:00
Tim Brooks c2a8fe1f91
Prevent CCR recovery from missing documents (#38237)
Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.

This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.
2019-02-05 13:32:41 -06:00
Tal Levy aef5775561
re-enables awaitsfixed datemath tests (#38376)
Previously, date formats of `YYYY.MM.dd` would hit an issue
where the year would jump towards the end of the calendar year.
This was an issue that had since been resolved in tests by using
`yyyy` to be the more accurate representation of the year.

Closes #37037.
2019-02-05 11:20:40 -08:00
Julie Tibshirani 3ce7d2c9b6
Make sure to reject mappings with type _doc when include_type_name is false. (#38270)
`CreateIndexRequest#source(Map<String, Object>, ... )`, which is used when
deserializing index creation requests, accidentally accepts mappings that are
nested twice under the type key (as described in the bug report #38266).

This in turn causes us to be too lenient in parsing typeless mappings. In
particular, we accept the following index creation request, even though it
should not contain the type key `_doc`:

```
PUT index?include_type_name=false
{
  "mappings": {
    "_doc": {
      "properties": { ... }
    }
  }
}
```

There is a similar issue for both 'put templates' and 'put mappings' requests
as well.

This PR makes the minimal changes to detect and reject these typed mappings in
requests. It does not address #38266 generally, or attempt a larger refactor
around types in these server-side requests, as I think this should be done at a
later time.
2019-02-05 10:52:32 -08:00
David Turner f2dd5dd6eb
Remove DiscoveryPlugin#getDiscoveryTypes (#38414)
With this change we no longer support pluggable discovery implementations. No
known implementations of `DiscoveryPlugin` actually override this method, so in
practice this should have no effect on the wider world. However, we were using
this rather extensively in tests to provide the `test-zen` discovery type. We
no longer need a separate discovery type for tests as we no longer need to
customise its behaviour.

Relates #38410
2019-02-05 17:42:24 +00:00
David Turner b7ab521eb1
Throw AssertionError when no master (#38432)
Today we throw a fatal `RuntimeException` if an exception occurs in
`getMasterName()`, and this includes the case where there is currently no
master. However, sometimes we call this method inside an `assertBusy()` in
order to allow for a cluster that is in the process of stabilising and electing
a master. The trouble is that `assertBusy()` only retries on an
`AssertionError` and not on a general `RuntimeException`, so the lack of a
master is immediately fatal.

This commit fixes the issue by asserting there is a master, triggering a retry
if there is not.

Fixes #38331
2019-02-05 17:11:20 +00:00
Armin Braun 2f6afd290e
Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization (#38368)
* The problem in #38226 is that in some corner cases multiple calls to `endSnapshot` were made concurrently, leading to non-deterministic behavior (`beginSnapshot` was triggering a repository finalization while one that was triggered by a `deleteSnapshot` was already in progress)
   * Fixed by:
      * Making all `endSnapshot` calls originate from the cluster state being in a "completed" state (apart from on short-circuit on initializing an empty snapshot). This forced putting the failure string into `SnapshotsInProgress.Entry`.
      * Adding deduplication logic to `endSnapshot`
* Also:
  * Streamlined the init behavior to work the same way (keep state on the `SnapshotsService` to decide which snapshot entries are stale)
* closes #38226
2019-02-05 16:44:18 +01:00
Lee Hinman d862453d68
Support unknown fields in ingest pipeline map configuration (#38352)
We already support unknown objects in the list of pipelines, this changes the
`PipelineConfiguration` to support fields other than just `id` and `config`.

Relates to #36938
2019-02-05 07:52:17 -07:00
David Turner 3b2a0d7959
Rename no-master-block setting (#38350)
Replaces `discovery.zen.no_master_block` with `cluster.no_master_block`. Any
value set for the old setting is now ignored.
2019-02-05 08:47:56 +00:00
David Turner 2d114a02ff
Rename static Zen1 settings (#38333)
Renames the following settings to remove the mention of `zen` in their names:

- `discovery.zen.hosts_provider` -> `discovery.seed_providers`
- `discovery.zen.ping.unicast.concurrent_connects` -> `discovery.seed_resolver.max_concurrent_resolvers`
- `discovery.zen.ping.unicast.hosts.resolve_timeout` -> `discovery.seed_resolver.timeout`
- `discovery.zen.ping.unicast.hosts` -> `discovery.seed_addresses`
2019-02-05 08:46:52 +00:00
Yogesh Gaikwad fe36861ada
Add support for API keys to access Elasticsearch (#38291)
X-Pack security supports built-in authentication service
`token-service` that allows access tokens to be used to 
access Elasticsearch without using Basic authentication.
The tokens are generated by `token-service` based on
OAuth2 spec. The access token is a short-lived token
(defaults to 20m) and refresh token with a lifetime of 24 hours,
making them unsuitable for long-lived or recurring tasks where
the system might go offline thereby failing refresh of tokens.

This commit introduces a built-in authentication service
`api-key-service` that adds support for long-lived tokens aka API
keys to access Elasticsearch. The `api-key-service` is consulted
after `token-service` in the authentication chain. By default,
if TLS is enabled then `api-key-service` is also enabled.
The service can be disabled using the configuration setting.

The API keys:-
- by default do not have an expiration but expiration can be
  configured where the API keys need to be expired after a
  certain amount of time.
- when generated will keep authentication information of the user that
   generated them.
- can be defined with a role describing the privileges for accessing
   Elasticsearch and will be limited by the role of the user that
   generated them
- can be invalidated via invalidation API
- information can be retrieved via a get API
- that have been expired or invalidated will be retained for 1 week
  before being deleted. The expired API keys remover task handles this.

Following are the API key management APIs:-
1. Create API Key - `PUT/POST /_security/api_key`
2. Get API key(s) - `GET /_security/api_key`
3. Invalidate API Key(s) `DELETE /_security/api_key`

The API keys can be used to access Elasticsearch using `Authorization`
header, where the auth scheme is `ApiKey` and the credentials, is the 
base64 encoding of API key Id and API key separated by a colon.
Example:-
```
curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health
```

Closes #34383
2019-02-05 14:21:57 +11:00
Christoph Büscher d255303584
Add typless client side GetIndexRequest calls and response class (#37778)
The HLRC client currently uses `org.elasticsearch.action.admin.indices.get.GetIndexRequest`
and `org.elasticsearch.action.admin.indices.get.GetIndexResponse` in its get index calls. Both request and
response are designed for the typed APIs, including some return types e.g. for `getMappings()` which in
the maps it returns still use a level including the type name.
In order to change this without breaking existing users of the HLRC API, this PR introduces two new request
and response objects in the `org.elasticsearch.client.indices` client package. These are used by the
IndicesClient#get and IndicesClient#exists calls now by default and support the type-less API. The old request
and response objects are still kept for use in similarly named, but deprecated methods.

The newly introduced client side classes are simplified versions of the server side request/response classes since
they don't need to support wire serialization, and only the response needs fromXContent parsing (but no
xContent-serialization, since this is the responsibility of the server-side class).
Also changing the return type of `GetIndexResponse#getMapping` to
`Map<String, MappingMetaData> getMappings()`, while it previously was returning another map
keyed by the type-name. Similar getters return simple Maps instead of the ImmutableOpenMaps that the 
server side response objects return.
2019-02-05 03:41:05 +01:00
Gordon Brown 292e0f6fb7
Deprecate `_type` in simulate pipeline requests (#37949)
As mapping types are being removed throughout Elasticsearch, the use of
`_type` in pipeline simulation requests is deprecated. Additionally, the
default `_type` used if one is not supplied has been changed to `_doc` for
consistency with the rest of Elasticsearch.
2019-02-04 16:11:44 -07:00
Christoph Büscher 0ced775389
Mute RareClusterStateIT.testDelayedMappingPropagationOnReplica (#38357) 2019-02-04 22:30:34 +01:00
Mayya Sharipova 641704464d
Deprecate types in rollover index API (#38039)
Relates to #35190
2019-02-04 16:07:45 -05:00
Zachary Tong ab1150378b
Add Composite to AggregationBuilders (#38207) 2019-02-04 13:47:04 -05:00
David Turner 2c1eab2b8a
Clarify slow cluster-state log messages (#38302)
The message `... took [31s] above the warn threshold of 30s` suggests
incorrectly that the task took 61 seconds. This commit adds the clarifying
words `which is`.
2019-02-04 17:44:00 +00:00
Andrey Ershov 7bc8bc9605
ensureGreen (#38324) 2019-02-04 16:36:04 +01:00
Jason Tedor 625d37a26a
Introduce retention lease background sync (#38262)
This commit introduces a background sync for retention leases. The idea
here is that we do a heavyweight sync when adding a new retention lease,
and then periodically we want to background sync any retention lease
renewals to the replicas. As long as the background sync interval is
significantly lower than the extended lifetime of a retention lease, it
is okay if from time to time a replica misses a sync (it will still have
an older version of the lease that is retaining more data as we assume
that renewals do not decrease the retaining sequence number). There are
two follow-ups that will come after this commit. The first is to address
the fact that we have not adapted the should periodically flush logic to
possibly flush the retention leases. We want to do something like flush
if we have not flushed in the last five minutes and there are renewed
retention leases since the last time that we flushed. An additional
follow-up will remove the syncing of retention leases when a retention
lease expires. Today this sync could be invoked in the background by a
merge operation. Rather, we will move the syncing of retention lease
expiration to be done under the background sync. The background sync
will use the heavyweight sync (write action) if a lease has expired, and
will use the lightweight background sync (replication action) otherwise.
2019-02-04 10:35:29 -05:00
Christoph Büscher 5ee7232379
Mute SpecificMasterNodesIT#testElectOnlyBetweenMasterNodes (#38334) 2019-02-04 16:10:06 +01:00
Christoph Büscher 715e581378
Mute DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex (#38330) 2019-02-04 15:46:19 +01:00
Boaz Leskes e49b593c81
Move TokenService to seqno powered cas (#38311)
Relates #37872 
Relates #10708
2019-02-04 15:25:41 +01:00
Yannick Welsch ece8c659c5
Decrease leader and follower check timeout (#38298)
Reduces the leader and follower check timeout to 3 * 10 = 30s instead of 3 * 30 = 90s, with 30s still
being a very long time for a node to be completely unresponsive.
2019-02-04 15:11:12 +01:00
Przemyslaw Gomulka 9b64558efb
Migrating from joda to java.time. Watcher plugin (#35809)
part of the migrating joda time work. Migrating watcher plugin to use JDK's java-time

refers #27330
2019-02-04 15:08:31 +01:00
Alexander Reelsen 87f3579125
Add nanosecond field mapper (#37755)
This adds a dedicated field mapper that supports nanosecond resolution -
at the price of a reduced date range.

When using the date field mapper, the time is stored as milliseconds since the epoch
in a long in lucene. This field mapper stores the time in nanoseconds
since the epoch - which means its range is much smaller, ranging roughly from
1970 to 2262.

Note that aggregations will still be in milliseconds.
However docvalue fields will have full nanosecond resolution

Relates #27330
2019-02-04 11:31:16 +01:00
Christoph Büscher 15510da2af
Mute SharedClusterSnapshotRestoreIT#testAbortedSnapshotDuringInitDoesNotStart (#38304) 2019-02-04 10:41:35 +01:00
David Turner 1d82a6d9f9
Deprecate unused Zen1 settings (#38289)
Today the following settings in the `discovery.zen` namespace are still used:

- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`

This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
2019-02-04 08:52:08 +00:00
Armin Braun 4561f425db
Remove Redundandant Loop in SnapshotShardsService (#38283)
* This was a merge mistake on my end I think, obviously we only need to loop over the shards once not twice here to find those that we missed in INIT state
2019-02-04 09:06:39 +01:00
Alpar Torok d58e899d45
Remove empty service files (#38192) 2019-02-04 10:05:04 +02:00
Jason Tedor d2cc1459a3
Fix ordering problem in add or renew lease test (#38280)
We have to set the primary term before we add a retention lease,
otherwise we can not assert the correct primary term.
2019-02-03 12:54:31 -05:00
Christoph Büscher 6ca7a913ea
Mute ReplicationTrackerRetentionLeaseTests#testAddOrRenewRetentionLease (#38275) 2019-02-03 12:54:13 +01:00
Armin Braun 89d7c57bd9
Fix Incorrect Transport Response Handler Type (#38264)
* Fix Incorrect Transport Response Handler Type
* The response type here is not empty and was always wrong but this only became visible now that 0a604e3b24 was introduced
   * As a result of 0a604e3b24 we started actually handling the response
of this request and logging/handling exceptions before that we simply dropped the classcast exception here quietly using the empty response handler
* fix busy assert not handling `Exception`
* Closes #38226
* Closes #38256
2019-02-03 08:48:15 +01:00
Nhat Nguyen 0861dc3581
Mute testCanRunUnsafeBootstrapAfterErroneousDetachWithoutLoosingMetaData (#38268)
Tracked at #38267
2019-02-02 20:02:21 -05:00
Christoph Büscher 50cdc61874
Mute DedicatedClusterSnapshotRestoreIT#testRestoreShrinkIndex (#38257) 2019-02-02 13:46:29 +01:00
David Turner c311062476
Add CoordinatorTests for empty unicast hosts list (#38209)
Today we have DiscoveryDisruptionIT tests for checking that discovery can still
work once the cluster has formed, even if the cluster is misconfigured and only
has a single master-eligible node in its unicast hosts list. In fact with Zen2
we can go one better: we do not need any nodes in the unicast hosts list,
because nodes also use the contents of the last-committed cluster state for
discovery. Additionally, the DiscoveryDisruptionIT tests were failing due to
the overenthusiastic fault-detection timeouts.

This commit replaces these tests with deterministic `CoordinatorTests` that
verify the same behaviour. It also removes some duplication by extracting a
test method called `testFollowerCheckerAfterMasterReelection()`

Closes #37687
2019-02-02 07:54:56 +00:00
Nhat Nguyen 80d3092292
Fix primary term in testAddOrRenewRetentionLease (#38239)
We should increase primary term before renewing leases; otherwise, the
term of the latest RetentionLeases will be lower than the current term.

Relates #37951
2019-02-02 02:38:53 -05:00
Nhat Nguyen 1ec04dff43
FIx testReplicaIgnoresOlderRetentionLeasesVersion (#38246)
If the innerLength is 0, the version won't be increased; then there will
be two RetentionLeases with the same term and version, but their leases
are different.

Relates #37951
Closes #38245
2019-02-02 02:37:37 -05:00
Nhat Nguyen 8bee5b8e06
Mute testAddOrRenewRetentionLease (#38240)
Relates #38239
2019-02-01 21:27:10 -05:00
Boaz Leskes f6e06a2b19 Adapt minimum versions for seq# powered operations in Watch related requests and UpdateRequest (#38231)
After backporting #37977, #37857 and #37872
2019-02-01 20:37:16 -05:00
Jason Tedor f181e17038
Introduce retention leases versioning (#37951)
Because concurrent sync requests from a primary to its replicas could be
in flight, it can be the case that an older retention leases collection
arrives and is processed on the replica after a newer retention leases
collection has arrived and been processed. Without a defense, in this
case the replica would overwrite the newer retention leases with the
older retention leases. This commit addresses this issue by introducing
a versioning scheme to retention leases. This versioning scheme is used
to resolve out-of-order processing on the replica. We persist this
version into Lucene and restore it on recovery. The encoding of
retention leases is starting to get a little ugly. We can consider
addressing this in a follow-up.
2019-02-01 17:19:19 -05:00
Nhat Nguyen 9c39dea7ae
AwaitsFix testAbortedSnapshotDuringInitDoesNotStart (#38227)
Tracked at #38226
2019-02-01 16:24:02 -05:00
Armin Braun 03a1d21070
SnapshotShardsService Simplifications (#38025)
* Instead of replacing the `shardSnapshots` field, we mutate it, explicitly removing entries from it in only a single spot
* Decreased the amount of indirection by moving all logic for starting a snapshot's newly discovered shard tasks into `startNewShards` (saves us two maps (keyed by snapshot) and iterations over them)
2019-02-01 20:46:14 +01:00
Luca Cavanna ee57420de6
Adjust SearchRequest version checks (#38181)
The finalReduce flag is now supported on 6.x too, hence we need to update the version checks in master.
2019-02-01 19:23:13 +01:00
Andrey Ershov 04dc41b99e
Zen2ify RareClusterStateIT (#38184)
In Zen 1 there are commit timeout and publish timeout and these
settings could be changed on-the-fly.

In Zen 2, there is only commit timeout and this setting is static.
RareClusterStateIT is actively using these settings and the fact, they
are dynamic.

This commit adds cancelCommitedPublication method to Coordinator to
be used by tests. This method will cancel current committed publication
if there is any.
When there is BlockClusterStateProcessing on the non-master node, the
publication will be accepted and committed, but not yet applied. So we
can use the method above to cancel it.

Also, this commit replaces callback + AtomicReference with ActionFuture,
which makes test code easier to read.
2019-02-01 18:18:11 +01:00
Yannick Welsch 025bf28405
Fix _host based require filters (#38173)
Using index.routing.allocation.require._host does not correctly work because the boolean logic in
filter matching is broken (DiscoveryNodeFilters.match(...) will return false) when
opType ==OpType.AND
2019-02-01 16:02:37 +01:00
Tanguy Leroux da6269b456
RestoreService should update primary terms when restoring shards of existing indices (#38177)
When restoring shards of existing indices, the RestoreService also 
restores the values of primary terms stored in the snapshot index 
metadata. The primary terms are not updated and could potentially 
conflict with current index primary terms if the restored primary terms 
are lower than the existing ones.

This situation is likely to happen with replicated closed indices 
(because primary terms are increased when the index is transitioning 
from open to closed state, and the snapshotted primary terms are the
 one at the time the index was opened) (see #38024) and maybe also 
with CCR.

This commit changes the RestoreService so that it updates the primary 
terms using the maximum value between the snapshotted values and 
the existing values.

Related to #33888
2019-02-01 15:59:11 +01:00
Desmond Vehar c1c4abae10 Throw if two inner_hits have the same name (#37645)
This change throws an error if two inner_hits have the same name

Closes #37584
2019-02-01 15:53:50 +01:00
Alexander Reelsen 35ed137684
Ensure joda compatibility in custom date formats (#38171)
If custom date formats are used, there may be combinations that the new
performat DateFormatters.from() method has not covered yet. This adds a
few such corner cases and ensures the tests are correctly commented
out.
2019-02-01 15:42:56 +01:00
Jim Ferenczi 66e4fb4fb6
Do not compute cardinality if the `terms` execution mode does not use `global_ordinals` (#38169)
In #38158 we ensured that global ordinals are not loaded when another execution hint is explicitly set on the source. This change is a follow up that addresses a comment
dd6043c1c0 (r252984782) added after the merge.
2019-02-01 15:32:19 +01:00
Nhat Nguyen 2e475d63f7
Do not set timeout for IndexRequests in GatewayIndexStateIT (#38147)
CI might not be fast enough to publish a dynamic mapping update within 100ms.
2019-02-01 09:30:03 -05:00
Andrey Ershov c1270e97b0
Zen2ify testMasterFailoverDuringIndexingWithMappingChanges (#38178)
In Zen2 cluster bootstrap is required and some parameters are 
called differently in Zen2.
2019-02-01 15:24:08 +01:00
Andrey Ershov bda591453c
Add elasticsearch-node detach-cluster command (#37979)
This commit adds the second part of `elasticsearch-node` tool -
`detach-cluster` command in addition to `unsafe-bootstrap` command.
Also, this commit changes the semantics of `unsafe-bootstrap`, now
`unsafe-bootstrap` changes clusterUUID.
So the algorithm of running `elasticsearch-node` tool is the following:
1) Stop all nodes in the cluster.
2) Pick master-eligible node with the highest (term, version) pair and
run the `unsafe-bootstrap` command on it. If there are no survived
master-eligible nodes - skip this step.
3) Run `detach-cluster` command on the remaining survived nodes.

Detach cluster makes the following changes to the node metadata:
1) Sets clusterUUID committed to false.
2) Sets currentTerm and term to 0. 
3) Removes voting tombstones and sets voting configurations to special
constant MUST_JOIN_ELECTED_MASTER, that prevents initial cluster
bootstrap.

`ElasticsearchNodeCommand` base abstract class is introduced, because
`UnsafeBootstrapMasterCommand` and `DetachClusterCommand` have a lot in
common.
Also, this commit adds "ordinal" parameter to both commands, because it's 
impossible to write IT otherwise.
For MUST_JOIN_ELECTED_MASTER case special handling is introduced in
`ClusterFormationFailureHelper`.
Tests for both commands reside in `ElasticsearchNodeCommandIT` (renamed
from `UnsafeBootstrapMasterIT`).
2019-02-01 14:53:55 +01:00
Alexander Reelsen 979e5576e5
Add tests for fractional epoch parsing (#38162)
Fractional epoch parsing is supported, the tests we used were edge cases
that did not make sense. This adds tests to properly check for this.
2019-02-01 14:48:37 +01:00
Tanguy Leroux 029e4b6278
Clear send behavior rule in CloseWhileRelocatingShardsIT (#38159)
The current CloseWhileRelocatingShardsIT test adds some "send behavior" 
rule to a target node's mocked transport service in order to detect when shard 
relocating are started. These rules are never cleared and prevent the test to 
complete normally after the rebalance is re-enabled again.

This commit changes the test so that rules are cleared and most verifications 
are done before the rebalance is reenabled again.

Closes #38090
2019-02-01 12:58:46 +01:00
Yannick Welsch ce469cfda5
Fix testCorruptedIndex (#38161)
Folks at the Lucene project do not seem to be interested in classifying corruptions and
distinguishing them from file-system exceptions (see https://issues.apache.org/jira/browse/LUCENE-8525),
so we'll just cop out as well.

Closes #34322
2019-02-01 12:51:38 +01:00
Luca Cavanna e18cac3659
Add finalReduce flag to SearchRequest (#38104)
With #37000 we made sure that fnial reduction is automatically disabled
whenever a localClusterAlias is provided with a SearchRequest.

While working on #37838, we found a scenario where we do need to set a
localClusterAlias yet we would like to perform a final reduction in the
remote cluster: when searching on a single remote cluster.

Relates to #32125

This commit adds support for a separate finalReduce flag to
SearchRequest and makes use of it in TransportSearchAction in case we
are searching against a single remote cluster.

This also makes sure that num_reduce_phases is correct when searching
against a single remote cluster: it makes little sense to return
`num_reduce_phases` set to `2`, which looks especially weird in case
the search was performed against a single remote shard. We should
perform one reduction phase only in this case and `num_reduce_phases`
should reflect that.

* line length
2019-02-01 12:11:42 +01:00
Jim Ferenczi 6fa93ca493
Forbid negative field boosts in analyzed queries (#37930)
This change forbids negative field boost in the `query_string`, `simple_query_string`
and `multi_match` queries.
Negative boosts are not allowed in Lucene 8 (scores must be positive).
The backport of this change to 6x will turn the error into a deprecation warning
in order to raise the awareness of this breaking change in 7.0.

Closes #33309
2019-02-01 11:41:40 +01:00
Jim Ferenczi 57b1d245e8
Remove AtomiFieldData#getLegacyFieldValues (#38087)
This function is unused now that we format the docvalue fields with the default
formatter on the field (#30831)
2019-02-01 11:41:17 +01:00
Andrey Ershov bfd618cf83
Universal cluster bootstrap method for tests with autoMinMasterNodes=false (#38038)
Currently, there are a few tests that use autoMinMasterNodes=false and
hence override addExtraClusterBootstrapSettings, mostly this is 10-30
lines of codes that are copy-pasted from class to class.

This PR introduces `InternalTestCluster.setBootstrapMasterNodeIndex`
which is suitable for all classes and copy-paste could be removed.

Removing code is always a good thing!
2019-02-01 11:34:31 +01:00
Jim Ferenczi b7308aa03c
Don't load global ordinals with the `map` execution_hint (#37833)
The terms aggregator loads the global ordinals to retrieve the cardinality of the field to aggregate on. This information is then used to select the strategy to use for the aggregation (breadth_first or depth_first). However this should be avoided if the execution_hint is explicitly set to map since this mode doesn't really need the global ordinals. Since we still need the cardinality of the field this change picks the maximum cardinality in the segments as an estimation of the total cardinality to select the strategy to use (breadth_first or depth_first). This estimation is only used if the execution hint is set to map, otherwise the global ordinals are still used to retrieve the accurate cardinality.

Closes #37705
2019-02-01 09:35:46 +01:00
David Turner 23f00e3676
Relax fault detector in some disruption tests (#38101)
Today we use `AbstractDisruptionTestCase` to test the behaviour of things like
master elections in the presence of cluster disruptions. These tests have
rather enthusiastic fault detection settings, detecting a fault if a single
ping fails, with a one-second timeout. Furthermore there are some tests that
assert the identity of the master remains unchanged during some disruption, and
these assertions fail rather often thanks to the overly sensitive fault
detector.

However in a number of these tests the fault detector need not be this
sensitive. This commit moves some such tests into their own test suite and uses
more sensible fault-detection settings to avoid the kind of master instability
that is causing CI failures.

Closes #37699
2019-02-01 08:10:49 +00:00
Alexander Reelsen c02cd3e2fd
Fix java time epoch date formatters (#37829)
The self written epoch date formatters were not properly able to format
an Instant to a string due to a misconfiguration.

This fix also removes a until now existing runtime behaviour under java
8 regarding the names of the aggregation buckets, which are now the same
as before and have been under java 11.
2019-02-01 09:03:48 +01:00
Yannick Welsch 859e2f5bc8 Adapt timeouts in UpdateMappingIntegrationIT
Relates to #37263 and possibly #36916
2019-02-01 08:58:31 +01:00
Adrien Grand d83c748417
Fix test bug in DynamicMappingsIT. (#37906)
Closes #37898
2019-02-01 08:35:29 +01:00
Przemyslaw Gomulka 2758578570
Trim the JSON source in indexing slow logs (#38081)
The '{' as a first character in log line is causing problems for beats when parsing plaintext logs. This can happen if the submitted document has an additional '\n' at the beginning and we are not reformatting. 
Trimming the source part of a SlogLog solves that and keeps the logs readable.

closes #38080
2019-02-01 08:12:12 +01:00
Armin Braun 0a604e3b24
Fix Two Races that Lead to Stuck Snapshots (#37686)
* Fixes two broken spots:
    1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting
    2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state
* Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk
* Significantly extended test infrastructure to reproduce the above two issues
  * Two new test runs:
      1. Reproducing the effects of node disconnects/restarts in isolation
      2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
* Relates #32265 
* Closes #32348
2019-02-01 05:45:40 +01:00
Nhat Nguyen b8b843476d
Disable dynamic mapping in testSimpleGetFieldMappingsWithDefaults (#38045)
Since #31140 we no longer require acking on the dynamic mapping of index
requests. Thus, a returned mapping from a get mapping request does not
necessarily contain the dynamic updates from the index request. This
commit replaces the dynamic mapping update with a manual put mapping.

Relates #31140
Closes #37928
2019-01-31 21:01:41 -05:00
Nhat Nguyen a8ebe2a217
Fix random params in testSoftDeletesRetentionLock (#38114)
Since #37992 the retainingSequenceNumber is initialized with 0 
while the global checkpoint can be -1.

Relates #37992
2019-01-31 20:50:41 -05:00
Lee Hinman c67a9663af
Fix MasterServiceTests.testClusterStateUpdateLogging (#38116)
This changes the test to not use a `CountDownlatch`, instead adding an assertion
for the final logging message and waiting until the `MockAppender` has seen it
before proceeding.

Related to df2c06f6f30f7e23a6863a3f72fc3bdb7648885c
Resolves #23739
2019-01-31 17:13:19 -07:00
Yuri Astrakhan f3cde06a1d
geotile_grid implementation (#37842)
Implements `geotile_grid` aggregation

This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240

This code uses the same base classes as `geohash_grid` agg, but uses a different hashing
algorithm to allow zoom consistency.  Each grid bucket is aligned to Web Mercator tiles.
2019-01-31 19:11:30 -05:00
Pascal Christoph a3d9ba3f4b Log document id when MapperParsingException occurs (#37800)
Closes #37658
2019-01-31 16:33:13 -05:00
Nhat Nguyen 237fcda2cc
Disable dynamic mapping update in testTransportBulkTasks (#38073)
If a replica does not have a right mapping yet, we will retry the index
request on that replica; then the actual tasks is higher than the
expected tasks. Since #31140 this happens more frequently for we no
longer require acking on the dynamic mapping of index requests.

Relates #31140
Closes #37893
2019-01-31 13:16:52 -05:00
Przemyslaw Gomulka 28b5c7ce78
Do not set up NodeAndClusterIdStateListener in test (#38110)
When extending ESIntegTestCase are run on the same jvm, the static field in
NodeAndClusterIdConverter will throw an AlreadySet exceptions.
overriding the configuration method from Node.configureNodeAndClusterIdStateListener in the MockNode will prevent the listener registration from happening
relates #32850
2019-01-31 18:59:40 +01:00
Nhat Nguyen 8e95780f98
Soft-deletes policy should always fetch latest leases (#37940)
If a new retention lease is added while a primary's soft-deletes policy
is locked for peer-recovery, that lease won't be baked into the Lucene
commit.

Relates #37165
Relates #37375
2019-01-31 12:02:57 -05:00
Henning Andersen 68ed72b923
Handle scheduler exceptions (#38014)
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.

This is a continuation of #28667, #36137 and also fixes #37708.
2019-01-31 17:51:45 +01:00
David Turner 7f738e8541
Minor logging improvements (#38084)
Fixes some log messages that caused some minor confusion when digging through a
log generated by a failing test.
2019-01-31 16:41:04 +00:00
Tal Levy 9923f0fe6a
fix a few versionAdded values in ElasticsearchExceptions (#37877)
TooManyBucketsException was introduced in v6.2
and SnapshotInProgressException was introduced in v6.7
2019-01-31 08:28:20 -08:00
Tanguy Leroux 7a597cad0d
Reenable BWC tests after backport of #37899 (#38093)
This commit adapts the version used in StartedShardEntry serialization
 after the backport of  #37899 and reenables bwc tests.

Related to #37899
Related to #38074
2019-01-31 16:53:28 +01:00
Henning Andersen 7487be3d3c Un-mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock 2019-01-31 15:31:01 +01:00
Jason Tedor a9b12b38f0
Push primary term to replication tracker (#38044)
This commit pushes the primary term into the replication tracker. This
is a precursor to using the primary term to resolving ordering problems
for retention leases. Namely, it can be that out-of-order retention
lease sync requests arrive on a replica. To resolve this, we need a
tuple of (primary term, version). For this to be, the primary term needs
to be accessible in the replication tracker. As the primary term is part
of the replication group anyway, this change conceptually makes sense.
2019-01-31 09:19:49 -05:00
Luca Cavanna 622fb7883b
Introduce ability to minimize round-trips in CCS (#37828)
With #37566 we have introduced the ability to merge multiple search responses into one. That makes it possible to expose a new way of executing cross-cluster search requests, that makes CCS much faster whenever there is network latency between the CCS coordinating node and the remote clusters. The coordinating node can now send a single search request to each remote cluster, which gets reduced by each one of them. from + size results are requested to each cluster, and the reduce phase in each cluster is non final (meaning that buckets are not pruned and pipeline aggs are not executed). The CCS coordinating node performs an additional, final reduction, which produces one search response out of the multiple responses received from the different clusters.

This new execution path will be activated by default for any CCS request unless a scroll is provided or inner hits are requested as part of field collapsing. The search API accepts now a new parameter called ccs_minimize_roundtrips that allows to opt-out of the default behaviour.

Relates to #32125
2019-01-31 15:12:14 +01:00
Armin Braun ae9f4df361
Don't Assert Ack on when Publish Timeout is 0 in Test (#38077)
* Publish timeout is set to `0` so out of order processing of states on the node can lead to a `false` ack response
  * See #30672
* Closes #36813
2019-01-31 14:35:11 +01:00
Alexander Reelsen 9f026bb8ad
Reduce object creation in Rounding class (#38061)
This reduces objects creations in the rounding class (used by aggs) by properly
creating the objects only once. Furthermore a few unneeded ZonedDateTime objects
were created in order to create other objects out of them. This was
changed as well.

Running the benchmarks shows a much faster performance for all of the
java time based Rounding classes.
2019-01-31 14:18:28 +01:00
Adrien Grand a536fa7755
Treat put-mapping calls with `_doc` as a top-level key as typed calls. (#38032)
Currently the put-mapping API assumes that because the type name is `_doc` then
it is dealing with a typeless put-mapping call. Yet we still allow running the
put-mapping API in a typed fashion with `_doc` as a type name. The current logic
triggers surprising errors when doing a typed put-mapping call with `_doc` as a
type name on an index that has a type already.

This is a bit of a corner-case, but is more important on 6.x due to the fact
that using the index API with `_doc` as a type name triggers typed calls to the
put-mapping API with `_doc` as a type name.
2019-01-31 13:57:42 +01:00
David Turner eadcb5f0f8
Fix size of rolling-upgrade bootstrap config (#38031)
Zen2 nodes will bootstrap themselves once they believe there to be no remaining
Zen1 master-eligible nodes in the cluster, as long as minimum_master_nodes is
satisfied.

Today the bootstrap configuration comprises just the ids of the known
master-eligible nodes, and this might be too small to be safe. For instance, if
there are 5 master-eligible nodes (so that minimum_master_nodes is 3) then the
bootstrap configuration could comprise just 3 nodes, of which 2 form a quorum,
and this does not intersect other quorums that might arise, leading to a
split-brain.

This commit fixes this by expanding the bootstrap configuration so that its
quorums satisfy minimum_master_nodes, by adding some of the IDs of the other
master-eligible nodes in the last-published cluster state.
2019-01-31 08:00:11 +00:00
Alexander Reelsen b94acb608b
Speed up converting of temporal accessor to zoned date time (#37915)
The existing implementation was slow due to exceptions being thrown if
an accessor did not have a time zone. This implementation queries for
having a timezone, local time and local date and also checks for an
instant preventing to throw an exception and thus speeding up the conversion.

This removes the existing method and create a new one named
DateFormatters.from(TemporalAccessor accessor) to resemble the naming of
the java time ones.

Before this change an epoch millis parser using the toZonedDateTime
method took approximately 50x longer.

Relates #37826
2019-01-31 08:55:40 +01:00
Alexander Reelsen 160d1bd4dd
Work around JDK8 timezone bug in tests (#37968)
The timezone GMT0 cannot be properly parsed on java8.
The randomZone() method now excludes GMT0, if java8 is used.

Closes #37814
2019-01-31 08:52:35 +01:00
Nhat Nguyen f5398d6511 Mute testRetentionLeasesSyncOnExpiration
Tracked at #37963
2019-01-31 00:57:27 -05:00
Jason Tedor a6a534f1f0
Reenable BWC testing after retention lease stats (#38062)
This commit adjusts the BWC version on retention leases in stats, so
with this we also reenable BWC testing.
2019-01-30 20:34:27 -05:00
Tim Brooks b88bdfe958
Add dispatching to `HandledTransportAction` (#38050)
This commit allows implementors of the `HandledTransportAction` to
specify what thread the action should be executed on. The motivation for
this commit is that certain CCR requests should be performed on the
generic threadpool.
2019-01-30 15:40:49 -07:00
Michael Basnight 945ad05d54
Update verify repository to allow unknown fields (#37619)
The subparser in verify repository allows for unknown fields. This
commit sets the value to true for the parser and modifies the test such
that it accurately tests it.

Relates #36938
2019-01-30 14:31:16 -06:00
David Turner 81c443c9de
Deprecate minimum_master_nodes (#37868)
Today we pass `discovery.zen.minimum_master_nodes` to nodes started up in
tests, but for 7.x nodes this setting is not required as it has no effect.
This commit removes this setting so that nodes are started with more realistic
configurations, and deprecates it.
2019-01-30 20:09:15 +00:00
Armin Braun a070b8acc0
Extract TransportRequestDeduplication from ShardStateAction (#37870)
* Extracted the logic for master request duplication so it can be reused by the snapshotting logic
* Removed custom listener used by `ShardStateAction` to not leak these into future users of this class
* Changed semantics slightly to get rid of redundant instantiations of the composite listener
* Relates #37686
2019-01-30 19:21:09 +01:00
Jason Tedor 6500b0cbd7
Expose retention leases in shard stats (#37991)
This commit exposes retention leases via shard-level stats.
2019-01-30 13:20:40 -05:00
Jason Tedor c468b2f7ca
Make primary terms fields private in index shard (#38036)
This commit encapsulates the primary terms fields in index shard. This
is a precursor to pushing the operation primary term down to the
replication tracker.
2019-01-30 12:56:58 -05:00
Nhat Nguyen ed460c2815 Log flush_stats and commit_stats in testMaybeFlush
This test failed a few times over the last several months. It seems that
we triggered a flush, but CI was too slow to finish it in several
seconds. I added the flush stats and commit stats and unmuted this test.
We should have a good clue if this test fails again.

Relates #37896
2019-01-30 12:54:31 -05:00
Christoph Büscher ecbaa38864
Remove deprecated Plugin#onModule extension points (#37866)
Removes some guice index level extension point marked as @Deprecated since at
least 6.0. They served as a signpost for plugin authors upgrading from 2.x but
this shouldn't be relevant in 7.0 anymore.
2019-01-30 17:17:54 +01:00
Igor Motov 23805fa41a
Geo: Fix Empty Geometry Collection Handling (#37978)
Fixes handling empty geometry collection and re-enables
testParseGeometryCollection test.

Fixes #37894
2019-01-30 09:20:30 -05:00
Luca Cavanna b91d587275
Move SearchHit and SearchHits to Writeable (#37931)
This allowed to make SearchHits immutable, while quite a few fields in
SearchHit have to stay mutable unfortunately.

Relates to #34389
2019-01-30 12:05:54 +01:00
Jason Tedor ba285a56a7
Fix limit on retaining sequence number (#37992)
We only assign non-negative sequence numbers to operations, so the lower
limit on retaining sequence numbers should be that it is non-negative
only.
2019-01-30 05:25:17 -05:00
Alexander Reelsen 9ec4abc31e
Ensure date parsing BWC compatibility (#37929)
In order to retain BWC this changes the java date formatters to be able to
parse nanoseconds resolution, even if only milliseconds are supported.
This used to work on joda time as well so that a user could store a date
like `2018-10-03T14:42:44.613469+0000` and then just loose the precision
on anything lower than millisecond level.
2019-01-30 10:47:12 +01:00
Adrien Grand c8af0f4bfa
Use mappings to format doc-value fields by default. (#30831)
Doc-value fields now return a value that is based on the mappings rather than
the script implementation by default.

This deprecates the special `use_field_mapping` docvalue format which was added
in #29639 only to ease the transition to 7.x and it is not necessary anymore in
7.0.
2019-01-30 10:31:51 +01:00
Adrien Grand b63b50b945
Give precedence to index creation when mixing typed templates with typeless index creation and vice-versa. (#37871)
Currently if you mix typed templates and typeless index creation or typeless
templates and typed index creation then you will end up with an error because
Elasticsearch tries to create an index that has multiple types: `_doc` and
the explicit type name that you used.

This commit proposes to give precedence to the index creation call so that
the type from the template will be ignored if the index creation call is
typeless while the template is typed, and the type from the index creation
call will be used if there is a typeless template.

This is consistent with the fact that index creation already "wins" if a field
is defined differently in the index creation call and in a template: the
definition from the index creation call is used in such cases.

Closes #37773
2019-01-30 10:28:24 +01:00
Jim Ferenczi 2732bb5cf3
Fix fetch source option in expand search phase (#37908)
This change fixes the copy of the fetch source option into the
expand search request that is used to retrieve the documents of each
collapsed group.

Closes #23829
2019-01-30 08:46:14 +01:00
Jim Ferenczi 5dcc805dc9
Restore a noop _all metadata field for 6x indices (#37808)
This commit restores a noop version of the AllFieldMapper that is instanciated only
for indices created in 6x. We need this metadata field mapper to be present in this version
in order to allow the upgrade of indices that explicitly disable _all (enabled: false).
The mapping of these indices contains a reference to the _all field that we cannot remove
in 7 so we'll need to keep this metadata mapper in 7x. Since indices created in 6x will not
be compatible with 8, we'll remove this noop mapper in the next major version.

Closes #37429
2019-01-30 08:45:50 +01:00
Marios Trivyzas f5b9b4d89c Add version 6.6.1 (#37975) 2019-01-30 15:33:01 +11:00
markharwood b889221f75
Types removal - deprecate include_type_name with index templates (#37484)
Added deprecation warnings for use of include_type_name in put/get index templates.
HLRC changes:
GetIndexTemplateRequest has a new client-side class which is a copy of server's GetIndexTemplateResponse but modified to be typeless.
PutIndexTemplateRequest has a new client-side counterpart which doesn't use types in the mappings
Relates to #35190
2019-01-29 20:52:41 +00:00
jimczi 193017672a Handle completion suggestion without contexts
This change fixes the handling of completion suggestion without contexts.

Relates #36996
2019-01-29 20:31:46 +01:00
Tim Brooks 00ace369af
Use `CcrRepository` to init follower index (#35719)
This commit modifies the put follow index action to use a
CcrRepository when creating a follower index. It routes 
the logic through the snapshot/restore process. A 
wait_for_active_shards parameter can be used to configure
how long to wait before returning the response.
2019-01-29 11:47:29 -07:00
Albert Zaharovits d05a4b9d14
Get Aliases with wildcard exclusion expression (#34230)
This commit adds the code in the HTTP layer that will parse exclusion wildcard
expressions.
The existing code issues 404s for wildcards as well as explicit indices.
But, in general, in an expression with exclude wildcards (-...*) following other
include wildcards, there is no way to tell if the include wildcard produced no
results or they were subsequently excluded.
Therefore, the proposed change is breaking the behavior of 404s for
wildcards. Specifically, no 404s will be returned for wildcards, even
if they are not followed by exclude wildcards or the exclude wildcards
could not possibly exclude what has previously been included.
Only explicitly requested aliases will be called out as missing.
2019-01-29 18:56:20 +02:00
Boaz Leskes 218df3009a
Move update and delete by query to use seq# for optimistic concurrency control (#37857)
The delete and update by query APIs both offer protection against overriding concurrent user changes to the documents they touch. They currently are using internal versioning. This PR changes that to rely on sequences numbers and primary terms.

Relates #37639 
Relates #36148 
Relates #10708
2019-01-29 10:23:05 -05:00
Yannick Welsch 3c9f7031b9
Enforce cluster UUIDs (#37775)
This commit adds join validation around cluster UUIDs, preventing a node to join a cluster if it was
previously part of another cluster. The commit introduces a new flag to the cluster state,
clusterUUIDCommitted, which denotes whether the node has locked into a cluster with the given
uuid. When a cluster is committed, this flag will turn to true, and subsequent cluster state updates
will keep the information about committal. Note that coordinating-only nodes are still free to switch
clusters at will (after restart), as they don't carry any persistent state.
2019-01-29 15:41:05 +01:00
Luca Cavanna 09a11a34ef
Remove clusterAlias instance member from QueryShardContext (#37923)
The clusterAlias member is only used in the copy constructor, to be able
to reconstruct the fully qualified index. It is also possible to remove
the instance member and add a private constructor that accepts the already built Index object which contains the cluster alias.
2019-01-29 15:31:49 +01:00
Boaz Leskes 65a9b61a91
Add Seq# based optimistic concurrency control to UpdateRequest (#37872)
The update request has a lesser known support for a one off update of a known document version. This PR adds an a seq# based alternative to power these operations.

Relates #36148 
Relates #10708
2019-01-29 09:18:05 -05:00
Tanguy Leroux 5d1964bcbf
Ignore shard started requests when primary term does not match (#37899)
This commit changes the StartedShardEntry so that it also contains the 
primary term of the shard to start. This way the master node can also 
checks that the primary term from the start request is equal to the current 
shard's primary term in the cluster state, and it can ignore any shard 
started request that would concerns a previous instance of the shard that 
would have been allocated to the same node.

Such situation are likely to happen with frozen (or restored) indices and 
the replication of closed indices, because with replicated closed indices 
the shards will be initialized again after the index is closed and can 
potentially be re initialized again if the index is reopened as a frozen 
index. In such cases the lifecycle of the shards would be something like:
* shard is STARTED
* index is closed
* shards is INITIALIZING (index state is CLOSED, primary term is X)
* index is reopened
* shards are INITIALIZING again (index state is OPENED, potentially frozen, 
primary term is X+1)

Adding the primary term to the shard started request will allow to discard 
potential StartedShardEntry requests received by the master node if the 
request concerns the shard with primary term X because it has been 
moved/reinitialized in the meanwhile under the primary term X+1.

Relates to #33888
2019-01-29 15:09:40 +01:00
Luca Cavanna 2325fb9cb3
Remove test only SearchShardTarget constructor (#37912)
Remove SearchShardTarget test only constructor and replace all the usages with calls to the other constructor that accepts a ShardId.
2019-01-29 14:58:11 +01:00
Luca Cavanna 42eec55837
Replace failure.get().addSuppressed with failure.accumulateAndGet() (#37649)
Also add a test for concurrent incoming failures
2019-01-29 14:57:33 +01:00
Luca Cavanna a6d4838a67
Clean up allowPartialSearchResults serialization (#37911)
When serializing allowPartialSearchResults to the shards through ShardSearchTransportRequest, we use an optional boolean field, though
the corresponding instance member is declared `boolean` which can never
be null. We also have an assert to verify that the incoming search
request provides a non-null value for the flag, and a comment explaining
that null should be considered a bug.

This commit makes the allowPartialSearchResults method in
ShardSearchRequest return a `boolean` rather than a `Boolean` and
changes the serialization from optional to non optional, in a bw comp manner.
2019-01-29 14:56:22 +01:00
Tanguy Leroux 460f10ce60
Close Index API should force a flush if a sync is needed (#37961)
This commit changes the TransportVerifyShardBeforeCloseAction so that it issues a 
forced flush, forcing the translog and the Lucene commit to contain the same max seq 
number and global checkpoint in the case the Translog contains operations that were 
not written in the IndexWriter (like a Delete that touches a non existing doc). This way 
the assertion added in #37426 won't trip.

Related to #33888
2019-01-29 13:15:58 +01:00
Yannick Welsch 504a89feaf
Step down as master when configured out of voting configuration (#37802)
Abdicates to another master-eligible node once the active master is reconfigured out of the voting
configuration, for example through the use of voting configuration exclusions.

Follow-up to #37712
2019-01-29 12:43:04 +01:00
Yannick Welsch 827c4f6567 Make Version.java aware of 6.x Lucene upgrade
Relates to #37913
2019-01-29 10:44:01 +01:00
Przemyslaw Gomulka 891320f5ac
Elasticsearch support to JSON logging (#36833)
In order to support JSON log format, a custom pattern layout was used and its configuration is enclosed in ESJsonLayout. Users are free to use their own patterns, but if smooth Beats integration is needed, they should use ESJsonLayout. EvilLoggerTests are left intact to make sure user's custom log patterns work fine.

To populate additional fields node.id and cluster.uuid which are not available at start time, 
a cluster state update will have to be received and the values passed to log4j pattern converter.
A ClusterStateObserver.Listener is used to receive only one ClusteStateUpdate. Once update is received the nodeId and clusterUUid are set in a static field in a NodeAndClusterIdConverter. 

Following fields are expected in JSON log lines: type, tiemstamp, level, component, cluster.name, node.name, node.id, cluster.uuid, message, stacktrace
see ESJsonLayout.java for more details and field descriptions

Docker log4j2 configuration is now almost the same as the one use for ES binary. 
The only difference is that docker is using console appenders, whereas ES is using file appenders.

relates: #32850
2019-01-29 07:20:09 +01:00
Nhat Nguyen 9ceb218d85 Adjust bwc version for put mapping requests
Relates #37675
2019-01-28 10:57:11 -05:00
Armin Braun 0d109396fa
Increase Timeout in #testSnapshotCanceled (#37890)
* The test failure reported in the issue looks like a mere timeout. Logging suggestst hat the snapshot completes/aborts correctly but the busy
loop polling the snapshot state times out too early.
* Closes #37888
2019-01-28 14:13:02 +01:00
Luca Cavanna a9adc16922 Mute failing SearchQueryIT test
Relates to #37814
2019-01-28 13:41:13 +01:00
Alpar Torok 64b98db973
Add an alias for :server:integTest so it runs as part of internalClusterTest (#37910) 2019-01-28 14:26:22 +02:00
Jason Tedor 194cdfe208
Sync retention leases on expiration (#37902)
This commit introduces a sync of retention leases when a retention lease
expires. As expiration of retention leases is lazy, their expiration is
managed only when getting the current retention leases from the
replication tracker. At this point, we callback to our full retention
lease sync to sync and flush these on all shard copies. With this
change, replicas do not locally manage expiration of retention leases;
instead, that is done only on the primary.
2019-01-28 07:11:51 -05:00
Tanguy Leroux 758eb9d451 Track accurate total hits in CloseIndexIT
The test was not using the TRACK_TOTAL_HITS_ACCURATE and thus
encountered a different issue tracked in #37907. In the meanwhile
we can adapt the test to not fail anymore.

Closes #37897
2019-01-28 11:30:20 +01:00
Martijn van Groningen 4e1a779773
Prepare ShardFollowNodeTask to bootstrap when it fall behind leader shard (#37562)
* Changed `LuceneSnapshot` to throw an `OperationsMissingException` if the requested ops are missing.
* Changed the shard changes api to handle the `OperationsMissingException` and wrap the exception into `ResourceNotFound` exception and include metadata to indicate the requested range can no longer be retrieved.
* Changed `ShardFollowNodeTask` to handle this `ResourceNotFound` exception with the included metdata header.

Relates to #35975
2019-01-28 09:30:04 +01:00
Jim Ferenczi a056804831 Track total hits in tests that index more than 10,000 docs
This change sets track_total_hits to true on a test that requires
to check the total hits of a query that can return more than 10,000 docs.

Closes #37895
2019-01-28 09:24:32 +01:00
Dimitrios Liappis 290c6637c2
Refactor into appropriate uses of scheduleUnlessShuttingDown (#37709)
Replace `threadPool().schedule()` / catch
`EsRejectedExecutionException` pattern with direct calls to
`ThreadPool#scheduleUnlessShuttingDown()`.

Closes #36318
2019-01-28 10:01:26 +02:00
Julie Tibshirani b1735aa93b
Support both typed and typeless 'get mapping' requests in the HLRC. (#37796)
From previous PRs, we've already added support for include_type_name to
the get mapping API. We had also taken an approach to the HLRC where the
server-side `GetMappingResponse#fromXContent` could only handle typeless
input.

This PR updates the HLRC for 'get mapping' to be in line with our new approach:

* Add a typeless 'get mappings' method to the Java HLRC, that accepts new
client-side request and response objects. This new response only handles
typeless mapping definitions.
* Switch the old version of `GetMappingResponse` back to expecting typed
mappings, and deprecate the corresponding method on the HLRC.

Finally, the PR also does some small, related clean-up around 'get field mappings'.
2019-01-27 16:02:22 -08:00
Jason Tedor f24dce1122
Fix newlines in retention lease sync action tests
There is a method invocation here spanning multiple lines. This commit
breaks it up into a line per parameter as this is friendlier to future
changes and diffs.
2019-01-27 08:16:14 -05:00
Jason Tedor 3801925cf0
Copy retention leases under lock
When adding a retention lease, we make a reference copy of the retention
leases under lock and then make a copy of that collection outside of the
lock. However, since we merely copied a reference to the retention
leases, after leaving a lock the underlying collection could change on
us. Rather, we want to copy these under lock. This commit adds a
dedicated method for doing this, asserts that we hold a lock when we use
this method, and changes adding a retention lease to use this method.

This commit was intended to be included with #37398 but was pushed to
the wrong branch.
2019-01-27 08:13:47 -05:00
Jason Tedor 5fddb631a2
Introduce retention lease syncing (#37398)
This commit introduces retention lease syncing from the primary to its
replicas when a new retention lease is added. A follow-up commit will
add a background sync of the retention leases as well so that renewed
retention leases are synced to replicas.
2019-01-27 07:49:56 -05:00
Nhat Nguyen 780b4c72fe
Make ChannelActionListener a top-level class (#37797)
We start using this class more often. Let's make it a top-level class.
2019-01-26 22:01:30 -05:00
Julie Tibshirani afc60bb0e5 Mute DynamicMappingIT#testConflictingDynamicMappings
Tracked in #37898.
2019-01-25 18:09:34 -08:00
Tal Levy eb973a4744 fix GeoHashGridTests precision parsing error
Previously, a hardcoded precision value of 4 was
used by these tests resulting in no approximation
errors. Now that the precision is between 1-12,
precision values of 1 and 2 result in potential
bucketing errors.

This commit adjusts the range to be 4-12.

Fixes #37892.
2019-01-25 17:29:04 -08:00
Julie Tibshirani 58301ead6d Mute IndexShardIT#testMaybeFlush
Tracked in #37896.
2019-01-25 17:12:16 -08:00
Julie Tibshirani 23b0d9b3ed Mute RecoveryWhileUnderLoadIT#testRecoverWhileUnderLoadAllocateReplicasRelocatePrimariesTest
Tracked in #37895.
2019-01-25 16:50:39 -08:00
Julie Tibshirani e41ccdc1a0 Mute GeoWKTShapeParserTests#testParseGeometryCollection
Tracked in #37894.
2019-01-25 16:15:16 -08:00
Julie Tibshirani 827ed12146 Mute TasksIT#testTransportBulkTasks
Tracked in #37893.
2019-01-25 15:29:24 -08:00
Julie Tibshirani a4020f4587 Mute SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard
Tracked in #37888.
2019-01-25 13:40:29 -08:00
Like eb7bf16427 Migrate o.e.i.r.RecoveryState to Writeable (#37380)
Relates to #34389
2019-01-25 15:52:04 -05:00
Nhat Nguyen 5cd4dfb0e4
Relax cluster metadata version check (#37834)
If the in_sync_allocations of index-1 or index-2 is changed, the
metadata version will be increased. This leads to the failure in
the metadata version checks. We need to relax them.

Closes #37820
2019-01-25 14:54:13 -05:00
Yuri Astrakhan f1e71be8b2
Refactored GeoHashGrid unit tests (#37832)
* Refactored GeoHashGrid unit tests

This change allows other grid aggregations to reuse the same tests.

The change mostly just moves code to the base classes, trying to
keep changes to a bare minimum.

* rename createInternalGeoHashGridBucket to createInternalGeoGridBucket

* indentation
2019-01-25 13:37:24 -05:00
Zachary Tong afd4618851 Fixes for a few randomized agg tests that fail hasValue() checks
Closes #37743
Closes #37873
2019-01-25 12:39:42 -05:00
Igor Motov 68149b6058
Geo: replace intermediate geo objects with libs/geo (#37721)
Replaces intermediate geo objects built by ShapeBuilders with
objects from the libs/geo hierarchy. This should allow us to build
all geo functionality around a single hierarchy.

Follow up for #35320
2019-01-25 11:37:27 -05:00
Tanguy Leroux a644bc095c
Add unit tests for ShardStateAction's ShardStartedClusterStateTaskExecutor (#37756) 2019-01-25 16:51:53 +01:00
Vishnu Gt 27c3fb8e0d Do not allow negative variances (#37384)
Due to floating point error, it was possible for variances to become negative which should never happen.  This bugfix sets variance to zero if it becomes negative as a result of fp error.
2019-01-25 09:56:34 -05:00
Tanguy Leroux ef8dd12c6d Limit number of documents indexed in CloseIndexIT test
This test indexes an unlimited number of documents, this commit
reduces this number to 25K and also tracks exact number of hits
when counting the docs.
2019-01-25 15:09:27 +01:00
Christoph Büscher b4b4cd6ebd
Clean codebase from empty statements (#37822)
* Remove empty statements

There are a couple of instances of undocumented empty statements all across the
code base. While they are mostly harmless, they make the code hard to read and
are potentially error-prone. Removing most of these instances and marking blocks
that look empty by intention as such.

* Change test, slightly more verbose but less confusing
2019-01-25 14:23:02 +01:00
Henning Andersen 49073dd2f6
Fail start on invalid index metadata (#37748)
Node started with node.data=false and node.master=false can no longer
start if they have index metadata. This avoids resurrecting old indexes
into the cluster and ensures metadata is cleaned out before
re-purposing a node that was previously master or data node.

Issue #27073
2019-01-25 14:22:48 +01:00
Jim Ferenczi cb451edb01
Allow nested fields in the composite aggregation (#37178)
This changes adds the support to handle `nested` fields in the `composite`
aggregation. A `nested` aggregation can be used as parent of a `composite`
aggregation in order to target `nested` fields in the `sources`.

Closes #28611
2019-01-25 14:00:39 +01:00
Alexander Reelsen 9e350d027e
Add BWC compatible processing to ingest date processors (#37407)
The ingest date processor is currently only able to parse joda formats.
However it is not using the existing elasticsearch classes but access
joda directly. This means that our existing BWC layer does not notify
the user about deprecated formats. This commit switches to use the
exising Elasticsearch Joda methods to acquire a date format, that
includes the BWC check and the ability to parse java 8 dates.

The date parsing in ingest has also another extra feature, that the
fallback year, when a date format without a year is used, is the current
year, and not 1970 like usual. This is currently not properly supported
in the DateFormatter class. As this is the only case for this feature
and java time can take care of this using the toZonedDateTime() method,
a workaround just for the joda time parser has been created, that can be
removed soon again from 7.0.
2019-01-25 13:50:19 +01:00
Jim Ferenczi 787acb14b9
Track total hits up to 10,000 by default (#37466)
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.

Closes #33028
2019-01-25 13:45:39 +01:00
Mayya Sharipova 70af3c7983
Correct deprec log in RestGetFieldMappingAction (#37843)
* Correct deprec log in RestGetFieldMappingAction

Correct a class used for deprecation logging in
RestGetFieldMappingAction

* Correct deprec log in RestCreateIndexAction

Correct a class used for deprecation logging in
RestCreateIndexAction
2019-01-25 07:13:46 -05:00
Andrey Ershov 9e7fd8caed
Migrate ZenDiscoveryIT to Zen2 (#37465)
ZenDiscoveryIT contained 5 tests. 3 run without changes, testNodeRejectsClusterStateWithWrongMasterNode removed, testHandleNodeJoin_incompatibleClusterState changed.
2019-01-25 11:17:09 +01:00
Armin Braun 7692b607b9
Fix ClusterDisruptionIT#testAckedIndexing (#37853)
* Stop threads before logging the list of exceptions
* For the broken case of concurrent iteration in the finally block and the threads not having shut down,
use `CopyOnWriteArrayList` to have concurrency safe iteration
* Closes #37810
2019-01-25 09:38:29 +01:00
Martijn van Groningen 5a9dadb3ff
changed versionAdded now that #37767 is backedported 2019-01-25 09:18:42 +01:00
Martijn van Groningen 1151f3b3ff
Fail with a dedicated exception if remote connection is missing or (#37767)
or connectivity to the remote connection is failing.

Relates to #37681
2019-01-25 08:53:18 +01:00
Ricardo Ferreira df8fa9781e Remove Abstract Component (#35898)
TransportAction and BaseRestHandler now no longer extends AbstractComponent. The AbstractComponent no longer has usages so it was deleted.

Closes #34488
2019-01-25 08:35:19 +01:00
Yuri Astrakhan 6a13a252e9
Abstract GeoHashGridAggregatorFactory creation, renamed geohash -> hash (#37836)
* Delegate `new GeoHashGridAggregatorFactory(...)` inside the `GeoGridAggregationBuilder` to the child classes.
* Rename all `geohash...` to `hash...`
2019-01-24 23:45:18 -05:00
Nhat Nguyen 3ccd488755 Remove testMappingsPropagatedToMasterNodeImmediately
This test is obsolete since #31140 where an index request with dynamic
mapping update no longer requires acking.

Closes #37816
2019-01-24 21:48:50 -05:00
Julie Tibshirani e1d8df4ffa
Deprecate types in create index requests. (#37134)
From #29453 and #37285, the include_type_name parameter was already present and defaulted to false. This PR makes the following updates:
* Add deprecation warnings to RestCreateIndexAction, plus tests in RestCreateIndexActionTests.
* Add a typeless 'create index' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I created new CreateIndexRequest and CreateIndexResponse objects that differ from the existing server ones.
2019-01-24 13:17:47 -08:00
Boaz Leskes af2f4c8f73 enable bwc tests and bump versions after backporting https://github.com/elastic/elasticsearch/pull/37639 2019-01-24 20:55:55 +01:00
Nhat Nguyen 864e465515 Adjust minRetainedSeqNo asssertion in CombinedDeletionPolicyTests
In these tests, we initialize the retained_seq_no with NO_OPS_PERFORMED,
thus we should verify that the min of the retained_seq_no is at least
NO_OPS_PERFORMED not 0.

Closes #35994
2019-01-24 13:43:51 -05:00
Andrey Ershov 4974684003
Add tool elasticsearch-node unsafe-bootstrap (#37696)
elasticsearch-node tool helps to restore cluster if half or more of
master eligible nodes are lost. Of course, all bets are off, regarding
data consistency.

There are two parts of the tool: unsafe-bootstrap to be used when there
is still at least one master-eligible node alive and detach-cluster,
when there are no master-eligible nodes left.
This commit implements the first part.

Docs for the tool will be added separately as a part of #37812.
2019-01-24 19:25:55 +01:00
Tal Levy 289106a578
Refactor GeoHashGrid to be abstract and re-usable (#37742)
This change split out all the specific GeoHash
classes for the geohash_grid aggregation into
abstract GeoGrid classes that can be re-used for
specific hashing types, like `geohash`
2019-01-24 10:12:14 -08:00
Nhat Nguyen 76fb573569
Do not allow put mapping on follower (#37675)
Today, the mapping on the follower is managed and replicated from its
leader index by the ShardFollowTask. Thus, we should prevent users
from modifying the mapping on the follower indices.

Relates #30086
2019-01-24 12:13:00 -05:00
David Turner 187b233571 Read m_m_n from cluster states from 6.7
This completes the BWC serialisation changes required for a 6.7 master to
inform other nodes of the node-level value of the `minimum_master_nodes`
setting.

Relates #37701, #37811
2019-01-24 17:05:49 +00:00
David Roberts 0e36adc35f Mute SimpleClusterStateIT testMetadataVersion
Due to https://github.com/elastic/elasticsearch/issues/37820
2019-01-24 16:50:55 +00:00
David Roberts bd02ca4b7b Mute NoMasterNodeIT testNoMasterActionsWriteMasterBlock
Due to https://github.com/elastic/elasticsearch/issues/37823
2019-01-24 15:17:13 +00:00
Nhat Nguyen a6abb28abf
Fix InternalEngineTests#assertOpsOnPrimary (#37746)
The assertion `assertOpsOnPrimary` does not store seq_no and primary
term of successful deletes to the `lastOpSeqNo` and `lastOpTerm`. This
leads to failures of the subsequence CAS deletes or indexes with seq_no
and term. Moreover, this assertion trips a translog assertion because it
bumps the primary term of some operations but not the primary term of
the engine.

Relates #36467
Closes #37684
2019-01-24 10:02:48 -05:00
David Roberts a81931bb2a Mute DynamicMappingIT testMappingsPropagatedToMasterNodeImmediately
Due to https://github.com/elastic/elasticsearch/issues/37816
2019-01-24 14:32:44 +00:00
Jason Tedor 7517e3a7bd
Optimize warning header de-duplication (#37725)
Now that warning headers no longer contain a timestamp of when the
warning was generated, we no longer need to extract the warning value
from the warning to determine whether or not the warning value is
duplicated. Instead, we can compare strings directly.

Further, when de-duplicating warning headers, are constantly rebuilding
sets. Instead of doing that, we can carry about the set with us and
rebuild it if we find a new warning value.

This commit applies both of these optimizations.
2019-01-24 08:39:24 -05:00
Yannick Welsch feab59df03
Bubble exceptions up in ClusterApplierService (#37729)
Exceptions thrown by the cluster applier service's settings and cluster appliers are bubbled up, and
block the state from being applied instead of silently being ignored. In combination with the cluster
state publishing lag detector, this will throw a node out of the cluster that can't properly apply
cluster state updates.
2019-01-24 14:09:03 +01:00
Simon Willnauer c7b16162ae
Remove unused ThreadBarrier class (#37666)
This class is pretty complex and only used in a test where we can simply
fail the test with an assertion error.
2019-01-24 13:52:22 +01:00
Yannick Welsch 2bf269e628 Fix docs for MappingUpdatedAction
Follow-up to #31140
2019-01-24 12:44:36 +01:00
David Roberts bcf5a4ca47 Mute ClusterDisruptionIT testAckedIndexing
Due to https://github.com/elastic/elasticsearch/issues/37810
2019-01-24 10:58:02 +00:00
Yannick Welsch 64adb5ad5b
Set acking timeout to 0 on dynamic mapping update (#31140)
As acking can fail for any reason (unrelated node being too slow, node disconnecting), it should not
be required for acking to succeed in order for index requests with dynamic mapping updates to
successfully complete.

Relates to #30672 and Closes #30844
2019-01-24 11:39:46 +01:00
Armin Braun 36889e8a2f
Remove Custom Listeners from SnapshotsService (#37629)
* Remove Custom Listeners from SnapshotsService

Motivations:
    * Shorten the code some more
    * Use ActionListener#wrap to get easy to reason about behavior in failure scenarios
    * Remove duplication in the logic of handling snapshot completion listeners (listeners removing themselves and comparing snapshots to their targets)
        * Also here, move all listener handling into `SnapshotsService` and remove custom listener class by putting listeners in a map
2019-01-24 10:11:18 +01:00
David Turner bdef2ab8c0
Use m_m_nodes from Zen1 master for Zen2 bootstrap (#37701)
Today we support a smooth rolling upgrade from Zen1 to Zen2 by automatically
bootstrapping the cluster once all the Zen1 nodes have left, as long as the
`minimum_master_nodes` count is satisfied. However this means that Zen2 nodes
also require the `minimum_master_nodes` setting for this one specific and
transient situation.

Since nodes only perform this automatic bootstrapping if they previously
belonged to a Zen1 cluster, they can keep track of the `minimum_master_nodes`
setting from the previous master instead of requiring it to be set on the Zen2
node.
2019-01-24 08:57:40 +00:00
Mayya Sharipova fdb66039d4
Change `rational` to `saturation` in script_score (#37766)
This change of the function name is necessary for conformity
with feature queries.

Closes #37714
2019-01-23 14:28:20 -05:00
Mayya Sharipova c8565fe692
Deprecate types in get field mapping API (#37667)
- Add deprecation warning to RestGetFieldMappingAction
- Add two new java HRLC classes GetFieldMappingsRequest and
GetFieldMappingsResponse. These classes use new typeless forms
of a request and response, and differ in that from the server
versions.

Relates to #35190
2019-01-23 14:24:35 -05:00
Tim Brooks f45b5fedb5
Add ability to listen to group of affix settings (#37679)
Currently we have the ability to listen for setting changes to two group
affix settings. However, it is possible that we might have the need to
listen to more than two. This commit adds a method that allows consumer
to listen to a list of affix settings for changes.
2019-01-23 12:05:39 -07:00
Jason Tedor 169cb38778
Liberalize StreamOutput#writeStringList (#37768)
In some cases we only have a string collection instead of a string list
that we want to serialize out. We have a convenience method for writing
a list of strings, but no such method for writing a collection of
strings. Yet, a list of strings is a collection of strings, so we can
simply liberalize StreamOutput#writeStringList to be more generous in
the collections that it accepts and write out collections of strings
too. On the other side, we do not have a convenience method for reading
a list of strings. This commit addresses both of these issues.
2019-01-23 12:52:17 -05:00
Benjamin Trent 1c2ae9185c
Add PersistentTasksClusterService::unassignPersistentTask method (#37576)
* Add PersistentTasksClusterService::unassignPersistentTask method

* adding cancellation test

* Adding integration test for unallocating tasks from a node

* Addressing review comments

* adressing minor PR comments
2019-01-23 11:48:32 -06:00
Igor Motov e3672aa551
Tests: disable testRandomGeoCollectionQuery on tiny polygons (#37579)
Due to https://issues.apache.org/jira/browse/LUCENE-8634 this test
may fail if a really tiny polygon is generated. This commit checks for
tiny polygons and skips the final check, which is expected to fail
until the lucene bug is fixed and new version of lucene is released.
2019-01-23 12:25:54 -05:00
Julie Tibshirani f0fc6e8003
Make sure PutMappingRequest accepts content types other than JSON. (#37720) 2019-01-23 08:51:05 -08:00
David Kyle d193ca8aae
Use disassociate in preference to deassociate (#37704) 2019-01-23 16:06:25 +00:00
Armin Braun 2439f68745
Delete Redundant RoutingServiceTests (#37750)
* This test compleletly overrode the `reroute` method and hence did nothing put test the override itself
   * Removed the test since it tests nothing and simplified `reroute` accordingly
2019-01-23 16:39:02 +01:00
Nhat Nguyen 6a9838359c
Always return metadata version if metadata is requested (#37674)
If the indices of a ClusterStateRequest are specified, we fail to
include the cluster state metadata version in the response.

Relates  #37633
2019-01-23 10:24:51 -05:00
Luca Cavanna 12f5b02fd0
Streamline skip_unavailable handling (#37672)
This commit moves the collectSearchShards method out of RemoteClusterService into TransportSearchAction that currently calls it. RemoteClusterService used to be used only for cross-cluster search but is now also used in cross-cluster replication where different API are called through the RemoteClusterAwareClient. There is no reason for the collectSearchShards and fetchShards methods to be respectively in RemoteClusterService and RemoteClusterConnection. The search shards API can be called through the RemoteClusterAwareClient too, the only missing bit is a way to handle failures based on the skip_unavailable setting for each cluster (currently only supported in RemoteClusterConnection#fetchShards) which is achieved by adding a isSkipUnavailable(String clusterAlias) method to RemoteClusterService.
This change is useful for #32125 as we will very soon need to also call the search API against remote clusters, which will be done through RemoteClusterAwareClient. In that case we will also need to support skip_unavailable when calling the search API so we need some way to handle the skip_unavailable setting like we currently do for the search_shards call.

Relates to #32125
2019-01-23 13:53:37 +01:00
Yannick Welsch d5139e0590
Only bootstrap and elect node in current voting configuration (#37712)
Adapts bootstrapping and leader election to only trigger on nodes that are actually part of the voting
configuration.
2019-01-23 13:10:11 +01:00
Simon Willnauer 4ec3a6d922
Ensure either success or failure path for SearchOperationListener is called (#37467)
Today we have several implementations of executing SearchOperationListener
in SearchService. While all of them seem to be safe at least on, the one that
executes scroll searches can cause illegal execution of SearchOperationListener
that can then in-turn trigger assertions in ShardSearchStats. This change
adds a SearchOperationListenerExecutor that uses try-with blocks to ensure
listeners are called in a safe way.

Relates to #37185
2019-01-23 12:38:44 +01:00
Tanguy Leroux 6130d15172
Adapt SyncedFlushService (#37691) 2019-01-23 11:08:54 +01:00
Alexander Reelsen 701d89caa2 Mute FilterAggregatorTests#testRandom
Relates #37743
2019-01-23 11:00:37 +01:00
Alexander Reelsen daa2ec8a60
Switch mapping/aggregations over to java time (#36363)
This commit moves the aggregation and mapping code from joda time to
java time. This includes field mappers, root object mappers, aggregations with date
histograms, query builders and a lot of changes within tests.

The cut-over to java time is a requirement so that we can support nanoseconds
properly in a future field mapper.

Relates #27330
2019-01-23 10:40:05 +01:00
Boaz Leskes 52ba407931
Expose sequence number and primary terms in search responses (#37639)
Users may require the sequence number and primary terms to perform optimistic concurrency control operations. Currently, you can get the sequence number via the `docvalues_fields` API but the primary term is not accessible because it is maintained by the `SeqNoFieldMapper` and the infrastructure can't find it. 

This commit adds a dedicated sub fetch phase to return both numbers that is connected to a new `seq_no_primary_term` parameter.
2019-01-23 09:01:58 +01:00
Andrey Ershov 7c6566e14c
Migrate SpecificMasterNodesIT to Zen2 (#37532)
1. testSimpleOnlyMasterNodeElection - requires cluster bootstrap when
the first master node is started.
2. testElectOnlyBetweenMasterNodes - requires cluster bootstrap when
the first master node is started and requires adding voting exclusion
before shutting down the first master node.
3. testAliasFilterValidation - requires cluster bootstrap when the
first master node is started.
2019-01-23 07:22:41 +01:00
Andrey Ershov e2e00cd245
Fix MetaStateFormat tests
It's not safe to continue writing state using MetaDataStateFormat
after dirty WriteStateException occurred if it's not recovered by
successful subsequent state write.

We've encountered test failure of testFailRandomlyAndReadAnyState.
The test breaks in the following way. There are 3 state paths. And what
happens next

Successful write at the beginning of the test yields 0 0 0 state
files in the directories.
1st write in the loop is unsuccessful, but not dirty - 0 0 0.
2nd write in the loop is not successful and dirty (failure during
fsync), however before removing new files we have 1 1 1. But now during
deletion, the first deletion fails and we get - 1 0 0.
3rd write in the loop is unsuccessful, but not dirty - so we want to
keep old generation, which happens to be the 1st generation, so now we
have 1 x x in state folders. Now we assert that we either load 0 or 1
state from the state folders and select only 2rd and 3th folder to
emulate disk failures - this results in NPE because there is nothing in
these folders.
Fortunately, this won’t be a problem in real life, because if there is
a dirty exception, we shut down the node and make sure we perform a
successful write on the node startup.
2019-01-23 07:21:26 +01:00
Zachary Tong 2ba9e361ab
Add helper classes to determine if aggs have a value (#36020)
This adds a set of helper classes to determine if an agg "has a value". 
This is needed because InternalAggs represent "empty" in different 
manners according to convention. Some use `NaN`, `+/- Inf`, `0.0`, etc.

A user can pass the Internal agg type to one of these helper methods
and it will report if the agg contains a value or not, which allows the
user to differentiate "empty" from a real `NaN`.

These helpers are best-effort in some cases.  For example, several
pipeline aggs share a single return class but use different conventions
to mark "empty", so the helper uses the loosest definition that applies
to all the aggs that use the class.

Sums in particular are unreliable.  The InternalSum simply returns 0.0
if the agg is empty (which is correct, no values == sum of zero).  But this
also means the helper cannot differentiate from "empty" and `+1 + -1`.
2019-01-22 12:38:55 -05:00
Jason Tedor 715719ee3b
Remove warn-date from warning headers (#37622)
This commit removes the warn-date from warning headers. Previously we
were stamping every warning header with when the request
occurred. However, this has a severe performance penalty when
deprecation logging is called frequently, as obtaining the current time
and formatting it properly is expensive. A previous change moved to
using the startup time as the time to stamp on every warning header, but
this was only to prove that the timestamping was expensive. Since the
warn-date is optional, we elect to remove it from the warning
header. Prior to this commit, we worked in Kibana to make the warn-date
treated as optional there so that we can follow-up in Elasticsearch and
remove the warn-date. This commit does that.
2019-01-22 12:29:24 -05:00
Yannick Welsch 23ba900840
Publish to masters first (#37673)
Prefer publishing to master-eligible nodes first, so that cluster state updates are committed more
quickly, and master-eligible nodes also turned more quickly into followers after a leader election.
2019-01-22 13:53:10 +01:00
David Kyle 3fad1eeaed
Un-assign persistent tasks as nodes exit the cluster (#37656)
PersistentTasksClusterService decides if a task should be reassigned by 
checking there is a node in the cluster with the same Id. If a node is 
restarted PersistentTasksClusterService may not observe the change and 
decide the task still has a valid assignment because the node's 
ephemeral Id is not used in that decision. This change un-assigns tasks
as the nodes in the cluster change.
2019-01-22 12:44:45 +00:00
Henning Andersen 228611843c
Fail start of non-data node if node has data (#37347)
* Fail start of non-data node if node has data

Check that nodes started with node.data=false cannot start if they have
shard data to avoid (old) indexes being resurrected into the cluster in red status.

Issue #27073
2019-01-22 13:27:12 +01:00
Yannick Welsch 2a7b7ccf1c
Use cancel instead of timeout for aborting publications (#37670)
When publications were cancelled because a node turned to follower or candidate, it would still
show as time out, which can be confusing in the logs. This change adapts the improper call of
onTimeout by generalizing it to a cancel method.
2019-01-22 12:51:03 +01:00
Christoph Büscher 0a93a0358b
Remove deprecated FieldNamesFieldMapper.Builder#index (#37305)
The method calls "enabled" in addition to what the super.index() does, but this
seems to be done explicitely now in the TypeParsers `parse` method. The removed
method has been deprecated since at least 6.0. Also making some of the Builders
methods and ctos private since they are only used internally in this class.
2019-01-22 12:12:21 +01:00
David Turner 5db7ed22a0
Bootstrap a Zen2 cluster once quorum is discovered (#37463)
Today when bootstrapping a Zen2 cluster we wait for every node in the
`initial_master_nodes` setting to be discovered, so that we can map the
node names or addresses in the `initial_master_nodes` list to their IDs for
inclusion in the initial voting configuration. This means that if any of
the expected master-eligible nodes fails to start then bootstrapping will
not occur and the cluster will not form. This is not ideal, and we would
prefer the cluster to bootstrap even if some of the master-eligible nodes
do not start.

Safe bootstrapping requires that all pairs of quorums of all initial
configurations overlap, and this is particularly troublesome to ensure
given that nodes may be concurrently and independently attempting to
bootstrap the cluster. The solution is to bootstrap using an initial
configuration whose size matches the size of the expected set of
master-eligible nodes, but with the unknown IDs replaced by "placeholder"
IDs that can never belong to any node.  Any quorum of received votes in any
of these placeholder-laden initial configurations is also a quorum of the
"true" initial set of master-eligible nodes, giving the guarantee that it
intersects all other quorums as required.

Note that this change means that the initial configuration is not
necessarily robust to any node failures. Normally the cluster will form and
then auto-reconfigure to a more robust configuration in which the
placeholder IDs are replaced by the IDs of genuine nodes as they join the
cluster; however if a node fails between bootstrapping and this
auto-reconfiguration then the cluster may become unavailable. This we feel
to be less likely than a node failing to start at all.

This commit also enormously simplifies the cluster bootstrapping process.
Today, the cluster bootstrapping process involves two (local) transport actions
in order to support a flexible bootstrapping API and to make it easily
accessible to plugins. However this flexibility is not required for the current
design so it is adding a good deal of unnecessary complexity. Here we remove
this complexity in favour of a much simpler ClusterBootstrapService
implementation that does all the work itself.
2019-01-22 11:03:51 +00:00
Adrien Grand e9fcb25a28
Upgrade to lucene-8.0.0-snapshot-83f9835. (#37668)
This snapshot uses a new file format for doc-values which is expected to make
advance/advanceExact perform faster on sparse fields:
https://issues.apache.org/jira/browse/LUCENE-8585
2019-01-22 11:44:29 +01:00
Alpar Torok 74d1cfbf7e Mute failing test
Tracking ##37687
2019-01-22 10:50:27 +02:00
Alexander Reelsen 4fb68ea195
Fix java time formatters that round up (#37604)
In order to be able to parse epoch seconds and epoch milli seconds own
java time fields had been introduced. These fields are however not
compatible with the way that java time allows one to configure default
fields (when a part of a timestamp cannot be read then a default value
is added), which is used for the formatters that are rounding up to the
next value.

This commit allows java date formatters to configure its round up parsing 
by setting default values via a consumer. By default all formats are setting 
JavaDateFormatter.ROUND_UP_BASE_FIELDS for rounding up. The epoch
however parsers both need to set different fields. The merged date
formatters do not set any fields, they just append all the round up formatters.

Also the formatter now properly copies the locale and the timezone, 
fractional parsing has been set to nano seconds with proper width.
2019-01-22 09:42:17 +01:00
Alpar Torok 17d704347e Mute failing test
Tracking #37685
2019-01-22 10:31:23 +02:00
Tanguy Leroux 0290547ad7
Ensure that max seq # is equal to the global checkpoint when creating ReadOnlyEngines (#37426)
Since version 6.7.0 the Close Index API guarantees that all translog 
operations have been correctly flushed before the index is closed. If 
the index is reopened as a Frozen index (which uses a ReadOnlyEngine) 
we can verify that the maximum sequence number from the last Lucene 
commit is indeed equal to the last known global checkpoint and refuses 
to open the read only engine if it's not the case. In this PR the check is 
only done for indices created on or after 6.7.0 as they are guaranteed 
to be closed using the new Close Index API.

Related #33888
2019-01-22 09:22:33 +01:00
Alpar Torok a713183cab Mute failing discovery disruption tests
Tracking #37539
2019-01-22 10:16:04 +02:00
Nhat Nguyen 7394892b4c
Make prepare engine step of recovery source non-blocking (#37573)
Relates #37174
2019-01-21 21:35:10 -05:00
Tim Brooks 21838d73b5
Extract message serialization from `TcpTransport` (#37034)
This commit introduces a NetworkMessage class. This class has two
subclasses - InboundMessage and OutboundMessage. These messages can
be serialized and deserialized independent of the transport. This allows
more granular testing. Additionally, the serialization mechanism is now
a simple Supplier. This builds the framework to eventually move the
serialization of transport messages to the network thread. This is the
one serialization component that is not currently performed on the
network thread (transport deserialization and http serialization and
deserialization are all on the network thread).
2019-01-21 14:14:18 -07:00
Tim Brooks f516d68fb2
Share `NioGroup` between http and transport impls (#37396)
Currently we create dedicated network threads for both the http and
transport implementations. Since these these threads should never
perform blocking operations, these threads could be shared. This commit
modifies the nio-transport to have 0 http workers be default. If the
default configs are used, this will cause the http transport to be run
on the transport worker threads. The http worker setting will still exist
in case the user would like to configure dedicated workers. Additionally,
this commmit deletes dedicated acceptor threads. We have never had these
for the netty transport and they can be added back if a need is
determined in the future.
2019-01-21 13:50:56 -07:00
Armin Braun 3a3f5b39c3
Fix Race in Concurrent Snapshot Delete and Create (#37612)
* The repo id was determined wrong when the delete picked up on an in progress snapshot
  * NOTE: This solution is still a best-effort fix and there's a slight chance of running into concurrency issues here
when multiple create and delete requests for the same snapshot name are happening concurrently, but these require a sequence
of multiple cluster state updates between the changed method reading the genId and submitting its cluster state update task
* Added test reproduced the issue reliably in about 50% of runs
* Closes #37581
2019-01-21 13:10:33 +01:00
Luca Cavanna 09a6ba50ef
Add support for merging multiple search responses into one (#37566)
This will be used in cross-cluster search when reduction will be
performed locally on each cluster. The CCS coordinating node will send
one search request per remote cluster involved and will get one search
response back from each one of them. Such responses contain all the info
to be able to perform an additional reduction and return results back
to the user.

Relates to #32125
2019-01-21 11:51:47 +01:00
Jason Tedor adae233f77
Add some deprecation optimizations (#37597)
This commit optimizes some of the performance issues from using
deprecation logging:
 - we optimize encoding the deprecation value
 - we optimize formatting the deprecation string
 - we optimize away getting the current time (by using cached startup
   time)
2019-01-18 16:42:25 -05:00
Tal Levy 106f900dfb
refactor inner geogrid classes to own class files (#37596)
To make further refactoring of GeoGrid aggregations
easier (related: #30320), splitting out these inner
class dependencies into their own files makes it
easier to map the relationship between classes
2019-01-18 13:40:00 -08:00
Julie Tibshirani 8da7a27f3b
Deprecate types in the put mapping API. (#37280)
From #29453 and #37285, the `include_type_name` parameter was already present and defaulted to false. This PR makes the following updates:
- Add deprecation warnings to `RestPutMappingAction`, plus tests in `RestPutMappingActionTests`.
- Add a typeless 'put mappings' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I opted to create a new `PutMappingRequest` object that differs from the existing server one.
2019-01-18 12:28:31 -08:00
Jack Conradson de55b4dfd1
Add types deprecation to script contexts (#37554)
This adds deprecation to _type in the script contexts for ingest and update. 
This adds a DeprecationMap that wraps the ctx Map containing _type for these 
specific contexts.
2019-01-18 09:13:49 -08:00
Yannick Welsch 377d96e376
Remove initial_master_nodes on node restart (#37580)
Some tests (e.g. testRestoreIndexWithShardsMissingInLocalGateway) were split-braining since
being switched to Zen2 because the bootstrap setting was left around when nodes got restarted
with data folders wiped.

The test in question here was starting one node (which autobootstrapped to that single node), then
another node. The first node was then shut down (after excluding it from the voting configuration),
its data folder wiped, and restarted. After restart, the node had an empty data folder yet
initial_master_nodes set to itself (i.e. same name). This made the node sometimes form a cluster of
its own, and not rejoin the existing cluster with the other node.
2019-01-18 16:36:42 +01:00
Jason Tedor ed297b7369
Only update response headers if we have a new one (#37590)
Currently when adding a response header, we do some de-duplication, and
maybe drop the header on the floor if we have reached capacity. Yet, we
still update the thread local tracking the response headers. This is
really expensive because under the hood there is a shared reference that
we synchronize on. In the case of a request processed across many shards
in a tight loop, this contention can be detrimental to performance. We
can avoid updating the thread local in these cases though, when the
response header is duplicate of one that we have already seen, or when
it's dropped on the floor. This commit addresses these performance
issues by avoiding the unnecessary set.
2019-01-18 08:20:05 -05:00
Tanguy Leroux 29d3a708da Fix BulkWithUpdatesIT and CloseIndexIT
As of today the Close Index API does its best to close indices,
but closing an index with ongoing recoveries might or might not
be acknowledged depending of the values of the max seq number
and global checkpoint at the time the
TransportVerifyShardBeforeClose action is executed.

These tests failed because they always expect that the index is
correctly closed on the first try, which is not always the case.
Instead we need to retry the closing until it succeed.

Closes #37571
2019-01-18 10:54:35 +01:00
David Turner 65e76b3f6f
Migrate RecoveryFromGatewayIT to Zen2 (#37520)
* Fixes `testTwoNodeFirstNodeCleared` by manipulating voting config exclusions.

* Removes `testRecoveryDifferentNodeOrderStartup` since state recovery is now
  handled entirely on the elected master, so the order in which the data nodes
  start is irrelevant.
2019-01-18 09:15:51 +00:00
David Turner 699d881739
Migrate IndicesExistsIT to Zen2 (#37526)
This test was actually passing, for the wrong reason: it asserts a
`MasterNotDiscoveredException` is thrown, expecting this to be due to a failure
to perform state recovery, but in fact it's thrown because the node is not
correctly bootstrapped.
2019-01-18 09:15:30 +00:00
Christoph Büscher 2f0e0b2426
Allow indices.get_mapping response parsing without types (#37492)
This change adds deprecation warning to the indices.get_mapping API in case the
"inlcude_type_name" parameter is set to "true" and changes the parsing code in
GetMappingsResponse to parse the type-less response instead of the one
containing types. As a consequence the HLRC client doesn't need to force
"include_type_name=true" any more and the GetMappingsResponseTests can be
adapted to the new format as well. Also removing some "include_type_name"
parameters in yaml test and docs where not necessary.
2019-01-18 09:33:36 +01:00
Armin Braun 62ddc8c776 Reenable UnicastZenPingTests#testSimplePings
* This was muted needlessly, the problem in #26701 only applies to `6.x`
* Relates #26701
2019-01-18 08:36:22 +01:00
Tim Brooks b6f06a48c0
Implement follower rate limiting for file restore (#37449)
This is related to #35975. This commit implements rate limiting on the
follower side using a new class `CombinedRateLimiter`.
2019-01-17 14:58:46 -07:00
Armin Braun 381d035cd6
Remove Redundant RestoreRequest Class (#37535)
* Same as #37464 but for the restore side
2019-01-17 22:23:23 +01:00