Commit Graph

2975 Commits

Author SHA1 Message Date
Jason Tedor 3aae98f922
Add debug logging for leases sync on recovery test
This commit adds some debug logging for a retention leases sync on
recovery test.
2019-04-09 22:59:22 -04:00
Julie Tibshirani d38214060e Mute ClusterDisruptionIT#testCannotJoinIfMasterLostDataFolder.
Tracked in #41047.
2019-04-09 17:36:21 -07:00
Julie Tibshirani a417905098 Mute RareClusterStateIT#testDelayedMappingPropagationOnPrimary as we await a fix.
Tracked in #41030.
2019-04-09 13:41:53 -07:00
Julie Tibshirani a0fc2461d7 Mute DedicatedClusterSnapshotRestoreIT#testSnapshotWithStuckNode as we await a fix. 2019-04-09 12:03:33 -07:00
Mark Vieira 1287c7d91f
[Backport] Replace usages RandomizedTestingTask with built-in Gradle Test (#40978) (#40993)
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)

This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.

(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)

* Fix forking JVM runner

* Don't bump shadow plugin version
2019-04-09 11:52:50 -07:00
Jason Tedor 321f93c4f9
Wait for all listeners in checkpoint listeners test
It could be that we try to shutdown the executor pool before all the
listeners have been invoked. It can happen that one was not invoked if
it timed out and was in the process of being notified that it timed out
on the executor. If we do this shutdown then, a listener will be met
with rejected execution exception. To address this, we first wait until
all listeners have been notified (or timed out) before proceeding with
shutting down the executor.

Relates #40970
2019-04-09 14:27:09 -04:00
Henning Andersen c5a77e5d8c Node repurpose tool docs (#40525)
Added documentation for node repurpose tool and included documentation on how to repurpose nodes safely. Adjusted order of tools in `elasticsearch-node` tool since the repurpose tool is most likely to be used.

Co-Authored-By: David Turner <david.turner@elastic.co>
2019-04-09 15:07:37 +02:00
David Turner 08ecdfe20e Short-circuit rebalancing when disabled (#40966)
Today if `cluster.routing.rebalance.enable: none` then rebalancing is disabled,
but we still execute `balanceByWeights()` and perform some rather expensive
calculations before discovering that we cannot rebalance any shards. In a large
cluster this can make cluster state updates occur rather slowly. With this
change we check earlier whether rebalancing is globally disabled and, if so,
avoid the rebalancing process entirely.

Relates #40942 which was reverted because of egregiously faulty tests.
2019-04-09 07:59:52 +01:00
Nhat Nguyen 69421612e5 Mute testRecoverMissingAnalyzer
Tracked at #40867
2019-04-08 22:47:20 -04:00
Nhat Nguyen 713e5c987b Adjust init map size of user data of index commit (#40965)
The number of user data attributes of an index commit has increased 
from 6 to 8, but we forgot to adjust. This change increases the initial 
size of that map to avoid resizing.
2019-04-08 22:47:20 -04:00
Christoph Büscher 335955b874 Some internal refactorings in AnalysisRegistry (#40609)
Reducing some methods scope and marking them as static where possible. Removing
"alias" support from AnalysisRegistry#produceAnalyze and changing that method to
return a NamedAnalyzer instead of having a side effect on the analyzer map passed in.
Also, CustomAnalyzerProvider doesn't seem to need the `environment` field.
2019-04-08 20:48:34 +02:00
David Turner 8eef92fafd Revert "Short-circuit rebalancing when disabled (#40942)"
This reverts commit f78e6ef73b.
2019-04-08 15:58:56 +01:00
David Turner f78e6ef73b Short-circuit rebalancing when disabled (#40942)
Today if `cluster.routing.rebalance.enable: none` then rebalancing is disabled,
but we still execute `balanceByWeights()` and perform some rather expensive
calculations before discovering that we cannot rebalance any shards. In a large
cluster this can make cluster state updates occur rather slowly. With this
change we check earlier whether rebalancing is globally disabled and, if so,
avoid the rebalancing process entirely.
2019-04-08 14:57:29 +01:00
Jim Ferenczi bc0fe7d64d Handle min_doc_freq in phrase suggester (#40840)
The phrase suggesters have an option to remove terms that have
a frequency lower than a provided min_doc_freq. However this value is
overwritten by the frequency of the original term in the popular mode.
This change ensures that we keep the maximum value between the provided
min_doc_value and the original term frequency as a threshold to select
candidates.

Fixes #16764
2019-04-08 12:23:54 +02:00
Jason Tedor 4163e59768
Mute failing IndexShard local history test
This test fails reliably with, so this commit mutes that test until a
fix is available.
2019-04-07 10:17:46 -04:00
Jason Tedor 6900399144
Be lenient when parsing build flavor and type on the wire (#40734)
Today we are strict when parsing build flavor and types off the
wire. This means that if a later version introduces a new build flavor
or type, an older version would not be able to parse what that new
version is sending. For a practical example of this, we recently added
the build type "docker", and this means that in a rolling upgrade
scenario older nodes would not be able to understand the build type that
the newer node is sending. This breaks clusters and is bad. We do not
normally think of adding a new enumeration value as being a
serialization breaking change, it is just not a lesson that we have
learned before. We should be lenient here though, so that we can add
future changes without running the risk of breaking ourselves
horribly. It is either that, or we have super-strict testing
infrastructure here yet still I fear the possibility of mistakes. This
commit changes the parsing of build flavor and build type so that we are
still strict at startup, yet we are lenient with values coming across
the wire. This will help avoid us breaking rolling upgrades, or clients
that are on an older version.
2019-04-06 17:24:16 -04:00
Jason Tedor e44e84ab42
Suppress lease background sync failures if stopping (#40902)
If the transport service is stopped, likely because we are shutting
down, and a retention lease background sync fires the logs will display
a warn message and stacktrace. Yet, this situaton is harmless and can
happen as a normal course of business when shutting down. This commit
suppresses the log messages in this case.
2019-04-06 10:18:52 -04:00
David Turner 2ff19bc1b7
Use Writeable for TransportReplAction derivatives (#40905)
Relates #34389, backport of #40894.
2019-04-05 19:10:10 +01:00
Colin Goodheart-Smithe 4452e8e10f Mutes GatewayIndexStateIT.testRecoverBrokenIndexMetadata 2019-04-05 10:53:52 -04:00
David Turner 922a70ce32 Remove unused import
Relates #40863
2019-04-05 09:21:34 +01:00
David Turner d8956d2601 Remove test-only customisation from TransReplAct (#40863)
The `getIndexShard()` and `sendReplicaRequest()` methods in
TransportReplicationAction are effectively only used to customise some
behaviour in tests. However there are other ways to do this that do not cause
such an obstacle to separating the TransportReplicationAction into its two
halves (see #40706).

This commit removes these customisation points and injects the test-only
behaviour using other techniques.
2019-04-05 08:54:41 +01:00
Martijn van Groningen 809a5f13a4
Make -try xlint warning disabled by default. (#40833)
Many gradle projects specifically use the -try exclude flag, because
there are many cases where auto-closeable resource ignore is never
referenced in body of corresponding try statement. Suppressing this
warning specifically in each case that it happens using
`@SuppressWarnings("try")` would be very verbose.

This change removes `-try` from any gradle project and adds it to the
build plugin. Also this change removes exclude flags from gradle projects
that is already specified in build plugin (for example -deprecation).

Relates to #40366
2019-04-05 08:02:26 +02:00
Nhat Nguyen 5a2eb07c0e Primary replica resync should not send ops without seqno (#40433)
Primary-replica resync in a mixed-cluster between 6.x and 5.6 can send
operations without sequence number to a replica which already processed
operations with sequence number. This leads to the failure of that
replica for we trip the sequence number assertion when writing resync
operations without sequence number to translog.
2019-04-04 21:54:31 -04:00
Colin Goodheart-Smithe 402f312c5e
Adds version 6.7.2 2019-04-04 16:35:39 +01:00
Nhat Nguyen 2756a3936b Reject illegal flush parameters (#40213)
This change rejects an illegal combination of flush parameters where
force is true, but wait_if_ongoing is false. This combination is trappy
and should be forbidden.

Closes #36342
2019-04-04 09:02:31 -04:00
Nhat Nguyen c4960ad736 Ensure flush happen before closing an index (#40184)
If there's an ongoing flush triggered by the translog flush threshold,
we may fail to execute a flush because waitIfOngoing is false by
default.

Relates to #36342
2019-04-04 09:02:31 -04:00
Nhat Nguyen e716b9ceee Ensure no scheduled refresh in testPendingRefreshWithIntervalChange
If a refresh, which is scheduled by the setting change, executes after
the index-2 operation and win the refresh race (i.e., maybeRefresh) with
the scheduledRefresh that we are going to check, then the latter will
return false.

Closes #39565
Relates #39462

PR #40387
2019-04-04 09:02:31 -04:00
Adrien Grand 670e76669c
Fix alias resolution runtime complexity. (#40263) (#40788)
A user reported that the same query that takes ~900ms when querying an index
pattern only takes ~50ms when only querying indices that have matches. The
query is a date range query and we confirmed that the `can_match` phase works
as expected. I was able to reproduce this issue locally with a single node: with
900 1-shard indices, a query to an index pattern that matches all indices runs
in ~90ms while a query to the only index that has matches runs in 0-1ms.

This ended up not being related to the `can_match` phase but to the cost of
resolving aliases when querying an index pattern that matches lots of indices.
In that case, we first resolve the index pattern to a list of concrete indices
and then for each concrete index, we check whether it was matched through an
alias, meaning we might have to apply alias filters. Unfortunately this second
per-index operation runs in linear time with the number of matched concrete
indices, which means that alias resolution runs in O(num_indices^2) overall.
So queries get exponentially slower as an index pattern matches more indices.

I reorganized alias resolution into a one-step operation that runs in linear
time with the number of matches indices, and then a per-index operation that
runs in linear time with the number of aliases of this index. This makes alias
resolution run is O(num_indices * num_aliases_per_index) overall instead. When
testing the scenario described above, the `took` went down from ~90ms to ~10ms.
It is still more than the 0-1ms latency that one gets when only querying the
single index that has data, but still much better than what we had before.

Closes #40248
2019-04-04 11:40:42 +02:00
Adrien Grand f5f5c3e429
Add unit test for MetaDataMappingService with typeless put mapping. (#40578) (#40720)
This is currently only tested via REST tests.

Closes #37450
2019-04-04 10:07:55 +02:00
Ryan Ernst a28d5f35d9 Fix geo points missing test (#40704)
This commit initializes the geo points for the missing doc values test.

fixes #40684
2019-04-03 16:48:09 -07:00
Mayya Sharipova a94e9500ac Correct bug in ScriptDocValues (#40488)
If a field `field_name` was missing in a document,
doc['field_name'].get(0) incorrectly retrieved
a value of the previously accessed document.
This happened because `get(int index)` function
was just accessing `values[index]` without
checking the number of values - `count`.

This PR fixes this.
2019-04-03 16:47:59 -07:00
Yannick Welsch 6ae7d593ea Avoid background sync on relocated primary (#40800)
There were some test failures caused by the background retention lease sync running on a relocated
primary. This commit fixes the situation that triggered the assertion and reactivates the failing test.

Closes #40731
2019-04-03 20:28:48 +02:00
Christoph Büscher 89389197b3 Help Eclipse infering lambda parameter types (#40747)
The Eclipse compiler (4.10, Photon) cannot build this test because it cannot
correctly infer the type arguments of the functions. Explicitely adding them
helps in this case.
2019-04-03 17:51:22 +02:00
Christoph Büscher 09ba3ec677 Small refactorings to analysis components (#40745)
This change adds the following internal refactorings:

* wraps input analyzers into an unmodifiable map in IndexAnalyzers ctor
* removes duplicated indexSetting in IndexAnalyzers
* removes references to IndexAnalyzers from DocumentMapperParser and TypeParser.ParserContext.
  It can always be retrieve it from MapperService directly in those cases
2019-04-03 14:22:16 +02:00
David Turner 1d2bc85586 Inline TransportReplAction#registerRequestHandlers (#40762)
It is important that resync actions are not rejected on the primary even if its
`write` threadpool is overloaded. Today we do this by exposing
`registerRequestHandlers` to subclasses and overriding it in
`TransportResyncReplicationAction`. This isn't ideal because it obscures the
difference between this action and other replication actions, and also might
allow subclasses to try and use some state before they are properly
initialised. This change replaces this override with a constructor parameter to
solve these issues.

Relates #40706
2019-04-03 12:12:26 +01:00
Jason Tedor df65e46d10
Deprecate versions of Java prior to Java 11 (#40756)
This commit deprecates versions of Java prior to Java 11. This commit
will cause a warning to be printed to standard error when any command
line tool is invoked, or when Elasticsearch is started. Additionally, we
log a deprecation message when Elasticsearch is started.
2019-04-03 06:39:40 -04:00
David Turner e64524c46f Remove some abstractions from `TransportReplicationAction` (#40706)
`TransportReplicationAction` is a rather complex beast, and some of its
concrete implementations do not need all of its features. More specifically, it
(a) chases a primary around the cluster until it manages to pin it down and
then (b) executes an action on that primary and all its replicas. There are
some actions that are coordinated by the primary itself, meaning that there is
no need for the chase-the-primary phases, and in the case of peer recovery
retention leases and primary/replica resync it is important to bypass these
first phases.

This commit is a step towards separating the `TransportReplicationAction` into
these two parts. It is a mostly mechanical sequence of steps to remove some
abstractions that are no longer in use.
2019-04-03 09:08:29 +01:00
Simon Willnauer dd624c31b0 Don't mark shard as refreshPending on stats fetching (#40458)
Completion and DocStats are pulled from internal readers
instead of external since #33835 and #33847 which doesn't require
us to refresh after a stats call since refreshes will happen internally
anyhow and that will cause updated stats on ongoing indexing.
2019-04-02 16:15:30 +02:00
David Turner 6f00952abd Use TAR instead of DOCKER build type before 6.7.0 (#40723)
In 6.7.0 (#39378) we added a build type of DOCKER for the docker images, but
unfortunately earlier versions do not understand this and will reject any
transport messages that mention this build type.

This commit fixes this by reporting TAR instead of DOCKER when talking to older
nodes.

Relates (but does not fix) #40511
Relates #39378
2019-04-02 13:17:50 +01:00
Alexander Reelsen c644fbfc6e Allow single digit milliseconds in strict date parsing (#40676)
In order to remain compatible with the existing joda based
implementation the parsing of milliseconds should support parsing single
digits instead of relying on three, even with strict formats.

This adds a few tests to duel against the existing joda based
implementation in order to ensure the parsing behaviour is the same.

Closes #40403
2019-04-02 10:27:50 +02:00
Andrey Ershov 287e334ef3 Do not perform cleanup if Manifest write fails with dirty exception (#40519)
Currently, if Manifest write is unsuccessful (i.e. WriteStateException
is thrown) we perform cleanup of newly created metadata files.
However, this is wrong.
Consider the following sequence (caught by CI here
https://github.com/elastic/elasticsearch/issues/39077):

- cluster global data is written **successful**
- the associated manifest write **fails** (during the fsync, ie files
have been written)
- deleting (revert) the manifest files, **fails**, metadata is
therefore persisted
- deleting (revert) the cluster global data is **successful**

In this case, when trying to load metadata (after node restart
because of dirty WriteStateException),  the following exception will
happen
```
java.io.IOException: failed to find global metadata [generation: 0]
```
because the manifest file is referencing missing global metadata file.

This commit checks if thrown WriteStateException is dirty and if its
we don't perform any cleanup, because new Manifest file might be
created, but its deletion has failed.
In the future, we might add more fine-grained check - perform the
clean up if WriteStateException is dirty, but Manifest deletion is
successful.

Closes https://github.com/elastic/elasticsearch/issues/39077

(cherry picked from commit 1fac56916bb3c4f3333c639e59188dbe743e385b)
2019-04-01 12:52:32 +03:00
Jim Ferenczi 7cc79123df Fix merging of text field mapper (#40627)
On mapping updates the `text` field mapper does not update
the field types for the underlying prefix and phrase fields.
In practice this shouldn't be considered as a bug but we have
an assert in the code that check that field types in the mapper service
are identical to the ones present in field mappers.
2019-04-01 08:41:42 +02:00
Jason Tedor cebe509460
Fix bug in detecting use of bundled JDK on macOS
This commit fixes a bug in detecting the use of the bundled JDK on
macOS. This bug arose because the path of Java home is different on
macOS.
2019-03-31 19:43:17 -04:00
Henning Andersen 92d07e9377 Geo Point parse error fix (#40447)
When geo point parsing threw a parse exception, it did not consume
remaining tokens from the parser. This in turn meant that
indexing documents with malformed geo points into mappings with
ignore_malformed=true would fail in some cases, since DocumentParser
expects geo_point parsing to end on the END_OBJECT token.

Related to #17617
2019-03-29 17:39:12 +01:00
Luca Cavanna a0b02ce6ef Move top-level pipeline aggs out of QuerySearchResult (#40319)
As part of #40177 we have added top-level pipeline aggs to
`InternalAggregations`. Given that `QuerySearchResult` holds an
`InternalAggregations` instance, there is no need to keep on setting
top-level pipeline aggs separately. Top-level pipeline aggs can then
always be transported through `InternalAggregations`. Such change is
made in a backwards compatible manner.
2019-03-29 17:01:14 +01:00
Oghenovo Usiwoma 444b4c4136 Improve error message for absence of indices (#39789)
"no indices exist" has been added to the error message for absence of indices
2019-03-29 17:01:14 +01:00
Luca Cavanna 48b0deef4f Remove throws IOException from PipelineAggregationBuilder#create (#40222)
IOException are never thrown in any of the existing pipeline aggregation
builders. Removing the throws IOException from the create method allows
to remove it also from a couple of other methods which ends up simplifying
 AggregationPhase (one less catch).
2019-03-29 17:01:14 +01:00
Jason Tedor 585f38787c
Add usage indicators for the bundled JDK (#40616)
This commit adds indications whether or not a distribution is from the
bundled JDK, and whether or not we are using the bundled JDK.
2019-03-29 08:25:32 -04:00
Martijn van Groningen be31800154
Update ingest jdocs that a null return value will drop the current document. (#40359) 2019-03-29 09:46:54 +01:00
Jason Tedor 7255562afd
Add start and stop time to cat recovery API (#40378)
The cat recovery API is incredibly useful. Yet it is missing the start
and stop time as an option from the output. This commit adds these as
options to the cat recovery API. We elect to make these not visible by
default to avoid breaking the output that users might rely on.
2019-03-28 16:23:37 -04:00
Mayya Sharipova 24755209b4 Add randomScore function in script_score query (#40186)
To make script_score query to have the same features
as function_score query, we need to add randomScore
function.

This function produces different
random scores on different index shards.
It is also able to produce random scores
based on the internal Lucene Document Ids.
2019-03-28 13:23:47 -04:00
David Turner 1a3916a8de Optimise rejection of out-of-range `long` values (#40325)
Today if you try and insert a very large number like `1e9999999` into a long
field we first construct this number as a `BigDecimal`, convert this to a
`BigInteger` and then reject it because it is out of range. Unfortunately
making such a large `BigInteger` is rather expensive.

We can avoid this expense by performing a (weaker) range check on the
`BigDecimal` representation of incoming `long`s too.

Relates #26137
Closes #40323
2019-03-28 12:27:34 +00:00
David Turner 073b13f5b0 Add docs for cluster.remote.*.proxy setting (#40281)
In #33062 we introduced the `cluster.remote.*.proxy` setting for proxied
connections to remote clusters, but left it deliberately undocumented since it
needed followup work so that it could work with SNI. However, since #32517 is
now closed we can add this documentation and remove the comment about its lack
of documentation.
2019-03-28 12:11:24 +00:00
jimczi 8775e37d03 Fix SearchResponseMerger#testMergeSearchHits
This commit fixes an edge case in tests where search hits are empty
after the merge but some shards returned hits. This can happen if
the total number of merged hits is less than the provided `from`.

Closes #40553
2019-03-28 09:57:21 +01:00
Adrien Grand 65a35c985c
Remove type from VersionConflictEngineException. (#37490) (#40514)
It initially mentioned the type in the exception because the type used to be
required to uniquely identify a document. This is not necessary anymore given
that indices have at most one type.
2019-03-28 09:32:09 +01:00
Adrien Grand 2326a3dccb
Remove String interning from `o.e.index.Index`. (#40350) (#40517)
`Index` interns its name and uuid. My guess is that the main goal is to avoid
having duplicate strings in the representation of the cluster state. However
I doubt it helps much given that we have many other objects in the cluster state
that we don't try to reuse, and interning has some cost. When looking into
 #40263 my profiler pointed to string interning because of the `Index` object
that is created in `QueryShardContext` as one of the bottlenecks of the
`can_match` phase.
2019-03-28 09:31:42 +01:00
Andy Bristol 23395a9b9f
search as you type fieldmapper (#35600)
Adds the search_as_you_type field type that acts like a text field optimized
for as-you-type search completion. It creates a couple subfields that analyze
the indexed terms as shingles, against which full terms are queried, and a
prefix subfield that analyze terms as the largest shingle size used and
edge-ngrams, against which partial terms are queried

Adds a match_bool_prefix query type that creates a boolean clause of a term
query for each term except the last, for which a boolean clause with a prefix
query is created.

The match_bool_prefix query is the recommended way of querying a search as you
type field, which will boil down to term queries for each shingle of the input
text on the appropriate shingle field, and the final (possibly partial) term
as a term query on the prefix field. This field type also supports phrase and
phrase prefix queries however
2019-03-27 13:29:13 -07:00
Benjamin Trent 6563dc7ed9
Muting test for #40553 (#40555) 2019-03-27 14:52:12 -05:00
Tim Brooks ab44f5fd5d
Add InboundHandler for inbound message handling (#40430)
This commit adds an InboundHandler to handle inbound message processing.
With this commit, this code is moved out of the TcpTransport.
Additionally, finer grained unit tests are added to ensure that the
inbound processing works as expected
2019-03-27 12:33:26 -06:00
Yannick Welsch 64b31f44af No mapper service and index caches for replicated closed indices (#40423)
Replicated closed indices can't be indexed into or searched, and therefore don't need a shard with
full indexing and search capabilities allocated. We can save on a lot of heap memory for those
indices by not allocating a mapper service and caching infrastructure (which preallocates a constant
amount per instance). Before this change, a 1GB ES instance could host 250 replicated closed
metricbeat indices (each index with one shard). After this change, the same instance can host 7300
replicated closed metricbeat instances (not that this would be a recommended configuration). Most
of the remaining memory is in the cluster state and the IndexSettings object.
2019-03-27 19:04:24 +01:00
Yannick Welsch 8f7c5732f1 Use default discovery implementation for single-node discovery (#40036)
Switches "discovery.type: single-node" from using a separate implementation for single-node discovery to using the existing standard discovery implementation, with two small adaptions:

-  auto-bootstrapping, but requiring initial_master_nodes not to be set.
- not actively pinging other nodes using the Peerfinder
- not allowing other nodes to join its single-node cluster (if they have e.g. been set up using regular discovery and connect to the single-disco node).
2019-03-27 19:04:24 +01:00
Tim Brooks 3860ddd1a4
Move outbound message handling to OutboundHandler (#40336)
Currently there are some components of message serializer and sending
that still occur in TcpTransport. This commit makes it possible to
send a message without the TcpTransport by moving all of the remaining
application logic to the OutboundHandler. Additionally, it adds unit
tests to ensure that this logic works as expected.
2019-03-27 11:47:36 -06:00
David Turner 707d40ce06 Stabilise testStaleMasterNotHijackingMajority (#40253)
This test inadvertently asserts that the election occurs after a master failure
is clean. However, messy elections are a fact of life so we should not fail on
a messy election.

This change moves this test away from an `AbstractDisruptionTestCase` since it
does not need the fault detector to be so enthusiastic, and weakens the
assertions to merely say that we ignore states published by the old master
without saying anything about the cleanliness of the election.

Closes #36556
2019-03-27 16:00:14 +00:00
Tim Brooks 760cfffe4b
Move TransportMessageListener to TransportService (#40474)
Currently the TransportMessageListener is applied and used in the
Transport class. However, local requests and responses never make it to
this class. This PR moves the listener add/remove methods to the
TransportService. After this change the Transport can only have one
listener set with it. This one listener is the TransportService, which
will then propogate the events to the external listeners.

Additionally this commit back ports #40237

Remove Tracer from MockTransportService

Currently the TransportMessageListener is applied and used in the
Transport class. However, local requests and responses never make it to
this class. This PR moves the listener add/remove methods to the
TransportService. After this change the Transport can only have one
listener set with it. This one listener is the TransportService, which
will then propogate the events to the external listeners.
2019-03-27 09:24:20 -06:00
Przemyslaw Gomulka 65f01277ed
Parse composite patterns using ClassicFormat.parseObject backport(#40100) (#40501)
Java-time fails parsing composite patterns when first pattern matches only the prefix of the input. It expects pattern in longest to shortest order. Because of this constructing just one DateTimeFormatter with appendOptional is not sufficient. Parsers have to be iterated and if the parsing fails, the next one in order should be used. In order to not degrade performance parsing should not be throw exceptions on failure. Format.parseObject was used as it only returns null when parsing failed and allows to check if full input was read.
 
closes #39916
backport #40100
2019-03-27 13:51:44 +01:00
Yannick Welsch b4b17e16e0 Remove TransportSingleItemBulkWriteAction as replication action (#40424)
The implementation of TransportIndexAction and TransportDeleteAction as
TransportReplicationAction existed for interoperability with older 5.x nodes, as these older nodes
coordinated single index / deletes as replication requests. This BWC layer is no longer needed in 7.x,
where these single actions are now mapped to bulk requests. Completely removing the deprecated
transport actions is not possible yet if we want to keep BWC with a 6.x transport client. The best
way here is to wait for the transport client to go away and then just remove the actions.
2019-03-27 13:16:58 +01:00
Jim Ferenczi fe05a4d511 Fix random failures in SearchResponseMerger#testMergeSearchHits (#40223)
This commit fixes the expectation in the test when the search hits are empty.

Closes #40214
2019-03-27 11:17:10 +01:00
alex101101 fb8ad0cf30 Add a soft limit to the field name length (#40309)
Adds an optional limit to the length of field names, throws an IllegalArgumentException if the limit is breached. 
Closes #33651
2019-03-26 17:58:32 +01:00
Daniel Mitterdorfer f2b5960f90 Add version 6.7.1 2019-03-26 17:38:14 +01:00
Yannick Welsch bf7b167bba Remove timeout task after completing cluster state publication (#40411)
Each cluster state publication schedules a cancellation task with the provided publication timeout
(30s by default). This scheduled cancellation keeps a reference to the publication, and therefore the
full cluster state that was published. In case of frequently updating a large cluster state, this results
in a large number of cancellation tasks keeping references to all previously published cluster states.
2019-03-26 17:13:57 +01:00
Henning Andersen bf444b9f02 Store Pending Deletions Fix (#40345)
FilterDirectory.getPendingDeletions does not delegate, fixed
temporarily by overriding in StoreDirectory.

This in turn caused duplicate file name use after a trimUnsafeCommits
had been done, since a new IndexWriter would not consider the pending
deletes in IndexFileDeleter. This should only happen on windows (AFAIK).

Reenabled doing index updates for all tests using
IndexShardTests.indexOnReplicaWithGaps (which could fail due to above
when using mocked WindowsFS).

Added getPendingDeletions delegation to all elasticsearch
FilterDirectory subclasses that were not trivial test-only overrides to
minimize the risk of hitting this issue in another case.
2019-03-26 15:30:44 +01:00
Alan Woodward 12634850d6 IntervalQueryBuilderTests#testNonIndexedFields test fix (#40418)
This test checks that interval queries constructed against a field with no indexed
positions will throw exceptions. It uses a randomly-build IntervalsSourceProvider
against a fixed set of fields; however, the random source builder can occasionally
provide a source with a fixed field, meaning that even if the top-level query asks
for a set of intervals over a non-indexed field, the source will delegate to another
field, and no exception will be thrown.

This commit changes the test to always use a simple Match provider.

Fixes #40436
2019-03-26 08:33:42 +00:00
Nhat Nguyen 495dc11c9c Mute testPendingRefreshWithIntervalChange
Tracked at #39565
2019-03-25 11:47:08 -04:00
Armin Braun 3968d46a17
Remove Redundant Request Wrappers from RepositoryService (#40192) (#40404) 2019-03-25 16:36:02 +01:00
Armin Braun dc5ff0fffc
Log Warning on Failed Blob Deletes in BlobStoreRepository (#40188) (#40340)
* Log Warning on Failed Blob Deletes in BlobStoreRepository
* We should not just debug log these spots, they all can and will lead to leaked files when snapshot deletion fails
2019-03-25 08:52:09 +01:00
Nhat Nguyen b9f96a8e1f
Expose external refreshes through the stats API (#38643)
Right now, the stats API only provides refresh metrics regarding
internal refreshes. This isn't very useful and somewhat misleading for
cluster administrators since the internal refreshes are not indicative
of documents being available for search.

In this PR I added a new metric for collecting external refreshes as
they occur and exposing them through the stats API. Now, calling an
endpoint for stats will yield external refresh metrics as well.

Relates #36712
2019-03-24 22:21:00 -04:00
Armin Braun 13d76239a0
Use Netty ByteBuf Bulk Operations for Faster Deserialization (#40158) (#40339)
* Use bulk methods to read numbers faster from byte buffers
2019-03-24 19:08:51 +01:00
Jason Tedor 10bbb082a4
Only run retention lease actions on active primary (#40386)
In some cases, a request to perform a retention lease action can arrive
on a primary shard before it is active. In this case, the primary shard
would not yet be in primary mode, tripping an assertion in the
replication tracker. Instead, we should not attempt to perform such
actions on an initializing shard. This commit addresses this by not
returning the primary shard in the single shard iterator if the primary
shard is not yet active.
2019-03-23 09:39:39 -04:00
Zachary Tong 78f737dad3
Map value field to double in MovavgIT (#40230)
We were accidentally not mapping the index, which meant dynamic mapping
was choosing floats for the values.  This led to enough loss of precision
for the aggregated values to differ slightly from the test doubles,
which accumulated into large differences in the holt output.

This test fix adds an explicit mapping.
2019-03-21 14:03:14 -04:00
Jason Tedor 1e6941b138
Reduce retention lease sync intervals (#40302)
This commit adjusts the frequency with which CCR renews retention leases
and with which primaries sync retention leases to replicas. This helps
Lucene reclaim soft-deleted documents more aggressively, which we have
found in some use-cases can help improve performance, and either way
will help keep disk space under more control.
2019-03-21 07:37:44 -04:00
Alan Woodward 83d2870308 Add `use_field` option to intervals query (#40157)
This is the equivalent of the `field_masking_span` query, allowing users to
merge intervals from multiple fields - for example, to search for stemmed tokens
near unstemmed tokens.
2019-03-20 16:26:04 +00:00
Like 6f64267626 Make setting index.translog.sync_interval be dynamic (#37382)
Currently, we cannot update index setting index.translog.sync_interval if index is open, because it's
not dynamic which can be updated for closed index only.

Closes #32763
2019-03-20 17:12:45 +01:00
Yannick Welsch a5fb7fb17c Fix snapshot restore logging on fresh restore (#40252)
A recent refactoring (#37130) where imports got mixed up (changing Lucene's
IndexNotFoundException to Elasticsearch's IndexNotFoundException) led to many warnings being
logged in case of restoring a fresh snapshot.
2019-03-20 16:51:44 +01:00
Jim Ferenczi 3400483af4
Add date and date_nanos conversion to the numeric_type sort option (#40199) (#40224)
This change adds an option to convert a `date` field to nanoseconds resolution
 and a `date_nanos` field to millisecond resolution when sorting.
The resolution of the sort can be set using the `numeric_type` option of the
field sort builder. The conversion is done at the shard level and is restricted
to dates from 1970 to 2262 for the nanoseconds resolution in order to avoid
numeric overflow.
2019-03-20 16:50:28 +01:00
Nhat Nguyen efaf95628b Use separate translog dir in testDeleteWithFatalError
This test currently opens a new engine but shares the same translog
directory of the previous opening engine.
2019-03-20 10:22:27 -04:00
Mayya Sharipova 49a7c6e0e8
Expose proximity boosting (#39385) (#40251)
Expose DistanceFeatureQuery for geo, date and date_nanos types

Closes #33382
2019-03-20 09:24:41 -04:00
Henning Andersen 4c2a8638ca Cascading primary failure lead to MSU too low (#40249)
If a replica were first reset due to one primary failover and then
promoted (before resync completes), its MSU would not include changes
since global checkpoint, leading to errors during translog replay.

Fixed by re-initializing MSU before restoring local history.
2019-03-20 14:00:43 +01:00
Simon Willnauer 235f57989f Return cached segments stats if `include_unloaded_segments` is true (#39698)
Today we don't return segments stats for closed indices which makes it
hard to tell how much memory such an index would require. With this change
we return the statistics if requested by setting `include_unloaded_segments` to
true on the rest request.

Relates to #39512
2019-03-20 12:08:41 +01:00
Jason Tedor 9ce740a2eb
Modfiy casing in JVM home log message
This makes the log message consistent with the following line that shows
the JVM arguments.
2019-03-20 00:06:16 -04:00
Zachary Tong 69f5869707 Mute SearchResponseMergerTests#testMergeSearchHits
Tracking issue: https://github.com/elastic/elasticsearch/issues/40214
2019-03-19 13:40:38 -04:00
David Turner 33d8738c68 Fix RareClusterStateIT on MacOS (#40203)
Today RareClusterStateIT#testAssignmentWithJustAddedNodes fails on my Mac
because it waits for the default connection timeout of 30 seconds to connect to
a fake node with IP address 0.0.0.0. This connection attempt fails much more
quickly on Linux so the test passes.

This commit fixes this by reducing the connection timeout for this test.
2019-03-19 17:33:21 +00:00
Nhat Nguyen a13b4bc8c5 Always fail engine if delete operation fails (#40117)
Unlike index operations which can fail at the document level to
analyzing errors, delete operations should never fail at the document
level whether soft-deletes is enabled or not. With this change, we will
always fail the engine if we fail to apply a delete operation to Lucene.

Closes #33256
2019-03-19 13:09:23 -04:00
Luca Cavanna d14e79e849 Serialize top-level pipeline aggs as part of InternalAggregations (#40177)
We currently convert pipeline aggregators to their corresponding
InternalAggregation instance as part of the final reduction phase.
They arrive to the coordinating node as part of QuerySearchResult
objects fom the shards and, despite we may incrementally reduce
aggs (hence we may have some non-final reduce and the final
one later) all the reduction phases happen on the same node.

With CCS minimizing roundtrips though, each cluster performs its
own non-final reduction, and then serializes the results back to
the CCS coordinating node which will perform the final coordination.
This breaks the assumptions made up until now around reductions
happening all on the same node.

With #40101 we have made sure that top-level pipeline aggs are not
reduced as part of the non-final reduction. The next step is to make
sure that they don't get lost, meaning that each coordinating node
needs to send them back to the CCS coordinating node as part of
the top-level `InternalAggregations` object.

Closes #40059
2019-03-19 14:43:39 +01:00
Luca Cavanna 803ec46331 Skip sibling pipeline aggregators reduction during non-final reduce (#40101)
Today a coordinating node forces a final reduction of sibling pipeline aggregators whenever reducing aggs, unless it is reducing aggs incrementally. This works well for incremental reduction of aggs, but breaks CCS when minimizing roundtrips as each cluster ends up reducing its own pipeline aggregators locally while that should only be done by the CCS coordinating node later. This causes issues as after their reduction,  pipeline aggs cannot be further reduced, which is what happens with CCS causing errors like "java.lang.UnsupportedOperationException: Not supported" being returned.

Each coordinating node should rather honour the reduce context flag that
indicates whether we are executing a final reduce or not. If not, it should leave the sibling pipeline aggregations alone.

Note that his bug affects only pipeline aggs that don't have a parent in
the aggs tree, while all the others work well.

Relates to #40059 but does not fix it yet, as the CCS coordinating node also needs to be adapted to recreate sibling pipeline aggregators from the request.
2019-03-19 14:43:39 +01:00
Luca Cavanna 83f12a3d9c CCS: skip empty search hits when minimizing round-trips (#40098)
When minimizing round-trips, each cluster returns its own independent
search response. In case sort by field and/or field collapsing were
requested, when one cluster has no results to return, the information
about the field that sorting was based on (SortField array) as well as
the field (and the values) that collapsing was performed on are missing
in the search response. That causes problems as we can't build the
proper `TopDocs` instance which would need to be either `TopFieldDocs`
or `CollapseTopFieldDocs`. The merge routine expects that all the top
docs are of the same exact type which can't be guaranteed. Given that
the problematic results are empty, hence have no impact on the final
results, we can simply skip them.

Relates to #32125
Closes #40067
2019-03-19 14:43:39 +01:00
Luca Cavanna 9c38fa6468 [TEST] Update TransportSearchActionTests#testShouldMinimizeRoundtrips
Relates to #40044

Closes #40051
2019-03-19 14:43:38 +01:00
Luca Cavanna 07bfb4c7f7 CCS: Disable minimizing round-trips when dfs is requested (#40044)
When using DFS_QUERY_THEN_FETCH search type, the dfs phase is run and
its results are used in the query phase to make scoring accurate.
When using CCS, depending on whether the DFS phase runs in the CCS
coordinating node (like if all shards were local) or in each remote
cluster (when minimizing round-trips), scoring will differ.

This commit disables minimizing round-trips whenever DFS is requested,
as it is not currently  possible to ensure that scoring is accurate in
that case.

Relates to #32125
2019-03-19 14:43:38 +01:00
Nhat Nguyen 8dc6862b17 Unmute and trace testPendingRefreshWithIntervalChange
Tracked at #39565
2019-03-19 09:07:54 -04:00
Henning Andersen dde41cc2dd Node repurpose tool (#39403)
When a node is repurposed to master/no-data or no-master/no-data, v7.x
will not start (see #37748 and #37347). The `elasticsearch repurpose`
tool can fix this by cleaning up the problematic data.
2019-03-19 11:52:02 +01:00
Dimitris Athanasiou 95f660d577
Mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock test (#39689)
Relates #39688
2019-03-18 15:04:26 -06:00
Henning Andersen 0b214c1bfb Linearizability checker memory reduction (#40149)
The cache used in linearizability checker now uses approximately 6x less
memory by changing the cache from a set of (bits, state) tuples into a
map from bits -> { state }.

Each combination of states is kept once only, building on the
assumption that the number of state permutations is small compared to
the number of bits permutations. For those histories that are difficult
to check we will have many bits combinations that use the same state
permutations.

We end up now using approximately 15 bytes per entry compared to 101
bytes before, ie. a 6x improvement, allowing us to linearizability check
significantly longer histories.

Re-enabled linearizability checker in CoordinatorTests, hoping above
ensures we no longer run out of memory.

Resolves #39437
2019-03-18 21:16:59 +01:00
Nhat Nguyen 38e9522218 Remove wait for cluster state step in peer recovery (#40004)
We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped
using it since #25692 (6.0). This change removes that action and related
code in 7.x and 8.0.

Relates #19287
Relates #25692
2019-03-18 15:17:21 -04:00
Nhat Nguyen d720a64b9e Ensure sendBatch not called recursively (#39988)
This PR introduces AsyncRecoveryTarget which executes remote calls of
peer recovery asynchronously. In this change, we also add a new
assertion to ensure that method sendBatch, which sends a batch of
history operations in phase2, is never called recursively on the same
thread. This new assertion will also be used in method sendFileChunks.
2019-03-18 15:17:21 -04:00
Jim Ferenczi eb540125ea
Fix IndexSearcherWrapper visibility (#39071) (#40145)
This change adds a wrapper for IndexSearcher that makes IndexSearcher#search(List, Weight, Collector) visible by
sub-classes. The wrapper is used by the ContextIndexSearcher to call this protected method on a searcher created by a plugin.
This ensures that an override of the protected method in an IndexSearcherWrapper plugin is called when a search is executed.

Closes #30758
2019-03-18 11:33:54 +01:00
Jim Ferenczi 5b73a1bc7d
Add an option to force the numeric type of a field sort (#38095) (#40084)
This change adds an option to the `FieldSortBuilder` that allows to transform the type
of a numeric field into another. Possible values for this option are `long` that transforms
the source field into an integer and `double` that transforms the source field into a floating point.
This new option is useful for cross-index search when the sort field is mapped differently on some
indices. For instance if a field is mapped as a floating point in one index and as an integer in another
it is possible to align the type for both indices using the `numeric_type` option:

```
{
   "sort": {
    "field": "my_field",
    "numeric_type": "double" <1>
   }
}
```

<1> Ensure that values for this field are transformed to a floating point if needed.
2019-03-18 09:32:45 +01:00
Albert Zaharovits 1b75ee0bd7 AuditTrail correctly handle ReplicatedWriteRequest (#39925)
This fix deduplicates index names in `BulkShardRequests` and only audits
the specific resolved index for every comprising `BulkItemRequest`.
2019-03-17 13:05:26 +02:00
Jason Tedor 86d1d03c37
Remove cluster state size (#40109)
This commit removes the cluster state size field from the cluster state
response, and drops the backwards compatibility layer added in 6.7.0 to
continue to support this field. As calculation of this field was
expensive and had dubious value, we have elected to remove this field.
2019-03-15 17:16:25 -04:00
Tim Brooks 0b50a670a4
Remove transport name from tcp channel (#40074)
Currently, we maintain a transport name ("mock-nio", "nio", "netty")
that is passed to a `TcpTransportChannel` when a request is received.
The value of this name is to associate with the task when we register a
task with the task manager. However, it is only possible to run ES with
one transport, so having an implementation specific name is unnecessary.
This commit removes the name and replaces it with the generic
"transport".
2019-03-15 12:04:13 -06:00
Zachary Tong c72feedd74 Do not allow Sampler to allocate more than maxDoc size, better CB accounting (#39381)
The `sampler` agg creates a BestDocsDeferringCollector, which internally
initializes a priority queue of size `shardSize`.  This queue is
populated with empty `Object` sentinels, which is roughly 16b per
object.

Similarly, the Diversified samplers create a DiversifiedTopDocsCollectors
which internally track PQ slots with ScoreDocKeys, weighing in around
28kb

If the user sets a very abusive `shard_size`, this could easily OOM
a node or cluster since these PQ are allocated up-front without
any checks.

This commit makes sure that when we create the collector, it
cannot be larger than the maxDoc so that we don't accidentally blow
up the node.  We ensure the size is not greater than the overall
index maxDoc. A similar treatment is done for `maxDocsPerValue`
parameter of the diversified samplers

For good measure, this also adds in some CB accounting to try and track
memory usage.

Finally, a redundant array creation is removed to reduce a bit of
temporary memory.
2019-03-15 13:19:55 -04:00
Yannick Welsch c74111ff8e Reduce logging noise when stepping down as master before state recovery (#39950)
Reduces the logging noise from the state recovery component when there are duelling elections.

Relates to #32006
2019-03-15 17:24:03 +01:00
David Turner 0d152a54f8 Await all pending activity in testConnectAndDisconnect (#40037)
We call `ensureConnections()` to undo the effects of a disruption. However, it
is possible that one or more targets are currently CONNECTING and have been
since the disruption was active, and that the connection attempt was thwarted
by a concurrent disruption to the connection.  If so, we cannot simply add our
listener to the queue because it will be notified when this CONNECTING activity
completes even though it was disrupted. We must therefore wait for all the
current activity to finish and then go through and reconnect to any missing
nodes.

Closes #40030.
2019-03-15 08:08:57 +00:00
David Turner a323132503 Create retention leases file during recovery (#39359)
Today we load the shard history retention leases from disk whenever opening the
engine, and treat a missing file as an empty set of leases. However in some
cases this is inappropriate: we might be restoring from a snapshot (if the
target index already exists then there may be leases on disk) or
force-allocating a stale primary, and in neither case does it make sense to
restore the retention leases from disk.

With this change we write an empty retention leases file during recovery,
except for the following cases:

- During peer recovery the on-disk leases may be accurate and could be needed
  if the recovery target is made into a primary.

- During recovery from an existing store, as long as we are not
  force-allocating a stale primary.

Relates #37165
2019-03-15 07:49:49 +00:00
David Turner 8d2184b315
Fix up committed configuration on fake Zen1 nodes (#40065)
Today we test Zen1/Zen2 compatibility by running 7.x nodes with a "fake" Zen1
implementation. However this is not a truly faithful test because these nodes
do known how to properly deserialize a 7.x cluster state, voting configurations
and all, whereas a real Zen1 node is in 6.7 and ignores the coordination
metadata.

We only ever apply a cluster state that's been committed, which in Zen2
involves setting the last-committed configuration to equal the last-accepted
configuration. Zen1 knows nothing about this adjustment, so it is possible for
these to differ. This breaks the assertion that the cluster states are equal on
all nodes after integration tests.

This commit fixes this by implementing this adjustment in Zen1 before applying
a cluster state.

Fixes #40055.
2019-03-15 07:44:31 +00:00
Ioannis Kakavas 35aaf04c8c Handle empty input in AddStringKeyStoreCommand (#39490)
This change ensures that we do not make assumptions about the length
of the input that we can read from the stdin. It still consumes only
one line, as the previous implementation
2019-03-15 09:38:22 +02:00
Tamara Braun e2b60c7141 Fix not Recognizing Disabled Object Mapper (#39862)
* Fixes not finding disabled object mapper when using dotted field name
notation
* Closes #39456
2019-03-14 10:57:00 -07:00
Ioannis Kakavas 8dc8fc507d Handle UTF-8 values in the keystore (#39496)
* Handle UTF8 values in the keystore

Our current implementation uses CharBuffer#array to get the chars
that were decoded from the UTF-8 bytes. The backing array of
CharBuffer is created in CharsetDecoder#decode and gets an initial
length that is the same as the length of the ByteBuffer it decodes,
hence the number of UTF-8 bytes.
This works fine for the first 128 characters where each one needs
one bytes, but for the next UTF-8 characters (other latin alphabets
Greek, Cyrillic etc.) where we need 2 to 4 bytes per character, this
backing char array has a larger size than the number of the actual
chars this CharBuffer contains. Calling `array()` on it will return
a char array that can potentially have extra null chars so the
SecureString we get from the KeystoreWrapper, is not the same as the
one we entered.

This commit changes the behavior to use Arrays#copyOfRange to get
the necessary chars from the CharBuffer and adds a test with
random ( maybe not printable ) UTF-8 strings
2019-03-14 18:03:50 +02:00
Jason Tedor 9181668edf
Stop returning cluster state size by default (#40016)
Computing the compressed size of the cluster state on every invocation
of cluster:monitor/state action is expensive, and the value of this
field is dubious anyway. Therefore we want to remove computing this
field. As a first step, we stop computing and return this field by
default. To avoid breaking users, we will give them a system property to
use to tide them over until the next major release when we will actually
remove this field. This comes with a deprecation warning too, and the
backport to the appropriate minor will also include a note in the
migration guide. There will be a follow-up to remove this field in the
next major version.
2019-03-14 08:57:55 -04:00
Yogesh Gaikwad 20e5994179
Mute failing tests in NodeConnectionsServiceTests (#40034) (#40035) 2019-03-14 19:40:15 +11:00
Przemyslaw Gomulka 8a314a36db
Change zone formatting for all printers backport(#39568) #39952
After the joda-java time migration we were formatting zone ids with zoneOrOffsetId method. This when a date was provided with a ZoneRegion for instance America/Edmonton it was appending this zone identifier instead of zone formatted as +HH:MM.
This fix is changing the format of zone suffix for all printers and also always wrapping a Temporal into a ZonedDateTime when formatting.

closes #38471
backport #39568
2019-03-13 18:27:37 +01:00
Jim Ferenczi 7a7658707a
Upgrade to Lucene release 8.0.0 (#39998)
This commit upgrades to the GA release of Lucene 8

Closes #39640
2019-03-13 18:11:50 +01:00
Tim Brooks 352f9f1f39
Remove sizing from `Recycler#obtain` (#39975)
Currently there is a method `Recycler#obtain(size)` that allows a size
parameter to be passed. However all implementations ignore this
parameter and just allocate a page size based on other settings. This
commit removes this method.
2019-03-13 09:32:31 -06:00
Andrey Ershov 9300826d8a Do not log unsuccessful join attempt each time (#39756)
When performing the test with 57 master-eligible nodes and one node
crash, we saw messy elections, when multiple nodes were attempting to
become master.
JoinHelper has logged 105 long log messages with lengthy stack
traces during one such election.
To address this, we decided to log these messages every time only on
debug level.
We will log last unsuccessful join attempt (along with a timestamp)
if any with WARN level if the cluster is failing to form.

(cherry picked from commit 17a148cc27b5ac6c2e04ef5ae344da05a8a90902)
2019-03-13 13:30:31 +01:00
Christoph Büscher b10dd3769c Add analysis modes to restrict token filter use contexts (#36103)
Currently token filter settings are treated as fixed once they are declared and
used in an analyzer. This is done to prevent changes in analyzers that are already
used actively to index documents, since changes to the analysis chain could
corrupt the index. However, it would be safe to allow updates to token
filters at search time ("search_analyzer"). This change introduces a new
property of token filters that allows to mark them as only being usable at search
or at index time. Any analyzer that uses these tokenfilters inherits that property
and can be rejected if they are used in other contexts. This is a first step towards
making specific token filters (e.g. synonym filter) updateable.

Relates to #29051
2019-03-12 23:48:55 +01:00
Andy Bristol e2b88bc706 add version 6.6.3 2019-03-12 13:21:36 -07:00
David Turner 049970af3e Only connect to new nodes on new cluster state (#39629)
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves #29025.
2019-03-12 19:26:13 +00:00
Przemyslaw Gomulka a29bba4ede
Migrate Streamable to writeable for index package backport(#37381) #39949
Migrate streamable classes from index package to Writeable and clean up access modifiers

Related to #34389
backport#37381
2019-03-12 12:10:36 +01:00
lzh3636 ad55e5b80d Log missing file exception when failing to read metadata snapshot (#32920)
Adds the exception to the logged output, which contains info about the file that's missing.
2019-03-12 10:41:44 +01:00
Nhat Nguyen ce5f09ab04 Enforce retention leases require soft deletes (#39922)
If a primary on 6.7 and a replica on 5.6 are running more than 5 minutes
(retention leases background sync interval), the retention leases
background sync will be triggered, and it will trip 6.7 node due to the
illegal checkpoint value. We can fix the problem by making the returned
checkpoint depends on the node version. This PR, however, chooses to
enforce retention leases require soft deletes, and make retention leases
sync noop if soft deletes is disabled instead.

Closes #39914
2019-03-11 22:37:47 -04:00
Nhat Nguyen bf814357ad Enable soft deletes in RetentionLeaseIT
Relates #39922
2019-03-11 22:37:42 -04:00
Armin Braun 9eb4614fa6
More Verbose Assertion in testSnapshotWithStuckNode (#39893) (#39928)
* The test failure in #39852 is caused by a file in the initial repository when there should not be any
  * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository
=> added a more verbose and strict assertion that will log what file is left over next time
* Relates #39852
2019-03-11 19:27:08 +01:00
Jake Landis b0b0f66669
Remove types from internal monitoring templates and bump to api 7 (#39888) (#39926)
This commit removes the "doc" type from monitoring internal indexes.
The template still carries the "_doc" type since that is needed for
the internal representation.

This change impacts the following templates:
monitoring-alerts.json
monitoring-beats.json
monitoring-es.json
monitoring-kibana.json
monitoring-logstash.json

As part of the required changes, the system_api_version has been
bumped from "6" to "7" and support for version "2" has been dropped.

A new empty pipeline is now introduced for the version "7", and
the formerly empty "6" pipeline will now remove the type and re-direct
the request to the "7" index.

Additionally, to due to a difference in the internal representation
(which requires the inclusion of "_doc" type) and external representation
(which requires the exclusion of any type) a helper method is introduced
to help convert internal to external representation, and used by the
monitoring HTTP template exporter.

Relates #38637
2019-03-11 13:17:27 -05:00
Yannick Welsch 4f941c6963 Do not swallow exceptions in TimedRunnable (#39856)
Executors of type fixed_auto_queue_size (i.e. search / search_throttled) wrap runnables into
TimedRunnable, which is an AbstractRunnable. This is dangerous as it might silently swallow
exceptions, and possibly miss calling a response listener. While this has not triggered any failures in
the tests I have run so far, it might help uncover future problems.

Follow-up to #36137
2019-03-11 19:03:12 +01:00
Yannick Welsch 292eb8b001 Fix CoordinatorTests.testIncompatibleDiffResendsFullState (#39345)
This test started failing since decreasing the leader and follower check timeouts (#38298). The
reason is that the test was relying on the default publication timeout to come into effect before
leader / follower check timeouts, which is now not always true anymore.

Closes #38867
2019-03-11 19:03:10 +01:00
Tim Brooks dd77899278
Log send failure at debug level if channel closed (#39807)
Currently we log exceptions due to channel close at the debug level in
the normal exception handler. Currently we log all send failures due to
channel close at the warn level. This commit changes that to only log at
warn if the send failure is not due to channel closed. Additionally, it
adds the ssl engine closed as a channel close exception.
2019-03-11 10:33:02 -06:00
Yannick Welsch b7be724e50 Check term earlier in publication process (#39909)
in order to avoid tripping assertPreviousStateConsistency.

Closes #39314
2019-03-11 15:40:20 +01:00
David Turner 6e4f304f88 Synchronize pendingOutgoingJoins (#39900)
Today we use a ConcurrentHashSet to track the in-flight outgoing joins in the
`JoinHelper`. This is fine for adding and removing elements but not for the
emptiness test in `isJoinPending()` which might return false if one join
finishes just after another one starts, even though joins were pending
throughout.

As used today this is ok: it means the node was trying to join a master but
this join attempt just finished unsuccessfully, and causes it to (rightfully)
reject a `FollowerCheck` from the failed master. However this kind of API
inconsistency is trappy and there is no need to be clever here, so this change
replaces the set with a `synchronizedSet()`.
2019-03-11 12:13:21 +00:00
Ankit Jain 471aa6a16a Fixing 503 Service Unavailable errors during fetch phase (#39086)
When ESRejectedExecutionException gets thrown on the coordinating node while trying to fetch hits, the resulting exception will hold no shard failures, hence `503` is used as the response status code. In that case, `429` should be returned instead. Also, the status code should be taken from the cause if available whenever there are no shard failures instead of blindly returning `503` like we currently do.

Closes #38586
2019-03-11 10:13:55 +01:00
Adrien Grand b841de2e38
Don't emit deprecation warnings on calls to the monitoring bulk API. (#39805) (#39838)
The monitoring bulk API accepts the same format as the bulk API, yet its concept
of types is different from "mapping types" and the deprecation warning is only
emitted as a side-effect of this API reusing the parsing logic of bulk requests.

This commit extracts the parsing logic from `_bulk` into its own class with a
new flag that allows to configure whether usage of `_type` should emit a warning
or not. Support for payloads has been removed for simplicity since they were
unused.

@jakelandis has a separate change that removes this notion of type from the
monitoring bulk API that we are considering bringing to 8.0.
2019-03-11 07:58:28 +01:00
Adrien Grand 2bbef67770
Propagate exceptions in o.e.common.io.Streams. (#39042) (#39848)
This commit propagates some exceptions that were previously swallowed and also
makes sure that exceptions closing streams are either propagated if the try
block succeeded or added as suppressed exceptions otherwise.
2019-03-11 07:58:01 +01:00
Benjamin Trent 4da04616c9
[ML] refactoring lazy query and agg parsing (#39776) (#39881)
* [ML] refactoring lazy query and agg parsing

* Clean up and addressing PR comments

* removing unnecessary try/catch block

* removing bad call to logger

* removing unused import

* fixing bwc test failure due to serialization and config migrator test

* fixing style issues

* Adjusting DafafeedUpdate class serialization

* Adding todo for refactor in v8

* Making query non-optional so it does not write a boolean byte
2019-03-10 14:54:02 -05:00
Julie Tibshirani 8454cfc1b2 Move validation from FieldTypeLookup to MapperMergeValidator. (#39814)
This commit consolidates more mapping validation logic into the same class.
`FieldTypeLookup` is now a bit simpler, and has the sole responsibility of quickly
resolving field names to their types.

I have a broader refactor planned around mapping merge validation, but this
change should at least be a step in the right direction.
2019-03-08 18:05:21 -08:00
Nhat Nguyen 993182e426 Combine overriddenOps and skippedOps in translog (#39771)
These two stats are not important enough to be distinguishable.
This change combines them into a single stat.

Closes #33317
2019-03-08 16:28:50 -05:00
Julie Tibshirani be9c37fc76 Small simplifications to mapping validation. (#39777)
These simplifications to `MapperMergeValidator` are possible now that there is
always a single mapping definition.

* Remove the type argument in `validateMapperStructure`.
* Remove unnecessary checks against existing mappers.
2019-03-08 12:34:09 -08:00
Nhat Nguyen a0a91f74ff Treat TransportService stopped error as node is closing (#39800)
If TransportService is stopped before a shard-failure request is sent
but after the request is registered, TransportService will notify
ReplicationOperation a TransportException with an error message:
"transport stop, action: internal:cluster/shard/failure".

Relates #39584
2019-03-08 15:15:56 -05:00
Ryan Ernst 465343f12a
Bundle java in distributions (#38013)
* Bundle java in distributions

Setting up a jdk is currently a required external step when installing
elasticsearch. This is particularly problematic for the rpm/deb packages
as installing a jdk in the same package installation command does not
guarantee any order, so must be done in separate steps. Additionally,
JAVA_HOME must be set and often causes problems in selecting a correct
jdk when, for example, the system java is an older unsupported version.

This commit bundles platform specific openjdks into each distribution.
In addition to eliminating the issues above, it also presents future
possible improvements like using jlink to build jdk images only
containing modules that elasticsearch uses.

closes #31845
2019-03-08 11:04:18 -08:00
Gordon Brown e6b9262a31 Mute testOpenCloseApiWildcards (#39578) (#39579) 2019-03-08 15:18:16 +00:00
David Roberts aec2db78ea Mute RareClusterStateIT.testDelayedMappingPropagationOnReplica
Due to https://github.com/elastic/elasticsearch/issues/36813
2019-03-08 13:28:27 +00:00
David Roberts 366eef99a1 Mute SharedClusterSnapshotRestoreIT.testCloseOrDeleteIndexDuringSnapshot
Due to https://github.com/elastic/elasticsearch/issues/39828
2019-03-08 11:42:13 +00:00
David Turner 5d68143b18 Reformat elasticsearch-node messages (#39811)
Flows the warning messages emitted by the `elasticsearch-node` tool to a width
of 72 characters and tweaks the wording slightly.
2019-03-08 10:01:29 +00:00
Jake Landis 797d6b8a66
Execute ingest node pipeline before creating the index (#39607) (#39796)
Prior to this commit (and after 6.5.0), if an ingest node changes
the _index in a pipeline, the original target index would be created.
For daily indexes this could create an extra, empty index per day.

This commit changes the TransportBulkAction to execute the ingest node
pipeline before attempting to create the index. This ensures that the 
only index created is the original or one set by the ingest node pipeline. 
This was the execution order prior to 6.5.0 (#32786). 

The execution order was changed in 6.5 to better support default pipelines. 
Specifically the execution order was changed to be able to read the settings
from the index meta data. This commit also includes a change in logic such 
that if the target index does not exist when ingest node pipeline runs, it 
will now pull the default pipeline (if one exists) from the settings of the 
best matched of the index template. 

Relates #32786
Relates #32758 
Closes #36545
2019-03-07 13:31:41 -06:00
Jason Tedor 0250d554b6
Introduce forget follower API (#39718)
This commit introduces the forget follower API. This API is needed in cases that
unfollowing a following index fails to remove the shard history retention leases
on the leader index. This can happen explicitly through user action, or
implicitly through an index managed by ILM. When this occurs, history will be
retained longer than necessary. While the retention lease will eventually
expire, it can be expensive to allow history to persist for that long, and also
prevent ILM from performing actions like shrink on the leader index. As such, we
introduce an API to allow for manual removal of the shard history retention
leases in this case.
2019-03-07 11:08:45 -05:00
Armin Braun 213cc6673c
Remove Dead Code in o.e.util package (#39717) (#39779)
* None of this code is used so we should delete it, we can always bring it back if needed
2019-03-07 08:31:46 +01:00
Nhat Nguyen b69affda6a Use unwrapped cause to determine if node is closing (#39723)
We need to unwrap and use the actual cause when determining if the node
with primary shard is shutting down because TransportService will throw
a TransportException wrapped in a SendRequestTransportException.

Relates #39584
2019-03-06 15:30:55 -05:00
Nhat Nguyen 1fe7cb594f Don’t ack if unable to remove failing replica (#39584)
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
2019-03-06 15:30:55 -05:00
markharwood 1873de5240
Bug fix for AnnotatedTextHighlighter - port of 39525 (#39749)
Bug fix for AnnotatedTextHighlighter - port of 39525

Relates to #39395
2019-03-06 19:02:04 +00:00
Yannick Welsch d094107592 Fix SharedClusterSnapshotRestoreIT
Relates to #39644
2019-03-06 17:51:23 +01:00
Yannick Welsch fef11f7efc Allow snapshotting replicated closed indices (#39644)
This adds the capability to snapshot replicated closed indices.

It also changes snapshot requests in v8.0.0 to automatically expand wildcards to closed indices and hence start snapshotting closed indices by default. For v7.1.0 and above, wildcards are by default only expanded to open indices, which can be changed by explicitly setting the expand_wildcards option either to all or closed.

Note that indices are always restored as open indices, even if they have been snapshotted as closed replicated indices.

Relates to #33888
2019-03-06 16:08:20 +01:00
Simon Willnauer e620fb2e4a Add option to force load term dict into memory (#39741)
Lucene added an optimization to leave the term dictionary on disk
for non-id like fields. This change happened very late in the release
processes such that it's better to have an escape hatch if certain
use-cases are hurt by this optimization. This setting might be
removed in the future if it turns out to be unnecessary.
2019-03-06 15:29:04 +01:00
Christoph Büscher 6c503824c8 Fix occasional SearchServiceTests failure (#39697)
Currently SearchServiceTests.testCloseSearchContextOnRewriteException can fail
if a refresh happens while we test for the SearchPhaseExecutionException that is
thrown later in the test. The test takes the current Store#refCount and expects
it to be the same after the exception is thrown. If a refresh happens in that
interval however, the refCound will be different, causing the test to fail. This
can be provoked e.g. by running this section in a tight loop.
Switching of refresh for this tests solves the issue.
2019-03-06 14:18:03 +01:00
Andrey Ershov 52fd102e23 Avoid serialising state if it was already serialised (#39179)
When preparing the state to send to other nodes, we're serializing it
for each node, despite using putIfAbsent.
This commit checks if the state was already serialized for this node
version before performing the potentially expensive computation.
The map is not used by multiple threads, so computeIfAbsent is not
needed (and could not be used here easily, because IOException could
be thrown).

(cherry picked from commit c99be63b43f5250f3cd220130df73c5e9e097459)
2019-03-06 11:54:13 +01:00
David Turner 295e39a8c8 Drop node if asymmetrically partitioned from master (#39598)
When a node is joining the cluster we ensure that it can send requests to the
master _at that time_. If it joins the cluster and _then_ loses the ability to
send requests to the master then it should be removed from the cluster. Today
this is not the case: the master can still receive responses to its follower
checks, and receives acknowledgements to cluster state publications, so has no
reason to remove the node.

This commit changes the handling of follower checks so that they fail if they
come from a master that the other node was following but which it now believes
to have failed.
2019-03-06 09:41:57 +00:00
David Turner 77dd711847 Tidy up GroupedActionListener (#39633)
Today the `GroupedActionListener` accepts a `defaults` parameter but all
callers pass an empty list. Also it is permitted to pass an empty group but
this is trappy because the delegated listener is never be called in that case.
This commit removes the `defaults` parameter and forbids an empty group.
2019-03-06 09:25:10 +00:00
Armin Braun aaecaf59a4
Optimize Bulk Message Parsing and Message Length Parsing (#39634) (#39730)
* Optimize Bulk Message Parsing and Message Length Parsing

* findNextMarker took almost 1ms per invocation during the PMC rally track
  * Fixed to be about an order of magnitude faster by using Netty's bulk `ByteBuf` search
* It is unnecessary to instantiate an object (the input stream wrapper) and throw it away, just to read the `int` length from the message bytes
  * Fixed by adding bulk `int` read to BytesReference
2019-03-06 08:13:15 +01:00
Jason Tedor 75a0d4f470
Rename retention lease setting (#39719)
This commit renames the retention lease setting
index.soft_deletes.retention.lease so that it is under the namespace
index.soft_deletes.retention_lease. As such, we rename the setting to
index.soft_deletes.retention_lease.period.
2019-03-05 22:04:45 -05:00
Jason Tedor 504c792861
Add Docker build type (#39378)
This commit adds a new build type (together with deb/rpm/tar/zip) to
represent the official Docker images. This build type will be displayed
in APIs such as the main and nodes info APIs.
2019-03-05 22:03:15 -05:00
Luca Cavanna 9d0211485c Tie-break completion suggestions with same score and surface form (#39564)
In case multiple completion suggestion entries have the same score and
surface form, the order in which such options will be returned is
currently not deterministic.

With this commmit we introduce tie-breaking for such situations, based
on shard id, index name, index uuid and doc id like we already do for
 ordinary search hits. With this change we also make shardIndex
mandatory when sorting and comparing completion suggestion options,
which was previously only needed later when fetching hits).

Also, we need to make sure shardIndex is properly set when merging
completion suggestions coming from multiple clusters in
`SearchResponseMerger`
2019-03-05 18:03:54 +01:00
Jim Ferenczi 160dc29f0e Handle total hits equal to track_total_hits (#37907)
This change ensures that a total hits equal to the value set for
track_total_hits is not considered as a lower bound.
2019-03-05 16:28:48 +01:00
Armin Braun 750ec8ba53
Minor Cleanups in QueryPhase (#39680) (#39694)
* Soften redundant cast to allow use of `DeterministicTaskQueue` in this class for #39504
* Remove two redundant variables and lower visibility in two possible spots
* Make field `final`
2019-03-05 15:04:16 +01:00
Christoph Büscher 5cdea6ef17 Fix Fuzziness#asDistance(String) (#39643)
Currently Fuzziness#asDistance(String) doesn't work for custom AUTO values. If
the fuzziness is AUTO, the method returns the correct edit distance to use,
depending on the input string, but for custom AUTO values it currently always
returns an edit distance of 1. Correcting this and adding unit and integration
tests to catch these cases.

Closes #39614
2019-03-05 14:31:07 +01:00
Simon Willnauer 19f6a35358 Move BWC Version to 7.1.0 after backport
Relates to #39512
2019-03-05 14:11:59 +01:00
Simon Willnauer d112c89041 Allow inclusion of unloaded segments in stats (#39512)
Today we have no chance to fetch actual segment stats for segments that
are currently unloaded. This is relevant in the case of frozen indices.
This allows to monitor how much memory a frozen index would use if it was
unfrozen.
2019-03-05 14:02:20 +01:00
Armin Braun e8d9744340
Use Threadpool Time in ClusterApplierService (#39679) (#39685)
* Use threadpool's time in `ClusterApplierService` to allow for deterministic tests
* This is a part of/requirement for #39504
2019-03-05 12:37:49 +01:00
Gordon Brown 380dc27d91 Mute testCloseWhileRelocatingShards (#39589) 2019-03-05 13:34:43 +02:00
Alan Woodward 0b14782b23 Add stopword support to IntervalBuilder (#39637)
The match interval builder analyses input text and converts it to an IntervalSource, and as such
may generate token streams with stopwords. This commit deals with these by using the extend
factory to cover the gaps produced by these stopwords so that phrase and ordered queries work
correctly.
2019-03-05 10:50:45 +00:00
Christoph Büscher 2fe1fa8972
Shortcut counts on exists queries (#39570) (#39660)
`TopDocsCollectorContext` can already shortcut hit counts on `match_all` and `term` queries when there are no deletions. 
This change adds this ability for `exists` queries if the index doesn't have deletions and fields are indexed.

Closes #37475
2019-03-04 19:53:43 +01:00
Prabhakar S 98925e9a09 Fixing the custom object serialization bug in diffable utils. (#39544)
While serializing custom objects, the length of the list is computed after
filtering out the unsupported objects but while writing objects the filter
is not applied thus resulting in writing unsupported objects which will fail
to deserialize by the receiever. Adding the condition to filter out unsupported
custom objects.
2019-03-04 18:41:14 +01:00
Nhat Nguyen 801f13f201 Assert recovery done in testDoNotWaitForPendingSeqNo (#39595)
Since #39006 we should be able to complete a peer-recovery without
waiting for pending indexing operations. Thus, the assertion in
testDoNotWaitForPendingSeqNo should be updated from false to true.

Closes #39510
2019-03-04 10:21:23 -05:00
Yannick Welsch 936dbb00e3
Isolate Zen1 (#39470)
Cherry-picks a few commits from #39466 to align 7.x with master branch.
2019-03-04 15:51:17 +01:00
Luca Cavanna 9ddaabba88 Remote private SearchHits.Total class (#39556)
This is now possible as Lucene's `TotalHits` implements `equals`/`hashcode`,
all the other methods can be in-lined in `SearchHits` instead, no need for
a specific wrapper class.
2019-03-04 13:46:45 +01:00
Armin Braun 547af21a12
Introduce Mapping ActionListener (#39538) (#39636)
* Introduce Safer Chaining of Listeners

* The motivation here is to make reasoning about chains of `ActionListener` a little easier, by providing a safe method for nesting `ActionListener` that guarantees that a response is never dropped. Also, it dries up the code a little by removing the need to repeat `listener::onFailure` and `listener.onResponse` over and over.
* Refactored a number of obvious/easy spots to use the new listener constructor
2019-03-04 12:56:46 +01:00
Daniel Mitterdorfer fca6a2f006
Avoid deprecated API usage in TaskOperationFailure (#39303) (#39628)
With this commit we remove usage of the deprecated method
`ExceptionsHelper#detailedMessage` in the class `TaskOperationFailure`.

Relates #19069
2019-03-04 11:37:59 +01:00
David Turner dd68244841
Wait for state recovery in testFreshestMasterElectedAfterFullClusterRestart (#39602)
Zen1IT#testFreshestMasterElectedAfterFullClusterRestart fails sometimes because
we request the cluster state before state recovery has completed, and therefore
obtain the default value for the setting we're relying on.

Confusingly, we were starting out by setting this setting to its default value,
so the test looked like it was failing because of a production bug. This commit
avoids this confusion in future by setting it to a non-default value at the
start of the test.

Fixes #39586.
2019-03-04 10:26:07 +00:00
Adrien Grand 782f873165
Don't swallow exceptions in Store#close(). (#39035) (#39622)
Store#close() swallows any `IOException`.

Relates #39030
2019-03-04 10:58:43 +01:00
Adrien Grand 934946a232
Don't swallow exception in ThreadPool.terminate. (#39038) (#39623)
The use of `closeWhileHandlingException` means that any exception while trying
to close the threadpool is going to be swallowed.

Relates #39030
2019-03-04 10:58:29 +01:00
Adrien Grand 21540a5ada
Enhancements to IndicesQueryCache. (#39099) (#39626)
This commit adds the following:
 - more tests to IndicesServiceCloseTests, one of them found a bug in the order
   in which `IndicesQueryCache#onClose` and
   `IndicesService.indicesRefCount#decRef` are called.
 - made `IndicesQueryCache.stats2` a synchronized map. All writes to it are
   already protected by the lock of the Lucene cache, but the final read from
   an assertion in `IndicesQueryCache#close()` was not so this change should
   avoid any potential visibility issues.
 - human-readable `toString`s to make debugging easier.

Relates #37117
2019-03-04 10:58:12 +01:00
Armin Braun 68bc178017
Disable Bwc Tests (#39551)
* Disable Bwc Tests
* For #39550
2019-03-04 10:41:52 +01:00
Yannick Welsch 0f65390c29 Do not mutate engine during planning step (#39571)
This cleans up the Engine implementation by separating the sequence number generation from the
planning step in the engine, to avoid for the planning step to have any side effects. This makes it
easier to see that every sequence number is properly accounted for.
2019-03-04 10:11:39 +01:00
David Turner 9ec24bae80 Mute testDoNotWaitForPendingSeqNo
Relates #39510, #39595.
2019-03-03 22:03:53 -05:00
Mayya Sharipova d0e65a45a2 Add debug log for flush for IndicesRequestCacheIT (#39475)
Add debug log when index is flushed to investigate a failure
in IndicesRequestCacheIT

"DEBUG" level is used as "TRACE" produces too  much output irrelevant for this
issue

Relates to #32827
2019-03-01 13:12:45 -05:00
Luca Cavanna 29e3c18713 Mute failing IndexShardIT#testPendingRefreshWithIntervalChange
Relates to #39565
2019-03-01 14:55:19 +01:00
Tanguy Leroux e005eeb0b3
Backport support for replicating closed indices to 7.x (#39506)(#39499)
Backport support for replicating closed indices (#39499)
    
    Before this change, closed indexes were simply not replicated. It was therefore
    possible to close an index and then decommission a data node without knowing
    that this data node contained shards of the closed index, potentially leading to
    data loss. Shards of closed indices were not completely taken into account when
    balancing the shards within the cluster, or automatically replicated through shard
    copies, and they were not easily movable from node A to node B using APIs like
    Cluster Reroute without being fully reopened and closed again.
    
    This commit changes the logic executed when closing an index, so that its shards
    are not just removed and forgotten but are instead reinitialized and reallocated on
    data nodes using an engine implementation which does not allow searching or
     indexing, which has a low memory overhead (compared with searchable/indexable
    opened shards) and which allows shards to be recovered from peer or promoted
    as primaries when needed.
    
    This new closing logic is built on top of the new Close Index API introduced in
    6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before
    closing them, and closing an index on a 8.0 cluster will reinitialize the index shards
    and therefore impact the cluster health.
    
    Some APIs have been adapted to make them work with closed indices:
    - Cluster Health API
    - Cluster Reroute API
    - Cluster Allocation Explain API
    - Recovery API
    - Cat Indices
    - Cat Shards
    - Cat Health
    - Cat Recovery
    
    This commit contains all the following changes (most recent first):
    * c6c42a1 Adapt NoOpEngineTests after #39006
    * 3f9993d Wait for shards to be active after closing indices (#38854)
    * 5e7a428 Adapt the Cluster Health API to closed indices (#39364)
    * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767)
    * 71f5c34 Recover closed indices after a full cluster restart (#39249)
    * 4db7fd9 Adapt the Recovery API for closed indices (#38421)
    * 4fd1bb2 Adapt more tests suites to closed indices (#39186)
    * 0519016 Add replica to primary promotion test for closed indices (#39110)
    * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631)
    * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955)
    * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex()
    * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329)
    * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327)
    * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326)
    * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024)
    * e53a9be Fix compilation error in IndexShardIT after merge with master
    * cae4155 Relax NoOpEngine constraints (#37413)
    * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes
    * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903)
    
    Relates to #33888
2019-03-01 14:48:26 +01:00
Yannick Welsch 1a50af7dd4 Do not close bad indices on startup (#39500)
With #17187, we verified IndexService creation during initial state recovery on the master and if the
recovery failed the index was imported as closed, not allocating any shards. This was mainly done to
prevent endless allocation loops and full log files on data-nodes when the indexmetadata contained
broken settings / analyzers. Zen2 loads the cluster state eagerly, and this check currently runs on all
nodes (not only the elected master), which can significantly slow down startup on data nodes.
Furthermore, with replicated closed indices (#33888) on the horizon, importing the index as closed
will no longer not allocate any shards. Fortunately, the original issue for endless allocation loops is
no longer a problem due to #18467, where we limit the retries of failed allocations. The solution here
is therefore to just undo #17187, as it's no longer necessary, and covered by #18467, which will solve
the issue for Zen2 and replicated closed indices as well.
2019-03-01 09:23:46 +01:00
Tal Levy b9b46fdec6
fix UpdateSettingsRequestStreamableTests.mutateInstance (#39386) (#39477)
Mutations of the timeout values were using string-representations.

This resulted in very rare cases where the original timeout value was
represented as something like "0ms" and the new random time-value generated
was "0s". Although their string representations differ, their underlying
TimeValue does not. This resulted in `-Dtests.seed=7F4C034C43C22B1B` to
fail.
2019-02-28 21:02:32 -08:00
Mark Tozzi 609118c229 Override and mute InternalAutoDateHistogramTests#testReduceRandom() (#39536)
pending resolution of #39497
2019-02-28 16:00:32 -05:00
Lee Hinman dae48ba262 Add details about what acquired the shard lock last (#38807)
This adds a `details` parameter to shard locking in `NodeEnvironment`. This is
intended to be used for diagnosing issues such as

```
  1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index
  1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService     ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index
  1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms
  1> 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?]
```

In the hope that we will be able to determine why the shard is still locked.

Relates to #30290 as well as some other CI failures
2019-02-28 10:50:47 -07:00
Armin Braun e564c4d8ad
Add Package Level JavaDoc on Snapshots (#38108) (#39514)
* Add Package Level JavaDoc on Snapshots
2019-02-28 18:23:01 +01:00
Simon Willnauer 5c96b90ed5 Never block on scheduled refresh if a refresh is running (#39462)
Today we block on the ReferenceManager in the case of a scheduled refresh.
Yet if there is a refresh happening concurrently we might block and create
very smallish segments. Instead we should just move on to the next shard
and free up the refresh thread instead.
2019-02-28 11:57:45 +01:00
Armin Braun d3d7d9bb9d
Remove Dead Code + Duplication in o.e.c.routing (#36678) (#39493)
* Removed obviously unused fields+methods
* Inlined public methods that only had one caller
* Simplified `Optional` chain
* Simplified some obviously redundant conditions
2019-02-28 10:33:05 +01:00
Armin Braun 90ab4a6f6e
Stabilize RareClusterState (#38671) (#39468)
* Use actual master node, not just a master elligible node when trying to cancel publication. This only works on the master and for unlucky seeds we never try the master within the 10s that the busy assert runs.
* Closes #36813
2019-02-28 08:01:52 +01:00
Tanguy Leroux 4dd274b51d Unmute CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState() (#39452)
This commit unmutes the test and comments out the
offending call to linearizabilityChecker.isLinearizable() as suggested
in #39437
2019-02-27 20:38:54 +01:00