Commit Graph

3934 Commits

Author SHA1 Message Date
Armin Braun 6bd033931b
Add Consistency Assertion to SnapshotsInProgress (#47598) (#47633)
Assert given input shards and indices are consistent.
Also, fixed the equality check for SnapshotsInProgress.
Before this change the tests never had more than a single waiting
shard per index so they never failed as a result of the
waiting shards list not being ordered.

Follow up to #47552
2019-10-07 10:37:56 +02:00
Luca Cavanna 736fceb18b
Fold InitialSearchPhase into AbstractSearchAsyncAction (#47182)
Historically, we have two base classes for search actions that generally need to fan out to multiple shards and then move on to the following phase: InitialSearchPhase and AbstractSearchAsyncAction that extends it. Practically, every search action extends the latter, and there are no direct subclasses of InitialSearchPhase in our codebase.

This commit folds InitialSearchPhase into AbstractSearchAsyncAction in the attempt of simplifying things and making the search code running on the coordinating node easier to reason about.
2019-10-07 10:10:04 +02:00
Martijn van Groningen f2f2304c75
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-10-07 10:07:56 +02:00
Armin Braun 22679c7932
Fix Snapshot Corruption in Edge Case (#47552) (#47620)
This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550
2019-10-05 15:01:06 +02:00
Armin Braun f2d2ca21e2
Cleaner Handling of Store Refcount in BlobStoreRepository (#47560) (#47594)
If a shard gets closed we properly abort its snapshot
before closing it. We should in thise case make sure to
not throw a confusing exception about trying to increment
the reference on an already closed shard in the async tasks
if the snapshot is already aborted.
Also, added an assertion to make sure that aborts are in
fact the only situation in which we run into a concurrently
closed store.
2019-10-05 09:45:10 +02:00
Gordon Brown e47bdf760e
Fix Rollover error when alias has closed indices (#47148) (#47539)
Rollover previously requested index stats for all indices in the
provided alias, which causes an exception when there is a closed index
with that alias.

This commit adjusts the IndicesOptions used on the index stats
request so that closed indices are ignored, rather than throwing
an exception.
2019-10-04 17:40:05 -06:00
Jason Tedor 35ca3d68d7
Validating monitoring hosts setting while parsing (#47571)
This commit lifts the validation of the monitoring hosts setting into
the setting itself, rather than when the setting is used. This prevents
a scenario where an invalid value for the setting is accepted, but then
later fails while applying a cluster state with the invalid setting.
2019-10-04 17:32:49 -04:00
Mark Tozzi e404f7ea80
DocValueFormat implementation for date range fields (#47472) (#47605) 2019-10-04 17:21:17 -04:00
Lee Hinman 79376b7219 Set default SLM retention invocation time (#47604)
This adds a default for the `slm.retention_schedule` setting, setting it
to `0 30 1 * * ?` which is 1:30am every day.

Having retention unset meant that it would never be invoked and clean up
snapshots. We determined it would be better to have a default than never
to be run. When coming to a decision, we weighed the option of an
absolute time (such as 1:30am) versus a periodic invocation (like every
12 hours). In the end we decided on the absolute time because it has
better predictability and consistency than a periodic invocation, which
would rely on when the master node were elected or restarted.

Relates to #43663
2019-10-04 15:00:20 -06:00
Armin Braun c1be7a802c
Simplify Snapshot Delete Process (#47439) (#47533)
We don't need to read the SnapshotInfo for a
snapshot to determine the indices that need to
be updated when it is deleted as the `RepositoryData`
contains that information already.
This PR makes it so the `RepositoryData` is used to
determine which indices to update and also removes
the special handling for deleting snapshot metadata
and the CS snapshot blob and has those simply be
deleted as part of the deleting of other unreferenced
blobs in the last step of the delete.

This makes the snapshot delete a little faster and
more resilient by removing two RPC calls
(the separate delete and the get).

Also, this shortens the diff with #46250 as a
side-effect.
2019-10-04 13:55:16 +02:00
David Roberts defc97a300 Remove fallback for controller location (#47104)
This change removes the temporary controller
location fallback introduced in #47013.

Relates elastic/ml-cpp#593
2019-10-04 09:50:26 +01:00
Ryan Ernst f32692208e
Add explanations to script score queries (#46693) (#47548)
While function scores using scripts do allow explanations, they are only
creatable with an expert plugin. This commit improves the situation for
the newer script score query by adding the ability to set the
explanation from the script itself.

To set the explanation, a user would check for `explanation != null` to
indicate an explanation is needed, and then call
`explanation.set("some description")`.
2019-10-03 21:05:05 -07:00
Nhat Nguyen 5e4732f2bb Limit number of retaining translog files for peer recovery (#47414)
Today we control the extra translog (when soft-deletes is disabled) for
peer recoveries by size and age. If users manually (force) flush many
times within a short period, we can keep many small (or empty) translog
files as neither the size or age condition is reached. We can protect
the cluster from running out of the file descriptors in such a situation
by limiting the number of retaining translog files.
2019-10-03 20:45:29 -04:00
Armin Braun bac119f672
Fix getSnapshotIndexMetaData Exception Behavior (#47488) (#47496)
If we fail to read the global metadata in a snapshot
we would throw `SnapshotMissingException` but wouldn't
do so for the index metadata.
This is breaking SLM tests at a low rate because they
use `SnapshotMissingException` thrown from snapshot status APIs
to wait for a snapshot being gone.
Also, we should be consistent here in general and not leak the
`NoSuchFileException` to the transport layer for index meta.

Closes #46508
2019-10-03 12:47:50 +02:00
Armin Braun 7549be4489
Fix es.http.cname_in_publish_address Deprecation Logging (#47451)
Since the property defaulted to `true` this deprecation logging
runs every time unless its set to `false` manually (in which case
it should've also logged but didn't).
I didn't add a tests and removed the tests we had in `7.x` that
covered this logging. I did move the check out of the `if (InetAddresses.isInetAddress(hostString) == false) {` condition so this is sort-of covered by the REST tests.
IMO, any unit-test of this would be somewhat redundant and would've forced
adding a field that just indicates that the deprecated property was used to
every instance which seemed pointless.

Closes #47436
2019-10-03 11:10:48 +02:00
Alpar Torok 0a14bb174f Remove eclipse conditionals (#44075)
* Remove eclipse conditionals

We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.

This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.

Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.

This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
2019-10-03 11:55:00 +03:00
Armin Braun 0beb5263b4
Fix Snapshot Finalization not Waiting for Index Metadata (#47445) (#47459)
* Fix Snapshot Finalization not Waiting for Index Metadata

We were mixing up the listeners here which led to the final listener
that should be called after all the metadata has been written
to be called before that.
I fixed this by removing the one redundant listener and flattening
the logic out.

* Closes #47425
2019-10-02 23:26:18 +02:00
Jason Tedor 52b97ec539
Allow setting validation against arbitrary types (#47264)
Today when settings validate, they can only validate against settings
that are of the same type. While this strong-type is convenient from a
development perspective, it is too limiting in that some settings need
to validate against settings of a different type. For example, the list
setting xpack.monitoring.exporters.<namespace>.host wants to validate
that it is non-empty if and only if the string setting
xpack.monitoring.exporters.<namespace>.type is "http". Today this is
impossible since the settings validation framework only allows that
setting to validate against other list settings. This commit increases
the flexibility here to validate against settings of arbitrary type, at
the expense of losing strong-typing during development.
2019-10-02 16:31:06 -04:00
Jim Ferenczi c340814b34 Fix highlighting of overlapping terms in the unified highlighter (#47227)
The passage formatter that the unified highlighter use doesn't handle terms with overlapping offsets.
For tokenizer that provides multiple segmentation of the same terms (edge ngram for instance) the formatter
should select the largest span in order to highlight the term only once. This change implements this logic.
2019-10-02 16:34:12 +02:00
Yannick Welsch f7980e9745 Adapt version constants after backport (#47353) 2019-10-02 14:26:23 +02:00
Yannick Welsch 99d2fe295d Use optype CREATE for single auto-id index requests (#47353)
Changes auto-id index requests to use optype CREATE, making it compliant with our docs.
This will also make these auto-id index requests compatible with the new "create-doc" index
privilege (which is based on the optype), the default optype is changed to create, just as it is
already documented.
2019-10-02 14:16:52 +02:00
Yannick Welsch 0024695dd8 Disallow externally generated autoGeneratedTimestamp (#47341)
The autoGeneratedTimestamp field is internally used to speed up indexing of operations with
auto-ids, as we can rule out duplicates. Setting this field externally can make the index
inconsistent, resulting in duplicate documents with same id.
2019-10-02 14:16:52 +02:00
Yannick Welsch 8c11fe610e Use standard semantics for retried auto-id requests (#47311)
Adds support for handling auto-id requests with optype CREATE. Also simplifies the code
handling this by using the standard indexing path when dealing with possible retry conflicts.

Relates #47169
2019-10-02 14:16:52 +02:00
Yannick Welsch 7b2613db55 Allow optype CREATE for append-only indexing operations (#47169)
Bulk requests currently do not allow adding "create" actions with auto-generated IDs.
This commit allows using the optype CREATE for append-only indexing operations. This is
mainly the user facing aspect of it.
2019-10-02 14:16:52 +02:00
Jim Ferenczi 42c5054e52 Fix alias field resolution in match query (#47369)
Synonym queries (when two tokens/paths start at the same position) use the alias field instead
of the concrete field to build Lucene queries. This commit fixes this bug by resolving the alias field upfront in order to provide the concrete field to the actual query parser.
2019-10-02 11:45:43 +02:00
Nhat Nguyen 5cfcd7c458 Re-fetch shard info of primary when new node joins (#47035)
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.

Relates #46959

This work was done by Henning in #42518.

Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
2019-10-01 22:16:26 -04:00
Gordon Brown ba6ee2d40d
[7.x] Adjust randomization in cluster shard limit tests (#47254)
This commit adjusts randomization for the cluster shard limit tests so
that there is often more of a gap left between the limit and the size of
the first index. This allows the same randomization to be used for all
tests, and alleviates flakiness in
`testIndexCreationOverLimitFromTemplate`.
2019-10-01 14:53:10 -06:00
David Turner 99b25d3740 Keep nodes above watermark in testAutomaticReleaseOfIndexBlock (#47387)
Today the comment boldly claims that this line of code keeps nodes above the
10-byte low watermark when in fact this is not true at all. This change fixes
this so that it really does keep nodes above the low watermark.

Fixes #45338. Again.
2019-10-01 19:58:23 +01:00
Armin Braun 3d6ef6a90e
Speed up and Reorder Snapshot Delete Operations (#47293) (#47350)
This is a preliminary of #46250 making the snapshot
delete work by doing all the metadata updates first
and then bulk deleting all of the now unreferenced
blobs.
Before this change, the metadata updates for each shard
and subsequent deletion of the blobs that have become unreferenced
due to the delete would happen sequentially shard-by-shard
parallelising only over all the indices in the snapshot.
This change makes it so the all the metadata updates
happen in parallel on a shard level first.
Once all of the updates of shard-level metadata have finished,
all the now unreferenced blobs are deleted in bulk.
This has two benefits (outside of making #46250 a smaller change):
* We have a lower likelihood of failing to update shard level metadata because
it happens with priority and a higher degree of parallelism
* Deleting of unreferenced data in the shards should go much faster in many cases (rolling indices, large number of indices with many unchanged shards) as well because a number of small bulk deletions (just two blobs for `index-N` and `snap-` for each unchanged shard) are grouped into larger bulk deletes of `100-1000` blobs depending on Cloud provider (even though the final bulk deletes are happening sequentially this should be much faster in almost all cases as you'd parallelism of 50 (GCS) to 500 (S3) snapshot threads to achieve the same delete rates when deleting from unchanged shards).
2019-10-01 19:05:43 +02:00
Colin Goodheart-Smithe c93b39c65b
Adds version 7.4.1 2019-10-01 16:03:11 +01:00
Howard a9cd42c05d Cancel recoveries even if all shards assigned (#46520)
We cancel ongoing peer recoveries if a node joins the cluster with a completely
up-to-date copy of a shard, because we can use such a copy to recover a replica
instantly. However, today we only look for recoveries to cancel while there are
unassigned shards in the cluster. This means that we do not contemplate the
cancellation of the last few recoveries since recovering shards are not
unassigned.  It might take much longer for these recoveries to complete than
would be necessary if they were cancelled.

This commit fixes this by checking for cancellable recoveries even if all
shards are assigned.
2019-10-01 10:55:32 +01:00
Ignacio Vera 03d717dc32
Provide better error when updating geo_shape field mapper settings (#47281) (#47338) 2019-10-01 10:52:39 +02:00
Yannick Welsch dd0af2e425 Fix CloseIndexIT.testRelocatedClosedIndexIssue (#47169)
Closes #47330
2019-10-01 08:34:27 +02:00
Armin Braun 3d23cb44a3
Speed up Snapshot Finalization (#47283) (#47309)
As a result of #45689 snapshot finalization started to
take significantly longer than before. This may be a
little unfortunate since it increases the likelihood
of failing to finalize after having written out all
the segment blobs.
This change parallelizes all the metadata writes that
can safely run in parallel in the finalization step to
speed the finalization step up again. Also, this will
generally speed up the snapshot process overall in case
of large number of indices.

This is also a nice to have for #46250 since we add yet
another step (deleting of old index- blobs in the shards
to the finalization.
2019-09-30 23:28:59 +02:00
Jason Tedor 890951113f
Make Setting#getRaw have private access (#47287)
The method Setting#getRaw leaks implementation details about settings,
namely that they are backed by strings. We do not want code to rely upon
this, so this commit makes Setting#getRaw private as a first step
towards hiding the implementaton details of settings from the rest of
the codebase.
2019-09-30 14:14:30 -04:00
David Turner 72b63635de
Remove unused pluggable metadata upgraders (#47277)
Today plugins may provide upgraders for custom metadata and index metadata, but
these upgraders are bypassed during a rolling restart. Fortunately this
extension mechanism is unused by all known plugins. This commit removes these
extension points.

Relates #47297
2019-09-30 16:58:29 +01:00
Gaurav614 052c523d41 Fail allocation of new primaries in empty cluster (#43284)
Today if you create an index in a cluster without any data nodes then it will
report yellow health because it never attempts to assign any shards if there
are no data nodes, so the new shards remain at `AllocationStatus.NO_ATTEMPT`.
This commit moves the new primaries to `AllocationStatus.DECIDERS_NO` in this
situation, causing the cluster health to move to red.

Fixes #41073
2019-09-30 14:27:12 +01:00
Yannick Welsch 467596871a Omit writing index metadata for non-replicated closed indices on data-only node (#47285)
Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index
metadata storage mechanism, which has special handling for closed indices (but incorrectly
handles replicated closed indices). On non-master-eligible data nodes, it's possible for the
node's manifest file (which tracks the relevant metadata state that the node should persist) to
become out of sync with what's actually stored on disk, leading to an inconsistency that is then
detected at startup, refusing for the node to start up.

Closes #47276
2019-09-30 13:56:52 +02:00
Przemyslaw Gomulka d9a7bcef21
Support optional parsers in any order with DateMathParser Backport(46654) (#47217)
Currently DateMathParser with roundUp = true is relying on the DateFormatter build with
combined optional sub parsers with defaulted fields (depending on the formatter).
That means that for yyyy-MM-dd'T'HH:mm:ss||yyyy-MM-dd'T'HH:mm:ss.SSS
Java.time implementation expects optional parsers in order from most specific to
least specific (reverse in the example above).
It is causing a problem because the first parsing succeeds but does not consume the full input.
The second parser should be used.
We can work around this with keeping a list of RoundUpParsers and iterate over them choosing
the one that parsed full input. The same approach we used for regular (non date math) in
relates #40100
The jdk is not considering this to be a bug https://bugs.openjdk.java.net/browse/JDK-8188771

Those below will expect this change first
relates #46242
relates #45284
backport #46654
2019-09-30 13:54:52 +02:00
Yannick Welsch 9dc90e41fc Remove "force" version type (#47228)
It's been deprecated long ago and can be removed.

Relates to #20377

Closes #19769
2019-09-30 11:58:34 +02:00
Martijn van Groningen 66f72bcdbc
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-09-30 08:12:28 +02:00
Rory Hunter 53a4d2176f
Convert most awaitBusy calls to assertBusy (#45794) (#47112)
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.

There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.

I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.

Other changes:

  * Rework `waitForDocs` to scale its timeout. Instead of calling
    `assertBusy` in a loop, work out a reasonable overall timeout and await
    just once.
  * Some tests failed after switching to `assertBusy` and had to be fixed.
  * Correct the expect templates in AbstractUpgradeTestCase.  The ES
    Security team confirmed that they don't use templates any more, so
    remove this from the expected templates. Also rewrite how the setup
    code checks for templates, in order to give more information.
  * Remove an expected ML template from XPackRestTestConstants The ML team
    advised that the ML tests shouldn't be waiting for any
    `.ml-notifications*` templates, since such checks should happen in the
    production code instead.
  * Also rework the template checking code in `XPackRestTestHelper` to give
    more helpful failure messages.
  * Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
2019-09-29 12:21:46 +01:00
Jason Tedor 98989f7b37
Use fallback settings in throttling decider (#47261)
This commit replaces some uses of Setting#getRaw in the throttling
allocation decider settings. Instead, these settings should be using
fallback settings.
2019-09-28 08:06:24 -04:00
Jason Tedor bd603b0a7b
Remove dead leniency in allow rebalance setting use (#47259)
This commit removes some leniency that exists in getting the allow
rebalance setting. Fortunately, that leniency is dead code, this can
never happen. The reason this can never happen is because the settings
infrastructure will not allow setting an invalid value for this
setting. If you try to set this in the elasticsearch.yml, then the node
will fail to start, since parsing the setting will fail. If you try to
set this via an update settings API call, then parsing the setting will
fail and the settings update will be rejected. Therefore, this leniency
can never be activated, so we remove it.

This commit is the first of a few in an attempt to remove the public
uses of Setting#getRaw.
2019-09-28 08:05:37 -04:00
Martijn van Groningen 76b66634e9
put provided argument on the previous line just like in master branch,
that way this doesn't show in the final pr.
2019-09-27 15:22:09 +02:00
David Roberts e943e27954 Spawn controller processes from a different directory on macOS (#47013)
This is the Java side of https://github.com/elastic/ml-cpp/pull/593
with a fallback so that ml-cpp bundles with either the
new or old directory structure work for the time being.
A few days after merging the C++ changes a followup to
this change will be made that removes the fallback.
2019-09-27 14:02:40 +01:00
Martijn van Groningen 7ffe2e7e63
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-09-27 14:42:11 +02:00
Yannick Welsch 6fd3b4723f Remove write lock for Translog.getGeneration (#47036)
No need for the write lock, and currentFileGeneration is already protected by the read lock. 
Also removes the unused method "isCurrent".
2019-09-27 13:58:07 +02:00
Jim Ferenczi 73a09b34b8
Replace SearchContextException with SearchException (#47046)
This commit removes the SearchContextException in favor of a simpler
SearchException that doesn't leak the SearchContext.

Relates #46523
2019-09-26 14:21:23 +02:00
Tanguy Leroux 95e2ca741e
Remove unused private methods and fields (#47154)
This commit removes a bunch of unused private fields and unused
private methods from the code base.

Backport of (#47115)
2019-09-26 12:49:21 +02:00
jimczi 97d977f381 #47046 Fix serialization version check after backport 2019-09-26 09:56:24 +02:00
Jim Ferenczi 04972baffa
Merge ShardSearchTransportRequest and ShardSearchLocalRequest (#46996) (#47081)
This change merges the `ShardSearchTransportRequest` and `ShardSearchLocalRequest`
into a single `ShardSearchRequest` that can be used to create a SearchContext.

Relates #46523
2019-09-26 09:20:53 +02:00
Martijn van Groningen 429f23ea2f
Allow ingest processors to execute in a non blocking manner. (#47122)
Backport of #46241

This PR changes the ingest executing to be non blocking
by adding an additional method to the Processor interface
that accepts a BiConsumer as handler and changing
IngestService#executeBulkRequest(...) to ingest document
in a non blocking fashion iff a processor executes
in a non blocking fashion.

This is the second PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.

The plan is to merge changes made to the server module separately from
the pr that will merge enrich into master, so that these changes can
be reviewed in isolation.

This change originates from the enrich branch and was introduced there
in #43361.
2019-09-26 08:55:28 +02:00
David Turner 45c7783018
Warn on slow metadata persistence (#47130)
Today if metadata persistence is excessively slow on a master-ineligible node
then the `ClusterApplierService` emits a warning indicating that the
`GatewayMetaState` applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to `10s`. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.

Backport of #47005.
2019-09-26 07:40:54 +01:00
Tim Brooks 4f47e1f169
Extract proxy connection logic to specialized class (#47138)
Currently the logic to check if a connection to a remote discovery node
exists and otherwise create a proxy connection is mixed with the
collect nodes, cluster connection lifecycle, and other
RemoteClusterConnection logic. This commit introduces a specialized
RemoteConnectionManager class which handles the open connections.
Additionally, it reworks the "round-robin" proxy logic to create the list
of potential connections at connection open/close time, opposed to each
time a connection is requested.
2019-09-25 15:58:18 -06:00
Nhat Nguyen 7c5a088aa5 Increase ensureGreen timeout for testReplicaCorruption (#47136)
We can have a large number of shard copies in this test. For example,
the two recent failures have 24 and 27 copies respectively and all
replicas have to copy segment files as their stores are corrupted. Our
CI needs more than 30 seconds to start all these copies.

Note that in two recent failures, the cluster was green just after the
cluster health timed out.

Closes #41899
2019-09-25 17:04:08 -04:00
Lee Hinman a267df30fa Wait for snapshot completion in SLM snapshot invocation (#47051)
* Wait for snapshot completion in SLM snapshot invocation

This changes the snapshots internally invoked by SLM to wait for
completion. This allows us to capture more snapshotting failure
scenarios.

For example, previously a snapshot would be created and then registered
as a "success", however, the snapshot may have been aborted, or it may
have had a subset of its shards fail. These cases are now handled by
inspecting the response to the `CreateSnapshotRequest` and ensuring that
there are no failures. If any failures are present, the history store
now stores the action as a failure instead of a success.

Relates to #38461 and #43663
2019-09-25 14:25:22 -06:00
Armin Braun 93fcd23da8
Fail Snapshot on Corrupted Metadata Blob (#47009) (#47096)
We should not be quietly ignoring a corrupted shard-level index-N
blob. Simply creating a new empty shard-level index-N and moving
on means that all snapshots of that shard show `SUCESS` as their
state at the repository root but are in fact broken.
This change at least makes it visible to the user that they can't
snapshot the given shard any more and forces the user to move on
to a new repository since the current one is broken and will not
allow snapshotting the inconsistent shard again.

Also, this change stops the delete action for shards with broken
index-N blobs instead of simply deleting all blobs in the path
containing the broken index-N. This prevents a temporarily
broken/missing index-N blob from corrupting all snapshots of
that shard.
2019-09-25 15:55:33 +02:00
Nhat Nguyen 22575bd7e6 Remove isRecovering method from Engine (#47039)
We already prevent flushing in Engine if it's recovering. Hence, we can
remove the protection in IndexShard.
2019-09-25 08:58:08 -04:00
Armin Braun c4a166fc9a
Simplify SnapshotResiliencyTests (#46961) (#47108)
Simplify `SnapshotResiliencyTests` to more closely
match the structure of `AbstractCoordinatorTestCase` and allow for
future drying up between the two classes:

* Make the test cluster nodes a nested-class in the test cluster itself
* Remove the needless custom network disruption implementation and
  simply track disconnected node ids like `AbstractCoordinatorTestCase`
  does
2019-09-25 14:53:11 +02:00
Yannick Welsch 81cbd3fba4 Mute ClusterShardLimitIT.testIndexCreationOverLimitFromTemplate
Relates #47107
2019-09-25 14:03:08 +02:00
David Turner ac920e8e64 Assert no exceptions during state application (#47090)
Today we log and swallow exceptions during cluster state application, but such
an exception should not occur. This commit adds assertions of this fact, and
updates the Javadocs to explain it.

Relates #47038
2019-09-25 12:32:51 +01:00
Martijn van Groningen eef1ba3fad
Make ingest pipeline resolution logic unit testable (#47026)
Extracted ingest pipeline resolution logic into a static method
and added unit tests for pipeline resolution logic.

Followup from #46847
2019-09-25 11:35:00 +02:00
Daniel Mitterdorfer 48df560593
Emit log message when parent circuit breaker trips (#47000) (#47073)
We emit a debug log message whenever a child circuit breaker trips (in
`ChildMemoryCircuitBreaker#circuitBreak(String, long)`) but we never
emit a log message when the parent circuit breaker trips. As this is
more likely to happen with the real memory circuit breaker it is not
possible to detect this in the logs. With this commit we add a log
message on the same log level (debug) when the parent circuit breaker
trips.
2019-09-25 10:22:46 +02:00
Julie Tibshirani 41ee8aa6fc Reject regexp queries on the _index field. (#46945)
We speculatively added support for `regexp` queries on the `_index` field in
#34089 (this functionality was not actually requested by a user). Supporting
regex logic adds complexity to the `_index` field for not much gain, so we
would like to remove it.

From an end-to-end test it turns out this functionality never even worked in
the first place because of an error in how regex flags were interpreted! For
this reason, we can remove support for `regexp` on `_index` without a
deprecation period.

Relates to #46640.
2019-09-24 12:17:00 -07:00
Tim Brooks 71ec0707cf
Remove locking around connection attempts (#46845)
Currently in the ConnectionManager we lock around the node id. This is
odd because we key connections by the ephemeral id. Upon further
investigation it appears to me that we do not need the locking. Using
the concurrent map, we can ensure that only one connection attempt
completes. There is a very small chance that a new connection attempt
will proceed right as another connection attempt is completing. However,
since the whole process is asynchronous and event oriented
(lightweight), that does not seem to be an issue.
2019-09-24 11:05:42 -06:00
Tim Brooks f02582de4b
Reduce a bind failure to trace logging (#46891)
Due to recent changes in the nio transport, a failure to bind the server
channel has started to be logged at an error level. This exception leads
to an automatic retry on a different port, so it should only be logged
at a trace level.
2019-09-24 10:32:18 -06:00
David Turner 9135e2f9e3 Improve LeaderCheck rejection messages (#46998)
Today the `LeaderChecker` rejects checks from nodes that are not in the current
cluster with the exception message `leader check from unknown node` which
offers no information about why the node is unknown. In fact the node must have
been in the cluster in the recent past, so it might help guide the user to a
more useful log message if we describe it as a `removed node` instead of an
`unknown node`. This commit changes the exception message like this, and also
tidies up a few other loose ends in the `LeaderChecker`.
2019-09-24 13:41:37 +01:00
David Turner 6943a3101f
Cut PersistedState interface from GatewayMetaState (#46655)
Today `GatewayMetaState` implements `PersistedState` but it's an error to use
it as a `PersistedState` before it's been started, or if the node is
master-ineligible. It also holds some fields that are meaningless on nodes that
do not persist their states. Finally, it takes responsibility for both loading
the original cluster state and some of the high-level logic for writing the
cluster state back to disk.

This commit addresses these concerns by introducing a more specific
`PersistedState` implementation for use on master-eligible nodes which is only
instantiated if and when it's appropriate. It also moves the fields and
high-level persistence logic into a new `IncrementalClusterStateWriter` with a
more appropriate lifecycle.

Follow-up to #46326 and #46532
Relates #47001
2019-09-24 12:31:13 +01:00
Julie Tibshirani 9124c94a6c
Add support for aliases in queries on _index. (#46944)
Previously, queries on the _index field were not able to specify index aliases.
This was a regression in functionality compared to the 'indices' query that was
deprecated and removed in 6.0.

Now queries on _index can specify an alias, which is resolved to the concrete
index names when we check whether an index matches. To match a remote shard
target, the pattern needs to be of the form 'cluster:index' to match the
fully-qualified index name. Index aliases can be specified in the following query
types: term, terms, prefix, and wildcard.
2019-09-23 13:21:37 -07:00
Jim Ferenczi 08f28e642b Replace SearchContext with QueryShardContext in query builder tests (#46978)
This commit replaces the SearchContext used in AbstractQueryTestCase with
a QueryShardContext in order to reduce the visibility of search contexts.

Relates #46523
2019-09-23 20:24:02 +02:00
Eray 199fff8a55 Allow max_children only in top level nested sort (#46731)
This commit restricts the usage of max_children to the top level nested sort since it is ignored on the other levels.
2019-09-23 18:53:50 +02:00
Armin Braun 2da040601b
Fix Bug in Snapshot Status Response Timestamps (#46919) (#46970)
Fixing a corner case where snapshot total time calculation was off when
getting the `SnapshotStatus` of an in-progress snapshot.

Closes #46913
2019-09-23 15:01:47 +02:00
David Turner 7bc86f23ec Wait longer for leader failure in logs test (#46958)
`testLogsWarningPeriodicallyIfClusterNotFormed` simulates a leader failure and
waits for long enough that a failing leader check is scheduled. However it does
not wait for the failing check to actually fail, which requires another two
actions and therefore might take up to 200ms more. Unlucky timing would result
in this test failing, for instance:

    ./gradle ':server:test' \
      --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testLogsWarningPeriodicallyIfClusterNotFormed" \
      -Dtests.jvm.argline="-Dhppc.bitmixer=DETERMINISTIC" \
      -Dtests.seed=F18CDD0EBEB5653:E9BC1A8B062E697A

This commit adds the extra delay needed for the leader failure to complete as
expected.

Fixes #46920
2019-09-23 10:52:13 +01:00
Martijn van Groningen 33bbc4798b
fixed compile errors after merging 2019-09-23 09:46:14 +02:00
Martijn van Groningen 0cfddca61d
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-09-23 09:46:05 +02:00
Armin Braun ee4e6b1382
Add TestLogging for #46701 (#46939) (#46949)
This at a very low rate and with the force merge in place
before checking the cache size it's not clear why the cache
is not of size `0` -> seems something else
must be happening here that is unexpected.
-> add debug logging to this test to find out

Relates #46701
2019-09-21 15:24:58 +02:00
Armin Braun 938648fcff
Remove Duplicate Shard Snapshot State Updates (#46862) (#46906)
We were repeatedly trying to send shard state updates for aborted
snapshots on every cluster state update.
This is simply dead-code since those updates are already safely
sent in the callbacks passed to `SnapshotShardsService#snapshot`.
On master failover, we ensure that the status update is resent
via `SnapshotShardsService#syncShardStatsOnNewMaster`.
=> there is no need for trying to send updates here over and over
and this logic can safely be removed
2019-09-20 14:30:03 +02:00
Jason Tedor 97acf353fa
Move pipelines resolved assertion (#46892)
This assertion was added during the development of required
pipelines. In the initial version of that work, the notion of whether or
not a request was forwarded from the coordinating node to an ingest node
was introduced. It was realized later that instead we needed to track
whether or not the pipeline for the request was resolved. When that
change was made, this assertion, while not incorrect, was left behind
and only applied if the coordnating node was forwarding the
request. Instead, the assertion applies whether or not the request is
being forwarded. This commit addresses that by moving the assertion and
renaming some variables.
2019-09-20 07:27:56 -04:00
Jason Tedor 2425fd1a50
Removed unused import from RequiredPipelineIT.java
This commit removes an unused import that was left behind after cleaning
up a backport. Sorry.
2019-09-19 16:46:27 -04:00
Jason Tedor bd77626177
Add the ability to require an ingest pipeline (#46847)
This commit adds the ability to require an ingest pipeline on an
index. Today we can have a default pipeline, but that could be
overridden by a request pipeline parameter. This commit introduces a new
index setting index.required_pipeline that acts similarly to
index.default_pipeline, except that it can not be overridden by a
request pipeline parameter. Additionally, a default pipeline and a
request pipeline can not both be set. The required pipeline can be set
to _none to ensure that no pipeline ever runs for index requests on that
index.
2019-09-19 16:37:45 -04:00
Armin Braun c922743c5d
Remove Bogus Test: testDeleteOrphanSnapshot (#46835) (#46874)
This test is broken with a very low failure rate after recent changes.
Particularly after #45689 which does not check for duplicate snapshot
uuids during snapshot finalization any more.
The check for duplicate uuids during finalization was removed
conciously since it lead to problems during master failover.
This test fails because it increments the repository state id in an
unexpected manner now, starting from the impossible situation of having
the same snapshot UUID for two different repository state ids.
This situation can't normally be reached, but we manually crafted it here.

This test didn't do anything before though, because the manually crafted
cluster state would simply result in an error during finalization before
and nothing but a normal snapshot delete would be tested.
=> removing this test here, it doesn't test anything.

Closes #46843
2019-09-19 18:52:35 +02:00
Yannick Welsch 9638ca20b0 Allow dropping documents with auto-generated ID (#46773)
When using auto-generated IDs + the ingest drop processor (which looks to be used by filebeat
as well) + coordinating nodes that do not have the ingest processor functionality, this can lead
to a NullPointerException.

The issue is that markCurrentItemAsDropped() is creating an UpdateResponse with no id when
the request contains auto-generated IDs. The response serialization is lenient for our
REST/XContent format (i.e. we will send "id" : null) but the internal transport format (used for
communication between nodes) assumes for this field to be non-null, which means that it can't
be serialized between nodes. Bulk requests with ingest functionality are processed on the
coordinating node if the node has the ingest capability, and only otherwise sent to a different
node. This means that, in order to reproduce this, one needs two nodes, with the coordinating
node not having the ingest functionality.

Closes #46678
2019-09-19 16:46:33 +02:00
Armin Braun a087553009
Rearrange BlobStoreRepository to Prepare #46250 (#46824) (#46853)
In #46250 we will have to move to two different delete
paths for BwC. Both paths will share the initial
reads from the repository though so I extracted/separated the
delete logic away from the initial reads to significantly simplify
the diff against #46250 here.
Also, I added some JavaDoc from #46250 here as well which makes the
code a little easier to follow even ignoring #46250 I think.
2019-09-19 13:07:00 +02:00
Tanguy Leroux 3ae51f25dd Move testSnapshotWithLargeSegmentFiles to ESMockAPIBasedRepositoryIntegTestCase (#46802)
This commit moves the common test testSnapshotWithLargeSegmentFiles 
to the ESMockAPIBasedRepositoryIntegTestCase base class.
2019-09-18 15:41:30 +02:00
Christos Soulios 0076083b35
Implement rounding optimization for fixed offset timezones (#46809)
Fixes #45702 with date_histogram aggregation when using fixed_interval.
Optimization has been implemented for both fixed and calendar intervals
2019-09-18 15:56:34 +03:00
Armin Braun 142b10604e
Fix testHistoryRetention (#46799) (#46805)
Suppress the reasonable-history check in this test to guarantee we're always getting ops based recovery even after a background sync.

Closes #45953

Co-Authored-By: David Turner <david.turner@elastic.co>
2019-09-18 13:22:55 +02:00
Martijn van Groningen ac4e990924
Add ingest cluster state listeners (#46650)
In the case that an ingest processor factory relies on other configuration
in the cluster state in order to construct a processor instance then
it is currently undetermined if a processor facotry can be notified about
a change if multiple cluster state updates are bundled together and
if a processor implement `ClusterStateApplier` interface.
(IngestService implements this interface too)

The idea with ingest cluster state listener is that it is guaranteed to
update the processor factory first before the ingest service creates
a pipeline with their respective processor instances.

Currently this concept is used in the enrich branch:
https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21

In this case it a processor factory is interested in enrich indices' _meta
mapping fields.

This is the third PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.

Changes to the server module are merged separately from the pr that will
merge enrich into master, so that these changes can be reviewed in isolation.
2019-09-18 09:13:16 +02:00
Armin Braun 2c70d403fc
Reenable+Fix testMasterShutdownDuringFailedSnapshot (#46303) (#46747)
Reenable this test since it was fixed by #45689 in production
code (specifically, the fact that we write the `snap-` blobs
without overwrite checks now).
Only required adding the assumed blocking on index file writes
to test code to properly work again.

* Closes #25281
2019-09-17 18:09:48 +02:00
Armin Braun b0f09b279f
Make Snapshot Logic Write Metadata after Segments (#45689) (#46764)
* Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581
* Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons
   * Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization
* Fixes #41581
2019-09-17 13:09:39 +02:00
Armin Braun c045bc7f54
Minor Rearrangements in Snapshot Code (#46652) (#46752)
Inlining one trivial single-use method and extracting the
stale shard path blob calculation to make the diff with #46250
more manageable.
2019-09-17 09:23:00 +02:00
Armin Braun 20cb95ca5e
Fix testSnapshotRelocatingPrimary to Actually Run Relocations (#46594) (#46620)
Without replicas we won't actually get any relocations
going when removing the node constraints in this test.
Adjusted the code to force relocations by forbidding
nodes that hold primaries instead.
Also, fixed the timeouts and asserted that we actually
get relocations.

Fixes #46276
2019-09-16 15:15:33 +02:00
Andrei Dan c57cca98b2
[ILM] Add date setting to calculate index age (#46561) (#46697)
* [ILM] Add date setting to calculate index age

Add the `index.lifecycle.origination_date` to allow users to configure a
custom date that'll be used to calculate the index age for the phase
transmissions (as opposed to the default index creation date).

This could be useful for users to create an index with an "older"
origination date when indexing old data.

Relates to #42449.

* [ILM] Don't override creation date on policy init

The initial approach we took was to override the lifecycle creation date
if the `index.lifecycle.origination_date` setting was set. This had the
disadvantage of the user not being able to update the `origination_date`
anymore once set.

This commit changes the way we makes use of the
`index.lifecycle.origination_date` setting by checking its value when
we calculate the index age (ie. at "read time") and, in case it's not
set, default to the index creation date.

* Make origination date setting index scope dynamic

* Document orignation date setting in ilm settings

(cherry picked from commit d5bd2bb77ee28c1978ab6679f941d7c02e389d32)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2019-09-16 08:50:28 +01:00
Armin Braun 2b85dcb201
Parallelize Repository Cleanup Actions (#46647) (#46714)
* Parallelize Repository Cleanup Actions

Deleting root blobs and unreferenced indices can safely happen in parallel,
no need to have both operations run sequentially when they preclude all other
repository operations.
2019-09-16 07:52:03 +02:00
David Turner 272b0ecbdd Remove docs for proxy mode (#46677)
We added docs for proxy mode in #40281 but on reflection we should not be
documenting this setting since it does not play well with all proxies and we
can't recommend its use. This commit removes those docs and expands its Javadoc
instead.
2019-09-13 22:20:11 +01:00
Nhat Nguyen cabff5a7cd Handle lower retaining seqno retention lease error (#46420)
We renew the CCR retention lease at a fixed interval, therefore it's
possible to have more than one in-flight renewal requests at the same
time. If requests arrive out of order, then the assertion is violated.

Closes #46416
Closes #46013
2019-09-13 08:50:19 -04:00
Nhat Nguyen e1a33c6283 Fix false positive out of sync warning in synced-flush (#46576)
Synced-flush consists of three steps: (1) force-flush on every active
copy; (2) check for ongoing indexing operations; (3) seal copies if
there's no change since step 1. If some indexing operations are
completed on the primary but not replicas, then Lucene commits from step
1 on replicas won't be the same as the primary's. And step 2 would pass
if it's executed when all pending operations are done. Once step 2
passes, we will incorrectly emit the "out of sync" warning message
although nothing wrong here.

Relates #28464
Relates #30244
2019-09-12 16:34:33 -04:00
Nhat Nguyen 5465c8d095 Increase timeout for relocation tests (#46554)
There's nothing wrong in the logs from these failures. I think 30
seconds might not be enough to relocate shards with many documents as CI
is quite slow. This change increases the timeout to 60 seconds for these
relocation tests. It also dumps the hot threads in case of timed out.

Closes #46526
Closes #46439
2019-09-12 16:34:01 -04:00
Zachary Tong e1e06c2589 Add version constant for 7.3.3 2019-09-12 13:50:40 -04:00
Jim Ferenczi 4407f3af1b Delay the creation of SubSearchContext to the FetchSubPhase (#46598)
This change delays the creation of the SubSearchContext for nested and parent/child inner_hits
to the fetch sub phase in order to ensure that a SearchContext can built entirely from a
QueryShardContext. This commit also adds a validation step to the inner hits builder that ensures that we fail the request early if the inner hits path is invalid.

Relates #46523
2019-09-12 14:52:15 +02:00
Igor Motov 35cb93248d Geo: fix indexing of west to east linestrings crossing the antimeridian (#46601)
Fixes that way linestrings that are crossing the antimeridian are
indexed due to a normalization bug these lines were decomposed into
a line segment that was stretching entire globe.

Fixes #43775
2019-09-11 17:43:17 -04:00
Zachary Tong 6dc8ed5d57
[7.x Backport] Refactor AllocatedPersistentTask#init(), move rollup ctor logic (#46406)
This makes the AllocatedPersistentTask#init() method protected so that
implementing classes can perform their initialization logic there,
instead of the constructor.  Rollup's task is adjusted to use this
init method.

It also slightly refactors the methods to se a static logger in the
AllocatedTask instead of passing it in via an argument.  This is
simpler, logged messages come from the task instead of the
service, and is easier for tests
2019-09-11 17:00:28 -04:00
Ryan Ernst fa9327cdb9 Add more meaningful keystore version mismatch errors (#46291)
This commit changes the version bounds of keystore reading to give
better error messages when a user has a too new or too old format.

relates #44624
2019-09-11 09:55:19 -07:00
Jim Ferenczi 23bf310c84 Replace the SearchContext with QueryShardContext when building aggregator factories (#46527)
This commit replaces the `SearchContext` with the `QueryShardContext` when building aggregator factories. Aggregator factories are part of the `SearchContext` so they shouldn't require a `SearchContext` to create them.
The main changes here are the signatures of `AggregationBuilder#build` that now takes a `QueryShardContext` and `AggregatorFactory#createInternal` that passes the `SearchContext` to build the `Aggregator`.

Relates #46523
2019-09-11 16:43:30 +02:00
Armin Braun 27c15f137e
Remove Unused Method from BlobStoreRepository (#46204) (#46593)
This method isn't used anymore and I forgot to delete it.
2019-09-11 16:34:24 +02:00
William Brafford 8c9f15db44
Fix Path comparisons for Windows tests (#46503) (#46566)
* Fix Path comparisons for Windows tests

The test NodeEnvironmentTests#testCustonDataPaths worked just fine on
Darwin and Linux, but the comparison was breaking in Windows because one
path had the "C:\" prefix and the other one didn't. The simple fix is to
compare absolute paths rather than potentially relative ones.
2019-09-11 09:33:00 -04:00
Christoph Büscher aa0c586b73 Deprecate `_field_names` disabling (#42854)
Currently we allow `_field_names` fields to be disabled explicitely, but since
the overhead is negligible now we decided to keep it turned on by default and
deprecate the `enable` option on the field type. This change adds a deprecation
warning whenever this setting is used, going forward we want to ignore and finally
remove it.

Closes #27239
2019-09-11 14:58:08 +02:00
Armin Braun 41633cb9b5
More Efficient Ordering of Shard Upload Execution (#42791) (#46588)
* More Efficient Ordering of Shard Upload Execution (#42791)
* Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard
* Inspired by #39657
* Cleanup BlobStoreRepository Abort and Failure Handling (#46208)
2019-09-11 13:59:20 +02:00
Jim Ferenczi 80bb08fbda Replace the SearchContext with QueryShardContext when building collapsing context (#46543)
This commit replaces the `SearchContext` with the `QueryShardContext` when building collapsing conteext
Collapse context is part of the `SearchContext` so it shouldn't require a `SearchContext` to create one.

Relates #46523
2019-09-11 12:25:38 +02:00
Jim Ferenczi 425b1a77e8 Add more context to QueryShardContext (#46584)
This change adds an IndexSearcher and the node's BigArrays in the QueryShardContext.
It's a spin off of #46527 as this change is required to allow aggregation builder to solely use the
query shard context.

Relates #46523
2019-09-11 12:24:51 +02:00
Armin Braun f8d5145472
Fix SnapshotStatusApisIT (#46563) (#46582)
Obviously we have to run the status request again to busy wait for
the `STARTED` state, just busy waiting on an existing response
won't do anything.

Closes #45917
2019-09-11 11:58:42 +02:00
Lee Hinman cdc3a260af
Add retention to Snapshot Lifecycle Management (backport of #4… (#46506)
* Add retention to Snapshot Lifecycle Management (#46407)

This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to #38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to #43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to #43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (#45018)

Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.

Relates to #43663

* Add min_count and max_count as SLM retention predicates (#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to #43663

* Time-bound deletion of snapshots in retention delete function (#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to #43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to #43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to #43663

* Increase timeout waiting for snapshot completion

* Apply patch

From 2374316f0d.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to #43663

* Check all actions preventing snapshot delete during retention (#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to #43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to #43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to #43663
Resolves #46205

* Add error logging when exceptions are thrown

* Update serialization versions

* Fix type inference

* Use non-Cancellable HLRC return value

* Fix Client mocking in test

* Fix SLMSnapshotBlockingIntegTests for 7.x branch

* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval

* Add serialization guards for SnapshotLifecyclePolicy
2019-09-10 09:08:09 -06:00
Mayya Sharipova 2c5f9b558b Fix highlighting for script_score query (#46507) 2019-09-10 08:26:47 -04:00
David Turner 6c67b53932
Load metadata at start time not construction time (#46326)
Today we load the metadata from disk while constructing the node. However there
is no real need to do so, and this commit moves that code to run later while
the node is starting instead.
2019-09-10 11:15:10 +01:00
Henning Andersen 9fce5a99d8 Rest Controller wildcard registration (#46487)
Registering two different http methods on the same path using different
wildcard names would result in the last wildcard name being active only.
Now throw an exception instead.

Closes #46482
2019-09-09 21:49:18 +02:00
Zachary Tong 8d17527050 [TEST] create larger cuckoo filters for tests (#46457)
The cuckoofilters could be randomly created with too small of
capacity or precision, which means that they can only absorb a few
values before collisions start to make all filters look identical.

This increases the size of filters we generate (capacity >> than
the test cases) and lower fpp rate.
2019-09-09 10:18:51 -04:00
David Turner 8428f8e6e8 Remove trailing comma from nodes lists (#46484)
Today when the membership of the cluster changes we log messages that describe
the change like this:

    added {{node-1}{OPdaTIGmSxaEXXOyg3o96w}{127.0.0.1}{127.0.0.1:9301}{di},}

The trailing comma suggests there is some missing string that might contain
extra information, but in fact it's an artefact of how these messages are
constructed. This commit removes the trailing comma from these lists.
2019-09-09 14:47:32 +01:00
Armin Braun ee3396735c
Execute SnapshotsService Error Callback on Generic Thread (#46277) (#46480)
I couldn't find a test for this, as it seems we only get
into this error handler on a bug. Regardless, we are
executing the snapshot finalization on the master update
thread here which shouldn't happen and will make
debugging a production issue resulting from this
trickier than it has to be (because we probably also
get a cluster state apply is slow warning in addition
to the original bug).
Used the generic pool here instead of the snapshot pool
because we're resolving the user callback here as
well and the generic pool seemed like the safer bet for
that.
2019-09-09 14:38:11 +02:00
Martijn van Groningen c057fce978
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-09-09 08:40:54 +02:00
Nhat Nguyen 24c3a1de3c Ignore replication for noop updates (#46458)
Previously, we ignore replication for noop updates because they do not
have sequence numbers. Since #44603, we started assigning sequence
numbers to noop updates leading them to be replicated to replicas.

This bug occurs only on 8.0 for it requires #41065 and #44603.

Closes #46366
2019-09-07 11:32:01 -04:00
markharwood 323ec022be
Deprecate the "index.max_adjacency_matrix_filters" index setting (#46394)
Following performance optimisations to the adjacency_matrix aggregation we no longer require this setting. Marked as deprecated and due for removal in 8.0

Related #46324
2019-09-06 13:59:47 +01:00
Yunfeng,Wu 7582af27b0 Resolve the incorrect scroll_current when delete or close index (#45226)
Resolve the incorrect current scroll for deleted or closed index
2019-09-06 09:45:53 +02:00
Jim Ferenczi f2a6c88f83
Add a system property to ignore awareness attributes (#46375)
This is a follow up of #19191 for 7.x.
This change adds a system property called "es.routing.search_ignore_awareness_attributes" that when set to true will
effectively ignore allocation awareness attributes when routing search and get requests. This is now the default in 8.x so this
commit adds a way to opt-in to this new behavior in a minor version of 7.x.

Relates #45735
2019-09-06 09:29:27 +02:00
Paul Sanwald 758680c549
version bump to 6.8.4 (#46409) 2019-09-05 15:14:36 -04:00
Jason Tedor 92866f977a
Clarify error message on keystore write permissions (#46321)
When the Elasticsearch process does not have write permissions to
upgrade the Elasticsearch keystore, we bail with an error message that
indicates there is a filesystem permissions problem. This commit
clarifies that error message by pointing out the directory where write
permissions are required, or that the user can also run the
elasticsearch-keystore upgrade command manually before starting the
Elasticsearch process. In this case, the upgrade would not be needed at
runtime, so the permissions would not be needed then.
2019-09-05 15:11:54 -04:00
Benjamin Trent d912a49c6f
[7.x] Support geotile_grid aggregation in composite agg sources (#45810) (#46399)
* Support geotile_grid aggregation in composite agg sources (#45810)

Adds support for `geotile_grid` as a source in composite aggs. 

Part of this change includes adding a new docFormat of `GEOTILE` that formats a hashed `long` value into a geotile formatting string `zoom/x/y`.
2019-09-05 13:22:57 -05:00
Armin Braun 7a9af874ad
Enable Debug Logging for Master and Coordination Packages (#46363) (#46374)
In order to track down #46091:
* Enables debug logging in REST tests for `master` and `coordination` packages
since we suspect that issues are caused by failed and then retried publications
2019-09-05 14:03:38 +02:00
Yannick Welsch 7e4c633ce3 Quiet down shard lock failures (#46368)
These were actually never intended to be logged at the warning level but made visible by a refactoring in #19991, which introduced a new exception type but forgot to adapt some of the consumers of the exception.
2019-09-05 13:08:11 +02:00
Nhat Nguyen 03ed18a010 Unmute testRecoveryFromFailureOnTrimming
Tracked at #46267
2019-09-04 22:33:17 -04:00
Julie Tibshirani 40c3225d26
First round of optimizations for vector functions. (#46294)
This PR merges the `vectors-optimize-brute-force` feature branch, which makes
the following changes to how vector functions are computed:
* Precompute the L2 norm of each vector at indexing time. (#45390)
* Switch to ByteBuffer for vector encoding. (#45936)
* Decode vectors and while computing the vector function. (#46103) 
* Use an array instead of a List for the query vector. (#46155)
* Precompute the normalized query vector when using cosine similarity. (#46190)

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
2019-09-04 14:45:57 -07:00
Nhat Nguyen a16cb89956 Revert "Sync translog without lock when trim unreferenced readers (#46203)"
Unfortunately, with this change, we won't clean up all unreferenced
generations when reopening. We assume that there's at most one
unreferenced generation when reopening translog. The previous
implementation guarantees this assumption by syncing translog every time
after we remove a translog reader. This change, however, only syncs
translog once after we have removed all unreferenced readers (can be
more than one) and breaks the assumption.

Closes #46267

This reverts commit fd8183ee51d7cf08d9def58a2ae027714beb60de.
2019-09-04 17:09:39 -04:00
Jason Tedor 3cbdd84b89
Add test that get triggers shard search active (#46317)
This commit is a follow-up to a change that fixed that multi-get was not
triggering a shard to become search active. In that change, we added a
test that multi-get properly triggers a shard to become search
active. This commit is a follow-up to that change which adds a test for
the get case. While get is already handled correctly in production code,
there was not a test for it. This commit adds one. Additionally, we
factor all the search idle tests from IndexShardIT into a separate test
class, as an effort to keep related tests together instead of a single
large test class containing a jumble of tests, and also to keep test
classes smaller for better parallelization.
2019-09-04 11:53:32 -04:00
markharwood 408b58dd9d
Adjacency_matrix aggregation optimisation. (#46257) (#46315)
Avoid pre-allocating ((N * N) - N) / 2 “BitsIntersector” objects given N filters.
Most adjacency matrices will be sparse and we typically don’t need to allocate all of these objects - can save a lot of allocations when the number of filters is high.

Closes #46212
2019-09-04 16:45:32 +01:00
Nhat Nguyen eb56d23421 Do not send recovery requests with CancellableThreads (#46287)
Previously, we send recovery requests using CancellableThreads because
we send requests and wait for responses in a blocking manner. With async
recovery, we no longer need to do so. Moreover, if we fail to submit a
request, then we can release the Store using an interruptible thread
which can risk invalidating the node lock.

This PR is the first step to avoid forking when releasing the Store.

Relates #45409
Relates #46178
2019-09-04 11:26:11 -04:00
Henning Andersen 5066835569 Fix SearchService.createContext exception handling (#46258)
An exception from the DefaultSearchContext constructor could leak a
searcher, causing future issues like shard lock obtained exceptions. The
underlying cause of the exception in the constructor has been fixed, but
as a safety precaution we also fix the exception handling in
createContext.

Closes #45378
2019-09-04 14:46:30 +02:00
Nhat Nguyen 3f67cbe974 Suppress warning from background sync on relocated primary (#46247)
If a primary as being relocated, then the global checkpoint and
retention lease background sync can emit unnecessary warning logs.
This side effect was introduced in #42241.

Relates #40800
Relates #42241
2019-09-03 18:44:15 -04:00
Nhat Nguyen 5924df1764 Mute testRecoveryFromFailureOnTrimming
Tracked at #46267
2019-09-03 18:44:08 -04:00
Lee Hinman 57f322f85e Move MockRespository into test framework (#46298)
This moves the `MockRespository` class into `test/framework/src/main` so
it can be used across all modules and plugins in tests.
2019-09-03 16:21:10 -06:00
Jason Tedor b8c51ff894
Multi-get requests should wait for search active (#46283)
When a shard has fallen search idle, and a non-realtime multi-get
request is executed, today such requests do not wait for the shard to
become search active and therefore such requests do not wait for a
refresh to see the latest changes to the index. This also prevents such
requests from triggering the shard as non-search idle, influencing the
behavior of scheduled refreshes. This commit addresses this by attaching
a listener to the shard search active state for multi-get requests. In
this way, when the next scheduled refresh is executed, the multi-get
request will then proceed.
2019-09-03 14:31:37 -04:00
Henning Andersen 2383acaa89 Fix testSyncFailsIfOperationIsInFlight (#46269)
testSyncFailsIfOperationIsInFlight could fail due to the index request
spawing a GCP sync (new since 7.4). Test now waits for it to finish
before testing that flushed sync fails.
2019-09-03 17:30:00 +02:00
dengweisysu 416419e4c9 Sync translog without lock when trim unreferenced readers (#46203)
With this change, we can avoid blocking writing threads when trimming
unreferenced readers; hence improving the translog writing performance
in async durability mode.

Close #46201
2019-09-02 21:55:06 -04:00
Anup e01ec802e7 Remove duplicate line in SearchAfterBuilder (#45994) 2019-09-03 01:30:01 +02:00
Armin Braun 2662c1b417
Wait for all Rec. to Stop on Node Close (#46178) (#46237)
* Wait for all Rec. to Stop on Node Close

* This issue is in the `RecoverySourceHandler#acquireStore`. If we submit the store release to the generic threadpool while it is getting shut down we never complete the futue we wait on (in the generic pool as well) and fail to ever release the store potentially.
* Fixed by waiting for all recoveries to end on node close so that we aways have a healthy thread pool here
* Closes #45956
2019-09-02 18:04:37 +02:00
Martijn van Groningen 555b630160
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-09-02 09:16:55 +02:00
Martijn van Groningen 5747badaa8
Allow ingest processors access to node client. (#46077)
This is the first PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.

The plan is to merge changes made to the server module separately from
the pr that will merge enrich into master, so that these changes can
be reviewed in isolation.
2019-09-02 08:24:26 +02:00
Nhat Nguyen db949847e5 Fix translog stats in testPrepareIndexForPeerRecovery (#46137)
When recovering a shard locally, we use a translog snapshot from
newSnapshotFromGen which consists of all readers from a certain
generation. In the test, we use newSnapshotFromMinSeqNo for the
expectation. The snapshot of this method includes only readers
containing operations in the requesting range.

Closes #46022
2019-08-30 08:53:27 -04:00
Andrey Ershov 152ce62c58 Enhanced logging when transport is misconfigured to talk to HTTP port (#45964)
If a node is misconfigured to talk to remote node HTTP port (instead of
transport port) eventually it will receive an HTTP response from the
remote node on transport port (this happens when a node sends
accidentally line terminating byte in a transport request).
If this happens today it results in a non-friendly log message and a
long stack trace.
This commit adds a check if a malformed response is HTTP response. In
this case, a concise log message would appear.

(cherry picked from commit 911d02b7a9c3ce7fe316360c127a935ca4b11f37)
2019-08-30 13:02:08 +02:00
Paul Sanwald 8bdbc7d9bf
Bump version from 7.4 to 7.5 (#46142) 2019-08-29 15:03:26 -04:00
Julie Tibshirani b5d8b364bb
Ensure top docs optimization is fully disabled for queries with unbounded max scores. (#46105) (#46139)
When a query contains a mandatory clause that doesn't track the max score per
block, we disable the max score optimization. Previously, we were doing this by
wrapping the collector with a FilterCollector that always returned
ScoreMode.COMPLETE.

However we weren't adjusting totalHitsThreshold, so the collector could still
call Scorer#setMinCompetitiveScore. It is against the method contract to call
setMinCompetitiveScore when the score mode is COMPLETE, and some scorers like
ReqOptSumScorer throw an error in this case.

This commit tries to disable the optimization by always setting
totalHitsThreshold to max int, as opposed to wrapping the collector.
2019-08-29 10:56:53 -07:00
Simon Willnauer 9b2ea07b17
Flush engine after big merge (#46066) (#46111)
Today we might carry on a big merge uncommitted and therefore
occupy a significant amount of diskspace for quite a long time
if for instance indexing load goes down and we are not quickly
reaching the translog size threshold. This change will cause a
flush if we hit a significant merge (512MB by default) which
frees diskspace sooner.
2019-08-29 17:54:15 +02:00
Nhat Nguyen bb49124690 Only verify global checkpoint if translog sync occurred (#45980)
We only sync translog if the given offset hasn't synced yet. We can't
verify the global checkpoint from the latest translog checkpoint unless
a sync has occurred.

Closes #46065
Relates #45634
2019-08-29 09:44:40 -04:00
David Turner d340530a47 Avoid overshooting watermarks during relocation (#46079)
Today the `DiskThresholdDecider` attempts to account for already-relocating
shards when deciding how to allocate or relocate a shard. Its goal is to stop
relocating shards onto a node before that node exceeds the low watermark, and
to stop relocating shards away from a node as soon as the node drops below the
high watermark.

The decider handles multiple data paths by only accounting for relocating
shards that affect the appropriate data path. However, this mechanism does not
correctly account for _new_ relocating shards, which are unwittingly ignored.
This means that we may evict far too many shards from a node above the high
watermark, and may relocate far too many shards onto a node causing it to blow
right past the low watermark and potentially other watermarks too.

There are in fact two distinct issues that this PR fixes. New incoming shards
have an unknown data path until the `ClusterInfoService` refreshes its
statistics. New outgoing shards have a known data path, but we fail to account
for the change of the corresponding `ShardRouting` from `STARTED` to
`RELOCATING`, meaning that we fail to find the correct data path and treat the
path as unknown here too.

This PR also reworks the `MockDiskUsagesIT` test to avoid using fake data paths
for all shards. With the changes here, the data paths are handled in tests as
they are in production, except that their sizes are fake.

Fixes #45177
2019-08-29 12:40:55 +01:00
Jason Tedor 9bc4a24118
Handle delete document level failures (#46100)
Today we assume that document failures can not occur for deletes. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.
2019-08-28 22:17:16 -04:00
Tal Levy a356bcff41
Add Circle Processor (#43851) (#46097)
add circle-processor that translates circles to polygons
2019-08-28 14:44:08 -07:00
Jason Tedor 1249e6ba5d
Handle no-op document level failures (#46083)
Today we assume that document failures can not occur for no-ops. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.
2019-08-28 13:57:24 -04:00
Tanguy Leroux 9e14ffa8be Few clean ups in ESBlobStoreRepositoryIntegTestCase (#46068) 2019-08-28 16:29:46 +02:00
Mark Tozzi aec125faff
Support Range Fields in Histogram and Date Histogram (#46012)
Backport of 1a0dddf4ad24b3f2c751a1fe0e024fdbf8754f94 (AKA #445395)

     * Add support for a Range field ValuesSource, including decode logic for range doc values and exposing RangeType as a first class enum
     * Provide hooks in ValuesSourceConfig for aggregations to control ValuesSource class selection on missing & script values
     * Branch aggregator creation in Histogram and DateHistogram based on ValuesSource class, to enable specialization based on type.  This is similar to how Terms aggregator works.
     * Prioritize field type when available for selecting the ValuesSource class type to use for an aggregation
2019-08-28 09:06:09 -04:00
Martijn van Groningen 1157224a6b
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-08-28 10:14:07 +02:00
Henning Andersen 300e717e42 Disallow partial results when shard unavailable (#45739)
Searching with `allowPartialSearchResults=false` could still return
partial search results during recovery. If a shard copy fails
with a "shard not available" exception, the failure would be ignored and
a partial result returned. The one case where this is known to happen
is when a shard copy is recovering when searching, since
`IllegalIndexShardStateException` is considered a "shard not available"
exception.

Relates to #42612
2019-08-27 17:01:23 +02:00
Nhat Nguyen 146e23a8a9 Relax translog assertion in testRestoreLocalHistoryFromTranslog (#45943)
Since #45473, we trim translog below the local checkpoint of the safe
commit immediately if soft-deletes enabled. In
testRestoreLocalHistoryFromTranslog, we should have a safe commit after
recoverFromTranslog is called; then we will trim translog files which
contain only operations that are at most the global checkpoint.

With this change, we relax the assertion to ensure that we don't put
operations to translog while recovering history from the local translog.
2019-08-26 17:19:19 -04:00
Nhat Nguyen c66bae39c3 Update translog checkpoint after marking ops as persisted (#45634)
If two translog syncs happen concurrently, then one can return before
its operations are marked as persisted. In general, this should not be
an issue; however, peer recoveries currently rely on this assumption.

Closes #29161
2019-08-26 17:18:52 -04:00
Nhat Nguyen f2e8b17696 Do not create engine under IndexShard#mutex (#45263)
Today we create new engines under IndexShard#mutex. This is not ideal
because it can block the cluster state updates which also execute under
the same mutex. We can avoid this problem by creating new engines under
a separate mutex.

Closes #43699
2019-08-26 17:18:29 -04:00
Jason Tedor 3d64605075
Remove node settings from blob store repositories (#45991)
This commit starts from the simple premise that the use of node settings
in blob store repositories is a mistake. Here we see that the node
settings are used to get default settings for store and restore throttle
rates. Yet, since there are not any node settings registered to this
effect, there can never be a default setting to fall back to there, and
so we always end up falling back to the default rate. Since this was the
only use of node settings in blob store repository, we move them. From
this, several places fall out where we were chaining settings through
only to get them to the blob store repository, so we clean these up as
well. That leaves us with the changeset in this commit.
2019-08-26 16:26:13 -04:00
Zachary Tong 943a016bb2
Add Cumulative Cardinality agg (and Data Science plugin) (#45990)
This adds a pipeline aggregation that calculates the cumulative
cardinality of a field.  It does this by iteratively merging in the
HLL sketch from consecutive buckets and emitting the cardinality up
to that point.

This is useful for things like finding the total "new" users that have
visited a website (as opposed to "repeat" visitors).

This is a Basic+ aggregation and adds a new Data Science plugin
to house it and future advanced analytics/data science aggregations.
2019-08-26 16:19:55 -04:00
James Baiera 5535ff0a44
Fix IngestService to respect original document content type (#45799) (#45984)
Backport of #45799

This PR modifies the logic in IngestService to preserve the original content type 
on the IndexRequest, such that when a document with a content type like SMILE 
is submitted to a pipeline, the resulting document that is persisted will remain in 
the original content type (SMILE in this case).
2019-08-26 14:33:33 -04:00
Armin Braun af2bd75def
Fix Broken HTTP Request Breaking Channel Closing (#45958) (#45973)
This is essentially the same issue fixed in #43362 but for http request
version instead of the request method. We have to deal with the
case of not being able to parse the request version, otherwise
channel closing fails.

Fixes #43850
2019-08-26 16:20:58 +02:00
Armin Braun 5a17987e19
Fix SnapshotStatusApisIT (#45929) (#45971)
The snapshot status when blocking can still be INIT in rare cases when
the new cluster state that has the snapshot in `STARTED` hasn't yet
become visible.
Fixes #45917
2019-08-26 15:59:02 +02:00
Andrey Ershov d96469ddff Better logging for TLS message on non-secure transport channel (#45835)
This commit enhances logging for 2 cases:

1. If non-TLS enabled node receives transport message from TLS enabled
node on transport port.
2. If non-TLS enabled node receives HTTPs request on transport port.

(cherry picked from commit 4f52ebd32eb58526b4c8022f8863210bf88fc9be)
2019-08-26 15:07:13 +02:00
Jason Tedor 599bf2d68b
Deprecate the pidfile setting (#45938)
This commit deprecates the pidfile setting in favor of node.pidfile.
2019-08-23 21:31:35 -04:00
Mayya Sharipova 3bc1494d38 Correct warning testScalingThreadPoolConfiguration
Correct expected warning

Closes #45907
2019-08-23 10:30:36 -04:00
Henning Andersen 46d9a575db Fix RemoteClusterConnection close race (#45898)
Closing a `RemoteClusterConnection` concurrently with trying to connect
could result in double invoking the listener.

This fixes
RemoteClusterConnectionTest#testCloseWhileConcurrentlyConnecting

Closes #45845
2019-08-23 14:26:02 +02:00
Tanguy Leroux 8e66df9925 Move testRetentionLeasesClearedOnRestore (#45896) 2019-08-23 13:43:40 +02:00
Martijn van Groningen 837cfa2640
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-08-23 11:22:27 +02:00
Alexander Reelsen ecafe4f4ad Update joda to 2.10.3 (#45495) 2019-08-23 10:39:39 +02:00
Armin Braun ba6d72ea9f
Fix TransportSnapshotsStatusAction ThreadPool Use (#45824) (#45883)
In case of an in-progress snapshot this endpoint was broken because
it tried to execute repository operations in the callback on a
transport thread which is not allowed (only generic or snapshot
pool are allowed here).
2019-08-23 06:17:50 +02:00
Jason Tedor de6b6fd338
Add node.processors setting in favor of processors (#45885)
This commit namespaces the existing processors setting under the "node"
namespace. In doing so, we deprecate the existing processors setting in
favor of node.processors.
2019-08-22 22:18:37 -04:00
Nhat Nguyen 3393f9599e
Ignore translog retention policy if soft-deletes enabled (#45473)
Since #45136, we use soft-deletes instead of translog in peer recovery.
There's no need to retain extra translog to increase a chance of
operation-based recoveries. This commit ignores the translog retention
policy if soft-deletes is enabled so we can discard translog more
quickly.

Backport of #45473
Relates #45136
2019-08-22 16:40:06 -04:00
dengweisysu 72c6302d12 Fsync translog without writeLock before rolling (#45765)
Today, when rolling a new translog generation, we block all write
threads until a new generation is created. This choice is perfectly 
fine except in a highly concurrent environment with the translog 
async setting. We can reduce the blocking time by pre-sync the 
current generation without writeLock before rolling. The new step 
would fsync most of the data of the current generation without 
blocking write threads.

Close #45371
2019-08-22 16:18:42 -04:00
William Brafford f82c0f56a6
Mute flaky RemoteClusterConnection test (#45850) 2019-08-22 15:00:43 -04:00
Jake Landis c60399c77f
introduce 7.3.2 version to 7.x (#45864) 2019-08-22 12:24:19 -05:00
Andrey Ershov ed8307c198 Deprecate es.http.cname_in_publish_address setting (#45616)
Follow up on #32806.

The system property es.http.cname_in_publish_address is deprecated
starting from 7.0.0 and deprecation warning should be added if the
property is specified.
This PR will go to 7.x and master.
Follow-up PR to remove es.http.cname_in_publish_address property
completely will go to the master.

(cherry picked from commit a5ceca7715818f47ec87dd5f17f8812c584b592b)
2019-08-22 12:09:35 +02:00
Armin Braun 88acae48ce
Remove index-N Rebuild in Shard Snapshot Updates (#45740) (#45778)
* There is no point in listing out every shard over and over when the `index-N` blob in the shard contains a list of all the files
   * Rebuilding the `index-N` from the `snap-${uuid}.dat` blobs does not provide any material benefit. It only would in the corner case of a corrupted `index-N` but otherwise uncorrupted blobs since we neither check the correctness of the content of all segment blobs nor do we do a similar recovery at the root of the repository.
   * Also, at least in version `6.x` we only mark a shard snapshot as successful after writing out the updated `index-N` blob so all snapshots that would work with `7.x` and newer must have correct `index-N` blobs

=> Removed the rebuilding of the `index-N` content from `snap-${uuid}.dat` files and moved to only listing `index-N` when taking a snapshot instead of listing all files
=> Removed check of file existence against physical blob listing
=> Kept full listing on the delete side to retain full cleanup of blobs that aren't referenced by the `index-N`
2019-08-22 11:32:45 +02:00
Luca Cavanna b95ca9c3bb Fix compile errors in HttpChannelTaskHandler
Relates to #43332
2019-08-22 11:13:26 +02:00
Luca Cavanna a47ade3e64 Cancel search task on connection close (#43332)
This PR introduces a mechanism to cancel a search task when its corresponding connection gets closed. That would relief users from having to manually deal with tasks and cancel them if needed. Especially the process of finding the task_id requires calling get tasks which needs to call every node in the cluster.

The implementation is based on associating each http channel with its currently running search task, and cancelling the task when the previously registered close listener gets called.
2019-08-22 10:43:20 +02:00
Nhat Nguyen 3029887451 Never release store using CancellableThreads (#45409)
Today we can release a Store using CancellableThreads. If we are holding
the last reference, then we will verify the node lock before deleting
the store. Checking node lock performs some I/O on FileChannel. If the
current thread is interrupted, then the channel will be closed and the
node lock will also be invalid.

Closes #45237
2019-08-21 21:24:31 -04:00
Tal Levy 9b14b7298b
[7.x] Add is_write_index column to cat.aliases (#45798)
* Add is_write_index column to cat.aliases (#44772)

Aliases have had the option to set `is_write_index` since 6.4,
but the cat.aliases action was never updated.

* correct version bounds to 7.4
2019-08-21 14:15:49 -07:00
William Brafford 2b549e7342
CLI tools: write errors to stderr instead of stdout (#45586)
Most of our CLI tools use the Terminal class, which previously did not provide methods for writing to standard output. When all output goes to standard out, there are two basic problems. First, errors and warnings are "swallowed" in pipelines, making it hard for a user to know when something's gone wrong. Second, errors and warnings are intermingled with legitimate output, making it difficult to pass the results of interactive scripts to other tools.

This commit adds a second set of print commands to Terminal for printing to standard error, with errorPrint corresponding to print and errorPrintln corresponding to println. This leaves it to developers to decide which output should go where. It also adjusts existing commands to send errors and warnings to stderr.

Usage is printed to standard output when it's correctly requested (e.g., bin/elasticsearch-keystore --help) but goes to standard error when a command is invoked incorrectly (e.g. bin/elasticsearch-keystore list-with-a-typo | sort).
2019-08-21 14:46:07 -04:00
Armin Braun 790765d3f9
Remove Dep. on SnapshotsService in SnapshotShardsService (#45776) (#45791)
SnapshotShardsService depends on the RepositoriesService
not the SnapshotsService, no need to have this indirection.
2019-08-21 19:26:19 +02:00
Armin Braun 6aaee8aa0a
Repository Cleanup Endpoint (#43900) (#45780)
* Repository Cleanup Endpoint (#43900)

* Snapshot cleanup functionality via transport/REST endpoint.
* Added all the infrastructure for this with the HLRC and node client
* Made use of it in tests and resolved relevant TODO
* Added new `Custom` CS element that tracks the cleanup logic.
Kept it similar to the delete and in progress classes and gave it
some (for now) redundant way of handling multiple cleanups but only allow one
* Use the exact same mechanism used by deletes to have the combination
of CS entry and increment in repository state ID provide some
concurrency safety (the initial approach of just an entry in the CS
was not enough, we must increment the repository state ID to be safe
against concurrent modifications, otherwise we run the risk of "cleaning up"
blobs that just got created without noticing)
* Isolated the logic to the transport action class as much as I could.
It's not ideal, but we don't need to keep any state and do the same
for other repository operations
(like getting the detailed snapshot shard status)
2019-08-21 17:59:49 +02:00
Jim Ferenczi fe2a7523ec Add support for inlined user dictionary in the Kuromoji plugin (#45489)
This change adds a new option called user_dictionary_rules to
Kuromoji's tokenizer. It can be used to set additional tokenization rules
to the Japanese tokenizer directly in the settings (instead of using a file).
This commit also adds a check that no rules are duplicated since this is not allowed
in the UserDictionary.

Closes #25343
2019-08-21 16:28:30 +02:00
Martijn van Groningen 2677ac14d2
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-08-21 14:28:17 +02:00
Christos Soulios 2a0c7c40e5
[7.x] Implement AvgAggregatorTests#testDontCacheScripts and remove AvgIT #45746
Backports PR #45737:

    Similar to PR #45030 integration test testDontCacheScripts() was moved to unit test AvgAggregatorTests#testDontCacheScripts.

    AvgIT class was removed.
2019-08-20 20:19:51 +03:00
Christos Soulios 96a40acd82
[7.x] Migrate tests from MaxIT to MaxAggregatorTests (#45030) #45742
Backports PR #45030 to 7.x:

    This PR migrates tests from MaxIT integration test to MaxAggregatorTests, as described in #42893
2019-08-20 18:58:47 +03:00
Nhat Nguyen e9759b2b33 Wait for background refresh in testAutomaticRefresh (#45661)
If the background refresh is running, but not finished yet then the
document might not be visible to the next search. Thus, if
scheduledRefresh returns false, we need to wait until the background
refresh is done.

Closes #45571
2019-08-20 10:40:12 -04:00
Rory Hunter 47b3dccbc4
Always check that cgroup data is present (#45647)
`OsProbe` fetches cgroup data from the filesystem, and has asserts that
check for missing values. This PR changes most of these asserts into
runtime checks, since at least one user has reported an NPE where
a piece of cgroup data was missing.

Backport of #45606 to 7.x.
2019-08-19 10:29:41 +01:00
Nhat Nguyen 6f5d944fbd Ensure AsyncTask#isScheduled remain false after close (#45687)
If a scheduled task of an AbstractAsyncTask starts after it was closed,
then isScheduledOrRunning can remain true forever although no task is
running or scheduled.

Closes #45576
2019-08-17 13:48:50 -04:00
Vega 6f2daa85e3 Allow uppercase in keystore setting names (#45222)
The elasticsearch keystore was originally backed by a PKCS#12 keystore, which had several limitations. To overcome some of these limitations in encoding, the setting names existing within the keystore were limited to lowercase alphanumberic (with underscore). Now that the keystore is backed by an encrypted blob, this restriction is no longer relevant. This commit relaxes that restriction by allowing uppercase ascii characters as well.

closes #43835
2019-08-16 17:50:08 -07:00
Igor Motov 98c850c08b
Geo: Change order of parameter in Geometries to lon, lat 7.x (#45618)
Changes the order of parameters in Geometries from lat, lon to lon, lat
and moves all Geometry classes are moved to the
org.elasticsearch.geomtery package.

Backport of #45332

Closes #45048
2019-08-16 14:42:02 -04:00
Ryan Ernst 742213d710 Improve error message when index settings are not a map (#45588)
This commit adds an explicit error message when a create index request
contains a settings key that is not a json object. Prior to this change
the user would be given a ClassCastException with no explanation of what
went wrong.

closes #45126
2019-08-16 11:39:26 -07:00
Zachary Tong 50c65d05ba Move bucket reduction from Bucket to the InternalAgg (#45566)
The current idiom is to have the InternalAggregator find all the
buckets sharing the same key, put them in a list, get the first bucket
and ask that bucket to reduce all the buckets (including itself).

This a somewhat confusing workflow, and feels like the aggregator should
be reducing the buckets (since the aggregator owns the buckets), rather
than asking one bucket to do all the reductions.

This commit basically moves the `Bucket.reduce()` method to the
InternalAgg and renames it `reduceBucket()`.  It also moves the
`createBucket()` (or equivalent) method from the bucket to the
InternalAgg as well.
2019-08-16 13:59:00 -04:00
Andrey Ershov dbc90653dc transport.publish_address should contain CNAME (#45626)
This commit adds CNAME reporting for transport.publish_address same way
it's done for http.publish_address.

Relates #32806
Relates #39970

(cherry picked from commit e0a2558a4c3a6b6fbfc6cd17ed34a6f6ef7b15a9)
2019-08-16 17:42:00 +02:00
Armin Braun d6a9edea16
Lower Limit for Maximum Message Size in TcpTransport (#44496) (#45635)
* Since we're buffering network reads to the heap and then deserializing them it makes no sense to buffer a message that is 90% of the heap size since we couldn't deserialize it anyway
* I think `30%` is a more reasonable guess here given that we can reasonably assume that the deserialized message will be larger than the serialized message itself and processing it will take additional heap as well
2019-08-16 12:27:54 +02:00
Martijn van Groningen 5ea0985711
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-08-16 09:47:11 +02:00
Armin Braun a48242c371
Cleanup Redundant TransportLogger Instantiation (#43265) (#45629)
* This class' methods are all effectively `static` => make them `static` and stop instantiating it needlessly
2019-08-15 21:16:56 +02:00
Zachary Tong cd441f6906 Catch AllocatedTask registration failures (#45300)
When a persistent task attempts to register an allocated task locally,
this creates the Task object and starts tracking it locally.  If there
is a failure while initializing the task, this is handled by a catch
and subsequent error handling (canceling, unregistering, etc).

But if the task fails to be created because an exception is thrown
in the tasks ctor, this is uncaught and fails the cluster update
thread.  The ramification is that a persistent task remains in the
cluster state, but is unable to create the allocated task, and the
exception prevents other tasks "after" the poisoned task from starting
too.

Because the allocated task is never created, the cancellation tools
are not able to remove the persistent task and it is stuck as a
zombie in the CS.

This commit adds exception handling around the task creation,
and attempts to notify the master if there is a failure (so the
persistent task can be removed).  Even if this notification fails,
the exception handling means the rest of the uninitialized tasks
can proceed as normal.
2019-08-15 15:14:19 -04:00
Armin Braun de58353722
Lower Painless Static Memory Footprint (#45487) (#45619)
* Painless generates a ton of duplicate strings and empty `Hashmap` instances wrapped as unmodifiable
* This change brings down the static footprint of Painless on an idle node by 20MB (after running the PMC benchmark against said node)
   * Since we were looking into ways of optimizing for smaller node sizes I think this is a worthwhile optimization
2019-08-15 19:41:45 +02:00
Alpar Torok 03a1645bc6 Use dynamic port ranges for ExternalTestCluster (#45601)
Moves methods added in #44213 and uses them to configure the port range
for `ExternalTestCluster` too.
These were still using `9300-9400` ( teh default ) and running into
races.
2019-08-15 16:40:12 +03:00
Armin Braun 1beea3588b
Make BlobStoreRepository Validation Read master.dat (#45546) (#45578)
* Fixing this for two reasons:
   1. Why not verify that the seed we wrote is actually there when we can
   2. The AWS S3 SDK started to log a bunch of WARN messages about not fully reading the stream now that we started to abuse the read blob as an `exists` check after removing that method from the blob container
2019-08-15 07:07:52 +02:00
Nick Knize 647a8308c3
[SPATIAL] Backport new ShapeFieldMapper and ShapeQueryBuilder to 7x (#45363)
* Introduce Spatial Plugin (#44389)

Introduce a skeleton Spatial plugin that holds new licensed features coming to 
Geo/Spatial land!

* [GEO] Refactor DeprecatedParameters in AbstractGeometryFieldMapper (#44923)

Refactor DeprecatedParameters specific to legacy geo_shape out of
AbstractGeometryFieldMapper.TypeParser#parse.

* [SPATIAL] New ShapeFieldMapper for indexing cartesian geometries (#44980)

Add a new ShapeFieldMapper to the xpack spatial module for
indexing arbitrary cartesian geometries using a new field type called shape.
The indexing approach leverages lucene's new XYShape field type which is
backed by BKD in the same manner as LatLonShape but without the WGS84
latitude longitude restrictions. The new field mapper builds on and
extends the refactoring effort in AbstractGeometryFieldMapper and accepts
shapes in either GeoJSON or WKT format (both of which support non geospatial
geometries).

Tests are provided in the ShapeFieldMapperTest class in the same manner
as GeoShapeFieldMapperTests and LegacyGeoShapeFieldMapperTests.
Documentation for how to use the new field type and what parameters are
accepted is included. The QueryBuilder for searching indexed shapes is
provided in a separate commit.

* [SPATIAL] New ShapeQueryBuilder for querying indexed cartesian geometry (#45108)

Add a new ShapeQueryBuilder to the xpack spatial module for
querying arbitrary Cartesian geometries indexed using the new shape field
type.

The query builder extends AbstractGeometryQueryBuilder and leverages the
ShapeQueryProcessor added in the previous field mapper commit.

Tests are provided in ShapeQueryTests in the same manner as
GeoShapeQueryTests and docs are updated to explain how the query works.
2019-08-14 16:35:10 -05:00
Armin Braun e0d84e7178
Clean up Callback Chains and Duplicate in SnapshotResiliencyTests (#45398) (#45563)
* It's in the title, follow up to #45233
* Flatten more listeners into `StepListener`
* Remove duplication from repo and index bootstrap and asserting that the steps execute successfully
2019-08-14 21:53:07 +02:00
Armin Braun 5f6bc6fc2d
Prevent Leaking Search Tasks on Exceptions in FetchSearchPhase and DfsQueryPhase (#45500) (#45540)
* If `counter.onResult` throws an exception we might leak a transport task because the failure is not handled as a phase failure (instead it bubbles up in the transport service eventually hitting the `onFailure` callback again and couting down the `counter` twice).

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
2019-08-14 14:49:38 +02:00
Armin Braun 00e4fba2fb
Simplify and Optimize RestController Slightly (#45419) (#45485)
* Simplify the path iterator to generate less garbage
* `dispatchRequest` always terminates, adjust code accordingly
2019-08-13 10:43:30 +02:00
Martijn van Groningen 1951cdf1cb
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-08-13 09:12:31 +02:00
Julie Tibshirani dc1856ca53 Make sure to validate the type before attempting to merge a new mapping. (#45157)
Currently, when adding a new mapping, we attempt to parse + merge it before
checking whether its top-level document type matches the existing type. So when
a user attempts to introduce a new mapping type, we may give a confusing error
message around merging instead of complaining that it's not possible to add
more than one type ("Rejecting mapping update to [my-index] as the final
mapping would have more than 1 type...").

This PR moves the type validation to the start of
`MetaDataMappingService#applyRequest` so that we make sure the type matches
before performing any mapper merging.

We already partially addressed this issue in #29316, but the tests there
focused on `MapperService` and did not catch this problem with end-to-end
mapping updates.

Addresses #43012.
2019-08-12 14:28:03 -07:00
Zachary Tong 4d97d2c50f Revert "Only execute one final reduction in InternalAutoDateHistogram (#45359)"
This reverts commit c0ea8a867e.
2019-08-12 17:17:17 -04:00
Julie Tibshirani 8c4394d5d7 Fix a bug where mappings are dropped from rollover requests. (#45411)
We accidentally introduced this bug when adding a typeless version of the
rollover request. The bug is not present if include_type_name is set to true.
2019-08-12 12:46:27 -07:00
Michael Basnight a521e4c86f Retrieve processors instead of checking existence (#45354)
The previous hasProcessors method would validate if a processor was
present within a pipeline, but would not return the contents of the
processors. This does not allow a consumer to inspect the processor for
specific metadata. The method now returns the list of processors based
on the class of the processor passed in.
2019-08-12 13:48:17 -05:00
Zachary Tong 472f6ef41a Mute InternalAutoDateHistogramTests#testReduceRandom() 2019-08-12 14:45:08 -04:00
Zachary Tong c0ea8a867e Only execute one final reduction in InternalAutoDateHistogram (#45359)
Because auto-date-histo can perform multiple reductions while
merging buckets, we need to ensure that the intermediate reductions
are done with a `finalReduce` set to false to prevent Pipeline aggs
from generating their output.

Once all the buckets have been merged and the output is stable,
a mostly-noop reduction can be performed which will allow pipelines
to generate their output.
2019-08-12 14:07:38 -04:00
Albert Zaharovits 2cb172f079
CreateIndex and PutIndexTemplate with typeless mapping (#45120)
This commit makes sure that mapping parameters to `CreateIndex` and
`PutIndexTemplate` are keyed by the type name. 

`IndexCreationTask` expects mappings to be keyed by the type name.
It asserts this for template mappings but not for the mappings in the request.
The `CreateIndexRequest` and `RestCreateIndexAction` mostly make it sure
that the mapping is keyed by a type name, but not always.
When building the create-index request outside of the REST handler, there are
a few methods to set the mapping for the request. Some of them add the type
name some of them do not.
For example, `CreateIndexRequest#mapping(String type, Map<String, ?> source)`
adds the type name, but
`CreateIndexRequest#mapping(String type, XContentBuilder source)` does not.
This PR asserts the type name in the request mapping inside `IndexCreationTask`
and makes all `CreateIndexRequest#mapping` methods add the type name.
2019-08-12 08:05:07 +03:00
Armin Braun a9e1402189
Remove Settings from BaseRestRequest Constructor (#45418) (#45429)
* Resolving the todo, cleaning up the unused `settings` parameter
* Cleaning up some other minor dead code in affected classes
2019-08-12 05:14:45 +02:00
Nhat Nguyen cf9a73b5ac Call afterWriteOperation after trim translog in peer recovery (#45182)
testShouldFlushAfterPeerRecovery was added #28350 to make sure the
flushing loop triggered by afterWriteOperation eventually terminates.
This test relies on the fact that we call afterWriteOperation after
making changes in translog. In #44756, we roll a new generation in
RecoveryTarget#finalizeRecovery but do not call afterWriteOperation.

Relates #28350
Relates #45073
2019-08-10 22:59:02 -04:00
Nhat Nguyen 25c6102101 Trim local translog in peer recovery (#44756)
Today, if an operation-based peer recovery occurs, we won't trim
translog but leave it as is. Some unacknowledged operations existing in
translog of that replica might suddenly reappear when it gets promoted.
With this change, we ensure trimming translog above the starting
sequence number of phase 2. This change can allow us to read translog
forward.
2019-08-10 22:59:02 -04:00
Armin Braun 1cd464d675
Isolate Request in Call-Chain for REST Request Handling (#45130) (#45417)
* Follow up to #44949
* Stop using a special code path for multi-line JSON and instead handle its detection like that of other XContent types when creating the request
* Only leave a single path that holds a reference to the full REST request
   * In the next step we can move the copying of request content to happen before the actual request handling and make it conditional on the handler in question to stop copying bulk requests as suggested in #44564
2019-08-10 10:21:01 +02:00
Armin Braun d1ed9bdbfd
Use StepListener to Simplify SnapshotResiliencyTests (#45233) (#45386)
* Reduces complicated callback relations in `testSuccessfulSnapshotAndRestore` to flat steps of sequential actions
* Will refactor the other tests in this suit as a follow up
   * This format certainly makes it easier to create more complicated tests that involve multiple subsequent snapshots as it would allow adding loops
2019-08-09 18:19:48 +02:00
Yannick Welsch 9e6d874a41
Show BWC version in ClusterFormationFailureHelper (#45352)
When having a cluster state from 6.x, display the metadata version as the cluster state version.
Avoids confusion where a cluster state from 6.x is displayed as version 0 even if has some actual
content.
2019-08-09 16:23:38 +02:00
Yannick Welsch 5ddeb488a6 Allow _update on write alias (#45318)
Using the document update API on aliases with a write index does not work.

Follow-up to #31520
2019-08-09 11:44:24 +02:00
Martijn van Groningen f1ee29f22e
Added a custom api to perform the msearch more efficiently for enrich processor (#43965)
Currently the msearch api is used to execute buffered search requests;
however the msearch api doesn't deal with search requests in an intelligent way.
It basically executes each search separately in a concurrent manner.

This api reuses the msearch request and response classes and executes
the searches as one request in the node holding the enrich index shard.
Things like engine.searcher and query shard context are only created once.
Also there are less layers than executing a regular msearch request. This
results in an interesting speedup.

Without this change, in a single node cluster, enriching documents
with a bulk size of 5000 items, the ingest time in each bulk response
varied from 174ms to 822ms. With this change the ingest time in each
bulk response varied from 54ms to 109ms.

I think we should add a change like this based on this improvement in ingest time.

However I do wonder if instead of doing this change, we should improve
the msearch api to execute more efficiently. That would be more complicated
then this change, because in this change the custom api can only search
enrich index shards and these are special because they always have a single
primary shard. If msearch api is to be improved then that should work for
any search request to any indices. Making the same optimization for
indices with more than 1 primary shard requires much more work.

The current change is isolated in the enrich plugin and LOC / complexity
is small. So this good enough for now.
2019-08-09 09:11:04 +02:00
Tal Levy 2a99eaa7c2 Revert "removes the CellIdSource abstraction from geo-grid aggs (#45307) (#45353)"
This reverts commit 7b0a8040de.
2019-08-08 17:40:03 -07:00
Armin Braun 12ed6dc999
Only retain reasonable history for peer recoveries (#45208) (#45355)
Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates #41536
2019-08-09 01:56:32 +02:00
Tal Levy 7b0a8040de
removes the CellIdSource abstraction from geo-grid aggs (#45307) (#45353)
CellIdSource is a helper ValuesSource that encodes GeoPoint
into a long-encoded representation of the grid bucket the point
is associated with. This complicates thing as usage evolves to
support shapes that are associated with more than one bucket ordinal.
2019-08-08 16:33:16 -07:00
Armin Braun b19de55095
Add missing wait to testAutomaticReleaseOfIndexBlock (#45342) (#45351)
Today the test waits for one of the shards to be blocked, but this does not
mean that the block has been applied on all nodes, so a subsequent indexing
operation may still go through.

Fixes #45338
2019-08-08 22:39:22 +02:00
Henning Andersen d139896b66
Reindex share retry between hit sources (#44203) (#45348)
The client and remote hit sources had each their own retry mechanism,
which would do the same. Supporting resiliency we would have to expand
on the retry mechanisms and as a preparation for that, the retry
mechanism is now shared such that each sub class is only responsible for
sending requests and converting responses/failures to common format.

Part of #42612
2019-08-08 22:01:29 +02:00
Christoph Büscher a552b33276 Fix occasional SuggestSearchIT failure (#45330)
Refreshes happening during indexing can result differen segment counts and
slightly skewed term statistics, which in turn has the potential to change
suggestion output slightly. In order to prevent this, disable refresh for the
affected tests.

Closes #43261
2019-08-08 21:06:32 +02:00
Martijn van Groningen bb429d3b5c
required changes after merge 2019-08-08 17:04:18 +02:00
Dimitris Athanasiou e53bb050db Mute testAutomaticReleaseOfIndexBlock
Relates #45338
2019-08-08 17:56:41 +03:00
Martijn van Groningen 708f856940
Merge remote-tracking branch 'es/7.x' into enrich-7.x 2019-08-08 16:52:45 +02:00
Andrey Ershov 07c656fba9 Mute testCustomDataPaths on Windows
See #45333

(cherry picked from commit 671e1ad1068aee4b593ad0c8ab13ff60b4f125b8)
2019-08-08 16:26:56 +02:00
Zachary Tong 86d6597890 Use newIndexSearcher() instead of newSearcher() (#45248)
`newSearcher()` from lucene can randomly choose index readers which
are not compatible with our tests, like ParallelCompositeReader.
The `newIndexSearcher()` method on AggregatorTestCase is a wrapper
similar to newSearcher but compatible with our tests
2019-08-08 09:34:38 -04:00
Martijn van Groningen e066133016
Change the ingest simulate api to not include dropped documents (#44161)
If documents are dropped by the `drop` processor then
these documents are returned as a `null` value in the response.

=== Example

Create pipeline:

```
PUT _ingest/pipeline/droppipeline
{
    "processors": [
        {
            "set": {
                "field": "bla",
                "value": "val"
            }
        },
        {
            "drop": {}
        }
    ]
}
```

Simulate request:

POST _ingest/pipeline/droppipeline/_simulate
{
    "docs": [
        {
            "_source": {
                "message": "text"
            }
        }
    ]
}

Response:

```
{
    "docs": [
        null
    ]
}
```

Response if verbose is enabled:

```
{
    "docs": [
        {
            "processor_results": [
                {
                    "doc": {
                        "_index": "_index",
                        "_type": "_doc",
                        "_id": "_id",
                        "_source": {
                            "message": "text",
                            "bla": "val"
                        },
                        "_ingest": {
                            "timestamp": "2019-07-10T11:07:10.758315Z"
                        }
                    }
                },
                null
            ]
        }
    ]
}
```

Closes #36150

* Abort pipeline simulation in verbose mode when document has been dropped
by drop processor
2019-08-08 13:04:33 +02:00
Martijn van Groningen fb959d188c
Backport: Add description to force-merge tasks (#41365) (#45191)
* Add description to force-merge tasks (#41365)

This is static information that is part of the force merge request.

Relates to #15975
2019-08-08 08:15:09 +02:00
Michael Basnight 89861d0884 Add ingest processor existence helper method (#45156)
This commit adds a helper method to the ingest service allowing it to
inspect a pipeline by id and verify the existence of a processor in the
pipeline. This work exposed a potential bug in that some processors
contain inner processors that are passed in at instantiation. These
processors needed a common way to expose their inner processors, so the
WrappingProcessor was created in order to expose the inner processor.
2019-08-07 11:19:04 -05:00
Bukhtawar cd304c4def Auto-release flood-stage write block (#42559)
If a node exceeds the flood-stage disk watermark then we add a block to all of
its indices to prevent further writes as a last-ditch attempt to prevent the
node completely exhausting its disk space. However today this block remains in
place until manually removed, and this block is a source of confusion for users
who current have ample disk space and did not even realise they nearly ran out
at some point in the past.

This commit changes our behaviour to automatically remove this block when a
node drops below the high watermark again. The expectation is that the high
watermark is some distance below the flood-stage watermark and therefore the
disk space problem is truly resolved.

Fixes #39334
2019-08-07 11:03:53 +01:00
Tanguy Leroux a869342910 Restore DefaultShardOperationFailedException's reason after deserialization (#45203)
The reason field of DefaultShardOperationFailedException is lost during serialization. 
This is sad because this field is checked for nullity during xcontent generation and it 
means that the cause won't be included in the generated xcontent and won't be 
printed in two REST API responses (Close Index API and Indices Shard Stores API).

This commit simply restores the reason from the cause during deserialization.
2019-08-07 10:37:15 +02:00
Jason Tedor bd59ee6c72
Fix clock used in update requests (#45262)
We accidentally switched to using the relative time provider here. This
commit fixes this by switching to the appropriate absolute clock.
2019-08-06 21:15:21 -04:00
David Turner f5d1381e01 Remove always-true param from IndicesService#stats (#45231)
Parameter `includePrevious` is always true, so this commit inlines it.
2019-08-06 17:22:11 +01:00
David Turner 355713b9ca
Improve slow logging in MasterService (#45241)
Adds a tighter threshold for logging a warning about slowness in the
`MasterService` instead of relying on the cluster service's 30-second warning
threshold. This new threshold applies to the computation of the cluster state
update in isolation, so we get a warning if computing a new cluster state
update takes longer than 10 seconds even if it is subsequently applied quickly.
It also applies independently to the length of time it takes to notify the
cluster state tasks on completion of publication, in case any of these
notifications holds up the master thread for too long.

Relates #45007
Backport of #45086
2019-08-06 17:01:49 +01:00
Tanguy Leroux 772ce1f599
Add deprecation warning for Force Merge API (#44903)
This commit adds a deprecation warning in 7.x for the Force Merge API 
when both only_expunge_deletes and max_num_segments are set in a request.

Relates #44761
2019-08-06 16:04:24 +02:00
Jason Tedor 5b1b146099
Normalize environment paths (#45179)
This commit applies a normalization process to environment paths, both
in how they are stored internally, also their settings values. This
normalization is done via two means:
 - we make the paths absolute
 - we remove redundant name elements from the path (what Java calls
   "normalization")

This change ensures that when we compare and refer to these paths within
the system, we are using a common ground. For example, prior to the
change if the data path was relative, we would not compare it correctly
to paths from disk usage. This is because the paths in disk usage were
being made absolute.
2019-08-06 06:04:30 -04:00
Yannick Welsch 7aeb2fe73c Add per-socket keepalive options (#44055)
Uses JDK 11's per-socket configuration of TCP keepalive (supported on Linux and Mac), see
https://bugs.openjdk.java.net/browse/JDK-8194298, and exposes these as transport settings.
By default, these options are disabled for now (i.e. fall-back to OS behavior), but we would like
to explore whether we can enable them by default, in particular to force keepalive configurations
that are better tuned for running ES.
2019-08-06 10:45:44 +02:00