Commit Graph

2230 Commits

Author SHA1 Message Date
Igor Motov bdbc353dea Geo: improve handling of out of bounds points in linestrings (#47939)
Brings handling of out of bounds points in linestrings in line with
points. Now points with latitude above 90 and below -90 are handled
the same way as for points by adjusting the longitude by moving it by
180 degrees.

Relates to #43916
2019-10-23 14:17:44 -04:00
Armin Braun 7215201406
Track Shard-Snapshot Index Generation at Repository Root (#48371)
This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question).

This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing).

Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs.

This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob.
__No deletes are executed on the data nodes at all for any operation with this change.__

Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details).

This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository.
This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes.
Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur.
Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless).
With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused.

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
2019-10-23 10:58:26 +01:00
Tanguy Leroux 4790ee4c32 Reenable azure repository tests and remove some randomization in http servers (#48283)
Relates #47948
Relates #47380
2019-10-23 09:06:50 +02:00
Armin Braun 8a02a5fc7d
Simplify Shard Snapshot Upload Code (#48155) (#48345)
The code here was needlessly complicated when it
enqueued all file uploads up-front. Instead, we can
go with a cleaner worker + queue pattern here by taking
the max-parallelism from the threadpool info.

Also, I slightly simplified the rethrow and
listener (step listener is pointless when you add the callback in the next line)
handling it since I noticed that we were needlessly rethrowing in the same
code and that wasn't worth a separate PR.
2019-10-22 17:17:09 +01:00
Armin Braun dc08feadc6
Remove Redundant Version Param from Repository APIs (#48231) (#48298)
This parameter isn't used by any implementation
2019-10-21 16:20:45 +02:00
Ignacio Vera b1224fca8c
upgrade to Lucene-8.3.0-snapshot-25968e3b75e (#48227) 2019-10-21 08:21:09 +02:00
Alpar Torok cc26e30281
Increase timeout for yml tests (#48237)
Some of these are larger than what can complete in the regular timeout.

Closes #48212
2019-10-18 11:14:15 -07:00
jimczi b858e19bcc Revert #46598 that breaks the cachability of the sub search contexts. 2019-10-15 09:40:59 +02:00
Alpar Torok fbbe04b801 Add a verifyVersions to the test FW (#47192)
The test FW has a method to check that it's implementation of getting
index and wire compatible versions as well as reasoning about which
version is released or not produces the same rezults as the simillar
implementation in the build.

This PR adds the `verifyVersions` task to the test FW so we have one
task to check everything related to versions.
2019-10-10 11:23:56 +03:00
Armin Braun 302e09decf
Simplify some Common ActionRunnable Uses (#47799) (#47828)
Especially in the snapshot code there's a lot
of logic chaining `ActionRunnables` in tricky
ways now and the code is getting hard to follow.
This change introduces two convinience methods that
make it clear that a wrapped listener is invoked with
certainty in some trickier spots and shortens the code a bit.
2019-10-09 23:29:50 +02:00
Hendrik Muhs 5e0e54f455
[Transform] move root endpoint to _transform with BWC layer (#47127) (#47682)
move the main endpoint to /_transform/ from /_data_frame/transforms/ with providing backwards compatibility and deprecation warnings
2019-10-08 08:59:01 +02:00
Alpar Torok 2b16d7bcf8
Backport testclusters all (#47565)
* Bwc testclusters all (#46265)

Convert all bwc projects to testclusters

* Fix bwc versions config

* WIP fix rolling upgrade

* Fix bwc tests on old versions

* Fix rolling upgrade
2019-10-04 16:12:53 +03:00
Ryan Ernst f32692208e
Add explanations to script score queries (#46693) (#47548)
While function scores using scripts do allow explanations, they are only
creatable with an expert plugin. This commit improves the situation for
the newer script score query by adding the ability to set the
explanation from the script itself.

To set the explanation, a user would check for `explanation != null` to
indicate an explanation is needed, and then call
`explanation.set("some description")`.
2019-10-03 21:05:05 -07:00
Nhat Nguyen 5e4732f2bb Limit number of retaining translog files for peer recovery (#47414)
Today we control the extra translog (when soft-deletes is disabled) for
peer recoveries by size and age. If users manually (force) flush many
times within a short period, we can keep many small (or empty) translog
files as neither the size or age condition is reached. We can protect
the cluster from running out of the file descriptors in such a situation
by limiting the number of retaining translog files.
2019-10-03 20:45:29 -04:00
Yannick Welsch 99d2fe295d Use optype CREATE for single auto-id index requests (#47353)
Changes auto-id index requests to use optype CREATE, making it compliant with our docs.
This will also make these auto-id index requests compatible with the new "create-doc" index
privilege (which is based on the optype), the default optype is changed to create, just as it is
already documented.
2019-10-02 14:16:52 +02:00
Henning Andersen b5a2afccb2 MockSearchService concurrency fix (#47139)
Fixed MockSearchService concurrency, assertNoInFlightContext could
have false negative result (rarely).

Split out from #46060
Closes #47048
2019-10-02 12:33:18 +02:00
Tanguy Leroux f5c5411fe8
Differentiate base paths in repository integration tests (#47284) (#47300)
This commit change the repositories base paths used in Azure/S3/GCS
integration tests so that they don't conflict with each other when tests
 run in parallel on real storage services.

Closes #47202
2019-10-01 08:39:55 +02:00
Armin Braun 3d23cb44a3
Speed up Snapshot Finalization (#47283) (#47309)
As a result of #45689 snapshot finalization started to
take significantly longer than before. This may be a
little unfortunate since it increases the likelihood
of failing to finalize after having written out all
the segment blobs.
This change parallelizes all the metadata writes that
can safely run in parallel in the finalization step to
speed the finalization step up again. Also, this will
generally speed up the snapshot process overall in case
of large number of indices.

This is also a nice to have for #46250 since we add yet
another step (deleting of old index- blobs in the shards
to the finalization.
2019-09-30 23:28:59 +02:00
Yannick Welsch 9dc90e41fc Remove "force" version type (#47228)
It's been deprecated long ago and can be removed.

Relates to #20377

Closes #19769
2019-09-30 11:58:34 +02:00
Rory Hunter 53a4d2176f
Convert most awaitBusy calls to assertBusy (#45794) (#47112)
Backport of #45794 to 7.x. Convert most `awaitBusy` calls to
`assertBusy`, and use asserts where possible. Follows on from #28548 by
@liketic.

There were a small number of places where it didn't make sense to me to
call `assertBusy`, so I kept the existing calls but renamed the method to
`waitUntil`. This was partly to better reflect its usage, and partly so
that anyone trying to add a new call to awaitBusy wouldn't be able to find
it.

I also didn't change the usage in `TransportStopRollupAction` as the
comments state that the local awaitBusy method is a temporary
copy-and-paste.

Other changes:

  * Rework `waitForDocs` to scale its timeout. Instead of calling
    `assertBusy` in a loop, work out a reasonable overall timeout and await
    just once.
  * Some tests failed after switching to `assertBusy` and had to be fixed.
  * Correct the expect templates in AbstractUpgradeTestCase.  The ES
    Security team confirmed that they don't use templates any more, so
    remove this from the expected templates. Also rewrite how the setup
    code checks for templates, in order to give more information.
  * Remove an expected ML template from XPackRestTestConstants The ML team
    advised that the ML tests shouldn't be waiting for any
    `.ml-notifications*` templates, since such checks should happen in the
    production code instead.
  * Also rework the template checking code in `XPackRestTestHelper` to give
    more helpful failure messages.
  * Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4
2019-09-29 12:21:46 +01:00
Tim Brooks e11c56760d
Fix bind failure logging for mock transport (#47150)
Currently the MockNioTransport uses a custom exception handler for
server channel exceptions. This means that bind failures are logged at
the warn level. This commit modifies the transport to use the common
TcpTransport exception handler which will log exceptions at the correct
level.
2019-09-27 13:53:48 -06:00
Nhat Nguyen 444b47ce88 Relax maxSeqNoOfUpdates assertion in FollowingEngine (#47188)
We disable MSU optimization if the local checkpoint is smaller than
max_seq_no_of_updates. Hence, we need to relax the MSU assertion in
FollowingEngine for that scenario. Suppose the leader has three
operations: index-0, delete-1, and index-2 for the same doc Id. MSU on
the leader is 1 as index-2 is an append. If the follower applies index-0
then index-2, then the assertion is violated.

Closes #47137
2019-09-27 14:00:20 -04:00
Tanguy Leroux 42ae76ab7c Injected response errors in Azure repository tests should have a body (#47176)
The Azure SDK client expects server errors to have a body, 
something that looks like:

<?xml version="1.0" encoding="utf-8"?>  
<Error>  
  <Code>string-value</Code>  
  <Message>string-value</Message>  
</Error> 

I've forgot to add such errors in Azure tests and that triggers 
some NPE in the client like the one reported in #47120.

Closes #47120
2019-09-27 09:43:29 +02:00
Jim Ferenczi 73a09b34b8
Replace SearchContextException with SearchException (#47046)
This commit removes the SearchContextException in favor of a simpler
SearchException that doesn't leak the SearchContext.

Relates #46523
2019-09-26 14:21:23 +02:00
Tanguy Leroux 95e2ca741e
Remove unused private methods and fields (#47154)
This commit removes a bunch of unused private fields and unused
private methods from the code base.

Backport of (#47115)
2019-09-26 12:49:21 +02:00
David Turner 45c7783018
Warn on slow metadata persistence (#47130)
Today if metadata persistence is excessively slow on a master-ineligible node
then the `ClusterApplierService` emits a warning indicating that the
`GatewayMetaState` applier was slow, but gives no further details. If it is
excessively slow on a master-eligible node then we do not see any warning at
all, although we might see other consequences such as a lagging node or a
master failure.

With this commit we emit a warning if metadata persistence takes longer than a
configurable threshold, which defaults to `10s`. We also emit statistics that
record how much index metadata was persisted and how much was skipped since
this can help distinguish cases where IO was slow from cases where there are
simply too many indices involved.

Backport of #47005.
2019-09-26 07:40:54 +01:00
Tim Brooks 4f47e1f169
Extract proxy connection logic to specialized class (#47138)
Currently the logic to check if a connection to a remote discovery node
exists and otherwise create a proxy connection is mixed with the
collect nodes, cluster connection lifecycle, and other
RemoteClusterConnection logic. This commit introduces a specialized
RemoteConnectionManager class which handles the open connections.
Additionally, it reworks the "round-robin" proxy logic to create the list
of potential connections at connection open/close time, opposed to each
time a connection is requested.
2019-09-25 15:58:18 -06:00
David Turner ac920e8e64 Assert no exceptions during state application (#47090)
Today we log and swallow exceptions during cluster state application, but such
an exception should not occur. This commit adds assertions of this fact, and
updates the Javadocs to explain it.

Relates #47038
2019-09-25 12:32:51 +01:00
Tim Brooks 6720c56bdd
Set netty system properties in BuildPlugin (#45881)
Currently in production instances of Elasticsearch we set a couple of
system properties by default. We currently do not apply all of these
system properties in tests. This commit applies these properties in the
tests.
2019-09-24 10:49:36 -06:00
David Turner 6943a3101f
Cut PersistedState interface from GatewayMetaState (#46655)
Today `GatewayMetaState` implements `PersistedState` but it's an error to use
it as a `PersistedState` before it's been started, or if the node is
master-ineligible. It also holds some fields that are meaningless on nodes that
do not persist their states. Finally, it takes responsibility for both loading
the original cluster state and some of the high-level logic for writing the
cluster state back to disk.

This commit addresses these concerns by introducing a more specific
`PersistedState` implementation for use on master-eligible nodes which is only
instantiated if and when it's appropriate. It also moves the fields and
high-level persistence logic into a new `IncrementalClusterStateWriter` with a
more appropriate lifecycle.

Follow-up to #46326 and #46532
Relates #47001
2019-09-24 12:31:13 +01:00
Julie Tibshirani 9124c94a6c
Add support for aliases in queries on _index. (#46944)
Previously, queries on the _index field were not able to specify index aliases.
This was a regression in functionality compared to the 'indices' query that was
deprecated and removed in 6.0.

Now queries on _index can specify an alias, which is resolved to the concrete
index names when we check whether an index matches. To match a remote shard
target, the pattern needs to be of the form 'cluster:index' to match the
fully-qualified index name. Index aliases can be specified in the following query
types: term, terms, prefix, and wildcard.
2019-09-23 13:21:37 -07:00
Jim Ferenczi 08f28e642b Replace SearchContext with QueryShardContext in query builder tests (#46978)
This commit replaces the SearchContext used in AbstractQueryTestCase with
a QueryShardContext in order to reduce the visibility of search contexts.

Relates #46523
2019-09-23 20:24:02 +02:00
Luca Cavanna d4d1182677 update _common.json format (#46872)
API spec now use an object for the documentation field. _common was not updated yet. This commit updates _common.json and its corresponding parser.

Closes #46744

Co-Authored-By: Tomas Della Vedova <delvedor@users.noreply.github.com>
2019-09-23 17:01:29 +02:00
Yannick Welsch 9638ca20b0 Allow dropping documents with auto-generated ID (#46773)
When using auto-generated IDs + the ingest drop processor (which looks to be used by filebeat
as well) + coordinating nodes that do not have the ingest processor functionality, this can lead
to a NullPointerException.

The issue is that markCurrentItemAsDropped() is creating an UpdateResponse with no id when
the request contains auto-generated IDs. The response serialization is lenient for our
REST/XContent format (i.e. we will send "id" : null) but the internal transport format (used for
communication between nodes) assumes for this field to be non-null, which means that it can't
be serialized between nodes. Bulk requests with ingest functionality are processed on the
coordinating node if the node has the ingest capability, and only otherwise sent to a different
node. This means that, in order to reproduce this, one needs two nodes, with the coordinating
node not having the ingest functionality.

Closes #46678
2019-09-19 16:46:33 +02:00
Tanguy Leroux 3ae51f25dd Move testSnapshotWithLargeSegmentFiles to ESMockAPIBasedRepositoryIntegTestCase (#46802)
This commit moves the common test testSnapshotWithLargeSegmentFiles 
to the ESMockAPIBasedRepositoryIntegTestCase base class.
2019-09-18 15:41:30 +02:00
Armin Braun f983b67fdc
Add Assertion About Leaking index-N to Repo Tests (#46774) (#46801)
This adds an assert to make sure we're not leaking
index-N blobs on the shard level to the repo consistency checks.
It is ok to have a single redundant index-N blob in a failure scenario
but additional index-N should always be cleaned up before adding more.
2019-09-18 13:15:56 +02:00
Tanguy Leroux 4db37801d0 Add resumable uploads support to GCS repository integration tests (#46562)
This commit adds support for resumable uploads to the internal HTTP 
server used in GoogleCloudStorageBlobStoreRepositoryTests. This 
way we can also test the behavior of the Google's client when the 
service returns server errors in response to resumable upload requests.

The BlobStore implementation for GCS has the choice between 2 
methods to upload a blob: resumable and multipart. In the current 
implementation, the client executes a resumable upload if the blob 
size is larger than LARGE_BLOB_THRESHOLD_BYTE_SIZE, 
otherwise it executes a multipart upload. This commit makes this 
logic overridable in tests, allowing to randomize the decision of 
using one method or the other.

The commit add support for single request resumable uploads 
and chunked resumable uploads (the blob is uploaded into multiple 
2Mb chunks; each chunk being a resumable upload). For this last 
case, this PR also adds a test testSnapshotWithLargeSegmentFiles 
which makes it more probable that a chunked resumable upload is 
executed.
2019-09-18 09:33:05 +02:00
Armin Braun 2c70d403fc
Reenable+Fix testMasterShutdownDuringFailedSnapshot (#46303) (#46747)
Reenable this test since it was fixed by #45689 in production
code (specifically, the fact that we write the `snap-` blobs
without overwrite checks now).
Only required adding the assumed blocking on index file writes
to test code to properly work again.

* Closes #25281
2019-09-17 18:09:48 +02:00
Armin Braun b00de8edf3
Ensure SAS Tokens in Test Use Minimal Permissions (#46112) (#46628)
There were some issues with the Azure implementation requiring
permissions to list all containers ue to a container exists
check. This was caught in CI this time, but going forward we
should ensure that CI is executed using a token that does not
allow listing containers.

Relates #43288
2019-09-17 15:40:11 +02:00
Armin Braun b0f09b279f
Make Snapshot Logic Write Metadata after Segments (#45689) (#46764)
* Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581
* Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons
   * Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization
* Fixes #41581
2019-09-17 13:09:39 +02:00
Przemysław Witek e49be611ad
[7.x] Add audit messages for Data Frame Analytics (#46521) (#46738) 2019-09-16 21:21:38 +02:00
Nhat Nguyen 5465c8d095 Increase timeout for relocation tests (#46554)
There's nothing wrong in the logs from these failures. I think 30
seconds might not be enough to relocate shards with many documents as CI
is quite slow. This change increases the timeout to 60 seconds for these
relocation tests. It also dumps the hot threads in case of timed out.

Closes #46526
Closes #46439
2019-09-12 16:34:01 -04:00
Jim Ferenczi 4407f3af1b Delay the creation of SubSearchContext to the FetchSubPhase (#46598)
This change delays the creation of the SubSearchContext for nested and parent/child inner_hits
to the fetch sub phase in order to ensure that a SearchContext can built entirely from a
QueryShardContext. This commit also adds a validation step to the inner hits builder that ensures that we fail the request early if the inner hits path is invalid.

Relates #46523
2019-09-12 14:52:15 +02:00
Jim Ferenczi 23bf310c84 Replace the SearchContext with QueryShardContext when building aggregator factories (#46527)
This commit replaces the `SearchContext` with the `QueryShardContext` when building aggregator factories. Aggregator factories are part of the `SearchContext` so they shouldn't require a `SearchContext` to create them.
The main changes here are the signatures of `AggregationBuilder#build` that now takes a `QueryShardContext` and `AggregatorFactory#createInternal` that passes the `SearchContext` to build the `Aggregator`.

Relates #46523
2019-09-11 16:43:30 +02:00
Armin Braun 41633cb9b5
More Efficient Ordering of Shard Upload Execution (#42791) (#46588)
* More Efficient Ordering of Shard Upload Execution (#42791)
* Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard
* Inspired by #39657
* Cleanup BlobStoreRepository Abort and Failure Handling (#46208)
2019-09-11 13:59:20 +02:00
Jim Ferenczi 425b1a77e8 Add more context to QueryShardContext (#46584)
This change adds an IndexSearcher and the node's BigArrays in the QueryShardContext.
It's a spin off of #46527 as this change is required to allow aggregation builder to solely use the
query shard context.

Relates #46523
2019-09-11 12:24:51 +02:00
David Turner 6c67b53932
Load metadata at start time not construction time (#46326)
Today we load the metadata from disk while constructing the node. However there
is no real need to do so, and this commit moves that code to run later while
the node is starting instead.
2019-09-10 11:15:10 +01:00
Tanguy Leroux 88bed09119 Mutualize code in cloud-based repository integration tests (#46483)
This commit factors out some common code between the cloud-based
repository integration tests that were recently improved.

Relates #46376
2019-09-09 16:02:14 +02:00
Armin Braun 1bb1c77885
Increase REST-Test Client Timeout to 60s (#46455) (#46461)
We are seeing requests take more than the default 30s
which leads to requests being retried and returning
unexpected failures like e.g. "index already exists"
because the initial requests that timed out, worked
out functionally anyway.
=> double the timeout to reduce the likelihood of
the failures described in #46091
=> As suggested in the issue, we should in a follow-up
turn off retrying all-together probably
2019-09-07 07:40:16 +02:00
Tanguy Leroux 28974b5723 Replace mocked client in GCSBlobStoreRepositoryTests by HTTP server (#46255)
This commit removes the usage of MockGoogleCloudStoragePlugin in
GoogleCloudStorageBlobStoreRepositoryTests and replaces it by a
HttpServer that emulates the Storage service. This allows the repository
tests to use the real Google's client under the hood in tests and will allow
us to test the behavior of the snapshot/restore feature for GCS repositories
by simulating random server-side internal errors.

The HTTP server used to emulate the Storage service is intentionally simple
and minimal to keep things understandable and maintainable. Testing full
client options on the server side (like authentication, chunked encoding
etc) remains the responsibility of the GoogleCloudStorageFixture.
2019-09-05 10:37:37 +02:00