OpenSearch

Commit Graph

Author	SHA1	Message	Date
Mayya Sharipova	7cf170830c	Optimize sort on numeric long and date fields. (#49732 ) This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Backport for #48804 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>	2019-11-29 15:37:40 -05:00
Armin Braun	813b49adb4	Make BlobStoreRepository Aware of ClusterState (#49639 ) (#49711 ) * Make BlobStoreRepository Aware of ClusterState (#49639) This is a preliminary to #49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in #49060	2019-11-29 14:57:47 +01:00
Armin Braun	90e9d61f2b	Optimize GoogleCloudStorageHttpHandler (#49677 ) (#49707 ) Removing a lot of needless buffering and array creation to reduce the significant memory usage of tests using this. The incoming stream from the `exchange` is already buffered so there is no point in adding a ton of additional buffers everywhere.	2019-11-29 11:17:47 +01:00
Jim Ferenczi	496bb9e2ee	Add a listener to track the progress of a search request locally (#49471 ) (#49691 ) This commit adds a function in NodeClient that allows to track the progress of a search request locally. Progress is tracked through a SearchProgressListener that exposes query and fetch responses as well as partial and final reduces. This new method can be used by modules/plugins inside a node in order to track the progress of a local search request. Relates #49091	2019-11-28 18:23:09 +01:00
Jim Ferenczi	d6445fae4b	Add a cluster setting to disallow loading fielddata on _id field (#49166 ) This change adds a dynamic cluster setting named `indices.id_field_data.enabled`. When set to `false` any attempt to load the fielddata for the `_id` field will fail with an exception. The default value in this change is set to `false` in order to prevent fielddata usage on this field for future versions but it will be set to `true` when backporting to 7x. When the setting is set to true (manually or by default in 7x) the loading will also issue a deprecation warning since we want to disallow fielddata entirely when https://github.com/elastic/elasticsearch/issues/26472 is implemented. Closes #43599	2019-11-28 09:35:28 +01:00
Armin Braun	3862400270	Remove Redundant EsBlobStoreTestCase (#49603 ) (#49605 ) All the implementations of `EsBlobStoreTestCase` use the exact same bootstrap code that is also used by their implementation of `EsBlobStoreContainerTestCase`. This means all tests might as well live under `EsBlobStoreContainerTestCase` saving a lot of code duplication. Also, there was no HDFS implementation for `EsBlobStoreTestCase` which is now automatically resolved by moving the tests over since there is a HDFS implementation for the container tests.	2019-11-26 20:57:19 +01:00
Armin Braun	495b543e63	Improve Stability of GCS Mock API (#49592 ) (#49597 ) Same as #49518 pretty much but for GCS. Fixing a few more spots where input stream can get closed without being fully drained and adding assertions to make sure it's always drained. Moved the no-close stream wrapper to production code utilities since there's a number of spots in production code where it's also useful (will reuse it there in a follow-up).	2019-11-26 16:53:51 +01:00
Nhat Nguyen	d2e92a1791	EngineTestCase#getDocIds should use internal reader (#49564 ) We do not guarantee that EngineTestCase#getDocIds is called after the engine has been externally refreshed. Hence, we trip an assertion assertSearcherIsWarmedUp. CI: https://gradle-enterprise.elastic.co/s/pm2at5qmfm2iu Relates #48605	2019-11-25 21:07:30 -05:00
Armin Braun	a5fa86ed97	Improve Stability of Mock APIs (#49518 ) (#49524 ) This commit ensures that even for requests that are known to be empty body we at least attempt to read one bytes from the request body input stream. This is done to work around the behavior in `sun.net.httpserver.ServerImpl.Dispatcher#handleEvent` that will close a TCP/HTTP connection that does not have the `eof` flag (see `sun.net.httpserver.LeftOverInputStream#isEOF`) set on its input stream. As far as I can tell the only way to set this flag is to do a read when there's no more bytes buffered. This fixes the numerous connection closing issues because the `ServerImpl` stops closing connections that it thinks weren't fully drained. Also, I removed a now redundant drain loop in the Azure handler as well as removed the connection closing in the error handler's drain action (this shouldn't have an effect but makes things more predictable/easier to reason about IMO). I would suggest merging this and closing related issue after verifying that this fixes things on CI. The way to locally reproduce the issues we're seeing in tests is to make the retry timings more aggressive in e.g. the azure tests and move them to single digit values. This makes the retries happen quickly enough that they run into the async connecting closing of allegedly non-eof connections by `ServerImpl` and produces the exact kinds of failures we're seeing currently. Relates #49401, #49429	2019-11-25 10:28:55 +01:00
Nhat Nguyen	8260cba629	Increase timeout while checking for no snapshotted commit (#49461 ) If some replica is performing a file-based recovery, then the check assertNoSnapshottedIndexCommit would fail. We should increase the timeout for this check so that we can wait until all recoveries done or aborted. Closes #49403	2019-11-24 15:12:34 -05:00
Armin Braun	231d079bf8	Fix Azure Mock Issues (#49377 ) (#49381 ) Fixing a few small issues found in this code: 1. We weren't reading the request headers but the response headers when checking for blob existence in the mocked single upload path 2. Error code can never be `null` removed the dead code that resulted 3. In the logging wrapper we weren't checking for `Throwable` so any failing assertions in the http mock would not show up since they run on a thread managed by the mock http server	2019-11-21 19:57:50 +01:00
Yannick Welsch	420825c3b5	Strengthen validateClusterFormed check (#49248 ) Strengthens the validateClusterFormed check that is used by the test infrastructure to make sure that nodes are properly connected and know about each other. Is used in situations where the cluster is scaled up and down, and where there previously was a network disruption that has been healed. Closes #49243	2019-11-21 17:38:12 +01:00
Armin Braun	df8d7b213b	Add Logging to Mock Repo API Server (#49409 ) While we log exception in the handler, we may still miss exceptions hgiher up the execution chain. This adds logging of exceptions to all operations on the IO loop including connection establishment. Relates #49401	2019-11-21 11:33:57 +01:00
Tanguy Leroux	f753fa2265	HttpHandlers should return correct list of objects (#49283 ) This commit fixes the server side logic of "List Objects" operations of Azure and S3 fixtures. Until today, the fixtures were returning a " flat" view of stored objects and were not correctly handling the delimiter parameter. This causes some objects listing to be wrongly interpreted by the snapshot deletion logic in Elasticsearch which relies on the ability to list child containers of BlobContainer (#42653) to correctly delete stale indices. As a consequence, the blobs were not correctly deleted from the emulated storage service and stayed in heap until they got garbage collected, causing CI failures like #48978. This commit fixes the server side logic of Azure and S3 fixture when listing objects so that it now return correct common blob prefixes as expected by the snapshot deletion process. It also adds an after-test check to ensure that tests leave the repository empty (besides the root index files). Closes #48978	2019-11-20 09:26:42 +01:00
Jay Modi	eed4cd25eb	ThreadPool and ThreadContext are not closeable (#43249 ) (#49273 ) This commit changes the ThreadContext to just use a regular ThreadLocal over the lucene CloseableThreadLocal. The CloseableThreadLocal solves issues with ThreadLocals that are no longer needed during runtime but in the case of the ThreadContext, we need it for the runtime of the node and it is typically not closed until the node closes, so we miss out on the benefits that this class provides. Additionally by removing the close logic, we simplify code in other places that deal with exceptions and tracking to see if it happens when the node is closing. Closes #42577	2019-11-19 13:15:16 -07:00
Armin Braun	0acba44a2e	Make Repository.getRepositoryData an Async API (#49299 ) (#49312 ) This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of #49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.	2019-11-19 16:49:12 +01:00
Tanguy Leroux	ca4f55f2e4	Add docker-compose fixtures for S3 integration tests (#49107 ) (#49229 ) Similarly to what has been done for Azure (#48636) and GCS (#48762), this committ removes the existing Ant fixture that emulates a S3 storage service in favor of multiple docker-compose based fixtures. The goals here are multiple: be able to reuse a s3-fixture outside of the repository-s3 plugin; allow parallel execution of integration tests; removes the existing AmazonS3Fixture that has evolved in a weird beast in dedicated, more maintainable fixtures. The server side logic that emulates S3 mostly comes from the latest HttpHandler made for S3 blob store repository tests, with additional features extracted from the (now removed) AmazonS3Fixture: authentication checks, session token checks and improved response errors. Chunked upload request support for S3 object has been added too. The server side logic of all tests now reside in a single S3HttpHandler class. Whereas AmazonS3Fixture contained logic for basic tests, session token tests, EC2 tests or ECS tests, the S3 fixtures are now dedicated to each kind of test. Fixtures are inheriting from each other, making things easier to maintain.	2019-11-18 05:56:59 -05:00
markharwood	c3745b03ee	Search optimisation - add canMatch early aborts for queries on "_index" field (#49158 ) Make queries on the “_index” field fast-fail if the target shard is an index that doesn’t match the query expression. Part of the “canMatch” phase optimisations. Closes #48473	2019-11-15 16:50:32 +00:00
Jay Modi	b6ec066ca9	ESIntegTestCase always cleans up static fields (#49105 ) (#49108 ) ESIntegTestCase has logic to clean up static fields in a method annotated with `@AfterClass` so that these fields do not trigger the StaticFieldsInvariantRule. However, during the exceptional close of the test cluster, this cleanup can be missed. The StaticFieldsInvariantRule always runs and will attempt to inspect the size of the static fields that were not cleaned up. If the `currentCluster` field of ESIntegTestCase references an InternalTestCluster, this could hold a reference to an implementation of a `Path` that comes from the `sun.nio.fs` package, which the security manager will deny access to. This casues additional noise to be generated since the AccessControlException will cause the StaticFieldsInvariantRule to fail and also be reported along with the actual exception that occurred. This change clears the static fields of ESIntegTestCase in a finally block inside the `@AfterClass` method to prevent this unnecessary noise. Closes #41526	2019-11-15 09:39:57 -07:00
Rory Hunter	c46a0e8708	Apply 2-space indent to all gradle scripts (#49071 ) Backport of #48849. Update `.editorconfig` to make the Java settings the default for all files, and then apply a 2-space indent to all `*.gradle` files. Then reformat all the files.	2019-11-14 11:01:23 +00:00
Henning Andersen	66f0c8900f	Fix Transport Stopped Exception (#48930 ) (#49035 ) When a node shuts down, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates #42612	2019-11-13 18:48:05 +01:00
Tanguy Leroux	20fc1dbe18	Move MinIO fixture in its own project (#49036 ) This commit moves the MinIO docker-compose fixture from the :plugins:repository-s3 to its own :test:minio-fixture Gradle project.	2019-11-13 10:03:59 -05:00
Yannick Welsch	2dfa0133d5	Always use primary term from primary to index docs on replica (#47583 ) Ensures that we always use the primary term established by the primary to index docs on the replica. Makes the logic around replication less brittle by always using the operation primary term on the replica that is coming from the primary.	2019-11-13 12:13:45 +01:00
Tanguy Leroux	1903505a3f	Log exceptions thrown by HttpHandlers in repository integration tests (#48991 ) This commit changes the ESMockAPIBasedRepositoryIntegTestCase so that HttpHandler are now wrapped in order to log any exceptions that could be thrown when executing the server side logic in repository integration tests.	2019-11-12 20:14:30 +01:00
Tim Brooks	0645ee88e2	Send cluster name and discovery node in handshake (#48916 ) This commits sends the cluster name and discovery naode in the transport level handshake response. This will allow us to stop sending the transport service level handshake request in the 8.0-8.x release cycle. It is necessary to start sending this in 7.x so that 8.0 is guaranteed to be communicating with a version that sends the required information.	2019-11-11 18:42:02 -05:00
Yannick Welsch	87862868c6	Allow realtime get to read from translog (#48843 ) The realtime GET API currently has erratic performance in case where a document is accessed that has just been indexed but not refreshed yet, as the implementation will currently force an internal refresh in that case. Refreshing can be an expensive operation, and also will block the thread that executes the GET operation, blocking other GETs to be processed. In case of frequent access of recently indexed documents, this can lead to a refresh storm and terrible GET performance. While older versions of Elasticsearch (2.x and older) did not trigger refreshes and instead opted to read from the translog in case of realtime GET API or update API, this was removed in 5.0 (#20102) to avoid inconsistencies between values that were returned from the translog and those returned by the index. This was partially reverted in 6.3 (#29264) to allow _update and upsert to read from the translog again as it was easier to guarantee consistency for these, and also brought back more predictable performance characteristics of this API. Calls to the realtime GET API, however, would still always do a refresh if necessary to return consistent results. This means that users that were calling realtime GET APIs to coordinate updates on client side (realtime GET + CAS for conditional index of updated doc) would still see very erratic performance. This PR (together with #48707) resolves the inconsistencies between reading from translog and index. In particular it fixes the inconsistencies that happen when requesting stored fields, which were not available when reading from translog. In case where stored fields are requested, this PR will reparse the _source from the translog and derive the stored fields to be returned. With this, it changes the realtime GET API to allow reading from the translog again, avoid refresh storms and blocking the GET threadpool, and provide overall much better and predictable performance for this API.	2019-11-09 17:47:50 +01:00
Nhat Nguyen	ff6c121eb9	Closed shard should never open new engine (#47186 ) We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock. Closes #47060	2019-11-08 23:40:34 -05:00
Yannick Welsch	af887be3e5	Hide orphaned tasks from follower stats (#48901 ) CCR follower stats can return information for persistent tasks that are in the process of being cleaned up. This is problematic for tests where CCR follower indices have been deleted, but their persistent follower task is only cleaned up asynchronously afterwards. If one of the following tests then accesses the follower stats, it might still get the stats for that follower task. In addition, some tests were not cleaning up their auto-follow patterns, leaving orphaned patterns behind. Other tests cleaned up their auto-follow patterns. As always the same name was used, it just depended on the test execution order whether this led to a failure or not. This commit fixes the offensive tests, and will also automatically remove auto-follow-patterns at the end of tests, like we do for many other features. Closes #48700	2019-11-08 13:56:53 +01:00
Tanguy Leroux	8a14ea5567	Add docker-composed based test fixture for GCS (#48902 ) Similarly to what has be done for Azure in #48636, this commit adds a new :test:fixtures:gcs-fixture project which provides two docker-compose based fixtures that emulate a Google Cloud Storage service. Some code has been extracted from existing tests and placed into this new project so that it can be easily reused in other projects.	2019-11-07 13:27:22 -05:00
Armin Braun	d83e374062	Bound Linearizability Check in CoordinatorTests (#48751 ) (#48853 ) Same as #44444 but for the coordinator tests. Closes #48742	2019-11-04 21:36:17 +01:00
Armin Braun	a22f6fbe3c	Cleanup Redundant Futures in Recovery Code (#48805 ) (#48832 ) Follow up to #48110 cleaning up the redundant future uses that were left over from that change.	2019-11-02 17:28:12 +01:00
Tanguy Leroux	989467ca1e	Add docker-compose based test fixture for Azure (#48736 ) This commit adds a new :test:fixtures:azure-fixture project which provides a docker-compose based container that runs a AzureHttpFixture Java class that emulates an Azure Storage service. The logic to emulate the service is extracted from existing tests and placed in AzureHttpHandler into the new project so that it can be easily reused. The :plugins:repository-azure project is an example of such utilization. The AzureHttpFixture fixture is just a wrapper around AzureHttpHandler and is now executed within the docker container. The :plugins:repository-azure:qa:microsoft-azure project uses the new test fixture and the existing AzureStorageFixture has been removed.	2019-10-31 10:43:43 +01:00
Armin Braun	52e5ceb321	Restore from Individual Shard Snapshot Files in Parallel (#48110 ) (#48686 ) Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.	2019-10-30 14:36:30 +01:00
Tanguy Leroux	24f6985235	Reduce allocations when draining HTTP requests bodies in repository tests (#48541 ) In repository integration tests, we drain the HTTP request body before returning a response. Before this change this operation was done using Streams.readFully() which uses a 8kb buffer to read the input stream, it now uses a 1kb for the same operation. This should reduce the allocations made during the tests and speed them up a bit on CI. Co-authored-by: Armin Braun <me@obrown.io>	2019-10-29 09:15:06 +01:00
Rory Hunter	30389c6660	Improve SAML tests resiliency to auto-formatting (#48517 ) Backport of #48452. The SAML tests have large XML documents within which various parameters are replaced. At present, if these test are auto-formatted, the XML documents get strung out over many, many lines, and are basically illegible. Fix this by using named placeholders for variables, and indent the multiline XML documents. The tests in `SamlSpMetadataBuilderTests` deserve a special mention, because they include a number of certificates in Base64. I extracted these into variables, for additional legibility.	2019-10-27 16:06:23 +00:00
Tim Brooks	f5f1072824	Multiple remote connection strategy support (#48496 ) * Extract remote "sniffing" to connection strategy (#47253) Currently the connection strategy used by the remote cluster service is implemented as a multi-step sniffing process in the RemoteClusterConnection. We intend to introduce a new connection strategy that will operate in a different manner. This commit extracts the sniffing logic to a dedicated strategy class. Additionally, it implements dedicated tests for this class. Additionally, in previous commits we moved away from a world where the remote cluster connection was mutable. Instead, when setting updates are made, the connection is torn down and rebuilt. We still had methods and tests hanging around for the mutable behavior. This commit removes those. * Introduce simple remote connection strategy (#47480) This commit introduces a simple remote connection strategy which will open remote connections to a configurable list of user supplied addresses. These addresses can be remote Elasticsearch nodes or intermediate proxies. We will perform normal clustername and version validation, but otherwise rely on the remote cluster to route requests to the appropriate remote node. * Make remote setting updates support diff strategies (#47891) Currently the entire remote cluster settings infrastructure is designed around the sniff strategy. As we introduce an additional conneciton strategy this infrastructure needs to be modified to support it. This commit modifies the code so that the strategy implementations will tell the service if the connection needs to be torn down and rebuilt. As part of this commit, we will wait 10 seconds for new clusters to connect when they are added through the "update" settings infrastructure. * Make remote setting updates support diff strategies (#47891) Currently the entire remote cluster settings infrastructure is designed around the sniff strategy. As we introduce an additional conneciton strategy this infrastructure needs to be modified to support it. This commit modifies the code so that the strategy implementations will tell the service if the connection needs to be torn down and rebuilt. As part of this commit, we will wait 10 seconds for new clusters to connect when they are added through the "update" settings infrastructure.	2019-10-25 09:29:41 -06:00
Tim Brooks	c0b545f325	Make BytesReference an interface (#48486 ) BytesReference is currently an abstract class which is extended by various implementations. This makes it very difficult to use the delegation pattern. The implication of this is that our releasable BytesReference is a PagedBytesReference type and cannot be used as a generic releasable bytes reference that delegates to any reference type. This commit makes BytesReference an interface and introduces an AbstractBytesReference for common functionality.	2019-10-24 15:39:30 -06:00
Igor Motov	bdbc353dea	Geo: improve handling of out of bounds points in linestrings (#47939 ) Brings handling of out of bounds points in linestrings in line with points. Now points with latitude above 90 and below -90 are handled the same way as for points by adjusting the longitude by moving it by 180 degrees. Relates to #43916	2019-10-23 14:17:44 -04:00
Armin Braun	7215201406	Track Shard-Snapshot Index Generation at Repository Root (#48371 ) This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <yannick@welsch.lu> Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>	2019-10-23 10:58:26 +01:00
Tanguy Leroux	4790ee4c32	Reenable azure repository tests and remove some randomization in http servers (#48283 ) Relates #47948 Relates #47380	2019-10-23 09:06:50 +02:00
Armin Braun	8a02a5fc7d	Simplify Shard Snapshot Upload Code (#48155 ) (#48345 ) The code here was needlessly complicated when it enqueued all file uploads up-front. Instead, we can go with a cleaner worker + queue pattern here by taking the max-parallelism from the threadpool info. Also, I slightly simplified the rethrow and listener (step listener is pointless when you add the callback in the next line) handling it since I noticed that we were needlessly rethrowing in the same code and that wasn't worth a separate PR.	2019-10-22 17:17:09 +01:00
Armin Braun	dc08feadc6	Remove Redundant Version Param from Repository APIs (#48231 ) (#48298 ) This parameter isn't used by any implementation	2019-10-21 16:20:45 +02:00
Ignacio Vera	b1224fca8c	upgrade to Lucene-8.3.0-snapshot-25968e3b75e (#48227 )	2019-10-21 08:21:09 +02:00
Alpar Torok	cc26e30281	Increase timeout for yml tests (#48237 ) Some of these are larger than what can complete in the regular timeout. Closes #48212	2019-10-18 11:14:15 -07:00
jimczi	b858e19bcc	Revert #46598 that breaks the cachability of the sub search contexts.	2019-10-15 09:40:59 +02:00
Alpar Torok	fbbe04b801	Add a verifyVersions to the test FW (#47192 ) The test FW has a method to check that it's implementation of getting index and wire compatible versions as well as reasoning about which version is released or not produces the same rezults as the simillar implementation in the build. This PR adds the `verifyVersions` task to the test FW so we have one task to check everything related to versions.	2019-10-10 11:23:56 +03:00
Armin Braun	302e09decf	Simplify some Common ActionRunnable Uses (#47799 ) (#47828 ) Especially in the snapshot code there's a lot of logic chaining `ActionRunnables` in tricky ways now and the code is getting hard to follow. This change introduces two convinience methods that make it clear that a wrapped listener is invoked with certainty in some trickier spots and shortens the code a bit.	2019-10-09 23:29:50 +02:00
Hendrik Muhs	5e0e54f455	[Transform] move root endpoint to _transform with BWC layer (#47127 ) (#47682 ) move the main endpoint to /_transform/ from /_data_frame/transforms/ with providing backwards compatibility and deprecation warnings	2019-10-08 08:59:01 +02:00
Alpar Torok	2b16d7bcf8	Backport testclusters all (#47565 ) * Bwc testclusters all (#46265) Convert all bwc projects to testclusters * Fix bwc versions config * WIP fix rolling upgrade * Fix bwc tests on old versions * Fix rolling upgrade	2019-10-04 16:12:53 +03:00
Ryan Ernst	f32692208e	Add explanations to script score queries (#46693 ) (#47548 ) While function scores using scripts do allow explanations, they are only creatable with an expert plugin. This commit improves the situation for the newer script score query by adding the ability to set the explanation from the script itself. To set the explanation, a user would check for `explanation != null` to indicate an explanation is needed, and then call `explanation.set("some description")`.	2019-10-03 21:05:05 -07:00
Nhat Nguyen	5e4732f2bb	Limit number of retaining translog files for peer recovery (#47414 ) Today we control the extra translog (when soft-deletes is disabled) for peer recoveries by size and age. If users manually (force) flush many times within a short period, we can keep many small (or empty) translog files as neither the size or age condition is reached. We can protect the cluster from running out of the file descriptors in such a situation by limiting the number of retaining translog files.	2019-10-03 20:45:29 -04:00
Yannick Welsch	99d2fe295d	Use optype CREATE for single auto-id index requests (#47353 ) Changes auto-id index requests to use optype CREATE, making it compliant with our docs. This will also make these auto-id index requests compatible with the new "create-doc" index privilege (which is based on the optype), the default optype is changed to create, just as it is already documented.	2019-10-02 14:16:52 +02:00
Henning Andersen	b5a2afccb2	MockSearchService concurrency fix (#47139 ) Fixed MockSearchService concurrency, assertNoInFlightContext could have false negative result (rarely). Split out from #46060 Closes #47048	2019-10-02 12:33:18 +02:00
Tanguy Leroux	f5c5411fe8	Differentiate base paths in repository integration tests (#47284 ) (#47300 ) This commit change the repositories base paths used in Azure/S3/GCS integration tests so that they don't conflict with each other when tests run in parallel on real storage services. Closes #47202	2019-10-01 08:39:55 +02:00
Armin Braun	3d23cb44a3	Speed up Snapshot Finalization (#47283 ) (#47309 ) As a result of #45689 snapshot finalization started to take significantly longer than before. This may be a little unfortunate since it increases the likelihood of failing to finalize after having written out all the segment blobs. This change parallelizes all the metadata writes that can safely run in parallel in the finalization step to speed the finalization step up again. Also, this will generally speed up the snapshot process overall in case of large number of indices. This is also a nice to have for #46250 since we add yet another step (deleting of old index- blobs in the shards to the finalization.	2019-09-30 23:28:59 +02:00
Yannick Welsch	9dc90e41fc	Remove "force" version type (#47228 ) It's been deprecated long ago and can be removed. Relates to #20377 Closes #19769	2019-09-30 11:58:34 +02:00
Rory Hunter	53a4d2176f	Convert most awaitBusy calls to assertBusy (#45794 ) (#47112 ) Backport of #45794 to 7.x. Convert most `awaitBusy` calls to `assertBusy`, and use asserts where possible. Follows on from #28548 by @liketic. There were a small number of places where it didn't make sense to me to call `assertBusy`, so I kept the existing calls but renamed the method to `waitUntil`. This was partly to better reflect its usage, and partly so that anyone trying to add a new call to awaitBusy wouldn't be able to find it. I also didn't change the usage in `TransportStopRollupAction` as the comments state that the local awaitBusy method is a temporary copy-and-paste. Other changes: * Rework `waitForDocs` to scale its timeout. Instead of calling `assertBusy` in a loop, work out a reasonable overall timeout and await just once. * Some tests failed after switching to `assertBusy` and had to be fixed. * Correct the expect templates in AbstractUpgradeTestCase. The ES Security team confirmed that they don't use templates any more, so remove this from the expected templates. Also rewrite how the setup code checks for templates, in order to give more information. * Remove an expected ML template from XPackRestTestConstants The ML team advised that the ML tests shouldn't be waiting for any `.ml-notifications` templates, since such checks should happen in the production code instead. Also rework the template checking code in `XPackRestTestHelper` to give more helpful failure messages. * Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4	2019-09-29 12:21:46 +01:00
Tim Brooks	e11c56760d	Fix bind failure logging for mock transport (#47150 ) Currently the MockNioTransport uses a custom exception handler for server channel exceptions. This means that bind failures are logged at the warn level. This commit modifies the transport to use the common TcpTransport exception handler which will log exceptions at the correct level.	2019-09-27 13:53:48 -06:00
Nhat Nguyen	444b47ce88	Relax maxSeqNoOfUpdates assertion in FollowingEngine (#47188 ) We disable MSU optimization if the local checkpoint is smaller than max_seq_no_of_updates. Hence, we need to relax the MSU assertion in FollowingEngine for that scenario. Suppose the leader has three operations: index-0, delete-1, and index-2 for the same doc Id. MSU on the leader is 1 as index-2 is an append. If the follower applies index-0 then index-2, then the assertion is violated. Closes #47137	2019-09-27 14:00:20 -04:00
Tanguy Leroux	42ae76ab7c	Injected response errors in Azure repository tests should have a body (#47176 ) The Azure SDK client expects server errors to have a body, something that looks like: <?xml version="1.0" encoding="utf-8"?> <Error> <Code>string-value</Code> <Message>string-value</Message> </Error> I've forgot to add such errors in Azure tests and that triggers some NPE in the client like the one reported in #47120. Closes #47120	2019-09-27 09:43:29 +02:00
Jim Ferenczi	73a09b34b8	Replace SearchContextException with SearchException (#47046 ) This commit removes the SearchContextException in favor of a simpler SearchException that doesn't leak the SearchContext. Relates #46523	2019-09-26 14:21:23 +02:00
Tanguy Leroux	95e2ca741e	Remove unused private methods and fields (#47154 ) This commit removes a bunch of unused private fields and unused private methods from the code base. Backport of (#47115)	2019-09-26 12:49:21 +02:00
David Turner	45c7783018	Warn on slow metadata persistence (#47130 ) Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved. Backport of #47005.	2019-09-26 07:40:54 +01:00
Tim Brooks	4f47e1f169	Extract proxy connection logic to specialized class (#47138 ) Currently the logic to check if a connection to a remote discovery node exists and otherwise create a proxy connection is mixed with the collect nodes, cluster connection lifecycle, and other RemoteClusterConnection logic. This commit introduces a specialized RemoteConnectionManager class which handles the open connections. Additionally, it reworks the "round-robin" proxy logic to create the list of potential connections at connection open/close time, opposed to each time a connection is requested.	2019-09-25 15:58:18 -06:00
David Turner	ac920e8e64	Assert no exceptions during state application (#47090 ) Today we log and swallow exceptions during cluster state application, but such an exception should not occur. This commit adds assertions of this fact, and updates the Javadocs to explain it. Relates #47038	2019-09-25 12:32:51 +01:00
Tim Brooks	6720c56bdd	Set netty system properties in BuildPlugin (#45881 ) Currently in production instances of Elasticsearch we set a couple of system properties by default. We currently do not apply all of these system properties in tests. This commit applies these properties in the tests.	2019-09-24 10:49:36 -06:00
David Turner	6943a3101f	Cut PersistedState interface from GatewayMetaState (#46655 ) Today `GatewayMetaState` implements `PersistedState` but it's an error to use it as a `PersistedState` before it's been started, or if the node is master-ineligible. It also holds some fields that are meaningless on nodes that do not persist their states. Finally, it takes responsibility for both loading the original cluster state and some of the high-level logic for writing the cluster state back to disk. This commit addresses these concerns by introducing a more specific `PersistedState` implementation for use on master-eligible nodes which is only instantiated if and when it's appropriate. It also moves the fields and high-level persistence logic into a new `IncrementalClusterStateWriter` with a more appropriate lifecycle. Follow-up to #46326 and #46532 Relates #47001	2019-09-24 12:31:13 +01:00
Julie Tibshirani	9124c94a6c	Add support for aliases in queries on _index. (#46944 ) Previously, queries on the _index field were not able to specify index aliases. This was a regression in functionality compared to the 'indices' query that was deprecated and removed in 6.0. Now queries on _index can specify an alias, which is resolved to the concrete index names when we check whether an index matches. To match a remote shard target, the pattern needs to be of the form 'cluster:index' to match the fully-qualified index name. Index aliases can be specified in the following query types: term, terms, prefix, and wildcard.	2019-09-23 13:21:37 -07:00
Jim Ferenczi	08f28e642b	Replace SearchContext with QueryShardContext in query builder tests (#46978 ) This commit replaces the SearchContext used in AbstractQueryTestCase with a QueryShardContext in order to reduce the visibility of search contexts. Relates #46523	2019-09-23 20:24:02 +02:00
Luca Cavanna	d4d1182677	update _common.json format (#46872 ) API spec now use an object for the documentation field. _common was not updated yet. This commit updates _common.json and its corresponding parser. Closes #46744 Co-Authored-By: Tomas Della Vedova <delvedor@users.noreply.github.com>	2019-09-23 17:01:29 +02:00
Yannick Welsch	9638ca20b0	Allow dropping documents with auto-generated ID (#46773 ) When using auto-generated IDs + the ingest drop processor (which looks to be used by filebeat as well) + coordinating nodes that do not have the ingest processor functionality, this can lead to a NullPointerException. The issue is that markCurrentItemAsDropped() is creating an UpdateResponse with no id when the request contains auto-generated IDs. The response serialization is lenient for our REST/XContent format (i.e. we will send "id" : null) but the internal transport format (used for communication between nodes) assumes for this field to be non-null, which means that it can't be serialized between nodes. Bulk requests with ingest functionality are processed on the coordinating node if the node has the ingest capability, and only otherwise sent to a different node. This means that, in order to reproduce this, one needs two nodes, with the coordinating node not having the ingest functionality. Closes #46678	2019-09-19 16:46:33 +02:00
Tanguy Leroux	3ae51f25dd	Move testSnapshotWithLargeSegmentFiles to ESMockAPIBasedRepositoryIntegTestCase (#46802 ) This commit moves the common test testSnapshotWithLargeSegmentFiles to the ESMockAPIBasedRepositoryIntegTestCase base class.	2019-09-18 15:41:30 +02:00
Armin Braun	f983b67fdc	Add Assertion About Leaking index-N to Repo Tests (#46774 ) (#46801 ) This adds an assert to make sure we're not leaking index-N blobs on the shard level to the repo consistency checks. It is ok to have a single redundant index-N blob in a failure scenario but additional index-N should always be cleaned up before adding more.	2019-09-18 13:15:56 +02:00
Tanguy Leroux	4db37801d0	Add resumable uploads support to GCS repository integration tests (#46562 ) This commit adds support for resumable uploads to the internal HTTP server used in GoogleCloudStorageBlobStoreRepositoryTests. This way we can also test the behavior of the Google's client when the service returns server errors in response to resumable upload requests. The BlobStore implementation for GCS has the choice between 2 methods to upload a blob: resumable and multipart. In the current implementation, the client executes a resumable upload if the blob size is larger than LARGE_BLOB_THRESHOLD_BYTE_SIZE, otherwise it executes a multipart upload. This commit makes this logic overridable in tests, allowing to randomize the decision of using one method or the other. The commit add support for single request resumable uploads and chunked resumable uploads (the blob is uploaded into multiple 2Mb chunks; each chunk being a resumable upload). For this last case, this PR also adds a test testSnapshotWithLargeSegmentFiles which makes it more probable that a chunked resumable upload is executed.	2019-09-18 09:33:05 +02:00
Armin Braun	2c70d403fc	Reenable+Fix testMasterShutdownDuringFailedSnapshot (#46303 ) (#46747 ) Reenable this test since it was fixed by #45689 in production code (specifically, the fact that we write the `snap-` blobs without overwrite checks now). Only required adding the assumed blocking on index file writes to test code to properly work again. * Closes #25281	2019-09-17 18:09:48 +02:00
Armin Braun	b00de8edf3	Ensure SAS Tokens in Test Use Minimal Permissions (#46112 ) (#46628 ) There were some issues with the Azure implementation requiring permissions to list all containers ue to a container exists check. This was caught in CI this time, but going forward we should ensure that CI is executed using a token that does not allow listing containers. Relates #43288	2019-09-17 15:40:11 +02:00
Armin Braun	b0f09b279f	Make Snapshot Logic Write Metadata after Segments (#45689 ) (#46764 ) * Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581 * Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons * Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization * Fixes #41581	2019-09-17 13:09:39 +02:00
Przemysław Witek	e49be611ad	[7.x] Add audit messages for Data Frame Analytics (#46521 ) (#46738 )	2019-09-16 21:21:38 +02:00
Nhat Nguyen	5465c8d095	Increase timeout for relocation tests (#46554 ) There's nothing wrong in the logs from these failures. I think 30 seconds might not be enough to relocate shards with many documents as CI is quite slow. This change increases the timeout to 60 seconds for these relocation tests. It also dumps the hot threads in case of timed out. Closes #46526 Closes #46439	2019-09-12 16:34:01 -04:00
Jim Ferenczi	4407f3af1b	Delay the creation of SubSearchContext to the FetchSubPhase (#46598 ) This change delays the creation of the SubSearchContext for nested and parent/child inner_hits to the fetch sub phase in order to ensure that a SearchContext can built entirely from a QueryShardContext. This commit also adds a validation step to the inner hits builder that ensures that we fail the request early if the inner hits path is invalid. Relates #46523	2019-09-12 14:52:15 +02:00
Jim Ferenczi	23bf310c84	Replace the SearchContext with QueryShardContext when building aggregator factories (#46527 ) This commit replaces the `SearchContext` with the `QueryShardContext` when building aggregator factories. Aggregator factories are part of the `SearchContext` so they shouldn't require a `SearchContext` to create them. The main changes here are the signatures of `AggregationBuilder#build` that now takes a `QueryShardContext` and `AggregatorFactory#createInternal` that passes the `SearchContext` to build the `Aggregator`. Relates #46523	2019-09-11 16:43:30 +02:00
Armin Braun	41633cb9b5	More Efficient Ordering of Shard Upload Execution (#42791 ) (#46588 ) * More Efficient Ordering of Shard Upload Execution (#42791) * Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard * Inspired by #39657 * Cleanup BlobStoreRepository Abort and Failure Handling (#46208)	2019-09-11 13:59:20 +02:00
Jim Ferenczi	425b1a77e8	Add more context to QueryShardContext (#46584 ) This change adds an IndexSearcher and the node's BigArrays in the QueryShardContext. It's a spin off of #46527 as this change is required to allow aggregation builder to solely use the query shard context. Relates #46523	2019-09-11 12:24:51 +02:00
David Turner	6c67b53932	Load metadata at start time not construction time (#46326 ) Today we load the metadata from disk while constructing the node. However there is no real need to do so, and this commit moves that code to run later while the node is starting instead.	2019-09-10 11:15:10 +01:00
Tanguy Leroux	88bed09119	Mutualize code in cloud-based repository integration tests (#46483 ) This commit factors out some common code between the cloud-based repository integration tests that were recently improved. Relates #46376	2019-09-09 16:02:14 +02:00
Armin Braun	1bb1c77885	Increase REST-Test Client Timeout to 60s (#46455 ) (#46461 ) We are seeing requests take more than the default 30s which leads to requests being retried and returning unexpected failures like e.g. "index already exists" because the initial requests that timed out, worked out functionally anyway. => double the timeout to reduce the likelihood of the failures described in #46091 => As suggested in the issue, we should in a follow-up turn off retrying all-together probably	2019-09-07 07:40:16 +02:00
Tanguy Leroux	28974b5723	Replace mocked client in GCSBlobStoreRepositoryTests by HTTP server (#46255 ) This commit removes the usage of MockGoogleCloudStoragePlugin in GoogleCloudStorageBlobStoreRepositoryTests and replaces it by a HttpServer that emulates the Storage service. This allows the repository tests to use the real Google's client under the hood in tests and will allow us to test the behavior of the snapshot/restore feature for GCS repositories by simulating random server-side internal errors. The HTTP server used to emulate the Storage service is intentionally simple and minimal to keep things understandable and maintainable. Testing full client options on the server side (like authentication, chunked encoding etc) remains the responsibility of the GoogleCloudStorageFixture.	2019-09-05 10:37:37 +02:00
Alpar Torok	d709a5c193	Quote the task name in reproduction line printer (#46266 ) Some tasks have `#` for instance that doesn't play well with some shells ( e.x. zsh )	2019-09-04 12:22:58 +03:00
Lee Hinman	57f322f85e	Move MockRespository into test framework (#46298 ) This moves the `MockRespository` class into `test/framework/src/main` so it can be used across all modules and plugins in tests.	2019-09-03 16:21:10 -06:00
David Turner	d340530a47	Avoid overshooting watermarks during relocation (#46079 ) Today the `DiskThresholdDecider` attempts to account for already-relocating shards when deciding how to allocate or relocate a shard. Its goal is to stop relocating shards onto a node before that node exceeds the low watermark, and to stop relocating shards away from a node as soon as the node drops below the high watermark. The decider handles multiple data paths by only accounting for relocating shards that affect the appropriate data path. However, this mechanism does not correctly account for _new_ relocating shards, which are unwittingly ignored. This means that we may evict far too many shards from a node above the high watermark, and may relocate far too many shards onto a node causing it to blow right past the low watermark and potentially other watermarks too. There are in fact two distinct issues that this PR fixes. New incoming shards have an unknown data path until the `ClusterInfoService` refreshes its statistics. New outgoing shards have a known data path, but we fail to account for the change of the corresponding `ShardRouting` from `STARTED` to `RELOCATING`, meaning that we fail to find the correct data path and treat the path as unknown here too. This PR also reworks the `MockDiskUsagesIT` test to avoid using fake data paths for all shards. With the changes here, the data paths are handled in tests as they are in production, except that their sizes are fake. Fixes #45177	2019-08-29 12:40:55 +01:00
Rory Hunter	3666bcfbd8	Handle multiple loopback addresses (#46061 ) AbstractSimpleTransportTestCase.testTransportProfilesWithPortAndHost expects a host to only have a single IPv4 loopback address, which isn't necessarily the case. Allow for >= 1 address. Backport of #45901.	2019-08-29 09:45:51 +01:00
Tanguy Leroux	9e14ffa8be	Few clean ups in ESBlobStoreRepositoryIntegTestCase (#46068 )	2019-08-28 16:29:46 +02:00
Luca Cavanna	267183998e	[TEST] wait for http channels to be closed in ESIntegTestCase (#45977 ) We recently added a check to `ESIntegTestCase` in order to verify that no http channels are being tracked when we close clusters and the REST client. Close listeners though are invoked asynchronously, hence this check may fail if we assert before the close listener that removes the channel from the map is invoked. With this commit we add an `assertBusy` so we try and wait for the map to be empty. Closes #45914 Closes #45955	2019-08-27 14:00:24 +02:00
Nhat Nguyen	f2e8b17696	Do not create engine under IndexShard#mutex (#45263 ) Today we create new engines under IndexShard#mutex. This is not ideal because it can block the cluster state updates which also execute under the same mutex. We can avoid this problem by creating new engines under a separate mutex. Closes #43699	2019-08-26 17:18:29 -04:00
Tanguy Leroux	8e66df9925	Move testRetentionLeasesClearedOnRestore (#45896 )	2019-08-23 13:43:40 +02:00
Jason Tedor	de6b6fd338	Add node.processors setting in favor of processors (#45885 ) This commit namespaces the existing processors setting under the "node" namespace. In doing so, we deprecate the existing processors setting in favor of node.processors.	2019-08-22 22:18:37 -04:00
Armin Braun	bfddaaa2ae	Acknowledge Indices Were Wiped Successfully in REST Tests (#45832 ) (#45842 ) In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates #45605 which might be caused by unacked deletes that were just slow.	2019-08-22 17:19:51 +02:00
Luca Cavanna	a47ade3e64	Cancel search task on connection close (#43332 ) This PR introduces a mechanism to cancel a search task when its corresponding connection gets closed. That would relief users from having to manually deal with tasks and cancel them if needed. Especially the process of finding the task_id requires calling get tasks which needs to call every node in the cluster. The implementation is based on associating each http channel with its currently running search task, and cancelling the task when the previously registered close listener gets called.	2019-08-22 10:43:20 +02:00
Armin Braun	824f1090a9	Disable testTimeoutPerConnection on Windows (#45785 ) (#45818 ) * It appears this test that is specific to how the BSD network stack works does randomly fail on Windows => disabling it since it's not clear that it should work on Windows in a stable way * Fixes #45777	2019-08-22 06:06:09 +02:00
William Brafford	2b549e7342	CLI tools: write errors to stderr instead of stdout (#45586 ) Most of our CLI tools use the Terminal class, which previously did not provide methods for writing to standard output. When all output goes to standard out, there are two basic problems. First, errors and warnings are "swallowed" in pipelines, making it hard for a user to know when something's gone wrong. Second, errors and warnings are intermingled with legitimate output, making it difficult to pass the results of interactive scripts to other tools. This commit adds a second set of print commands to Terminal for printing to standard error, with errorPrint corresponding to print and errorPrintln corresponding to println. This leaves it to developers to decide which output should go where. It also adjusts existing commands to send errors and warnings to stderr. Usage is printed to standard output when it's correctly requested (e.g., bin/elasticsearch-keystore --help) but goes to standard error when a command is invoked incorrectly (e.g. bin/elasticsearch-keystore list-with-a-typo \| sort).	2019-08-21 14:46:07 -04:00
Armin Braun	6aaee8aa0a	Repository Cleanup Endpoint (#43900 ) (#45780 ) * Repository Cleanup Endpoint (#43900) * Snapshot cleanup functionality via transport/REST endpoint. * Added all the infrastructure for this with the HLRC and node client * Made use of it in tests and resolved relevant TODO * Added new `Custom` CS element that tracks the cleanup logic. Kept it similar to the delete and in progress classes and gave it some (for now) redundant way of handling multiple cleanups but only allow one * Use the exact same mechanism used by deletes to have the combination of CS entry and increment in repository state ID provide some concurrency safety (the initial approach of just an entry in the CS was not enough, we must increment the repository state ID to be safe against concurrent modifications, otherwise we run the risk of "cleaning up" blobs that just got created without noticing) * Isolated the logic to the transport action class as much as I could. It's not ideal, but we don't need to keep any state and do the same for other repository operations (like getting the detailed snapshot shard status)	2019-08-21 17:59:49 +02:00
Gordon Brown	ecb3ebd796	Clean SLM and ongoing snapshots in test framework (#45564 ) Adjusts the cluster cleanup routine in ESRestTestCase to clean up SLM test cases, and optionally wait for all snapshots to be deleted. Waiting for all snapshots to be deleted, rather than failing if any are in progress, is necessary for tests which use SLM policies because SLM policies may be in the process of executing when the test ends.	2019-08-16 14:17:34 -06:00
Igor Motov	98c850c08b	Geo: Change order of parameter in Geometries to lon, lat 7.x (#45618 ) Changes the order of parameters in Geometries from lat, lon to lon, lat and moves all Geometry classes are moved to the org.elasticsearch.geomtery package. Backport of #45332 Closes #45048	2019-08-16 14:42:02 -04:00
Luca Cavanna	c31cddf27e	Update the schema for the REST API specification (#42346 ) * Update the REST API specification This patch updates the REST API spefication in JSON files to better encode deprecated entities, to improve specification of URL paths, and to open up the schema for future extensions. Notably, it changes the `paths` from a list of strings to a list of objects, where each particular object encodes all the information for this particular path: the `parts` and the `methods`. Among the benefits of this approach is eg. encoding the difference between using the `PUT` and `POST` methods in the Index API, to either use a specific document ID, or let Elasticsearch generate one. Also `documentation` becomes an object that supports an `url` and also a `description` which is a new field. * Adapt YAML runner to new REST API specification format The logic for choosing the path to use when running tests has been simplified, as a consequence of the path parts being listed under each path in the spec. The special case for create and index has been removed. Also the parsing code has been hardened so that errors are thrown earlier when the structure of the spec differs from what expected, and their error messages should be more helpful.	2019-08-16 14:40:00 +02:00
Alpar Torok	4a67645e5d	Use dynamic ports for ESSingleNodeTestCase too Extends #45601 to cover all tests.	2019-08-16 09:17:19 +03:00
Armin Braun	73e266b2fd	Fix Failures when Closing Indices in EsBlobStoreRepositoryIntegTestCase (#45532 ) (#45614 ) * Same issue as in #44754 as far as I can see: in case of async translog persistence we randomly fail to close * Closes #45335 * Closes #45334	2019-08-15 19:45:17 +02:00
Alpar Torok	03a1645bc6	Use dynamic port ranges for ExternalTestCluster (#45601 ) Moves methods added in #44213 and uses them to configure the port range for `ExternalTestCluster` too. These were still using `9300-9400` ( teh default ) and running into races.	2019-08-15 16:40:12 +03:00
Nick Knize	647a8308c3	[SPATIAL] Backport new ShapeFieldMapper and ShapeQueryBuilder to 7x (#45363 ) * Introduce Spatial Plugin (#44389) Introduce a skeleton Spatial plugin that holds new licensed features coming to Geo/Spatial land! * [GEO] Refactor DeprecatedParameters in AbstractGeometryFieldMapper (#44923) Refactor DeprecatedParameters specific to legacy geo_shape out of AbstractGeometryFieldMapper.TypeParser#parse. * [SPATIAL] New ShapeFieldMapper for indexing cartesian geometries (#44980) Add a new ShapeFieldMapper to the xpack spatial module for indexing arbitrary cartesian geometries using a new field type called shape. The indexing approach leverages lucene's new XYShape field type which is backed by BKD in the same manner as LatLonShape but without the WGS84 latitude longitude restrictions. The new field mapper builds on and extends the refactoring effort in AbstractGeometryFieldMapper and accepts shapes in either GeoJSON or WKT format (both of which support non geospatial geometries). Tests are provided in the ShapeFieldMapperTest class in the same manner as GeoShapeFieldMapperTests and LegacyGeoShapeFieldMapperTests. Documentation for how to use the new field type and what parameters are accepted is included. The QueryBuilder for searching indexed shapes is provided in a separate commit. * [SPATIAL] New ShapeQueryBuilder for querying indexed cartesian geometry (#45108) Add a new ShapeQueryBuilder to the xpack spatial module for querying arbitrary Cartesian geometries indexed using the new shape field type. The query builder extends AbstractGeometryQueryBuilder and leverages the ShapeQueryProcessor added in the previous field mapper commit. Tests are provided in ShapeQueryTests in the same manner as GeoShapeQueryTests and docs are updated to explain how the query works.	2019-08-14 16:35:10 -05:00
Nhat Nguyen	4fcf7bbd07	Do not hold writeLock while verifying Lucene/translog We should not hold Engine#writeLock while executing assertConsistentHistoryBetweenTranslogAndLuceneIndex for this check might acquire Engine#readLock. Relates #45461	2019-08-13 16:16:06 -04:00
Nhat Nguyen	24514275c7	Get max_seq_no after snapshot translog and Lucene (#45461 ) We should capture max_seq_no after snapshotting translog and Lucene; otherwise, that max_seq_no can be smaller some operation in translog or Lucene. With this change, we also hold the Engine#writeLock during this check so that no indexing can happen. Closes #45454	2019-08-13 16:16:06 -04:00
Nhat Nguyen	25c6102101	Trim local translog in peer recovery (#44756 ) Today, if an operation-based peer recovery occurs, we won't trim translog but leave it as is. Some unacknowledged operations existing in translog of that replica might suddenly reappear when it gets promoted. With this change, we ensure trimming translog above the starting sequence number of phase 2. This change can allow us to read translog forward.	2019-08-10 22:59:02 -04:00
Armin Braun	12ed6dc999	Only retain reasonable history for peer recoveries (#45208 ) (#45355 ) Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates #41536	2019-08-09 01:56:32 +02:00
Tim Brooks	af908efa41	Disable netty direct buffer pooling by default (#44837 ) Elasticsearch does not grant Netty reflection access to get Unsafe. The only mechanism that currently exists to free direct buffers in a timely manner is to use Unsafe. This leads to the occasional scenario, under heavy network load, that direct byte buffers can slowly build up without being freed. This commit disables Netty direct buffer pooling and moves to a strategy of using a single thread-local direct buffer for interfacing with sockets. This will reduce the memory usage from networking. Elasticsearch currently derives very little value from direct buffer usage (TLS, compression, Lucene, Elasticsearch handling, etc all use heap bytes). So this seems like the correct trade-off until that changes.	2019-08-08 15:10:31 -06:00
David Turner	355713b9ca	Improve slow logging in MasterService (#45241 ) Adds a tighter threshold for logging a warning about slowness in the `MasterService` instead of relying on the cluster service's 30-second warning threshold. This new threshold applies to the computation of the cluster state update in isolation, so we get a warning if computing a new cluster state update takes longer than 10 seconds even if it is subsequently applied quickly. It also applies independently to the length of time it takes to notify the cluster state tasks on completion of publication, in case any of these notifications holds up the master thread for too long. Relates #45007 Backport of #45086	2019-08-06 17:01:49 +01:00
Jason Tedor	5b1b146099	Normalize environment paths (#45179 ) This commit applies a normalization process to environment paths, both in how they are stored internally, also their settings values. This normalization is done via two means: - we make the paths absolute - we remove redundant name elements from the path (what Java calls "normalization") This change ensures that when we compare and refer to these paths within the system, we are using a common ground. For example, prior to the change if the data path was relative, we would not compare it correctly to paths from disk usage. This is because the paths in disk usage were being made absolute.	2019-08-06 06:04:30 -04:00
Yannick Welsch	7aeb2fe73c	Add per-socket keepalive options (#44055 ) Uses JDK 11's per-socket configuration of TCP keepalive (supported on Linux and Mac), see https://bugs.openjdk.java.net/browse/JDK-8194298, and exposes these as transport settings. By default, these options are disabled for now (i.e. fall-back to OS behavior), but we would like to explore whether we can enable them by default, in particular to force keepalive configurations that are better tuned for running ES.	2019-08-06 10:45:44 +02:00
Zachary Tong	3df1c76f9b	Allow pipeline aggs to select specific buckets from multi-bucket aggs (#44179 ) This adjusts the `buckets_path` parser so that pipeline aggs can select specific buckets (via their bucket keys) instead of fetching the entire set of buckets. This is useful for bucket_script in particular, which might want specific buckets for calculations. It's possible to workaround this with `filter` aggs, but the workaround is hacky and probably less performant. - Adjusts documentation - Adds a barebones AggregatorTestCase for bucket_script - Tweaks AggTestCase to use getMockScriptService() for reductions and pipelines. Previously pipelines could just pass in a script service for testing, but this didnt work for regular aggs. The new getMockScriptService() method fixes that issue, but needs to be used for pipelines too. This had a knock-on effect of touching MovFn, AvgBucket and ScriptedMetric	2019-08-05 12:18:40 -04:00
David Turner	13a167051f	Remove fileBasedRecovery flag (#45146 ) Today `RecoveryTarget#prepareForTranslogOperations` takes a boolean flag indicating whether the recovery is file-based or not. This was used in 6.x to bootstrap some commit data that were missing in indices created in 5.x: `b506955f8d/server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java (L298-L300)` This flag no longer has any effect, so this commit removes it. Backport of #45131 to 7.x.	2019-08-05 08:17:40 +01:00
Tim Brooks	984ba82251	Move nio channel initialization to event loop (#45155 ) Currently in the transport-nio work we connect and bind channels on the a thread before the channel is registered with a selector. Additionally, it is at this point that we set all the socket options. This commit moves these operations onto the event-loop after the channel has been registered with a selector. It attempts to set the socket options for a non-server channel at registration time. If that fails, it will attempt to set the options after the channel is connected. This should fix #41071.	2019-08-02 17:31:31 -04:00
David Turner	9ff320d967	Use index for peer recovery instead of translog (#45137 ) Today we recover a replica by copying operations from the primary's translog. However we also retain some historical operations in the index itself, as long as soft-deletes are enabled. This commit adjusts peer recovery to use the operations in the index for recovery rather than those in the translog, and ensures that the replication group retains enough history for use in peer recovery by means of retention leases. Reverts #38904 and #42211 Relates #41536 Backport of #45136 to 7.x.	2019-08-02 15:00:43 +01:00
Armin Braun	9450505d5b	Stop Passing Around REST Request in Multiple Spots (#44949 ) (#45109 ) * Stop Passing Around REST Request in Multiple Spots * Motivated by #44564 * We are currently passing the REST request object around to a large number of places. This works fine since we simply copy the full request content before we handle the rest itself which is needlessly hard on GC and heap. * This PR removes a number of spots where the request is passed around needlessly. There are many more spots to optimize in follow-ups to this, but this one would already enable bypassing the request copying for some error paths in a follow up.	2019-08-02 07:31:38 +02:00
David Turner	c088bafbbc	Wait for events in waitForRelocation (#45074 ) Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in `ESIntegTestCase#waitForRelocation()` to deal with the case that this health request returns successfully despite the fact that there is a pending reroute task which will relocate another shard. Relates #44433 Fixes #45003	2019-08-01 13:47:39 +01:00
Nhat Nguyen	979d0a71c7	Remove leniency during replay translog in peer recovery (#44989 ) This change removes leniency in InternalEngine during replaying translog in peer recovery.	2019-07-30 13:25:15 -04:00
Armin Braun	548c767b6b	S3 3rd Party Test Goal (#44799 ) (#45004 ) * Create S3 Third Party Test Task that Covers the S3 CLI Tool * Adjust snapshot cli test tool tests to work with real S3 * Build adjustment * Clean up repo path before testing * Dedup the logic for asserting path contents by using the correct utility method here that somehow became unused	2019-07-30 17:16:41 +02:00
David Turner	55f1dd8da6	Close nodes properly in Coordinator tests (#44967 ) Today closing a `ClusterNode` in an `AbstractCoordinatorTestCase` uses `onNode()` so has no effect if the node is not in the current list of nodes. It also discards the `Runnable` it creates without having run it, so has no effect anyway. This commit makes these tests much stricter about properly closing the nodes started during `Coordinator` tests, by tracking the persisted states that are opened, and adds an assertion to catch the trappy requirement that the closing node still belongs to the cluster.	2019-07-30 11:47:36 +01:00
Andrey Ershov	5a0bd696fc	Snapshot tool S3 cleanup 7.x backport (#44575 ) Backport of #44551	2019-07-30 11:02:08 +02:00
Nhat Nguyen	4813728783	Remove leniency in reset engine from translog (#44711 ) Replaying operations from the local translog must never fail as those operations were processed successfully on the primary before and the mapping is up to update already. This change removes leniency during resetting engine from translog in IndexShard and InternalEngine.	2019-07-29 16:31:45 -04:00
Yannick Welsch	8653c33838	Fix testBlockingIncomingRequests (#44939 ) Adapted test to take non-blocking nature into account.	2019-07-29 16:37:53 +02:00
Yannick Welsch	24873dd3e3	Do not block transport thread on startup (#44939 ) We currently block the transport thread on startup, which has caused test failures. I think this is some kind of deadlock situation. I don't think we should even block a transport thread, and there's also no need to do so. We can just reject requests as long we're not fully set up. Note that the HTTP layer is only started much later (after we've completed full start up of the transport layer), so that one should be completely unaffected by this. Closes #41745	2019-07-29 11:35:17 +02:00
Jason Tedor	6ea2b5dec0	Deprecate setting processors to more than available (#44889 ) Today the processors setting is permitted to be set to more than the number of processors available to the JVM. The processors setting directly sizes the number of threads in the various thread pools, with most of these sizes being a linear function in the number of processors. It doesn't make any sense to set processors very high as the overhead from context switching amongst all the threads will overwhelm, and changing the setting does not control how many physical CPU resources there are on which to schedule the additional threads. We have to draw a line somewhere and this commit deprecates setting processors to more than the number of available processors. This is the right place to draw the line given the linear growth as a function of processors in most of the thread pools, and that some are capped at the number of available processors already.	2019-07-26 17:06:44 +09:00
Yannick Welsch	0ce841915c	Add Clone Index API (#44267 ) Adds an API to clone an index. This is similar to the index split and shrink APIs, just with the difference that the number of primary shards is kept the same. In case where the filesystem provides hard-linking capabilities, this is a very cheap operation. Indexing cloning can be done by running `POST my_source_index/_clone/my_target_index` and it supports the same options as the split and shrink APIs. Closes #44128	2019-07-25 22:02:28 +02:00
Andrei Stefan	2633d11eb7	Switch from using docvalue_fields to extracting values from _source (#44062 ) (#44804 ) * Switch from using docvalue_fields to extracting values from _source where applicable. Doing this means parsing the _source and handling the numbers parsing just like Elasticsearch is doing it when it's indexing a document. * This also introduces a minor limitation: aliases type of fields that are NOT part of a tree of sub-fields will not be able to be retrieved anymore. field_caps API doesn't shed any light into a field being an alias or not and at _source parsing time there is no way to know if a root field is an alias or not. Fields of the type "a.b.c.alias" can be extracted from docvalue_fields, only if the field they point to can be extracted from docvalue_fields. Also, not all fields in a hierarchy of fields can be evaluated to being an alias. (cherry picked from commit 8bf8a055e38f00df5f49c8d97f632f69d6e00c2c)	2019-07-25 10:02:41 +03:00
Igor Motov	f9943a3e53	Geo: deprecate ShapeBuilder in QueryBuilders (#44715 ) Removes unnecessary now timeline decompositions from shape builders and deprecates ShapeBuilders in QueryBuilder in favor of libs/geo shapes. Relates to #40908	2019-07-24 14:27:58 -04:00
Armin Braun	d8be9244f9	Fix Repository Cleanup Test Correctness (#44738 ) (#44751 ) * The tests were creating the corruption and asserting its existence not on the repository base path but on a clean path. As a result the consistency assertion on the repository wouldn't see the corruption ever an pass even if the cleanup was broken for repositories that have a non-root base path	2019-07-24 16:03:37 +02:00
Armin Braun	818103ff1e	Fix testRetentionLeasesClearedOnRestore (#44754 ) (#44766 ) * Fix this test randomly failing when running into async translog persistence edge case and failing to successfully close index * Also, slightly improve debug logging on close failure * Closes #44681	2019-07-23 21:29:07 +02:00
Igor Motov	9338fc8536	GEO: Switch to using GeoTestUtil to generate random geo shapes (#44635 ) Switches to more robust way of generating random test geometries by reusing lucene's GeoTestUtil. Removes duplicate random geometry generators by moving them to the test framework. Closes #37278	2019-07-23 14:30:41 -04:00
Mayya Sharipova	972a49312c	Fix testQuotedQueryStringWithBoost test (#43385 ) Add more logging to indexRandom Seems that asynchronous indexing from indexRandom sometimes indexes the same document twice, which will mess up the expected score calculations. For example, indexing: { "index" : {"_id" : "1" } } {"important" :"phrase match", "less_important": "nothing important"} { "index" : {"_id" : "2" } } {"important" :"nothing important", "less_important" :"phrase match"} Produces the expected scores: 13.8 for doc1, and 1.38 for doc2 indexing: { "index" : {"_id" : "1" } } {"important" :"phrase match", "less_important": "nothing important"} { "index" : {"_id" : "2" } } {"important" :"nothing important", "less_important" :"phrase match"} { "index" : {"_id" : "3" } } {"important" :"phrase match", "less_important": "nothing important"} Produces scores: 9.4 for doc1, and 1.96 for doc2 which are found in the error logs. Relates to #43144	2019-07-22 08:44:31 -04:00
Ryan Ernst	4c05d25ec7	Convert Transport Request/Response to Writeable (#44636 ) (#44654 ) This commit converts all remaining TransportRequest and TransportResponse classes to implement Writeable, and disallows Streamable implementations. relates #34389	2019-07-20 11:25:58 -07:00
Ryan Ernst	f4ee2e9e91	Convert direct implementations of Streamable to Writeable (#44605 ) (#44646 ) This commit converts Streamable to Writeable for direct implementations. relates #34389	2019-07-20 08:32:29 -07:00
Ryan Ernst	f193d14764	Convert remaining Action Response/Request to writeable.reader (#44528 ) (#44607 ) This commit converts readFrom to ctor with StreamInput on the remaining ActionResponse and ActionRequest classes. relates #34389	2019-07-19 13:33:38 -07:00
Lee Hinman	fe2ef66e45	Expose index age in ILM explain output (#44457 ) * Expose index age in ILM explain output This adds the index's age to the ILM explain output, for example: ``` { "indices" : { "ilm-000001" : { "index" : "ilm-000001", "managed" : true, "policy" : "full-lifecycle", "lifecycle_date" : "2019-07-16T19:48:22.294Z", "lifecycle_date_millis" : 1563306502294, "age" : "1.34m", "phase" : "hot", "phase_time" : "2019-07-16T19:48:22.487Z", ... etc ... } } } ``` This age can be used to tell when ILM will transition the index to the next phase, based on that phase's `min_age`. Resolves #38988 * Expose age in getters and in HLRC	2019-07-18 15:33:45 -06:00
Andrey Ershov	ef6ddd15c6	Revert "Snapshot tool: S3 orphaned files cleanup (#44551)" This reverts commit `09edeeb3`	2019-07-18 17:21:45 +02:00
Andrey Ershov	6f5327ba45	Fix BlobStoreTestUtil	2019-07-18 17:00:23 +02:00
Andrey Ershov	09edeeb38e	Snapshot tool: S3 orphaned files cleanup (#44551 ) A tool to work with snapshots. Co-authored by @original-brownbear. This commit adds snapshot tool and the single command cleanup, that cleans up orphaned files for S3. Snapshot tool lives in x-pack/snapshot-tool. (cherry picked from commit fc4aed44dd975d83229561090f957a95cc76b287)	2019-07-18 16:38:00 +02:00
David Turner	452f7f67a0	Defer reroute when starting shards (#44539 ) Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. Backport of #44433 and #44543.	2019-07-18 14:10:40 +01:00
Nhat Nguyen	51180af91d	Make peer recovery send file chunks async (#44468 ) Relates #44040 Relates #36195	2019-07-17 22:25:43 -04:00
Nhat Nguyen	458f24c46a	Reenable accounting circuit breaker (#44495 ) We have a new Lucene 8.2 snapshot on master and 7.x; hence we can re-enable the accounting on these branches. Relates #30290	2019-07-17 22:25:43 -04:00
Jason Tedor	39c5f98de7	Introduce test issue logging (#44477 ) Today we have an annotation for controlling logging levels in tests. This annotation serves two purposes, one is to control the logging level used in tests, when such control is needed to impact and assert the behavior of loggers in tests. The other use is when a test is failing and additional logging is needed. This commit separates these two concerns into separate annotations. The primary motivation for this is that we have a history of leaving behind the annotation for the purpose of investigating test failures long after the test failure is resolved. The accumulation of these stale logging annotations has led to excessive disk consumption. Having recently cleaned this up, we would like to avoid falling into this state again. To do this, we are adding a link to the test failure under investigation to the annotation when used for the purpose of investigating test failures. We will add tooling to inspect these annotations, in the same way that we have tooling on awaits fix annotations. This will enable us to report on the use of these annotations, and report when stale uses of the annotation exist.	2019-07-18 05:33:33 +09:00
Yannick Welsch	f78e64e3e2	Terminate linearizability check early on large histories (#44444 ) Large histories can be problematic and have the linearizability checker occasionally run OOM. As it's very difficult to bound the size of the histories just right, this PR will let it instead run for 10 seconds on large histories and then abort. Closes #44429	2019-07-17 18:51:25 +02:00
Armin Braun	c8db0e9b7e	Remove blobExists Method from BlobContainer (#44472 ) (#44475 ) * We only use this method in one place in production code and can replace that with a read -> remove it to simplify the interface * Keep it as an implementation detail in the Azure repository	2019-07-17 11:56:02 +02:00

1 2 3 4 5 ...

2367 Commits