OpenSearch

Commit Graph

Author	SHA1	Message	Date
Tim Brooks	0645ee88e2	Send cluster name and discovery node in handshake (#48916 ) This commits sends the cluster name and discovery naode in the transport level handshake response. This will allow us to stop sending the transport service level handshake request in the 8.0-8.x release cycle. It is necessary to start sending this in 7.x so that 8.0 is guaranteed to be communicating with a version that sends the required information.	2019-11-11 18:42:02 -05:00
Jake Landis	c320b499a0	Prevent deadlock by using separate schedulers (#48697 ) (#48964 ) Currently the BulkProcessor class uses a single scheduler to schedule flushes and retries. Functionally these are very different concerns but can result in a dead lock. Specifically, the single shared scheduler can kick off a flush task, which only finishes it's task when the bulk that is being flushed finishes. If (for what ever reason), any items in that bulk fails it will (by default) schedule a retry. However, that retry will never run it's task, since the flush task is consuming the 1 and only thread available from the shared scheduler. Since the BulkProcessor is mostly client based code, the client can provide their own scheduler. As-is the scheduler would require at minimum 2 worker threads to avoid the potential deadlock. Since the number of threads is a configuration option in the scheduler, the code can not enforce this 2 worker rule until runtime. For this reason this commit splits the single task scheduler into 2 schedulers. This eliminates the potential for the flush task to block the retry task and removes this deadlock scenario. This commit also deprecates the Java APIs that presume a single scheduler, and updates any internal code to no longer use those APIs. Fixes #47599 Note - #41451 fixed the general case where a bulk fails and is retried that can result in a deadlock. This fix should address that case as well as the case when a bulk failure from the flush needs to be retried.	2019-11-11 16:31:21 -06:00
Mark Tozzi	d9e569278f	Refactor and DRY up Kahan Sum algorithm (#48558 ) (#48959 )	2019-11-11 15:09:19 -05:00
Armin Braun	c45470f84f	Fix ShardGenerations in RepositoryData in BwC Case (#48920 ) (#48947 ) We were tripping the assertion that the makes sure we only have empty `ShardGenerations` in `RepositoryData` in the BwC case because shard generations were passed to the `Repository` in the BwC case. Fixed by only generating empty shard gen for BwC snapshots in `SnapshotsService`.	2019-11-11 18:02:53 +01:00
Rory Hunter	014e1b1090	Improve resiliency to auto-formatting in server (#48940 ) Backport of #48450. Make a number of changes so that code in the `server` directory is more resilient to automatic formatting. This covers: * Reformatting multiline JSON to embed whitespace in the strings * Move some comments around to they aren't auto-formatted to a strange place. This also required moving some `&&` and `\|\|` operators from the end-of-line to start-of-line`. * Add helper method `reformatJson()`, to strip whitespace from a JSON document using XContent methods. This is sometimes necessary where a test is comparing some machine-generated JSON with an expected value. Also, `HyperLogLogPlusPlus.java` is now excluded from formatting because it contains large data tables that don't reformat well with the current settings, and changing the settings would be worse for the rest of the codebase.	2019-11-11 14:33:04 +00:00
Yannick Welsch	87862868c6	Allow realtime get to read from translog (#48843 ) The realtime GET API currently has erratic performance in case where a document is accessed that has just been indexed but not refreshed yet, as the implementation will currently force an internal refresh in that case. Refreshing can be an expensive operation, and also will block the thread that executes the GET operation, blocking other GETs to be processed. In case of frequent access of recently indexed documents, this can lead to a refresh storm and terrible GET performance. While older versions of Elasticsearch (2.x and older) did not trigger refreshes and instead opted to read from the translog in case of realtime GET API or update API, this was removed in 5.0 (#20102) to avoid inconsistencies between values that were returned from the translog and those returned by the index. This was partially reverted in 6.3 (#29264) to allow _update and upsert to read from the translog again as it was easier to guarantee consistency for these, and also brought back more predictable performance characteristics of this API. Calls to the realtime GET API, however, would still always do a refresh if necessary to return consistent results. This means that users that were calling realtime GET APIs to coordinate updates on client side (realtime GET + CAS for conditional index of updated doc) would still see very erratic performance. This PR (together with #48707) resolves the inconsistencies between reading from translog and index. In particular it fixes the inconsistencies that happen when requesting stored fields, which were not available when reading from translog. In case where stored fields are requested, this PR will reparse the _source from the translog and derive the stored fields to be returned. With this, it changes the realtime GET API to allow reading from the translog again, avoid refresh storms and blocking the GET threadpool, and provide overall much better and predictable performance for this API.	2019-11-09 17:47:50 +01:00
Nhat Nguyen	ff6c121eb9	Closed shard should never open new engine (#47186 ) We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock. Closes #47060	2019-11-08 23:40:34 -05:00
Nhat Nguyen	9a42e71dd9	Do not cancel recovery for copy on broken node (#48265 ) This change fixes a poisonous situation where an ongoing recovery was canceled because a better copy was found on a node that the cluster had previously tried allocating the shard to but failed. The solution is to keep track of the set of nodes that an allocation was failed on so that we can avoid canceling the current recovery for a copy on failed nodes. Closes #47974	2019-11-08 23:10:47 -05:00
Adrien Grand	3b9ce0a4f3	Elasticsearch 7.5 is on Lucene 8.3. (#48831 )	2019-11-06 10:13:09 -05:00
David Turner	bd5c6c4779	Add preflight check to dynamic mapping updates (#48867 ) Today if the primary discovers that an indexing request needs a mapping update then it will send it to the master for validation and processing. If, however, the put-mapping request is invalid then the master still processes it as a (no-op) cluster state update. When there are a large number of indexing operations that result in invalid mapping updates this can overwhelm the master. However, the primary already has a reasonably up-to-date mapping against which it can check the (approximate) validity of the put-mapping request before sending it to the master. For instance it is not possible to remove fields in a mapping update, so if the primary detects that a mapping update will exceed the fields limit then it can reject it itself and avoid bothering the master. This commit adds a pre-flight check to the mapping update path so that the primary can discard obviously-invalid put-mapping requests itself. Fixes #35564 Backport of #48817	2019-11-05 18:08:22 +01:00
Nhat Nguyen	0887cbc964	Fix testForceMergeWithSoftDeletesRetentionAndRecoverySource (#48766 ) This test failure manifests the limitation of the recovery source merge policy explained in #41628. If we already merge down to a single segment then subsequent force merges will be noop although they can prune recovery source. We need to adjust this test until we have a fix for the merge policy. Relates #41628 Closes #48735	2019-11-02 21:14:12 -04:00
Armin Braun	3c20541823	Cleanup Concurrent RepositoryData Loading (#48329 ) (#48834 ) The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122	2019-11-02 20:42:29 +01:00
Armin Braun	a22f6fbe3c	Cleanup Redundant Futures in Recovery Code (#48805 ) (#48832 ) Follow up to #48110 cleaning up the redundant future uses that were left over from that change.	2019-11-02 17:28:12 +01:00
Jason Tedor	c82ecb664c	Do not wrap ingest processor exception with IAE (#48816 ) The problem with wrapping here is that it converts any exception into an IAE, which we treat as a client error (400 status) whereas the exception being wrapped here could be a server error (e.g., NPE). This commit stops wrapping all ingest processor exceptions as IAEs.	2019-11-01 15:11:35 -04:00
Mark Vieira	6ab4645f4e	[7.x] Introduce type-safe and consistent pattern for handling build globals (#48818 ) This commit introduces a consistent, and type-safe manner for handling global build parameters through out our build logic. Primarily this replaces the existing usages of extra properties with static accessors. It also introduces and explicit API for initialization and mutation of any such parameters, as well as better error handling for uninitialized or eager access of parameter values. Closes #42042	2019-11-01 11:33:11 -07:00
Tal Levy	4be54402de	[7.x] Add ingest info to Cluster Stats (#48485 ) (#48661 ) * Add ingest info to Cluster Stats (#48485) This commit enhances the ClusterStatsNodes response to include global processor usage stats on a per-processor basis. example output: ``` ... "processor_stats": { "gsub": { "count": 0, "failed": 0 "current": 0 "time_in_millis": 0 }, "script": { "count": 0, "failed": 0 "current": 0, "time_in_millis": 0 } } ... ``` The purpose for this enhancement is to make it easier to collect stats on how specific processors are being used across the cluster beyond the current per-node usage statistics that currently exist in node stats. Closes #46146. * fix BWC of ingest stats The introduction of processor types into IngestStats had a bug. It was set to `null` and set as the key to the map. This would throw a NPE. This commit resolves this by setting all the processor types from previous versions that are not serializing it out to `_NOT_AVAILABLE`.	2019-10-31 14:36:54 -07:00
Ioannis Kakavas	99aedc844d	Copy http headers to ThreadContext strictly (#45945 ) (#48675 ) Previous behavior while copying HTTP headers to the ThreadContext, would allow multiple HTTP headers with the same name, handling only the first occurrence and disregarding the rest of the values. This can be confusing when dealing with multiple Headers as it is not obvious which value is read and which ones are silently dropped. According to RFC-7230, a client must not send multiple header fields with the same field name in a HTTP message, unless the entire field value for this header is defined as a comma separated list or this specific header is a well-known exception. This commits changes the behavior in order to be more compliant to the aforementioned RFC by requiring the classes that implement ActionPlugin to declare if a header can be multi-valued or not when registering this header to be copied over to the ThreadContext in ActionPlugin#getRestHeaders. If the header is allowed to be multivalued, then all such headers are read from the HTTP request and their values get concatenated in a comma-separated string. If the header is not allowed to be multivalued, and the HTTP request contains multiple such Headers with different values, the request is rejected with a 400 status.	2019-10-31 23:05:12 +02:00
Zachary Tong	34c2375417	Add v7.4.3 version constant	2019-10-31 13:21:25 -04:00
Alexander Reelsen	4ecf234617	Upgrade to joda 2.10.4 (#47805 )	2019-10-31 14:49:50 +01:00
Stéphane Campinas	7ea74918e1	[DOCS] Fix typo in IndexFieldData.java comments (#48743 )	2019-10-31 09:40:35 -04:00
kkewwei	0366c4d4a9	Faster access to INITIALIZING/RELOCATING shards (#47817 ) Today a couple of allocation deciders iterate through all the shards on a node to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster state updates in clusters with very high-density nodes holding many thousands of shards even if those shards belong to closed or frozen indices. This commit pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up this search. Closes #46941 Relates #48579 Co-authored-by: "hongju.xhj" <hongju.xhj@alibaba-inc.com>	2019-10-31 10:55:59 +00:00
Rory Hunter	d96976e2b1	Improve resiliency to formatting JSON in server (#48706 ) Backport of #48553. Make a number of changes so that JSON in the server directory is more resilient to automatic formatting. This covers: * Reformatting multiline JSON to embed whitespace in the strings * Add helper method `stripWhitespace()`, to strip whitespace from a JSON document using XContent methods. This is sometimes necessary where a test is comparing some machine-generated JSON with an expected value.	2019-10-31 10:48:55 +00:00
Arvind Ramachandran	eefa84bc94	Ignore dangling indices created in newer versions (#48652 ) Today it is possible that we import a dangling index that was created in a newer version than one or more of the nodes in the cluster. Such an index would prevent the older node(s) from rejoining the cluster if they were to briefly leave it for some reason. This commit prevents the import of such dangling indices. Fixes #34264	2019-10-31 10:12:42 +00:00
Yannick Welsch	fe8901b00b	Return consistent source in updates (#48707 )	2019-10-31 10:00:40 +01:00
Ignacio Vera	5bea3898a9	Add IndexOrDocValuesQuery to GeoPolygonQueryBuilder (#48449 ) (#48731 )	2019-10-31 08:46:57 +01:00
Nhat Nguyen	f8ef402027	Do not warm up searcher in engine constructor (#48605 ) With this change, we won't warm up searchers until we externally refresh an engine. We explicitly refresh before allowing reading from a shard (i.e., move to post_recovery state) and during resetting. These guarantees that we have warmed up the engine before exposing the external searcher. Another prerequisite for #47186.	2019-10-30 14:22:59 -04:00
Armin Braun	36039706b5	Fix SnapshotShardStatus Reporting for Failed Shard (#48556 ) (#48687 ) Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526	2019-10-30 15:43:41 +01:00
Armin Braun	52e5ceb321	Restore from Individual Shard Snapshot Files in Parallel (#48110 ) (#48686 ) Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.	2019-10-30 14:36:30 +01:00
Armin Braun	01e326d2e3	Fix ref count handling in Engine.failEngine (#48639 ) (#48646 ) We can run into an already closed store here and hence throw on trying to increment the ref count => moving to the guarded ref count increment closes #48625	2019-10-30 10:10:48 +01:00
Julie Tibshirani	89c65752dc	Update the signature of vector script functions. (#48653 ) Previously the functions accepted a doc values reference, whereas they now accept the name of the vector field. Here's an example of how a vector function was called before and after the change. ``` Before: cosineSimilarity(params.query_vector, doc['field']) After: cosineSimilarity(params.query_vector, 'field') ``` This seems more intuitive, since we don't allow direct access to vector doc values and the the meaning of `doc['field']` is unclear. The PR makes the following changes (broken into distinct commits): * Add new function signatures of the form `function(params.query_vector, 'field')` and deprecates the old ones. Because Painless doesn't allow two methods with the same name and number of arguments, we allow a generic `Object` to be passed in to the function and decide on the behavior through an `instanceof` check. * Refactor the class bindings so that the document field is passed to the constructor instead of the instance method. This allows us to avoid retrieving the vector doc values on every function invocation, which gives a tiny speed-up in benchmarks. Note that this PR adds new signatures for the sparse vector functions too, even though sparse vectors are deprecated. It seemed simplest to understand (for both us and users) to keep everything symmetric between dense and sparse vectors.	2019-10-29 15:46:05 -07:00
Stuart Tettemer	55d00cf2b1	Scripting: fill in get contexts REST API (#48319 ) (#48602 ) Updates response for `GET /_script_context`, returning a `contexts` object with a list of context description objects. The description includes the context name and a list of methods available. The methods list has the signature for the `execute` mathod and any getters. eg. ``` { "contexts": [ { "name" : "moving-function", "methods" : [ { "name" : "execute", "return_type" : "double", "params" : [ { "type" : "java.util.Map", "name" : "params" }, { "type" : "double[]", "name" : "values" } ] } ] }, { "name" : "number_sort", "methods" : [ { "name" : "execute", "return_type" : "double", "params" : [ ] }, { "name" : "getDoc", "return_type" : "java.util.Map", "params" : [ ] }, { "name" : "getParams", "return_type" : "java.util.Map", "params" : [ ] }, { "name" : "get_score", "return_type" : "double", "params" : [ ] } ] }, ... ] } ``` fixes: #47411	2019-10-29 14:41:15 -06:00
Nhat Nguyen	2a863ac8ff	Fix testCleanUpCommitsWhenGlobalCheckpointAdvanced Relates #48559	2019-10-29 10:39:16 -04:00
Nhat Nguyen	b08cd058bc	Greedily advance safe commit on new global checkpoint (#48559 ) Today we won't advance the safe commit on a new global checkpoint unless the last commit can become safe. This is not great if we have more than two commits as we can have a new safe commit earlier. Closes #4853	2019-10-29 10:39:16 -04:00
Jim Ferenczi	aa70ff5ea4	Fix failures in ShuffleForcedMergePolicyTests#testDiagnostics (#48627 ) This commit fixes intermittent failures in ShuffleForcedMergePolicyTests#testDiagnostics by setting a more restricted merge policy that ensures that extra merging will not happen before the forced merge.	2019-10-29 13:46:55 +01:00
Jim Ferenczi	c6abe58f63	Fix expectations in SearchAfter integration tests (#48372 ) This commit fixes the expectations of SearchAfterIT#shouldFail regarding the inner exceptions that should be thrown when testing failures. The exception is sometimes wrapped in a QueryShardException so this change only checks that the toString representation contains the expected message. Closes #43143	2019-10-29 12:37:22 +01:00
Yannick Welsch	6af3ce58f8	Filter on node id in AllocationIdIT (#48623 ) Makes the assertions more targeted. Relates #48529	2019-10-29 12:10:48 +01:00
Jim Ferenczi	028084ce23	Add a new merge policy that interleaves old and new segments on force merge (#48533 ) This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in # remain efficient even after running a force merge. Relates #37043	2019-10-29 10:44:56 +01:00
Armin Braun	53a22b8a8a	Fix Validity of RepositoryDataTests Randomness (#48564 ) (#48566 ) Trivial point, but we were only testing shard generations for a single shard here, accidentally, and not testing the `null` generation case at all.	2019-10-28 11:04:57 +01:00
Nhat Nguyen	1ef87c9a68	Refresh should not acquire readLock (#48414 ) Today, we hold the engine readLock while refreshing. Although this choice simplifies the correctness reasoning, it can block IndexShard from closing if warming an external reader takes time. The current implementation of refresh does not need to hold readLock as ReferenceManager can handle errors correctly if the engine is closed in midway. This PR is a prerequisite that we need to solve #47186.	2019-10-25 17:32:35 -04:00
Dan Hermann	2e3db518c9	Do not reference values for filtered settings (#48066 ) (#48518 )	2019-10-25 16:22:11 -05:00
Tim Brooks	f5f1072824	Multiple remote connection strategy support (#48496 ) * Extract remote "sniffing" to connection strategy (#47253) Currently the connection strategy used by the remote cluster service is implemented as a multi-step sniffing process in the RemoteClusterConnection. We intend to introduce a new connection strategy that will operate in a different manner. This commit extracts the sniffing logic to a dedicated strategy class. Additionally, it implements dedicated tests for this class. Additionally, in previous commits we moved away from a world where the remote cluster connection was mutable. Instead, when setting updates are made, the connection is torn down and rebuilt. We still had methods and tests hanging around for the mutable behavior. This commit removes those. * Introduce simple remote connection strategy (#47480) This commit introduces a simple remote connection strategy which will open remote connections to a configurable list of user supplied addresses. These addresses can be remote Elasticsearch nodes or intermediate proxies. We will perform normal clustername and version validation, but otherwise rely on the remote cluster to route requests to the appropriate remote node. * Make remote setting updates support diff strategies (#47891) Currently the entire remote cluster settings infrastructure is designed around the sniff strategy. As we introduce an additional conneciton strategy this infrastructure needs to be modified to support it. This commit modifies the code so that the strategy implementations will tell the service if the connection needs to be torn down and rebuilt. As part of this commit, we will wait 10 seconds for new clusters to connect when they are added through the "update" settings infrastructure. * Make remote setting updates support diff strategies (#47891) Currently the entire remote cluster settings infrastructure is designed around the sniff strategy. As we introduce an additional conneciton strategy this infrastructure needs to be modified to support it. This commit modifies the code so that the strategy implementations will tell the service if the connection needs to be torn down and rebuilt. As part of this commit, we will wait 10 seconds for new clusters to connect when they are added through the "update" settings infrastructure.	2019-10-25 09:29:41 -06:00
Luca Cavanna	d6d2edf324	Fix .tasks index strict mapping: parent_id should be parent_task_id (#48393 ) * Fix .tasks index strict mapping: parent_id should be parent_task_id The .tasks index has mappings that's strictly defined. `parent_task_id` was defined as `parent_id` though which would cause an exception in case a task is persisted that has a parent task id set. While at it, a couple of compiler warnings were addressed and a test request builder was removed in favour of using its corresponding request. * increment version	2019-10-25 17:00:06 +02:00
Luca Cavanna	9c48ed12bc	Remove response search phase from ExpandSearchPhase (#48401 ) The expand phase is always created providing a function that builds the next phase to be run, which has a single purpose: sending the response back. Such small search phase is not necessary and causes some issues when reporting search progress and counting the search phases that need to be executed and that are already executed. We can simply rather send back the response, without creating a specific phase for that.	2019-10-25 17:00:06 +02:00
Armin Braun	84a47a9632	Remove Outdated AwaitsFix (#48513 ) (#48522 ) This `AwaitsFix` was accidentally added after the test was already fixed in #46594 => we can remove it.	2019-10-25 16:08:56 +02:00
Armin Braun	edab3748e9	Remove Incorrect Assertion from SnapshotsInProgress (#47458 ) (#48514 ) This relates to the effort towards #46250. We added tracking of the shard generation for successful snapshots to `8.0`. This assertion isn't correct though. While an `8.0` master won't create an entry with sucess state and a null shard generation it may still (on e.g. master failover) send a success entry created by a 7.x master with a `null` generation over the wire. Closes #47406	2019-10-25 15:03:23 +02:00
Christoph Büscher	3fb3397c12	BlendedTermQuery's equals method should consider boosts (#48193 ) This changes the queries equals() method so that the boost factors for each term are considered for the equality calculation. This means queries are only equal if both their terms and associated boosts match. The ordering of the terms doesn't matter as before, which is why we internally need to sort the terms and boost for comparison on the first equals() call like before. Boosts that are `null` are considered equal to boosts of 1.0f because topLevelQuery() will only wrap into BoostQuery if boost is not null and different from 1f. Closes #48184	2019-10-25 13:35:14 +02:00
Yannick Welsch	486794f24d	Show task ID in source of persistent task state update (#48483 ) Relates #48395	2019-10-25 10:29:16 +02:00
Tim Brooks	c0b545f325	Make BytesReference an interface (#48486 ) BytesReference is currently an abstract class which is extended by various implementations. This makes it very difficult to use the delegation pattern. The implication of this is that our releasable BytesReference is a PagedBytesReference type and cannot be used as a generic releasable bytes reference that delegates to any reference type. This commit makes BytesReference an interface and introduces an AbstractBytesReference for common functionality.	2019-10-24 15:39:30 -06:00
Yannick Welsch	acf6d34d69	Always use last properly persisted metadata as previous state (#47779 ) On data-only nodes we were not using the last persisted cluster state as base point to compute what needed storage, but the last applied cluster state (but not necessarily properly persisted) instead.	2019-10-24 13:30:59 +02:00
David Turner	50518359fe	Fix relocating shards size calculation (#48421 ) In #48392 we added a second computation of the sizes of the relocating shards in `canRemain()` but passed the wrong value for `subtractLeavingShards`. This fixes that. It also removes some unnecessary logging in a test case added in the same commit.	2019-10-24 08:58:50 +01:00
Jim Ferenczi	dc5c31d67a	Add a deprecation warning regarding allocation awareness in search request (#48351 ) This is a follow up of https://github.com/elastic/elasticsearch/issues/43453 where we added a system property to disallow allocation awareness in search requests. Since search requests will no longer check the allocation awareness attributes for routing in the next major version, this change adds a deprecation warning on any setup that uses these attributes. Relates #43453	2019-10-24 09:25:50 +02:00
Mayya Sharipova	9e9533f717	Correct syntax from backport User older format of map Relates to #48425	2019-10-23 17:19:15 -04:00
Mayya Sharipova	975dbecfa9	Correct rewritting of script_score query (#48425 ) Previously there was a bug when an query inside script_score query was rewritten. If min_score was not set and was equal to null, we were converting it to float value which resulted to NPE. This commit corrects this. Closes #48081	2019-10-23 17:01:51 -04:00
Igor Motov	bdbc353dea	Geo: improve handling of out of bounds points in linestrings (#47939 ) Brings handling of out of bounds points in linestrings in line with points. Now points with latitude above 90 and below -90 are handled the same way as for points by adjusting the longitude by moving it by 180 degrees. Relates to #43916	2019-10-23 14:17:44 -04:00
Jim Ferenczi	41116eb7ea	Do not throw errors on unknown types in SearchAfterBuilder (#48147 ) * Do not throw errors on unknown types in SearchAfterBuilder The support for BigInteger and BigDecimal was added for XContent in https://github.com/elastic/elasticsearch/pull/32888. However the SearchAfterBuilder xcontent parser doesn't expect them to be present so it throws an AssertionError. This change fixes this discrepancy by changing the AssertionError into an IllegalArgumentException that will not cause the node to die when thrown. Closes #48074	2019-10-23 20:02:14 +02:00
Tom Callahan	892264a97a	Add versions 7.4.2 and 6.8.5	2019-10-23 13:32:51 -04:00
David Turner	c783a20560	Handle negative free disk space in deciders (#48392 ) Today it is possible that the total size of all relocating shards exceeds the total amount of free disk space. For instance, this may be caused by another user of the same disk increasing their disk usage, or may be due to how Elasticsearch double-counts relocations that are nearly complete particularly if there are many concurrent relocations in progress. The `DiskThresholdDecider` treats negative free space similarly to zero free space, but it then fails when rendering the messages that explain its decision. This commit fixes its handling of negative free space. Fixes #48380	2019-10-23 18:16:41 +01:00
Adrien Grand	81ef72d3ef	Lucene#asSequentialBits gets the leadCost backwards. (#48335 ) (#48403 ) The comment says it needs random-access, but it passes `Long#MAX_VALUE` as the lead cost, which forces sequential access, it should pass `0` instead. I took advantage of this fix to improve the logic to leverage an estimation of the number of times that `Bits#get` gets called to make better decisions.	2019-10-23 17:48:17 +02:00
Przemyslaw Gomulka	aaa6209be6	[7.x] [Java.time] Calculate week of a year with ISO rules BACKPORT(#48209 ) (#48349 ) Reverting the change introducing IsoLocal.ROOT and introducing IsoCalendarDataProvider that defaults start of the week to Monday and requires minimum 4 days in first week of a year. This extension is using java SPI mechanism and defaults for Locale.ROOT only. It require jvm property java.locale.providers to be set with SPI,COMPAT closes #41670 backport #48209	2019-10-23 17:39:38 +02:00
Armin Braun	7215201406	Track Shard-Snapshot Index Generation at Repository Root (#48371 ) This change adds a new field `"shards"` to `RepositoryData` that contains a mapping of `IndexId` to a `String[]`. This string array can be accessed by shard id to get the generation of a shard's shard folder (i.e. the `N` in the name of the currently valid `/indices/${indexId}/${shardId}/index-${N}` for the shard in question). This allows for creating a new snapshot in the shard without doing any LIST operations on the shard's folder. In the case of AWS S3, this saves about 1/3 of the cost for updating an empty shard (see #45736) and removes one out of two remaining potential issues with eventually consistent blob stores (see #38941 ... now only the root `index-${N}` is determined by listing). Also and equally if not more important, a number of possible failure modes on eventually consistent blob stores like AWS S3 are eliminated by moving all delete operations to the `master` node and moving from incremental naming of shard level index-N to uuid suffixes for these blobs. This change moves the deleting of the previous shard level `index-${uuid}` blob to the master node instead of the data node allowing for a safe and consistent update of the shard's generation in the `RepositoryData` by first updating `RepositoryData` and then deleting the now unreferenced `index-${newUUID}` blob. __No deletes are executed on the data nodes at all for any operation with this change.__ Note also: Previous issues with hanging data nodes interfering with master nodes are completely impossible, even on S3 (see next section for details). This change changes the naming of the shard level `index-${N}` blobs to a uuid suffix `index-${UUID}`. The reason for this is the fact that writing a new shard-level `index-` generation blob is not atomic anymore in its effect. Not only does the blob have to be written to have an effect, it must also be referenced by the root level `index-N` (`RepositoryData`) to become an effective part of the snapshot repository. This leads to a problem if we were to use incrementing names like we did before. If a blob `index-${N+1}` is written but due to the node/network/cluster/... crashes the root level `RepositoryData` has not been updated then a future operation will determine the shard's generation to be `N` and try to write a new `index-${N+1}` to the already existing path. Updates like that are problematic on S3 for consistency reasons, but also create numerous issues when thinking about stuck data nodes. Previously stuck data nodes that were tasked to write `index-${N+1}` but got stuck and tried to do so after some other node had already written `index-${N+1}` were prevented form doing so (except for on S3) by us not allowing overwrites for that blob and thus no corruption could occur. Were we to continue using incrementing names, we could not do this. The stuck node scenario would either allow for overwriting the `N+1` generation or force us to continue using a `LIST` operation to figure out the next `N` (which would make this change pointless). With uuid naming and moving all deletes to `master` this becomes a non-issue. Data nodes write updated shard generation `index-${uuid}` and `master` makes those `index-${uuid}` part of the `RepositoryData` that it deems correct and cleans up all those `index-` that are unused. Co-authored-by: Yannick Welsch <yannick@welsch.lu> Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>	2019-10-23 10:58:26 +01:00
Jim Ferenczi	50f565b158	SearchSlowLog uses a non thread-safe object to escape json (#48363 ) This commit fixes the usage of JsonStringEncoder#quoteAsUTF8 in the SearchSlowLog. JsonStringEncoder#getInstance should always be called to get a thread local object but this assumption was broken by #44642. This means that any slow log can throw an AIOOBE since it uses the same byte array concurrently. Closes #48358	2019-10-23 10:23:06 +02:00
Armin Braun	8a02a5fc7d	Simplify Shard Snapshot Upload Code (#48155 ) (#48345 ) The code here was needlessly complicated when it enqueued all file uploads up-front. Instead, we can go with a cleaner worker + queue pattern here by taking the max-parallelism from the threadpool info. Also, I slightly simplified the rethrow and listener (step listener is pointless when you add the callback in the next line) handling it since I noticed that we were needlessly rethrowing in the same code and that wasn't worth a separate PR.	2019-10-22 17:17:09 +01:00
Nhat Nguyen	d0a4bad95b	Use MultiFileTransfer in CCR remote recovery (#44514 ) Relates #44468	2019-10-21 23:30:52 -04:00
Armin Braun	e65c60915a	Cleanup FileRestoreContext Abstractions (#48173 ) (#48300 ) This class is only used by the blob store repository and CCR and the abstractions didn't really make sense with CCR ignoring the concrete `restoreFiles` method completely and having a method used only by the blobstore overriden as unsupported. => Moved to a more fitting set of abstractions => Dried up the stream wrapping in `BlobStoreRepository` a little now that the `restoreFile` method could be simplified Relates #48110 as it makes changing the API of `FileRestoreContext` to what is needed for async restores simpler	2019-10-21 17:30:35 +02:00
Armin Braun	dc08feadc6	Remove Redundant Version Param from Repository APIs (#48231 ) (#48298 ) This parameter isn't used by any implementation	2019-10-21 16:20:45 +02:00
David Turner	672b2a92ca	Fix compile error from previous commit (#48230 ) The previous commit, `3a6fa0bbdb` introduces a compile error that was fixed locally but not committed. This commit adds the missing change.	2019-10-21 08:54:04 +01:00
David Turner	3a6fa0bbdb	Close query cache on index service creation failure (#48230 ) Today it is possible that we create the `QueryCache` and then fail to create the owning `IndexService` and this means we do not close the `QueryCache` again. This commit addresses that leak. Fixes #48186	2019-10-21 08:46:53 +01:00
Ignacio Vera	b1224fca8c	upgrade to Lucene-8.3.0-snapshot-25968e3b75e (#48227 )	2019-10-21 08:21:09 +02:00
Takuya Kajiwara	a56daeae2d	[DOCS] Fix typos in InternalEngine.java comments (#46861 )	2019-10-18 10:36:58 -04:00
David Turner	a8bcbbc38a	Quieter logging from the DiskThresholdMonitor (#48115 ) Today if an Elasticsearch node reaches a disk watermark then it will repeatedly emit logging about it, which implies that some action needs to be taken by the administrator. This is misleading. Elasticsearch strives to keep nodes under the high watermark, but it is normal to have a few nodes occasionally exceed this level. Nodes may be over the low watermark for an extended period without any ill effects. This commit enhances the logging emitted by the `DiskThresholdMonitor` to be less misleading. The expected case of hitting the high watermark and immediately relocating one or more shards that to bring the node back under the watermark again is reduced in severity to `INFO`. Additionally, `INFO` messages are not emitted repeatedly. Fixes #48038	2019-10-18 15:00:14 +01:00
Armin Braun	1157775074	Remove Support for pre-5.x Indices in Restore (#48181 ) (#48199 ) The logic for handling empty segment files has been unnecessary ever since #24021 which removes the support for these files in 6.x -> we can safely remove the support for restoring these from 7.x+ to simplify the code.	2019-10-18 09:45:07 +02:00
Przemyslaw Gomulka	02d18f5c1e	[7.x] Slow log must use separate underlying logger for each index BACKPORT(#47234 ) (#48176 ) * Slow log must use separate underlying logger for each index (#47234) SlowLog instances should not share the same underlying logger, as it would cause different indexes override each other levels. When creating underlying logger, unique per index identifier should be used. Name + IndexSettings.UUID Closes #42432	2019-10-17 20:04:57 +02:00
Armin Braun	04e3316408	Stop Resolving Fallback IndexId (#48141 ) (#48204 ) There is no reason to still resolve the fallback `IndexId` here. It only applies to `2.x` repos and those we can't read anymore anyway because they use an `/index` instead of an `/index-N` blob at the repo root for which at least 7.x+ does not contain the logic to find it.	2019-10-17 19:27:49 +02:00
Stuart Tettemer	356eef00c8	Scripting: get context names REST API (#48026 ) (#48168 ) Adds `GET /_script_context`, returning a `contexts` object with each available context as a key whose value is an empty object. eg. ``` { "contexts": { "aggregation_selector": {}, "aggs": {}, "aggs_combine": {}, ... } } ``` refs: #47411	2019-10-17 09:08:55 -06:00
Armin Braun	0ca7cc1848	Safely Close Repositories on Node Shutdown (#48020 ) (#48107 ) We were not closing repositories on Node shutdown. In production, this has little effect but in tests shutting down a node using `MockRepository` and is currently stuck in a simulated blocked-IO situation will only unblock when the node's threadpool is interrupted. This might in some edge cases (many snapshot threads and some CI slowness) result in the execution taking longer than 5s to release all the shard stores and thus we fail the assertion about unreleased shard stores in the internal test cluster. Regardless of tests, I think we should close repositories and release resources associated with them when closing a node and not just when removing a repository from the CS with running nodes as this behavior is really unexpected. Fixes #47689	2019-10-17 07:55:05 +02:00
Armin Braun	f1bc3a0753	Remove TestLogging for #46701 (#48156 ) (#48160 ) This hasn't failed in 5 weeks now. Removing the test logging and closing the issue. Closes #46701	2019-10-17 07:54:20 +02:00
Jack Conradson	fa99721295	Drop stored scripts with the old style-id (#48078 ) This PR fixes (#47593). Stored scripts with the old-style id of lang#id are saved through the upgrade process but are no longer accessible in recent versions. This fix will drop those scripts altogether since there is no way for a user to access them.	2019-10-16 16:10:31 -07:00
jimczi	b2dc98562b	Bump version to 7.6	2019-10-16 15:57:12 +02:00
Klemen Košir	8243e99134	Fix typo in QueryBuilders Javadoc. (#47362 ) This PR fixes a typo in the Javadoc for terms queries in QueryBuilders.	2019-10-15 16:16:21 -07:00
Martijn van Groningen	aff0c9babc	This commits merges (#48040 ) the enrich-7.x feature branch, which is backport merge and adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices. Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner. Related to #32789	2019-10-15 17:31:45 +02:00
jimczi	b858e19bcc	Revert #46598 that breaks the cachability of the sub search contexts.	2019-10-15 09:40:59 +02:00
Martijn van Groningen	cc4b6c43b3	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-10-15 07:23:47 +02:00
Jim Ferenczi	ef02a736ca	Don't apply the plugin's reader wrapper in can_match phase (#47816 ) This change modifies the local execution of the `can_match` phase to not apply the plugin's reader wrapper (if it is configured) when acquiring the searcher. We must ensure that the phase runs quickly and since we don't know the cost of applying the wrapper it is preferable to avoid it entirely. The can_match phase can aford false positives so it is also safe for the builtin plugins that use this functionality. Closes #46817	2019-10-14 13:07:05 +02:00
Martijn van Groningen	d4901a71d7	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-10-14 10:27:17 +02:00
Nhat Nguyen	8180cf1e68	Mute testDoNotInfinitelyWaitForMapping Tracked at #47974	2019-10-13 22:06:50 -04:00
Nhat Nguyen	2995d4a9c0	Sequence number based replica allocation (#46959 ) With this change, shard allocation prefers allocating replicas on a node that already has a copy of the shard that is as close as possible to the primary, so that it is as cheap as possible to bring the new replica in sync with the primary. Furthermore, if we find a copy that is identical to the primary then we cancel an ongoing recovery because the new copy which is identical to the primary needs no work to recover as a replica. We no longer need to perform a synced flush before performing a rolling upgrade or full cluster start with this improvement. Closes #46318	2019-10-13 22:06:50 -04:00
Nhat Nguyen	4f06225928	Avoid unneeded refresh with concurrent realtime gets (#47895 ) This change should reduce refreshes for a use-case where we perform multiple realtime gets at the same time on an active index. Currently, we only call refresh if the index operation is still on the versionMap. However, at the time we call refresh, that operation might be already or will be included in the latest reader. Hence, we do not need to refresh. Adding another lock here is not an issue as the refresh is already sequential.	2019-10-13 20:08:21 -04:00
Nhat Nguyen	4c1bb210cb	Force flush in translog retention policy test (#47879 ) If we roll translog but do not index, then a flush without force is a noop. In this case, the number of retained translog files will be higher than the value specified by the retention policy. Closes #4741	2019-10-13 20:08:21 -04:00
Przemyslaw Gomulka	6ab58de7ef	[7.x] Enable ResolverStyle.STRICT for java formatters backport(#46675 ) (#47913 ) Joda was using ResolverStyle.STRICT when parsing. This means that date will be validated to be a correct year, year-of-month, day-of-month However, we also want to make it works with Year-Of-Era as Joda used to, hence custom temporalquery.localdate in DateFormatters.from Within DateFormatters we use the correct uuuu year instead of yyyy year of era worth noting: if yyyy(without an era) is used in code, the parsing result will be a TemporalAccessor which will fail to be converted into LocalDate. We mostly use DateFormatters.from so this takes care of this. If possible the uuuu format should be used.	2019-10-11 21:19:56 +02:00
Christoph Büscher	2ef12c37f5	Add builder for distance_feature to QueryBuilders (#47846 ) The QueryBuilders convenience class is currently missing a shortcut to construct a DistanceFeatureQueryBuilder, which is added here. Closes #47767	2019-10-11 18:20:01 +02:00
Alan Woodward	ec9198d0e2	Adjust Version.V_6_8_4 to refer to Lucene 7.7.2 (#47926 ) 6.8.4 will ship with Lucene 7.7.2, so we need to change our version settings to reflect this. Relates #47901	2019-10-11 17:01:42 +01:00
David Turner	ba62eb3dce	Allow truncation of clean translog (#47866 ) Today the `elasticsearch-shard remove-corrupted-data` tool will only truncate a translog it determines to be corrupt. However there may be other cases in which it is desirable to truncate the translog, for instance if an operation in the translog cannot be replayed for some reason other than corruption. This commit adds a `--truncate-clean-translog` option to skip the corruption check on the translog and blindly truncate it.	2019-10-11 15:48:12 +01:00
Henning Andersen	a0d0866f59	Shrink should not touch max_retries (#47719 ) Shrink would set `max_retries=1` in order to avoid retrying. This however sticks to the shrunk index afterwards, causing issues when a shard copy later fails to allocate just once. Avoiding a retry of a shrink makes sense since there is no new node to allocate to and a retry will likely fail again. However, the downside of having max_retries=1 afterwards outweigh the benefit of not retrying the failed shrink a few times. This change ensures shrink no longer sets max_retries and also makes all resize operations (shrink, clone, split) leave the setting at default value rather than copy it from source.	2019-10-11 14:22:56 +02:00
Przemyslaw Gomulka	0c439fe495	[7.x] Allow partial parsing dates (#47872 ) backport(#46814 ) Enable partial parsing of date part. This is making the behaviour in java.time implementation the same as with joda. 2018, 2018-01 and 2018-01-01 are all valid dates for date_optional_time or strict_date_optional_time closes #45284 closes #47473	2019-10-11 11:17:19 +02:00
Zachary Tong	2de3411c9c	Make sibling pipeline agg ctor's protected (#42808 ) SiblingPipelineAggregator is a public interfaces, but the ctor was package-private. These should be protected so that plugin authors can extend and implement their own sibling pipeline agg.	2019-10-10 12:31:14 -04:00
Martijn van Groningen	102016d571	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-10-10 14:44:05 +02:00
Jim Ferenczi	bd6e2592a7	Remove the SearchContext from the highlighter context (#47733 ) Today built-in highlighter and plugins have access to the SearchContext through the highlighter context. However most of the information exposed in the SearchContext are not needed and a QueryShardContext would be enough to perform highlighting. This change replaces the SearchContext by the informations that are absolutely required by highlighter: a QueryShardContext and the SearchContextHighlight. This change allows to reduce the exposure of the complex SearchContext and remove the needs to clone it in the percolator sub phase. Relates #47198 Relates #46523	2019-10-10 10:34:10 +02:00
Jim Ferenczi	3d334a262b	Ensure that we don't call listener twice when detecting a partial failure in _search (#47694 ) This change fixes a bug that can occur when a shard failure is detected while we build the search response and accept partial failures in set to false. In this case we currently call onFailure on the provided listener but also continue the search as if the failure didn't occur. This can lead to a listener called twice, once with onFailure and once with onSuccess which is forbidden by design.	2019-10-10 09:59:49 +02:00
dengweisysu	dc4224fbdf	Sync translog without lock before trim unreferenced readers (#47790 ) This commit is similar to the optimization made in #45765. With this change, we fsync most of the data of the current generation without holding writeLock when trimming unreferenced readers. Relates #45765	2019-10-09 17:56:30 -04:00
Armin Braun	302e09decf	Simplify some Common ActionRunnable Uses (#47799 ) (#47828 ) Especially in the snapshot code there's a lot of logic chaining `ActionRunnables` in tricky ways now and the code is getting hard to follow. This change introduces two convinience methods that make it clear that a wrapped listener is invoked with certainty in some trickier spots and shortens the code a bit.	2019-10-09 23:29:50 +02:00
Igor Motov	12e4e7ef54	Geo: implement proper handling of out of bounds geo points (#47734 ) This is the first iteration in improving of handling of out of bounds geopoints with a latitude outside of the -90 - +90 range and a longitude outside of the -180 - +180 range. Relates to #43916	2019-10-09 20:30:59 +04:00
Igor Motov	f8b8afdc70	Geo: Fixes indexing of linestrings that go around the globe (#47471 ) LINESTRING (0 0, 720 20) is now decomposed into 3 strings: multilinestring ( (0.0 0.0, 180.0 5.0), (-180.0 5.0, 180 15), (-180.0 15.0, 0 20) ) It also fixes issues with linestrings that intersect antimeridian more than 5 times. Fixes #43837 Fixes #43826	2019-10-09 20:30:59 +04:00
Tim Brooks	d18ff24dbe	Fix BulkByScrollResponseTests exception assertions (#45519 ) Currently in the x content serialization tests we compare the exception messages that are serialized. These exceptions messages are not equivalent because the exception often changes when serialized to x content. This commit removes this assertion.	2019-10-09 10:15:58 -06:00
Tim Brooks	02622c1ef9	Fix issues with serializing BulkByScrollResponse (#45357 ) Currently there are two issues with serializing BulkByScrollResponse. First, when deserializing from XContent, indexing exceptions and search exceptions are switched. Additionally, search exceptions do no retain the appropriate RestStatus code, so you must evaluate the status code from the exception. However, the exception class is not always correctly retained when serialized. This commit adds tests in the failure case. Additionally, fixes the swapping of failure types and adds the rest status code to the search failure.	2019-10-09 10:12:14 -06:00
Martijn van Groningen	da1e2ea461	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-10-09 09:06:13 +02:00
Armin Braun	96b36b5a8c	Make loadShardSnapshot Exceptions Consistent (#47728 ) (#47735 ) Similar to #47507. We are throwing `SnapshotException` when you (and SLM tests) would expect a `SnapshotMissingException` for concurrent snapshot status and snapshot delete operations with a very low probability. Fixed the exception type and added a test for this scenario.	2019-10-08 21:04:51 +02:00
Armin Braun	5cef4752f7	Fix Ex. Handling in SnapshotsService#snapshots (#47507 ) (#47727 ) We're needlessly wrapping a `SnapshotMissingException` which itself is a `SnapshotException` when trying to load a missing snapshot. This leads to failure #47442 which expects a `SnapshotMissingException` in this case. Closes #47442	2019-10-08 17:01:54 +02:00
Henning Andersen	ce91ba7c25	Dangling indices strip aliases (#47581 ) Importing dangling indices with aliases risks breaking functionalities using those aliases. For instance, writing to an alias may break if there is no is_write_index indication on the existing alias and the dangling index import adds a second index to the alias. Or an application could have an assumption about the alias only ever pointing to one index and suddenly seeing the alias also linked to an old index could break it. With this change we strip aliases of the index meta data found before importing a dangling index.	2019-10-08 12:09:30 +02:00
David Turner	bb5f750ab4	Deprecate include_relocations setting (#47443 ) Setting `cluster.routing.allocation.disk.include_relocations` to `false` is a bad idea since it will lead to the kinds of overshoot that were otherwise fixed in #46079. This commit deprecates this setting so it can be removed in the next major release.	2019-10-08 08:19:04 +01:00
Tal Levy	a17f394e27	Geo-Match Enrich Processor (#47243 ) (#47701 ) this commit introduces a geo-match enrich processor that looks up a specific `geo_point` field in the enrich-index for all entries that have a geo_shape match field that meets some specific relation criteria with the input field. For example, the enrich index may contain documents with zipcodes and their respective geo_shape. Ingesting documents with a geo_point field can be enriched with which zipcode they associate according to which shape they are contained within. this commit also refactors some of the MatchProcessor by moving a lot of the shared code to AbstractEnrichProcessor. Closes #42639.	2019-10-07 15:03:46 -07:00
Armin Braun	b669b8f046	Simplify Snapshot Delete Further (#47626 ) (#47644 ) This change removes the special path for deleting the index metadata blobs and moves deleting them to the bulk delete of unreferenced blobs at the end of the snapshot delete process. This saves N RPC calls for a snapshot containing N indices and simplifies the code. Also, this change moves the unreferenced data cleanup up the stack to make it more obvious that any exceptions during this pahse will be ignored and not fail the delete request. Lastly, this change removes the needless chaining of first deleting unreferenced data from the snapshot delete and then running the stale data cleanup (that would also run from the cleanup endpoint) and simply fires off the cleanup right after updating the repository data (index-N) in parallel to the other delete operations to speed up the delete some more.	2019-10-07 14:18:41 +02:00
Armin Braun	1359ef73a3	Add IT for Snapshot Issue in 47552 (#47627 ) (#47634 ) * Add IT for Snapshot Issue in 47552 (#47627) Adding a specific integration test that reproduces the problem fixed in #47552. The issue fixed only reproduces in the snapshot resiliency otherwise which are not available in 6.8 where the fix is being backported to as well.	2019-10-07 10:38:19 +02:00
Armin Braun	6bd033931b	Add Consistency Assertion to SnapshotsInProgress (#47598 ) (#47633 ) Assert given input shards and indices are consistent. Also, fixed the equality check for SnapshotsInProgress. Before this change the tests never had more than a single waiting shard per index so they never failed as a result of the waiting shards list not being ordered. Follow up to #47552	2019-10-07 10:37:56 +02:00
Luca Cavanna	736fceb18b	Fold InitialSearchPhase into AbstractSearchAsyncAction (#47182 ) Historically, we have two base classes for search actions that generally need to fan out to multiple shards and then move on to the following phase: InitialSearchPhase and AbstractSearchAsyncAction that extends it. Practically, every search action extends the latter, and there are no direct subclasses of InitialSearchPhase in our codebase. This commit folds InitialSearchPhase into AbstractSearchAsyncAction in the attempt of simplifying things and making the search code running on the coordinating node easier to reason about.	2019-10-07 10:10:04 +02:00
Martijn van Groningen	f2f2304c75	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-10-07 10:07:56 +02:00
Armin Braun	22679c7932	Fix Snapshot Corruption in Edge Case (#47552 ) (#47620 ) This fixes missing to marking shard snapshots as failures when multiple data-nodes are lost during the snapshot process or shard snapshot failures have occured before a node left the cluster. The problem was that we were simply not adding any shard entries for completed shards on node-left events. This has no effect for a successful shard, but for a failed shard would lead to that shard not being marked as failed during snapshot finalization. Fixed by corectly keeping track of all previous completed shard states as well in this case. Also, added an assertion that without this fix would trip on almost every run of the resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so we have a proper assertion message. Closes #47550	2019-10-05 15:01:06 +02:00
Armin Braun	f2d2ca21e2	Cleaner Handling of Store Refcount in BlobStoreRepository (#47560 ) (#47594 ) If a shard gets closed we properly abort its snapshot before closing it. We should in thise case make sure to not throw a confusing exception about trying to increment the reference on an already closed shard in the async tasks if the snapshot is already aborted. Also, added an assertion to make sure that aborts are in fact the only situation in which we run into a concurrently closed store.	2019-10-05 09:45:10 +02:00
Gordon Brown	e47bdf760e	Fix Rollover error when alias has closed indices (#47148 ) (#47539 ) Rollover previously requested index stats for all indices in the provided alias, which causes an exception when there is a closed index with that alias. This commit adjusts the IndicesOptions used on the index stats request so that closed indices are ignored, rather than throwing an exception.	2019-10-04 17:40:05 -06:00
Jason Tedor	35ca3d68d7	Validating monitoring hosts setting while parsing (#47571 ) This commit lifts the validation of the monitoring hosts setting into the setting itself, rather than when the setting is used. This prevents a scenario where an invalid value for the setting is accepted, but then later fails while applying a cluster state with the invalid setting.	2019-10-04 17:32:49 -04:00
Mark Tozzi	e404f7ea80	DocValueFormat implementation for date range fields (#47472 ) (#47605 )	2019-10-04 17:21:17 -04:00
Lee Hinman	79376b7219	Set default SLM retention invocation time (#47604 ) This adds a default for the `slm.retention_schedule` setting, setting it to `0 30 1 * * ?` which is 1:30am every day. Having retention unset meant that it would never be invoked and clean up snapshots. We determined it would be better to have a default than never to be run. When coming to a decision, we weighed the option of an absolute time (such as 1:30am) versus a periodic invocation (like every 12 hours). In the end we decided on the absolute time because it has better predictability and consistency than a periodic invocation, which would rely on when the master node were elected or restarted. Relates to #43663	2019-10-04 15:00:20 -06:00
Armin Braun	c1be7a802c	Simplify Snapshot Delete Process (#47439 ) (#47533 ) We don't need to read the SnapshotInfo for a snapshot to determine the indices that need to be updated when it is deleted as the `RepositoryData` contains that information already. This PR makes it so the `RepositoryData` is used to determine which indices to update and also removes the special handling for deleting snapshot metadata and the CS snapshot blob and has those simply be deleted as part of the deleting of other unreferenced blobs in the last step of the delete. This makes the snapshot delete a little faster and more resilient by removing two RPC calls (the separate delete and the get). Also, this shortens the diff with #46250 as a side-effect.	2019-10-04 13:55:16 +02:00
David Roberts	defc97a300	Remove fallback for controller location (#47104 ) This change removes the temporary controller location fallback introduced in #47013. Relates elastic/ml-cpp#593	2019-10-04 09:50:26 +01:00
Ryan Ernst	f32692208e	Add explanations to script score queries (#46693 ) (#47548 ) While function scores using scripts do allow explanations, they are only creatable with an expert plugin. This commit improves the situation for the newer script score query by adding the ability to set the explanation from the script itself. To set the explanation, a user would check for `explanation != null` to indicate an explanation is needed, and then call `explanation.set("some description")`.	2019-10-03 21:05:05 -07:00
Nhat Nguyen	5e4732f2bb	Limit number of retaining translog files for peer recovery (#47414 ) Today we control the extra translog (when soft-deletes is disabled) for peer recoveries by size and age. If users manually (force) flush many times within a short period, we can keep many small (or empty) translog files as neither the size or age condition is reached. We can protect the cluster from running out of the file descriptors in such a situation by limiting the number of retaining translog files.	2019-10-03 20:45:29 -04:00
Armin Braun	bac119f672	Fix getSnapshotIndexMetaData Exception Behavior (#47488 ) (#47496 ) If we fail to read the global metadata in a snapshot we would throw `SnapshotMissingException` but wouldn't do so for the index metadata. This is breaking SLM tests at a low rate because they use `SnapshotMissingException` thrown from snapshot status APIs to wait for a snapshot being gone. Also, we should be consistent here in general and not leak the `NoSuchFileException` to the transport layer for index meta. Closes #46508	2019-10-03 12:47:50 +02:00
Armin Braun	7549be4489	Fix es.http.cname_in_publish_address Deprecation Logging (#47451 ) Since the property defaulted to `true` this deprecation logging runs every time unless its set to `false` manually (in which case it should've also logged but didn't). I didn't add a tests and removed the tests we had in `7.x` that covered this logging. I did move the check out of the `if (InetAddresses.isInetAddress(hostString) == false) {` condition so this is sort-of covered by the REST tests. IMO, any unit-test of this would be somewhat redundant and would've forced adding a field that just indicates that the deprecated property was used to every instance which seemed pointless. Closes #47436	2019-10-03 11:10:48 +02:00
Alpar Torok	0a14bb174f	Remove eclipse conditionals (#44075 ) * Remove eclipse conditionals We used to have some meta projects with a `-test` prefix because historically eclipse could not distinguish between test and main source-sets and could only use a single classpath. This is no longer the case for the past few Eclipse versions. This PR adds the necessary configuration to correctly categorize source folders and libraries. With this change eclipse can import projects, and the visibility rules are correct e.x. auto compete doesn't offer classes from test code or `testCompile` dependencies when editing classes in `main`. Unfortunately the cyclic dependency detection in Eclipse doesn't seem to take the difference between test and non test source sets into account, but since we are checking this in Gradle anyhow, it's safe to set to `warning` in the settings. Unfortunately there is no setting to ignore it. This might cause problems when building since Eclipse will probably not know the right order to build things in so more wirk might be necesarry.	2019-10-03 11:55:00 +03:00
Armin Braun	0beb5263b4	Fix Snapshot Finalization not Waiting for Index Metadata (#47445 ) (#47459 ) * Fix Snapshot Finalization not Waiting for Index Metadata We were mixing up the listeners here which led to the final listener that should be called after all the metadata has been written to be called before that. I fixed this by removing the one redundant listener and flattening the logic out. * Closes #47425	2019-10-02 23:26:18 +02:00
Jason Tedor	52b97ec539	Allow setting validation against arbitrary types (#47264 ) Today when settings validate, they can only validate against settings that are of the same type. While this strong-type is convenient from a development perspective, it is too limiting in that some settings need to validate against settings of a different type. For example, the list setting xpack.monitoring.exporters.<namespace>.host wants to validate that it is non-empty if and only if the string setting xpack.monitoring.exporters.<namespace>.type is "http". Today this is impossible since the settings validation framework only allows that setting to validate against other list settings. This commit increases the flexibility here to validate against settings of arbitrary type, at the expense of losing strong-typing during development.	2019-10-02 16:31:06 -04:00
Jim Ferenczi	c340814b34	Fix highlighting of overlapping terms in the unified highlighter (#47227 ) The passage formatter that the unified highlighter use doesn't handle terms with overlapping offsets. For tokenizer that provides multiple segmentation of the same terms (edge ngram for instance) the formatter should select the largest span in order to highlight the term only once. This change implements this logic.	2019-10-02 16:34:12 +02:00
Yannick Welsch	f7980e9745	Adapt version constants after backport (#47353 )	2019-10-02 14:26:23 +02:00
Yannick Welsch	99d2fe295d	Use optype CREATE for single auto-id index requests (#47353 ) Changes auto-id index requests to use optype CREATE, making it compliant with our docs. This will also make these auto-id index requests compatible with the new "create-doc" index privilege (which is based on the optype), the default optype is changed to create, just as it is already documented.	2019-10-02 14:16:52 +02:00
Yannick Welsch	0024695dd8	Disallow externally generated autoGeneratedTimestamp (#47341 ) The autoGeneratedTimestamp field is internally used to speed up indexing of operations with auto-ids, as we can rule out duplicates. Setting this field externally can make the index inconsistent, resulting in duplicate documents with same id.	2019-10-02 14:16:52 +02:00
Yannick Welsch	8c11fe610e	Use standard semantics for retried auto-id requests (#47311 ) Adds support for handling auto-id requests with optype CREATE. Also simplifies the code handling this by using the standard indexing path when dealing with possible retry conflicts. Relates #47169	2019-10-02 14:16:52 +02:00
Yannick Welsch	7b2613db55	Allow optype CREATE for append-only indexing operations (#47169 ) Bulk requests currently do not allow adding "create" actions with auto-generated IDs. This commit allows using the optype CREATE for append-only indexing operations. This is mainly the user facing aspect of it.	2019-10-02 14:16:52 +02:00
Jim Ferenczi	42c5054e52	Fix alias field resolution in match query (#47369 ) Synonym queries (when two tokens/paths start at the same position) use the alias field instead of the concrete field to build Lucene queries. This commit fixes this bug by resolving the alias field upfront in order to provide the concrete field to the actual query parser.	2019-10-02 11:45:43 +02:00
Nhat Nguyen	5cfcd7c458	Re-fetch shard info of primary when new node joins (#47035 ) Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <henning.andersen@elastic.co>	2019-10-01 22:16:26 -04:00
Gordon Brown	ba6ee2d40d	[7.x] Adjust randomization in cluster shard limit tests (#47254 ) This commit adjusts randomization for the cluster shard limit tests so that there is often more of a gap left between the limit and the size of the first index. This allows the same randomization to be used for all tests, and alleviates flakiness in `testIndexCreationOverLimitFromTemplate`.	2019-10-01 14:53:10 -06:00
David Turner	99b25d3740	Keep nodes above watermark in testAutomaticReleaseOfIndexBlock (#47387 ) Today the comment boldly claims that this line of code keeps nodes above the 10-byte low watermark when in fact this is not true at all. This change fixes this so that it really does keep nodes above the low watermark. Fixes #45338. Again.	2019-10-01 19:58:23 +01:00
Armin Braun	3d6ef6a90e	Speed up and Reorder Snapshot Delete Operations (#47293 ) (#47350 ) This is a preliminary of #46250 making the snapshot delete work by doing all the metadata updates first and then bulk deleting all of the now unreferenced blobs. Before this change, the metadata updates for each shard and subsequent deletion of the blobs that have become unreferenced due to the delete would happen sequentially shard-by-shard parallelising only over all the indices in the snapshot. This change makes it so the all the metadata updates happen in parallel on a shard level first. Once all of the updates of shard-level metadata have finished, all the now unreferenced blobs are deleted in bulk. This has two benefits (outside of making #46250 a smaller change): * We have a lower likelihood of failing to update shard level metadata because it happens with priority and a higher degree of parallelism * Deleting of unreferenced data in the shards should go much faster in many cases (rolling indices, large number of indices with many unchanged shards) as well because a number of small bulk deletions (just two blobs for `index-N` and `snap-` for each unchanged shard) are grouped into larger bulk deletes of `100-1000` blobs depending on Cloud provider (even though the final bulk deletes are happening sequentially this should be much faster in almost all cases as you'd parallelism of 50 (GCS) to 500 (S3) snapshot threads to achieve the same delete rates when deleting from unchanged shards).	2019-10-01 19:05:43 +02:00
Colin Goodheart-Smithe	c93b39c65b	Adds version 7.4.1	2019-10-01 16:03:11 +01:00
Howard	a9cd42c05d	Cancel recoveries even if all shards assigned (#46520 ) We cancel ongoing peer recoveries if a node joins the cluster with a completely up-to-date copy of a shard, because we can use such a copy to recover a replica instantly. However, today we only look for recoveries to cancel while there are unassigned shards in the cluster. This means that we do not contemplate the cancellation of the last few recoveries since recovering shards are not unassigned. It might take much longer for these recoveries to complete than would be necessary if they were cancelled. This commit fixes this by checking for cancellable recoveries even if all shards are assigned.	2019-10-01 10:55:32 +01:00
Ignacio Vera	03d717dc32	Provide better error when updating geo_shape field mapper settings (#47281 ) (#47338 )	2019-10-01 10:52:39 +02:00
Yannick Welsch	dd0af2e425	Fix CloseIndexIT.testRelocatedClosedIndexIssue (#47169 ) Closes #47330	2019-10-01 08:34:27 +02:00
Armin Braun	3d23cb44a3	Speed up Snapshot Finalization (#47283 ) (#47309 ) As a result of #45689 snapshot finalization started to take significantly longer than before. This may be a little unfortunate since it increases the likelihood of failing to finalize after having written out all the segment blobs. This change parallelizes all the metadata writes that can safely run in parallel in the finalization step to speed the finalization step up again. Also, this will generally speed up the snapshot process overall in case of large number of indices. This is also a nice to have for #46250 since we add yet another step (deleting of old index- blobs in the shards to the finalization.	2019-09-30 23:28:59 +02:00
Jason Tedor	890951113f	Make Setting#getRaw have private access (#47287 ) The method Setting#getRaw leaks implementation details about settings, namely that they are backed by strings. We do not want code to rely upon this, so this commit makes Setting#getRaw private as a first step towards hiding the implementaton details of settings from the rest of the codebase.	2019-09-30 14:14:30 -04:00
David Turner	72b63635de	Remove unused pluggable metadata upgraders (#47277 ) Today plugins may provide upgraders for custom metadata and index metadata, but these upgraders are bypassed during a rolling restart. Fortunately this extension mechanism is unused by all known plugins. This commit removes these extension points. Relates #47297	2019-09-30 16:58:29 +01:00
Gaurav614	052c523d41	Fail allocation of new primaries in empty cluster (#43284 ) Today if you create an index in a cluster without any data nodes then it will report yellow health because it never attempts to assign any shards if there are no data nodes, so the new shards remain at `AllocationStatus.NO_ATTEMPT`. This commit moves the new primaries to `AllocationStatus.DECIDERS_NO` in this situation, causing the cluster health to move to red. Fixes #41073	2019-09-30 14:27:12 +01:00
Yannick Welsch	467596871a	Omit writing index metadata for non-replicated closed indices on data-only node (#47285 ) Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices (but incorrectly handles replicated closed indices). On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. Closes #47276	2019-09-30 13:56:52 +02:00
Przemyslaw Gomulka	d9a7bcef21	Support optional parsers in any order with DateMathParser Backport(46654) (#47217 ) Currently DateMathParser with roundUp = true is relying on the DateFormatter build with combined optional sub parsers with defaulted fields (depending on the formatter). That means that for yyyy-MM-dd'T'HH:mm:ss\|\|yyyy-MM-dd'T'HH:mm:ss.SSS Java.time implementation expects optional parsers in order from most specific to least specific (reverse in the example above). It is causing a problem because the first parsing succeeds but does not consume the full input. The second parser should be used. We can work around this with keeping a list of RoundUpParsers and iterate over them choosing the one that parsed full input. The same approach we used for regular (non date math) in relates #40100 The jdk is not considering this to be a bug https://bugs.openjdk.java.net/browse/JDK-8188771 Those below will expect this change first relates #46242 relates #45284 backport #46654	2019-09-30 13:54:52 +02:00
Yannick Welsch	9dc90e41fc	Remove "force" version type (#47228 ) It's been deprecated long ago and can be removed. Relates to #20377 Closes #19769	2019-09-30 11:58:34 +02:00
Martijn van Groningen	66f72bcdbc	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-09-30 08:12:28 +02:00
Rory Hunter	53a4d2176f	Convert most awaitBusy calls to assertBusy (#45794 ) (#47112 ) Backport of #45794 to 7.x. Convert most `awaitBusy` calls to `assertBusy`, and use asserts where possible. Follows on from #28548 by @liketic. There were a small number of places where it didn't make sense to me to call `assertBusy`, so I kept the existing calls but renamed the method to `waitUntil`. This was partly to better reflect its usage, and partly so that anyone trying to add a new call to awaitBusy wouldn't be able to find it. I also didn't change the usage in `TransportStopRollupAction` as the comments state that the local awaitBusy method is a temporary copy-and-paste. Other changes: * Rework `waitForDocs` to scale its timeout. Instead of calling `assertBusy` in a loop, work out a reasonable overall timeout and await just once. * Some tests failed after switching to `assertBusy` and had to be fixed. * Correct the expect templates in AbstractUpgradeTestCase. The ES Security team confirmed that they don't use templates any more, so remove this from the expected templates. Also rewrite how the setup code checks for templates, in order to give more information. * Remove an expected ML template from XPackRestTestConstants The ML team advised that the ML tests shouldn't be waiting for any `.ml-notifications` templates, since such checks should happen in the production code instead. Also rework the template checking code in `XPackRestTestHelper` to give more helpful failure messages. * Fix issue in `DataFrameSurvivesUpgradeIT` when upgrading from < 7.4	2019-09-29 12:21:46 +01:00
Jason Tedor	98989f7b37	Use fallback settings in throttling decider (#47261 ) This commit replaces some uses of Setting#getRaw in the throttling allocation decider settings. Instead, these settings should be using fallback settings.	2019-09-28 08:06:24 -04:00
Jason Tedor	bd603b0a7b	Remove dead leniency in allow rebalance setting use (#47259 ) This commit removes some leniency that exists in getting the allow rebalance setting. Fortunately, that leniency is dead code, this can never happen. The reason this can never happen is because the settings infrastructure will not allow setting an invalid value for this setting. If you try to set this in the elasticsearch.yml, then the node will fail to start, since parsing the setting will fail. If you try to set this via an update settings API call, then parsing the setting will fail and the settings update will be rejected. Therefore, this leniency can never be activated, so we remove it. This commit is the first of a few in an attempt to remove the public uses of Setting#getRaw.	2019-09-28 08:05:37 -04:00
Martijn van Groningen	76b66634e9	put provided argument on the previous line just like in master branch, that way this doesn't show in the final pr.	2019-09-27 15:22:09 +02:00
David Roberts	e943e27954	Spawn controller processes from a different directory on macOS (#47013 ) This is the Java side of https://github.com/elastic/ml-cpp/pull/593 with a fallback so that ml-cpp bundles with either the new or old directory structure work for the time being. A few days after merging the C++ changes a followup to this change will be made that removes the fallback.	2019-09-27 14:02:40 +01:00
Martijn van Groningen	7ffe2e7e63	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-09-27 14:42:11 +02:00
Yannick Welsch	6fd3b4723f	Remove write lock for Translog.getGeneration (#47036 ) No need for the write lock, and currentFileGeneration is already protected by the read lock. Also removes the unused method "isCurrent".	2019-09-27 13:58:07 +02:00
Jim Ferenczi	73a09b34b8	Replace SearchContextException with SearchException (#47046 ) This commit removes the SearchContextException in favor of a simpler SearchException that doesn't leak the SearchContext. Relates #46523	2019-09-26 14:21:23 +02:00
Tanguy Leroux	95e2ca741e	Remove unused private methods and fields (#47154 ) This commit removes a bunch of unused private fields and unused private methods from the code base. Backport of (#47115)	2019-09-26 12:49:21 +02:00
jimczi	97d977f381	#47046 Fix serialization version check after backport	2019-09-26 09:56:24 +02:00
Jim Ferenczi	04972baffa	Merge ShardSearchTransportRequest and ShardSearchLocalRequest (#46996 ) (#47081 ) This change merges the `ShardSearchTransportRequest` and `ShardSearchLocalRequest` into a single `ShardSearchRequest` that can be used to create a SearchContext. Relates #46523	2019-09-26 09:20:53 +02:00
Martijn van Groningen	429f23ea2f	Allow ingest processors to execute in a non blocking manner. (#47122 ) Backport of #46241 This PR changes the ingest executing to be non blocking by adding an additional method to the Processor interface that accepts a BiConsumer as handler and changing IngestService#executeBulkRequest(...) to ingest document in a non blocking fashion iff a processor executes in a non blocking fashion. This is the second PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. The plan is to merge changes made to the server module separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation. This change originates from the enrich branch and was introduced there in #43361.	2019-09-26 08:55:28 +02:00
David Turner	45c7783018	Warn on slow metadata persistence (#47130 ) Today if metadata persistence is excessively slow on a master-ineligible node then the `ClusterApplierService` emits a warning indicating that the `GatewayMetaState` applier was slow, but gives no further details. If it is excessively slow on a master-eligible node then we do not see any warning at all, although we might see other consequences such as a lagging node or a master failure. With this commit we emit a warning if metadata persistence takes longer than a configurable threshold, which defaults to `10s`. We also emit statistics that record how much index metadata was persisted and how much was skipped since this can help distinguish cases where IO was slow from cases where there are simply too many indices involved. Backport of #47005.	2019-09-26 07:40:54 +01:00
Tim Brooks	4f47e1f169	Extract proxy connection logic to specialized class (#47138 ) Currently the logic to check if a connection to a remote discovery node exists and otherwise create a proxy connection is mixed with the collect nodes, cluster connection lifecycle, and other RemoteClusterConnection logic. This commit introduces a specialized RemoteConnectionManager class which handles the open connections. Additionally, it reworks the "round-robin" proxy logic to create the list of potential connections at connection open/close time, opposed to each time a connection is requested.	2019-09-25 15:58:18 -06:00
Nhat Nguyen	7c5a088aa5	Increase ensureGreen timeout for testReplicaCorruption (#47136 ) We can have a large number of shard copies in this test. For example, the two recent failures have 24 and 27 copies respectively and all replicas have to copy segment files as their stores are corrupted. Our CI needs more than 30 seconds to start all these copies. Note that in two recent failures, the cluster was green just after the cluster health timed out. Closes #41899	2019-09-25 17:04:08 -04:00
Lee Hinman	a267df30fa	Wait for snapshot completion in SLM snapshot invocation (#47051 ) * Wait for snapshot completion in SLM snapshot invocation This changes the snapshots internally invoked by SLM to wait for completion. This allows us to capture more snapshotting failure scenarios. For example, previously a snapshot would be created and then registered as a "success", however, the snapshot may have been aborted, or it may have had a subset of its shards fail. These cases are now handled by inspecting the response to the `CreateSnapshotRequest` and ensuring that there are no failures. If any failures are present, the history store now stores the action as a failure instead of a success. Relates to #38461 and #43663	2019-09-25 14:25:22 -06:00
Armin Braun	93fcd23da8	Fail Snapshot on Corrupted Metadata Blob (#47009 ) (#47096 ) We should not be quietly ignoring a corrupted shard-level index-N blob. Simply creating a new empty shard-level index-N and moving on means that all snapshots of that shard show `SUCESS` as their state at the repository root but are in fact broken. This change at least makes it visible to the user that they can't snapshot the given shard any more and forces the user to move on to a new repository since the current one is broken and will not allow snapshotting the inconsistent shard again. Also, this change stops the delete action for shards with broken index-N blobs instead of simply deleting all blobs in the path containing the broken index-N. This prevents a temporarily broken/missing index-N blob from corrupting all snapshots of that shard.	2019-09-25 15:55:33 +02:00
Nhat Nguyen	22575bd7e6	Remove isRecovering method from Engine (#47039 ) We already prevent flushing in Engine if it's recovering. Hence, we can remove the protection in IndexShard.	2019-09-25 08:58:08 -04:00
Armin Braun	c4a166fc9a	Simplify SnapshotResiliencyTests (#46961 ) (#47108 ) Simplify `SnapshotResiliencyTests` to more closely match the structure of `AbstractCoordinatorTestCase` and allow for future drying up between the two classes: * Make the test cluster nodes a nested-class in the test cluster itself * Remove the needless custom network disruption implementation and simply track disconnected node ids like `AbstractCoordinatorTestCase` does	2019-09-25 14:53:11 +02:00
Yannick Welsch	81cbd3fba4	Mute ClusterShardLimitIT.testIndexCreationOverLimitFromTemplate Relates #47107	2019-09-25 14:03:08 +02:00
David Turner	ac920e8e64	Assert no exceptions during state application (#47090 ) Today we log and swallow exceptions during cluster state application, but such an exception should not occur. This commit adds assertions of this fact, and updates the Javadocs to explain it. Relates #47038	2019-09-25 12:32:51 +01:00
Martijn van Groningen	eef1ba3fad	Make ingest pipeline resolution logic unit testable (#47026 ) Extracted ingest pipeline resolution logic into a static method and added unit tests for pipeline resolution logic. Followup from #46847	2019-09-25 11:35:00 +02:00
Daniel Mitterdorfer	48df560593	Emit log message when parent circuit breaker trips (#47000 ) (#47073 ) We emit a debug log message whenever a child circuit breaker trips (in `ChildMemoryCircuitBreaker#circuitBreak(String, long)`) but we never emit a log message when the parent circuit breaker trips. As this is more likely to happen with the real memory circuit breaker it is not possible to detect this in the logs. With this commit we add a log message on the same log level (debug) when the parent circuit breaker trips.	2019-09-25 10:22:46 +02:00
Julie Tibshirani	41ee8aa6fc	Reject regexp queries on the _index field. (#46945 ) We speculatively added support for `regexp` queries on the `_index` field in #34089 (this functionality was not actually requested by a user). Supporting regex logic adds complexity to the `_index` field for not much gain, so we would like to remove it. From an end-to-end test it turns out this functionality never even worked in the first place because of an error in how regex flags were interpreted! For this reason, we can remove support for `regexp` on `_index` without a deprecation period. Relates to #46640.	2019-09-24 12:17:00 -07:00
Tim Brooks	71ec0707cf	Remove locking around connection attempts (#46845 ) Currently in the ConnectionManager we lock around the node id. This is odd because we key connections by the ephemeral id. Upon further investigation it appears to me that we do not need the locking. Using the concurrent map, we can ensure that only one connection attempt completes. There is a very small chance that a new connection attempt will proceed right as another connection attempt is completing. However, since the whole process is asynchronous and event oriented (lightweight), that does not seem to be an issue.	2019-09-24 11:05:42 -06:00
Tim Brooks	f02582de4b	Reduce a bind failure to trace logging (#46891 ) Due to recent changes in the nio transport, a failure to bind the server channel has started to be logged at an error level. This exception leads to an automatic retry on a different port, so it should only be logged at a trace level.	2019-09-24 10:32:18 -06:00
David Turner	9135e2f9e3	Improve LeaderCheck rejection messages (#46998 ) Today the `LeaderChecker` rejects checks from nodes that are not in the current cluster with the exception message `leader check from unknown node` which offers no information about why the node is unknown. In fact the node must have been in the cluster in the recent past, so it might help guide the user to a more useful log message if we describe it as a `removed node` instead of an `unknown node`. This commit changes the exception message like this, and also tidies up a few other loose ends in the `LeaderChecker`.	2019-09-24 13:41:37 +01:00
David Turner	6943a3101f	Cut PersistedState interface from GatewayMetaState (#46655 ) Today `GatewayMetaState` implements `PersistedState` but it's an error to use it as a `PersistedState` before it's been started, or if the node is master-ineligible. It also holds some fields that are meaningless on nodes that do not persist their states. Finally, it takes responsibility for both loading the original cluster state and some of the high-level logic for writing the cluster state back to disk. This commit addresses these concerns by introducing a more specific `PersistedState` implementation for use on master-eligible nodes which is only instantiated if and when it's appropriate. It also moves the fields and high-level persistence logic into a new `IncrementalClusterStateWriter` with a more appropriate lifecycle. Follow-up to #46326 and #46532 Relates #47001	2019-09-24 12:31:13 +01:00
Julie Tibshirani	9124c94a6c	Add support for aliases in queries on _index. (#46944 ) Previously, queries on the _index field were not able to specify index aliases. This was a regression in functionality compared to the 'indices' query that was deprecated and removed in 6.0. Now queries on _index can specify an alias, which is resolved to the concrete index names when we check whether an index matches. To match a remote shard target, the pattern needs to be of the form 'cluster:index' to match the fully-qualified index name. Index aliases can be specified in the following query types: term, terms, prefix, and wildcard.	2019-09-23 13:21:37 -07:00
Jim Ferenczi	08f28e642b	Replace SearchContext with QueryShardContext in query builder tests (#46978 ) This commit replaces the SearchContext used in AbstractQueryTestCase with a QueryShardContext in order to reduce the visibility of search contexts. Relates #46523	2019-09-23 20:24:02 +02:00
Eray	199fff8a55	Allow max_children only in top level nested sort (#46731 ) This commit restricts the usage of max_children to the top level nested sort since it is ignored on the other levels.	2019-09-23 18:53:50 +02:00
Armin Braun	2da040601b	Fix Bug in Snapshot Status Response Timestamps (#46919 ) (#46970 ) Fixing a corner case where snapshot total time calculation was off when getting the `SnapshotStatus` of an in-progress snapshot. Closes #46913	2019-09-23 15:01:47 +02:00
David Turner	7bc86f23ec	Wait longer for leader failure in logs test (#46958 ) `testLogsWarningPeriodicallyIfClusterNotFormed` simulates a leader failure and waits for long enough that a failing leader check is scheduled. However it does not wait for the failing check to actually fail, which requires another two actions and therefore might take up to 200ms more. Unlucky timing would result in this test failing, for instance: ./gradle ':server:test' \ --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testLogsWarningPeriodicallyIfClusterNotFormed" \ -Dtests.jvm.argline="-Dhppc.bitmixer=DETERMINISTIC" \ -Dtests.seed=F18CDD0EBEB5653:E9BC1A8B062E697A This commit adds the extra delay needed for the leader failure to complete as expected. Fixes #46920	2019-09-23 10:52:13 +01:00
Martijn van Groningen	33bbc4798b	fixed compile errors after merging	2019-09-23 09:46:14 +02:00
Martijn van Groningen	0cfddca61d	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-09-23 09:46:05 +02:00
Armin Braun	ee4e6b1382	Add TestLogging for #46701 (#46939 ) (#46949 ) This at a very low rate and with the force merge in place before checking the cache size it's not clear why the cache is not of size `0` -> seems something else must be happening here that is unexpected. -> add debug logging to this test to find out Relates #46701	2019-09-21 15:24:58 +02:00
Armin Braun	938648fcff	Remove Duplicate Shard Snapshot State Updates (#46862 ) (#46906 ) We were repeatedly trying to send shard state updates for aborted snapshots on every cluster state update. This is simply dead-code since those updates are already safely sent in the callbacks passed to `SnapshotShardsService#snapshot`. On master failover, we ensure that the status update is resent via `SnapshotShardsService#syncShardStatsOnNewMaster`. => there is no need for trying to send updates here over and over and this logic can safely be removed	2019-09-20 14:30:03 +02:00
Jason Tedor	97acf353fa	Move pipelines resolved assertion (#46892 ) This assertion was added during the development of required pipelines. In the initial version of that work, the notion of whether or not a request was forwarded from the coordinating node to an ingest node was introduced. It was realized later that instead we needed to track whether or not the pipeline for the request was resolved. When that change was made, this assertion, while not incorrect, was left behind and only applied if the coordnating node was forwarding the request. Instead, the assertion applies whether or not the request is being forwarded. This commit addresses that by moving the assertion and renaming some variables.	2019-09-20 07:27:56 -04:00
Jason Tedor	2425fd1a50	Removed unused import from RequiredPipelineIT.java This commit removes an unused import that was left behind after cleaning up a backport. Sorry.	2019-09-19 16:46:27 -04:00
Jason Tedor	bd77626177	Add the ability to require an ingest pipeline (#46847 ) This commit adds the ability to require an ingest pipeline on an index. Today we can have a default pipeline, but that could be overridden by a request pipeline parameter. This commit introduces a new index setting index.required_pipeline that acts similarly to index.default_pipeline, except that it can not be overridden by a request pipeline parameter. Additionally, a default pipeline and a request pipeline can not both be set. The required pipeline can be set to _none to ensure that no pipeline ever runs for index requests on that index.	2019-09-19 16:37:45 -04:00
Armin Braun	c922743c5d	Remove Bogus Test: testDeleteOrphanSnapshot (#46835 ) (#46874 ) This test is broken with a very low failure rate after recent changes. Particularly after #45689 which does not check for duplicate snapshot uuids during snapshot finalization any more. The check for duplicate uuids during finalization was removed conciously since it lead to problems during master failover. This test fails because it increments the repository state id in an unexpected manner now, starting from the impossible situation of having the same snapshot UUID for two different repository state ids. This situation can't normally be reached, but we manually crafted it here. This test didn't do anything before though, because the manually crafted cluster state would simply result in an error during finalization before and nothing but a normal snapshot delete would be tested. => removing this test here, it doesn't test anything. Closes #46843	2019-09-19 18:52:35 +02:00
Yannick Welsch	9638ca20b0	Allow dropping documents with auto-generated ID (#46773 ) When using auto-generated IDs + the ingest drop processor (which looks to be used by filebeat as well) + coordinating nodes that do not have the ingest processor functionality, this can lead to a NullPointerException. The issue is that markCurrentItemAsDropped() is creating an UpdateResponse with no id when the request contains auto-generated IDs. The response serialization is lenient for our REST/XContent format (i.e. we will send "id" : null) but the internal transport format (used for communication between nodes) assumes for this field to be non-null, which means that it can't be serialized between nodes. Bulk requests with ingest functionality are processed on the coordinating node if the node has the ingest capability, and only otherwise sent to a different node. This means that, in order to reproduce this, one needs two nodes, with the coordinating node not having the ingest functionality. Closes #46678	2019-09-19 16:46:33 +02:00
Armin Braun	a087553009	Rearrange BlobStoreRepository to Prepare #46250 (#46824 ) (#46853 ) In #46250 we will have to move to two different delete paths for BwC. Both paths will share the initial reads from the repository though so I extracted/separated the delete logic away from the initial reads to significantly simplify the diff against #46250 here. Also, I added some JavaDoc from #46250 here as well which makes the code a little easier to follow even ignoring #46250 I think.	2019-09-19 13:07:00 +02:00
Tanguy Leroux	3ae51f25dd	Move testSnapshotWithLargeSegmentFiles to ESMockAPIBasedRepositoryIntegTestCase (#46802 ) This commit moves the common test testSnapshotWithLargeSegmentFiles to the ESMockAPIBasedRepositoryIntegTestCase base class.	2019-09-18 15:41:30 +02:00
Christos Soulios	0076083b35	Implement rounding optimization for fixed offset timezones (#46809 ) Fixes #45702 with date_histogram aggregation when using fixed_interval. Optimization has been implemented for both fixed and calendar intervals	2019-09-18 15:56:34 +03:00
Armin Braun	142b10604e	Fix testHistoryRetention (#46799 ) (#46805 ) Suppress the reasonable-history check in this test to guarantee we're always getting ops based recovery even after a background sync. Closes #45953 Co-Authored-By: David Turner <david.turner@elastic.co>	2019-09-18 13:22:55 +02:00
Martijn van Groningen	ac4e990924	Add ingest cluster state listeners (#46650 ) In the case that an ingest processor factory relies on other configuration in the cluster state in order to construct a processor instance then it is currently undetermined if a processor facotry can be notified about a change if multiple cluster state updates are bundled together and if a processor implement `ClusterStateApplier` interface. (IngestService implements this interface too) The idea with ingest cluster state listener is that it is guaranteed to update the processor factory first before the ingest service creates a pipeline with their respective processor instances. Currently this concept is used in the enrich branch: https://github.com/elastic/elasticsearch/blob/enrich/x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java#L21 In this case it a processor factory is interested in enrich indices' _meta mapping fields. This is the third PR that merges changes made to server module from the enrich branch (see #32789) into the master branch. Changes to the server module are merged separately from the pr that will merge enrich into master, so that these changes can be reviewed in isolation.	2019-09-18 09:13:16 +02:00
Armin Braun	2c70d403fc	Reenable+Fix testMasterShutdownDuringFailedSnapshot (#46303 ) (#46747 ) Reenable this test since it was fixed by #45689 in production code (specifically, the fact that we write the `snap-` blobs without overwrite checks now). Only required adding the assumed blocking on index file writes to test code to properly work again. * Closes #25281	2019-09-17 18:09:48 +02:00
Armin Braun	b0f09b279f	Make Snapshot Logic Write Metadata after Segments (#45689 ) (#46764 ) * Write metadata during snapshot finalization after segment files to prevent outdated metadata in case of dynamic mapping updates as explained in #41581 * Keep the old behavior of writing the metadata beforehand in the case of mixed version clusters for BwC reasons * Still overwrite the metadata in the end, so even a mixed version cluster is fixed by this change if a newer version master does the finalization * Fixes #41581	2019-09-17 13:09:39 +02:00
Armin Braun	c045bc7f54	Minor Rearrangements in Snapshot Code (#46652 ) (#46752 ) Inlining one trivial single-use method and extracting the stale shard path blob calculation to make the diff with #46250 more manageable.	2019-09-17 09:23:00 +02:00
Armin Braun	20cb95ca5e	Fix testSnapshotRelocatingPrimary to Actually Run Relocations (#46594 ) (#46620 ) Without replicas we won't actually get any relocations going when removing the node constraints in this test. Adjusted the code to force relocations by forbidding nodes that hold primaries instead. Also, fixed the timeouts and asserted that we actually get relocations. Fixes #46276	2019-09-16 15:15:33 +02:00
Andrei Dan	c57cca98b2	[ILM] Add date setting to calculate index age (#46561 ) (#46697 ) * [ILM] Add date setting to calculate index age Add the `index.lifecycle.origination_date` to allow users to configure a custom date that'll be used to calculate the index age for the phase transmissions (as opposed to the default index creation date). This could be useful for users to create an index with an "older" origination date when indexing old data. Relates to #42449. * [ILM] Don't override creation date on policy init The initial approach we took was to override the lifecycle creation date if the `index.lifecycle.origination_date` setting was set. This had the disadvantage of the user not being able to update the `origination_date` anymore once set. This commit changes the way we makes use of the `index.lifecycle.origination_date` setting by checking its value when we calculate the index age (ie. at "read time") and, in case it's not set, default to the index creation date. * Make origination date setting index scope dynamic * Document orignation date setting in ilm settings (cherry picked from commit d5bd2bb77ee28c1978ab6679f941d7c02e389d32) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2019-09-16 08:50:28 +01:00
Armin Braun	2b85dcb201	Parallelize Repository Cleanup Actions (#46647 ) (#46714 ) * Parallelize Repository Cleanup Actions Deleting root blobs and unreferenced indices can safely happen in parallel, no need to have both operations run sequentially when they preclude all other repository operations.	2019-09-16 07:52:03 +02:00
David Turner	272b0ecbdd	Remove docs for proxy mode (#46677 ) We added docs for proxy mode in #40281 but on reflection we should not be documenting this setting since it does not play well with all proxies and we can't recommend its use. This commit removes those docs and expands its Javadoc instead.	2019-09-13 22:20:11 +01:00
Nhat Nguyen	cabff5a7cd	Handle lower retaining seqno retention lease error (#46420 ) We renew the CCR retention lease at a fixed interval, therefore it's possible to have more than one in-flight renewal requests at the same time. If requests arrive out of order, then the assertion is violated. Closes #46416 Closes #46013	2019-09-13 08:50:19 -04:00
Nhat Nguyen	e1a33c6283	Fix false positive out of sync warning in synced-flush (#46576 ) Synced-flush consists of three steps: (1) force-flush on every active copy; (2) check for ongoing indexing operations; (3) seal copies if there's no change since step 1. If some indexing operations are completed on the primary but not replicas, then Lucene commits from step 1 on replicas won't be the same as the primary's. And step 2 would pass if it's executed when all pending operations are done. Once step 2 passes, we will incorrectly emit the "out of sync" warning message although nothing wrong here. Relates #28464 Relates #30244	2019-09-12 16:34:33 -04:00
Nhat Nguyen	5465c8d095	Increase timeout for relocation tests (#46554 ) There's nothing wrong in the logs from these failures. I think 30 seconds might not be enough to relocate shards with many documents as CI is quite slow. This change increases the timeout to 60 seconds for these relocation tests. It also dumps the hot threads in case of timed out. Closes #46526 Closes #46439	2019-09-12 16:34:01 -04:00
Zachary Tong	e1e06c2589	Add version constant for 7.3.3	2019-09-12 13:50:40 -04:00
Jim Ferenczi	4407f3af1b	Delay the creation of SubSearchContext to the FetchSubPhase (#46598 ) This change delays the creation of the SubSearchContext for nested and parent/child inner_hits to the fetch sub phase in order to ensure that a SearchContext can built entirely from a QueryShardContext. This commit also adds a validation step to the inner hits builder that ensures that we fail the request early if the inner hits path is invalid. Relates #46523	2019-09-12 14:52:15 +02:00
Igor Motov	35cb93248d	Geo: fix indexing of west to east linestrings crossing the antimeridian (#46601 ) Fixes that way linestrings that are crossing the antimeridian are indexed due to a normalization bug these lines were decomposed into a line segment that was stretching entire globe. Fixes #43775	2019-09-11 17:43:17 -04:00
Zachary Tong	6dc8ed5d57	[7.x Backport] Refactor AllocatedPersistentTask#init(), move rollup ctor logic (#46406 ) This makes the AllocatedPersistentTask#init() method protected so that implementing classes can perform their initialization logic there, instead of the constructor. Rollup's task is adjusted to use this init method. It also slightly refactors the methods to se a static logger in the AllocatedTask instead of passing it in via an argument. This is simpler, logged messages come from the task instead of the service, and is easier for tests	2019-09-11 17:00:28 -04:00
Ryan Ernst	fa9327cdb9	Add more meaningful keystore version mismatch errors (#46291 ) This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624	2019-09-11 09:55:19 -07:00
Jim Ferenczi	23bf310c84	Replace the SearchContext with QueryShardContext when building aggregator factories (#46527 ) This commit replaces the `SearchContext` with the `QueryShardContext` when building aggregator factories. Aggregator factories are part of the `SearchContext` so they shouldn't require a `SearchContext` to create them. The main changes here are the signatures of `AggregationBuilder#build` that now takes a `QueryShardContext` and `AggregatorFactory#createInternal` that passes the `SearchContext` to build the `Aggregator`. Relates #46523	2019-09-11 16:43:30 +02:00
Armin Braun	27c15f137e	Remove Unused Method from BlobStoreRepository (#46204 ) (#46593 ) This method isn't used anymore and I forgot to delete it.	2019-09-11 16:34:24 +02:00
William Brafford	8c9f15db44	Fix Path comparisons for Windows tests (#46503 ) (#46566 ) * Fix Path comparisons for Windows tests The test NodeEnvironmentTests#testCustonDataPaths worked just fine on Darwin and Linux, but the comparison was breaking in Windows because one path had the "C:\" prefix and the other one didn't. The simple fix is to compare absolute paths rather than potentially relative ones.	2019-09-11 09:33:00 -04:00
Christoph Büscher	aa0c586b73	Deprecate `_field_names` disabling (#42854 ) Currently we allow `_field_names` fields to be disabled explicitely, but since the overhead is negligible now we decided to keep it turned on by default and deprecate the `enable` option on the field type. This change adds a deprecation warning whenever this setting is used, going forward we want to ignore and finally remove it. Closes #27239	2019-09-11 14:58:08 +02:00
Armin Braun	41633cb9b5	More Efficient Ordering of Shard Upload Execution (#42791 ) (#46588 ) * More Efficient Ordering of Shard Upload Execution (#42791) * Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard * Inspired by #39657 * Cleanup BlobStoreRepository Abort and Failure Handling (#46208)	2019-09-11 13:59:20 +02:00
Jim Ferenczi	80bb08fbda	Replace the SearchContext with QueryShardContext when building collapsing context (#46543 ) This commit replaces the `SearchContext` with the `QueryShardContext` when building collapsing conteext Collapse context is part of the `SearchContext` so it shouldn't require a `SearchContext` to create one. Relates #46523	2019-09-11 12:25:38 +02:00
Jim Ferenczi	425b1a77e8	Add more context to QueryShardContext (#46584 ) This change adds an IndexSearcher and the node's BigArrays in the QueryShardContext. It's a spin off of #46527 as this change is required to allow aggregation builder to solely use the query shard context. Relates #46523	2019-09-11 12:24:51 +02:00
Armin Braun	f8d5145472	Fix SnapshotStatusApisIT (#46563 ) (#46582 ) Obviously we have to run the status request again to busy wait for the `STARTED` state, just busy waiting on an existing response won't do anything. Closes #45917	2019-09-11 11:58:42 +02:00
Lee Hinman	cdc3a260af	Add retention to Snapshot Lifecycle Management (backport of #4… (#46506 ) * Add retention to Snapshot Lifecycle Management (#46407) This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria. An example policy would look like: ``` PUT /_slm/policy/snapshot-every-day { "schedule": "0 30 2 * * ?", "name": "<production-snap-{now/d}>", "repository": "my-s3-repository", "config": { "indices": ["foo-", "important"] }, // Newly configured retention options "retention": { // Snapshots should be deleted after 14 days "expire_after": "14d", // Keep a maximum of thirty snapshots "max_count": 30, // Keep a minimum of the four most recent snapshots "min_count": 4 } } ``` SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour. Included in this work is a new SLM stats API endpoint available through ``` json GET /_slm/stats ``` That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API. Add base framework for snapshot retention (#43605) * Add base framework for snapshot retention This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask` to start as the basis for SLM's retention implementation. Relates to #38461 * Remove extraneous 'public' * Use a local var instead of reading class var repeatedly * Add SnapshotRetentionConfiguration for retention configuration (#43777) * Add SnapshotRetentionConfiguration for retention configuration This commit adds the `SnapshotRetentionConfiguration` class and its HLRC counterpart to encapsulate the configuration for SLM retention. Currently only a single parameter is supported as an example (we still need to discuss the different options we want to support and their names) to keep the size of the PR down. It also does not yet include version serialization checks since the original SLM branch has not yet been merged. Relates to #43663 * Fix REST tests * Fix more documentation * Use Objects.equals to avoid NPE * Put `randomSnapshotLifecyclePolicy` in only one place * Occasionally return retention with no configuration * Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764) * Implement SnapshotRetentionTask's snapshot filtering and deletion This commit implements the snapshot filtering and deletion for `SnapshotRetentionTask`. Currently only the expire-after age is used for determining whether a snapshot is eligible for deletion. Relates to #43663 * Fix deletes running on the wrong thread * Handle missing or null policy in snap metadata differently * Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>> * Use the `OriginSettingClient` to work with security, enhance logging * Prevent NPE in test by mocking Client * Allow empty/missing SLM retention configuration (#45018) Semi-related to #44465, this allows the `"retention"` configuration map to be missing. Relates to #43663 * Add min_count and max_count as SLM retention predicates (#44926) This adds the configuration options for `min_count` and `max_count` as well as the logic for determining whether a snapshot meets this criteria to SLM's retention feature. These options are optional and one, two, or all three can be specified in an SLM policy. Relates to #43663 * Time-bound deletion of snapshots in retention delete function (#45065) * Time-bound deletion of snapshots in retention delete function With a cluster that has a large number of snapshots, it's possible that snapshot deletion can take a very long time (especially since deletes currently have to happen in a serial fashion). To prevent snapshot deletion from taking forever in a cluster and blocking other operations, this commit adds a setting to allow configuring a maximum time to spend deletion snapshots during retention. This dynamic setting defaults to 1 hour and is best-effort, meaning that it doesn't hard stop a deletion at an hour mark, but ensures that once the time has passed, all subsequent deletions are deferred until the next retention cycle. Relates to #43663 * Wow snapshots suuuure can take a long time. * Use a LongSupplier instead of actually sleeping * Remove TestLogging annotation * Remove rate limiting * Add SLM metrics gathering and endpoint (#45362) * Add SLM metrics gathering and endpoint This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The stats stored include the number of snapshots taken, failed, deleted, the number of retention runs, as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount of time spent deleting snapshots from SLM retention. This commit also adds an endpoint for retrieving all stats (further commits will expose this in the SLM get-policy API) that looks like: ``` GET /_slm/stats { "retention_runs" : 13, "retention_failed" : 0, "retention_timed_out" : 0, "retention_deletion_time" : "1.4s", "retention_deletion_time_millis" : 1404, "policy_metrics" : { "daily-snapshots2" : { "snapshots_taken" : 7, "snapshots_failed" : 0, "snapshots_deleted" : 6, "snapshot_deletion_failures" : 0 }, "daily-snapshots" : { "snapshots_taken" : 12, "snapshots_failed" : 0, "snapshots_deleted" : 12, "snapshot_deletion_failures" : 6 } }, "total_snapshots_taken" : 19, "total_snapshots_failed" : 0, "total_snapshots_deleted" : 18, "total_snapshot_deletion_failures" : 6 } ``` This does not yet include HLRC for this, as this commit is quite large on its own. That will be added in a subsequent commit. Relates to #43663 * Version qualify serialization * Initialize counters outside constructor * Use computeIfAbsent instead of being too verbose * Move part of XContent generation into subclass * Fix REST action for master merge * Unused import * Record history of SLM retention actions (#45513) This commit records the deletion of snapshots by the retention component of SLM into the SLM history index for the purposes of reviewing operations taken by SLM and alerting. * Retry SLM retention after currently running snapshot completes (#45802) * Retry SLM retention after currently running snapshot completes This commit adds a ClusterStateObserver to wait until the currently running snapshot is complete before proceeding with snapshot deletion. SLM retention waits for the maximum allowed deletion time for the snapshot to complete, however, the waiting time is not factored into the limit on actual deletions. Relates to #43663 * Increase timeout waiting for snapshot completion * Apply patch From `2374316f0d`.patch * Rename test variables * [TEST] Be less strict for stats checking * Skip SLM retention if ILM is STOPPING or STOPPED (#45869) This adds a check to ensure we take no action during SLM retention if ILM is currently stopped or in the process of stopping. Relates to #43663 * Check all actions preventing snapshot delete during retention (#45992) * Check all actions preventing snapshot delete during retention run Previously we only checked to see if a snapshot was currently running, but it turns out that more things can block snapshot deletion. This changes the check to be a check for: - a snapshot currently running - a deletion already in progress - a repo cleanup in progress - a restore currently running This was found by CI where a third party delete in a test caused SLM retention deletion to throw an exception. Relates to #43663 * Add unit test for okayToDeleteSnapshots * Fix bug where SLM retention task would be scheduled on every node * Enhance test logging * Ignore if snapshot is already deleted * Missing import * Fix SnapshotRetentionServiceTests * Expose SLM policy stats in get SLM policy API (#45989) This also adds support for the SLM stats endpoint to the high level rest client. Retrieving a policy now looks like: ```json { "daily-snapshots" : { "version": 1, "modified_date": "2019-04-23T01:30:00.000Z", "modified_date_millis": 1556048137314, "policy" : { "schedule": "0 30 1 * * ?", "name": "<daily-snap-{now/d}>", "repository": "my_repository", "config": { "indices": ["data-", "important"], "ignore_unavailable": false, "include_global_state": false }, "retention": {} }, "stats": { "snapshots_taken": 0, "snapshots_failed": 0, "snapshots_deleted": 0, "snapshot_deletion_failures": 0 }, "next_execution": "2019-04-24T01:30:00.000Z", "next_execution_millis": 1556048160000 } } ``` Relates to #43663 Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356) * Rewrite SnapshotLifecycleIT as as ESIntegTestCase This commit splits `SnapshotLifecycleIT` into two different tests. `SnapshotLifecycleRestIT` which includes the tests that do not require slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an integration test using `MockRepository` to simulate a snapshot being in progress. Relates to #43663 Resolves #46205 * Add error logging when exceptions are thrown * Update serialization versions * Fix type inference * Use non-Cancellable HLRC return value * Fix Client mocking in test * Fix SLMSnapshotBlockingIntegTests for 7.x branch * Update SnapshotRetentionTask for non-multi-repo snapshot retrieval * Add serialization guards for SnapshotLifecyclePolicy	2019-09-10 09:08:09 -06:00
Mayya Sharipova	2c5f9b558b	Fix highlighting for script_score query (#46507 )	2019-09-10 08:26:47 -04:00
David Turner	6c67b53932	Load metadata at start time not construction time (#46326 ) Today we load the metadata from disk while constructing the node. However there is no real need to do so, and this commit moves that code to run later while the node is starting instead.	2019-09-10 11:15:10 +01:00
Henning Andersen	9fce5a99d8	Rest Controller wildcard registration (#46487 ) Registering two different http methods on the same path using different wildcard names would result in the last wildcard name being active only. Now throw an exception instead. Closes #46482	2019-09-09 21:49:18 +02:00
Zachary Tong	8d17527050	[TEST] create larger cuckoo filters for tests (#46457 ) The cuckoofilters could be randomly created with too small of capacity or precision, which means that they can only absorb a few values before collisions start to make all filters look identical. This increases the size of filters we generate (capacity >> than the test cases) and lower fpp rate.	2019-09-09 10:18:51 -04:00
David Turner	8428f8e6e8	Remove trailing comma from nodes lists (#46484 ) Today when the membership of the cluster changes we log messages that describe the change like this: added {{node-1}{OPdaTIGmSxaEXXOyg3o96w}{127.0.0.1}{127.0.0.1:9301}{di},} The trailing comma suggests there is some missing string that might contain extra information, but in fact it's an artefact of how these messages are constructed. This commit removes the trailing comma from these lists.	2019-09-09 14:47:32 +01:00
Armin Braun	ee3396735c	Execute SnapshotsService Error Callback on Generic Thread (#46277 ) (#46480 ) I couldn't find a test for this, as it seems we only get into this error handler on a bug. Regardless, we are executing the snapshot finalization on the master update thread here which shouldn't happen and will make debugging a production issue resulting from this trickier than it has to be (because we probably also get a cluster state apply is slow warning in addition to the original bug). Used the generic pool here instead of the snapshot pool because we're resolving the user callback here as well and the generic pool seemed like the safer bet for that.	2019-09-09 14:38:11 +02:00
Martijn van Groningen	c057fce978	Merge remote-tracking branch 'es/7.x' into enrich-7.x	2019-09-09 08:40:54 +02:00
Nhat Nguyen	24c3a1de3c	Ignore replication for noop updates (#46458 ) Previously, we ignore replication for noop updates because they do not have sequence numbers. Since #44603, we started assigning sequence numbers to noop updates leading them to be replicated to replicas. This bug occurs only on 8.0 for it requires #41065 and #44603. Closes #46366	2019-09-07 11:32:01 -04:00
markharwood	323ec022be	Deprecate the "index.max_adjacency_matrix_filters" index setting (#46394 ) Following performance optimisations to the adjacency_matrix aggregation we no longer require this setting. Marked as deprecated and due for removal in 8.0 Related #46324	2019-09-06 13:59:47 +01:00
Yunfeng,Wu	7582af27b0	Resolve the incorrect scroll_current when delete or close index (#45226 ) Resolve the incorrect current scroll for deleted or closed index	2019-09-06 09:45:53 +02:00
Jim Ferenczi	f2a6c88f83	Add a system property to ignore awareness attributes (#46375 ) This is a follow up of #19191 for 7.x. This change adds a system property called "es.routing.search_ignore_awareness_attributes" that when set to true will effectively ignore allocation awareness attributes when routing search and get requests. This is now the default in 8.x so this commit adds a way to opt-in to this new behavior in a minor version of 7.x. Relates #45735	2019-09-06 09:29:27 +02:00
Paul Sanwald	758680c549	version bump to 6.8.4 (#46409 )	2019-09-05 15:14:36 -04:00
Jason Tedor	92866f977a	Clarify error message on keystore write permissions (#46321 ) When the Elasticsearch process does not have write permissions to upgrade the Elasticsearch keystore, we bail with an error message that indicates there is a filesystem permissions problem. This commit clarifies that error message by pointing out the directory where write permissions are required, or that the user can also run the elasticsearch-keystore upgrade command manually before starting the Elasticsearch process. In this case, the upgrade would not be needed at runtime, so the permissions would not be needed then.	2019-09-05 15:11:54 -04:00
Benjamin Trent	d912a49c6f	[7.x] Support geotile_grid aggregation in composite agg sources (#45810 ) (#46399 ) * Support geotile_grid aggregation in composite agg sources (#45810) Adds support for `geotile_grid` as a source in composite aggs. Part of this change includes adding a new docFormat of `GEOTILE` that formats a hashed `long` value into a geotile formatting string `zoom/x/y`.	2019-09-05 13:22:57 -05:00
Armin Braun	7a9af874ad	Enable Debug Logging for Master and Coordination Packages (#46363 ) (#46374 ) In order to track down #46091: * Enables debug logging in REST tests for `master` and `coordination` packages since we suspect that issues are caused by failed and then retried publications	2019-09-05 14:03:38 +02:00
Yannick Welsch	7e4c633ce3	Quiet down shard lock failures (#46368 ) These were actually never intended to be logged at the warning level but made visible by a refactoring in #19991, which introduced a new exception type but forgot to adapt some of the consumers of the exception.	2019-09-05 13:08:11 +02:00
Nhat Nguyen	03ed18a010	Unmute testRecoveryFromFailureOnTrimming Tracked at #46267	2019-09-04 22:33:17 -04:00
Julie Tibshirani	40c3225d26	First round of optimizations for vector functions. (#46294 ) This PR merges the `vectors-optimize-brute-force` feature branch, which makes the following changes to how vector functions are computed: * Precompute the L2 norm of each vector at indexing time. (#45390) * Switch to ByteBuffer for vector encoding. (#45936) * Decode vectors and while computing the vector function. (#46103) * Use an array instead of a List for the query vector. (#46155) * Precompute the normalized query vector when using cosine similarity. (#46190) Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>	2019-09-04 14:45:57 -07:00
Nhat Nguyen	a16cb89956	Revert "Sync translog without lock when trim unreferenced readers (#46203 )" Unfortunately, with this change, we won't clean up all unreferenced generations when reopening. We assume that there's at most one unreferenced generation when reopening translog. The previous implementation guarantees this assumption by syncing translog every time after we remove a translog reader. This change, however, only syncs translog once after we have removed all unreferenced readers (can be more than one) and breaks the assumption. Closes #46267 This reverts commit fd8183ee51d7cf08d9def58a2ae027714beb60de.	2019-09-04 17:09:39 -04:00
Jason Tedor	3cbdd84b89	Add test that get triggers shard search active (#46317 ) This commit is a follow-up to a change that fixed that multi-get was not triggering a shard to become search active. In that change, we added a test that multi-get properly triggers a shard to become search active. This commit is a follow-up to that change which adds a test for the get case. While get is already handled correctly in production code, there was not a test for it. This commit adds one. Additionally, we factor all the search idle tests from IndexShardIT into a separate test class, as an effort to keep related tests together instead of a single large test class containing a jumble of tests, and also to keep test classes smaller for better parallelization.	2019-09-04 11:53:32 -04:00
markharwood	408b58dd9d	Adjacency_matrix aggregation optimisation. (#46257 ) (#46315 ) Avoid pre-allocating ((N * N) - N) / 2 “BitsIntersector” objects given N filters. Most adjacency matrices will be sparse and we typically don’t need to allocate all of these objects - can save a lot of allocations when the number of filters is high. Closes #46212	2019-09-04 16:45:32 +01:00
Nhat Nguyen	eb56d23421	Do not send recovery requests with CancellableThreads (#46287 ) Previously, we send recovery requests using CancellableThreads because we send requests and wait for responses in a blocking manner. With async recovery, we no longer need to do so. Moreover, if we fail to submit a request, then we can release the Store using an interruptible thread which can risk invalidating the node lock. This PR is the first step to avoid forking when releasing the Store. Relates #45409 Relates #46178	2019-09-04 11:26:11 -04:00
Henning Andersen	5066835569	Fix SearchService.createContext exception handling (#46258 ) An exception from the DefaultSearchContext constructor could leak a searcher, causing future issues like shard lock obtained exceptions. The underlying cause of the exception in the constructor has been fixed, but as a safety precaution we also fix the exception handling in createContext. Closes #45378	2019-09-04 14:46:30 +02:00
Nhat Nguyen	3f67cbe974	Suppress warning from background sync on relocated primary (#46247 ) If a primary as being relocated, then the global checkpoint and retention lease background sync can emit unnecessary warning logs. This side effect was introduced in #42241. Relates #40800 Relates #42241	2019-09-03 18:44:15 -04:00
Nhat Nguyen	5924df1764	Mute testRecoveryFromFailureOnTrimming Tracked at #46267	2019-09-03 18:44:08 -04:00
Lee Hinman	57f322f85e	Move MockRespository into test framework (#46298 ) This moves the `MockRespository` class into `test/framework/src/main` so it can be used across all modules and plugins in tests.	2019-09-03 16:21:10 -06:00

... 3 4 5 6 7 ...

4046 Commits