OpenSearch

Commit Graph

Author	SHA1	Message	Date
Henning Andersen	2ac38fd315	Reindex and friends fail on RED shards (#45830 ) Reindex, update by query and delete by query would silently disregard RED/unavailable shards, thus not copying, updating or deleting matching data in those shards. Now use `allow_partial_search_results=false` to ensure these operations fail if the search crosses an unavailable chard. Added the option to explicitly specify `allow_partial_search_results=true` for reindex only (seemed too strange for update/delete by query). Relates #45739 and #42612	2019-11-18 21:23:08 +01:00
Benjamin Trent	eefe7688ce	[7.x][ML] ML Model Inference Ingest Processor (#49052 ) (#49257 ) * [ML] ML Model Inference Ingest Processor (#49052) * [ML][Inference] adds lazy model loader and inference (#47410) This adds a couple of things: - A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them - A Model class and its first sub-class LocalModel. Used to cache model information and run inference. - Transport action and handler for requests to infer against a local model Related Feature PRs: * [ML][Inference] Adjust inference configuration option API (#47812) * [ML][Inference] adds logistic_regression output aggregator (#48075) * [ML][Inference] Adding read/del trained models (#47882) * [ML][Inference] Adding inference ingest processor (#47859) * [ML][Inference] fixing classification inference for ensemble (#48463) * [ML][Inference] Adding model memory estimations (#48323) * [ML][Inference] adding more options to inference processor (#48545) * [ML][Inference] handle string values better in feature extraction (#48584) * [ML][Inference] Adding _stats endpoint for inference (#48492) * [ML][Inference] add inference processors and trained models to usage (#47869) * [ML][Inference] add new flag for optionally including model definition (#48718) * [ML][Inference] adding license checks (#49056) * [ML][Inference] Adding memory and compute estimates to inference (#48955) * fixing version of indexed docs for model inference	2019-11-18 13:19:17 -05:00
gpaimla	7d20b50f45	Implement Lucene EstonianAnalyzer, Stemmer (#49149 ) This PR adds a new analyzer and stemmer for the Estonian language. Closes #48895	2019-11-18 17:24:21 +01:00
Armin Braun	25cc8e3663	Fix RepoCleanup not Removed on Master-Failover (#49217 ) (#49239 ) The logic for `cleanupInProgress()` was backwards everywhere (method itself and all but one user). Also, we weren't checking it when removing a repository. This lead to a bug (in the one spot that didn't use the method backwards) that prevented the cleanup cluster state entry from ever being removed from the cluster state if master failed over during the cleanup process. This change corrects the backwards logic, adds a test that makes sure the cleanup is always removed and adds a check that prevents repository removal during cleanup to the repositories service. Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.	2019-11-18 16:44:09 +01:00
Armin Braun	f7d9e7bdc4	Better Exceptions on Concurrent Snapshot Operations (#49220 ) (#49237 ) * Better Exceptions on Concurrent Snapshot Operations It is somewhat tricky to debug test failures from concurrent operations without having the exact knowledge of what ran concurrently so I added it to these exceptions in all spots.	2019-11-18 14:12:55 +01:00
Armin Braun	42268f0b0e	Fix Broken Network Disruption in SnapshotResiliencyTests (#49216 ) (#49231 ) The network disruption was acting on node ids and node names which made reconnects not work. Moved all usages to node names to fix this. Since the map of all nodes in the test is indexed by name this was easier to work with.	2019-11-18 12:02:27 +01:00
Yannick Welsch	af797a77a1	Auto-expand indices according to allocation filtering rules (#48974 ) Honours allocation filtering rules when auto-expanding indices.	2019-11-18 12:01:56 +01:00
Armin Braun	2886d4c6dd	Make FsBlobContainer Listing Resilient to Concurrent Modifications (#49142 ) (#49176 ) * Make FsBlobContainer Listing Resilient to Concurrent Modifications If we list out files in a folder via the lazily computed directory stream, we have to deal with concurrent deletes when reading the file attributes since we don't have a lock on the directory in any way. Closes #37581	2019-11-15 21:14:53 +01:00
Mark Tozzi	dad68c59fe	Avoid precision loss in DocValueFormat.RAW#parseLong (#49063 ) (#49169 )	2019-11-15 12:32:26 -05:00
markharwood	c3745b03ee	Search optimisation - add canMatch early aborts for queries on "_index" field (#49158 ) Make queries on the “_index” field fast-fail if the target shard is an index that doesn’t match the query expression. Part of the “canMatch” phase optimisations. Closes #48473	2019-11-15 16:50:32 +00:00
Jason Tedor	36dc544819	Adjust version on ingest processor exception The dedicated ingest processor exception was backported to 7.5. This commit updates the version in the 7.x branch.	2019-11-15 09:35:12 -05:00
Armin Braun	fc505aaa76	Track Repository Gen. in BlobStoreRepository (#48944 ) (#49116 ) This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see https://github.com/elastic/elasticsearch/issues/47520). Closes #47834 Closes #49048	2019-11-15 09:54:53 +01:00
Tal Levy	5cd6f64f15	Introduce faster approximate sinh/atan math functions (#49009 ) (#49110 ) This commit introduces a new class called ESSloppyMath that is meant to reflect the purpose of Lucene's SloppyMath, but add additional unimplemented faster alternatives to math functions. The two that are used by geotile-grid a lot are sinh/atan. In a quick elasticsearch rally benchmark for geotile-grid on Switzerland data points, this shows a (1.22x) 22% speed-up over using Math's functions. closes #41166.	2019-11-14 14:15:34 -08:00
bellengao	6ce04429c6	Fix `_analyze` API to correctly use normalizers when specified (#48866 ) Currently the `_analyze` endpoint doesn't correctly use normalizers specified in the request. This change fixes that by returning the resolved normalizer from TransportAnalyzeAction#getAnalyzer and updates test to be able to catch this in the future. Closes #48650	2019-11-14 19:51:11 +01:00
Jason Tedor	2bcdcb17cd	Introduce dedicated ingest processor exception (#48810 ) Today we wrap exceptions that occur while executing an ingest processor in an ElasticsearchException. Today, in ExceptionsHelper#unwrapCause we only unwrap causes for exceptions that implement ElasticsearchWrapperException, which the top-level ElasticsearchException does not. Ultimately, this means that any exception that occurs during processor execution does not have its cause unwrapped, and so its status is blanket treated as a 500. This means that while executing a bulk request with an ingest pipeline, document-level failures that occur during a processor will cause the status for that document to be treated as 500. Since that does not give the client any indication that they made a mistake, it means some clients will enter infinite retries, thinking that there is some server-side problem that merely needs to clear. This commit addresses this by introducing a dedicated ingest processor exception, so that its causes can be unwrapped. While we could consider a broader change to unwrap causes for more than just ElasticsearchWrapperExceptions, that is a broad change with unclear implications. Since the problem of reporting 500s on client errors is a user-facing bug, we take the conservative approach for now, and we can revisit the unwrapping in a future change.	2019-11-14 11:04:53 -05:00
Christoph Büscher	6c5644335f	Simplify TransportMultiSearchActionTests (#48523 ) The test doesn't seem to need the threadpool that is created and destroyed in setup and teardown any longer, so it can be removed.	2019-11-14 14:48:16 +01:00
Henning Andersen	66f0c8900f	Fix Transport Stopped Exception (#48930 ) (#49035 ) When a node shuts down, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates #42612	2019-11-13 18:48:05 +01:00
Yannick Welsch	2dfa0133d5	Always use primary term from primary to index docs on replica (#47583 ) Ensures that we always use the primary term established by the primary to index docs on the replica. Makes the logic around replication less brittle by always using the operation primary term on the replica that is coming from the primary.	2019-11-13 12:13:45 +01:00
Igor Motov	40776eedaf	Fix ignoring missing values in min/max aggregations (#48970 ) Fixes the issue when the missing values can be ignored in min/max due to BKD optimization. Fixes #48905	2019-11-12 19:57:28 -05:00
Armin Braun	0e1035241d	Fix Broken Snapshots in Mixed Clusters (#48993 ) (#48995 ) Reverts #48947 and fixes the issue orginally addressed by removing the assertion. It turns out we can't simply pass empty shard generations to the snapshot finalization in the BwC case as that results in no indices being added to the meta for the given snapshot since we take the indices from the shard generations (even in the BwC case the `null` generations work fine for this). Closes #48983	2019-11-12 21:35:41 +01:00
David Turner	9baea80853	Ignore metadata of deleted indices at start (#48918 ) Today in 6.x it is possible to add an index tombstone to the graveyard without deleting the corresponding index metadata, because the deletion is slightly deferred. If you shut down the node and upgrade to 7.x when in this state then the node will fail to apply any cluster states, reporting java.lang.IllegalStateException: Cannot delete index [...], it is still part of the cluster state. This commit addresses this situation by skipping over any index metadata with a corresponding tombstone, allowing this metadata to be cleaned up by the 7.x node.	2019-11-12 11:16:54 +00:00
David Turner	dc441588b6	Remove support for ancient corrupted markers (#48858 ) Today we still support reading store corruption markers of versions that haven't been written since 1.7. This commit removes this legacy support.	2019-11-12 11:10:46 +00:00
Yannick Welsch	ab15bce4e7	Auto-expand replicated closed indices (#48973 ) Fixes a bug where replicated closed indices were not being auto-expanded.	2019-11-12 12:00:05 +01:00
Tim Brooks	0645ee88e2	Send cluster name and discovery node in handshake (#48916 ) This commits sends the cluster name and discovery naode in the transport level handshake response. This will allow us to stop sending the transport service level handshake request in the 8.0-8.x release cycle. It is necessary to start sending this in 7.x so that 8.0 is guaranteed to be communicating with a version that sends the required information.	2019-11-11 18:42:02 -05:00
Jake Landis	c320b499a0	Prevent deadlock by using separate schedulers (#48697 ) (#48964 ) Currently the BulkProcessor class uses a single scheduler to schedule flushes and retries. Functionally these are very different concerns but can result in a dead lock. Specifically, the single shared scheduler can kick off a flush task, which only finishes it's task when the bulk that is being flushed finishes. If (for what ever reason), any items in that bulk fails it will (by default) schedule a retry. However, that retry will never run it's task, since the flush task is consuming the 1 and only thread available from the shared scheduler. Since the BulkProcessor is mostly client based code, the client can provide their own scheduler. As-is the scheduler would require at minimum 2 worker threads to avoid the potential deadlock. Since the number of threads is a configuration option in the scheduler, the code can not enforce this 2 worker rule until runtime. For this reason this commit splits the single task scheduler into 2 schedulers. This eliminates the potential for the flush task to block the retry task and removes this deadlock scenario. This commit also deprecates the Java APIs that presume a single scheduler, and updates any internal code to no longer use those APIs. Fixes #47599 Note - #41451 fixed the general case where a bulk fails and is retried that can result in a deadlock. This fix should address that case as well as the case when a bulk failure from the flush needs to be retried.	2019-11-11 16:31:21 -06:00
Mark Tozzi	d9e569278f	Refactor and DRY up Kahan Sum algorithm (#48558 ) (#48959 )	2019-11-11 15:09:19 -05:00
Armin Braun	c45470f84f	Fix ShardGenerations in RepositoryData in BwC Case (#48920 ) (#48947 ) We were tripping the assertion that the makes sure we only have empty `ShardGenerations` in `RepositoryData` in the BwC case because shard generations were passed to the `Repository` in the BwC case. Fixed by only generating empty shard gen for BwC snapshots in `SnapshotsService`.	2019-11-11 18:02:53 +01:00
Rory Hunter	014e1b1090	Improve resiliency to auto-formatting in server (#48940 ) Backport of #48450. Make a number of changes so that code in the `server` directory is more resilient to automatic formatting. This covers: * Reformatting multiline JSON to embed whitespace in the strings * Move some comments around to they aren't auto-formatted to a strange place. This also required moving some `&&` and `\|\|` operators from the end-of-line to start-of-line`. * Add helper method `reformatJson()`, to strip whitespace from a JSON document using XContent methods. This is sometimes necessary where a test is comparing some machine-generated JSON with an expected value. Also, `HyperLogLogPlusPlus.java` is now excluded from formatting because it contains large data tables that don't reformat well with the current settings, and changing the settings would be worse for the rest of the codebase.	2019-11-11 14:33:04 +00:00
Yannick Welsch	87862868c6	Allow realtime get to read from translog (#48843 ) The realtime GET API currently has erratic performance in case where a document is accessed that has just been indexed but not refreshed yet, as the implementation will currently force an internal refresh in that case. Refreshing can be an expensive operation, and also will block the thread that executes the GET operation, blocking other GETs to be processed. In case of frequent access of recently indexed documents, this can lead to a refresh storm and terrible GET performance. While older versions of Elasticsearch (2.x and older) did not trigger refreshes and instead opted to read from the translog in case of realtime GET API or update API, this was removed in 5.0 (#20102) to avoid inconsistencies between values that were returned from the translog and those returned by the index. This was partially reverted in 6.3 (#29264) to allow _update and upsert to read from the translog again as it was easier to guarantee consistency for these, and also brought back more predictable performance characteristics of this API. Calls to the realtime GET API, however, would still always do a refresh if necessary to return consistent results. This means that users that were calling realtime GET APIs to coordinate updates on client side (realtime GET + CAS for conditional index of updated doc) would still see very erratic performance. This PR (together with #48707) resolves the inconsistencies between reading from translog and index. In particular it fixes the inconsistencies that happen when requesting stored fields, which were not available when reading from translog. In case where stored fields are requested, this PR will reparse the _source from the translog and derive the stored fields to be returned. With this, it changes the realtime GET API to allow reading from the translog again, avoid refresh storms and blocking the GET threadpool, and provide overall much better and predictable performance for this API.	2019-11-09 17:47:50 +01:00
Nhat Nguyen	ff6c121eb9	Closed shard should never open new engine (#47186 ) We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock. Closes #47060	2019-11-08 23:40:34 -05:00
Nhat Nguyen	9a42e71dd9	Do not cancel recovery for copy on broken node (#48265 ) This change fixes a poisonous situation where an ongoing recovery was canceled because a better copy was found on a node that the cluster had previously tried allocating the shard to but failed. The solution is to keep track of the set of nodes that an allocation was failed on so that we can avoid canceling the current recovery for a copy on failed nodes. Closes #47974	2019-11-08 23:10:47 -05:00
Adrien Grand	3b9ce0a4f3	Elasticsearch 7.5 is on Lucene 8.3. (#48831 )	2019-11-06 10:13:09 -05:00
David Turner	bd5c6c4779	Add preflight check to dynamic mapping updates (#48867 ) Today if the primary discovers that an indexing request needs a mapping update then it will send it to the master for validation and processing. If, however, the put-mapping request is invalid then the master still processes it as a (no-op) cluster state update. When there are a large number of indexing operations that result in invalid mapping updates this can overwhelm the master. However, the primary already has a reasonably up-to-date mapping against which it can check the (approximate) validity of the put-mapping request before sending it to the master. For instance it is not possible to remove fields in a mapping update, so if the primary detects that a mapping update will exceed the fields limit then it can reject it itself and avoid bothering the master. This commit adds a pre-flight check to the mapping update path so that the primary can discard obviously-invalid put-mapping requests itself. Fixes #35564 Backport of #48817	2019-11-05 18:08:22 +01:00
Nhat Nguyen	0887cbc964	Fix testForceMergeWithSoftDeletesRetentionAndRecoverySource (#48766 ) This test failure manifests the limitation of the recovery source merge policy explained in #41628. If we already merge down to a single segment then subsequent force merges will be noop although they can prune recovery source. We need to adjust this test until we have a fix for the merge policy. Relates #41628 Closes #48735	2019-11-02 21:14:12 -04:00
Armin Braun	3c20541823	Cleanup Concurrent RepositoryData Loading (#48329 ) (#48834 ) The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122	2019-11-02 20:42:29 +01:00
Armin Braun	a22f6fbe3c	Cleanup Redundant Futures in Recovery Code (#48805 ) (#48832 ) Follow up to #48110 cleaning up the redundant future uses that were left over from that change.	2019-11-02 17:28:12 +01:00
Jason Tedor	c82ecb664c	Do not wrap ingest processor exception with IAE (#48816 ) The problem with wrapping here is that it converts any exception into an IAE, which we treat as a client error (400 status) whereas the exception being wrapped here could be a server error (e.g., NPE). This commit stops wrapping all ingest processor exceptions as IAEs.	2019-11-01 15:11:35 -04:00
Tal Levy	4be54402de	[7.x] Add ingest info to Cluster Stats (#48485 ) (#48661 ) * Add ingest info to Cluster Stats (#48485) This commit enhances the ClusterStatsNodes response to include global processor usage stats on a per-processor basis. example output: ``` ... "processor_stats": { "gsub": { "count": 0, "failed": 0 "current": 0 "time_in_millis": 0 }, "script": { "count": 0, "failed": 0 "current": 0, "time_in_millis": 0 } } ... ``` The purpose for this enhancement is to make it easier to collect stats on how specific processors are being used across the cluster beyond the current per-node usage statistics that currently exist in node stats. Closes #46146. * fix BWC of ingest stats The introduction of processor types into IngestStats had a bug. It was set to `null` and set as the key to the map. This would throw a NPE. This commit resolves this by setting all the processor types from previous versions that are not serializing it out to `_NOT_AVAILABLE`.	2019-10-31 14:36:54 -07:00
Ioannis Kakavas	99aedc844d	Copy http headers to ThreadContext strictly (#45945 ) (#48675 ) Previous behavior while copying HTTP headers to the ThreadContext, would allow multiple HTTP headers with the same name, handling only the first occurrence and disregarding the rest of the values. This can be confusing when dealing with multiple Headers as it is not obvious which value is read and which ones are silently dropped. According to RFC-7230, a client must not send multiple header fields with the same field name in a HTTP message, unless the entire field value for this header is defined as a comma separated list or this specific header is a well-known exception. This commits changes the behavior in order to be more compliant to the aforementioned RFC by requiring the classes that implement ActionPlugin to declare if a header can be multi-valued or not when registering this header to be copied over to the ThreadContext in ActionPlugin#getRestHeaders. If the header is allowed to be multivalued, then all such headers are read from the HTTP request and their values get concatenated in a comma-separated string. If the header is not allowed to be multivalued, and the HTTP request contains multiple such Headers with different values, the request is rejected with a 400 status.	2019-10-31 23:05:12 +02:00
Zachary Tong	34c2375417	Add v7.4.3 version constant	2019-10-31 13:21:25 -04:00
Stéphane Campinas	7ea74918e1	[DOCS] Fix typo in IndexFieldData.java comments (#48743 )	2019-10-31 09:40:35 -04:00
kkewwei	0366c4d4a9	Faster access to INITIALIZING/RELOCATING shards (#47817 ) Today a couple of allocation deciders iterate through all the shards on a node to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster state updates in clusters with very high-density nodes holding many thousands of shards even if those shards belong to closed or frozen indices. This commit pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up this search. Closes #46941 Relates #48579 Co-authored-by: "hongju.xhj" <hongju.xhj@alibaba-inc.com>	2019-10-31 10:55:59 +00:00
Rory Hunter	d96976e2b1	Improve resiliency to formatting JSON in server (#48706 ) Backport of #48553. Make a number of changes so that JSON in the server directory is more resilient to automatic formatting. This covers: * Reformatting multiline JSON to embed whitespace in the strings * Add helper method `stripWhitespace()`, to strip whitespace from a JSON document using XContent methods. This is sometimes necessary where a test is comparing some machine-generated JSON with an expected value.	2019-10-31 10:48:55 +00:00
Arvind Ramachandran	eefa84bc94	Ignore dangling indices created in newer versions (#48652 ) Today it is possible that we import a dangling index that was created in a newer version than one or more of the nodes in the cluster. Such an index would prevent the older node(s) from rejoining the cluster if they were to briefly leave it for some reason. This commit prevents the import of such dangling indices. Fixes #34264	2019-10-31 10:12:42 +00:00
Yannick Welsch	fe8901b00b	Return consistent source in updates (#48707 )	2019-10-31 10:00:40 +01:00
Ignacio Vera	5bea3898a9	Add IndexOrDocValuesQuery to GeoPolygonQueryBuilder (#48449 ) (#48731 )	2019-10-31 08:46:57 +01:00
Nhat Nguyen	f8ef402027	Do not warm up searcher in engine constructor (#48605 ) With this change, we won't warm up searchers until we externally refresh an engine. We explicitly refresh before allowing reading from a shard (i.e., move to post_recovery state) and during resetting. These guarantees that we have warmed up the engine before exposing the external searcher. Another prerequisite for #47186.	2019-10-30 14:22:59 -04:00
Armin Braun	36039706b5	Fix SnapshotShardStatus Reporting for Failed Shard (#48556 ) (#48687 ) Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526	2019-10-30 15:43:41 +01:00
Armin Braun	52e5ceb321	Restore from Individual Shard Snapshot Files in Parallel (#48110 ) (#48686 ) Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.	2019-10-30 14:36:30 +01:00
Armin Braun	01e326d2e3	Fix ref count handling in Engine.failEngine (#48639 ) (#48646 ) We can run into an already closed store here and hence throw on trying to increment the ref count => moving to the guarded ref count increment closes #48625	2019-10-30 10:10:48 +01:00

1 2 3 4 5 ...

3825 Commits