Commit Graph

3954 Commits

Author SHA1 Message Date
Przemyslaw Gomulka 502873b144
[Java.time] Retain prefixed date pattern in formatter (#48703)
JavaDateFormatter should keep the pattern with the prefixed 8 as it will be used for serialisation. The stripped pattern should be used for the enclosed formatters.

closes #48698
2019-11-27 12:29:18 +01:00
Yannick Welsch 0a73ba05de Do not mutate request on scripted upsert (#49578)
Fixes a bug where a scripted upsert that causes a dynamic mapping update is retried (because
mapping update is still in-flight), and the request is mutated multiple times.

Closes #48670
2019-11-27 09:25:36 +01:00
Martijn van Groningen 09c4269097
Add templating support to enrich processor (#49093)
Adds support for templating to `field` and `target_field` options.
2019-11-27 08:53:11 +01:00
Martijn van Groningen 90850f4ea0
Backport: Introduce on_failure_pipeline ingest metadata inside on_failure block (#49596)
Backport of #49076

In case an exception occurs inside a pipeline processor,
the pipeline stack is kept around as header in the exception.
Then in the on_failure processor the id of the pipeline the
exception occurred is made accessible via the `on_failure_pipeline`
ingest metadata.

Closes #44920
2019-11-27 07:52:08 +01:00
Armin Braun 996cdebfb4
Make BlobStoreRepository#writeIndexGen API Async (#49584) (#49610)
Preliminary to shorten the diff of #49060. In #49060
we execute cluster state updates during the writing of a new
index gen and thus it must be an async API.
2019-11-26 22:37:31 +01:00
Armin Braun 3862400270
Remove Redundant EsBlobStoreTestCase (#49603) (#49605)
All the implementations of `EsBlobStoreTestCase` use the exact same
bootstrap code that is also used by their implementation of
`EsBlobStoreContainerTestCase`.
This means all tests might as well live under `EsBlobStoreContainerTestCase`
saving a lot of code duplication. Also, there was no HDFS implementation for
`EsBlobStoreTestCase` which is now automatically resolved by moving the tests over
since there is a HDFS implementation for the container tests.
2019-11-26 20:57:19 +01:00
Alan Woodward fe2c65185e Annotated text type should extend TextFieldType (#49555)
The annotated text mapper has a field type that currently extends StringFieldType,
which means that all the positional-related query factory methods need to be copied
over from TextFieldType. In addition, MappedFieldType.intervals() hasn't been
overridden, so you can't use intervals queries with annotated text - a major drawback,
since one of the purposes of annotated text is to be able to run positional queries against
annotations.

This commit changes the annotated text field type to extend TextFieldType instead,
adding tests to ensure that position queries work correctly.

Closes #49289
2019-11-26 16:52:21 +00:00
Armin Braun 495b543e63
Improve Stability of GCS Mock API (#49592) (#49597)
Same as #49518 pretty much but for GCS.
Fixing a few more spots where input stream can get closed
without being fully drained and adding assertions to make sure
it's always drained.
Moved the no-close stream wrapper to production code utilities since
there's a number of spots in production code where it's also useful
(will reuse it there in a follow-up).
2019-11-26 16:53:51 +01:00
Rory Hunter cf5f013033
Return 400 when handling invalid JSON (#49558)
Backport of #49552.

Closes #49428. The code that works out an HTTP code for an exception didn't
consider the JsonParseException case, meant that an invalid JSON request could
result in a 500 Internal Server Error. Now it returns 400 Bad Request.
2019-11-26 12:36:56 +00:00
Tim Brooks 416178c7c8
Enable simple remote connection strategy (#49561)
This commit back ports three commits related to enabling the simple
connection strategy.

Allow simple connection strategy to be configured (#49066)

Currently the simple connection strategy only exists in the code. It
cannot be configured. This commit moves in the direction of allowing it
to be configured. It introduces settings for the addresses and socket
count. Additionally it introduces new settings for the sniff strategy
so that the more generic number of connections and seed node settings
can be deprecated.

The simple settings are not yet registered as the registration is
dependent on follow-up work to validate the settings.

Ensure at least 1 seed configured in remote test (#49389)

This fixes #49384. Currently when we select a random subset of seed
nodes from a list, it is possible for 0 seeds to be selected. This test
depends on at least 1 seed being selected.

Add the simple strategy to cluster settings (#49414)

This is related to #49067. This commit adds the simple connection
strategy settings and strategy mode setting to the cluster settings
registry. With these changes, the simple connection mode can be used.
Additionally, it adds validation to ensure that settings cannot be
misconfigured.
2019-11-25 16:53:07 -07:00
Zachary Tong 99e313695f Reuse CompensatedSum object in agg collect loops (#49548)
The new CompensatedSum is a nice DRY refactor, but had the unanticipated 
side effect of creating a lot of object allocation in the aggregation hot collection 
loop: one object per visited document, per aggregator. In some places it 
created two per-doc-per-agg (weighted avg, geo centroids, etc) since there 
were multiple compensations being maintained.

This PR moves the object creation out of the hot loop so that it is now 
created once per segment, and resets the internal state each time through 
the loop
2019-11-25 16:46:48 -05:00
Armin Braun 2502ff39a0
Enhance SnapshotResiliencyTests (#49514) (#49541)
A few enhancements to `SnapshotResiliencyTests`:
1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures)
2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response)
3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method
4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
2019-11-25 13:31:28 +01:00
Jared Tan 1d2bfd1af6 Include id to the error msg when it's too long (#49433) 2019-11-24 13:08:26 -05:00
Jason Tedor 69f570ea5f
Adjust version on final pipeline serialization
This commit adjusts the version final pipeline serialization after it
was backported to the 7.5 branch.
2019-11-22 14:56:56 -05:00
Jay Modi 4fd5fb5297
Stop NodeTests from timing out in certain cases (#49202) (#49503)
The NodeTests class contains tests that check behavior when shutting
down a node. This involves starting a node, performing some operation,
stopping the node, and then awaiting the close of the node. Part of
closing a node is the termination of the node's ThreadPool. ThreadPool
termination semantics can be deceiving. The ThreadPool#terminate method
takes a timeout value and the first oddity is that the terminate method
can take two times the timeout value before returning. Internally this
method acts on the ExecutorService instances that are held by the
ThreadPool. First, an orderly shutdown is attempted and pending tasks
are allowed to execute while waiting for the timeout value. If any of
the ExecutorService instances have not terminated, a call is made to
attempt to stop all active tasks (usually using interrupts) and then
waits for up to the timeout value a second time for the termination of
the ExecutorService instances. This means that if use a large value
when waiting for a node to close, we may not attempt to interrupt any
threads that are in a blocking call before the test times out.

In order to avoid causing these tests to time out, this change reduces
the timeout passed to Node#awaitClose to 10 seconds from 1 day. This
will allow blocked threads to be interrupted before the test suite
fails due to the timeout.

Closes #44256
Closes #42350
Closes #44435
2019-11-22 12:41:52 -07:00
Jason Tedor 71bcfbf1e3
Replace required pipeline with final pipeline (#49470)
This commit enhances the required pipeline functionality by changing it
so that default/request pipelines can also be executed, but the required
pipeline is always executed last. This gives users the flexibility to
execute their own indexing pipelines, but also ensure that any required
pipelines are also executed. Since such pipelines are executed last, we
change the name of required pipelines to final pipelines.
2019-11-22 14:37:36 -05:00
Armin Braun 97c7ea60b9
Add Missing Nullable Assertions in SnapshotsService (#49465) (#49492)
Just realized we were missing some annotations here which was somewhat
confusing since other methods/parameters have the `Nullable` annotation
wherever a `null` can be passed.
2019-11-22 17:27:27 +01:00
Rory Hunter 4fae2bb3b1
Don't close stderr under `--quiet` (#49431)
Backport of #47208.

Closes #46900. When running ES with `--quiet`, if ES then exits abnormally, a
user has to go hunting in the logs for the error. Instead, never close
System.err, and print more information to it if ES encounters a fatal error
e.g. config validation, or some fatal runtime exception. This is useful when
running under e.g. systemd, since the error will go into the journal.

Note that stderr is still closed in daemon (`-d`) mode.
2019-11-22 14:58:17 +00:00
Jim Ferenczi ed4eecc00e Pre-sort shards based on the max/min value of the primary sort field (#49092)
This change automatically pre-sort search shards on search requests that use a primary sort based on the value of a field. When possible, the can_match phase will extract the min/max (depending on the provided sort order) values of each shard and use it to pre-sort the shards prior to running the subsequent phases. This feature can be useful to ensure that shards that contain recent data are executed first so that intermediate merge have more chance to contain contiguous data (think of date_histogram for instance) but it could also be used in a follow up to early terminate sorted top-hits queries that don't require the total hit count. The latter could significantly speed up the retrieval of the most/least
recent documents from time-based indices.

Relates #49091
2019-11-22 11:02:12 +01:00
Igor Motov e8971ff367 Geo: Fix handling of circles in legacy geo_shape queries (#49410)
Brings back support for circles in legacy geo_shape queries that
was accidentally lost during query refactoring.

Fixes #49296
2019-11-21 14:03:31 -05:00
Christoph Büscher 138d16ab9e Fix ClusterHealthResponsesTests condition (#49360)
Currently the condtion that is supposed to test creation of test instances with
multiple indices is never true because it compares Strings with an enum. This
changes it so the condition uses the enum constants instead.
2019-11-21 17:14:23 +01:00
Alan Woodward d1eb7e749e Fix test for index phrases shortcut with multi-term synonyms (#49366)
Lucene 8.3 included a root fix for #43976, which was temporarily fixed in elasticsearch
by #44340. Since we have upgraded to 8.3 we no longer need this workaround. This
commit fixes the test that was added to check the workaround, and instead checks that
fields with index_phrases enabled correctly build queries when used with multi-term
synonyms.

Closes #47777
2019-11-21 09:49:58 +00:00
Yannick Welsch d72bd3a171 Verify translog checksum before UUID check (#49394)
When opening a translog file, we check whether the UUID matches what we expect (the UUID
from the latest commit). The UUID check can in certain cases fail when the translog is
corrupted. This commit changes the ordering of the checks so that corruption is detected first.
2019-11-21 10:12:49 +01:00
Yannick Welsch 8ee70fa9c6
Fix testPeerRecoveryTrimsLocalTranslog (#49385)
7.x uses the transport client, which, when being closed, can throw an IllegalStateException

Closes #49375
2019-11-21 10:03:25 +01:00
Nhat Nguyen 37a9cd677b Ignore Lucene index in peer recovery if translog corrupted (#49114)
If the translog on a replica is corrupt, we should not perform an 
operation-based recovery or utilize sync_id as we won't be able to open
an engine in the next step. This change adds an extra validation that
ensures translog is okay when preparing a peer recovery request.
2019-11-20 16:04:09 -05:00
jaymode d9fd4cc351 Add version 6.8.6 2019-11-20 11:01:57 -07:00
Jim Ferenczi 81548df2d9 Disable caching when queries are profiled (#48195)
This change disables the query and request cache when
profile is set to true in the request. This means that profiled queries
will not check caches to execute the query and the result will never be
added in the cache either.

Closes #33298
2019-11-20 16:02:59 +01:00
Armin Braun 1cde4a6364
Make SnapshotsService#getRepositoryData Async (#49322) (#49358)
* Make SnapshotsService#getRepositoryData Async (#49322)

Follow up to #49299 removing the blocking step for the
snapshot status APIs as well.
2019-11-20 15:22:10 +01:00
Alan Woodward c6b31162ba
Refactor percolator's QueryAnalyzer to use QueryVisitors
Lucene now allows us to explore the structure of a query using QueryVisitors,
delegating the knowledge of how to recurse through and collect terms to the
query implementations themselves. The percolator currently has a home-grown
external version of this API to construct sets of matching terms that must be
present in a document in order for it to possibly match the query.

This commit removes the home-grown implementation in favour of one using
QueryVisitor. This has the added benefit of making interval queries available
for percolator pre-filtering. Due to a bug in multi-term intervals (LUCENE-9050)
it also includes a clone of some of the lucene intervals logic, that can be removed
once upstream has been fixed.

Closes #45639
2019-11-20 09:21:01 +00:00
Mark Tozzi 17358b5af7
(refactor) Extract Empty/Script/Missing ValuesSource behavior to an interface (#48320) (#49330)
This is a pure code rearrangement refactor.  Logic for what specific ValuesSource instance to use for a given type (e.g. script or field) moved out of ValuesSourceConfig and into CoreValuesSourceType (previously just ValueSourceType; we extract an interface for future extensibility).  ValueSourceConfig still selects which case to use, and then the ValuesSourceType instance knows how to construct the ValuesSource for that case.
2019-11-19 16:44:29 -05:00
Jay Modi eed4cd25eb
ThreadPool and ThreadContext are not closeable (#43249) (#49273)
This commit changes the ThreadContext to just use a regular ThreadLocal
over the lucene CloseableThreadLocal. The CloseableThreadLocal solves
issues with ThreadLocals that are no longer needed during runtime but
in the case of the ThreadContext, we need it for the runtime of the
node and it is typically not closed until the node closes, so we miss
out on the benefits that this class provides.

Additionally by removing the close logic, we simplify code in other
places that deal with exceptions and tracking to see if it happens when
the node is closing.

Closes #42577
2019-11-19 13:15:16 -07:00
Jack Conradson 14d2e795ae make dim files mmapped (#49272)
This change mmaps dim files in HybridDirectory to take advantage of off-
heap BKD trees. This is based off of (#48509) via 
(https://issues.apache.org/jira/browse/LUCENE-8932).
2019-11-19 10:22:30 -08:00
Armin Braun 0acba44a2e
Make Repository.getRepositoryData an Async API (#49299) (#49312)
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of #49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.

In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
2019-11-19 16:49:12 +01:00
Armin Braun 9c00648314
Make Snapshot Delete Concurrency Exception Consistent (#49266) (#49281)
We shouldn't be throwing `RepositoryException` when the repository
wasn't concurrently modified in an unexpected fashion (i.e. on the blob/file level).
When we know that the known repo gen moved higher in terms of the generation
tracked in master memory we should throw the concurrent snapshot exception.

This change makes concurrent snapshot create and delete always throw the same exception,
prevents unnecessary listings when the generation is known to be off and prevents
future test failures in SLM tests that assume the concurrent snapshot exception
is always thrown here.
Without this change, the newly added test randomly fails the `instanceOf` assertion
by running into a `RepositoryException`.
2019-11-19 09:50:52 +01:00
Henning Andersen 2ac38fd315 Reindex and friends fail on RED shards (#45830)
Reindex, update by query and delete by query would silently disregard
RED/unavailable shards, thus not copying, updating or deleting matching
data in those shards. Now use `allow_partial_search_results=false` to
ensure these operations fail if the search crosses an unavailable chard.

Added the option to explicitly specify `allow_partial_search_results=true`
for reindex only (seemed too strange for update/delete by query).

Relates #45739 and #42612
2019-11-18 21:23:08 +01:00
Benjamin Trent eefe7688ce
[7.x][ML] ML Model Inference Ingest Processor (#49052) (#49257)
* [ML] ML Model Inference Ingest Processor (#49052)

* [ML][Inference] adds lazy model loader and inference (#47410)

This adds a couple of things:

- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:

* [ML][Inference] Adjust inference configuration option API (#47812)

* [ML][Inference] adds logistic_regression output aggregator (#48075)

* [ML][Inference] Adding read/del trained models (#47882)

* [ML][Inference] Adding inference ingest processor (#47859)

* [ML][Inference] fixing classification inference for ensemble (#48463)

* [ML][Inference] Adding model memory estimations (#48323)

* [ML][Inference] adding more options to inference processor (#48545)

* [ML][Inference] handle string values better in feature extraction (#48584)

* [ML][Inference] Adding _stats endpoint for inference (#48492)

* [ML][Inference] add inference processors and trained models to usage (#47869)

* [ML][Inference] add new flag for optionally including model definition (#48718)

* [ML][Inference] adding license checks (#49056)

* [ML][Inference] Adding memory and compute estimates to inference (#48955)

* fixing version of indexed docs for model inference
2019-11-18 13:19:17 -05:00
gpaimla 7d20b50f45 Implement Lucene EstonianAnalyzer, Stemmer (#49149)
This PR adds a new analyzer and stemmer for the Estonian language.

Closes #48895
2019-11-18 17:24:21 +01:00
Armin Braun 25cc8e3663
Fix RepoCleanup not Removed on Master-Failover (#49217) (#49239)
The logic for `cleanupInProgress()` was backwards everywhere (method itself and
all but one user). Also, we weren't checking it when removing a repository.

This lead to a bug (in the one spot that didn't use the method backwards) that prevented
the cleanup cluster state entry from ever being removed from the cluster state if master
failed over during the cleanup process.

This change corrects the backwards logic, adds a test that makes sure the cleanup
is always removed and adds a check that prevents repository removal during cleanup
to the repositories service.

Also, the failure handling logic in the cleanup action was broken. Repeated invocation would lead to the cleanup being removed from the cluster state even if it was in progress. Fixed by adding a flag that indicates whether or not any removal of the cleanup task from the cluster state must be executed. Sorry for mixing this in here, but I had to fix it in the same PR, as the first test (for master-failover) otherwise would often just delete the blocked cleanup action as a result of a transport master action retry.
2019-11-18 16:44:09 +01:00
Armin Braun f7d9e7bdc4
Better Exceptions on Concurrent Snapshot Operations (#49220) (#49237)
* Better Exceptions on Concurrent Snapshot Operations

It is somewhat tricky to debug test failures from concurrent operations
without having the exact knowledge of what ran concurrently so I added
it to these exceptions in all spots.
2019-11-18 14:12:55 +01:00
Armin Braun 42268f0b0e
Fix Broken Network Disruption in SnapshotResiliencyTests (#49216) (#49231)
The network disruption was acting on node ids and node names
which made reconnects not work. Moved all usages to node names
to fix this. Since the map of all nodes in the test is indexed
by name this was easier to work with.
2019-11-18 12:02:27 +01:00
Yannick Welsch af797a77a1 Auto-expand indices according to allocation filtering rules (#48974)
Honours allocation filtering rules when auto-expanding indices.
2019-11-18 12:01:56 +01:00
Armin Braun 2886d4c6dd
Make FsBlobContainer Listing Resilient to Concurrent Modifications (#49142) (#49176)
* Make FsBlobContainer Listing Resilient to Concurrent Modifications

If we list out files in a folder via the lazily computed directory
stream, we have to deal with concurrent deletes when reading the file
attributes since we don't have a lock on the directory in any way.

Closes #37581
2019-11-15 21:14:53 +01:00
Mark Tozzi dad68c59fe
Avoid precision loss in DocValueFormat.RAW#parseLong (#49063) (#49169) 2019-11-15 12:32:26 -05:00
markharwood c3745b03ee
Search optimisation - add canMatch early aborts for queries on "_index" field (#49158)
Make queries on the “_index” field fast-fail if the target shard is an index that doesn’t match the query expression. Part of the “canMatch” phase optimisations.

Closes #48473
2019-11-15 16:50:32 +00:00
Jason Tedor 36dc544819
Adjust version on ingest processor exception
The dedicated ingest processor exception was backported to 7.5. This
commit updates the version in the 7.x branch.
2019-11-15 09:35:12 -05:00
Armin Braun fc505aaa76
Track Repository Gen. in BlobStoreRepository (#48944) (#49116)
This is intended as a stop-gap solution/improvement to #38941 that
prevents repo modifications without an intermittent master failover
from causing inconsistent (outdated due to inconsistent listing of index-N blobs)
`RepositoryData` to be written.

Tracking the latest repository generation will move to the cluster state in a
separate pull request. This is intended as a low-risk change to be backported as
far as possible and motived by the recently increased chance of #38941
causing trouble via SLM (see https://github.com/elastic/elasticsearch/issues/47520).

Closes #47834
Closes #49048
2019-11-15 09:54:53 +01:00
Tal Levy 5cd6f64f15
Introduce faster approximate sinh/atan math functions (#49009) (#49110)
This commit introduces a new class called ESSloppyMath
that is meant to reflect the purpose of Lucene's SloppyMath,
but add additional unimplemented faster alternatives to math functions.

The two that are used by geotile-grid a lot are sinh/atan.

In a quick elasticsearch rally benchmark for geotile-grid on Switzerland
data points, this shows a (1.22x) 22% speed-up over using Math's functions.

closes #41166.
2019-11-14 14:15:34 -08:00
bellengao 6ce04429c6 Fix `_analyze` API to correctly use normalizers when specified (#48866)
Currently the `_analyze` endpoint doesn't correctly use normalizers specified
in the request. This change fixes that by returning the resolved normalizer from
TransportAnalyzeAction#getAnalyzer and updates test to be able to catch this
in the future.

Closes #48650
2019-11-14 19:51:11 +01:00
Jason Tedor 2bcdcb17cd
Introduce dedicated ingest processor exception (#48810)
Today we wrap exceptions that occur while executing an ingest processor
in an ElasticsearchException. Today, in ExceptionsHelper#unwrapCause we
only unwrap causes for exceptions that implement
ElasticsearchWrapperException, which the top-level
ElasticsearchException does not. Ultimately, this means that any
exception that occurs during processor execution does not have its cause
unwrapped, and so its status is blanket treated as a 500. This means
that while executing a bulk request with an ingest pipeline,
document-level failures that occur during a processor will cause the
status for that document to be treated as 500. Since that does not give
the client any indication that they made a mistake, it means some
clients will enter infinite retries, thinking that there is some
server-side problem that merely needs to clear. This commit addresses
this by introducing a dedicated ingest processor exception, so that its
causes can be unwrapped. While we could consider a broader change to
unwrap causes for more than just ElasticsearchWrapperExceptions, that is
a broad change with unclear implications. Since the problem of reporting
500s on client errors is a user-facing bug, we take the conservative
approach for now, and we can revisit the unwrapping in a future change.
2019-11-14 11:04:53 -05:00
Christoph Büscher 6c5644335f Simplify TransportMultiSearchActionTests (#48523)
The test doesn't seem to need the threadpool that is created and destroyed in
setup and teardown any longer, so it can be removed.
2019-11-14 14:48:16 +01:00
Rory Hunter c46a0e8708
Apply 2-space indent to all gradle scripts (#49071)
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
2019-11-14 11:01:23 +00:00
Henning Andersen 66f0c8900f
Fix Transport Stopped Exception (#48930) (#49035)
When a node shuts down, `TransportService` moves to stopped state and
then closes connections. If a request is done in between, an exception
was thrown that was not retried in replication actions. Now throw a
wrapped `NodeClosedException` exception instead, which is correctly
handled in replication action. Fixed other usages too.

Relates #42612
2019-11-13 18:48:05 +01:00
Yannick Welsch 2dfa0133d5 Always use primary term from primary to index docs on replica (#47583)
Ensures that we always use the primary term established by the primary to index docs on the
replica. Makes the logic around replication less brittle by always using the operation primary
term on the replica that is coming from the primary.
2019-11-13 12:13:45 +01:00
Igor Motov 40776eedaf Fix ignoring missing values in min/max aggregations (#48970)
Fixes the issue when the missing values can be ignored in min/max
due to BKD optimization.

Fixes #48905
2019-11-12 19:57:28 -05:00
Armin Braun 0e1035241d
Fix Broken Snapshots in Mixed Clusters (#48993) (#48995)
Reverts #48947 and fixes the issue orginally addressed by removing the assertion.
It turns out we can't simply pass empty shard generations to the snapshot finalization in the
BwC case as that results in no indices being added to the meta for the given snapshot since
we take the indices from the shard generations (even in the BwC case the `null` generations work
fine for this).

Closes #48983
2019-11-12 21:35:41 +01:00
David Turner 9baea80853 Ignore metadata of deleted indices at start (#48918)
Today in 6.x it is possible to add an index tombstone to the graveyard without
deleting the corresponding index metadata, because the deletion is slightly
deferred. If you shut down the node and upgrade to 7.x when in this state then
the node will fail to apply any cluster states, reporting

    java.lang.IllegalStateException: Cannot delete index [...], it is still part of the cluster state.

This commit addresses this situation by skipping over any index metadata with a
corresponding tombstone, allowing this metadata to be cleaned up by the 7.x
node.
2019-11-12 11:16:54 +00:00
David Turner dc441588b6 Remove support for ancient corrupted markers (#48858)
Today we still support reading store corruption markers of versions that
haven't been written since 1.7. This commit removes this legacy support.
2019-11-12 11:10:46 +00:00
Yannick Welsch ab15bce4e7 Auto-expand replicated closed indices (#48973)
Fixes a bug where replicated closed indices were not being auto-expanded.
2019-11-12 12:00:05 +01:00
Tim Brooks 0645ee88e2
Send cluster name and discovery node in handshake (#48916)
This commits sends the cluster name and discovery naode in the transport
level handshake response. This will allow us to stop sending the
transport service level handshake request in the 8.0-8.x release cycle.
It is necessary to start sending this in 7.x so that 8.0 is guaranteed
to be communicating with a version that sends the required information.
2019-11-11 18:42:02 -05:00
Jake Landis c320b499a0
Prevent deadlock by using separate schedulers (#48697) (#48964)
Currently the BulkProcessor class uses a single scheduler to schedule
flushes and retries. Functionally these are very different concerns but
can result in a dead lock. Specifically, the single shared scheduler
can kick off a flush task, which only finishes it's task when the bulk
that is being flushed finishes. If (for what ever reason), any items in
that bulk fails it will (by default) schedule a retry. However, that retry
will never run it's task, since the flush task is consuming the 1 and
only thread available from the shared scheduler.

Since the BulkProcessor is mostly client based code, the client can
provide their own scheduler. As-is the scheduler would require
at minimum 2 worker threads to avoid the potential deadlock. Since the
number of threads is a configuration option in the scheduler, the code
can not enforce this 2 worker rule until runtime. For this reason this
commit splits the single task scheduler into 2 schedulers. This eliminates
the potential for the flush task to block the retry task and removes this
deadlock scenario.

This commit also deprecates the Java APIs that presume a single scheduler,
and updates any internal code to no longer use those APIs.

Fixes #47599

Note - #41451 fixed the general case where a bulk fails and is retried
that can result in a deadlock. This fix should address that case as well as
the case when a bulk failure *from the flush* needs to be retried.
2019-11-11 16:31:21 -06:00
Mark Tozzi d9e569278f
Refactor and DRY up Kahan Sum algorithm (#48558) (#48959) 2019-11-11 15:09:19 -05:00
Armin Braun c45470f84f
Fix ShardGenerations in RepositoryData in BwC Case (#48920) (#48947)
We were tripping the assertion that the makes sure we only have empty `ShardGenerations` in `RepositoryData` in the BwC case because shard generations were passed to the `Repository` in the BwC case. Fixed by only generating empty shard gen for BwC snapshots in `SnapshotsService`.
2019-11-11 18:02:53 +01:00
Rory Hunter 014e1b1090
Improve resiliency to auto-formatting in server (#48940)
Backport of #48450.

Make a number of changes so that code in the `server` directory is more
resilient to automatic formatting. This covers:

* Reformatting multiline JSON to embed whitespace in the strings
* Move some comments around to they aren't auto-formatted to a strange
  place. This also required moving some `&&` and `||` operators from the
  end-of-line to start-of-line`.
* Add helper method `reformatJson()`, to strip whitespace from a JSON
  document using XContent methods. This is sometimes necessary where
  a test is comparing some machine-generated JSON with an expected
  value.

Also, `HyperLogLogPlusPlus.java` is now excluded from formatting because it
contains large data tables that don't reformat well with the current settings,
and changing the settings would be worse for the rest of the codebase.
2019-11-11 14:33:04 +00:00
Yannick Welsch 87862868c6 Allow realtime get to read from translog (#48843)
The realtime GET API currently has erratic performance in case where a document is accessed
that has just been indexed but not refreshed yet, as the implementation will currently force an
internal refresh in that case. Refreshing can be an expensive operation, and also will block the
thread that executes the GET operation, blocking other GETs to be processed. In case of
frequent access of recently indexed documents, this can lead to a refresh storm and terrible
GET performance.

While older versions of Elasticsearch (2.x and older) did not trigger refreshes and instead opted
to read from the translog in case of realtime GET API or update API, this was removed in 5.0
(#20102) to avoid inconsistencies between values that were returned from the translog and
those returned by the index. This was partially reverted in 6.3 (#29264) to allow _update and
upsert to read from the translog again as it was easier to guarantee consistency for these, and
also brought back more predictable performance characteristics of this API. Calls to the realtime
GET API, however, would still always do a refresh if necessary to return consistent results. This
means that users that were calling realtime GET APIs to coordinate updates on client side
(realtime GET + CAS for conditional index of updated doc) would still see very erratic
performance.

This PR (together with #48707) resolves the inconsistencies between reading from translog and
index. In particular it fixes the inconsistencies that happen when requesting stored fields, which
were not available when reading from translog. In case where stored fields are requested, this
PR will reparse the _source from the translog and derive the stored fields to be returned. With
this, it changes the realtime GET API to allow reading from the translog again, avoid refresh
storms and blocking the GET threadpool, and provide overall much better and predictable
performance for this API.
2019-11-09 17:47:50 +01:00
Nhat Nguyen ff6c121eb9 Closed shard should never open new engine (#47186)
We should not open new engines if a shard is closed. We break this
assumption in #45263 where we stop verifying the shard state before
creating an engine but only before swapping the engine reference.
We can fail to snapshot the store metadata or checkIndex a closed shard
if there's some IndexWriter holding the index lock.

Closes #47060
2019-11-08 23:40:34 -05:00
Nhat Nguyen 9a42e71dd9 Do not cancel recovery for copy on broken node (#48265)
This change fixes a poisonous situation where an ongoing recovery was
canceled because a better copy was found on a node that the cluster had
previously tried allocating the shard to but failed. The solution is to
keep track of the set of nodes that an allocation was failed on so that
we can avoid canceling the current recovery for a copy on failed nodes.

Closes #47974
2019-11-08 23:10:47 -05:00
Adrien Grand 3b9ce0a4f3
Elasticsearch 7.5 is on Lucene 8.3. (#48831) 2019-11-06 10:13:09 -05:00
David Turner bd5c6c4779
Add preflight check to dynamic mapping updates (#48867)
Today if the primary discovers that an indexing request needs a mapping update
then it will send it to the master for validation and processing. If, however,
the put-mapping request is invalid then the master still processes it as a
(no-op) cluster state update. When there are a large number of indexing
operations that result in invalid mapping updates this can overwhelm the
master.

However, the primary already has a reasonably up-to-date mapping against which
it can check the (approximate) validity of the put-mapping request before
sending it to the master. For instance it is not possible to remove fields in a
mapping update, so if the primary detects that a mapping update will exceed the
fields limit then it can reject it itself and avoid bothering the master.

This commit adds a pre-flight check to the mapping update path so that the
primary can discard obviously-invalid put-mapping requests itself.

Fixes #35564
Backport of #48817
2019-11-05 18:08:22 +01:00
Nhat Nguyen 0887cbc964 Fix testForceMergeWithSoftDeletesRetentionAndRecoverySource (#48766)
This test failure manifests the limitation of the recovery source merge
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.

Relates #41628
Closes #48735
2019-11-02 21:14:12 -04:00
Armin Braun 3c20541823
Cleanup Concurrent RepositoryData Loading (#48329) (#48834)
The loading of `RepositoryData` is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122
2019-11-02 20:42:29 +01:00
Armin Braun a22f6fbe3c
Cleanup Redundant Futures in Recovery Code (#48805) (#48832)
Follow up to #48110 cleaning up the redundant future
uses that were left over from that change.
2019-11-02 17:28:12 +01:00
Jason Tedor c82ecb664c
Do not wrap ingest processor exception with IAE (#48816)
The problem with wrapping here is that it converts any exception into an
IAE, which we treat as a client error (400 status) whereas the exception
being wrapped here could be a server error (e.g., NPE). This commit
stops wrapping all ingest processor exceptions as IAEs.
2019-11-01 15:11:35 -04:00
Mark Vieira 6ab4645f4e
[7.x] Introduce type-safe and consistent pattern for handling build globals (#48818)
This commit introduces a consistent, and type-safe manner for handling
global build parameters through out our build logic. Primarily this
replaces the existing usages of extra properties with static accessors.
It also introduces and explicit API for initialization and mutation of
any such parameters, as well as better error handling for uninitialized
or eager access of parameter values.

Closes #42042
2019-11-01 11:33:11 -07:00
Tal Levy 4be54402de
[7.x] Add ingest info to Cluster Stats (#48485) (#48661)
* Add ingest info to Cluster Stats (#48485)

This commit enhances the ClusterStatsNodes response to include global
processor usage stats on a per-processor basis.

example output:

```
...
    "processor_stats": {
      "gsub": {
        "count": 0,
        "failed": 0
        "current": 0
        "time_in_millis": 0
      },
      "script": {
        "count": 0,
        "failed": 0
        "current": 0,
        "time_in_millis": 0
      }
    }
...
```

The purpose for this enhancement is to make it easier to collect stats on how specific processors are being used across the cluster beyond the current per-node usage statistics that currently exist in node stats.

Closes #46146.

* fix BWC of ingest stats

The introduction of processor types into IngestStats had a bug.
It was set to `null` and set as the key to the map. This would
throw a NPE. This commit resolves this by setting all the processor
types from previous versions that are not serializing it out to
`_NOT_AVAILABLE`.
2019-10-31 14:36:54 -07:00
Ioannis Kakavas 99aedc844d
Copy http headers to ThreadContext strictly (#45945) (#48675)
Previous behavior while copying HTTP headers to the ThreadContext,
would allow multiple HTTP headers with the same name, handling only
the first occurrence and disregarding the rest of the values. This
can be confusing when dealing with multiple Headers as it is not
obvious which value is read and which ones are silently dropped.

According to RFC-7230, a client must not send multiple header fields
with the same field name in a HTTP message, unless the entire field
value for this header is defined as a comma separated list or this
specific header is a well-known exception.

This commits changes the behavior in order to be more compliant to
the aforementioned RFC by requiring the classes that implement
ActionPlugin to declare if a header can be multi-valued or not when
registering this header to be copied over to the ThreadContext in
ActionPlugin#getRestHeaders.
If the header is allowed to be multivalued, then all such headers
are read from the HTTP request and their values get concatenated in
a comma-separated string.
If the header is not allowed to be multivalued, and the HTTP
request contains multiple such Headers with different values, the
request is rejected with a 400 status.
2019-10-31 23:05:12 +02:00
Zachary Tong 34c2375417 Add v7.4.3 version constant 2019-10-31 13:21:25 -04:00
Alexander Reelsen 4ecf234617 Upgrade to joda 2.10.4 (#47805) 2019-10-31 14:49:50 +01:00
Stéphane Campinas 7ea74918e1 [DOCS] Fix typo in IndexFieldData.java comments (#48743) 2019-10-31 09:40:35 -04:00
kkewwei 0366c4d4a9 Faster access to INITIALIZING/RELOCATING shards (#47817)
Today a couple of allocation deciders iterate through all the shards on a node
to find the `INITIALIZING` or `RELOCATING` ones, and this can slow down cluster
state updates in clusters with very high-density nodes holding many thousands
of shards even if those shards belong to closed or frozen indices. This commit
pre-computes the sets of `INITIALIZING` and `RELOCATING` shards to speed up
this search.

Closes #46941
Relates #48579

Co-authored-by: "hongju.xhj" <hongju.xhj@alibaba-inc.com>
2019-10-31 10:55:59 +00:00
Rory Hunter d96976e2b1
Improve resiliency to formatting JSON in server (#48706)
Backport of #48553. Make a number of changes so that JSON in the server
directory is more resilient to automatic formatting. This covers:

   * Reformatting multiline JSON to embed whitespace in the strings
   * Add helper method `stripWhitespace()`, to strip whitespace from a JSON
     document using XContent methods. This is sometimes necessary where
     a test is comparing some machine-generated JSON with an expected
     value.
2019-10-31 10:48:55 +00:00
Arvind Ramachandran eefa84bc94 Ignore dangling indices created in newer versions (#48652)
Today it is possible that we import a dangling index that was created in a
newer version than one or more of the nodes in the cluster. Such an index would
prevent the older node(s) from rejoining the cluster if they were to briefly
leave it for some reason. This commit prevents the import of such dangling
indices.

Fixes #34264
2019-10-31 10:12:42 +00:00
Yannick Welsch fe8901b00b Return consistent source in updates (#48707) 2019-10-31 10:00:40 +01:00
Ignacio Vera 5bea3898a9
Add IndexOrDocValuesQuery to GeoPolygonQueryBuilder (#48449) (#48731) 2019-10-31 08:46:57 +01:00
Nhat Nguyen f8ef402027 Do not warm up searcher in engine constructor (#48605)
With this change, we won't warm up searchers until we externally refresh
an engine. We explicitly refresh before allowing reading from a shard
(i.e., move to post_recovery state) and during resetting. These
guarantees that we have warmed up the engine before exposing the
external searcher.

Another prerequisite for #47186.
2019-10-30 14:22:59 -04:00
Armin Braun 36039706b5
Fix SnapshotShardStatus Reporting for Failed Shard (#48556) (#48687)
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.

Closes #48526
2019-10-30 15:43:41 +01:00
Armin Braun 52e5ceb321
Restore from Individual Shard Snapshot Files in Parallel (#48110) (#48686)
Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.
2019-10-30 14:36:30 +01:00
Armin Braun 01e326d2e3
Fix ref count handling in Engine.failEngine (#48639) (#48646)
We can run into an already closed store here and hence
throw on trying to increment the ref count => moving to
the guarded ref count increment

closes #48625
2019-10-30 10:10:48 +01:00
Julie Tibshirani 89c65752dc
Update the signature of vector script functions. (#48653)
Previously the functions accepted a doc values reference, whereas they now
accept the name of the vector field. Here's an example of how a vector function
was called before and after the change.

```
Before: cosineSimilarity(params.query_vector, doc['field'])
After:  cosineSimilarity(params.query_vector, 'field')
```

This seems more intuitive, since we don't allow direct access to vector doc
values and the the meaning of `doc['field']` is unclear.

The PR makes the following changes (broken into distinct commits):
* Add new function signatures of the form `function(params.query_vector,
'field')` and deprecates the old ones. Because Painless doesn't allow two
methods with the same name and number of arguments, we allow a generic `Object`
to be passed in to the function and decide on the behavior through an
`instanceof` check.
* Refactor the class bindings so that the document field is passed to the
constructor instead of the instance method. This allows us to avoid retrieving
the vector doc values on every function invocation, which gives a tiny speed-up
in benchmarks.

Note that this PR adds new signatures for the sparse vector functions too, even
though sparse vectors are deprecated. It seemed simplest to understand (for both
us and users) to keep everything symmetric between dense and sparse vectors.
2019-10-29 15:46:05 -07:00
Stuart Tettemer 55d00cf2b1
Scripting: fill in get contexts REST API (#48319) (#48602)
Updates response for `GET /_script_context`, returning a `contexts`
object with a list of context description objects.  The description
includes the context name and a list of methods available.  The
methods list has the signature for the `execute` mathod and any
getters. eg.
```
{
  "contexts": [
     {
       "name" : "moving-function",
       "methods" : [
         {
           "name" : "execute",
           "return_type" : "double",
           "params" : [
             {
               "type" : "java.util.Map",
               "name" : "params"
             },
             {
               "type" : "double[]",
               "name" : "values"
             }
           ]
         }
       ]
     },
     {
       "name" : "number_sort",
       "methods" : [
         {
           "name" : "execute",
           "return_type" : "double",
           "params" : [ ]
         },
         {
           "name" : "getDoc",
           "return_type" : "java.util.Map",
           "params" : [ ]
         },
         {
           "name" : "getParams",
           "return_type" : "java.util.Map",
           "params" : [ ]
         },
         {
           "name" : "get_score",
           "return_type" : "double",
           "params" : [ ]
         }
       ]
     },
...
  ]
}
```

fixes: #47411
2019-10-29 14:41:15 -06:00
Nhat Nguyen 2a863ac8ff Fix testCleanUpCommitsWhenGlobalCheckpointAdvanced
Relates #48559
2019-10-29 10:39:16 -04:00
Nhat Nguyen b08cd058bc Greedily advance safe commit on new global checkpoint (#48559)
Today we won't advance the safe commit on a new global checkpoint unless 
the last commit can become safe. This is not great if we have more than
two commits as we can have a new safe commit earlier.

Closes #4853
2019-10-29 10:39:16 -04:00
Jim Ferenczi aa70ff5ea4 Fix failures in ShuffleForcedMergePolicyTests#testDiagnostics (#48627)
This commit fixes intermittent failures in ShuffleForcedMergePolicyTests#testDiagnostics
by setting a more restricted merge policy that ensures that extra merging will not happen
before the forced merge.
2019-10-29 13:46:55 +01:00
Jim Ferenczi c6abe58f63 Fix expectations in SearchAfter integration tests (#48372)
This commit fixes the expectations of SearchAfterIT#shouldFail regarding the inner exceptions that should be thrown
when testing failures. The exception is sometimes wrapped in a QueryShardException so this change only checks that
the toString representation contains the expected message.

Closes #43143
2019-10-29 12:37:22 +01:00
Yannick Welsch 6af3ce58f8 Filter on node id in AllocationIdIT (#48623)
Makes the assertions more targeted.

Relates #48529
2019-10-29 12:10:48 +01:00
Jim Ferenczi 028084ce23 Add a new merge policy that interleaves old and new segments on force merge (#48533)
This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges
and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents
first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices
even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working
on in # remain efficient even after running a force merge.

Relates #37043
2019-10-29 10:44:56 +01:00
Armin Braun 53a22b8a8a
Fix Validity of RepositoryDataTests Randomness (#48564) (#48566)
Trivial point, but we were only testing shard generations
for a single shard here, accidentally, and not testing the
`null` generation case at all.
2019-10-28 11:04:57 +01:00
Nhat Nguyen 1ef87c9a68 Refresh should not acquire readLock (#48414)
Today, we hold the engine readLock while refreshing. Although this 
choice simplifies the correctness reasoning, it can block IndexShard 
from closing if warming an external reader takes time. The current
implementation of refresh does not need to hold readLock as
ReferenceManager can handle errors correctly if the engine is closed in
midway.

This PR is a prerequisite that we need to solve #47186.
2019-10-25 17:32:35 -04:00
Dan Hermann 2e3db518c9
Do not reference values for filtered settings (#48066) (#48518) 2019-10-25 16:22:11 -05:00
Tim Brooks f5f1072824
Multiple remote connection strategy support (#48496)
* Extract remote "sniffing" to connection strategy (#47253)

Currently the connection strategy used by the remote cluster service is
implemented as a multi-step sniffing process in the
RemoteClusterConnection. We intend to introduce a new connection strategy
that will operate in a different manner. This commit extracts the
sniffing logic to a dedicated strategy class. Additionally, it implements
dedicated tests for this class.

Additionally, in previous commits we moved away from a world where the
remote cluster connection was mutable. Instead, when setting updates are
made, the connection is torn down and rebuilt. We still had methods and
tests hanging around for the mutable behavior. This commit removes those.

* Introduce simple remote connection strategy (#47480)

This commit introduces a simple remote connection strategy which will
open remote connections to a configurable list of user supplied
addresses. These addresses can be remote Elasticsearch nodes or
intermediate proxies. We will perform normal clustername and version
validation, but otherwise rely on the remote cluster to route requests
to the appropriate remote node.

* Make remote setting updates support diff strategies (#47891)

Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.

As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.

* Make remote setting updates support diff strategies (#47891)

Currently the entire remote cluster settings infrastructure is designed
around the sniff strategy. As we introduce an additional conneciton
strategy this infrastructure needs to be modified to support it. This
commit modifies the code so that the strategy implementations will tell
the service if the connection needs to be torn down and rebuilt.

As part of this commit, we will wait 10 seconds for new clusters to
connect when they are added through the "update" settings
infrastructure.
2019-10-25 09:29:41 -06:00
Luca Cavanna d6d2edf324 Fix .tasks index strict mapping: parent_id should be parent_task_id (#48393)
* Fix .tasks index strict mapping: parent_id should be parent_task_id

The .tasks index has mappings that's strictly defined. `parent_task_id`
was defined as `parent_id` though which would cause an exception in case
a task is persisted that has a parent task id set.

While at it, a couple of compiler warnings were addressed and a test
request builder was removed in favour of using its corresponding request.

* increment version
2019-10-25 17:00:06 +02:00