Commit Graph

3581 Commits

Author SHA1 Message Date
Zachary Tong 8d17527050 [TEST] create larger cuckoo filters for tests (#46457)
The cuckoofilters could be randomly created with too small of
capacity or precision, which means that they can only absorb a few
values before collisions start to make all filters look identical.

This increases the size of filters we generate (capacity >> than
the test cases) and lower fpp rate.
2019-09-09 10:18:51 -04:00
David Turner 8428f8e6e8 Remove trailing comma from nodes lists (#46484)
Today when the membership of the cluster changes we log messages that describe
the change like this:

    added {{node-1}{OPdaTIGmSxaEXXOyg3o96w}{127.0.0.1}{127.0.0.1:9301}{di},}

The trailing comma suggests there is some missing string that might contain
extra information, but in fact it's an artefact of how these messages are
constructed. This commit removes the trailing comma from these lists.
2019-09-09 14:47:32 +01:00
Armin Braun ee3396735c
Execute SnapshotsService Error Callback on Generic Thread (#46277) (#46480)
I couldn't find a test for this, as it seems we only get
into this error handler on a bug. Regardless, we are
executing the snapshot finalization on the master update
thread here which shouldn't happen and will make
debugging a production issue resulting from this
trickier than it has to be (because we probably also
get a cluster state apply is slow warning in addition
to the original bug).
Used the generic pool here instead of the snapshot pool
because we're resolving the user callback here as
well and the generic pool seemed like the safer bet for
that.
2019-09-09 14:38:11 +02:00
Nhat Nguyen 24c3a1de3c Ignore replication for noop updates (#46458)
Previously, we ignore replication for noop updates because they do not
have sequence numbers. Since #44603, we started assigning sequence
numbers to noop updates leading them to be replicated to replicas.

This bug occurs only on 8.0 for it requires #41065 and #44603.

Closes #46366
2019-09-07 11:32:01 -04:00
markharwood 323ec022be
Deprecate the "index.max_adjacency_matrix_filters" index setting (#46394)
Following performance optimisations to the adjacency_matrix aggregation we no longer require this setting. Marked as deprecated and due for removal in 8.0

Related #46324
2019-09-06 13:59:47 +01:00
Yunfeng,Wu 7582af27b0 Resolve the incorrect scroll_current when delete or close index (#45226)
Resolve the incorrect current scroll for deleted or closed index
2019-09-06 09:45:53 +02:00
Jim Ferenczi f2a6c88f83
Add a system property to ignore awareness attributes (#46375)
This is a follow up of #19191 for 7.x.
This change adds a system property called "es.routing.search_ignore_awareness_attributes" that when set to true will
effectively ignore allocation awareness attributes when routing search and get requests. This is now the default in 8.x so this
commit adds a way to opt-in to this new behavior in a minor version of 7.x.

Relates #45735
2019-09-06 09:29:27 +02:00
Paul Sanwald 758680c549
version bump to 6.8.4 (#46409) 2019-09-05 15:14:36 -04:00
Jason Tedor 92866f977a
Clarify error message on keystore write permissions (#46321)
When the Elasticsearch process does not have write permissions to
upgrade the Elasticsearch keystore, we bail with an error message that
indicates there is a filesystem permissions problem. This commit
clarifies that error message by pointing out the directory where write
permissions are required, or that the user can also run the
elasticsearch-keystore upgrade command manually before starting the
Elasticsearch process. In this case, the upgrade would not be needed at
runtime, so the permissions would not be needed then.
2019-09-05 15:11:54 -04:00
Benjamin Trent d912a49c6f
[7.x] Support geotile_grid aggregation in composite agg sources (#45810) (#46399)
* Support geotile_grid aggregation in composite agg sources (#45810)

Adds support for `geotile_grid` as a source in composite aggs. 

Part of this change includes adding a new docFormat of `GEOTILE` that formats a hashed `long` value into a geotile formatting string `zoom/x/y`.
2019-09-05 13:22:57 -05:00
Armin Braun 7a9af874ad
Enable Debug Logging for Master and Coordination Packages (#46363) (#46374)
In order to track down #46091:
* Enables debug logging in REST tests for `master` and `coordination` packages
since we suspect that issues are caused by failed and then retried publications
2019-09-05 14:03:38 +02:00
Yannick Welsch 7e4c633ce3 Quiet down shard lock failures (#46368)
These were actually never intended to be logged at the warning level but made visible by a refactoring in #19991, which introduced a new exception type but forgot to adapt some of the consumers of the exception.
2019-09-05 13:08:11 +02:00
Nhat Nguyen 03ed18a010 Unmute testRecoveryFromFailureOnTrimming
Tracked at #46267
2019-09-04 22:33:17 -04:00
Julie Tibshirani 40c3225d26
First round of optimizations for vector functions. (#46294)
This PR merges the `vectors-optimize-brute-force` feature branch, which makes
the following changes to how vector functions are computed:
* Precompute the L2 norm of each vector at indexing time. (#45390)
* Switch to ByteBuffer for vector encoding. (#45936)
* Decode vectors and while computing the vector function. (#46103) 
* Use an array instead of a List for the query vector. (#46155)
* Precompute the normalized query vector when using cosine similarity. (#46190)

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
2019-09-04 14:45:57 -07:00
Nhat Nguyen a16cb89956 Revert "Sync translog without lock when trim unreferenced readers (#46203)"
Unfortunately, with this change, we won't clean up all unreferenced
generations when reopening. We assume that there's at most one
unreferenced generation when reopening translog. The previous
implementation guarantees this assumption by syncing translog every time
after we remove a translog reader. This change, however, only syncs
translog once after we have removed all unreferenced readers (can be
more than one) and breaks the assumption.

Closes #46267

This reverts commit fd8183ee51d7cf08d9def58a2ae027714beb60de.
2019-09-04 17:09:39 -04:00
Jason Tedor 3cbdd84b89
Add test that get triggers shard search active (#46317)
This commit is a follow-up to a change that fixed that multi-get was not
triggering a shard to become search active. In that change, we added a
test that multi-get properly triggers a shard to become search
active. This commit is a follow-up to that change which adds a test for
the get case. While get is already handled correctly in production code,
there was not a test for it. This commit adds one. Additionally, we
factor all the search idle tests from IndexShardIT into a separate test
class, as an effort to keep related tests together instead of a single
large test class containing a jumble of tests, and also to keep test
classes smaller for better parallelization.
2019-09-04 11:53:32 -04:00
markharwood 408b58dd9d
Adjacency_matrix aggregation optimisation. (#46257) (#46315)
Avoid pre-allocating ((N * N) - N) / 2 “BitsIntersector” objects given N filters.
Most adjacency matrices will be sparse and we typically don’t need to allocate all of these objects - can save a lot of allocations when the number of filters is high.

Closes #46212
2019-09-04 16:45:32 +01:00
Nhat Nguyen eb56d23421 Do not send recovery requests with CancellableThreads (#46287)
Previously, we send recovery requests using CancellableThreads because
we send requests and wait for responses in a blocking manner. With async
recovery, we no longer need to do so. Moreover, if we fail to submit a
request, then we can release the Store using an interruptible thread
which can risk invalidating the node lock.

This PR is the first step to avoid forking when releasing the Store.

Relates #45409
Relates #46178
2019-09-04 11:26:11 -04:00
Henning Andersen 5066835569 Fix SearchService.createContext exception handling (#46258)
An exception from the DefaultSearchContext constructor could leak a
searcher, causing future issues like shard lock obtained exceptions. The
underlying cause of the exception in the constructor has been fixed, but
as a safety precaution we also fix the exception handling in
createContext.

Closes #45378
2019-09-04 14:46:30 +02:00
Nhat Nguyen 3f67cbe974 Suppress warning from background sync on relocated primary (#46247)
If a primary as being relocated, then the global checkpoint and
retention lease background sync can emit unnecessary warning logs.
This side effect was introduced in #42241.

Relates #40800
Relates #42241
2019-09-03 18:44:15 -04:00
Nhat Nguyen 5924df1764 Mute testRecoveryFromFailureOnTrimming
Tracked at #46267
2019-09-03 18:44:08 -04:00
Lee Hinman 57f322f85e Move MockRespository into test framework (#46298)
This moves the `MockRespository` class into `test/framework/src/main` so
it can be used across all modules and plugins in tests.
2019-09-03 16:21:10 -06:00
Jason Tedor b8c51ff894
Multi-get requests should wait for search active (#46283)
When a shard has fallen search idle, and a non-realtime multi-get
request is executed, today such requests do not wait for the shard to
become search active and therefore such requests do not wait for a
refresh to see the latest changes to the index. This also prevents such
requests from triggering the shard as non-search idle, influencing the
behavior of scheduled refreshes. This commit addresses this by attaching
a listener to the shard search active state for multi-get requests. In
this way, when the next scheduled refresh is executed, the multi-get
request will then proceed.
2019-09-03 14:31:37 -04:00
Henning Andersen 2383acaa89 Fix testSyncFailsIfOperationIsInFlight (#46269)
testSyncFailsIfOperationIsInFlight could fail due to the index request
spawing a GCP sync (new since 7.4). Test now waits for it to finish
before testing that flushed sync fails.
2019-09-03 17:30:00 +02:00
dengweisysu 416419e4c9 Sync translog without lock when trim unreferenced readers (#46203)
With this change, we can avoid blocking writing threads when trimming
unreferenced readers; hence improving the translog writing performance
in async durability mode.

Close #46201
2019-09-02 21:55:06 -04:00
Anup e01ec802e7 Remove duplicate line in SearchAfterBuilder (#45994) 2019-09-03 01:30:01 +02:00
Armin Braun 2662c1b417
Wait for all Rec. to Stop on Node Close (#46178) (#46237)
* Wait for all Rec. to Stop on Node Close

* This issue is in the `RecoverySourceHandler#acquireStore`. If we submit the store release to the generic threadpool while it is getting shut down we never complete the futue we wait on (in the generic pool as well) and fail to ever release the store potentially.
* Fixed by waiting for all recoveries to end on node close so that we aways have a healthy thread pool here
* Closes #45956
2019-09-02 18:04:37 +02:00
Martijn van Groningen 5747badaa8
Allow ingest processors access to node client. (#46077)
This is the first PR that merges changes made to server module from
the enrich branch (see #32789) into the master branch.

The plan is to merge changes made to the server module separately from
the pr that will merge enrich into master, so that these changes can
be reviewed in isolation.
2019-09-02 08:24:26 +02:00
Nhat Nguyen db949847e5 Fix translog stats in testPrepareIndexForPeerRecovery (#46137)
When recovering a shard locally, we use a translog snapshot from
newSnapshotFromGen which consists of all readers from a certain
generation. In the test, we use newSnapshotFromMinSeqNo for the
expectation. The snapshot of this method includes only readers
containing operations in the requesting range.

Closes #46022
2019-08-30 08:53:27 -04:00
Andrey Ershov 152ce62c58 Enhanced logging when transport is misconfigured to talk to HTTP port (#45964)
If a node is misconfigured to talk to remote node HTTP port (instead of
transport port) eventually it will receive an HTTP response from the
remote node on transport port (this happens when a node sends
accidentally line terminating byte in a transport request).
If this happens today it results in a non-friendly log message and a
long stack trace.
This commit adds a check if a malformed response is HTTP response. In
this case, a concise log message would appear.

(cherry picked from commit 911d02b7a9c3ce7fe316360c127a935ca4b11f37)
2019-08-30 13:02:08 +02:00
Paul Sanwald 8bdbc7d9bf
Bump version from 7.4 to 7.5 (#46142) 2019-08-29 15:03:26 -04:00
Julie Tibshirani b5d8b364bb
Ensure top docs optimization is fully disabled for queries with unbounded max scores. (#46105) (#46139)
When a query contains a mandatory clause that doesn't track the max score per
block, we disable the max score optimization. Previously, we were doing this by
wrapping the collector with a FilterCollector that always returned
ScoreMode.COMPLETE.

However we weren't adjusting totalHitsThreshold, so the collector could still
call Scorer#setMinCompetitiveScore. It is against the method contract to call
setMinCompetitiveScore when the score mode is COMPLETE, and some scorers like
ReqOptSumScorer throw an error in this case.

This commit tries to disable the optimization by always setting
totalHitsThreshold to max int, as opposed to wrapping the collector.
2019-08-29 10:56:53 -07:00
Simon Willnauer 9b2ea07b17
Flush engine after big merge (#46066) (#46111)
Today we might carry on a big merge uncommitted and therefore
occupy a significant amount of diskspace for quite a long time
if for instance indexing load goes down and we are not quickly
reaching the translog size threshold. This change will cause a
flush if we hit a significant merge (512MB by default) which
frees diskspace sooner.
2019-08-29 17:54:15 +02:00
Nhat Nguyen bb49124690 Only verify global checkpoint if translog sync occurred (#45980)
We only sync translog if the given offset hasn't synced yet. We can't
verify the global checkpoint from the latest translog checkpoint unless
a sync has occurred.

Closes #46065
Relates #45634
2019-08-29 09:44:40 -04:00
David Turner d340530a47 Avoid overshooting watermarks during relocation (#46079)
Today the `DiskThresholdDecider` attempts to account for already-relocating
shards when deciding how to allocate or relocate a shard. Its goal is to stop
relocating shards onto a node before that node exceeds the low watermark, and
to stop relocating shards away from a node as soon as the node drops below the
high watermark.

The decider handles multiple data paths by only accounting for relocating
shards that affect the appropriate data path. However, this mechanism does not
correctly account for _new_ relocating shards, which are unwittingly ignored.
This means that we may evict far too many shards from a node above the high
watermark, and may relocate far too many shards onto a node causing it to blow
right past the low watermark and potentially other watermarks too.

There are in fact two distinct issues that this PR fixes. New incoming shards
have an unknown data path until the `ClusterInfoService` refreshes its
statistics. New outgoing shards have a known data path, but we fail to account
for the change of the corresponding `ShardRouting` from `STARTED` to
`RELOCATING`, meaning that we fail to find the correct data path and treat the
path as unknown here too.

This PR also reworks the `MockDiskUsagesIT` test to avoid using fake data paths
for all shards. With the changes here, the data paths are handled in tests as
they are in production, except that their sizes are fake.

Fixes #45177
2019-08-29 12:40:55 +01:00
Jason Tedor 9bc4a24118
Handle delete document level failures (#46100)
Today we assume that document failures can not occur for deletes. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.
2019-08-28 22:17:16 -04:00
Tal Levy a356bcff41
Add Circle Processor (#43851) (#46097)
add circle-processor that translates circles to polygons
2019-08-28 14:44:08 -07:00
Jason Tedor 1249e6ba5d
Handle no-op document level failures (#46083)
Today we assume that document failures can not occur for no-ops. This
assumption is bogus, as they can fail for a variety of reasons such as
the Lucene index having reached the document limit. Because of this
assumption, we were asserting that such a document-level failure would
never happen. When this bogus assertion is violated, we fail the node, a
catastrophe. Instead, we need to treat this as a fatal engine exception.
2019-08-28 13:57:24 -04:00
Tanguy Leroux 9e14ffa8be Few clean ups in ESBlobStoreRepositoryIntegTestCase (#46068) 2019-08-28 16:29:46 +02:00
Mark Tozzi aec125faff
Support Range Fields in Histogram and Date Histogram (#46012)
Backport of 1a0dddf4ad24b3f2c751a1fe0e024fdbf8754f94 (AKA #445395)

     * Add support for a Range field ValuesSource, including decode logic for range doc values and exposing RangeType as a first class enum
     * Provide hooks in ValuesSourceConfig for aggregations to control ValuesSource class selection on missing & script values
     * Branch aggregator creation in Histogram and DateHistogram based on ValuesSource class, to enable specialization based on type.  This is similar to how Terms aggregator works.
     * Prioritize field type when available for selecting the ValuesSource class type to use for an aggregation
2019-08-28 09:06:09 -04:00
Henning Andersen 300e717e42 Disallow partial results when shard unavailable (#45739)
Searching with `allowPartialSearchResults=false` could still return
partial search results during recovery. If a shard copy fails
with a "shard not available" exception, the failure would be ignored and
a partial result returned. The one case where this is known to happen
is when a shard copy is recovering when searching, since
`IllegalIndexShardStateException` is considered a "shard not available"
exception.

Relates to #42612
2019-08-27 17:01:23 +02:00
Nhat Nguyen 146e23a8a9 Relax translog assertion in testRestoreLocalHistoryFromTranslog (#45943)
Since #45473, we trim translog below the local checkpoint of the safe
commit immediately if soft-deletes enabled. In
testRestoreLocalHistoryFromTranslog, we should have a safe commit after
recoverFromTranslog is called; then we will trim translog files which
contain only operations that are at most the global checkpoint.

With this change, we relax the assertion to ensure that we don't put
operations to translog while recovering history from the local translog.
2019-08-26 17:19:19 -04:00
Nhat Nguyen c66bae39c3 Update translog checkpoint after marking ops as persisted (#45634)
If two translog syncs happen concurrently, then one can return before
its operations are marked as persisted. In general, this should not be
an issue; however, peer recoveries currently rely on this assumption.

Closes #29161
2019-08-26 17:18:52 -04:00
Nhat Nguyen f2e8b17696 Do not create engine under IndexShard#mutex (#45263)
Today we create new engines under IndexShard#mutex. This is not ideal
because it can block the cluster state updates which also execute under
the same mutex. We can avoid this problem by creating new engines under
a separate mutex.

Closes #43699
2019-08-26 17:18:29 -04:00
Jason Tedor 3d64605075
Remove node settings from blob store repositories (#45991)
This commit starts from the simple premise that the use of node settings
in blob store repositories is a mistake. Here we see that the node
settings are used to get default settings for store and restore throttle
rates. Yet, since there are not any node settings registered to this
effect, there can never be a default setting to fall back to there, and
so we always end up falling back to the default rate. Since this was the
only use of node settings in blob store repository, we move them. From
this, several places fall out where we were chaining settings through
only to get them to the blob store repository, so we clean these up as
well. That leaves us with the changeset in this commit.
2019-08-26 16:26:13 -04:00
Zachary Tong 943a016bb2
Add Cumulative Cardinality agg (and Data Science plugin) (#45990)
This adds a pipeline aggregation that calculates the cumulative
cardinality of a field.  It does this by iteratively merging in the
HLL sketch from consecutive buckets and emitting the cardinality up
to that point.

This is useful for things like finding the total "new" users that have
visited a website (as opposed to "repeat" visitors).

This is a Basic+ aggregation and adds a new Data Science plugin
to house it and future advanced analytics/data science aggregations.
2019-08-26 16:19:55 -04:00
James Baiera 5535ff0a44
Fix IngestService to respect original document content type (#45799) (#45984)
Backport of #45799

This PR modifies the logic in IngestService to preserve the original content type 
on the IndexRequest, such that when a document with a content type like SMILE 
is submitted to a pipeline, the resulting document that is persisted will remain in 
the original content type (SMILE in this case).
2019-08-26 14:33:33 -04:00
Armin Braun af2bd75def
Fix Broken HTTP Request Breaking Channel Closing (#45958) (#45973)
This is essentially the same issue fixed in #43362 but for http request
version instead of the request method. We have to deal with the
case of not being able to parse the request version, otherwise
channel closing fails.

Fixes #43850
2019-08-26 16:20:58 +02:00
Armin Braun 5a17987e19
Fix SnapshotStatusApisIT (#45929) (#45971)
The snapshot status when blocking can still be INIT in rare cases when
the new cluster state that has the snapshot in `STARTED` hasn't yet
become visible.
Fixes #45917
2019-08-26 15:59:02 +02:00
Andrey Ershov d96469ddff Better logging for TLS message on non-secure transport channel (#45835)
This commit enhances logging for 2 cases:

1. If non-TLS enabled node receives transport message from TLS enabled
node on transport port.
2. If non-TLS enabled node receives HTTPs request on transport port.

(cherry picked from commit 4f52ebd32eb58526b4c8022f8863210bf88fc9be)
2019-08-26 15:07:13 +02:00