Stored fields were still being accessed for nested inner hits even if the _source was not requested.
This was done to figure out the id of the root document. However this is already known higher up the stack.
So instead this change adds the id to the nested search context, so that it is no longer required to be fetched via the stored fields.
In case the _source is large and no source is requested then hot threads like these ones would still appear:
```
100.3% (501.3ms out of 500ms) cpu usage by thread 'elasticsearch[AfXKKfq][search][T#6]'
2/10 snapshots sharing following 22 elements
org.apache.lucene.store.DataInput.skipBytes(DataInput.java:352)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.skipField(CompressingStoredFieldsReader.java:246)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:601)
org.apache.lucene.index.CodecReader.document(CodecReader.java:88)
org.apache.lucene.index.FilterLeafReader.document(FilterLeafReader.java:411)
org.elasticsearch.search.fetch.FetchPhase.loadStoredFields(FetchPhase.java:347)
org.elasticsearch.search.fetch.FetchPhase.createNestedSearchHit(FetchPhase.java:219)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:150)
org.elasticsearch.search.fetch.subphase.InnerHitsFetchSubPhase.hitsExecute(InnerHitsFetchSubPhase.java:73)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:166)
org.elasticsearch.search.fetch.subphase.InnerHitsFetchSubPhase.hitsExecute(InnerHitsFetchSubPhase.java:73)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:166)
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:422)
```
and:
```
8/10 snapshots sharing following 27 elements
org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:135)
org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:138)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState$1.fillBuffer(CompressingStoredFieldsReader.java:531)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState$1.readBytes(CompressingStoredFieldsReader.java:550)
org.apache.lucene.store.DataInput.readBytes(DataInput.java:87)
org.apache.lucene.store.DataInput.skipBytes(DataInput.java:350)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.skipField(CompressingStoredFieldsReader.java:246)
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:601)
org.apache.lucene.index.CodecReader.document(CodecReader.java:88)
org.apache.lucene.index.FilterLeafReader.document(FilterLeafReader.java:411)
org.elasticsearch.search.fetch.FetchPhase.loadStoredFields(FetchPhase.java:347)
org.elasticsearch.search.fetch.FetchPhase.createNestedSearchHit(FetchPhase.java:219)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:150)
org.elasticsearch.search.fetch.subphase.InnerHitsFetchSubPhase.hitsExecute(InnerHitsFetchSubPhase.java:73)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:166)
org.elasticsearch.search.fetch.subphase.InnerHitsFetchSubPhase.hitsExecute(InnerHitsFetchSubPhase.java:73)
org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:166)
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:422)
```
Currently Engine.close can return immediately if the engine is already at the process of shutting down (due to a concurrent close call or an engine failure). This is a shame because some of our testing infra wants to do things like checking the index. This commit changes the logic to make sure that all calls to close wait until resources are freed. Failing the engine is still non blocking.
Fixes#25817
This commit removes all external dependencies from the rest client jar
and shades them in an 'org.elasticsearch.client' package within the jar
using shadowJar gradle plugin. All projects that depended on the
existing jar have been converted to using the 'org.elasticsearch.client'
package prefixes to interact with the rest client.
Closes#25208
The context suggester extracts the context field values from the document but it does not filter doc values field coming from Keyword field.
This change filters doc values field when building the context values.
Fixes#25404
This change handles the case where a SpanNearQueryBuilder tries to create a query with a single clause.
This is not allowed in the SpanNearQuery so instead of throwing an exception when the weight is built, this change builds and returns
the singleton inner clause on toQuery.
Fixes#25630
The default _parent field tries to load global ordinals because it is created with eager_global_ordinals=true.
This leads to an IllegalStateException because this field does not have doc_values.
This change explicitely sets eager_global_ordinals to false in order to avoid the ISE on startup.
Fixes#25849
When a replica processes out of order operations, it can drop some due to version comparisons. In the past that would have resulted in a VersionConflictException being thrown and the operation was totally ignored. With the seq# push, we started storing these operations in the translog (but not indexing them into lucene) in order to have complete op histories to facilitate ops based recoveries. This in turn had the undesired effect that deleted docs may be resurrected during recovery in some extreme edge situation (see a complete explanation below). This PR contains a simple fix, which is also an optimization for the recovery process, incoming operation that have a seq# lower than the current local checkpoint (i.e., have already been processed) should not be indexed into lucene. Note that sometimes we can also skip storing them in the translog, but this is not required for the fix and is more complicated.
This is the equivalent of #25592
## More details on resurrected ops
Consider two operations:
- Index d1, seq no 1
- Delete d1, seq no 3
On a replica they come out of order:
- Translog gen 1 contains:
- delete (seqNo 3)
- Translog gen 2 contains:
- index (seqNo 1) (wasn't indexed into lucene, but put into the translog)
- another operation (seqNo 10)
- Translog gen 3
- another op (seqNo 9)
- Engine commits with:
- local checkpoint 9
- refers to gen 2
If this replica becomes a primary:
- Local recovery will replay translog gen 2 and up, causing index #1 to be re-index.
- Even if recovery will start at gen 3, the translog retention policy will cause file based recovery to replay the entire translog. If it happens to start at gen 2 (but not 1), we will run into the same problem.
#### Some context - out of order delivery involving deletes:
On normal operations, this relies on the gc_deletes setting. We assume that the setting represents an upper bound on the time between the index and the delete operation. The index operation will be detected as stale based on the tombstone map in the LiveVersionMap.
Recovery presents a challenge as it can replay an old index operation that was in the translog and override a delete operation that was done when the engine was opened (and is not part of the replayed snapshot). To deal with this situation, we disable GC deletes (i.e. retain all deletes) for the duration of recoveries. This means that the delete operation will be remembered and the index operation ignored.
Both of the above scenarios (local recover + peer recovery) create a situation where the delete operation is never replayed. It this "lost" as lucene doesn't remember it happened and our LiveVersionMap is populated with it.
#### Solution:
Note that both local and peer recovery represent a scenario where we replay translog ops on top of an existing lucene index, potentially with ongoing indexing. Therefore we can treat them the same.
The local checkpoint in Lucene represent a marker indicating that all operations below it were performed on the index. This is the only form of "memory" that we have that relates to deletes. If we can achieve the following:
1) All ops below the local checkpoint are not indexed to lucene.
2) All ops above the local checkpoint are
It will mean that all variants are covered: (i# == index op seq#, d# == delete op seq#, lc == local checkpoint in commit)
1) i# < d# <= lc - document is already deleted in lucene and stays that way.
2) i# <= lc < d# - delete is replayed on index - document is deleted
3) lc < i# < d# - index is replayed and then delete - document is deleted.
More formally - we want to make sure that for all ops that performed on the primary o1 and o2, if o2 is processed on a shard before o1, o1 will be dropped. We have the following scenarios
1) If both o1 or o2 are not included in the replayed snapshot and are above it (i.e., have a higher seq#), they fall under the gc deletes assumption.
2) If both o1 is part of the replayed snapshot but o2 is above it:
- if o2 arrives first, o1 must arrive due to the recovery and potentially via replication as well. since gc deletes is disabled we are guaranteed to know of o2's existence.
3) If both o2 and o1 are part of the replayed snapshot:
- we fall under the same scenarios as #2 - disabling GC deletes ensures we know of o2 if it arrives first.
4) If o1 falls before the snapshot and o2 is either part of the snapshot or higher:
- Since the snapshot is guaranteed to contain all ops that are not part of lucene and are above the lc in the commit used, this means that o1 is part of lucene and o1 < local checkpoint. This means it won't be processed and we're not in the scenario we're discussing.
5) If o2 falls before the snapshot but o1 is part of it:
- by the same reasoning above, o2 is < local checkpoint. Since o1 < o2, we also get o1 < local checkpoint and this will be dropped.
#### Implementation:
For local recovery, we can filter the ops we read of the translog and avoid replaying them. For peer recovery this is tricky as we do want to send the operations in order to have some history on the target shard. Filtering operations on the engine level (i.e., not indexing to lucene if op seq# <= lc) would work for both.
This commit changes the way we handle field expansion in `match`, `multi_match` and `query_string` query.
The main changes are:
- For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.
- For partial field names (with `*` suffix), the expansion is done only on `keyword`, `text`, `date`, `ip` and `number` field types. Other field types are simply ignored.
- For all fields (`*`), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.
- The `*` notation can also be used to set `default_field` option on`query_string` query. This should replace the needs for the extra option `use_all_fields` which is deprecated in this change.
This commit also rewrites simple `*` query to matchalldocs query when all fields are requested (Fixes#25556).
The same change should be done on `simple_query_string` for completeness.
`use_all_fields` option in `query_string` is also deprecated in this change, `default_field` should be set to `*` instead.
Relates #25551
Removes the primary term from the replication request and pushes it into the transport envelope. This makes it possible to remove the term from the ReplicationOperation universe. The primary term that is to be used for a replication operation is now determined in the reroute phase when the node decides to execute a primary action (and validated once the primary action gets to execute). This makes it possible to validate that the primary action was sent to the correct primary shard instance that it was meant to be sent to (currently we only validate primary actions using the allocation id, which can be reused for failed and reallocated primaries).
If a primary shard is relocated, and then subsequently closed, there is a short window where ReplicationOperation could access the
closed shard (engine is not shut down yet) and, because it does not know that the shard was relocated, try to update the local
checkpoint, tripping an assertion in GlobalCheckPointTracker that a local checkpoint cannot be updated if it's not in primary mode.
This change rewrites search requests on the coordinating node before
we send requests to the individual shards. This will reduce the rewrite load
and object creation for each rewrite on the executing nodes and will fetch
resources only once instead of N times once per shard for queries like `terms`
query with index lookups. (among percolator and geo-shape)
Relates to #25791
When we skip a shard we should first increment the skip and successful shard
counters before we notify the super class about a skipped shard which could
send back the result before we increment the stats.
This commit adds the min wire/index compat versions to the main action
output. Not only will this make the compatility expected more
transparent, but it also allows to test which version others think the
compat versions are, similar to how we test the lucene version.
When a node tries to join a cluster, it goes through a validation step to make sure the node is compatible with the cluster. Currently we validation that the node can read the cluster state and that it is compatible with the indexes of the cluster. This PR adds validation that the joining node's version is compatible with the versions of existing nodes. Concretely we check that:
1) The node's min compatible version is higher or equal to any node in the cluster (this prevents a too-new node from joining)
2) The node's version is higher or equal to the min compat version of all cluster nodes (this prevents a too old join where, for example, the master is on 5.6, there's another 6.0 node in the cluster and a 5.4 node tries to join).
3) The node's major version is at least as higher as the lowest node in the cluster. This is important as we use the minimum version in the cluster to stop executing bwc code for operations that require multiple nodes. If the nodes are already operating in "new cluster mode", we should prevent nodes from the previous major to join (even if they are wire level compatible). This does mean that if you have a very unlucky partition during the upgrade which partitions all old nodes which are also a minority / data nodes only, the may not be able to re-join the cluster. We feel this edge case risk is well worth the simplification it brings to BWC layers only going one way. This restriction only holds if the cluster state has been recovered (i.e., the cluster has properly formed).
Also, the node join validation can now selectively fail specific nodes (previously the entire batch was failed). This is an important preparation for a follow up PR where we plan to have a rejected joining node die with dignity.
Also has updates to ScriptMetaData for allowing the old namespace format to be loaded all the way back through 5.0; however, it will throw an exception if two scripts share the same id but different languages.
This commit removes legacy checks for unsupported an environment
variable and unsupported system properties. This environment variable
and these system properties have not been supported since 1.x so it is
safe to stop checking for the existence of these settings.
Relates #25809
The `QueryRewriteContext` used to provide a client object that can
be used to fetch geo-shapes, terms or documents for percolation. Unfortunately
all client calls used to be blocking calls which can have significant impact on the
rewrite phase since it occupies an entire search thread until the resource is
received. In the case that the index the resource is fetched from isn't on the local
node this can have significant impact on query throughput.
Note: this doesn't fix MLT since it fetches stuff in doQuery which is a different beast. Yet, it is a huge step in the right direction
This commit calls the `useSystemProperties` method on the HttpAsyncClientBuilder so that the jvm
system properties are used. The primary reason for doing this is to ensure the builder uses the
system default SSLContext rather than the default instance created by the http client library.
Closes#23231
Today we have duplicated code that is quite complicated to iterate
over rewriteable (`QueryBuilders` mainly) This change introduces a
`Rewriteable` interface that allow to share code to do the rewriting as
well as encapsulation and composition of queries.
Setting a timeout or enforcing low-level search cancellation used to make us
wrap the collector and check either the current time or whether the search
task was cancelled for every collected document. This can be significant
overhead on cheap queries that match many documents.
This commit changes the approach to wrap the bulk scorer rather than the
collector and exponentially increase the interval between two consecutive
checks in order to reduce the overhead of those checks.
We currently use fielddata on the `_id` field which is trappy, especially as we
do it implicitly. This changes the `random_score` function to use doc ids when
no seed is provided and to suggest a field when a seed is provided.
For now the change only emits a deprecation warning when no field is supplied
but this should be replaced by a strict check on 7.0.
Closes#25240
When a node tries to join a cluster, it goes through a validation step to make sure the node is compatible with the cluster. Currently we validation that the node can read the cluster state and that it is compatible with the indexes of the cluster. This PR adds validation that the joining node's version is compatible with the versions of existing nodes. Concretely we check that:
1) The node's min compatible version is higher or equal to any node in the cluster (this prevents a too-new node from joining)
2) The node's version is higher or equal to the min compat version of all cluster nodes (this prevents a too old join where, for example, the master is on 5.6, there's another 6.0 node in the cluster and a 5.4 node tries to join).
3) The node's major version is at least as higher as the lowest node in the cluster. This is important as we use the minimum version in the cluster to stop executing bwc code for operations that require multiple nodes. If the nodes are already operating in "new cluster mode", we should prevent nodes from the previous major to join (even if they are wire level compatible). This does mean that if you have a very unlucky partition during the upgrade which partitions all old nodes which are also a minority / data nodes only, the may not be able to re-join the cluster. We feel this edge case risk is well worth the simplification it brings to BWC layers only going one way.
Also, the node join validation can now selectively fail specific nodes (previously the entire batch was failed). This is an important preparation for a follow up PR where we plan to have a rejected joining node die with dignity.
Today we provide a lot of functionality on the `QueryRewriteContext` that
we potentially don't have ie. if we rewrite on a coordinating node or when
we percolating. This change moves most of the unnecessary shard level or
index level services and dependencies to `QueryShardContext` instead.
If a request contains an invalid error trace parameter, we send a error
on the channel. This should immediately abort any additional processing
of the request but instead we march on, dispatch the request and
subsequently send another message on the channel. The problem here is
this means two writes on the channel which leads to the request being
released twice ultimately raising in illegal reference count
exception. This commit addresses this by performing an early return in
the case that the request contained an invalid error trace parameter.
Relates #25785
This commit removes a timed latch await in a transport client listeners
test. The problem with a timed wait here is that on an overloaded
machine, the test can fail because the waiting thread was not unlatched
quickly enough. This makes the test unnecessarily flaky. Instead, we
should wait indefinitely and simply let the test fail by the test
timeout if the latch is not counted down for some reason.
Closes#25760
Currently we ignore unknown field names when parsing RangeAggregator.Range and
GeoDistanceAggregationBuilder.Range from `range`, `date_range` or `geo_distance`
aggregations. This can hide subtle errors in the query. This change makes parsing `ranges`
stricter.
This is an appealing assertion, but there scenarios where it can happen under normal operations. For example, when an index is created it may run into an exception when the lucene files have already been created. The master will try to assign the shard to another node (it's empty, so no need to look for data) but if there is no other node, it will reassign it to the same node. At that point the deletion will get a list of existing commits (which it will typically delete).
With #23997 and #25268 we have changed put alias, delete alias, update aliases and delete index to not accept aliases. Instead concrete indices should be provided as their index parameter.
This commit improves the error message in case aliases are provided, from an IndexNotFoundException (404 status code) with "no such index" message, to an IllegalArgumentException (400 status code) with "The provided expression [alias] matches an alias, specify the corresponding concrete indices instead." message.
Note that there is no specific error message for the case where wildcard expressions match one or more aliases. In fact, aliases are simply ignored when expanding wildcards for such APIs. An error is thrown only when the expression ends up matching no indices at all, and allow_no_indices is set to false. In that case the error is still the generic "404 - no such index".
403 can be confused with security. If an API doesn't support working against closed indices and closed indices are referred to in a request, that is a bad request, hence 400 is more appropriate.
The test checks if a file based or ops based recovery happened, but if the replica shard never finished recovering expectations are not met.
Fixes#25761
Currently the `to` and `from` parameter in the `date_range` aggregation is not
parsed with the correct date field format from the mappings or the aggregation
if the argument is numeric, but always treated as a long value specifying
`epoch_millis`. This leads to problems e.g. when the format is `epoch_second`,
but the `to` and `from` are currently treated as millis.
With this change, we interpret these parameters according to the `format` of the target field.
If the `format` in the mappings is not compatible with numeric input values,
a compatible `format` (e.g. `epoch_millis`, `epoch_second`) must be specified in
the `date_range` aggregation itself, otherwise an error is thrown.
#Closes #17920
This commit changes how the offset source is picked for each field using the es mapping rather than the underlying Lucene field infos.
It's mandatory for large mappings where field infos retrieval can be costly (the global field infos is merged for each highlighted field in every hit by the Lucene impl).
Fixes#25699
* Register data node stats from info carried back in search responses
This is part of #24915, where we now calculate the EWMA of service time for
tasks in the search threadpool, and send that as well as the current queue size
back to the coordinating node. The coordinating node now tracks this information
for each node in the cluster.
This information will be used in the future the determining the best replica a
search request should be routed to. This change has no user-visible difference.
* Move response time timing into ResponseListenerWrapper
* Move ResponseListenerWrapper to ActionListener instead of SearchActionListener
Also removes the logger
* Move `requestIndex` back to private
* De-guice-ify ResponseCollectorService \o/
* Undo all changes to SearchQueryThenFetchAsyncAction
* Remove unneeded response collector from TransportSearchAction
* Undo all changes to SearchDfsQueryThenFetchAsyncAction
* Completely rewrite the inside of ResponseCollectorService's record keeping
* Documentation and cleanups for ResponseCollectorService
* Add unit test for collection of queue size and service time
* Fix Guice construction error
* Add basic unit tests for ResponseCollectorService
* Fix version constant for the master merge
* Fix test compilation after master merge
* Add a test for node removal on cluster changed event
* Remove integration test as there are now unit tests
* Rename ResponseListenerWrapper -> SearchExecutionStatsCollector
* Fix line-length
* Make classes private and final where appropriate
* Pass nodeId into SearchExecutionStatsCollector and use only ActionListener
* Get nodeId from connection so searchShardTarget can be private
* Remove threadpool from SearchContext, get it from IndexShard instead
* Add missing import
* Use BiFunction for responseWrapper rather than passing in collector service
This change removes the leniency of having a `null` index to fetch
terms from in 6.0 onwards. This feature will be deprecated in the 5.x series
and 6.0 nodes will require the index to be set.
Closes#25750
We used to compare agaisnt the min compatible version which is misleading since
it might move over time and since we backported the `can_match` API entirely
it's better to compare against a version constant.
We can't do it in the general case because of prefix queries, but I believe this
is mostly used in query strings and not in explicit `terms` queries.
Closes#25667
When sending replica requests for replication operations, we skip
sending the request to pre-6.0 nodes for operations that such nodes
would not be aware of (e.g., the background global checkpoint sync, or
the primary/replica resync) since they would not know what to do with
these requests. Yet, we simulate that we received responses from these
nodes. Today, this is done by simulating that they sent us that their
local checkpoint is unassigned sequence number. However, for pre-6.0
nodes we have introduced a special local checkpoint used in the global
checkpoint tracker for such nodes and that is what we should use here
too. This commit fixes this issue.
Relates #25744
The following token filters were moved: arabic_normalization, german_normalization, hindi_normalization, indic_normalization, persian_normalization, scandinavian_normalization, serbian_normalization, sorani_normalization, cjk_width and cjk_width
Relates to #23658
Currently replication and recovery are both coordinated through the latest cluster state available on the ClusterService as well as through the GlobalCheckpointTracker (to have consistent local/global checkpoint information), making it difficult to understand the relation between recovery and replication, and requiring some tricky checks in the recovery code to coordinate between the two. This commit makes the primary the single owner of its replication group, which simplifies the replication model and allows to clean up corner cases we have in our recovery code. It also reduces the dependencies in the code, so that neither RecoverySourceXXX nor ReplicationOperation need access to the latest state on ClusterService anymore. Finally, it gives us the property that in-sync shard copies won't receive global checkpoint updates which are above their local checkpoint (relates #25485).
When parsing indices options from REST, we parse the optional parameters that are supported at REST (ignore_unavailable, allow_no_indices and expand_wildcards) and we provide the API default values for all the other (internal) options so that they are set to the new indices options while parsing. The `ignoreAliases` option was forgotten though, which means that whenever you pass in any index option at REST to the delete index API, you get to delete aliases like it was supported before (as ignoreAliases gets set to false like in all the other APIs).
Added unit tests for IndicesOptions parsing from REST parameters, and yaml tests for the delete index API.
This change refactors the query_string query to analyze the query text around logical operators of the query string the same way than a match_query/multi_match_query.
It also adds a type parameter that can be used to change the way multi fields query are built the same way than a multi_match query does.
Now that these queries share the same behavior regarding text analysis, some parameters are obsolete and have been deprecated:
split_on_whitespace: This setting is now ignored with a deprecation notice
if it is used explicitely. With this PR The query_string always splits on logical operator.
It simplifies the understanding of the other parameters that can have different meanings
depending on the value of split_on_whitespace.
auto_generate_phrase_queries: This setting is now ignored with a deprecation notice
if it is used explicitely. This setting only makes sense when the parser splits on whitespace.
use_dismax: This setting is now ignored with a deprecation notice
if it is used explicitely. The tie_breaker parameter is sufficient to handle best_fields/most_fields.
Fixes#25574
With cross cluster search we can potentially proxy `can_match` requests
to nodes that don't have the endpoint. This might not cause any problem
from a functional perspecitve but will cause ugly error messages on
the target node. This commit will cause an IAE if we try to talk to an
incompatible node via a proxy.
Relates to #25704
It was brought up that our current client artifacts have generic names like 'rest' that may cause conflicts with other artifacts.
This commit renames:
- rest -> elasticsearch-rest-client
- sniffer -> elasticsearch-rest-client-sniffer
- rest-high-level -> elasticsearch-rest-high-level-client
A couple of small changes are also preparing the high level client for its first release.
Closes#20248
The slop parameter defaults to 0 in the Lucene SpanNearQuery, so we can set it
to this default value also and don't have to require it being specified in the
query when using the Rest API. Leaving `slop` a ctro arg in the Java API as it
should normally be specified and we can keep it `final` that way.
Closes#25642
A shrunk index should ignore anything from templates and instead take
its mappings, aliases, and settings from the original index, plus any
new settings and aliases passed in with the shrink request. This commit
causes this to be the case.
Relates #25380
Today if we search across a large amount of shards we hit every shard. Yet, it's quite
common to search across an index pattern for time based indices but filtering will exclude
all results outside a certain time range ie. `now-3d`. While the search can potentially hit
hundreds of shards the majority of the shards might yield 0 results since there is not document
that is within this date range. Kibana for instance does this regularly but used `_field_stats`
to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results.
This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards
and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.
Requests that execute a stored script will no longer be allowed to specify the lang of the script. This information is stored in the cluster state making only an id necessary to execute against. Putting a stored script will still require a lang.
* Changes DocValueFieldsFetchSubPhase to reuse doc values iterators for multiple hits
Closes#24986
* iter
* Update ScriptDocValues to not reuse GeoPoint and Date objects
* added Javadoc about script value re-use
* Enable doc values for range fields by default.
* Store ranges in a binary format that support multi field fields.
* Added BinaryDocValuesRangeQuery that can query ranges that have been encoded into a binary doc values field.
* Wrap range queries on a range field in IndexOrDocValuesQuery query.
Closes#24314
This method does exactly what getHits() does and is used in only a few places,
so it can safely be removed. It seems to be a left-over from when
InternalSearchHits was folded into the SearchHits interface, which didn't
contain this method.
In certain situations we can early terminate and just skip the entire
query phase or make the lucene level rewrite very cheap if we can already
tell that a query won't match any documents. For instance if there is a single
`match_none` ie. due to some range rewrite in a filter or must clause of a boolean
query it can just drop all it's other queries since it will never match.
Flake ids organize bytes in such a way that ids are ordered. However, we do not
need that property and could reorganize bytes in an order that would better suit
Lucene's terms dict instead.
Some synthetic tests suggest that this change decreases the disk footprint of
the `_id` field by about 50% in many cases (see `UUIDTests.testCompression`).
For instance, when simulating the indexing of 10M docs at a rate of 10k docs
per second, the current uid generator used 20.2 bytes per document on average,
while this new generator which only puts bytes in a different order uses 9.6
bytes per document on average.
We had already explored this idea in #18209 but the attempt to share long common
prefixes had had a bad impact on indexing speed. This time I have been more
careful about putting discriminant bytes early in the `_id` in a way that
preserves indexing speed on par with today, while still allowing for better
compression.
There is a bug when a call to `BytesReferenceStreamInput` skip is made
on a `BytesReference` that has an initial offset. The offset for the
current slice is added to the current index and then subtracted from the
length. This introduces the possibility of a negative number of bytes to
skip. This happens inside a loop, which leads to an infinte loop.
This commit correctly subtracts the current slice index from the
slice.length. Additionally, the `BytesArrayTests` are modified to test
instances that include an offset.
This is a protection mechanism to prevent a single search request from
hitting a large number of shards in the cluster concurrently. If a search is
executed against all indices in the cluster this can easily overload the cluster
causing rejections etc. which is not necessarily desirable. Instead this PR adds
a per request limit of `max_concurrent_shard_requests` that throttles the number of
concurrent initial phase requests to `256` by default. This limit can be increased per request
and protects single search requests from overloading the cluster. Subsequent PRs can introduces
addiontional improvemetns ie. limiting this on a `_msearch` level, making defaults a factor of
the number of nodes or sort shards iters such that we gain the best concurrency across nodes.
We lost the cluster alias due to some special caseing in inner hits
and due to the fact that we didn't pass on the alias to the shard request.
This change ensures that we have the cluster alias present on the shard to
ensure all SearchShardTarget reads preserve the alias.
Relates to #25606
Currently when we close a channel in Netty4Utils.closeChannels we
block until the closing is complete. This introduces the possibility
that a network selector thread will block while waiting until a
separate network selector thread closes a channel.
For instance: T1 closes channel 1 (which is assigned to a T1 selector).
Channel 1's close listener executes the closing of the node. That
means that T1 now tries to close channel 2. However, channel 2 is
assigned to a selector that is running on T2. T1 now must wait until T2
closes that channel at some point in the future.
This commit addresses this by adding a boolean to closeChannels
indicating if we should block on close. We only set this boolean to true
if we are closing down the server channels at shutdown. This call is
never made from a network thread. When we call the closeChannels method
with that boolean set to false, we do not block on close.
With #24236, tribe nodes submit cluster state changes to their MasterService, making it unnecessary to explicitly update the cluster state version. This PR fixes the double-incrementing of cluster state versions on tribe nodes, which are not harmful, but unnecessary.
This change collapses some of the packages for the bucket aggregations into their parent packages. This was done for the following aggregations:
* The variants of the range aggregation (geo_distance, date and ip) were moved into the `o.e.s.a.bucket.range` package
* The `o.e.s.a.bucket.terms.support` package was removed and the classes were moved to `o.e.s.a.bucket.terms`
* The filter aggregation was moved to `o.e.s.a.bucket.filter`
Since this PR is already relatively large with only the above changes subsequent PRs will do similar operations on relevant metric and pipeline aggregations
Relates to #22868
The test is currently serializing the cluster state using an older ES version format, but then deserializes those same bytes by
assuming they are of the current ES version.
When resolving wildcards, aliases should be treated as unavailable indices when the `ignoreAliases` option is set to `true` (currently enabled with delete index api and update aliases api). This way the `allow_no_indices` and `ignore_unavailable` options can be honoured, otherwise WildcardExpressionResolver ends up treating aliases differently and there is no way to control when an error is thrown.
The default behaviour for the delete index api, which has `ignore_unavailable` set to `false` and `allow_no_indices` set to `true` by default, is to throw an error when executed against an alias, same as when it's executed against an index that does not exist.
We currently check whether translog files can be trimmed whenever we create a new translog generation or close a view. However #25294 added a long translog retention period (12h, max 512MB by default), which means translog files should potentially be cleaned up long after there isn't any indexing activity to trigger flushes/the creation of new translog files. We therefore need a scheduled background check to clean up those files once they are no longer needed.
Relates to #10708
This commit does two things:
- bumps the version from 6.0.0-alpha3 to 6.0.0-beta1
- renames the 6.0.0-alpha3 version constant to 6.0.0-beta1
Relates #25621
This commit adjusts the expectation for the max number of threads in the
scaling thread pool configuration test. The reason that this expectation
is incorrect is because we removed the limitation that the number of
processors maxes out at 32, instead letting it be the true number of
logical processors on the machine. However, when we removed this
limitation, this test was never adjusted to reflect the new reality yet
it never arose since our tests were not running on machines with
incredibly high core counts.
Relates #20874
Updating the global checkpoint on a replica can occur for a few
different reasons:
- from inlined global checkpoint updates
- from a primary term transition
- from finalizing recovery
Yet, the trace logging for a global checkpoint update does not present
this information that can be useful when tracing test failures. This
commit adds a reason for the global checkpoint update on a replica so
that we can trace these updates.
Relates #25612
The current BWC code in `BulkItemRequest` mutates the underlying `DocWriteRequests` which causes test failures and unexpected state (our test infra checks bwc serialization on the fly). This PR removes this logic from master. Another PR will add a BWC layer to 5.x only.
This PR contains the logic in https://github.com/elastic/elasticsearch/pull/25510 , which is needed to run the tests.