This commit adds the support to early terminate the collection of a leaf
in the aggregation framework. This change introduces a MultiBucketCollector which
handles CollectionTerminatedException exactly like the Lucene MultiCollector.
Any aggregator can now throw a CollectionTerminatedException without stopping
the collection of a sibling aggregator. This is useful for aggregators that
can infer their result without visiting all documents (e.g.: a min/max aggregation on a match_all query).
* Search: Support of wildcard on docvalue_fields
For consistency with stored_fields, docvalue_fields should support the use of wildcards.
Documentation of doc values fields is updated accordingly.
See also: #26390Closes#26299
We used to set `maxScore` to `0` within `TopDocs` in situations where there is really no score as the size was set to `0` and scores were not even tracked. In such scenarios, `Float.Nan` is more appropriate, which gets converted to `max_score: null` on the REST layer. That's also more consistent with lucene which set `maxScore` to `Float.Nan` when merging empty `TopDocs` (see `TopDocs#merge`).
Today `_msearch` doesn't allow modifying the `max_concurrent_shard_requests`
per sub search request. This change adds support for setting this parameter on
all sub-search requests in an `_msearch`.
Relates to #31877
The notion of "quality" is an overloaded term in the search ranking evaluation
context. Its usually used to decribe certain levels of "good" vs. "bad" of a
seach result with respect to the users information need. We currently report the
result of the ranking evaluation as `quality_level` which is a bit missleading.
This changes the response parameter name to `metric_score` which fits better.
Currently the ranking evaluation response contains a 'unknown_docs' section
for each search use case in the evaluation set. It contains document ids for
results in the search hits that currently don't have a quality rating.
This change renames it to `unrated_docs`, which better reflects its purpose.
Today it is unclear what guarantees are offered by the search preference
feature, and we claim a guarantee that is stronger than what we really offer:
> A custom value will be used to guarantee that the same shards will be used
> for the same custom value.
This commit clarifies this documentation.
Forward-port of #32098 to `master`.
Adds support for `ignore_unmapped` parameter in geo distance sorting,
which is functionally equivalent to specifying an `unmapped_type` in
the field sort.
Closes#28152
This change deprecates completion queries and documents without context that target a
context enabled completion field. Querying without context degrades the search
performance considerably (even when the number of indexed contexts is low).
This commit targets master but the deprecation will take place in 6.x and the functionality
will be removed in 7 in a follow up.
Closes#29222
This commit adds the ability to configure how a docvalue field should be
formatted, so that it would be possible eg. to return a date field
formatted as the number of milliseconds since Epoch.
Closes#27740
Currently the first snippet in the documentation test in script-fields.asciidoc
isn't executed, although it has the CONSOLE annotation. Adding a test setup
annotation to it seems to fix the problem.
This commit changes the default out-of-the-box configuration for the
number of shards from five to one. We think this will help address a
common problem of oversharding. For users with time-based indices that
need a different default, this can be managed with index templates. For
users with non-time-based indices that find they need to re-shard with
the split API in place they no longer need to resort only to
reindexing.
Since this has the impact of changing the default number of shards used
in REST tests, we want to ensure that we still have coverage for issues
that could arise from multiple shards. As such, we randomize (rarely)
the default number of shards in REST tests to two. This is managed via a
global index template. However, some tests check the templates that are
in the cluster state during the test. Since this template is randomly
there, we need a way for tests to skip adding the template used to set
the number of shards to two. For this we add the default_shards feature
skip. To avoid having to write our docs in a complicated way because
sometimes they might be behind one shard, and sometimes they might be
behind two shards we apply the default_shards feature skip to all docs
tests. That is, these tests will always run with the default number of
shards (one).
* Clarify documentation of scroll_id
The Scroll API may return the same scroll ID for multiple requests due to server side state. This is not clear from the current documentation.
* Further clarify scroll ID return behaviour
Some features have been deprecated since `6.0` like the `_parent` field or the
ability to have multiple types per index. This allows to remove quite some
code, which in-turn will hopefully make it easier to proceed with the removal
of types.
The rank_eval documentation was missing an explanation of the parameter
`k` that controls the number of top hits that are used in the ranking evaluation.
Closes#29205
Increase the default limit of `index.highlight.max_analyzed_offset` to 1M instead of previous 10K.
Enhance an error message when offset increased to include field name, index name and doc_id.
Relates to https://github.com/elastic/kibana/issues/16764
* Search option terminate_after does not handle post_filters and aggregations correctly
This change fixes the handling of the `terminate_after` option when post_filters (or min_score) are used.
`post_filter` should be applied before `terminate_after` in order to terminate the query when enough document are accepted
by the post_filters.
This commit also changes the type of exception thrown by `terminate_after` in order to ensure that multi collectors (aggregations)
do not try to continue the collection when enough documents have been collected.
Closes#28411
Adds allow_partial_search_results flag to search requests with default setting = true.
When false, will error if search either timeouts, has partial errors or has missing shards rather
than returning partial search results. A cluster-level setting provides a default for search requests with no flag.
Closes#27435
This change adds support for the new ranking evaluation API to the High Level Rest Client.
This mostly means adding support for parsing the various response objects back from the
REST representation. It includes one change to the response syntax where previously we didn't
print the type of the metric details section but we now need it to pick the right parser to
parse this section back.
Closes#28198
* Limit the analyzed text for highlighting
- Introduce index level settings to control the max number of character
to be analyzed for highlighting
- Throw an error if analysis is required on a larger text
Closes#27517
Allowing `_doc` as a type will enable users to make the transition to 7.0
smoother since the index APIs will be `PUT index/_doc/id` and `POST index/_doc`.
This also moves most of the documentation to `_doc` as a type name.
Closes#27750Closes#27751
Also include _type and _id for parent/child hits inside inner hits.
In the case of top_hits aggregation the nested search hits are
directly returned and are not grouped by a root or parent document, so
it is important to include the _id and _index attributes in order to know
to what documents these nested search hits belong to.
Closes#27053
Today we require users to prepare their indices for split operations.
Yet, we can do this automatically when an index is created which would
make the split feature a much more appealing option since it doesn't have
any 3rd party prerequisites anymore.
This change automatically sets the number of routinng shards such that
an index is guaranteed to be able to split once into twice as many shards.
The number of routing shards is scaled towards the default shard limit per index
such that indices with a smaller amount of shards can be split more often than
larger ones. For instance an index with 1 or 2 shards can be split 10x
(until it approaches 1024 shards) while an index created with 128 shards can only
be split 3x by a factor of 2. Please note this is just a default value and users
can still prepare their indices with `index.number_of_routing_shards` for custom
splitting.
NOTE: this change has an impact on the document distribution since we are changing
the hash space. Documents are still uniformly distributed across all shards but since
we are artificually changing the number of buckets in the consistent hashign space
document might be hashed into different shards compared to previous versions.
This is a 7.0 only change.
Some code-paths use anonymous classes (such as NonCollectingAggregator
in terms agg), which messes up the display name of the profiler. If
we encounter an anonymous class, we need to grab the super's name.
Another naming issue was that ProfileAggs were not delegating to the
wrapped agg's name for toString(), leading to ugly display.
This PR also fixes up the profile documentation. Some of the examples were
executing against empty indices, which shows different profile results
than a populated index (and made for confusing examples).
Finally, I switched the agg display names from the fully qualified name
to the simple name, so that it's similar to how the query profiles work.
Closes#26405
Due to a change happened via #26102 to make the nested source consistent
with or without source filtering, the _source of a nested inner hit was
always wrapped in the parent path. This turned out to be not ideal for
users relying on the nested source, as it would require additional parsing
on the client side. This change fixes this, the _source of nested inner hits
is now no longer wrapped by parent json objects, irregardless of whether
the _source is included as is or source filtering is used.
Internally source filtering and highlighting relies on the fact that the
_source of nested inner hits are accessible by its full field path, so
in order to now break this, the conversion of the _source into its binary
form is performed in FetchSourceSubPhase, after any potential source filtering
is performed to make sure the structure of _source of the nested inner hit
is consistent irregardless if source filtering is performed.
PR for #26944Closes#26944
The shard preference _primary, _replica and its variants were useful
for the asynchronous replication. However, with the current impl, they
are no longer useful and should be removed.
Closes#26335
* Fix percolator highlight sub fetch phase to not highlight query twice
The PercolatorHighlightSubFetchPhase does not override hitExecute and since it extends HighlightPhase the search hits
are highlighted twice (by the highlight phase and then by the percolator). This does not alter the results, the second highlighting
just overrides the first one but this slow down the request because it duplicates the work.
This change exposes the duplicate removal option added in Lucene for the completion suggester
with a new option called `skip_duplicates` (defaults to false).
This commit also adapts the custom suggest collector to handle deduplication when multiple contexts match the input.
Closes#23364
Multi-level Nested Sort with Filters
Allow multiple levels of nested sorting where each level can have it's own filter.
Backward compatible with previous single-level nested sort.
* Remove the _all metadata field
This change removes the `_all` metadata field. This field is deprecated in 6
and cannot be activated for indices created in 6 so it can be safely removed in
the next major version (e.g. 7).
Today if we search across a large amount of shards we hit every shard. Yet, it's quite
common to search across an index pattern for time based indices but filtering will exclude
all results outside a certain time range ie. `now-3d`. While the search can potentially hit
hundreds of shards the majority of the shards might yield 0 results since there is not document
that is within this date range. Kibana for instance does this regularly but used `_field_stats`
to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results.
This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards
and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.
This is a protection mechanism to prevent a single search request from
hitting a large number of shards in the cluster concurrently. If a search is
executed against all indices in the cluster this can easily overload the cluster
causing rejections etc. which is not necessarily desirable. Instead this PR adds
a per request limit of `max_concurrent_shard_requests` that throttles the number of
concurrent initial phase requests to `256` by default. This limit can be increased per request
and protects single search requests from overloading the cluster. Subsequent PRs can introduces
addiontional improvemetns ie. limiting this on a `_msearch` level, making defaults a factor of
the number of nodes or sort shards iters such that we gain the best concurrency across nodes.
* Add documentation for the new parent-join field
This commit adds the docs for the new parent-join field.
It explains how to define, index and query this new field.
Relates #20257
This snapshot has faster range queries on range fields (LUCENE-7828), more
accurate norms (LUCENE-7730) and the ability to use fake term frequencies
(LUCENE-7854).
This commit adds back "id" as the key within a script to specify a
stored script (which with file scripts now gone is no longer ambiguous).
It also adds "source" as a replacement for "code". This is in an attempt
to normalize how scripts are specified across both put stored scripts and script usages, including search template requests. This also deprecates the old inline/stored keys.
This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815
It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
This commit refactors the query phase in order to be able
to automatically detect queries that can be early terminated.
If the index sort matches the query sort, the top docs collection is early terminated
on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector.
This change also adds a new parameter to the search request called `track_total_hits`.
It indicates if the total number of hits that match the query should be tracked.
If false, queries sorted by the index sort will not try to compute this information and
and will limit the collection to the first N documents per segment.
Aggregations are not impacted and will continue to see every document
even when the index sort matches the query sort and `track_total_hits` is false.
Relates #6720
Adds a "magic" key to the yaml testing stash mostly for use with
documentation tests. When unstashing an object, `$_path` is the
path into the current position in the object you are unstashing.
This means that in docs tests you can use
`// TESTRESPONSEs/somevalue/$body.${_path}/` to mean "replace
`somevalue` with whatever is the response in the same position."
Compare how you must carefully mock out all the numbers in the profile
response without this change:
```
// TESTRESPONSE[s/"id": "\[2aE02wS1R8q_QFnYu6vDVQ\]\[twitter\]\[1\]"/"id": $body.profile.shards.0.id/]
// TESTRESPONSE[s/"rewrite_time": 51443/"rewrite_time": $body.profile.shards.0.searches.0.rewrite_time/]
// TESTRESPONSE[s/"score": 51306/"score": $body.profile.shards.0.searches.0.query.0.breakdown.score/]
// TESTRESPONSE[s/"time_in_nanos": "1873811"/"time_in_nanos": $body.profile.shards.0.searches.0.query.0.time_in_nanos/]
// TESTRESPONSE[s/"build_scorer": 2935582/"build_scorer": $body.profile.shards.0.searches.0.query.0.breakdown.build_scorer/]
// TESTRESPONSE[s/"create_weight": 919297/"create_weight": $body.profile.shards.0.searches.0.query.0.breakdown.create_weight/]
// TESTRESPONSE[s/"next_doc": 53876/"next_doc": $body.profile.shards.0.searches.0.query.0.breakdown.next_doc/]
// TESTRESPONSE[s/"time_in_nanos": "391943"/"time_in_nanos": $body.profile.shards.0.searches.0.query.0.children.0.time_in_nanos/]
// TESTRESPONSE[s/"score": 28776/"score": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.score/]
// TESTRESPONSE[s/"build_scorer": 784451/"build_scorer": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.build_scorer/]
// TESTRESPONSE[s/"create_weight": 1669564/"create_weight": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.create_weight/]
// TESTRESPONSE[s/"next_doc": 10111/"next_doc": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.next_doc/]
// TESTRESPONSE[s/"time_in_nanos": "210682"/"time_in_nanos": $body.profile.shards.0.searches.0.query.0.children.1.time_in_nanos/]
// TESTRESPONSE[s/"score": 4552/"score": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.score/]
// TESTRESPONSE[s/"build_scorer": 42602/"build_scorer": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.build_scorer/]
// TESTRESPONSE[s/"create_weight": 89323/"create_weight": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.create_weight/]
// TESTRESPONSE[s/"next_doc": 2852/"next_doc": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.next_doc/]
// TESTRESPONSE[s/"time_in_nanos": "304311"/"time_in_nanos": $body.profile.shards.0.searches.0.collector.0.time_in_nanos/]
// TESTRESPONSE[s/"time_in_nanos": "32273"/"time_in_nanos": $body.profile.shards.0.searches.0.collector.0.children.0.time_in_nanos/]
```
To how you can cavalierly mock all the numbers at once with this change:
```
// TESTRESPONSE[s/(?<=[" ])\d+(\.\d+)?/$body.$_path/]
```
Now that indices have a single type by default, we can move to the next step
and identify documents using their `_id` rather than the `_uid`.
One notable change in this commit is that I made deletions implicitly create
types. This helps with the live version map in the case that documents are
deleted before the first type is introduced. Otherwise there would be no way
to differenciate `DELETE index/foo/1` followed by `PUT index/foo/1` from
`DELETE index/bar/1` followed by `PUT index/foo/1`, even though those are
different if versioning is involved.
`_search_shards`API today only returns aliases names if there is an alias
filter associated with one of them. Now it can be useful to see which aliases
have been expanded for an index given the index expressions. This change also includes non-filtering aliases even without a filtering alias being present.
Rewrites most of the snippets in the `innert_hits` docs to be
complete examples and enables `VIEW IN CONSOLE`, `COPY AS CURL`,
and automatic testing of the snippets.
Now that we have incremental reduce functions for topN and aggregations
we can set the default for `action.search.shard_count.limit` to unlimited.
This still allows users to restrict these settings while by default we executed
across all shards matching the search requests index pattern.
_field_stats has evolved quite a lot to become a multi purpose API capable of retrieving the field capabilities and the min/max value for a field.
In the mean time a more focused API called `_field_caps` has been added, this enpoint is a good replacement for _field_stats since he can
retrieve the field capabilities by just looking at the field mapping (no lookup in the index structures).
Also the recent improvement made to range queries makes the _field_stats API obsolete since this queries are now rewritten per shard based on the min/max found for the field.
This means that a range query that does not match any document in a shard can return quickly and can be cached efficiently.
For these reasons this change deprecates _field_stats. The deprecation should happen in 5.4 but we won't remove this API in 6.x yet which is why
this PR is made directly to 6.0.
The rest tests have also been adapted to not throw an error while this change is backported to 5.4.
They needed to be updated now that Painless is the default and
the non-sandboxed scripting languages are going away or gone.
I dropped the entire section about customizing the classloader
whitelists. In master this barely does anything (exposes more
things to expressions).
This change introduces a new API called `_field_caps` that allows to retrieve the capabilities of specific fields.
Example:
````
GET t,s,v,w/_field_caps?fields=field1,field2
````
... returns:
````
{
"fields": {
"field1": {
"string": {
"searchable": true,
"aggregatable": true
}
},
"field2": {
"keyword": {
"searchable": false,
"aggregatable": true,
"non_searchable_indices": ["t"]
"indices": ["t", "s"]
},
"long": {
"searchable": true,
"aggregatable": false,
"non_aggregatable_indices": ["v"]
"indices": ["v", "w"]
}
}
}
}
````
In this example `field1` have the same type `text` across the requested indices `t`, `s`, `v`, `w`.
Conversely `field2` is defined with two conflicting types `keyword` and `long`.
Note that `_field_caps` does not treat this case as an error but rather return the list of unique types seen for this field.
This commit clarifies the preference docs regarding the explanation of
how operations are routed by default. In particular, the previous use of
"shard replicas" was confusing as it could imply an operation would only
be routed to replicas by default.
Relates #23794
This is especially useful when we rewrite the query because the result of the rewrite can be very different on different shards. See #18254 for example.
* Add support for fragment_length in the unified highlighter
This commit introduce a new break iterator (a BoundedBreakIterator) designed for the unified highlighter
that is able to limit the size of fragments produced by generic break iterator like `sentence`.
The `unified` highlighter now supports `boundary_scanner` which can `words` or `sentence`.
The `sentence` mode will use the bounded break iterator in order to limit the size of the sentence to `fragment_length`.
When sentences bigger than `fragment_length` are produced, this mode will break the sentence at the next word boundary **after**
`fragment_length` is reached.
This commit adds a boundary_scanner property to the search highlight
request so the user can specify different boundary scanners:
* `chars` (default, current behavior)
* `word` Use a WordBreakIterator
* `sentence` Use a SentenceBreakIterator
This commit also adds "boundary_scanner_locale" to define which locale
should be used when scanning the text.
When nested objects are present in the mappings, many queries get deoptimized
due to the need to exclude documents that are not in the right space. For
instance, a filter is applied to all queries that prevents them from matching
non-root documents (`+*:* -_type:__*`). Moreover, a filter is applied to all
child queries of `nested` queries in order to make sure that the child query
only matches child documents (`_type:__nested_path`), which is required by
`ToParentBlockJoinQuery` (the Lucene query behing Elasticsearch's `nested`
queries).
These additional filters slow down `nested` queries. In 1.7-, the cost was
somehow amortized by the fact that we cached filters very aggressively. However,
this has proven to be a significant source of slow downs since 2.0 for users
of `nested` mappings and queries, see #20797.
This change makes the filtering a bit smarter. For instance if the query is a
`match_all` query, then we need to exclude nested docs. However, if the query
is `foo: bar` then it may only match root documents since `foo` is a top-level
field, so no additional filtering is required.
Another improvement is to use a `FILTER` clause on all types rather than a
`MUST_NOT` clause on all nested paths when possible since `FILTER` clauses
are more efficient.
Here are some examples of queries and how they get rewritten:
```
"match_all": {}
```
This query gets rewritten to `ConstantScore(+*:* -_type:__*)` on master and
`ConstantScore(_type:AutomatonQuery {\norg.apache.lucene.util.automaton.Automaton@4371da44})`
with this change. The automaton is the complement of `_type:__*` so it matches
the same documents, but is faster since it is now a positive clause. Simplistic
performance testing on a 10M index where each root document has 5 nested
documents on average gave a latency of 420ms on master and 90ms with this change
applied.
```
"term": {
"foo": {
"value": "0"
}
}
```
This query is rewritten to `+foo:0 #(ConstantScore(+*:* -_type:__*))^0.0` on
master and `foo:0` with this change: we do not need to filter nested docs out
since the query cannot match nested docs. While doing performance testing in
the same conditions as above, response times went from 250ms to 50ms.
```
"nested": {
"path": "nested",
"query": {
"term": {
"nested.foo": {
"value": "0"
}
}
}
}
```
This query is rewritten to
`+ToParentBlockJoinQuery (+nested.foo:0 #_type:__nested) #(ConstantScore(+*:* -_type:__*))^0.0`
on master and `ToParentBlockJoinQuery (nested.foo:0)` with this change. The
top-level filter (`-_type:__*`) could be removed since `nested` queries only
match documents of the parent space, as well as the child filter
(`#_type:__nested`) since the child query may only match nested docs since the
`nested` object has both `include_in_parent` and `include_in_root` set to
`false`. While doing performance testing in the same conditions as above,
response times went from 850ms to 270ms.
This pull request reuses the typed_keys parameter added in #22965, but this time it applies it to suggesters. When set to true, the suggester names in the search response will be prefixed with a prefix that reflects their type.
This changes removes the SearchResponseListener that was used by the ExpandCollapseSearchResponseListener to expand collapsed hits.
The removal of SearchResponseListener is not a breaking change because it was never released.
This change also replace the blocking call in ExpandCollapseSearchResponseListener by a single asynchronous multi search request. The parallelism of the expand request can be set via CollapseBuilder#max_concurrent_group_searches
Closes#23048
This commit removes support for the `application/x-ldjson` Content-Type header as this was only used in the first draft
of the spec and had very little uptake. Additionally, the docs for bulk and msearch have been updated to specifically
call out ndjson and mention that the newline character may be preceded by a carriage return.
Finally, the bulk request handling of the carriage return has been improved to remove this character from the source.
Closes#23025