The default `false` for `require_field_match` is a bit odd and confusing for users, given that field names get ignored by default and every field gets highlighted if it contains terms extracted out of the query, regardless of which fields were queries. Changed the default to `true`, it can always be changed per request.
Closes#10627Closes#11067
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't
make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably.
One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow
things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term
s are pulled out of the automata for multi term queries).
Removing our fork means the following in terms of features:
- dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
- some custom es queries won't be supported anymore, meaning they won't be highlighted. The only one I found up until now is the phrase_prefix. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with our MultiPhrasePrefixQuery.
Closes#10625Closes#11077
Previously, collate feature would be executed on all shards of an index using the client,
this leads to a deadlock when concurrent collate requests are run from the _search API,
due to the fact that both the external request and internal collate requests use the
same search threadpool.
As phrase suggestions are generated from the terms of the local shard, in most cases the
generated suggestion, which does not yield a hit for the collate query on the local shard
would not yield a hit for collate query on non-local shards.
Instead of using the client for collating suggestions, collate query is executed against
the ContextIndexSearcher. This PR removes the ability to specify a preference for a collate
query, as the collate query is only run on the local shard.
closes#9377
Add methods to operate on multi-valued fields in the expressions language.
Note that users will still not be able to access individual values
within a multi-valued field.
The following methods will be included:
* min
* max
* avg
* median
* count
* sum
Additionally, changes have been made to MultiValueMode to support the
new median method.
closes#11105
Removes the More Like This API, users should now use the More Like This query.
The MLT API tests were converted to their query equivalent. Also some clean
ups in MLT tests.
Closes#10736Closes#11003
The assumption is that gaps in histogram are generally undesirable, for instance
if you want to build a visualization from it. Additionally, we are building new
aggregations that require that there are no gaps to work correctly (eg.
derivatives).
Remove the ability to specify search type ‘query_and_fetch’ and
‘df_query_and_fetch’ from the REST API.
- Adds REST tests
- Updates REST API spec to remove ‘query_and_fetch’ and
‘df_query_and_fetch’ as options
- Removes documentation for these options
Closes#9606
* Removed the docs for `index.compound_format` and `index.compound_on_flush` - these are expert settings which should probably be removed (see https://github.com/elastic/elasticsearch/issues/10778)
* Removed the docs for `index.index_concurrency` - another expert setting
* Labelled the segments verbose output as experimental
* Marked the `compression`, `precision_threshold` and `rehash` options as experimental in the cardinality and percentile aggs
* Improved the experimental text on `significant_terms`, `execution_hint` in the terms agg, and `terminate_after` param on count and search
* Removed the experimental flag on the `geobounds` agg
* Marked the settings in the `merge` and `store` modules as experimental, rather than the modules themselves
Closes#10782
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field.
An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.
This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.
The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.
Closes#10523
This commit adds a `rewrite` parameter to the validate API in order to shown
how the given query is re-written into primitive queries. For example, an MLT
query is re-written into a disjunction of the selected terms. Other use cases
include `fuzzy`, `common_terms`, or `match` query especially with a
`cutoff_frequency` parameter. Note that the explanation is only given for a
single randomly chosen shard only, so the output may vary from one shard to
another.
Relates #1412Closes#10147
Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.
This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.
Close#7526
This commit brings the benefits of the `count` search type to search requests
that have a `size` of 0:
- a single round-trip to shards (no fetch phase)
- ability to use the query cache
Since `count` now provides no benefits over `query_then_fetch`, it has been
deprecated.
Close#7630
Allow to on/off scripting based on their source (where they get loaded from), the operation that executes them and their language.
The settings cover the following combinations:
- mode: on, off, sandbox
- source: indexed, dynamic, file
- engine: groovy, expressions, mustache, etc
- operation: update, search, aggs, mapping
The following settings are supported for every engine:
script.engine.groovy.indexed.update: sandbox/on/off
script.engine.groovy.indexed.search: sandbox/on/off
script.engine.groovy.indexed.aggs: sandbox/on/off
script.engine.groovy.indexed.mapping: sandbox/on/off
script.engine.groovy.dynamic.update: sandbox/on/off
script.engine.groovy.dynamic.search: sandbox/on/off
script.engine.groovy.dynamic.aggs: sandbox/on/off
script.engine.groovy.dynamic.mapping: sandbox/on/off
script.engine.groovy.file.update: sandbox/on/off
script.engine.groovy.file.search: sandbox/on/off
script.engine.groovy.file.aggs: sandbox/on/off
script.engine.groovy.file.mapping: sandbox/on/off
For ease of use, the following more generic settings are supported too:
script.indexed: sandbox/on/off
script.dynamic: sandbox/on/off
script.file: sandbox/on/off
script.update: sandbox/on/off
script.search: sandbox/on/off
script.aggs: sandbox/on/off
script.mapping: sandbox/on/off
These will be used to calculate the more specific settings, using the stricter setting of each combination. Operation based settings have precedence over conflicting source based ones.
Note that the `mustache` engine is affected by generic settings applied to any language, while native scripts aren't as they are static by definition.
Also, the previous `script.disable_dynamic` setting can now be deprecated.
Closes#6418Closes#10116Closes#10274
The behaviour is better in the case someone has multiple levels of nested object fields defined in the mapping and like to define a single inner_hits definition that is two or more levels deep.
If someone wants inner hits on a nested field that is 2 levels deep the following would need to be defined:
```
{
...
"inner_hits" : {
"path" : {
"level1" : {
"inner_hits" : {
"path" : {
"level2" : {
"query" : { .... }
}
}
}
}
}
}
}
```
With this change the above can be defined as:
```
{
...
"inner_hits" : {
"path" : {
"level1.level2" : {
"query" : { .... }
}
}
}
}
```
Closes#9251
The analysis chain should be used instead of relying on this, as it is
confusing when dealing with different per-field analysers.
The `locale` option was only used for `lowercase_expanded_terms`, which,
once removed, is no longer needed, so it was removed as well.
Fixes#9978
Relates to #9973
This commit adds scripting capability to significant_terms.
Custom heuristics can be implemented with a script that provides
parameters subset_freq, superset_freq,subset_size, superset_size.
closes#7850
Changed search_type docs to reflect that the `(dfs_)query_and_fetch` modes are an internal optimization and should not be specified explicitly by the user.
Relates to #9606
Removed the existing `pre_zone` and `post_zone` option in `date_histogram` in favor of
the simpler `time_zone` option. Previously, specifying different values for these could
lead to confusing scenarios where ES would return bucket keys that are not UTC.
Now `time_zone` is the only option setting, the calculation of date buckets to take place in the
preferred time zone, but after rounding converting the bucket key values back to UTC.
Closes#9062Closes#9637
Add offset option to 'date_histogram' replacing and simplifying the previous 'pre_offset' and 'post_offset' options.
This change is part of a larger clean up task for `date_histogram` from issue #9062.
We now have a very useful annotation to mark features or parameters as
experimental. Let's use it! This commit replaces some custom text warnings with
this annotation and adds this annotation to some existing features/parameters:
- inner_hits (unreleased yet)
- terminate_after (released in 1.4)
- per-bucket doc count errors in the terms agg (released in 1.4)
I also tagged with this annotation settings which should either be not needed
(like the ability to evict entries from the filter cache based on time) or that
are too deep into the way that Elasticsearch works like the Directory
implementation or merge settings.
Close#9563
These aggregations are not experimental anymore but some of their parameters
still are:
- `precision_threshold` and `rehash` on `cardinality`
- `compression` on percentiles(-ranks)
Close#9560
Histogram aggregation supports an 'offset' option to move bucket boundaries.
In a histogram with buckets of size X these can be moved from 0, X, 2X, 3X,...
by an offset value of Y to Y, X+Y, 2X+Y, 3X+Y... by using the 'offset' option.
The previous 'pre_offset' and 'post_offset' options are removed in favour of
the simplified 'offset' option.
Closes#9417Closes#9505
The `analyzer` setting is now the base setting, and `search_analyzer`
is simply an override of the search time analyzer. When setting
`search_analyzer`, `analyzer` must be set.
closes#9371
Extended_stats now displays the upper and lower bounds on standard deviations (e.g. avg +/- std).
Default is to show 2 std above/below, but can be changed using the `sigma` parameter.
Accepts non-negative doubles
Closes#9356
Inner hits allows to embed nested inner objects, children documents or the parent document that contributed to the matching of the returned search hit as inner hits, which would otherwise be hidden.
Closes#8153Closes#3022Closes#3152
This commit adds the ability to associate a bit of state with each
individual aggregation.
The aggregation response can be hard to stitch back together without
having a reference to the aggregation request. In many cases this is not
available, many json serializer frameworks cache types globally or have a
static deserialisation override mechanism. In these cases making the
original request available, if at all possible, would be a hack.
The old facets returned `_type` which was just enough metadata to know
what the originating facet type in the request was.
This PR takes `_type` one step further by introducing ANY arbitrary meta
data. This could be further <strike>ab</strike>used for instance by
generic/automated aggregations that include UI state (color information,
thresholds, user input states, etc) per aggregation.
This commit adds a new field to the response of the terms aggregation called
`sum_other_doc_count` which is equal to the sum of the doc counts of the buckets
that did not make it to the list of top buckets. It is typically useful to have
a sector called eg. `other` when using terms aggregations to build pie charts.
Example query and response:
```json
GET test/_search?search_type=count
{
"aggs": {
"colors": {
"terms": {
"field": "color",
"size": 3
}
}
}
}
```
```json
{
[...],
"aggregations": {
"colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 4,
"buckets": [
{
"key": "blue",
"doc_count": 65
},
{
"key": "red",
"doc_count": 14
},
{
"key": "brown",
"doc_count": 3
}
]
}
}
}
```
Close#8213
This is functionally equivalent to before, so there should be no
user-visible impact, except I added a NOTE in the docs warning about
the interaction of pagination and rescoring.
Closes#6232Closes#7707