Today we provide the ability to plug in MergePolicy and
we provide the once lucene ships with. We do not recommend to change
the default and even only a small number of expert users would ever touch
this. This commit removes the ancient log byte size and log doc count
merge policy providers, simplifies the MergePolicy wiring and makes the
tiered MP the one and only default. All notions of a merge policy has been
removed from the docs and should be deprecated in the previous version.
Closes#11588
While we had initially planned to keep rivers around in 2.0 to ease migration,
keeping support for rivers is challenging as it conflicts with other important
changes that we want to bring to 2.0 like synchronous dynamic mappings updates.
Nothing impossible to fix, but it would increase the complexity of how we
deal with dynamic mappings updates and manage rivers, while handling dynamic
mappings updates correctly is important for resiliency and rivers are on the go.
So removing rivers in 2.0 may well be a better trade-off.
The ResourceWatcher used settings prefixed `watcher.`, which
potentially could clash with the watcher plugin.
In order to prevent confusion, the settings have been renamed to
`resource.reload` prefixes.
This also uses the deprecation logging infrastructure introduced
in #11033 to log deprecated settings and their alternative at
startup.
Closes#11175
There are different ways to register custom query parsers through plugins, a couple of them work per index via index settings, which is probably even too flexible. There also three different ways to add a global custom query parser through either IndicesQueriesModule or IndicesQueriesRegistry. This commit consolidates the registration of custom query parsers via IndicesQueriesModule#addQuery(Class<? extends QueryParser>). The complexity of supporting parsers per index is not needed hence it got removed. Also the other ways of registering global custom parsers are dropped in favour of the one mentioned above.
Closes#11481
In #10918, we introduced the prompt placeholders. These were had a different format
than our existing placeholders. This changes the prompt placeholders to follow the
format of the existing placeholders.
Relates to #11455
Some of our meta fields (such as _id, _version, ...) are returned as top-level
properties of the json document, while other properties (_timestamp, _routing,
...) are returned under `fields`. This commit makes all meta fields returned
as top-level properties.
So eg. `GET test/test/1?fields=_timestamp,foo` would now return
```json
{
"_index": "test",
"_type": "test",
"_id": "1",
"_version": 1,
"_timestamp": 10000000,
"found": true,
"fields": {
"foo": [ "bar" ]
}
}
```
while it used to return
```json
{
"_index": "test",
"_type": "test",
"_id": "1",
"_version": 1,
"found": true,
"fields": {
"_timestamp": 10000000,
"foo": [ "bar" ]
}
}
```
To better distribute the memory allocating to indexing, the IndexingMemoryController periodically checks the different shard for their last indexing activity. If no activity has happened for a while, the controller marks the shards as in active and allocated it's memory buffer budget (but a small minimal budget) to other active shards. The recently added synced flush feature (#11179, #11336) uses this inactivity trigger to attempt as a trigger to attempt adding a sync id marker (which will speed up future recoveries).
We wait for 30m before declaring a shard inactive. However, these days the operation just requires a refresh and is light. We can be stricter (and 5m) increase the chance a synced flush will be triggered.
Closes#11479
This commit changes the date handling. First and foremost Elasticsearch
does not try to convert every date to a unix timestamp first and then
uses the configured date. This now allows for dates like `2015121212` to
be parsed correctly.
Instead it is now explicit by adding a `epoch_second` and `epoch_millis`
date format. This also means, that the default date format now is
`epoch_millis||dateOptionalTime` to remain backwards compatible.
Closes#5328
Relates #10971
Some settings may be considered sensitive, such as passwords, and storing them
in the configuration file on disk is not good from a security perspective. This change
allows settings to have a special value, `${prompt::text}` or `${prompt::secret}`, to
indicate that elasticsearch should prompt the user for the actual value on startup.
This only works when started in the foreground. In cases where elasticsearch is started
as a service or in the background, an exception will be thrown.
Closes#10838
* Cut the `has_child` and `has_parent` queries over to use Lucene's query time global ordinal join. The main benefit of this change is that parent/child queries can now efficiently execute if parent/child queries are wrapped in a bigger boolean query. If the rest of the query only hit a few documents both has_child and has_parent queries don't need to evaluate all parent or child documents any more.
* Cut the `_parent` field over to use doc values. This significantly reduces the on heap memory footprint of parent/child, because the parent id values are never loaded into memory.
Breaking changes:
* The `type` option on the `_parent` field can only point to a parent type that doesn't exist yet, so this means that an existing type/mapping can't become a parent type any longer.
* The `has_child` and `has_parent` queries can no longer be use in alias filters.
All these changes, improvements and breaks in compatibility only apply for indices created with ES version 2.0 or higher. For indices creates with ES <= 2.0 the older implementation is used.
It is highly recommended to re-index all your indices with parent and child documents to benefit from all the improvements that come with this refactoring. The easiest way to achieve this is by using the scan and bulk apis using a simple script.
Closes#6107Closes#8134
This change unifies the way scripts and templates are specified for all instances in the codebase. It builds on the Script class added previously and adds request building and parsing support as well as the ability to transfer script objects between nodes. It also adds a Template class which aims to provide the same functionality for template APIs
Closes#11091
In #11072 we are adding a check that will prevent opening of old indices. However, this check doesn't take into consideration the fact that indices can be made compatible with the current version through upgrade API. In order to make compatibility check aware of the upgrade, the upgrade API should write a new setting `index.version.minimum_compatible` that will indicate the minimum compatible version of lucene this index is compatible with and `index.version.upgraded` that will indicate the version of elasticsearch that performed the upgrade.
Closes#11095
Updated to not mislead the reader that the data is actually gone when a document is updated. For example if you have 100GB of docs and update each one you'll only be able to access 100GB of the data, but there would theoretically be 200GB of doc data.
Closes#10375
The count api used to have its own execution path, although it would do the same (up to bugs!) of the search api. This commit makes it a shortcut to the search api with size set to 0. The change is made in a backwards compatible manner, by leaving all of the java api code around too, given that you may not want to get back a whole SearchResponse when asking only for number of hits matching a query, also cause migrating from countResponse.getCount() to searchResponse.getHits().totalHits() doesn't look great from a user perspective. We can always decide to drop more code around the count api if we want to break backwards compatibility on the java api, making it a shortcut on the rest layer only.
Closes#9117Closes#11198
Add support for a specific deprecation logging that can be used to turn
on in order to notify users of a specific feature, flag, setting,
parameter, ... being deprecated.
The deprecation logger logs with a "deprecation." prefix logge
(or "org.elasticsearch.deprecation." if full name is used), and outputs
the logging to a dedicated deprecation log file.
Deprecation logging are logged under the DEBUG category. The idea is not to
enabled them by default (under WARN or ERROR) when running embedded in
another application.
By default they are turned off (INFO), in order to turn it on, the
"deprecation" category need to be set to DEBUG. This can be set in the
logging file or using the cluster update settings API, see the documentation
Closes#11033
This change adds a new "filter_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch.
For example, returning only the shards that failed to be optimized:
```
curl -XPOST 'localhost:9200/beer/_optimize?filter_path=_shards.failed'
{"_shards":{"failed":0}}%
```
It supports multiple filters (separated by a comma):
```
curl -XGET 'localhost:9200/_mapping?pretty&filter_path=*.mappings.*.properties.name,*.mappings.*.properties.title'
```
It also supports the YAML response format. Here it returns only the `_id` field of a newly indexed document:
```
curl -XPOST 'localhost:9200/library/book?filter_path=_id' -d '---hello:\n world: 1\n'
---
_id: "AU0j64-b-stVfkvus5-A"
```
It also supports wildcards. Here it returns only the host name of every nodes in the cluster:
```
curl -XGET 'http://localhost:9200/_nodes/stats?filter_path=nodes.*.host*'
{"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}}
```
And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment:
```
curl 'http://localhost:9200/_segments?pretty&filter_path=indices.**.version'
{
"indices" : {
"beer" : {
"shards" : {
"0" : [ {
"segments" : {
"_0" : {
"version" : "5.2.0"
},
"_1" : {
"version" : "5.2.0"
}
}
} ]
}
}
}
}
```
Note that elasticsearch sometimes returns directly the raw value of a field, like the _source field. If you want to filter _source fields, you should consider combining the already existing _source parameter (see Get API for more details) with the filter_path parameter like this:
```
curl -XGET 'localhost:9200/_search?pretty&filter_path=hits.hits._source&_source=title'
{
"hits" : {
"hits" : [ {
"_source":{"title":"Book #2"}
}, {
"_source":{"title":"Book #1"}
}, {
"_source":{"title":"Book #3"}
} ]
}
}
```
#10032 introduced the notion of sealing an index by marking it with a special read only marker, allowing for a couple of optimization to happen. The most important one was to speed up recoveries of shards where we know nothing has changed since they were online by skipping the file based sync phase. During the implementation we came up with a light notion which achieves the same recovery benefits but without the read only aspects which we dubbed synced flush. The fact that it was light weight and didn't put the index in read only mode, allowed us to do it automatically in the background which has great advantage. However we also felt the need to allow users to manually trigger this operation.
The implementation at #11179 added the sync flush internal logic and the manual (rest) rest API. The name of the API was modeled after the sealing terminology which may end up being confusing. This commit changes the API name to match the internal synced flush naming, namely `{index}/_flush/synced'.
On top of that it contains a couple other changes:
- Remove all java client API. This feature is not supposed to be called programtically by applications but rather by admins.
- Improve rest responses making structure similar to other (flush) API
- Change IndexShard#getOperationsCount to exclude the internal +1 on open shard . it's confusing to get 1 while there are actually no ongoing operations
- Some minor other clean ups
This option is broken currently since it potentially interprets an incoming
binary value as compressed while it just happens that the first bytes are the
same as the LZF header.
Mappings conflicts should not be ignored. If I read the history correctly, this
option was added when a mapping update to an existing field was considered a
conflict, even if the new mapping was exactly the same. Now that mapping updates
are smart enough to detect conflicting options, we don't need an option to
ignore conflicts.
The default `false` for `require_field_match` is a bit odd and confusing for users, given that field names get ignored by default and every field gets highlighted if it contains terms extracted out of the query, regardless of which fields were queries. Changed the default to `true`, it can always be changed per request.
Closes#10627Closes#11067
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't
make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably.
One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow
things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term
s are pulled out of the automata for multi term queries).
Removing our fork means the following in terms of features:
- dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
- some custom es queries won't be supported anymore, meaning they won't be highlighted. The only one I found up until now is the phrase_prefix. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with our MultiPhrasePrefixQuery.
Closes#10625Closes#11077
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new `missing` option which defines the value to consider when a
field does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a `tag` field.
This works in a very similar way to the `missing` option on the `sort`
element.
One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the `missing` value
but might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an `unmapped_type`
option in the future like we did for sorting.
Related to #5324
Previously, collate feature would be executed on all shards of an index using the client,
this leads to a deadlock when concurrent collate requests are run from the _search API,
due to the fact that both the external request and internal collate requests use the
same search threadpool.
As phrase suggestions are generated from the terms of the local shard, in most cases the
generated suggestion, which does not yield a hit for the collate query on the local shard
would not yield a hit for collate query on non-local shards.
Instead of using the client for collating suggestions, collate query is executed against
the ContextIndexSearcher. This PR removes the ability to specify a preference for a collate
query, as the collate query is only run on the local shard.
closes#9377
Add methods to operate on multi-valued fields in the expressions language.
Note that users will still not be able to access individual values
within a multi-valued field.
The following methods will be included:
* min
* max
* avg
* median
* count
* sum
Additionally, changes have been made to MultiValueMode to support the
new median method.
closes#11105
These clauses filter the document space without affecting scoring and map to
Lucene's BooleanClause.Occur.FILTER. The `filtered` query is now deprecated and
```json
{
"filtered": {
"query": { //query },
"filter": { //filter }
}
}
```
should be replaced with
```json
{
"bool": {
"must": { //query },
"filter": { //filter }
}
}
```
A few meta fields can currently be set within a document's source.
However, the recommended way to set meta fields like this is through
the api, and setting within the document can be a performance trap
(e.g. needing to find _id in order to route the document).
This change removes the ability to set meta fields within
a document source for 2.0+ indexes.
closes#11051closes#11074
As a follow up to #10870, this removes support for
index templates on disk. It also removes a missed
place still allowing disk based mappings.
closes#11052
Meta fields were locked down to not allow exotic options to the
underlying field types in #8143. This change fixes the docs
to no longer refer to the old settings.
closes#10879
This commit makes queries and filters parsed the same way using the
QueryParser abstraction. This allowed to remove duplicate code that we had
for similar queries/filters such as `range`, `prefix` or `term`.
Getting this to work would be a lot of work (creating two different
repositories, having another GPG key, integrating this into our build).
Closes#6498
Removes the More Like This API, users should now use the More Like This query.
The MLT API tests were converted to their query equivalent. Also some clean
ups in MLT tests.
Closes#10736Closes#11003
Previously, we were using the "statistical", technically accurate name. Instead, we
should probably use the name that people are familiar with, e.g. "Holt Winters" instead
of "triple exponential". To that end:
- `single_exp` becomes `ewma` (exponentially weighted moving average)
- `double_exp` becomes `holt`
When the `triple_exp` is added, it will be called `holt_winters`.
Current features (eg. update API) and future features (eg. reindex API)
depend on _source. This change locks down the field so that
it can no longer be disabled. It also removes legacy settings
compress/compress_threshold.
closes#8142closes#10915
This change removes the multiple implementations of different admin interfaces and centralizes it with AbstractClient. It also makes sure *all* executions of actions now go through a single AbstractClient#execute method, taking care of copying headers and wrapping listener.
This also has the side benefit of removing all the code around differnet possible clients, and removes quite a bit of code (most of the + code is actually removal of generics and such).
This change also changes how TransportClient is constructed, requiring a Builder to create it, its a breaking change and its noted in the migration guide.
Yea another step towards simplifying the action infra and making it simpler...
This removes Elasticsearch's filter cache and uses Lucene's instead. It has some
implications:
- custom cache keys (`_cache_key`) are unsupported
- decisions are made internally and can't be overridden by users ('_cache`)
- not only filters can be cached but also all queries that do not need scores
- parent/child queries can now be cached, however cached entries are only
valid for the current top-level reader so in practice it will likely only
be used on read-only indices
- the cache deduplicates filters, which plays nicer with large keys (eg. `terms`)
- better stats: we already had ram usage and evictions, but now also hit count,
miss count, lookup count, number of cached doc id sets and current number of
doc id sets in the cache
- dynamically changing the filter cache size is not supported anymore
Internally, an important change is that it removes the NoCacheFilter infrastructure
in favour of making Query.rewrite specializing the query for the current reader so
that it will only be cached on this reader (look for IndexCacheableQuery).
Note that consuming filters with the query API (createWeight/scorer) instead of
the filter API (getDocIdSet) is important for parent/child queries because
otherwise a QueryWrapperFilter(ParentQuery) would run the wrapped query per
segment while relations might be cross segments.
Expose new span queries from https://issues.apache.org/jira/browse/LUCENE-6083
Within returns matches from 'little' that are enclosed inside of a match from 'big'.
Containing returns matches from 'big' that enclose matches from 'little'.
Added infrastructure to allow basic member methods in the expressions
language to be called. The methods must have a signature with no arguments. Also
added the following member methods for date fields (and it should be easy to add more)
* getYear
* getMonth
* getDayOfMonth
* getHourOfDay
* getMinutes
* getSeconds
Allow fields to be accessed without using the member variable [value].
(Note that both ways can be used to access fields for back-compat.)
closes#10890
Using files that must be specified on each node is an anti-pattern
from the API based goal of ES. This change removes the ability
to specify the default mapping with a file on each node.
closes#10620
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back.
To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).
Change highlights:
- Refactor Translog file management to allow for multiple files.
- Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
- A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
- Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
- IndexShard now creates the translog together with the engine. The translog is closed by the engine on close. ShadowIndexShards do not open the translog.
- Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`)
Closes#10624
The assumption is that gaps in histogram are generally undesirable, for instance
if you want to build a visualization from it. Additionally, we are building new
aggregations that require that there are no gaps to work correctly (eg.
derivatives).
this commit removes the obsolete settings for distributors and updates
the documentation on multiple data.path. It also adds an explain to the
migration guide.
Relates to #9498Closes#10770
Remove the ability to specify search type ‘query_and_fetch’ and
‘df_query_and_fetch’ from the REST API.
- Adds REST tests
- Updates REST API spec to remove ‘query_and_fetch’ and
‘df_query_and_fetch’ as options
- Removes documentation for these options
Closes#9606
Regardless of the outcome of #8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.
closes#10814
field_value_factor now takes a default that is used if the document doesn't
have a value for that field. It looks like:
"field_value_factor": {
"field": "popularity",
"missing": 1
}
Closes#10841
Groovy sandboxing was disabled by default from 1.4.3 on though since we found out that it could be worked around, so it makes little sense to keep it and maintain it.
Closes#10156Closes#10480
This changes the default ranking behaviour of single-term queries on numeric fields to use the usual Lucene TermQuery scoring logic rather than a constant-scoring wrapper.
Closes#10628
* Removed the docs for `index.compound_format` and `index.compound_on_flush` - these are expert settings which should probably be removed (see https://github.com/elastic/elasticsearch/issues/10778)
* Removed the docs for `index.index_concurrency` - another expert setting
* Labelled the segments verbose output as experimental
* Marked the `compression`, `precision_threshold` and `rehash` options as experimental in the cardinality and percentile aggs
* Improved the experimental text on `significant_terms`, `execution_hint` in the terms agg, and `terminate_after` param on count and search
* Removed the experimental flag on the `geobounds` agg
* Marked the settings in the `merge` and `store` modules as experimental, rather than the modules themselves
Closes#10782
Hi there. I've been experimenting with the search templates recently and I'm a bit confused. Shouldn't the Mustache tags be written like `{{tagname}}` instead of `{tagname}`? Your using `{{...}}` [here](http://www.elastic.co/guide/en/elasticsearch/reference/current/search-template.html) BTW.
Using the first example in that page seems to indicate that something's wrong, or am I missing something?
```
$ curl 'localhost:9200/test/_search' -d '{"query":{"template":{"query":{"match":{"text":"{keywords}"}},"params":{"keywords":"value1_foo"}}}}'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
$ curl 'localhost:9200/test/_search' -d '{"query":{"template":{"query":{"match":{"text":"{{keywords}}"}},"params":{"keywords":"value1_foo"}}}}'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"testtype","_id":"1","_score":1.0,"_source":{"text":"value1_foo"}}]}}
```
Disabling doc values or trying to index hash values are not
correct uses of this the murmur3 field type, and just cause
problems. This disallows changing doc values or index options
for 2.0+.
closes#10465
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field.
An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.
This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.
The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.
Closes#10523
If a user explicitly defined the tree_level or precision parameter in a geo_shape mapping their specification was always overridden by the default_error_pct parameter (even though our docs say this parameter is a 'hint'). This lead to unexpected accuracy problems in the results of a geo_shape filter. (example provided in issue #9691)
This simple patch fixes the unexpected behavior by setting the default distance_error_pct parameter to zero when the tree_level or precision parameters are provided by the user. Under the covers the quadtree will now use the tree level defined by the user. The docs will be updated to alert the user to exercise caution with these parameters. Specifying a precision of "1m" for an index using large complex shapes can quickly lead to OOM issues.
closes#9691
In Lucene 5.1 lots of filters got deprecated in favour of equivalent queries.
Additionally, random-access to filters is now replaced with approximations on
scorers. This commit
- replaces the deprecated NumericRangeFilter, PrefixFilter, TermFilter and
TermsFilter with NumericRangeQuery, PrefixQuery, TermQuery and TermsQuery,
wrapped in a QueryWrapperFilter
- replaces XBooleanFilter, AndFilter and OrFilter with a BooleanQuery in a
QueryWrapperFilter
- removes DocIdSets.isBroken: the new two-phase iteration API will now help
execute slow filters efficiently
- replaces FilterCachingPolicy with QueryCachingPolicy
Close#8960
This option defaults to false, because it is also important to upgrade
the "merely old" segments since many Lucene improvements happen within
minor releases.
But you can pass true to do the minimal work necessary to upgrade to
the next major Elasticsearch release.
The HTTP GET upgrade request now also breaks out how many bytes of
ancient segments need upgrading.
Closes#10213Closes#10540
Conflicts:
dev-tools/create_bwc_index.py
rest-api-spec/api/indices.upgrade.json
src/main/java/org/elasticsearch/action/admin/indices/optimize/OptimizeRequest.java
src/main/java/org/elasticsearch/action/admin/indices/optimize/ShardOptimizeRequest.java
src/main/java/org/elasticsearch/action/admin/indices/optimize/TransportOptimizeAction.java
src/main/java/org/elasticsearch/index/engine/InternalEngine.java
src/test/java/org/elasticsearch/bwcompat/StaticIndexBackwardCompatibilityTest.java
src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
src/test/java/org/elasticsearch/rest/action/admin/indices/upgrade/UpgradeReallyOldIndexTest.java
This adds a new feature to the Term Vectors API which allows for filtering of
terms based on their tf-idf scores. With `dfs` option on, this could be useful
for finding out a good characteric vector of a document or a set of documents.
The parameters are similar to the ones used in the MLT Query.
Closes#9561
This commit adds a `rewrite` parameter to the validate API in order to shown
how the given query is re-written into primitive queries. For example, an MLT
query is re-written into a disjunction of the selected terms. Other use cases
include `fuzzy`, `common_terms`, or `match` query especially with a
`cutoff_frequency` parameter. Note that the explanation is only given for a
single randomly chosen shard only, so the output may vary from one shard to
another.
Relates #1412Closes#10147
This is really a Collector instead of a filter. This commit deprecates the
`limit` filter, makes it a no-op and recommends to use the `terminate_after`
parameter instead that we introduced in the meantime.
* In code, we mark `River`, `AbstractRiverComponent`, `RiverComponent` and `RiverName` classes as deprecated
* We log that information when a cluster is still using it
* We add this information in the plugins list as well
Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.
This commit switches to Lucene regular expressions instead of Java (not exactly
the same syntax yet most existing regular expressions should keep working) and
uses the same logic as RegExpQuery to intersect the regular expression with the
terms dictionary. I wrote a quick benchmark (in the PR) to make sure it made
things faster and the same request that took 750ms on master now takes 74ms with
this change.
Close#7526
For bacwards compatibility reasons routing_nodes were previously printed out when routing_table was requested, together with the actual routing_table. Now they are printed out only when requests through `routing_nodes` flag.
Relates to #10412Closes#10486
Removed the following methods from `ScriptService`, which don't require the `ScriptContext` argument:
```
public CompiledScript compile(String lang, String script, ScriptType scriptType)
public ExecutableScript executable(String lang, String script, ScriptType scriptType, Map<String, Object> vars)
public SearchScript search(SearchLookup lookup, String lang, String script, ScriptType scriptType, @Nullable Map<String, Object> vars)
```
Also removed the ScriptContext.Standard.GENERIC_PLUGIN enum value, as it was used only for backwards compatibility.
Plugins that make use of scripts should declare their own script contexts through `ScriptModule#registerScriptContext` and use them when compiling/executing scripts.
Closes#10476
Plugins can now define multiple operations/contexts that they use scripts for. Fine-grained settings can then be used to enable/disable scripts based on each single registered context.
Also added a new generic category called `plugin`, which will be used as a default when the context is not specified. This allows us to restore backwards compatibility for plugins on `ScriptService` by restoring the old methods that don't require the script context and making them internally use the `plugin` context, as they can only be called from plugins.
Closes#10347Closes#10419
Relates to #10154 and #10150
Adds link to additional information on how document frequencies are treated across shards to the cutoff_frequency parameter documentation.
Closes#10451
We had an undocumented parameter called `numeric_resolution` which allows to
configure how to deal with dates when provided as a number. The default is to
handle them as milliseconds, but you can also opt-on for eg. seconds.
Close#10072
This pull request makes boolean handled like dates and ipv4 addresses: things
are stored as as numerics under the hood and aggregations add some special
formatting logic in order to return true/false in addition to 1/0.
For example, here is an output of a terms aggregation on a boolean field:
```
"aggregations": {
"top_f": {
"doc_count_error_upper_bound": 0,
"buckets": [
{
"key": 0,
"key_as_string": "false",
"doc_count": 2
},
{
"key": 1,
"key_as_string": "true",
"doc_count": 1
}
]
}
}
```
Sorted numeric doc values are used under the hood.
Close#4678Close#7851
Now that fine-grained script settings are supported (#10116) we can remove support for the script.disable_dynamic setting.
Same result as `script.disable_dynamic: false` can be obtained as follows:
```
script.inline: on
script.indexed: on
```
An exception is thrown at startup when the old setting is set, so we make sure we tell users they have to change it rather than ignoring the setting.
Closes#10286
This commit brings the benefits of the `count` search type to search requests
that have a `size` of 0:
- a single round-trip to shards (no fetch phase)
- ability to use the query cache
Since `count` now provides no benefits over `query_then_fetch`, it has been
deprecated.
Close#7630
Allow to on/off scripting based on their source (where they get loaded from), the operation that executes them and their language.
The settings cover the following combinations:
- mode: on, off, sandbox
- source: indexed, dynamic, file
- engine: groovy, expressions, mustache, etc
- operation: update, search, aggs, mapping
The following settings are supported for every engine:
script.engine.groovy.indexed.update: sandbox/on/off
script.engine.groovy.indexed.search: sandbox/on/off
script.engine.groovy.indexed.aggs: sandbox/on/off
script.engine.groovy.indexed.mapping: sandbox/on/off
script.engine.groovy.dynamic.update: sandbox/on/off
script.engine.groovy.dynamic.search: sandbox/on/off
script.engine.groovy.dynamic.aggs: sandbox/on/off
script.engine.groovy.dynamic.mapping: sandbox/on/off
script.engine.groovy.file.update: sandbox/on/off
script.engine.groovy.file.search: sandbox/on/off
script.engine.groovy.file.aggs: sandbox/on/off
script.engine.groovy.file.mapping: sandbox/on/off
For ease of use, the following more generic settings are supported too:
script.indexed: sandbox/on/off
script.dynamic: sandbox/on/off
script.file: sandbox/on/off
script.update: sandbox/on/off
script.search: sandbox/on/off
script.aggs: sandbox/on/off
script.mapping: sandbox/on/off
These will be used to calculate the more specific settings, using the stricter setting of each combination. Operation based settings have precedence over conflicting source based ones.
Note that the `mustache` engine is affected by generic settings applied to any language, while native scripts aren't as they are static by definition.
Also, the previous `script.disable_dynamic` setting can now be deprecated.
Closes#6418Closes#10116Closes#10274
Deleting a type from an index is inherently dangerous because
the type can be recreated with new mappings which may conflict
with existing segments still using the old mappings. This
removes the ability to delete a type (similar to how deleting
fields within a type is not allowed, for the same reason).
closes#8877closes#10231
Documentation states false as the default for "validate", "validate_lon", and "validate_lat" leading to confusion as described in issue #9539. This simple fix corrects the documentation and communicates that these fields will be deprecated and removed in upcoming versions.
closes#9539
I've been attempting to programatically verify that adding index templates via the `{path.conf}/templates/` directory works fine although I was never able to validate this via an API call to the `/_template/`. It seems that these templates do not appear in that API call, which I discovered in the following mail thread:
http://elasticsearch-users.115913.n3.nabble.com/Loading-of-index-settings-template-from-file-in-config-templates-td4024923.html#d1366317284000-912
My question is why wouldn't the `/_template/*` method return these templates? This tends to complicate things for those that want to perform automated tests to verify that they are in fact being recognized and used by Elasticsearch.
This commit changes the behaviour of the delete api when processing a delete request that refers to a type that has routing set to required in the mapping, and the routing is missing in the request. Up until now the delete api sent a broadcast delete request to all of the shards that belong to the index, making sure that the document could be found although the routing value wasn't specified. This was probably not the best choice: if the routing is set to required, an error should be thrown instead.
A `RoutingMissingException` gets now thrown instead, like it happens in the same situation with every other api (index, update, get etc.). Last but not least, this change allows to get rid of a couple of `TransportAction`s, `Request`s and `Response`s and simplify the codebase.
Closes#9123Closes#10136
Adds a setting to disable detailed error messages and full exception stack traces
in HTTP responses. When set to false, the error_trace request parameter will result
in a HTTP 400 response. When the error_trace parameter is not present, the message
of the first ElasticsearchException will be output and no nested exception messages
will be output.
This commit adds the current total number of translog operations to the recovery reporting API. We also expose the recovered / total percentage:
```
"translog": {
"recovered": 536,
"total": 986,
"percent": "54.3%",
"total_time": "2ms",
"total_time_in_millis": 2
},
```
Closes#9368Closes#10042
The behaviour is better in the case someone has multiple levels of nested object fields defined in the mapping and like to define a single inner_hits definition that is two or more levels deep.
If someone wants inner hits on a nested field that is 2 levels deep the following would need to be defined:
```
{
...
"inner_hits" : {
"path" : {
"level1" : {
"inner_hits" : {
"path" : {
"level2" : {
"query" : { .... }
}
}
}
}
}
}
}
```
With this change the above can be defined as:
```
{
...
"inner_hits" : {
"path" : {
"level1.level2" : {
"query" : { .... }
}
}
}
}
```
Closes#9251
The analysis chain should be used instead of relying on this, as it is
confusing when dealing with different per-field analysers.
The `locale` option was only used for `lowercase_expanded_terms`, which,
once removed, is no longer needed, so it was removed as well.
Fixes#9978
Relates to #9973
This commit adds scripting capability to significant_terms.
Custom heuristics can be implemented with a script that provides
parameters subset_freq, superset_freq,subset_size, superset_size.
closes#7850
Changed search_type docs to reflect that the `(dfs_)query_and_fetch` modes are an internal optimization and should not be specified explicitly by the user.
Relates to #9606
As explained in elasticsearch/elasticsearch-mapper-attachments#101, we should have consistent documentation.
The best option is to link the documentation in elasticsearch guide to the most recent README in the plugin repo.
Closes#9756
The request tracer logs in TRACE level under the `transport.tracer` log and is dynamically configurable with include and exclude arrays to filter out unneeded info. By default all requests are logged with the exception of fault detection pings (fired every second).
add the notion of tracers in the MockTransportService for testing purposes
Closes#9286
To support the `_recovery` API, the recovery process keeps track of current progress in a class called RecoveryState. This class currently have some issues, mostly around concurrency (see #6644 ). This PR cleans it up as well as other issues around it:
- Make the Index subsection API cleaner:
- remove redundant information - all calculation is done based on the underlying file map
- clearer definition of what is what: total files, vs reused files (local files that match the source) vs recovered files (copied over). % based progress is reported based on recovered files only.
- cleaned up json response to match other API (sadly this breaks the structure). We now properly report human values for dates and other units.
- Add more robust unit testing
- Detail flag was passed along as state (it's now a ToXContent param)
- State lookup during reporting is now always done via the IndexShard , no more fall backs to many other classes.
- Cleanup APIs around time and move the little computations to the state class as opposed to doing them out of the API
I also improved error messages out of the REST testing infra for things I run into.
Closes#6644Closes#9811
Together with #8782 it should help in the situations simliar to #8887 by adding an ability to get information about currently running snapshot without accessing the repository itself.
Closes#8887
The number of current pending tasks is useful to detect and overloaded master. This commit adds it to the cluster health API. The complete list can be retrieved from the dedicated pending tasks API.
It also adds rest tests for the cluster health variants.
Closes#9877
There are two implications to this change.
First, percolator now uses _uid internally, extracting the id portion
when needed. Second, sorting on _id is no longer possible, since you
can no longer index _id. However, _uid can still be used to sort, and
is better anyways as indexing _id just to make it available to
fielddata for sorting is wasteful.
see #8143closes#9842
Double negatives are confusing, but a triple negative (1 no, 2 non, 3 null)? It takes five minutes to understand this little sentence. Cleaned that up a bit.
Closes#9789
This change removes the deprecated script parameter names ('file', 'id', and 'scriptField').
It also removes the ability to load file scripts using the 'script' parameter. File scripts should be loaded using the 'script_file' parameter only.
This was previously attempted in #8854. I revived that branch and did
some performance testing as was suggested in the comments there.
I fixed all the errors, mostly just the rest tests, which
needed to have http enabled on the node settings (the global cluster
previously had this always enabled). I also addressed the comments from
that issue.
My performance tests involved running the entire test suite on my
desktop which has 6 cores, 16GB of ram, and nothing else was being
run on the box at the time. I ran each set of settings 3 times and
took the average time.
| mode | master | patch | diff |
| ------- | ------ | ----- | ---- |
| local | 409s | 417s | +2% |
| network | 368s | 380s | +3% |
This increase in average time is clearly worthwhile to pay to achieve
isolation of tests. One caveat is the way I fixed the rest tests
is still to have one cluster for the entire suite, so all the rest
tests can still potentially affect each other, but this is an
issue for another day.
There were some oddities that I noticed while running these tests
that I would like to point out, as they probably deserve some
investigation (but orthogonal to this PR):
* The total test run times are highly variable (more than a minute between the min and max)
* Running in network mode is on average actually *faster* than local mode. How is this possible!?
This commit makes the `postings_format` and `doc_values_format` options of
mappings illegal on 2.0 and ignored on 1.x (meaning that the default postings
and doc values formats from the codec will be used in such a case).
This removes a fair amount of code.
Close#8746#9741