This change adds a new "filter_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch.
For example, returning only the shards that failed to be optimized:
```
curl -XPOST 'localhost:9200/beer/_optimize?filter_path=_shards.failed'
{"_shards":{"failed":0}}%
```
It supports multiple filters (separated by a comma):
```
curl -XGET 'localhost:9200/_mapping?pretty&filter_path=*.mappings.*.properties.name,*.mappings.*.properties.title'
```
It also supports the YAML response format. Here it returns only the `_id` field of a newly indexed document:
```
curl -XPOST 'localhost:9200/library/book?filter_path=_id' -d '---hello:\n world: 1\n'
---
_id: "AU0j64-b-stVfkvus5-A"
```
It also supports wildcards. Here it returns only the host name of every nodes in the cluster:
```
curl -XGET 'http://localhost:9200/_nodes/stats?filter_path=nodes.*.host*'
{"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}}
```
And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment:
```
curl 'http://localhost:9200/_segments?pretty&filter_path=indices.**.version'
{
"indices" : {
"beer" : {
"shards" : {
"0" : [ {
"segments" : {
"_0" : {
"version" : "5.2.0"
},
"_1" : {
"version" : "5.2.0"
}
}
} ]
}
}
}
}
```
Note that elasticsearch sometimes returns directly the raw value of a field, like the _source field. If you want to filter _source fields, you should consider combining the already existing _source parameter (see Get API for more details) with the filter_path parameter like this:
```
curl -XGET 'localhost:9200/_search?pretty&filter_path=hits.hits._source&_source=title'
{
"hits" : {
"hits" : [ {
"_source":{"title":"Book #2"}
}, {
"_source":{"title":"Book #1"}
}, {
"_source":{"title":"Book #3"}
} ]
}
}
```
#10032 introduced the notion of sealing an index by marking it with a special read only marker, allowing for a couple of optimization to happen. The most important one was to speed up recoveries of shards where we know nothing has changed since they were online by skipping the file based sync phase. During the implementation we came up with a light notion which achieves the same recovery benefits but without the read only aspects which we dubbed synced flush. The fact that it was light weight and didn't put the index in read only mode, allowed us to do it automatically in the background which has great advantage. However we also felt the need to allow users to manually trigger this operation.
The implementation at #11179 added the sync flush internal logic and the manual (rest) rest API. The name of the API was modeled after the sealing terminology which may end up being confusing. This commit changes the API name to match the internal synced flush naming, namely `{index}/_flush/synced'.
On top of that it contains a couple other changes:
- Remove all java client API. This feature is not supposed to be called programtically by applications but rather by admins.
- Improve rest responses making structure similar to other (flush) API
- Change IndexShard#getOperationsCount to exclude the internal +1 on open shard . it's confusing to get 1 while there are actually no ongoing operations
- Some minor other clean ups
This option is broken currently since it potentially interprets an incoming
binary value as compressed while it just happens that the first bytes are the
same as the LZF header.
Mappings conflicts should not be ignored. If I read the history correctly, this
option was added when a mapping update to an existing field was considered a
conflict, even if the new mapping was exactly the same. Now that mapping updates
are smart enough to detect conflicting options, we don't need an option to
ignore conflicts.
The default `false` for `require_field_match` is a bit odd and confusing for users, given that field names get ignored by default and every field gets highlighted if it contains terms extracted out of the query, regardless of which fields were queries. Changed the default to `true`, it can always be changed per request.
Closes#10627Closes#11067
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't
make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably.
One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow
things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term
s are pulled out of the automata for multi term queries).
Removing our fork means the following in terms of features:
- dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
- some custom es queries won't be supported anymore, meaning they won't be highlighted. The only one I found up until now is the phrase_prefix. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with our MultiPhrasePrefixQuery.
Closes#10625Closes#11077
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new `missing` option which defines the value to consider when a
field does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a `tag` field.
This works in a very similar way to the `missing` option on the `sort`
element.
One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the `missing` value
but might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an `unmapped_type`
option in the future like we did for sorting.
Related to #5324
Previously, collate feature would be executed on all shards of an index using the client,
this leads to a deadlock when concurrent collate requests are run from the _search API,
due to the fact that both the external request and internal collate requests use the
same search threadpool.
As phrase suggestions are generated from the terms of the local shard, in most cases the
generated suggestion, which does not yield a hit for the collate query on the local shard
would not yield a hit for collate query on non-local shards.
Instead of using the client for collating suggestions, collate query is executed against
the ContextIndexSearcher. This PR removes the ability to specify a preference for a collate
query, as the collate query is only run on the local shard.
closes#9377
Add methods to operate on multi-valued fields in the expressions language.
Note that users will still not be able to access individual values
within a multi-valued field.
The following methods will be included:
* min
* max
* avg
* median
* count
* sum
Additionally, changes have been made to MultiValueMode to support the
new median method.
closes#11105
These clauses filter the document space without affecting scoring and map to
Lucene's BooleanClause.Occur.FILTER. The `filtered` query is now deprecated and
```json
{
"filtered": {
"query": { //query },
"filter": { //filter }
}
}
```
should be replaced with
```json
{
"bool": {
"must": { //query },
"filter": { //filter }
}
}
```
A few meta fields can currently be set within a document's source.
However, the recommended way to set meta fields like this is through
the api, and setting within the document can be a performance trap
(e.g. needing to find _id in order to route the document).
This change removes the ability to set meta fields within
a document source for 2.0+ indexes.
closes#11051closes#11074
As a follow up to #10870, this removes support for
index templates on disk. It also removes a missed
place still allowing disk based mappings.
closes#11052
Meta fields were locked down to not allow exotic options to the
underlying field types in #8143. This change fixes the docs
to no longer refer to the old settings.
closes#10879
This commit makes queries and filters parsed the same way using the
QueryParser abstraction. This allowed to remove duplicate code that we had
for similar queries/filters such as `range`, `prefix` or `term`.
Getting this to work would be a lot of work (creating two different
repositories, having another GPG key, integrating this into our build).
Closes#6498
Removes the More Like This API, users should now use the More Like This query.
The MLT API tests were converted to their query equivalent. Also some clean
ups in MLT tests.
Closes#10736Closes#11003
Previously, we were using the "statistical", technically accurate name. Instead, we
should probably use the name that people are familiar with, e.g. "Holt Winters" instead
of "triple exponential". To that end:
- `single_exp` becomes `ewma` (exponentially weighted moving average)
- `double_exp` becomes `holt`
When the `triple_exp` is added, it will be called `holt_winters`.
Current features (eg. update API) and future features (eg. reindex API)
depend on _source. This change locks down the field so that
it can no longer be disabled. It also removes legacy settings
compress/compress_threshold.
closes#8142closes#10915
This change removes the multiple implementations of different admin interfaces and centralizes it with AbstractClient. It also makes sure *all* executions of actions now go through a single AbstractClient#execute method, taking care of copying headers and wrapping listener.
This also has the side benefit of removing all the code around differnet possible clients, and removes quite a bit of code (most of the + code is actually removal of generics and such).
This change also changes how TransportClient is constructed, requiring a Builder to create it, its a breaking change and its noted in the migration guide.
Yea another step towards simplifying the action infra and making it simpler...
This removes Elasticsearch's filter cache and uses Lucene's instead. It has some
implications:
- custom cache keys (`_cache_key`) are unsupported
- decisions are made internally and can't be overridden by users ('_cache`)
- not only filters can be cached but also all queries that do not need scores
- parent/child queries can now be cached, however cached entries are only
valid for the current top-level reader so in practice it will likely only
be used on read-only indices
- the cache deduplicates filters, which plays nicer with large keys (eg. `terms`)
- better stats: we already had ram usage and evictions, but now also hit count,
miss count, lookup count, number of cached doc id sets and current number of
doc id sets in the cache
- dynamically changing the filter cache size is not supported anymore
Internally, an important change is that it removes the NoCacheFilter infrastructure
in favour of making Query.rewrite specializing the query for the current reader so
that it will only be cached on this reader (look for IndexCacheableQuery).
Note that consuming filters with the query API (createWeight/scorer) instead of
the filter API (getDocIdSet) is important for parent/child queries because
otherwise a QueryWrapperFilter(ParentQuery) would run the wrapped query per
segment while relations might be cross segments.
Expose new span queries from https://issues.apache.org/jira/browse/LUCENE-6083
Within returns matches from 'little' that are enclosed inside of a match from 'big'.
Containing returns matches from 'big' that enclose matches from 'little'.
Added infrastructure to allow basic member methods in the expressions
language to be called. The methods must have a signature with no arguments. Also
added the following member methods for date fields (and it should be easy to add more)
* getYear
* getMonth
* getDayOfMonth
* getHourOfDay
* getMinutes
* getSeconds
Allow fields to be accessed without using the member variable [value].
(Note that both ways can be used to access fields for back-compat.)
closes#10890
Using files that must be specified on each node is an anti-pattern
from the API based goal of ES. This change removes the ability
to specify the default mapping with a file on each node.
closes#10620
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back.
To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).
Change highlights:
- Refactor Translog file management to allow for multiple files.
- Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
- A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
- Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
- IndexShard now creates the translog together with the engine. The translog is closed by the engine on close. ShadowIndexShards do not open the translog.
- Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`)
Closes#10624
The assumption is that gaps in histogram are generally undesirable, for instance
if you want to build a visualization from it. Additionally, we are building new
aggregations that require that there are no gaps to work correctly (eg.
derivatives).
this commit removes the obsolete settings for distributors and updates
the documentation on multiple data.path. It also adds an explain to the
migration guide.
Relates to #9498Closes#10770
Remove the ability to specify search type ‘query_and_fetch’ and
‘df_query_and_fetch’ from the REST API.
- Adds REST tests
- Updates REST API spec to remove ‘query_and_fetch’ and
‘df_query_and_fetch’ as options
- Removes documentation for these options
Closes#9606
Regardless of the outcome of #8142, we should at least enforce that
when _source is enabled, it is sufficient to reindex. This change
removes the excludes and includes settings, since these modify
the source, causing us to lose the ability to reindex some fields.
closes#10814
field_value_factor now takes a default that is used if the document doesn't
have a value for that field. It looks like:
"field_value_factor": {
"field": "popularity",
"missing": 1
}
Closes#10841
Groovy sandboxing was disabled by default from 1.4.3 on though since we found out that it could be worked around, so it makes little sense to keep it and maintain it.
Closes#10156Closes#10480
This changes the default ranking behaviour of single-term queries on numeric fields to use the usual Lucene TermQuery scoring logic rather than a constant-scoring wrapper.
Closes#10628
* Removed the docs for `index.compound_format` and `index.compound_on_flush` - these are expert settings which should probably be removed (see https://github.com/elastic/elasticsearch/issues/10778)
* Removed the docs for `index.index_concurrency` - another expert setting
* Labelled the segments verbose output as experimental
* Marked the `compression`, `precision_threshold` and `rehash` options as experimental in the cardinality and percentile aggs
* Improved the experimental text on `significant_terms`, `execution_hint` in the terms agg, and `terminate_after` param on count and search
* Removed the experimental flag on the `geobounds` agg
* Marked the settings in the `merge` and `store` modules as experimental, rather than the modules themselves
Closes#10782
Hi there. I've been experimenting with the search templates recently and I'm a bit confused. Shouldn't the Mustache tags be written like `{{tagname}}` instead of `{tagname}`? Your using `{{...}}` [here](http://www.elastic.co/guide/en/elasticsearch/reference/current/search-template.html) BTW.
Using the first example in that page seems to indicate that something's wrong, or am I missing something?
```
$ curl 'localhost:9200/test/_search' -d '{"query":{"template":{"query":{"match":{"text":"{keywords}"}},"params":{"keywords":"value1_foo"}}}}'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
$ curl 'localhost:9200/test/_search' -d '{"query":{"template":{"query":{"match":{"text":"{{keywords}}"}},"params":{"keywords":"value1_foo"}}}}'
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"testtype","_id":"1","_score":1.0,"_source":{"text":"value1_foo"}}]}}
```
Disabling doc values or trying to index hash values are not
correct uses of this the murmur3 field type, and just cause
problems. This disallows changing doc values or index options
for 2.0+.
closes#10465
The field stats api returns field level statistics such as lowest, highest values and number of documents that have at least one value for a field.
An api like this can be useful to explore a data set you don't know much about. For example you can figure at with the lowest and highest response times are, so that you can create a histogram or range aggregation with sane settings.
This api doesn't run a search to figure this statistics out, but rather use the Lucene index look these statics up (using Terms class in Lucene). So finding out these stats for fields is cheap and quick.
The min/max values are based on the type of the field. So for a numeric field min/max are numbers and date field the min/max date and other fields the min/max are term based.
Closes#10523
If a user explicitly defined the tree_level or precision parameter in a geo_shape mapping their specification was always overridden by the default_error_pct parameter (even though our docs say this parameter is a 'hint'). This lead to unexpected accuracy problems in the results of a geo_shape filter. (example provided in issue #9691)
This simple patch fixes the unexpected behavior by setting the default distance_error_pct parameter to zero when the tree_level or precision parameters are provided by the user. Under the covers the quadtree will now use the tree level defined by the user. The docs will be updated to alert the user to exercise caution with these parameters. Specifying a precision of "1m" for an index using large complex shapes can quickly lead to OOM issues.
closes#9691
In Lucene 5.1 lots of filters got deprecated in favour of equivalent queries.
Additionally, random-access to filters is now replaced with approximations on
scorers. This commit
- replaces the deprecated NumericRangeFilter, PrefixFilter, TermFilter and
TermsFilter with NumericRangeQuery, PrefixQuery, TermQuery and TermsQuery,
wrapped in a QueryWrapperFilter
- replaces XBooleanFilter, AndFilter and OrFilter with a BooleanQuery in a
QueryWrapperFilter
- removes DocIdSets.isBroken: the new two-phase iteration API will now help
execute slow filters efficiently
- replaces FilterCachingPolicy with QueryCachingPolicy
Close#8960
This option defaults to false, because it is also important to upgrade
the "merely old" segments since many Lucene improvements happen within
minor releases.
But you can pass true to do the minimal work necessary to upgrade to
the next major Elasticsearch release.
The HTTP GET upgrade request now also breaks out how many bytes of
ancient segments need upgrading.
Closes#10213Closes#10540
Conflicts:
dev-tools/create_bwc_index.py
rest-api-spec/api/indices.upgrade.json
src/main/java/org/elasticsearch/action/admin/indices/optimize/OptimizeRequest.java
src/main/java/org/elasticsearch/action/admin/indices/optimize/ShardOptimizeRequest.java
src/main/java/org/elasticsearch/action/admin/indices/optimize/TransportOptimizeAction.java
src/main/java/org/elasticsearch/index/engine/InternalEngine.java
src/test/java/org/elasticsearch/bwcompat/StaticIndexBackwardCompatibilityTest.java
src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
src/test/java/org/elasticsearch/rest/action/admin/indices/upgrade/UpgradeReallyOldIndexTest.java
This adds a new feature to the Term Vectors API which allows for filtering of
terms based on their tf-idf scores. With `dfs` option on, this could be useful
for finding out a good characteric vector of a document or a set of documents.
The parameters are similar to the ones used in the MLT Query.
Closes#9561
This commit adds a `rewrite` parameter to the validate API in order to shown
how the given query is re-written into primitive queries. For example, an MLT
query is re-written into a disjunction of the selected terms. Other use cases
include `fuzzy`, `common_terms`, or `match` query especially with a
`cutoff_frequency` parameter. Note that the explanation is only given for a
single randomly chosen shard only, so the output may vary from one shard to
another.
Relates #1412Closes#10147