Fixed testX and testSingleNodeNoFlush by specifying mapping on index creation instead of using dynamic mapping. Dynamic mapping is updated on the cluster level asynchronously and if mapping changes are not applied to the cluster state before node is closed, these changes are not be available after node restart. While data added in the test is preserved, due to absence of mapping, the test still fails. This is a known issue that we are not planning to fix at the moment.
Currently realtime GET does not take source includes/excludes into account.
This patch adds support for the source field mapper includes/excludes
when getting an entry from the transaction log. Even though it introduces
a slight performance penalty, it now adheres to the defined configuration
instead of returning all source data when a realtime get is done.
The filter method of XContentMapValues actually filtered out nested
arrays/lists completely due to a bug in the filter method, which threw
away all data inside of such an array.
Closes#2944
This bug was a follow up problem, because of the filtering of nested arrays
in case source exclusion was configured.
Analysing a numeric field will return UTF-16 representations of
of Lucenes numeric prefix terms. Those terms are meaningless in general
unless used for lookups in the lucene index. Passing a numeric field
to the analysis action is most likely a bug.
Closes#2953#2952
use the latest Lucene version as specified in o.e.common.lucene.Lucene
and must be upgraded with each lucene release.
This commit adds an assert that fails once the actual lucene version
that is used is higher than the current releases version.
Lucene ships with a version constant that is mainly used to provide consistent behaviour across lucene release versions. Lucene's Analysis capabilities are commonly applied at index and search time such that the search-time behaviour should be identical to the index-time behaviour in most of the cases. Currently ElasticSearch always uses the latest version from Lucene which can break backwards compatibility with the index for users that rely on behaviour that changed in new Lucene version.
Users should always use the version the index was created with unless it's explicitly configured.
closes#2945
don't produce broken positions anymore and prevent certain highlighter bugs that fail with
StringArrayOutOfBoundsExceptions as in #2931
This commit breaks backwards compatibility in terms of highlighting when NGramTokenFilter is used.
The highlighter will highlight the entire terms as produced by the tokenizer instead of the individual
sub-gram. To do sub-gram highlighting, the ngram tokenizer should be used. This behavior was based on
broken NGramTokenFilter behavior which will be fixed in Lucene 4.4 but was ported in this commit
to elasticsearch 0.90. The broken behavior can still be used if a version < LUCENE_42 is used
in the token filter mapping.
Closes#2931
- rename to open_contexts from open, we might have other open stats in the future related to search (lucene index searchers?)
- add a test to verify it works
to the lucene version is that `discreteMultiValueHighlighting` does default to `true`. Yet
we set this anyway in the HighlightingPhase such that the classes are obsolet.
if a single highlight phrase or term was greater than the fragCharSize producing negative string offsets
The fixed BaseFragListBuilder was added as XSimpleFragListBuilder which triggers an assert once Elasticsearch
upgrades to Lucene 4.3
In order to handle exceptions correctly, when classes are not found, one
needs to handle ClassNotFoundException as well as NoClassDefFoundError
in order to be sure to have caught every possible case. We did not cater
for the latter in ImmutableSettings yet.
This fix is just executing the same logic for both exceptions instead of
simply bubbling up NoClassDefFoundError.
When specifying minimum_should_match in a multi_match query it was being applied
to the outer bool query instead of to each of the inner field-specific bool queries.
Closes#2918
If cluster settings are update the REST API returns the accepted values. For
example, updating the `cluster.routing.allocation.disable_allocation` via
cluster settings:
```curl -XPUT http://localhost:9200/_cluster/settings -d '{
"transient":{
"cluster.routing.allocation.disable_allocation":"true"
}
}'```
will respond:
```{
"persistent":{},
"transient":{
"cluster.routing.allocation.disable_allocation":"true"
}
}```
Closes#2907
FieldData is an in-memory representation of the term dictionary in an uninverted form. Under certain circumstances this FieldData representation can grow very large on high-cardinality fields like tokenized full-text. Depending on the use-case filtering the terms that are hold in the FieldData representation can heavily improve execution performance and application stability.
FieldData Filters can be applied on a per-segment basis. During FieldData loading the terms enumeration is passed through a filter predicate that either accepts or rejects a term.
## Frequency Filter
The Frequency Filter acts as a high / low pass filter based on the document frequencies of a certain term within the segment that is loaded into field data. It allows to reject terms that are very high or low frequent based on absolute frequencies or percentages relative to the number of documents in the segment or more precise the number of document that have at least one value in the field that is loaded in the current segment.
Here is an example mapping
Here is an example mapping:
```json
{
"tweet" : {
"properties" : {
"locale" : {
"type" : "string",
"fielddata" : "format=paged_bytes;filter.frequency.min=0.001;filter.frequency.max=0.1",
"index" : "analyzed",
}
}
}
}
```
### Paramters
* `filter.frequency.min` - the minimum document frequency (inclusive) in order to be loaded in to memory. Either a percentage if < `1.0` or an absolute value. `0` if omitted.
* `filter.frequency.max` - the maximum document frequency (inclusive) in order to be loaded in to memory. Either a percentage if < `1.0` or an absolute value. `0` if omitted.
* `filter.frequency.min_segment_size` - the minimum number of documents in a segment in order for the filter to be applied. Small segments might be omitted with this setting.
## Regular Expression Filter
The regular expression filter applies a regular expression to each term during loading and only loads terms into memory that match the given regular expression.
Here is an example mapping:
```json
{
"tweet" : {
"properties" : {
"locale" : {
"type" : "string",
"fielddata" : "format=paged_bytes;filter.regex=^en_.*",
"index" : "analyzed",
}
}
}
}
```
Closes#2874
Note: This has been disabled by default and is therefore not included in a
standard build. The main reason for this is, that you need to have a RPM
binary and the rpm development packages installed, which is not the case
on many systems.
The package contains an init.d-script as well as systemd configurations.
You can build your own RPM package simply by running 'maven rpm:rpm'
* In the case only should clauses were specified with specific type of filters, the first clause determined which documents matched.
* In some cases the minimum at least 1 should clause should match behaviour was broken.
Have an explicit threadpool warmer that is dedicated to execute warmers. Currently, it uses the search threadpool, which does not work well since the number of concurrent searches should be separate from the number of concurrent warmers allows, also the characteristics of the search pool (for example, bounded queue_size) might not fit well with how warmers should be executed (they should not be "rejected").
closes#2815
The `geo_shape` precision could be only set via `tree_levels` so far. A new option `precision` now allows to set the levels of the underlying tree structures to be set by distances like `50m`. The def
## Example
```json
curl -XPUT 'http://127.0.0.1:9200/myindex/' -d '{
"mappings" : {
"type1": {
"dynamic": "false",
"properties": {
"location" : {
"type" : "geo_shape",
"geohash" : "true",
"store" : "yes",
"precision":"50m"
}
}
}
}
}'
```
## Changes
- GeoUtils defines the [WGS84](http://en.wikipedia.org/wiki/WGS84) reference ellipsoid of earth
- DistanceUnits refer to a more precise definition of earth circumference
- DistanceUnits for inch, yard and meter have been defined
- Set default levels in GeoShapeFieldMapper to 50m precision
Closes#2803
this happens for example because we list assigned shards, and they might not have been allocated on the relevant node yet, no need to list those as actual failures in some APIs
- that isAnnotationPresent bug is known, and probably will be fixed in later versions, but it costs us nothing to not use it now
- some tests fail, mainly due to consistent ordering expected from Map (within versions) which does not seem to be preserved, need to fix those tests to be agnostic to it
The REST Suggester API binds the 'Suggest API' to the REST Layer directly. Hence there is no need to touch the query layer for requesting suggestions.
This API extracts the Phrase Suggester API and makes 'suggestion request' top-level objects in suggestion requests. The complete API can be found in the
underlying ["Suggest Feature API"](http://www.elasticsearch.org/guide/reference/api/search/suggest.html).
# API Example
The following examples show how Suggest Actions work on the REST layer. According to this a simple request and its response will be shown.
## Suggestion Request
```json
curl -XPOST 'localhost:9200/_suggest?pretty=true' -d '{
"text" : "Xor the Got-Jewel",
"simple_phrase" : {
"phrase" : {
"analyzer" : "bigram",
"field" : "bigram",
"size" : 1,
"real_word_error_likelihood" : 0.95,
"max_errors" : 0.5,
"gram_size" : 2
}
}
}'
```
This example shows how to query a suggestion for the global text 'Xor the Got-Jewel'. A 'simple phrase' suggestion is requested and
a 'direct generator' is configured to generate the candidates.
## Suggestion Response
On success the request above will reply with a response like the following:
```json
{
"simple_phrase" : [ {
"text" : "Xor the Got-Jewel",
"offset" : 0,
"length" : 17,
"options" : [ {
"text" : "xorr the the got got jewel",
"score" : 3.5283546E-4
} ]
} ]
}
```
The 'suggest'-response contains a single 'simple phrase' which contains an 'option' in turn. This option represents a suggestion of the
queried text. It contains the corrected text and a score indicating the probability of this option to be meant.
Closes#2774
* Exposed the spatial strategy to be configurable as part of the geo_shape mappings
* Exposed the spatial strategy to be customizable at query time (will be used to generate the geo_shape filter/query)
* Removed XTermQueryPrefixTreeStrategy and reverted to use the lucene TermQueryPrefixTreeStrategy instead
* Made the RecursivePrefixTreeStrategy the default strategy to be used
* Removed support for all spatial operations except "intersects"
* Updated both the GeoShapeQueryBuilder and GeoShapeFilterBuilder with all the changes (removed the option of specifying the operation type (as only intersects is supported) and added the option of setting the filter/query spatial strategy
Closes#2720
The order in which routing and parent parameters are set is important. The
routing parameter must be set first or it will overwrite the parent routing
value.
The `term` suggester provides a very convenient API to access word alternatives on token
basis within a certain string distance. The API allows accessing each token in the stream
individually while suggest-selection is left to the API consumer. Yet, often already ranked
/ selected suggestions are required in order to present to the end-user.
Inside ElasticSearch we have the ability to access way more statistics and information quickly
to make better decision which token alternative to pick or if to pick an alternative at all.
This `phrase` suggester adds some logic on top of the `term` suggester to select entire
corrected phrases instead of individual tokens weighted based on a *ngram-langugage models*. In practice it
will be able to make better decision about which tokens to pick based on co-occurence and frequencies.
The current implementation is kept quite general and leaves room for future improvements.
# API Example
The `phrase` request is defined along side the query part in the json request:
```json
curl -s -XPOST 'localhost:9200/_search' -d {
"suggest" : {
"text" : "Xor the Got-Jewel",
"simple_phrase" : {
"phrase" : {
"analyzer" : "body",
"field" : "bigram",
"size" : 1,
"real_word_error_likelihood" : 0.95,
"max_errors" : 0.5,
"gram_size" : 2,
"direct_generator" : [ {
"field" : "body",
"suggest_mode" : "always",
"min_word_len" : 1
} ]
}
}
}
}
```
The response contains suggested sored by the most likely spell correction first. In this case we got the expected correction
`xorr the god jewel` first while the second correction is less conservative where only one of the errors is corrected. Note, the request
is executed with `max_errors` set to `0.5` so 50% of the terms can contain misspellings (See parameter descriptions below).
```json
{
"took" : 37,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2938,
"max_score" : 0.0,
"hits" : [ ]
},
"suggest" : {
"simple_phrase" : [ {
"text" : "Xor the Got-Jewel",
"offset" : 0,
"length" : 17,
"options" : [ {
"text" : "xorr the god jewel",
"score" : 0.17877324
}, {
"text" : "xor the god jewel",
"score" : 0.14231323
} ]
} ]
}
}
````
# Phrase suggest API
## Basic parameters
* `field` - the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections.
* `gram_size` - sets max size of the n-grams (shingles) in the `field`. If the field doesn't contain n-grams (shingles) this should be omitted or set to `1`.
* `real_word_error_likelihood` - the likelihood of a term being a misspelled even if the term exists in the dictionary. The default it `0.95` corresponding to 5% or the real words are misspelled.
* `confidence` - The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of `1.0` will only return suggestions that score higher than the input phrase. If set to `0.0` the top N candidates are returned. The default is `1.0`.
* `max_errors` - the maximum percentage of the terms that at most considered to be misspellings in order to form a correction. This method accepts a float value in the range `[0..1)` as a fraction of the actual query terms a number `>=1` as an absolut number of query terms. The default is set to `1.0` which corresponds to that only corrections with at most 1 misspelled term are returned.
* `separator` - the separator that is used to separate terms in the bigram field. If not set the whitespce character is used as a separator.
* `size` - the number of candidates that are generated for each individual query term Low numbers like `3` or `5` typically produce good results. Raising this can bring up terms with higher edit distances. The default is `5`.
* `analyzer` - Sets the analyzer to analyse to suggest text with. Defaults to the search analyzer of the suggest field passed via `field`.
* `shard_size` - Sets the maximum number of suggested term to be retrieved from each individual shard. During the reduce phase the only the top N suggestions are returned based on the `size` option. Defaults to `5`.
* `text` - Sets the text / query to provide suggestions for.
## Smoothing Models
The `phrase` suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index).
* `laplace` - the default model that uses an additive smoothing model where a constant (typically `1.0` or smaller) is added to all counts to balance weights, The default `alpha` is `0.5`.
* `stupid_backoff` - a simple backoff model that backs off to lower order n-gram models if the higher order count is `0` and discounts the lower order n-gram model by a constant factor. The default `discount` is `0.4`.
* `linear_interpolation` - a smoothing model that takes the weighted mean of the unigrams, bigrams and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn't have any default values. All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`) must be supplied.
## Candidate Generators
The `phrase` suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a `term` suggester called for each individual term in the text. The output of the generators is subsequently scored in in combination with the candidates from the other terms to for suggestion candidates.
Currently only one type of candidate generator is supported, the `direct_generator`. The Phrase suggest API accepts a list of generators under the key `direct_generator` each of the generators in the list are called per term in the original text.
## Direct Generators
The direct generators support the following parameters:
* `field` - The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion.
* `analyzer` - The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field.
* `size` - The maximum corrections to be returned per suggest text token.
* `suggest_mode` - The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified:
* `missing` - Only suggest terms in the suggest text that aren't in the index. This is the default.
* `popular` - Only suggest suggestions that occur in more docs then the original suggest text term.
* `always` - Suggest any matching suggestions based on terms in the suggest text.
* `max_edits` - The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2.
* `min_prefix` - The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don't occur in the beginning of terms.
* `min_query_length` - The minimum length a suggest text term must have in order to be included. Defaults to 4.
* `max_inspections` - A factor that is used to multiply with the `shards_size` in order to inspect more candidate spell corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5.
* `threshold_frequency` - The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option.
* `max_query_frequency` - The maximum threshold in number of documents a sugges text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this this also improves the spellcheck performance. The shard level document frequencies are used for this option.
* pre_filter - a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. (optional)
* post_filter - a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. (optional)
The following example shows a `phrase` suggest call with two generators, the first one is using a field containing ordinary indexed terms and the second one uses a field that uses
terms indexed with a `reverse` filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The `pre_filter` and `post_filter` options accept ordinary analyzer names.
```json
curl -s -XPOST 'localhost:9200/_search' -d {
"suggest" : {
"text" : "Xor the Got-Jewel",
"simple_phrase" : {
"phrase" : {
"analyzer" : "body",
"field" : "bigram",
"size" : 4,
"real_word_error_likelihood" : 0.95,
"confidence" : 2.0,
"gram_size" : 2,
"direct_generator" : [ {
"field" : "body",
"suggest_mode" : "always",
"min_word_len" : 1
}, {
"field" : "reverse",
"suggest_mode" : "always",
"min_word_len" : 1,
"pre_filter" : "reverse",
"post_filter" : "reverse"
} ]
}
}
}
}
```
`pre_filter` and `post_filter` can also be used to inject synonyms after candidates are generated. For instance for the query `captain usq` we might generate a candidate `usa` for term `usq` which is a synonym for `america` which allows to present `captain america` to the user if this phrase scores high enough.
Closes#2709