Instead of specifying 'path.plugins' configuration option, 'plugin.types'
is used to load plugins in integration tests. This makes sure the JVM
plugins are not loaded in all following tests from then.
Also removed the now unneeded es-plugin.properties files from JVM test
plugins.
* RPM: Use the ES_USER variable to set the user (same name as in the debian package
now), while retaining backwards compatibility to existing /etc/sysconfig/elasticsearch
* RPM: Bugfix: Remove the user when uninstalling the package
* RPM: Set an existing homedir when adding the user (allows one to run cronjobs as this user)
* DEB & RPM: Unify Required-Start/Required-Stop fields in initscripts
Currently elasticsearch ships with the plain and the fast-vector highlighter.
In order to support arbitrary highlighters via plugins, you only need to
implement a Highlighter interface and register your implementation in your
plugin at the HighlightModule.
In addition you can also add arbitrary options via the 'options' field in
the highlight request, which can be parsed in the highlighter implementation.
In order to find out how to write add your own analyzer, check out the tests
classes (CustomHighlighterSearchTests and CustomHighlighter).
Closes#2828
Using an automatically detected 'min_doc_freq' if suggest type is set to
'always' is counter intuitive. If we suggest always ignore the frequency and
set threshold frequency to 0 to allow all possible candidates to be drawn if
they are within the given bounds.
Closes#3037
To prevent to extensive resource use during recovery we use
recovery throtteling by default to prevent unexpected peak load
on clusters. The default is set to 20 MB/sec.
Closes#3035
Merge Throtteling is one of the most recommended settings and crucial in the
RealTime indexing case. We should set the default to a reasonable setting
that allows folks to index in a production index and don't see large merge
peaks by default. The default is set to 20 MB/sec on the node level.
Closes#3033
The default size used to be 2x availableProcessors which seemed to
be a to lowish value in practice. 3x appeared to be a sweetspot for
most application. The default is now 3 x availableProcessors
Closes#3023
Added support for unmapped & partially mapped fields (partially mapped fields may occur when searching across multiple indices where the faceted field is mapped on some and unmapped on others). If a shard doesn't have mappings for a field, the matching documents count on that shard will be added to the missing count for that facet.
Both has_parent and has_child filters are internally executed in two rounds. In the second round all documents are evaluated whilst only specific documents need to be checked. In the has_child case only documents belonging to a specific parent type need to be checked and in the has_parent case only child documents need to be checked.
Closes#3034
Similar to the global cluster wide disable allocation flags, allow to set those on a specific index by updating its settings. The keys are the same as the cluster one, except they start with an index, for example: index.routing.allocation.disable_allocation set to true.
closes#3031
The branches used in the score method can be moved into the
scorer call and be essentially a constant operation rather than
a linear operation depending on the number of parent docs.
Older OpenSUSE distributions do not ship with systemd and therefore are
using chkconfig, but do not have their scripts placed at /etc/init.d/
This patch is more defensive and adds additional checks in the postinstall
script to prevent aborted post install scripts, which makes the RPM
uninstallable.
when resolving empty settings values, their value should be removed, for example, when using ${env.ENV_VAR}, and ENV_VAR is not set, then the setting should be removed
This commit allows to set custom headers in HTTP responses (like
setting the WWW-Authenticate header for basic auth) by adding
RestRequest.addHeader() method.
Closes#2936Closes#2540
To get the history right: This is based on PR #2723
Update requests can now be put in the bulk api. All update request options are supported.
Example usage:
```
curl -XPOST 'localhost:9200/_bulk' --date-binary @bulk.json
```
Contents of bulk.json that contains two update request items:
```
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"} }
{ "update" : { "_id" : "0", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
{ "script" : "counter += param1", "lang" : "js", "params" : {"param1" : 1}, "upsert" : {"counter" : 1}}
```
The `doc`, `upsert` and all script related options are part of the payload. The `retry_on_conflict` option is part of the header.
Closes#2982
Before this change, the GetField#getValue() method was returning a list of values of a multivalued fields if the field values were obtained from source or if the field was stored and real-time get was used. If the field was stored but non-realtime get was used, GetField#getValue() was returning only the first element and the GetField#getValues() was returning a list of elements. This change makes behavior consistent. GetField#getValue() now always returns only the first value of the field and GetField#getValues() returns the entire list.
Typically, the main reason a reroute allocation command with allow_primary is enabled, is to force create an empty new shard because a shard (and its replicas) were lost. This can't be done today because the shard expects to have a valid index where its allocated, we need to clear its post allocation flag to make sure it is allowed to create a fresh index.
Lucene provides a set of statistics that depend on the codec / postingsformat
as well as on the index options used when the field is created / indexed.
If a certain stats value is not available lucene return `-1` instead of the
correct value. We need to ensure that those values are encoded correctly if
we try to write vLongs as well as when we aggregate those values.
Closes#3012
NgramTokenizer and NGramTokenFilter are broken with a version < 4.2
We should still support these filters but should prevent the StringIOOB
exceptions. Adding these fitlers for the FragmentBuilderHelper will
allow seamless highlighting on fields indexed with those tokenizers or
tokenfilters
Lucenes span queries are a different family than 'ordinary' queries
in lucene. Spans only work with other spans such that smart query
wrapping doesn't work with span queries at all ie. we can't wrap
in filtered query.
Closes#2994
Today an analysis chain with broken tokenfilters or tokenizers like
WordDelimiterFilter might produce somewhat broken term vectors that cause
`StringIndexOutOfBoundsExceptions` if FastVectorHighlighter is used
since the positions / offsets contract is violated and offsets of highlight
tokens are not increasing but decreasing even if their positions are increasing.
Yet, if we detect such a situation we can resort the tokens which might cause
somewhat odd highlights but doesn't fail hard with a StringIndexOOBException.
Closes#3006
All index meta data API's have urgent priority when it comes to cluster state updates. We'd like to remove indices asap to avoid things like unnecessary shards relocations
This Lucene Release introduced a new API on DocIdSetIterator that requires each
implementation to return a `cost` upperbound as a function of the iterated documents.
This API allows for several optimizations during query execution especially in
Conjunction and Disjunction Queries with min_should_match set.
Closes#2990
Allow to get the source directly using a specific REST endpoint without any additional content around it, the endpoint is `{index}/{type}/{id}/_source`.
Note, HEAD now also support the _source endpoint.
closes#2993, closes#2995
BytesRefOrdValComparator starts at 1 and ends at maxOrd - 1. Yet, numOrd is defined
as maxOrd - 1 excluding the 0 ord.
This causes wrong sort ords when the bottom of the queue is compared to the next
segment and the greatest term in the new segment is in-fact less than the current
queue bottom. If that is true we treat the values as equal and never include the right
value into the queue.
Closes#2991
This adds support for lucene span multi term queries. This lucene query
allows users to form complicated queries such as wildcards or prefix
queries embedded within span queries.
WeightFunction#weight to allow dedicated weight calculations per operation. In certain
circumstance it is more efficient / required to ignore certain factors in the weight
calculation to prevent for instance relocations if they are solely triggered by tie-breakers.
In particular the primary balance property should not be taken into account if the delta for
early termination is calculated since otherwise a relocation could be triggered solely by the
fact that two nodes have different amount of primaries allocated to them.
Closes#2984
A first implementation of adding matchers and helper methods to elasticsearch.
The following ones are supported
assertHitCount(searchResponse, 2);
// helper methods to easily access the first hits
assertFirstHit(searchResponse, hasId("foo")):
assertSecondHit(searchResponse, hasType("foo")):
assertThirdHit(searchResponse, hasIndex("foo")):
// methods to access all other hits
assertSearchHit(searchResponse, 5, hasId("10"));
// same as above, but maybe more readable
assertSearchHit(searchResponse.getHits().getAt(5), hasIndex("foo"));
I changed GeoFilterTests to show how it works.
Furthermore I inlined assertHighlight() from HighlighterSearchTests.
The ElasticsearchAssertions class can be used now as a centralized assertion class
in order have a centralized class for every developer to look at.
Custom settings are not always present in the `Settings` that are passed
to `NodeSettingsService.Listener#onRefreshSettings` such that using the defaults
will necessarily override the custom settings if set before.
Closes#2973
Fixed testX and testSingleNodeNoFlush by specifying mapping on index creation instead of using dynamic mapping. Dynamic mapping is updated on the cluster level asynchronously and if mapping changes are not applied to the cluster state before node is closed, these changes are not be available after node restart. While data added in the test is preserved, due to absence of mapping, the test still fails. This is a known issue that we are not planning to fix at the moment.
Currently realtime GET does not take source includes/excludes into account.
This patch adds support for the source field mapper includes/excludes
when getting an entry from the transaction log. Even though it introduces
a slight performance penalty, it now adheres to the defined configuration
instead of returning all source data when a realtime get is done.
The filter method of XContentMapValues actually filtered out nested
arrays/lists completely due to a bug in the filter method, which threw
away all data inside of such an array.
Closes#2944
This bug was a follow up problem, because of the filtering of nested arrays
in case source exclusion was configured.
Analysing a numeric field will return UTF-16 representations of
of Lucenes numeric prefix terms. Those terms are meaningless in general
unless used for lookups in the lucene index. Passing a numeric field
to the analysis action is most likely a bug.
Closes#2953#2952
use the latest Lucene version as specified in o.e.common.lucene.Lucene
and must be upgraded with each lucene release.
This commit adds an assert that fails once the actual lucene version
that is used is higher than the current releases version.
Lucene ships with a version constant that is mainly used to provide consistent behaviour across lucene release versions. Lucene's Analysis capabilities are commonly applied at index and search time such that the search-time behaviour should be identical to the index-time behaviour in most of the cases. Currently ElasticSearch always uses the latest version from Lucene which can break backwards compatibility with the index for users that rely on behaviour that changed in new Lucene version.
Users should always use the version the index was created with unless it's explicitly configured.
closes#2945
don't produce broken positions anymore and prevent certain highlighter bugs that fail with
StringArrayOutOfBoundsExceptions as in #2931
This commit breaks backwards compatibility in terms of highlighting when NGramTokenFilter is used.
The highlighter will highlight the entire terms as produced by the tokenizer instead of the individual
sub-gram. To do sub-gram highlighting, the ngram tokenizer should be used. This behavior was based on
broken NGramTokenFilter behavior which will be fixed in Lucene 4.4 but was ported in this commit
to elasticsearch 0.90. The broken behavior can still be used if a version < LUCENE_42 is used
in the token filter mapping.
Closes#2931
- rename to open_contexts from open, we might have other open stats in the future related to search (lucene index searchers?)
- add a test to verify it works
to the lucene version is that `discreteMultiValueHighlighting` does default to `true`. Yet
we set this anyway in the HighlightingPhase such that the classes are obsolet.
if a single highlight phrase or term was greater than the fragCharSize producing negative string offsets
The fixed BaseFragListBuilder was added as XSimpleFragListBuilder which triggers an assert once Elasticsearch
upgrades to Lucene 4.3
In order to handle exceptions correctly, when classes are not found, one
needs to handle ClassNotFoundException as well as NoClassDefFoundError
in order to be sure to have caught every possible case. We did not cater
for the latter in ImmutableSettings yet.
This fix is just executing the same logic for both exceptions instead of
simply bubbling up NoClassDefFoundError.
When specifying minimum_should_match in a multi_match query it was being applied
to the outer bool query instead of to each of the inner field-specific bool queries.
Closes#2918
If cluster settings are update the REST API returns the accepted values. For
example, updating the `cluster.routing.allocation.disable_allocation` via
cluster settings:
```curl -XPUT http://localhost:9200/_cluster/settings -d '{
"transient":{
"cluster.routing.allocation.disable_allocation":"true"
}
}'```
will respond:
```{
"persistent":{},
"transient":{
"cluster.routing.allocation.disable_allocation":"true"
}
}```
Closes#2907
FieldData is an in-memory representation of the term dictionary in an uninverted form. Under certain circumstances this FieldData representation can grow very large on high-cardinality fields like tokenized full-text. Depending on the use-case filtering the terms that are hold in the FieldData representation can heavily improve execution performance and application stability.
FieldData Filters can be applied on a per-segment basis. During FieldData loading the terms enumeration is passed through a filter predicate that either accepts or rejects a term.
## Frequency Filter
The Frequency Filter acts as a high / low pass filter based on the document frequencies of a certain term within the segment that is loaded into field data. It allows to reject terms that are very high or low frequent based on absolute frequencies or percentages relative to the number of documents in the segment or more precise the number of document that have at least one value in the field that is loaded in the current segment.
Here is an example mapping
Here is an example mapping:
```json
{
"tweet" : {
"properties" : {
"locale" : {
"type" : "string",
"fielddata" : "format=paged_bytes;filter.frequency.min=0.001;filter.frequency.max=0.1",
"index" : "analyzed",
}
}
}
}
```
### Paramters
* `filter.frequency.min` - the minimum document frequency (inclusive) in order to be loaded in to memory. Either a percentage if < `1.0` or an absolute value. `0` if omitted.
* `filter.frequency.max` - the maximum document frequency (inclusive) in order to be loaded in to memory. Either a percentage if < `1.0` or an absolute value. `0` if omitted.
* `filter.frequency.min_segment_size` - the minimum number of documents in a segment in order for the filter to be applied. Small segments might be omitted with this setting.
## Regular Expression Filter
The regular expression filter applies a regular expression to each term during loading and only loads terms into memory that match the given regular expression.
Here is an example mapping:
```json
{
"tweet" : {
"properties" : {
"locale" : {
"type" : "string",
"fielddata" : "format=paged_bytes;filter.regex=^en_.*",
"index" : "analyzed",
}
}
}
}
```
Closes#2874