uri parameters were not all parsed for the multi term vector request. This commit
makes sure that all parameters are parsed and used when creating the requests for the
multi term vector request.
In order to simplify both code and json request, the request structure now allows
two ways to use multi term vectors:
1. Give all parameters for each document requested in the docs array like this:
```
{
"docs": [
{
"_index": "testidx",
"_type": "test",
"_id": "2",
"terms": [
"fox"
],
"term_statistics": true
},
{
"_index": "testidx",
"_type": "test",
"_id": "1",
"terms": [
"quick",
"brown"
],
"term_statistics": false
}
]
}
```
2. Define a list of ids and give parameters in a separate parameters object like this:
```
{
"ids": [
"1",
"2"
],
"parameters": {
"_index": "testidx",
"_type": "test",
"terms": [
"brown"
]
}
}
```
uri parameters are global parameters that are set for both cases. They are overwritten
by parameter definitions in the body.
Also, this commit adds the missing setParent(..) and setPreference(..) to TermVectorRequestBuilder.
* Minor alignments (like setter to ctor)
* FuzzySuggester has a unicode aware flag, which is not exposed in the fuzzy completion request parameters
* Made XAnalyzingSuggester flags (PAYLOAD_SEP, END_BYTE, SEP_LABEL) to be written into the postings format, so we can retain backwards compatibility
* The above change also implies, that these flags can be set per instantiated XAnalyzingSuggester
* CompletionPostingsFormatTest now uses a randomProvider for writing data to check for bwc
This commit upgrades to Lucene 4.6 and contains the following improvements:
* Remove XIndexWriter in favor of the fixed IndexWriter
* Removes patched XLuceneConstantScoreQuery
* Now uses Lucene passage formatters contributed from Elasticsearch in PostingsHighlighter
* Upgrades to Lucene46 Codec from Lucene45 Codec
* Fixes problem in CommonTermsQueryParser where close was never called.
Closes#4241
When starting elasticsearch with a wrong linux user, it could generate a `NullPointerException` when `PluginsService` tries to list available plugins in `./plugins` dir.
To reproduce:
* create a plugins directory with `rwx` rights for root user only
* launch elasticsearch from another account (elasticsearch for example)
Related discussion: https://groups.google.com/forum/#!topic/elasticsearch/_WRW4Qfpo7MCloses#4186.
Closes#4187.
Add FieldDataTermsFilter that compares terms out of
the fielddata cache. When filtering on a large
set of terms this filter can be considerably faster
than using a standard lucene terms filter.
Add the "fielddata" execution mode to the
terms filter parser to enable the use of
the new FieldDataTermsFilter.
Add supporting tests and documentation.
Closes#4209
In case of a misconfigured slow search/index configuration (unparseable
TimeValue) an exception is thrown.
This is not a problem when creating a shard of an index, as an exception
is returned and all is good. However, this is a huge problem, when
starting up a node, as the shard creation is repeated endlessly.
This patch changes the behaviour to go on as usual and just disable the
slowlog, as an improper configuration of logging should not affect the
allocation behaviour.
Closes#2730
The search request inside of a put warmer request was nullable, but actually we have to have that request in the transport action.
Validation and appropriate test added.
Closes#4196
Make the FilterBuilder interface consistent with the QueryBuilder
interface and replace usage of QueryBuilderException with
ElasticSearchIllegalArgumentException.
Allow the user to configure the number of hash functions as well as add
support for serializing/deserializing the bloom filter from a stream.
Add a hashCode to the bloom filter.
In this view you never care about the actual heap used bytes; you only
want to know that your max is set to what you meant and what
percentage you're currently using.
Closes#4151.
The multi_match query accepted only an array in the fields parameter. This patch allows to use a single string as well.
Also added tests for parsing in both cases.
Closes#4164
There is an optimization that executes bit based (slow) filters in the end. Matched docs could be unset if they didn't match with any of these filters. The bug was that also iterator based (fast) filters should be checked.
This change checks all should filters in the end part (if must or must_not clauses exists), so it can now correctly unset matched docs. The current bool filters requires that at least one should clause must match for docs to be match regardless of any other clauses.
Closes#4130
At the end recovery, the IndexShardGatewayService will double check the gateway has moved the shard to POST_RECOVERY and if not, do it it self.
The shard state could have already move to started, causing the post_recovery call to throw an exception and the entire shard recovery to fail.
This can happened if, after the gateway moved the shard to POST_RECOVERY:
1) master sent a new cluster state indicating shard is initializing
2) IndicesClusterStateService#applyInitializingShard will send a shard started event
3) Master will mark shard as started and this will be processed quickly and move the shard to STARTED.
Closes#4147
With #3782 we changed the execution order of dynamic mapping updates and index operations. We now first send the mapping update to the master node, and then we index the document. This makes sense but caused issues with rivers as they are started due to the cluster changed event that is triggered on the master node right after the mapping update has been applied, but in order for the river to be started its _meta document needs to be available, which is not the case anymore as the index operation most likely hasn't happened yet. As a result in most of the cases rivers don't get started.
What we want to do is retry a few times if the _meta document wasn't found, so that the river gets started anyway.
Closes#4089, #3840
This adds a delegate to CharTermAttributeImpl to be compatible
with the Percolator that needs a CharTermAttribute. Yet compared
to CharTermAttributImpl we only fill the BytesRef with UTF-8 since
we already have it and only if we need to convert to UTF-16 we do it.
Closes#4028
Previously the field name specified in the search request was used, which isn't correct in case a custom index_name has been used for a field or the "path":"just_name" has been used in the mapping.
Closes#4116
Fixed also bug in the fast vector highlighter which was raised by enabling the object cache, due to null FieldQuery (NPE) in case the objects are taken from the cache
Added tests to check if there are issues when highlighting multiple fields at the same time
Closes#4106
Values returned by [Double|Long|Bytes]Values are sorted today which
is guaranteed by the underlying Lucene index. Several implementations can
make use of this property but the interfaces don't guarantee this behavior.
This commit adds the guarantees and makes use of them in several places.
Note: This change might require sorting for 3rd party implemenations of these
interaces.
The 'default' / 'standard' analyzer can be a trappy default sicne it filters
english stopwords by default. Yet a default should not be dedicated to a certain language
since elasticsearch is used in many different scenarios where a standard analysis chain
with specialization to english full-text might be rather counter productive.
This commit changes the 'standard' analyzer to use an empty stopword list for indices
that are created from 1.0.0.Beta1 version onwards but will maintain backwards compatibiliy
for older indices.
Closes#3775
Use .percolator as the internal (hidden) type name for percolators within the index. Seems nicer name to represent "hidden" types within an index.
closes#4090
Also, fix all the problems it brought up in tests.
Removed OverrideTypeMappingTests as it is no longer relevant.
Better naming for the default percolator mapping and change it's content use _default_ as root node.
Closes#4038
It create the following challenges:
- it automatically load all the norms for all fields. This should be an opt in feature similar to the new loading feature in field data. Will open a separate issue for it.
- It automatically loads all doc values for all fields (if they have it), overriding effectively the loading option of field data when its backed by doc values.
closes#4078
We now return acknowledged true when no wait is needed (mustAck always returns false). We do wait for the master node to complete its actions though. Previously it would try to timeout and hang due to a CountDown#fastForward call when the internal counter is set to 0
We now return acknowledged false without starting the timeout thread when the timeout is set 0, as starting the wait and immediately stopping the thread seems pointless.
Added coverage for ack in ClusterServiceTests
this map is a "true" immutable map, encapsulating an open impl, and has a builder that allows it to be built easily.
the builder has the optimization of using clone if its being built based on an existing immutable map.
we need to move to started from post recovery on cluster level changes, we need to make sure we handle a global state change of relocating, which can happen (and not pass through started)
Improve the introduction of new fields into the concrete parsed mappings by not relying on immutable maps and copying over entries, but instead using open maps (which will also use less memory), and using clone to perform the copy on write logic
This allows the RegexpQueryBuilder to be used in span queries
Added tests for all span multi term queries.
Also updated the documentation and removed mentioning of numeric range
queries for span queries (they have to be terms).
Closes#3392
This new API allows to get the mapping for a specific set of fields rather than get the whole index mapping and traverse it.
The fields to be retrieved can be specified by their full path, index name and field name and will be resolved in this order.
In case multiple field match, the first one will be returned.
Since we are now generating the output (rather then fall back to the stored mapping), you can specify `include_defaults`=true on the request to have default values returned.
Closes#3941
its preferable to execute the indices cluster state service as quickly as possible, as one of the first listeners, so it will apply the cluster state to the local state
potentially, it should even execute before it state is "visible" (through the state call), but that's another change...
make sure to throw the already exists exception, so when indexing into an alias, and it has not propagated yet through the cluster state, it will end up being ignored if it already exists
Some FieldData consumers require hash values per byte. We provide an optimization
that allows to cache the hashes internally if the consumer knows that they are needed
this optimization got lost in a previous commit. This commit adds them back and folds
the dedicated method into AtomicFieldData#getBytesValues(true|false)
As a side note, the internal reroute call is now part of the ack mechanism. That means that if the response contains acknowledged flag, the internal reroute that was eventually issued was acknowledged too. Also, even if the request is not acknowledged, the reroute is issued before returning, which means that there is no need to manually call reroute afterwards to make sure the new settings are immediately applied.
Closes#3995
Added support for serialization based on version to AcknowledgedResponse. Useful in api that don't support yet the acknowledged flag in the response.
Moved also ack warmer tests to more specific AckTests class
Close#3983
This addes the _cat/recovery/{index} API endpoint, which displays
information about the status of recovering shards. An example of the
output:
index shard node target recovered %
test2 0 Fwo7c_6MSdWM0uM1Ho4t-g 147304414 19236101 13.1%
test 0 Fwo7c_6MSdWM0uM1Ho4t-g 145891423 119640535 82.0%
Fixes#3969
This patch takes the version of the created index into account when a
prebuilt analyzer is created.
So, if an index was created with 0.90.4, then the prebuilt analyzers
will be the same than on the 0.90.4 release.
One reason for this feature is the possibility to change pre built
analyzers like the standard one.
The patch tries to reuse analyzers as mutch as possible. So even if
version X.Y.Z and X.Y.A use the same lucene analyzers, the same instance
is reused in order to prevent overcreation of lucene analyzer instances.
Closes#3790
The setting causes the upper bound for a range query/filter to be rounded up,
therefore the name `round_ceil` seems to make more sense.
Also this commit removes the redundant fourth parameter to DateMathParser.parse(..)
which was never used.
was: parse(String text, long now, boolean roundUp, boolean upperInclusive)
is now: parse(String text, long now, boolean roundCeil)
closes#3914
When running tests, Engine.searcher() is going to be an AssertingIndexSearcher
so we definitely don't want to discard it. This commit fixes it as well as the
bugs it found.
Closes#3987
In order to make sure that people do not get confused, if they
index a float as weight, it makes more sense to reject it instead of
silently parsing it to an integer and using it.
The CompletionFieldMapper now checks for the type of the number which
is being read and throws and exception if the number is something else
than int or long.
Closes#3977
Requires field index_options set to "offsets" in order to store positions and offsets in the postings list.
Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be.
Requires less disk space than term_vectors, needed for the fast_vector_highlighter.
Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance.
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm.
Uses forked version of lucene postings highlighter to support:
- per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value
- manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same
The lucene postings highlighter api is quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO.
The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents.
Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding.
Closes#3704
You can configure the highlighting api to return an excerpt of a field
even if there wasn't a match on the field.
The FVH makes excerpts from the beginning of the string to the first
boundary character after the requested length or the boundary_max_scan,
whichever comes first. The Plain highlighter makes excerpts from the
beginning of the string to the end of the last token before the requested
length.
Closes#1171
Currently we have a marker interface for Acknowledged[Request|Response],
this makes not much sense since we duplicate the code in each subclass
or class that implements the interface. We can simply use abstract
classes and have it implemented only once.
This commit primarily folds [Double|Bytes|Long|GeoPoint]Values.Iter
into [Double|Bytes|Long|GeoPoint]Values. Iterations now don't require
a auxillary class (Iter) but instead driven by native for loops. All
[Double|Bytes|Long|GeoPoint]Values are stateful and provide `setDocId`
and `nextValue` methods to iterate over all values in a document.
This has several advantage:
* The amout of specialized classes is reduced
* Iteration is clearly stateful ie. Iters can't be confused to be local.
* All iterations are size bounded which prevents runtime checks and
allows JIT optimizations / loop un-rolling and most iterations are
branch free.
* Due to the bounded iteration the need for a `hasNext` method call
is removed.
* Value iterations feels more native.
This commit also adds consistent documentation and unifies the calcualtion
if SortMode is involved.
This commit also changes the runtime behavior of BytesValues#getValue() such that it
will never return `null` anymore. If a document has no value in a field
this method still returns a `BytesRef` with a `length` of 0. To identify
documents with no values #hasValue() or #setDocument(int) should be used.
The latter should be preferred if the value will be consumed in the case
the document has a value.
Have a separate channel for recovery, so it won't overflow the "low" channel which is also used for bulk indexing.
Also, rename the channel names to be more descriptive. Change low to bulk (for bulk based operations, currently just bulk indexing), med to reg (for "regular" operations), and high to state (for state based communication). The new channel for recovery will be named recovery, and the ping channel will remain the same.
closes#3954
When setting track scores, the scan search type will return the scores for each document. The Java API builder does not properly set this value (it only sets it if a sort in in place, which is not relevant for scan search type).
closes#3949
Currently we don't allow resetting the awareness
attribute via the API since it requires at least one
non-empty string to update the setting. This commit
allows resetting this using an empty string.
Closes#3931
The #3526 fix was not complete, it handled cases of on node execution, but didn't properly handle cases where it was executed over the network, and forcing the execution of the replica operation when done over the wire.
This relates to #3854closes#3929
Added new AckedClusterStateUpdateTask interface that can be used to submit cluster state update tasks and allows actions to be notified back when a set of (configurable) nodes have acknowledged the cluster state update. Supports a configurable timeout, so that we wait for acknowledgement for a limited amount of time (will be provided in the request as it curently happens, default 10s).
Internally, a low level AckListener is created (InternalClusterService) and passed to the publish method, so that it can be notified whenever each node responds to the publish request. Once all the expected nodes have responded or the timeoeout has expired, the AckListener notifies the action which will return adding the proper acknowledged flag to the response.
Ideally, this new mechanism will gradually replace the existing ones based on custom endpoints and notifications (per api).
Closes#3786
The suggest stop filter is an improved version of the stop filter, which
takes stopwords only into account if the last char of a query is a
whitespace. This allows you to keep stopwords, but to allow suggesting for
"a".
Example: Index document content "a word". You are now able to suggest for
"a" and get back results in the completion suggester, if the suggest stop
filter is used on the query side, but will not get back any results for
"a " as this is identified as a stopword.
The implementation allows to set the `remove_trailing` parameter for a
custom stop filter and thus use the suggest stop filter instead of the
standard stop filter.
- "boost" should be "boost_factor"
- "mult" should be "multiply"
Also, store combine function names in ImmutableMap instead of iterating
over all possible names each time.
closes#3872 for master
SynonymFilters produces token streams with stacked tokens such that
conjunction queries need to be parsed in a special way such that the
stacked tokens are added as an innner disjuncition.
Closes#3881
The array holding the payloads (TermVectorFields.payloads) is reused for each token. If the
previous token had payloads but the current token had not, then the payloads of the previous
token were returned, because the payloads of the previous token were never invalidated.
For example, for a field only contained two tokens each occurring once, the first having a
payload and the second not, then for the second token, the payload of the first was returned.
closes#3873