The 'default' / 'standard' analyzer can be a trappy default sicne it filters
english stopwords by default. Yet a default should not be dedicated to a certain language
since elasticsearch is used in many different scenarios where a standard analysis chain
with specialization to english full-text might be rather counter productive.
This commit changes the 'standard' analyzer to use an empty stopword list for indices
that are created from 1.0.0.Beta1 version onwards but will maintain backwards compatibiliy
for older indices.
Closes#3775
Use .percolator as the internal (hidden) type name for percolators within the index. Seems nicer name to represent "hidden" types within an index.
closes#4090
This allows the RegexpQueryBuilder to be used in span queries
Added tests for all span multi term queries.
Also updated the documentation and removed mentioning of numeric range
queries for span queries (they have to be terms).
Closes#3392
This new API allows to get the mapping for a specific set of fields rather than get the whole index mapping and traverse it.
The fields to be retrieved can be specified by their full path, index name and field name and will be resolved in this order.
In case multiple field match, the first one will be returned.
Since we are now generating the output (rather then fall back to the stored mapping), you can specify `include_defaults`=true on the request to have default values returned.
Closes#3941
The setting causes the upper bound for a range query/filter to be rounded up,
therefore the name `round_ceil` seems to make more sense.
Also this commit removes the redundant fourth parameter to DateMathParser.parse(..)
which was never used.
was: parse(String text, long now, boolean roundUp, boolean upperInclusive)
is now: parse(String text, long now, boolean roundCeil)
closes#3914
Requires field index_options set to "offsets" in order to store positions and offsets in the postings list.
Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be.
Requires less disk space than term_vectors, needed for the fast_vector_highlighter.
Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance.
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm.
Uses forked version of lucene postings highlighter to support:
- per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value
- manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same
The lucene postings highlighter api is quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO.
The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents.
Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding.
Closes#3704
You can configure the highlighting api to return an excerpt of a field
even if there wasn't a match on the field.
The FVH makes excerpts from the beginning of the string to the first
boundary character after the requested length or the boundary_max_scan,
whichever comes first. The Plain highlighter makes excerpts from the
beginning of the string to the end of the last token before the requested
length.
Closes#1171
The description of the timeout parameter was worded misleadingly; it implied that the API would wait until the cluster reached the desired level and then stayed at that level for the timeout. I've tweaked the sentence to remove the risk of confusion.
The suggest stop filter is an improved version of the stop filter, which
takes stopwords only into account if the last char of a query is a
whitespace. This allows you to keep stopwords, but to allow suggesting for
"a".
Example: Index document content "a word". You are now able to suggest for
"a" and get back results in the completion suggester, if the suggest stop
filter is used on the query side, but will not get back any results for
"a " as this is identified as a stopword.
The implementation allows to set the `remove_trailing` parameter for a
custom stop filter and thus use the suggest stop filter instead of the
standard stop filter.
- "boost" should be "boost_factor"
- "mult" should be "multiply"
Also, store combine function names in ImmutableMap instead of iterating
over all possible names each time.
closes#3872 for master
Now that we properly fixed the ability to set the queue size on the index / bulk thread pool, we should actually set them to a somehow reasonable value to protect from users potentially overflowing our system.
I suggest defaults to be 50 for bulk, and 200 for indexing.
Also, set the thread pool for get, which we should set (in a similar value to a "read" queue size we have today).
closes#3888
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes#3806
* Merged segments are now warmed-up at the end of the merge operation instead
of _refresh, so that _refresh doesn't pay the price for the warm-up of merged
segments, which is often higher than flushed segments because of their size.
* Even when no _warmer is registered, some basic warm-up of the segments is
performed: norms, doc values (_version). This should help a bit people who
forget to register warmers.
* Eager loading support for the parent id cache and field data: when one
can't predict what terms will be present in the index, it is tempting to use
a match_all query in a warmer, but in that case, query execution might not be
much faster than field data loading so having a warmer that only loads field
data without running a query can be useful.
Closes#3819