currently, we treat all strings as shared (either by full equality or identity equality), while almost all times we know if they should be serialized as shared or not. Add an explicitly write/readSharedString, and use it where applicable, and all other write/readString will not treat them as shared
relates to #3322
when parsing a filter, we use null to indicate that this filter should not match anything, the top level filter doesn't take it into account
fixes#3356
Although segments are limited to 2B documents, there is not limit on the number
of unique values that a segment may store. This commit replaces 'int' with
'long' every time a number is used to represent an ordinal and modifies the
data-structures used to store ordinals so that they can actually support more
than 2B ordinals per segment.
This commit also improves memory usage of the multi-ordinals data-structures
and the transient memory usage which is required to build them (OrdinalsBuilder)
by using Lucene's PackedInts data-structures. In the end, loading the ordinals
mapping from disk may be a little slower, field-data-based features such as
faceting may be slightly slower or faster depending on whether being nicer to
the CPU caches balances the overhead of the additional abstraction or not, and
memory usage should be better in all cases, especially when the size of the
ordinals mapping is not negligible compared to the size of the values (numeric
data for example).
Close#3189
* If all parent ids have been emitted as hit, abort the query / filter execution.
* If the a relative small number of parent ids have been collected in the first phase then limit the number of second phase parent id lookups by putting a short circuit filter before parent document evaluation or omit the it in the case of the filter. This is contrable via the `short_circuit_cutoff` option which is exposed in the `has_child` query & filter.
All parent / child queries and filters (expect `top_children` query) abort execution if no parent ids have been collected in the first phase.
Closes#3190
With this design the percolate queries will be stored in a special `_percolator` type with its own mapping in the same index where the actual data is or in a different index (dedicated percolation index, which might require different sharding behavior compared to the index that holds actual data and being search on). This approach allows percolate requests to scale to the number of primary shards an index has been configured with and effectively distributes the percolate execution.
This commit doesn't add new percolate features other than scaling. The response remains similar, with exception that a header similar to the search api has been added to the percolate response.
Closes#3173
It can happen that Eclipse fails at correctly adding a new import entry to an
existing list of imports since we don't use its default rules. This commit
forces Eclipse to organize imports on save.
More-like-this and fuzzy-like-this queries expect analyzers which are able to
generate character terms (CharTermAttribute), so unfortunately this doesn't
work with analyzers which generate binary-only terms (BinaryTermAttribute,
the default CharTermAttribute impl being a special BinaryTermAttribute) such as
our analyzers for numeric fields (byte, short, integer, long, float, double but
also date and ip).
To work around this issue, this commits adds a fail_on_unsupported_field
parameter to the more-like-this and fuzzy-like-this parsers. When this parameter
is false, numeric fields will just be ignored and when it is true, an error will
be returned, saying that these queries don't support numeric fields. By default,
this setting is true but the mlt API sets it to true in order not to fail on
documents which contain numeric fields.
Close#3252
The XPatternCaptureGroupTokenFilter.java file can be removed once we
upgrade to Lucene 4.4.
This change required the addition of the commaDelimited flag to getAsArray()
to disable parsing strings as comma-delimited values.
Closes#3340
Open/Close index api supports now multiple indices the same way as the delete index api works. The only exception is when dealing with all indices: it's required to explicitly use _all or a pattern that identifies all the indices, not just an empty array of indices. Supports the ignore_missing param too.
Added also a new flag action.disable_close_all_indices (default false) to disable closing all indices
Closes#3217
The index field was serialized as a boolean instead of showing the
'analyed', 'not_analzyed', 'no' options. Fixed by calling
indexTokenizeOptionToString() in the builder.
Closes#3174
This decision helps people who want to rollout the oracle java without having an openjdk java installed.
* Removed any hard dependency on Java in the debian package
* The debian init script does not check for an existing JAVA_HOME anymore
* Debian and RedHat initscripts now exit if they do not find a java binary (instead of starting elasticsearch in the background and swallowing the error as there is no way to log it in that case)
* Changed the debian init script to rely on the pid file instead of the argument name of process
* Added a useful error message in case no java binary is available (in elasticsearch shell script)
Closes#3304Closes#3311
now that we have the concept of a shardIndex as part of our search execution, we can simply move to use ScoreDoc and FieldDoc instead of having our own wrappers that held the info
Also, rename shardRequestId where needed to be called shardIndex to conform with the variable name in Lucene
The previous loading of term vectors from the top level reader did not use the
correct docId. The docId in Versions.DocIdAndVersion is relative to the segment
reader in Versions.DocIdAndVersion and not to the top level reader.
Consequently the term vectors for the wrong document were returned if the
document was not on the first segment of the shard.
move away from maps to correlate between responses from different shards to unique incremental integer representing a shardRequestId (unique for the specific search request)
this allows to no longer require using maps (or CHM), and simply use atomic reference arrays, which rely on volatiles. it also removes the need to use a cache for heavy data structures since we don't really have them around anymore...
When using PlainHighlighter, TokenStreams are resetted both before highlighting
and at the beginning of highlighting, causing issues with analyzers that read
in reset() such as PatternAnalyzer. This commit removes the call to reset which
was performed before passing the TokenStream to the highlighter.
Close#3200
don't wrap in AnalysisService the indices analyzers we have with a NamedAnalyzer, since its effectively creates a new instance of an analyzer (with per field reuse strategy) and we don't benefit as much from reusing analyzers on the indices / node level
Now, the indices level analyzers return a NamedAnalyzer, also NamedAnalyzer will use the non per field reuse strategy since thats really the common case for it (no need for per field reuse there).
Also, try and reuse numeric analyzers globally instead of creating them per numeric mapper. Although those analyzers are not used during indexing (we have a custom numeric field for it), they can be used sometimes when searching in a query string for example without specific query implemenation in the mappers
in guice, we always use eager loaded singletons for all modules we create, thus, we can actually optimize the memory used by injectors by reduced the construction information they store per binding resulting in extensive reduction in memory usage for many indices/shards case on a node
also because all are eager singletons (and effectively, read only), we can not go through trying to create just in time bindings in the parent injector before trying to craete it in the current injector, resulting in improvement of object creations time and the time it takes to create an index or a shard on a node