NumericTokenizer is a simple wrapper aroung a NumericTokenStream. However, its
implementations had a few issues: its reset() method was not idempotent,
causing exceptions if reset() was called twice (causing #3211) and it had no
attributes, meaning that the only thing it allowed to do is counting the number
of generated tokens. The reason why indexing numeric data worked is that
the mapper's parseCreateField directly generates a NumericTokenStream and
by-passes the analyzer.
This commit makes NumericTokenizer.reset idempotent and makes consuming a
NumericTokenizer behave the same way as consuming the underlying
NumericTokenStream.
##############
Previous versions of the GeoPointFieldMapper just stored the actual geohash
of a point. This commit changes the behavior of storing geohashes by storing
the geohash and all its prefixes in decreasing order in the same field. To
enable this functionality the option geohash_prefix must be set in the mapping.
This behavior allows to filter GeoPoints by their geohashes. Basically a
geohash prefix is defined by the filter and all geohashes that match this
prefix will be returned. The neighbors flag allows to filter geohashes
that surround the given geohash cell. In general the neighborhood of a
geohash is defined by its eight adjacent cells.
To enable this, the type of filtered fields must be geo_point with geohashes
and geohash_prefix enabled.
For example:
curl -XPUT 'http://127.0.0.1:9200/locations/?pretty=true' -d '{
"mappings" : {
"location": {
"properties": {
"pin": {
"type": "geo_point",
"geohash": true,
"geohash_prefix": true
}
}
}
}
}'
This example defines a mapping for a type location in an index locations
with a field pin. The option geohash arranges storing the geohash of
the pin field.
To filter the results by the geohash a geohash_cell needs to be defined.
For example
curl -XGET 'http://127.0.0.1:9200/locations/_search?pretty=true' -d '{
"query": {
"match_all":{}
},
"filter": {
"geohash_cell": {
"field": "pin",
"geohash": "u30",
"neighbors": true
}
}
}'
This filter will match all geohashes that start with one of the following
prefixes: u30, u1r, u32, u33, u1p, u31, u0z, u2b and u2c.
Internally the GeoHashFilter is either a simple TermFilter, in case no
neighbors should be filtered or a BooleanFilter combining the TermFilters
of the geohash and all its neighbors.
Closes#2778
Lucene 4.4 will feature new n-gram tokenizers and filters that should not
generate broken offsets (that cause highlighting bugs) anymore. They also
correctly handle supplementary characters and the tokenizers can work in a
streaming fashion (they are not limited to the first 1024 chars of the
stream anymore).
Suggesters might need access to the shard they run on as well as the
index they operate on. This patch adds indexname and shard ID to the
SuggestionContext
Closes#3199
Using MonotonicAppendingLongBuffer instead of a GrowableWriter should help
save several bits per value, especially when the bytes to store have similar
lengths.
Closes#3186
if there is a failure during alias creation the tests don't fail with the
correct exception. This commit simplifies the debugging asserting on the ack
flag.
PhraseSuggester can be very slow and CPU intensive if a lot of terms
are suggested. Yet, to prevent cluster instabilty and long running requests
this commit adds a hard limit of by default 10 tokens where we just return
no correction for anymore if the query is parsed into more tokens.
Closes#3164
Until now 'named dates' like dateOptionalTime could not be used as a group
of dates. This patch allows it to group it arbitrarily like this:
* yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||dateOptionalTime
* dateOptionalTime||yyyy/MM/dd HH:mm:ss||yyyy/MM/dd
* yyyy/MM/dd HH:mm:ss||dateOptionalTime||yyyy/MM/dd
* date_time||date_time_no_millis
Closes#2132
Character.codePointAt and codePointBefore have two versions: one which only
accepts an offset, and one which accepts an offset and a limit. The former can
be dangerous when working with buffers of characters because if the offset
is the last char of the buffer, a char outside the buffer might be used to
compute the code point, so one should always use the version which accepts a
limit.
Collections.sort is wasteful on random-access lists: it dumps data into an
array, sorts the list and then adds elements back to the list. However, the
sorting can easily be performed in-place by using Lucene's
CollectionUtil.(merge|quick|tim)Sort.