The Hunspell service would throw a confusing error message if more than
one affix file was present. This commit distinguishes between the two
error cases: where there are no affix files and when there are too many
affix files.
Also implements lazy dictionary loading, which was used in the tests
but not implemented.
Closes#6850
This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding
- SignificanceHeuristic
- SignificanceHeuristicBuilder
- SignificanceHeuristicParser
closes#6561
The `recovery_after_time` tells the gateway to wait before starting recovery from disk. The goal here is to allow for more nodes to join the cluster and thus not start potentially unneeded replications. The `expectedNodes` setting (and friends) tells the gateway when it can start recovering even if the `recover_after_time` has not yet elapsed. However, `expectedNodes` is useless if one doesn't set `recovery_after_time`. This commit changes that by setting a sensible default of 5m for `recover_after_time` *if* a `expectedNodes` setting is present.
Closes#6742
`mmapfs` is really good for random access but can have sideeffects if
memory maps are large depending on the operating system etc. A hybrid
solution where only selected files are actually memory mapped but others
mostly consumed sequentially brings the best of both worlds and
minimizes the memory map impact.
This commit mmaps only the `dvd` and `tim` file for fast random access
on docvalues and term dictionaries.
Closes#6636
Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`.
Closes#6615
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)
Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.
Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`
Add support access to default Thai stopword set "_thai_"
Fix some bugs and broken links in documentation.
Closes#5935
For the casual reader, the reference to "term queries" may be glossed over, yielding an unexpected result when using `regexp` queries.
This attempts to make that distinction more prominent.
Closes#6698
If the match query with cutoff_frequency encounters stacked tokens,
like synonyms in the same position, it returns a boolean query instead
of a common terms query. However, if the original operator was set
to "and", it was ignoring that and resetting the operator to "or".
In fact, if operator is "and" then there is little benefit in using
a common terms query as a must query is already
executed efficiently.
The `exists` and `missing` filters need to merge postings lists of all existing
terms, which can be very costly, especially on high-cardinality fields. This
commit indexes the field names of a document under `_field_names` and reuses it
to speed up the `exists` and `missing` filters.
This is only enabled for indices that are created on or after Elasticsearch
1.3.0.
Close#5659
Added the http.jsonp.enable option to configure disabling of JSONP responses, as those
might pose a security risk, and can be disabled if unused.
This also fixes bugs in NettyHttpChannel
* JSONP responses were never setting application/javascript as the content-type
* The content-type and content-length headers were being overwritten even if they were set before
Closes#6164
Percentile Rank Aggregation is the reverse of the Percetiles aggregation. It determines the percentile rank (the proportion of values less than a given value) of the provided array of values.
Closes#6386