4495 Commits

Author SHA1 Message Date
Simon Willnauer
d4ec03ed76 # Phrase Suggester
The `term` suggester provides a very convenient API to access word alternatives on token
basis within a certain string distance. The API allows accessing each token in the stream
individually while suggest-selection is left to the API consumer. Yet, often already ranked
/ selected suggestions are required in order to present to the end-user.
Inside ElasticSearch we have the ability to access way more statistics and information quickly
to make better decision which token alternative to pick or if to pick an alternative at all.

This `phrase` suggester adds some logic on top of the `term` suggester to select entire
corrected phrases instead of individual tokens weighted based on a *ngram-langugage models*. In practice it
will be able to make better decision about which tokens to pick based on co-occurence and frequencies.
The current implementation is kept quite general and leaves room for future improvements.

# API Example

The `phrase` request is defined along side the query part in the json request:

curl -s -XPOST 'localhost:9200/_search' -d {
  "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 1,
        "real_word_error_likelihood" : 0.95,
        "max_errors" : 0.5,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        } ]

The response contains suggested sored by the most likely spell correction first. In this case we got the expected correction
`xorr the god jewel` first while the second correction is less conservative where only one of the errors is corrected. Note, the request
is executed with `max_errors` set to `0.5` so 50% of the terms can contain misspellings (See parameter descriptions below).

  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  "hits" : {
    "total" : 2938,
    "max_score" : 0.0,
    "hits" : [ ]
  "suggest" : {
    "simple_phrase" : [ {
      "text" : "Xor the Got-Jewel",
      "offset" : 0,
      "length" : 17,
      "options" : [ {
        "text" : "xorr the god jewel",
        "score" : 0.17877324
      }, {
        "text" : "xor the god jewel",
        "score" : 0.14231323
      } ]
    } ]

# Phrase suggest API

## Basic parameters

* `field` - the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections.
* `gram_size` - sets max size of the n-grams (shingles) in the `field`. If the field doesn't contain n-grams (shingles) this should be omitted or set to `1`.
* `real_word_error_likelihood` - the likelihood of a term being a misspelled even if the term exists in the dictionary. The default it `0.95` corresponding to 5% or the real words are misspelled.
* `confidence` - The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of `1.0` will only return suggestions that score higher than the input phrase. If set to `0.0` the top N candidates are returned. The default is `1.0`.
* `max_errors` - the maximum percentage of the terms that at most considered to be misspellings in order to form a correction. This method accepts a float value in the range `[0..1)` as a fraction of the actual query terms a number `>=1` as an absolut number of query terms. The default is set to `1.0` which corresponds to that only corrections with at most 1 misspelled term are returned.
* `separator` - the separator that is used to separate terms in the bigram field. If not set the whitespce character is used as a separator.
* `size` - the number of candidates that are generated for each individual query term Low numbers like `3` or `5` typically produce good results. Raising this can bring up terms with higher edit distances. The default is `5`.
* `analyzer` -  Sets the analyzer to analyse to suggest text with. Defaults to the search analyzer of the suggest field passed via `field`.
* `shard_size` - Sets the maximum number of suggested term to be retrieved from each individual shard. During the reduce phase the only the top N suggestions are returned based on the `size` option. Defaults to `5`.
* `text` - Sets the text / query to provide suggestions for.

## Smoothing Models
The `phrase` suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index).
* `laplace` - the default model that uses an additive smoothing model where a constant (typically `1.0` or smaller) is added to all counts to balance weights, The default `alpha` is `0.5`.
* `stupid_backoff` - a simple backoff model that backs off to lower order n-gram models if the higher order count is `0` and discounts the lower order n-gram model by a constant factor. The default `discount` is `0.4`.
* `linear_interpolation` - a smoothing model that takes the weighted mean of the unigrams, bigrams and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn't have any default values. All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`) must be supplied.

## Candidate Generators
The `phrase` suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a `term` suggester called for each individual term in the text. The output of the generators is subsequently scored in in combination with the candidates from the other terms to for suggestion candidates.
Currently only one type of candidate generator is supported, the `direct_generator`. The Phrase suggest API accepts a list of generators under the key `direct_generator` each of the generators in the list are called per term in the original text.

## Direct Generators

The direct generators support the following parameters:

* `field` - The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion.
* `analyzer` - The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field.
* `size` - The maximum corrections to be returned per suggest text token.
* `suggest_mode` - The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified:
 * `missing` - Only suggest terms in the suggest text that aren't in the index. This is the default.
 * `popular` - Only suggest suggestions that occur in more docs then the original suggest text term.
 * `always` - Suggest any matching suggestions based on terms in the suggest text.
* `max_edits` - The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2.
* `min_prefix` - The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don't occur in the beginning of terms.
* `min_query_length` -  The minimum length a suggest text term must have in order to be included. Defaults to 4.
* `max_inspections` - A factor that is used to multiply with the `shards_size` in order to inspect more candidate spell corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5.
* `threshold_frequency` - The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option.
* `max_query_frequency` - The maximum threshold in number of documents a sugges text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this this also improves the spellcheck performance.  The shard level document frequencies are used for this option.
* pre_filter -  a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. (optional)
* post_filter - a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. (optional)

The following example shows a `phrase` suggest call with two generators, the first one is using a field containing ordinary indexed terms and the second one uses a field that uses
terms indexed with a `reverse` filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The `pre_filter` and `post_filter` options accept ordinary analyzer names.

curl -s -XPOST 'localhost:9200/_search' -d {
 "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 4,
        "real_word_error_likelihood" : 0.95,
        "confidence" : 2.0,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        }, {
          "field" : "reverse",
          "suggest_mode" : "always",
          "min_word_len" : 1,
          "pre_filter" : "reverse",
          "post_filter" : "reverse"
        } ]

`pre_filter` and `post_filter` can also be used to inject synonyms after candidates are generated. For instance for the query `captain usq` we might generate a candidate `usa` for term `usq` which is a synonym for `america` which allows to present `captain america` to the user if this phrase scores high enough.

Closes #2709
2013-02-28 16:17:59 +01:00
Shay Banon
2bc624806d not bytes... 2013-02-28 16:02:38 +01:00
Shay Banon
1d45edd856 upgrade to jackson 2.1.4 2013-02-28 09:10:43 +01:00
Shay Banon
7400c30eba fail a shard if a merge failure occurs 2013-02-27 23:44:55 +01:00
Shay Banon
e908c723f1 don't log merge failures twice 2013-02-27 20:23:40 +01:00
Simon Willnauer
7be8f431d5 move id tests into SimpleQueryTests 2013-02-27 19:03:42 +01:00
Simon Willnauer
8ab602ec81 Fix AIOOB exception in UID type/id tuple creation.
Closes #2695
2013-02-27 18:58:27 +01:00
Shay Banon
3b2d403292 malformed elasticsearch.yml causes unresponsive hang
fixes #2693
2013-02-27 18:58:08 +01:00
Shay Banon
f02c3ec39a upgrade to guava 14.0 2013-02-27 18:31:13 +01:00
Shay Banon
31c273231a remove compress flag, as its no longer relevant 2013-02-27 18:13:48 +01:00
Drew Raines
cb7a569f4b Include preference in _count serialization and builder. [#2698] 2013-02-27 08:15:02 -06:00
Martijn van Groningen
ffbdc0a4c3 Updated postings format jdocs 2013-02-27 10:46:55 +01:00
Drew Raines
b53a8aff6a Allow _count to take preference parameter. [#2698] 2013-02-26 16:24:52 -06:00
Shay Banon
1e937fd5d1 Allow index: "no" for _type
fixes #2696
2013-02-26 22:06:52 +01:00
Martijn van Groningen
7c53d22ce9 Moved resolveClosestNestedObjectMapper to MapperService 2013-02-26 17:48:02 +01:00
Igor Motov
de243493c9 Changing dynamic index and cluster settings should work on master-only nodes
Fixes #2675
2013-02-26 08:54:46 -05:00
Shay Banon
bd75b731c6 move to 0.90.0.Beta2 snap 2013-02-26 10:33:57 +01:00
Shay Banon
ab3a59e0bf release 0.90.0.Beta1 2013-02-26 10:32:50 +01:00
Martijn van Groningen
2b5e3f5586 Fixed resolving closest nested object when sorting on a field inside nested object 2013-02-25 16:21:22 +01:00
Martijn van Groningen
c751df5ee5 Removed unused nested children collector. 2013-02-25 14:13:59 +01:00
Shay Banon
c7a05b1dda add helper method to know if ObjectMappers have a nested mapping 2013-02-25 13:40:05 +01:00
Shay Banon
6e3300efd3 better error message on nested sorting 2013-02-25 13:32:00 +01:00
Shay Banon
4bb4e49155 Empty list in ids query should not fail, but match no docs
relates to #2687
2013-02-25 12:51:34 +01:00
Shay Banon
bde36647fb Terms/Ids filter: Support empty list of values, resulting in no match for it
closes #2687
also closes #2686
2013-02-25 12:26:49 +01:00
Shay Banon
4145d154bb add a test for empty lookup terms filter 2013-02-25 11:58:58 +01:00
Shay Banon
358c0e35fb upgrade to latest jackson 2013-02-25 11:15:45 +01:00
Shay Banon
10ca4d5305 move internal stream facet type lookup to work with bytes 2013-02-25 10:57:18 +01:00
David Pilato
c689651706 Merge pull request #2681 from lukas-vlcek/master
Fix exception typo
2013-02-24 00:14:11 -08:00
Lukas Vlcek
a42f9491b5 fix typo in exception 2013-02-24 07:47:25 +01:00
Shay Banon
595e0e254e [Code refactoring] IndicesStats -> IndicesStatsResponse
fixes #1782
2013-02-23 14:23:36 +01:00
Shay Banon
7787901a2d cleanup the pom 2013-02-23 10:52:30 +01:00
David Pilato
4c493ac71d Revert changes on *Request classes from issue
Relative to #2657
2013-02-23 10:37:56 +01:00
David Pilato
a646e126e9 Display list of all available site plugins on /_plugins/ end point fix #2664 2013-02-23 09:34:06 +01:00
Shay Banon
eea3a01765 only return 404 on actual index settings missing, on "_all", return 200
relates to #2676
2013-02-22 23:08:38 +01:00
Shay Banon
915019587d Get settings on empty node fails with ArrayIndexOutOfBoundsException[0]
fixes #2676
2013-02-22 23:08:33 +01:00
Igor Motov
b8cc8e56c4 Improve stability of SimpleRobinEngineTests 2013-02-22 14:59:49 -05:00
Shay Banon
a3096157f8 upgrade to netty 3.6.3 2013-02-22 17:20:41 +01:00
Shay Banon
ad70105c39 keep the rescorer builder consistent with other builders, without the use of setters 2013-02-22 14:06:39 +01:00
Shay Banon
03fdc6aa80 Query DSL: Terms filter to allow for terms lookup from another document
closes #2674
2013-02-22 14:04:10 +01:00
Shay Banon
6978aa2189 mark source as "safe" when copying it over 2013-02-22 12:59:41 +01:00
Shay Banon
a234e45b59 fix boolean to is from get
relates to #2657
2013-02-22 12:45:56 +01:00
Igor Motov
ec3492c67c Improve stability of the testReusePeerRecovery test 2013-02-21 16:06:33 -05:00
Shay Banon
b7f5295674 update jsr166y adn jst166e to latest versions 2013-02-21 21:11:14 +01:00
Shay Banon
4753ffdf1e allow to set which queue implementation to use
expert setting, but still would be great to be able to control it
2013-02-21 20:07:40 +01:00
Drew Raines
5cce40fa5e Merge branch 'pull/2670'
Add 32-bit v6 jdk check for deb.
2013-02-21 08:26:18 -06:00
Ilya Nazarov
da3d682f0e Check for java-6-openjdk-i386 in init.d
There is check for /usr/lib/jvm/java-6-openjdk-amd64, but no for 32-bit systems (/usr/lib/jvm/java-6-openjdk-i386).
2013-02-21 21:13:51 +07:00
Igor Motov
4ea4de6f8d Add logging information for releasing node lock 2013-02-20 17:53:27 -05:00
Shay Banon
7bb092440a facet refactoring, default collector base post implementation
automatically implement post based on collector
2013-02-20 15:36:11 +01:00
Igor Motov
ce6f0e27bf Make file distribution among several disks configurable
Fixes #2650
2013-02-19 21:43:43 -05:00
David Pilato
b7afa0f44e Fix test for Support trailing slashes on plugin _site URLs #2654 2013-02-19 21:16:47 +01:00