OpenSearch/docs/reference/search/suggesters/phrase-suggest.asciidoc

[[search-suggesters-phrase]]
=== Phrase Suggester

NOTE: In order to understand the format of suggestions, please
read the <<search-suggesters>> page first.

The `term` suggester provides a very convenient API to access word
alternatives on token basis within a certain string distance. The API
allows accessing each token in the stream individually while
suggest-selection is left to the API consumer. Yet, often pre-selected
suggestions are required in order to present to the end-user. The
`phrase` suggester adds additional logic on top of the `term` suggester
to select entire corrected phrases instead of individual tokens weighted
based on `ngram-langugage` models. In practice it this suggester will be
able to make better decision about which tokens to pick based on
co-occurence and frequencies.

==== API Example

The `phrase` request is defined along side the query part in the json
request:

[source,js]
--------------------------------------------------
curl -XPOST 'localhost:9200/_search' -d {
  "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 1,
        "real_word_error_likelihood" : 0.95,
        "max_errors" : 0.5,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        } ]
      }
    }
  }
}
--------------------------------------------------

The response contains suggested scored by the most likely spell
correction first. In this case we got the expected correction
`xorr the god jewel` first while the second correction is less
conservative where only one of the errors is corrected. Note, the
request is executed with `max_errors` set to `0.5` so 50% of the terms
can contain misspellings (See parameter descriptions below).

[source,js]
--------------------------------------------------
  {
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2938,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "simple_phrase" : [ {
      "text" : "Xor the Got-Jewel",
      "offset" : 0,
      "length" : 17,
      "options" : [ {
        "text" : "xorr the god jewel",
        "score" : 0.17877324
      }, {
        "text" : "xor the god jewel",
        "score" : 0.14231323
      } ]
    } ]
  }
}
--------------------------------------------------

==== Basic Phrase suggest API parameters

[horizontal]
`field`::
    the name of the field used to do n-gram lookups for the
    language model, the suggester will use this field to gain statistics to
    score corrections. This field is mandatory.

`gram_size`:: 
    sets max size of the n-grams (shingles) in the `field`.
    If the field doesn't contain n-grams (shingles) this should be omitted
    or set to `1`. Note that Elasticsearch tries to detect the gram size
    based on the specified `field`. If the field uses a `shingle` filter the
    `gram_size` is set to the `max_shingle_size` if not explicitly set.

`real_word_error_likelihood`:: 
    the likelihood of a term being a
    misspelled even if the term exists in the dictionary. The default it
    `0.95` corresponding to 5% or the real words are misspelled.


`confidence`:: 
    The confidence level defines a factor applied to the
    input phrases score which is used as a threshold for other suggest
    candidates. Only candidates that score higher than the threshold will be
    included in the result. For instance a confidence level of `1.0` will
    only return suggestions that score higher than the input phrase. If set
    to `0.0` the top N candidates are returned. The default is `1.0`.

`max_errors`:: 
    the maximum percentage of the terms that at most
    considered to be misspellings in order to form a correction. This method
    accepts a float value in the range `[0..1)` as a fraction of the actual
    query terms a number `>=1` as an absolute number of query terms. The
    default is set to `1.0` which corresponds to that only corrections with
    at most 1 misspelled term are returned.

`separator`:: 
    the separator that is used to separate terms in the
    bigram field. If not set the whitespace character is used as a
    separator.

`size`:: 
    the number of candidates that are generated for each
    individual query term Low numbers like `3` or `5` typically produce good
    results. Raising this can bring up terms with higher edit distances. The
    default is `5`.

`analyzer`:: 
    Sets the analyzer to analyse to suggest text with.
    Defaults to the search analyzer of the suggest field passed via `field`.

`shard_size`:: 
    Sets the maximum number of suggested term to be
    retrieved from each individual shard. During the reduce phase, only the
    top N suggestions are returned based on the `size` option. Defaults to
    `5`.

`text`:: 
    Sets the text / query to provide suggestions for.

==== Smoothing Models

The `phrase` suggester supports multiple smoothing models to balance
weight between infrequent grams (grams (shingles) are not existing in
the index) and frequent grams (appear at least once in the index). 

[horizontal]
`stupid_backoff`:: 
    a simple backoff model that backs off to lower
    order n-gram models if the higher order count is `0` and discounts the
    lower order n-gram model by a constant factor. The default `discount` is
    `0.4`. Stupid Backoff is the default model. 

`laplace`::
    a smoothing model that uses an additive smoothing where a
    constant (typically `1.0` or smaller) is added to all counts to balance
    weights, The default `alpha` is `0.5`. 

`linear_interpolation`:: 
    a smoothing model that takes the weighted
    mean of the unigrams, bigrams and trigrams based on user supplied
    weights (lambdas). Linear Interpolation doesn't have any default values.
    All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`)
    must be supplied.

==== Candidate Generators

The `phrase` suggester uses candidate generators to produce a list of
possible terms per term in the given text. A single candidate generator
is similar to a `term` suggester called for each individual term in the
text. The output of the generators is subsequently scored in combination
with the candidates from the other terms to for suggestion candidates. 

Currently only one type of candidate generator is supported, the
`direct_generator`. The Phrase suggest API accepts a list of generators
under the key `direct_generator` each of the generators in the list are
called per term in the original text.

==== Direct Generators

The direct generators support the following parameters:

[horizontal]
`field`:: 
    The field to fetch the candidate suggestions from. This is
    an required option that either needs to be set globally or per
    suggestion.

`size`:: 
    The maximum corrections to be returned per suggest text token.

`suggest_mode`::
    The suggest mode controls what suggestions are
    included or controls for what suggest text terms, suggestions should be
    suggested. Three possible values can be specified: 
    ** `missing`: Only suggest terms in the suggest text that aren't in the
                  index. This is the default.
    ** `popular`: Only suggest suggestions that occur in more docs then the
                  original suggest text term.
    ** `always`: Suggest any matching suggestions based on terms in the
                 suggest text.

`max_edits`:: 
    The maximum edit distance candidate suggestions can have
    in order to be considered as a suggestion. Can only be a value between 1
    and 2. Any other value result in an bad request error being thrown.
    Defaults to 2.

`prefix_length`::
    The number of minimal prefix characters that must
    match in order be a candidate suggestions. Defaults to 1. Increasing
    this number improves spellcheck performance. Usually misspellings don't
    occur in the beginning of terms.

`min_word_len`:: 
    The minimum length a suggest text term must have in
    order to be included. Defaults to 4.

`max_inspections`:: 
    A factor that is used to multiply with the
    `shards_size` in order to inspect more candidate spell corrections on
    the shard level. Can improve accuracy at the cost of performance.
    Defaults to 5.

`min_doc_freq`::
    The minimal threshold in number of documents a
    suggestion should appear in. This can be specified as an absolute number
    or as a relative percentage of number of documents. This can improve
    quality by only suggesting high frequency terms. Defaults to 0f and is
    not enabled. If a value higher than 1 is specified then the number
    cannot be fractional. The shard level document frequencies are used for
    this option.

`max_term_freq`:: 
    The maximum threshold in number of documents a
    suggest text token can exist in order to be included. Can be a relative
    percentage number (e.g 0.4) or an absolute number to represent document
    frequencies. If an value higher than 1 is specified then fractional can
    not be specified. Defaults to 0.01f. This can be used to exclude high
    frequency terms from being spellchecked. High frequency terms are
    usually spelled correctly on top of this also improves the spellcheck
    performance. The shard level document frequencies are used for this
    option.

`pre_filter`::
    a filter (analyzer) that is applied to each of the
    tokens passed to this candidate generator. This filter is applied to the
    original token before candidates are generated.

`post_filter`:: 
    a filter (analyzer) that is applied to each of the
    generated tokens before they are passed to the actual phrase scorer.

The following example shows a `phrase` suggest call with two generators,
the first one is using a field containing ordinary indexed terms and the
second one uses a field that uses terms indexed with a `reverse` filter 
(tokens are index in reverse order). This is used to overcome the limitation 
of the direct generators to require a constant prefix to provide 
high-performance suggestions. The `pre_filter` and `post_filter` options 
accept ordinary analyzer names.

[source,js]
--------------------------------------------------
curl -s -XPOST 'localhost:9200/_search' -d {
 "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 4,
        "real_word_error_likelihood" : 0.95,
        "confidence" : 2.0,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        }, {
          "field" : "reverse",
          "suggest_mode" : "always",
          "min_word_len" : 1,
          "pre_filter" : "reverse",
          "post_filter" : "reverse"
        } ]
      }
    }
  }
}
--------------------------------------------------

`pre_filter` and `post_filter` can also be used to inject synonyms after
candidates are generated. For instance for the query `captain usq` we
might generate a candidate `usa` for term `usq` which is a synonym for
`america` which allows to present `captain america` to the user if this
phrase scores high enough.
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[[search-suggesters-phrase]]`
			`=== Phrase Suggester`

			`NOTE: In order to understand the format of suggestions, please`
			`read the <<search-suggesters>> page first.`

			The `term` suggester provides a very convenient API to access word
			`alternatives on token basis within a certain string distance. The API`
			`allows accessing each token in the stream individually while`
			`suggest-selection is left to the API consumer. Yet, often pre-selected`
			`suggestions are required in order to present to the end-user. The`
			`phrase` suggester adds additional logic on top of the `term` suggester
			`to select entire corrected phrases instead of individual tokens weighted`
			based on `ngram-langugage` models. In practice it this suggester will be
			`able to make better decision about which tokens to pick based on`
			`co-occurence and frequencies.`

			`==== API Example`

			The `phrase` request is defined along side the query part in the json
			`request:`

			`[source,js]`
			`--------------------------------------------------`
			`curl -XPOST 'localhost:9200/_search' -d {`
			`"suggest" : {`
			`"text" : "Xor the Got-Jewel",`
			`"simple_phrase" : {`
			`"phrase" : {`
			`"analyzer" : "body",`
			`"field" : "bigram",`
			`"size" : 1,`
			`"real_word_error_likelihood" : 0.95,`
			`"max_errors" : 0.5,`
			`"gram_size" : 2,`
			`"direct_generator" : [ {`
			`"field" : "body",`
			`"suggest_mode" : "always",`
			`"min_word_len" : 1`
			`} ]`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`The response contains suggested scored by the most likely spell`
			`correction first. In this case we got the expected correction`
			`xorr the god jewel` first while the second correction is less
			`conservative where only one of the errors is corrected. Note, the`
			request is executed with `max_errors` set to `0.5` so 50% of the terms
			`can contain misspellings (See parameter descriptions below).`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"took" : 5,`
			`"timed_out" : false,`
			`"_shards" : {`
			`"total" : 5,`
			`"successful" : 5,`
			`"failed" : 0`
			`},`
			`"hits" : {`
			`"total" : 2938,`
			`"max_score" : 0.0,`
			`"hits" : [ ]`
			`},`
			`"suggest" : {`
			`"simple_phrase" : [ {`
			`"text" : "Xor the Got-Jewel",`
			`"offset" : 0,`
			`"length" : 17,`
			`"options" : [ {`
			`"text" : "xorr the god jewel",`
			`"score" : 0.17877324`
			`}, {`
			`"text" : "xor the god jewel",`
			`"score" : 0.14231323`
			`} ]`
			`} ]`
			`}`
			`}`
			`--------------------------------------------------`

			`==== Basic Phrase suggest API parameters`

			`[horizontal]`
			`field`::
			`the name of the field used to do n-gram lookups for the`
			`language model, the suggester will use this field to gain statistics to`
			`score corrections. This field is mandatory.`

			`gram_size`::
			sets max size of the n-grams (shingles) in the `field`.
			`If the field doesn't contain n-grams (shingles) this should be omitted`
			or set to `1`. Note that Elasticsearch tries to detect the gram size
			based on the specified `field`. If the field uses a `shingle` filter the
			`gram_size` is set to the `max_shingle_size` if not explicitly set.

			`real_word_error_likelihood`::
			`the likelihood of a term being a`
			`misspelled even if the term exists in the dictionary. The default it`
			`0.95` corresponding to 5% or the real words are misspelled.


			`confidence`::
			`The confidence level defines a factor applied to the`
			`input phrases score which is used as a threshold for other suggest`
			`candidates. Only candidates that score higher than the threshold will be`
			included in the result. For instance a confidence level of `1.0` will
			`only return suggestions that score higher than the input phrase. If set`
			to `0.0` the top N candidates are returned. The default is `1.0`.

			`max_errors`::
			`the maximum percentage of the terms that at most`
			`considered to be misspellings in order to form a correction. This method`
			accepts a float value in the range `[0..1)` as a fraction of the actual
			query terms a number `>=1` as an absolute number of query terms. The
			default is set to `1.0` which corresponds to that only corrections with
			`at most 1 misspelled term are returned.`

			`separator`::
			`the separator that is used to separate terms in the`
			`bigram field. If not set the whitespace character is used as a`
			`separator.`

			`size`::
			`the number of candidates that are generated for each`
			individual query term Low numbers like `3` or `5` typically produce good
			`results. Raising this can bring up terms with higher edit distances. The`
			default is `5`.

			`analyzer`::
			`Sets the analyzer to analyse to suggest text with.`
			Defaults to the search analyzer of the suggest field passed via `field`.

			`shard_size`::
			`Sets the maximum number of suggested term to be`
			`retrieved from each individual shard. During the reduce phase, only the`
			top N suggestions are returned based on the `size` option. Defaults to
			`5`.

			`text`::
			`Sets the text / query to provide suggestions for.`

			`==== Smoothing Models`

			The `phrase` suggester supports multiple smoothing models to balance
			`weight between infrequent grams (grams (shingles) are not existing in`
			`the index) and frequent grams (appear at least once in the index).`

			`[horizontal]`
			`stupid_backoff`::
			`a simple backoff model that backs off to lower`
			order n-gram models if the higher order count is `0` and discounts the
			lower order n-gram model by a constant factor. The default `discount` is
			`0.4`. Stupid Backoff is the default model.

			`laplace`::
			`a smoothing model that uses an additive smoothing where a`
			constant (typically `1.0` or smaller) is added to all counts to balance
			weights, The default `alpha` is `0.5`.

			`linear_interpolation`::
			`a smoothing model that takes the weighted`
			`mean of the unigrams, bigrams and trigrams based on user supplied`
			`weights (lambdas). Linear Interpolation doesn't have any default values.`
			All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`)
			`must be supplied.`

			`==== Candidate Generators`

			The `phrase` suggester uses candidate generators to produce a list of
			`possible terms per term in the given text. A single candidate generator`
			is similar to a `term` suggester called for each individual term in the
			`text. The output of the generators is subsequently scored in combination`
			`with the candidates from the other terms to for suggestion candidates.`

			`Currently only one type of candidate generator is supported, the`
			`direct_generator`. The Phrase suggest API accepts a list of generators
			under the key `direct_generator` each of the generators in the list are
			`called per term in the original text.`

			`==== Direct Generators`

			`The direct generators support the following parameters:`

			`[horizontal]`
			`field`::
			`The field to fetch the candidate suggestions from. This is`
			`an required option that either needs to be set globally or per`
			`suggestion.`

			`size`::
			`The maximum corrections to be returned per suggest text token.`

			`suggest_mode`::
			`The suggest mode controls what suggestions are`
			`included or controls for what suggest text terms, suggestions should be`
			`suggested. Three possible values can be specified:`
			** `missing`: Only suggest terms in the suggest text that aren't in the
			`index. This is the default.`
			** `popular`: Only suggest suggestions that occur in more docs then the
			`original suggest text term.`
			** `always`: Suggest any matching suggestions based on terms in the
			`suggest text.`

			`max_edits`::
			`The maximum edit distance candidate suggestions can have`
			`in order to be considered as a suggestion. Can only be a value between 1`
			`and 2. Any other value result in an bad request error being thrown.`
			`Defaults to 2.`

			`prefix_length`::
			`The number of minimal prefix characters that must`
			`match in order be a candidate suggestions. Defaults to 1. Increasing`
			`this number improves spellcheck performance. Usually misspellings don't`
			`occur in the beginning of terms.`

			`min_word_len`::
			`The minimum length a suggest text term must have in`
			`order to be included. Defaults to 4.`

			`max_inspections`::
			`A factor that is used to multiply with the`
			`shards_size` in order to inspect more candidate spell corrections on
			`the shard level. Can improve accuracy at the cost of performance.`
			`Defaults to 5.`

			`min_doc_freq`::
			`The minimal threshold in number of documents a`
			`suggestion should appear in. This can be specified as an absolute number`
			`or as a relative percentage of number of documents. This can improve`
			`quality by only suggesting high frequency terms. Defaults to 0f and is`
			`not enabled. If a value higher than 1 is specified then the number`
			`cannot be fractional. The shard level document frequencies are used for`
			`this option.`

			`max_term_freq`::
			`The maximum threshold in number of documents a`
			`suggest text token can exist in order to be included. Can be a relative`
			`percentage number (e.g 0.4) or an absolute number to represent document`
			`frequencies. If an value higher than 1 is specified then fractional can`
			`not be specified. Defaults to 0.01f. This can be used to exclude high`
			`frequency terms from being spellchecked. High frequency terms are`
			`usually spelled correctly on top of this also improves the spellcheck`
			`performance. The shard level document frequencies are used for this`
			`option.`

			`pre_filter`::
			`a filter (analyzer) that is applied to each of the`
			`tokens passed to this candidate generator. This filter is applied to the`
			`original token before candidates are generated.`

			`post_filter`::
			`a filter (analyzer) that is applied to each of the`
			`generated tokens before they are passed to the actual phrase scorer.`

			The following example shows a `phrase` suggest call with two generators,
			`the first one is using a field containing ordinary indexed terms and the`
			second one uses a field that uses terms indexed with a `reverse` filter
			`(tokens are index in reverse order). This is used to overcome the limitation`
			`of the direct generators to require a constant prefix to provide`
			high-performance suggestions. The `pre_filter` and `post_filter` options
			`accept ordinary analyzer names.`

			`[source,js]`
			`--------------------------------------------------`
			`curl -s -XPOST 'localhost:9200/_search' -d {`
			`"suggest" : {`
			`"text" : "Xor the Got-Jewel",`
			`"simple_phrase" : {`
			`"phrase" : {`
			`"analyzer" : "body",`
			`"field" : "bigram",`
			`"size" : 4,`
			`"real_word_error_likelihood" : 0.95,`
			`"confidence" : 2.0,`
			`"gram_size" : 2,`
			`"direct_generator" : [ {`
			`"field" : "body",`
			`"suggest_mode" : "always",`
			`"min_word_len" : 1`
			`}, {`
			`"field" : "reverse",`
			`"suggest_mode" : "always",`
			`"min_word_len" : 1,`
			`"pre_filter" : "reverse",`
			`"post_filter" : "reverse"`
			`} ]`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`pre_filter` and `post_filter` can also be used to inject synonyms after
			candidates are generated. For instance for the query `captain usq` we
			might generate a candidate `usa` for term `usq` which is a synonym for
			`america` which allows to present `captain america` to the user if this
			`phrase scores high enough.`