372 lines
14 KiB
Plaintext
372 lines
14 KiB
Plaintext
[[search-suggesters-phrase]]
|
|
=== Phrase Suggester
|
|
|
|
NOTE: In order to understand the format of suggestions, please
|
|
read the <<search-suggesters>> page first.
|
|
|
|
The `term` suggester provides a very convenient API to access word
|
|
alternatives on a per token basis within a certain string distance. The API
|
|
allows accessing each token in the stream individually while
|
|
suggest-selection is left to the API consumer. Yet, often pre-selected
|
|
suggestions are required in order to present to the end-user. The
|
|
`phrase` suggester adds additional logic on top of the `term` suggester
|
|
to select entire corrected phrases instead of individual tokens weighted
|
|
based on `ngram-language` models. In practice this suggester will be
|
|
able to make better decisions about which tokens to pick based on
|
|
co-occurrence and frequencies.
|
|
|
|
==== API Example
|
|
|
|
The `phrase` request is defined along side the query part in the json
|
|
request:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XPOST 'localhost:9200/_search' -d {
|
|
"suggest" : {
|
|
"text" : "Xor the Got-Jewel",
|
|
"simple_phrase" : {
|
|
"phrase" : {
|
|
"analyzer" : "body",
|
|
"field" : "bigram",
|
|
"size" : 1,
|
|
"real_word_error_likelihood" : 0.95,
|
|
"max_errors" : 0.5,
|
|
"gram_size" : 2,
|
|
"direct_generator" : [ {
|
|
"field" : "body",
|
|
"suggest_mode" : "always",
|
|
"min_word_length" : 1
|
|
} ],
|
|
"highlight": {
|
|
"pre_tag": "<em>",
|
|
"post_tag": "</em>"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The response contains suggestions scored by the most likely spell
|
|
correction first. In this case we received the expected correction
|
|
`xorr the god jewel` first while the second correction is less
|
|
conservative where only one of the errors is corrected. Note, the
|
|
request is executed with `max_errors` set to `0.5` so 50% of the terms
|
|
can contain misspellings (See parameter descriptions below).
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took" : 5,
|
|
"timed_out" : false,
|
|
"_shards" : {
|
|
"total" : 5,
|
|
"successful" : 5,
|
|
"failed" : 0
|
|
},
|
|
"hits" : {
|
|
"total" : 2938,
|
|
"max_score" : 0.0,
|
|
"hits" : [ ]
|
|
},
|
|
"suggest" : {
|
|
"simple_phrase" : [ {
|
|
"text" : "Xor the Got-Jewel",
|
|
"offset" : 0,
|
|
"length" : 17,
|
|
"options" : [ {
|
|
"text" : "xorr the god jewel",
|
|
"highlighted": "<em>xorr</em> the <em>god</em> jewel",
|
|
"score" : 0.17877324
|
|
}, {
|
|
"text" : "xor the god jewel",
|
|
"highlighted": "xor the <em>god</em> jewel",
|
|
"score" : 0.14231323
|
|
} ]
|
|
} ]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Basic Phrase suggest API parameters
|
|
|
|
[horizontal]
|
|
`field`::
|
|
the name of the field used to do n-gram lookups for the
|
|
language model, the suggester will use this field to gain statistics to
|
|
score corrections. This field is mandatory.
|
|
|
|
`gram_size`::
|
|
sets max size of the n-grams (shingles) in the `field`.
|
|
If the field doesn't contain n-grams (shingles) this should be omitted
|
|
or set to `1`. Note that Elasticsearch tries to detect the gram size
|
|
based on the specified `field`. If the field uses a `shingle` filter the
|
|
`gram_size` is set to the `max_shingle_size` if not explicitly set.
|
|
|
|
`real_word_error_likelihood`::
|
|
the likelihood of a term being a
|
|
misspelled even if the term exists in the dictionary. The default is
|
|
`0.95` corresponding to 5% of the real words are misspelled.
|
|
|
|
|
|
`confidence`::
|
|
The confidence level defines a factor applied to the
|
|
input phrases score which is used as a threshold for other suggest
|
|
candidates. Only candidates that score higher than the threshold will be
|
|
included in the result. For instance a confidence level of `1.0` will
|
|
only return suggestions that score higher than the input phrase. If set
|
|
to `0.0` the top N candidates are returned. The default is `1.0`.
|
|
|
|
`max_errors`::
|
|
the maximum percentage of the terms that at most
|
|
considered to be misspellings in order to form a correction. This method
|
|
accepts a float value in the range `[0..1)` as a fraction of the actual
|
|
query terms a number `>=1` as an absolute number of query terms. The
|
|
default is set to `1.0` which corresponds to that only corrections with
|
|
at most 1 misspelled term are returned. Note that setting this too high
|
|
can negatively impact performance. Low values like `1` or `2` are recommended
|
|
otherwise the time spend in suggest calls might exceed the time spend in
|
|
query execution.
|
|
|
|
`separator`::
|
|
the separator that is used to separate terms in the
|
|
bigram field. If not set the whitespace character is used as a
|
|
separator.
|
|
|
|
`size`::
|
|
the number of candidates that are generated for each
|
|
individual query term Low numbers like `3` or `5` typically produce good
|
|
results. Raising this can bring up terms with higher edit distances. The
|
|
default is `5`.
|
|
|
|
`analyzer`::
|
|
Sets the analyzer to analyse to suggest text with.
|
|
Defaults to the search analyzer of the suggest field passed via `field`.
|
|
|
|
`shard_size`::
|
|
Sets the maximum number of suggested term to be
|
|
retrieved from each individual shard. During the reduce phase, only the
|
|
top N suggestions are returned based on the `size` option. Defaults to
|
|
`5`.
|
|
|
|
`text`::
|
|
Sets the text / query to provide suggestions for.
|
|
|
|
`highlight`::
|
|
Sets up suggestion highlighting. If not provided then
|
|
no `highlighted` field is returned. If provided must
|
|
contain exactly `pre_tag` and `post_tag` which are
|
|
wrapped around the changed tokens. If multiple tokens
|
|
in a row are changed the entire phrase of changed tokens
|
|
is wrapped rather than each token.
|
|
|
|
`collate`::
|
|
Checks each suggestion against the specified `query` to prune suggestions
|
|
for which no matching docs exist in the index. The collate query for a
|
|
suggestion is run only on the local shard from which the suggestion has
|
|
been generated from. The `query` must be specified, and it is run as
|
|
a <<query-dsl-template-query,`template` query>>.
|
|
The current suggestion is automatically made available as the `{{suggestion}}`
|
|
variable, which should be used in your query. You can still specify
|
|
your own template `params` -- the `suggestion` value will be added to the
|
|
variables you specify. Additionally, you can specify a `prune` to control
|
|
if all phrase suggestions will be returned, when set to `true` the suggestions
|
|
will have an additional option `collate_match`, which will be `true` if
|
|
matching documents for the phrase was found, `false` otherwise.
|
|
The default value for `prune` is `false`.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XPOST 'localhost:9200/_search' -d {
|
|
"suggest" : {
|
|
"text" : "Xor the Got-Jewel",
|
|
"simple_phrase" : {
|
|
"phrase" : {
|
|
"field" : "bigram",
|
|
"size" : 1,
|
|
"direct_generator" : [ {
|
|
"field" : "body",
|
|
"suggest_mode" : "always",
|
|
"min_word_length" : 1
|
|
} ],
|
|
"collate": {
|
|
"query": { <1>
|
|
"match": {
|
|
"{{field_name}}" : "{{suggestion}}" <2>
|
|
}
|
|
},
|
|
"params": {"field_name" : "title"}, <3>
|
|
"prune": true <4>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
<1> This query will be run once for every suggestion.
|
|
<2> The `{{suggestion}}` variable will be replaced by the text
|
|
of each suggestion.
|
|
<3> An additional `field_name` variable has been specified in
|
|
`params` and is used by the `match` query.
|
|
<4> All suggestions will be returned with an extra `collate_match`
|
|
option indicating whether the generated phrase matched any
|
|
document.
|
|
|
|
==== Smoothing Models
|
|
|
|
The `phrase` suggester supports multiple smoothing models to balance
|
|
weight between infrequent grams (grams (shingles) are not existing in
|
|
the index) and frequent grams (appear at least once in the index).
|
|
|
|
[horizontal]
|
|
`stupid_backoff`::
|
|
a simple backoff model that backs off to lower
|
|
order n-gram models if the higher order count is `0` and discounts the
|
|
lower order n-gram model by a constant factor. The default `discount` is
|
|
`0.4`. Stupid Backoff is the default model.
|
|
|
|
`laplace`::
|
|
a smoothing model that uses an additive smoothing where a
|
|
constant (typically `1.0` or smaller) is added to all counts to balance
|
|
weights, The default `alpha` is `0.5`.
|
|
|
|
`linear_interpolation`::
|
|
a smoothing model that takes the weighted
|
|
mean of the unigrams, bigrams and trigrams based on user supplied
|
|
weights (lambdas). Linear Interpolation doesn't have any default values.
|
|
All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`)
|
|
must be supplied.
|
|
|
|
==== Candidate Generators
|
|
|
|
The `phrase` suggester uses candidate generators to produce a list of
|
|
possible terms per term in the given text. A single candidate generator
|
|
is similar to a `term` suggester called for each individual term in the
|
|
text. The output of the generators is subsequently scored in combination
|
|
with the candidates from the other terms to for suggestion candidates.
|
|
|
|
Currently only one type of candidate generator is supported, the
|
|
`direct_generator`. The Phrase suggest API accepts a list of generators
|
|
under the key `direct_generator` each of the generators in the list are
|
|
called per term in the original text.
|
|
|
|
==== Direct Generators
|
|
|
|
The direct generators support the following parameters:
|
|
|
|
[horizontal]
|
|
`field`::
|
|
The field to fetch the candidate suggestions from. This is
|
|
a required option that either needs to be set globally or per
|
|
suggestion.
|
|
|
|
`size`::
|
|
The maximum corrections to be returned per suggest text token.
|
|
|
|
`suggest_mode`::
|
|
The suggest mode controls what suggestions are
|
|
included or controls for what suggest text terms, suggestions should be
|
|
suggested. Three possible values can be specified:
|
|
** `missing`: Only suggest terms in the suggest text that aren't in the
|
|
index. This is the default.
|
|
** `popular`: Only suggest suggestions that occur in more docs then the
|
|
original suggest text term.
|
|
** `always`: Suggest any matching suggestions based on terms in the
|
|
suggest text.
|
|
|
|
`max_edits`::
|
|
The maximum edit distance candidate suggestions can have
|
|
in order to be considered as a suggestion. Can only be a value between 1
|
|
and 2. Any other value result in an bad request error being thrown.
|
|
Defaults to 2.
|
|
|
|
`prefix_length`::
|
|
The number of minimal prefix characters that must
|
|
match in order be a candidate suggestions. Defaults to 1. Increasing
|
|
this number improves spellcheck performance. Usually misspellings don't
|
|
occur in the beginning of terms. (Old name "prefix_len" is deprecated)
|
|
|
|
`min_word_length`::
|
|
The minimum length a suggest text term must have in
|
|
order to be included. Defaults to 4. (Old name "min_word_len" is deprecated)
|
|
|
|
`max_inspections`::
|
|
A factor that is used to multiply with the
|
|
`shards_size` in order to inspect more candidate spell corrections on
|
|
the shard level. Can improve accuracy at the cost of performance.
|
|
Defaults to 5.
|
|
|
|
`min_doc_freq`::
|
|
The minimal threshold in number of documents a
|
|
suggestion should appear in. This can be specified as an absolute number
|
|
or as a relative percentage of number of documents. This can improve
|
|
quality by only suggesting high frequency terms. Defaults to 0f and is
|
|
not enabled. If a value higher than 1 is specified then the number
|
|
cannot be fractional. The shard level document frequencies are used for
|
|
this option.
|
|
|
|
`max_term_freq`::
|
|
The maximum threshold in number of documents a
|
|
suggest text token can exist in order to be included. Can be a relative
|
|
percentage number (e.g 0.4) or an absolute number to represent document
|
|
frequencies. If an value higher than 1 is specified then fractional can
|
|
not be specified. Defaults to 0.01f. This can be used to exclude high
|
|
frequency terms from being spellchecked. High frequency terms are
|
|
usually spelled correctly on top of this also improves the spellcheck
|
|
performance. The shard level document frequencies are used for this
|
|
option.
|
|
|
|
`pre_filter`::
|
|
a filter (analyzer) that is applied to each of the
|
|
tokens passed to this candidate generator. This filter is applied to the
|
|
original token before candidates are generated.
|
|
|
|
`post_filter`::
|
|
a filter (analyzer) that is applied to each of the
|
|
generated tokens before they are passed to the actual phrase scorer.
|
|
|
|
The following example shows a `phrase` suggest call with two generators,
|
|
the first one is using a field containing ordinary indexed terms and the
|
|
second one uses a field that uses terms indexed with a `reverse` filter
|
|
(tokens are index in reverse order). This is used to overcome the limitation
|
|
of the direct generators to require a constant prefix to provide
|
|
high-performance suggestions. The `pre_filter` and `post_filter` options
|
|
accept ordinary analyzer names.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -s -XPOST 'localhost:9200/_search' -d {
|
|
"suggest" : {
|
|
"text" : "Xor the Got-Jewel",
|
|
"simple_phrase" : {
|
|
"phrase" : {
|
|
"analyzer" : "body",
|
|
"field" : "bigram",
|
|
"size" : 4,
|
|
"real_word_error_likelihood" : 0.95,
|
|
"confidence" : 2.0,
|
|
"gram_size" : 2,
|
|
"direct_generator" : [ {
|
|
"field" : "body",
|
|
"suggest_mode" : "always",
|
|
"min_word_length" : 1
|
|
}, {
|
|
"field" : "reverse",
|
|
"suggest_mode" : "always",
|
|
"min_word_length" : 1,
|
|
"pre_filter" : "reverse",
|
|
"post_filter" : "reverse"
|
|
} ]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
`pre_filter` and `post_filter` can also be used to inject synonyms after
|
|
candidates are generated. For instance for the query `captain usq` we
|
|
might generate a candidate `usa` for term `usq` which is a synonym for
|
|
`america` which allows to present `captain america` to the user if this
|
|
phrase scores high enough.
|