opensearch-docs-cn/_search-plugins/searching-data/did-you-mean.md

23 KiB
Raw Blame History

layout title parent nav_order
default Did-you-mean Searching data 25

Did-you-mean

The Did-you-mean suggester shows suggested corrections for misspelled search terms.

For example, if a user types "fliud," OpenSearch suggests a corrected search term like "fluid." You can then suggest the corrected term to the user or even automatically correct the search term.

You can implement the did-you-mean suggester using one of the following methods:

Term suggester

Use the term suggester to suggest corrected spellings for individual words. The term suggester uses an edit distance to compute suggestions.

The edit distance is the number of single-character insertions, deletions, or substitutions that need to be performed for a term to match another term. For example, to change the word "cat" to "hats", you need to substitute "h" for "c" and insert an "s", so the edit distance in this case is 2.

To use the term suggester, you don't need any special field mappings for your index. By default, string field types are mapped as text. A text field is analyzed, so the title in the following example is tokenized into individual words. Indexing the following documents creates a books index where title is a text field:

PUT books/_doc/1
{
  "title": "Design Patterns (Object-Oriented Software)"
}

PUT books/_doc/2
{
  "title": "Software Architecture Patterns Explained"
}

To check how a string is split into tokens, you can use the _analyze endpoint. To apply the same analyzer that the field uses, you can specify the field's name in the field parameter:

GET books/_analyze
{
  "text": "Design Patterns (Object-Oriented Software)",
  "field": "title"
}

The default analyzer (standard) splits a string at word boundaries, removes punctuation, and lowercases the tokens:

{
  "tokens" : [
    {
      "token" : "design",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "patterns",
      "start_offset" : 7,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "object",
      "start_offset" : 17,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "oriented",
      "start_offset" : 24,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "software",
      "start_offset" : 33,
      "end_offset" : 41,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

To get suggestions for a misspelled search term, use the term suggester. Specify the input text that needs suggestions in the text field, and specify the field from which to get suggestions in the field field:

GET books/_search
{
  "suggest": {
    "spell-check": {
      "text": "patern",
      "term": {
        "field": "title"
      }
    }
  }
}

The term suggester returns a list of corrections for the input text in the options array:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "spell-check" : [
      {
        "text" : "patern",
        "offset" : 0,
        "length" : 6,
        "options" : [
          {
            "text" : "patterns",
            "score" : 0.6666666,
            "freq" : 2
          }
        ]
      }
    ]
  }
}

The score value is calculated based on the edit distance. The higher the score, the better the suggestion. The freq is the frequency that represents the number of times the term appears in the documents of the specified index.

You can include several suggestions in one request. The following example uses the term suggester for two different suggestions:

GET books/_search
{
  "suggest": {
    "spell-check1" : {
      "text" : "patern",
      "term" : {
        "field" : "title"
      }
    },
    "spell-check2" : {
      "text" : "desing",
      "term" : {
        "field" : "title"
      }
    }
  }
}

To receive suggestions for the same input text in multiple fields, you can define the text globally to avoid duplication:

GET books/_search
{
  "suggest": {
    "text" : "patern",
    "spell-check1" : {
      "term" : {
        "field" : "title"
      }
    },
    "spell-check2" : {
      "term" : {
        "field" : "subject"
      }
    }
  }
}

If text is specified both at the global and individual suggestion levels, the suggestion-level value overrides the global value.

Term suggester options

You can specify the following options to the term suggester.

Option Description
field The field from which to source suggestions. Required. Can be set for each suggestion or globally.
analyzer The analyzer with which to analyze the input text. Defaults to the analyzer configured for the field.
size The maximum number of suggestions to return for each token in the input text.
sort Specifies how suggestions should be sorted in the response. Valid values are:
- score: Sort by similarity score, then document frequency, and then the term itself.
- frequency: Sort by document frequency, then similarity score, and then the term itself.
suggest_mode The suggest mode specifies the terms for which suggestions should be included in the response. Valid values are:
- missing: Return suggestions only for the input text terms that are not in the index.
- popular: Return suggestions only if they occur in the documents more frequently than in the original input text.
- always: Always return suggestions for each term in the input text.
Default is missing.
max_edits The maximum edit distance for suggestions. Valid values are in the [1, 2] range. Default is 2.
prefix_length An integer that specifies the minimum length the matched prefix must be to start returning suggestions. If the prefix of prefix_length is not matched, but the search term is still within the edit distance, no suggestions are returned. Default is 1. Higher values improve spellcheck performance because misspellings dont tend to occur in the beginning of words.
min_word_length The minimum length a suggestion must be in order to be included in the response. Default is 4.
shard_size The maximum number of candidate suggestions to obtain from each shard. After all candidate suggestions are considered, the top shard_size suggestions are returned. Default is equal to the size value. Shard-level document frequencies may not be exact because terms may reside in different shards. If shard_size is larger than size, the document frequencies for suggestions are more accurate, at the cost of decreased performance.
max_inspections The multiplication factor for shard_size. The maximum number of candidate suggestions OpenSearch inspects to find suggestions is calculated as shard_size multiplied by max_inspection. May improve accuracy at the cost of decreased performance. Default is 5.
min_doc_freq The minimum number or percentage of documents in which a suggestion should appear for it to be returned. May improve accuracy by returning only suggestions with high shard-level document frequencies. Valid values are integers that represent the document frequency or floats in the [0, 1] range that represent the percentage of documents. Default is 0 (feature disabled).
max_term_freq The maximum number of documents in which a suggestion should appear in order for it to be returned. Valid values are integers that represent the document frequency or floats in the [0, 1] range that represent the percentage of documents. Default is 0.01. Excluding high-frequency termsimproves spellcheck performance because high-frequency terms are usually spelled correctly. Uses shard-level document frequencies.
string_distance The edit distance algorithm to use to determine similarity. Valid values are:
- internal: The default algorithm that is based on the Damerau-Levenshtein algorithm but is highly optimized for comparing edit distances for terms in the index.
- damerau_levenshtein: The edit distance algorithm based on the Damerau-Levenshtein algorithm.
- levenshtein: The edit distance algorithm based on the Levenshtein edit distance algorithm.
- jaro_winkler: The edit distance algorithm based on the Jaro-Winkler algorithm.
- ngram: The edit distance algorithm based on character n-grams.

Phrase suggester

To implement did-you-mean, use a phrase suggester. The phrase suggester is similar to the term suggester, except it uses n-gram language models to suggest whole phrases instead of individual words.

To set up a phrase suggester, create a custom analyzer called trigram that uses a shingle filter and lowercases tokens. This filter is similar to the edge_ngram filter, but it applies to words instead of letters. Then configure the field from which you'll be sourcing suggestions with the custom analyzer you created:

PUT books2
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "trigram": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "shingle"
            ]
          }
        },
        "filter": {
          "shingle": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "trigram": {
            "type": "text",
            "analyzer": "trigram"
          }
        }
      }
    }
  }
}

Index the documents into the new index:

PUT books2/_doc/1
{
  "title": "Design Patterns"
}

PUT books2/_doc/2
{
  "title": "Software Architecture Patterns Explained"
}

Suppose the user searches for an incorrect phrase:

GET books2/_search
{
  "suggest": {
    "phrase-check": {
      "text": "design paterns",
      "phrase": {
        "field": "title.trigram"
      }
    }
  }
}

The phrase suggester returns the corrected phrase:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "phrase-check" : [
      {
        "text" : "design paterns",
        "offset" : 0,
        "length" : 14,
        "options" : [
          {
            "text" : "design patterns",
            "score" : 0.31666178
          }
        ]
      }
    ]
  }
}

To highlight suggestions, set up the highlight field for the phrase suggester:

GET books2/_search
{
  "suggest": {
    "phrase-check": {
      "text": "design paterns",
      "phrase": {
        "field": "title.trigram",
        "gram_size": 3,
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

The results contain the highlighted text:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "phrase-check" : [
      {
        "text" : "design paterns",
        "offset" : 0,
        "length" : 14,
        "options" : [
          {
            "text" : "design patterns",
            "highlighted" : "design <em>patterns</em>",
            "score" : 0.31666178
          }
        ]
      }
    ]
  }
}

Phrase suggester options

You can specify the following options to the phrase suggester.

Option Description
field The field to use for n-gram lookups. The phrase suggester uses this field to calculate suggestion scores. Required.
gram_size The maximum size n of the n-grams (shingles) in the field. If the field does not contain n-grams (shingles), omit this option or set it to 1. If the field uses a shingle filter, and gram_size is not set, gram_size is set to max_shingle_size.
real_word_error_likelihood The probability that a term is misspelled, even if it exists in the dictionary. Default is 0.95 (5% of the words in the dictionary are misspelled).
confidence The confidence level is a float factor that is multiplied by the input phrase's score to calculate a threshold score for other suggestions. Only suggestions with higher scores than the threshold are returned. A confidence level of 1.0 will only return suggestions that score higher than the input phrase. If confidence is set to 0, the top size candidates are returned. Default is 1.
max_errors The maximum number or percentage of the terms that can be erroneous (spelled incorrectly) in order to return a suggestion. Valid values are integers that represent the number of terms or floats in the (0, 1) range that represent the percentage of the terms. Default is 1 (return only suggestions with at most one misspelled term). Setting this value to a high number can decrease performance. We recommend setting max_errors to a low number like 1 or 2 to reduce the time spent in suggest calls relative to the time spent in query execution.
separator The separator for the terms in the bigram field. Defaults to the space character.
size The number of candidate suggestions to generate for each query term. Specifying a higher value can result in terms with higher edit distances being returned. Default is 5.
analyzer The analyzer with which to analyze the suggestion text. Defaults to the analyzer configured for the field.
shard_size The maximum number of candidate suggestions to obtain from each shard. After all candidate suggestions are considered, the top shard_size suggestions are returned. Default is 5.
collate Used to prune suggestions for which there are no matching documents in the index.
collate.query Specifies a query against which suggestions are checked to prune the suggestions for which there are no matching documents in the index.
collate.prune Specifies whether to return all suggestions. If prune is set to false, only those suggestions that have matching documents are returned. If prune is set to true, all suggestions are returned; each suggestion has an additional collate_match field that is true if the suggestion has matching documents and is false otherwise. Default is false.
highlight Configures suggestion highlighting. Both pre_tag and post_tag values are required.
highlight.pre_tag The starting tag for highlighting.
highlight.post_tag The ending tag for highlighting.
smoothing Smoothing model to balance the weight of the shingles that exist in the index frequently with the weight of the shingles that exist in the index infrequently.

Collate field

To filter out spellchecked suggestions that will not return any results, you can use the collate field. This field contains a scripted query that is run for each returned suggestion. See Search templates for information on constructing a templated query. You can specify the current suggestion using the {% raw %}{{suggestion}}{% endraw %} variable, or you can pass your own template parameters in the params field (the suggestion value will be added to the variables you specify).

The collate query for a suggestion is run only on the shard from which the suggestion was sourced. The query is required.

Additionally, if the prune parameter is set to true, a collate_match field is added to each suggestion. If a query returns no results, the collate_match value is false. You can then filter out suggestions based on the collate_match field. The prune parameter's default value is false.

For example, the following query configures the collate field to run a match_phrase query matching the title field to the current suggestion:

GET books2/_search
{
  "suggest": {
    "phrase-check": {
      "text": "design paterns",
      "phrase": {
        "field": "title.trigram",
        "collate" : {
          "query" : {
            "source": {
              "match_phrase" : {
                "title": "{{suggestion}}"
              }
            }
          },
          "prune": "true"
        }
      }
    }
  }
}

The resulting suggestion contains the collate_match field set to true, which means the match_phrase query will return matching documents for the suggestion:

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "suggest" : {
    "phrase-check" : [
      {
        "text" : "design paterns",
        "offset" : 0,
        "length" : 14,
        "options" : [
          {
            "text" : "design patterns",
            "score" : 0.56759655,
            "collate_match" : true
          }
        ]
      }
    ]
  }
}

Smoothing models

For most use cases, when calculating a suggestion's score, you want to take into account not only the frequency of a shingle but also the shingle's size. Smoothing models are used to calculate scores for shingles of different sizes, balancing the weight of frequent and infrequent shingles.

The following smoothing models are supported.

Model Description
stupid_backoff Backs off to lower-order n-gram models if the higher-order n-gram count is 0 and multiplies the lower-order n-gram model by a constant factor (discount). This is the default smoothing model.
stupid.backoff.discount The factor by which to multiply the lower-order n-gram model. Optional. Default is 0.4.
laplace Uses additive smoothing, adding a constant alpha to all counts to balance weights.
laplace.alpha The constant added to all counts to balance weights, typically 1.0 or smaller. Optional. Default is 0.5.

By default, OpenSearch uses the Stupid Backoff model—a simple algorithm that starts with the shingles of the highest order and takes lower-order shingles if higher-order shingles are not found. For example, if you set up the phrase suggester to have 3-grams, 2-grams, and 1-grams, the Stupid Backoff model first inspects the 3-grams. If there are no 3-grams, it inspects 2-grams but multiplies the score by the discount factor. If there are no 2-grams, it inspects 1-grams but again multiplies the score by the discount factor. The Stupid Backoff model works well in most cases. If you need to choose the Laplace smoothing model, specify it in the smoothing parameter:

GET books2/_search
{
  "suggest": {
    "phrase-check": {
      "text": "design paterns",
      "phrase": {
        "field": "title.trigram",
        "size" : 1,
        "smoothing" : {
          "laplace" : {
            "alpha" : 0.7
          }
        }
      }
    }
  }
}

Candidate generators

Candidate generators provide possible suggestion terms based on the terms in the input text. There is one candidate generator available—direct_generator. A direct generator functions similarly to a term suggester: It is also called for each term in the input text. The phrase suggester supports multiple candidate generators, where each generator is called for each term in the input text. It also lets you specify a pre-filter (an analyzer that analyzes the input text terms before they enter the spellcheck phase) and a post-filter (an analyzer that analyzes the generated suggestions before they are returned).

Set up a direct generator for a phrase suggester:

GET books2/_search
{
  "suggest": {
    "text": "design paterns",
    "phrase-check": {
      "phrase": {
        "field": "title.trigram",
        "size": 1,
        "direct_generator": [
          {
            "field": "title.trigram",
            "suggest_mode": "always",
            "min_word_length": 3
          }
        ]
      }
    }
  }
}

You can specify the following direct generator options.

Option Description
field The field from which to source suggestions. Required. Can be set for each suggestion or globally.
size The maximum number of suggestions to return for each token in the input text.
suggest_mode The suggest mode specifies the terms for which suggestions generated on each shard should be included. The suggest mode is applied to suggestions for each shard and is not checked when combining suggestions from different shards. Therefore, if the suggest mode is missing, suggestions will be returned if the term is missing from one shard but exists on another shard. Valid values are:
- missing: Return suggestions only for the input text terms that are not in the shard.
- popular: Return suggestions only if they occur in the documents more frequently than in the original input text on the shard.
- always: Always return suggestions.
Default is missing.
max_edits The maximum edit distance for suggestions. Valid values are in the [1, 2] range. Default is 2.
prefix_length An integer that specifies the minimum length the matched prefix must be to start returning suggestions. If the prefix of prefix_length is not matched but the search term is still within the edit distance, no suggestions are returned. Default is 1. Higher values improve spellcheck performance because misspellings dont tend to occur in the beginning of words.
min_word_length The minimum length a suggestion must be in order to be included. Default is 4.
max_inspections The multiplication factor for shard_size. The maximum number of candidate suggestions OpenSearch inspects to find suggestions is calculated as shard_size multiplied by max_inspection. May improve accuracy at the cost of decreased performance. Default is 5.
min_doc_freq The minimum number or percentage of documents in which a suggestion should appear in order for it to be returned. May improve accuracy by returning only suggestions with high shard-level document frequencies. Valid values are integers that represent the document frequency or floats in the [0, 1] range that represent the percentage of documents. Default is 0 (feature disabled).
max_term_freq The maximum number of documents in which a suggestion should appear in order for it to be returned. Valid values are integers that represent the document frequency or floats in the [0, 1] range that represent the percentage of documents. Default is 0.01. Excluding high-frequency termsimproves spellcheck performance because high-frequency terms are usually spelled correctly. Uses shard-level document frequencies.
pre_filter An analyzer that is applied to each input text token passed to the generator before a suggestion is generated.
post_filter An analyzer that is applied to each generated suggestion before it is passed to the phrase scorer.