[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers (#49007)

The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.
2019-11-13 14:27:10 -05:00 · 2019-11-13 14:27:10 -05:00 · 095c34359f
parent e6ad3c29fd
commit 095c34359f
1 changed files with 36 additions and 5 deletions
--- a/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc
+++ b/docs/reference/analysis/tokenizers/edgengram-tokenizer.asciidoc
@ -72,12 +72,16 @@ configure the `edge_ngram` before using it.
 The `edge_ngram` tokenizer accepts the following parameters:
 [horizontal]
 `min_gram`::
    Minimum length of characters in a gram.  Defaults to `1`.
 `max_gram`::
-    Maximum length of characters in a gram.  Defaults to `2`.
+
 --
 Maximum length of characters in a gram.  Defaults to `2`.
 See <<max-gram-limits>>.
 --
 `token_chars`::
@ -93,6 +97,29 @@ Character classes may be any of the following:
 * `punctuation` -- for example `!` or `"`
 * `symbol` --      for example `$` or `√`
 [[max-gram-limits]]
 === Limitations of the `max_gram` parameter
 The `edge_ngram` tokenizer's `max_gram` value limits the character length of
 tokens. When the `edge_ngram` tokenizer is used with an index analyzer, this
 means search terms longer than the `max_gram` length may not match any indexed
 terms.
 For example, if the `max_gram` is `3`, searches for `apple` won't match the
 indexed term `app`.
 To account for this, you can use the <<analysis-truncate-tokenfilter,`truncate`
 token filter>> token filter with a search analyzer to shorten search terms to
 the `max_gram` character length. However, this could return irrelevant results.
 For example, if the `max_gram` is `3` and search terms are truncated to three
 characters, the search term `apple` is shortened to `app`. This means searches
 for `apple` return any indexed terms matching `app`, such as `apply`, `snapped`,
 and `apple`.
 We recommend testing both approaches to see which best fits your
 use case and desired search experience.
 [float]
 === Example configuration
@ -209,12 +236,16 @@ The above example produces the following terms:
 ---------------------------
 Usually we recommend using the same `analyzer` at index time and at search
-time. In the case of the `edge_ngram` tokenizer, the advice is different.  It
+time. In the case of the `edge_ngram` tokenizer, the advice is different. It
 only makes sense to use the `edge_ngram` tokenizer at index time, to ensure
-that partial words are available for matching in the index.  At search time,
+that partial words are available for matching in the index. At search time,
 just search for the terms the user has typed in, for instance: `Quick Fo`.
-Below is an example of how to set up a field for _search-as-you-type_:
+Below is an example of how to set up a field for _search-as-you-type_.
 Note that the `max_gram` value for the index analyzer is `10`, which limits
 indexed terms to 10 characters. Search terms are not truncated, meaning that
 search terms longer than 10 characters may not match any indexed terms.
 [source,console]
 -----------------------------------