[DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers (#49007)
The `edge_ngram` tokenizer limits tokens to the `max_gram` character length. Autocomplete searches for terms longer than this limit return no results. To prevent this, you can use the `truncate` token filter to truncate tokens to the `max_gram` character length. However, this could return irrelevant results. This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach. Closes #48956.
This commit is contained in:
parent
e6ad3c29fd
commit
095c34359f
|
@ -72,12 +72,16 @@ configure the `edge_ngram` before using it.
|
||||||
|
|
||||||
The `edge_ngram` tokenizer accepts the following parameters:
|
The `edge_ngram` tokenizer accepts the following parameters:
|
||||||
|
|
||||||
[horizontal]
|
|
||||||
`min_gram`::
|
`min_gram`::
|
||||||
Minimum length of characters in a gram. Defaults to `1`.
|
Minimum length of characters in a gram. Defaults to `1`.
|
||||||
|
|
||||||
`max_gram`::
|
`max_gram`::
|
||||||
Maximum length of characters in a gram. Defaults to `2`.
|
+
|
||||||
|
--
|
||||||
|
Maximum length of characters in a gram. Defaults to `2`.
|
||||||
|
|
||||||
|
See <<max-gram-limits>>.
|
||||||
|
--
|
||||||
|
|
||||||
`token_chars`::
|
`token_chars`::
|
||||||
|
|
||||||
|
@ -93,6 +97,29 @@ Character classes may be any of the following:
|
||||||
* `punctuation` -- for example `!` or `"`
|
* `punctuation` -- for example `!` or `"`
|
||||||
* `symbol` -- for example `$` or `√`
|
* `symbol` -- for example `$` or `√`
|
||||||
|
|
||||||
|
[[max-gram-limits]]
|
||||||
|
=== Limitations of the `max_gram` parameter
|
||||||
|
|
||||||
|
The `edge_ngram` tokenizer's `max_gram` value limits the character length of
|
||||||
|
tokens. When the `edge_ngram` tokenizer is used with an index analyzer, this
|
||||||
|
means search terms longer than the `max_gram` length may not match any indexed
|
||||||
|
terms.
|
||||||
|
|
||||||
|
For example, if the `max_gram` is `3`, searches for `apple` won't match the
|
||||||
|
indexed term `app`.
|
||||||
|
|
||||||
|
To account for this, you can use the <<analysis-truncate-tokenfilter,`truncate`
|
||||||
|
token filter>> token filter with a search analyzer to shorten search terms to
|
||||||
|
the `max_gram` character length. However, this could return irrelevant results.
|
||||||
|
|
||||||
|
For example, if the `max_gram` is `3` and search terms are truncated to three
|
||||||
|
characters, the search term `apple` is shortened to `app`. This means searches
|
||||||
|
for `apple` return any indexed terms matching `app`, such as `apply`, `snapped`,
|
||||||
|
and `apple`.
|
||||||
|
|
||||||
|
We recommend testing both approaches to see which best fits your
|
||||||
|
use case and desired search experience.
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
=== Example configuration
|
=== Example configuration
|
||||||
|
|
||||||
|
@ -209,12 +236,16 @@ The above example produces the following terms:
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
Usually we recommend using the same `analyzer` at index time and at search
|
Usually we recommend using the same `analyzer` at index time and at search
|
||||||
time. In the case of the `edge_ngram` tokenizer, the advice is different. It
|
time. In the case of the `edge_ngram` tokenizer, the advice is different. It
|
||||||
only makes sense to use the `edge_ngram` tokenizer at index time, to ensure
|
only makes sense to use the `edge_ngram` tokenizer at index time, to ensure
|
||||||
that partial words are available for matching in the index. At search time,
|
that partial words are available for matching in the index. At search time,
|
||||||
just search for the terms the user has typed in, for instance: `Quick Fo`.
|
just search for the terms the user has typed in, for instance: `Quick Fo`.
|
||||||
|
|
||||||
Below is an example of how to set up a field for _search-as-you-type_:
|
Below is an example of how to set up a field for _search-as-you-type_.
|
||||||
|
|
||||||
|
Note that the `max_gram` value for the index analyzer is `10`, which limits
|
||||||
|
indexed terms to 10 characters. Search terms are not truncated, meaning that
|
||||||
|
search terms longer than 10 characters may not match any indexed terms.
|
||||||
|
|
||||||
[source,console]
|
[source,console]
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
|
|
Loading…
Reference in New Issue