Docfix: ignore_above uses string length, not utf-8
ignore_above is used to guard against the lucene limitation that a term cannot exceed 32766 bytes. However, the implementation just used the character count, which doesn't take into account the fact that some characters have multi-byte utf-8 encodings. This commit updates the docs to make this relationship clear. Closes #11563
This commit is contained in:
parent
03148dc3b0
commit
f86e8c33c1
|
@ -143,6 +143,12 @@ defaults to `true` or to the parent `object` type setting.
|
||||||
|`ignore_above` |The analyzer will ignore strings larger than this size.
|
|`ignore_above` |The analyzer will ignore strings larger than this size.
|
||||||
Useful for generic `not_analyzed` fields that should ignore long text.
|
Useful for generic `not_analyzed` fields that should ignore long text.
|
||||||
|
|
||||||
|
This option is also useful for protecting against Lucene's term byte-length
|
||||||
|
limit of `32766`. Note: the value for `ignore_above` is the _character count_,
|
||||||
|
but Lucene counts bytes, so if you have UTF-8 text, you may want to set the
|
||||||
|
limit to `32766 / 3 = 10922` since UTF-8 characters may occupy at most 3
|
||||||
|
bytes.
|
||||||
|
|
||||||
|`position_offset_gap` |Position increment gap between field instances
|
|`position_offset_gap` |Position increment gap between field instances
|
||||||
with the same field name. Defaults to 0.
|
with the same field name. Defaults to 0.
|
||||||
|=======================================================================
|
|=======================================================================
|
||||||
|
|
Loading…
Reference in New Issue