Docfix: ignore_above uses string length, not utf-8

ignore_above is used to guard against the lucene limitation that a term cannot exceed 32766 bytes. However, the implementation just used the character count, which doesn't take into account the fact that some characters have multi-byte utf-8 encodings. This commit updates the docs to make this relationship clear. Closes #11563
2015-07-09 14:04:20 -05:00 · 2015-07-09 14:04:20 -05:00 · f86e8c33c1
parent 03148dc3b0
commit f86e8c33c1
1 changed files with 6 additions and 0 deletions
--- a/docs/reference/mapping/types/core-types.asciidoc
+++ b/docs/reference/mapping/types/core-types.asciidoc
@ -143,6 +143,12 @@ defaults to `true` or to the parent `object` type setting.
 |`ignore_above` |The analyzer will ignore strings larger than this size.
 Useful for generic `not_analyzed` fields that should ignore long text.

+This option is also useful for protecting against Lucene's term byte-length
+limit of `32766`. Note: the value for `ignore_above` is the _character count_,
+but Lucene counts bytes, so if you have UTF-8 text, you may want to set the
+limit to `32766 / 3 = 10922` since UTF-8 characters may occupy at most 3
+bytes.
+
 |`position_offset_gap` |Position increment gap between field instances
 with the same field name. Defaults to 0.
 |=======================================================================