From f86e8c33c11a41f3601d86a782960a3036e863bb Mon Sep 17 00:00:00 2001 From: John Roesler Date: Thu, 9 Jul 2015 14:04:20 -0500 Subject: [PATCH] Docfix: ignore_above uses string length, not utf-8 ignore_above is used to guard against the lucene limitation that a term cannot exceed 32766 bytes. However, the implementation just used the character count, which doesn't take into account the fact that some characters have multi-byte utf-8 encodings. This commit updates the docs to make this relationship clear. Closes #11563 --- docs/reference/mapping/types/core-types.asciidoc | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/reference/mapping/types/core-types.asciidoc b/docs/reference/mapping/types/core-types.asciidoc index 945a5c4e708..e1e92846bb7 100644 --- a/docs/reference/mapping/types/core-types.asciidoc +++ b/docs/reference/mapping/types/core-types.asciidoc @@ -143,6 +143,12 @@ defaults to `true` or to the parent `object` type setting. |`ignore_above` |The analyzer will ignore strings larger than this size. Useful for generic `not_analyzed` fields that should ignore long text. +This option is also useful for protecting against Lucene's term byte-length +limit of `32766`. Note: the value for `ignore_above` is the _character count_, +but Lucene counts bytes, so if you have UTF-8 text, you may want to set the +limit to `32766 / 3 = 10922` since UTF-8 characters may occupy at most 3 +bytes. + |`position_offset_gap` |Position increment gap between field instances with the same field name. Defaults to 0. |=======================================================================