Docfix: ignore_above uses string length, not utf-8

ignore_above is used to guard against the lucene limitation
that a term cannot exceed 32766 bytes.

However, the implementation just used the character count, which
doesn't take into account the fact that some characters have
multi-byte utf-8 encodings.

This commit updates the docs to make this relationship clear.

Closes #11563
This commit is contained in:
John Roesler 2015-07-09 14:04:20 -05:00 committed by Clinton Gormley
parent 03148dc3b0
commit f86e8c33c1
1 changed files with 6 additions and 0 deletions

View File

@ -143,6 +143,12 @@ defaults to `true` or to the parent `object` type setting.
|`ignore_above` |The analyzer will ignore strings larger than this size.
Useful for generic `not_analyzed` fields that should ignore long text.
This option is also useful for protecting against Lucene's term byte-length
limit of `32766`. Note: the value for `ignore_above` is the _character count_,
but Lucene counts bytes, so if you have UTF-8 text, you may want to set the
limit to `32766 / 3 = 10922` since UTF-8 characters may occupy at most 3
bytes.
|`position_offset_gap` |Position increment gap between field instances
with the same field name. Defaults to 0.
|=======================================================================