From 32eb5ffa925211030f4e3dd200118297e0a151c5 Mon Sep 17 00:00:00 2001 From: Adrien Grand Date: Fri, 6 Dec 2013 22:37:04 +0100 Subject: [PATCH] [Docs] Document which encoding should be used in order to make sense of the offsets returned by the term vectors API. Close #4363 --- docs/reference/search/termvectors.asciidoc | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/reference/search/termvectors.asciidoc b/docs/reference/search/termvectors.asciidoc index edcb2bbf735..a0bdaea967a 100644 --- a/docs/reference/search/termvectors.asciidoc +++ b/docs/reference/search/termvectors.asciidoc @@ -41,6 +41,14 @@ If the requested information wasn't stored in the index, it will be omitted without further warning. See <> for how to configure your index to store term vectors. +[WARNING] +====== +Start and end offsets assume UTF-16 encoding is being used. If you want to use +these offsets in order to get the original text that produced this token, you +should make sure that the string you are taking a sub-string of is also encoded +using UTF-16. +====== + [float] ==== Term statistics