From 955473f475ef66a503c98cfbd4a14c3938306131 Mon Sep 17 00:00:00 2001 From: Mikhail Korobov Date: Wed, 25 Jun 2014 23:18:43 +0600 Subject: [PATCH] Docs: unescape regexes in Pattern Tokenizer docs Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`. Closes #6615 --- .../tokenizers/pattern-tokenizer.asciidoc | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc index 72ca6041020..9a148456195 100644 --- a/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc +++ b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc @@ -7,7 +7,7 @@ via a regular expression. Accepts the following settings: [cols="<,<",options="header",] |====================================================================== |Setting |Description -|`pattern` |The regular expression pattern, defaults to `\\W+`. +|`pattern` |The regular expression pattern, defaults to `\W+`. |`flags` |The regular expression flags. |`group` |Which group to extract into tokens. Defaults to `-1` (split). |====================================================================== @@ -15,15 +15,24 @@ via a regular expression. Accepts the following settings: *IMPORTANT*: The regular expression should match the *token separators*, not the tokens themselves. +********************************************* +Note that you may need to escape `pattern` string literal according to +your client language rules. For example, in many programming languages +a string literal for `\W+` pattern is written as `"\\W+"`. +There is nothing special about `pattern` (you may have to escape other +string literals as well); escaping `pattern` is common just because it +often contains characters that should be escaped. +********************************************* + `group` set to `-1` (the default) is equivalent to "split". Using group >= 0 selects the matching group as the token. For example, if you have: ------------------------ -pattern = \\'([^\']+)\\' +pattern = '([^']+)' group = 0 input = aaa 'bbb' 'ccc' ------------------------ -the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). -With the same input but using group=1, the output would be: bbb and ccc -(no ' marks). +the output will be two tokens: `'bbb'` and `'ccc'` (including the `'` +marks). With the same input but using group=1, the output would be: +`bbb` and `ccc` (no `'` marks).