Docs: unescape regexes in Pattern Tokenizer docs

Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`. Closes #6615
2025-03-09 14:34:43 +00:00 · 2014-06-25 23:18:43 +06:00 · 2014-06-25 23:18:43 +06:00 · 955473f475
commit 955473f475
parent 6e6f4def5d
1 changed files with 14 additions and 5 deletions
--- a/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
+++ b/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc
@ -7,7 +7,7 @@ via a regular expression. Accepts the following settings:
 [cols="<,<",options="header",]
 |======================================================================
 |Setting |Description
-|`pattern` |The regular expression pattern, defaults to `\\W+`.
+|`pattern` |The regular expression pattern, defaults to `\W+`.
 |`flags` |The regular expression flags.
 |`group` |Which group to extract into tokens. Defaults to `-1` (split).
 |======================================================================
@ -15,15 +15,24 @@ via a regular expression. Accepts the following settings:
 *IMPORTANT*: The regular expression should match the *token separators*,
 not the tokens themselves.

+*********************************************
+Note that you may need to escape `pattern` string literal according to
+your client language rules. For example, in many programming languages
+a string literal for `\W+` pattern is written as `"\\W+"`.
+There is nothing special about `pattern` (you may have to escape other
+string literals as well); escaping `pattern` is common just because it
+often contains characters that should be escaped.
+*********************************************
+
 `group` set to `-1` (the default) is equivalent to "split". Using group
 >= 0 selects the matching group as the token. For example, if you have:

 ------------------------
-pattern = \\'([^\']+)\\'
+pattern = '([^']+)'
 group   = 0
 input   = aaa 'bbb' 'ccc'
 ------------------------

-the output will be two tokens: 'bbb' and 'ccc' (including the ' marks).
-With the same input but using group=1, the output would be: bbb and ccc
-(no ' marks).
+the output will be two tokens: `'bbb'` and `'ccc'` (including the `'`
+marks). With the same input but using group=1, the output would be:
+`bbb` and `ccc` (no `'` marks).