Docs: unescape regexes in Pattern Tokenizer docs
Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`. Closes #6615
This commit is contained in:
parent
6e6f4def5d
commit
955473f475
|
@ -7,7 +7,7 @@ via a regular expression. Accepts the following settings:
|
|||
[cols="<,<",options="header",]
|
||||
|======================================================================
|
||||
|Setting |Description
|
||||
|`pattern` |The regular expression pattern, defaults to `\\W+`.
|
||||
|`pattern` |The regular expression pattern, defaults to `\W+`.
|
||||
|`flags` |The regular expression flags.
|
||||
|`group` |Which group to extract into tokens. Defaults to `-1` (split).
|
||||
|======================================================================
|
||||
|
@ -15,15 +15,24 @@ via a regular expression. Accepts the following settings:
|
|||
*IMPORTANT*: The regular expression should match the *token separators*,
|
||||
not the tokens themselves.
|
||||
|
||||
*********************************************
|
||||
Note that you may need to escape `pattern` string literal according to
|
||||
your client language rules. For example, in many programming languages
|
||||
a string literal for `\W+` pattern is written as `"\\W+"`.
|
||||
There is nothing special about `pattern` (you may have to escape other
|
||||
string literals as well); escaping `pattern` is common just because it
|
||||
often contains characters that should be escaped.
|
||||
*********************************************
|
||||
|
||||
`group` set to `-1` (the default) is equivalent to "split". Using group
|
||||
>= 0 selects the matching group as the token. For example, if you have:
|
||||
|
||||
------------------------
|
||||
pattern = \\'([^\']+)\\'
|
||||
pattern = '([^']+)'
|
||||
group = 0
|
||||
input = aaa 'bbb' 'ccc'
|
||||
------------------------
|
||||
|
||||
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks).
|
||||
With the same input but using group=1, the output would be: bbb and ccc
|
||||
(no ' marks).
|
||||
the output will be two tokens: `'bbb'` and `'ccc'` (including the `'`
|
||||
marks). With the same input but using group=1, the output would be:
|
||||
`bbb` and `ccc` (no `'` marks).
|
||||
|
|
Loading…
Reference in New Issue