2013-08-28 19:24:34 -04:00
|
|
|
[[analysis-pattern-tokenizer]]
|
|
|
|
=== Pattern Tokenizer
|
|
|
|
|
|
|
|
A tokenizer of type `pattern` that can flexibly separate text into terms
|
|
|
|
via a regular expression. Accepts the following settings:
|
|
|
|
|
|
|
|
[cols="<,<",options="header",]
|
|
|
|
|======================================================================
|
|
|
|
|Setting |Description
|
2014-06-25 13:18:43 -04:00
|
|
|
|`pattern` |The regular expression pattern, defaults to `\W+`.
|
2013-08-28 19:24:34 -04:00
|
|
|
|`flags` |The regular expression flags.
|
|
|
|
|`group` |Which group to extract into tokens. Defaults to `-1` (split).
|
|
|
|
|======================================================================
|
|
|
|
|
|
|
|
*IMPORTANT*: The regular expression should match the *token separators*,
|
|
|
|
not the tokens themselves.
|
|
|
|
|
2014-06-25 13:18:43 -04:00
|
|
|
*********************************************
|
|
|
|
Note that you may need to escape `pattern` string literal according to
|
|
|
|
your client language rules. For example, in many programming languages
|
|
|
|
a string literal for `\W+` pattern is written as `"\\W+"`.
|
|
|
|
There is nothing special about `pattern` (you may have to escape other
|
|
|
|
string literals as well); escaping `pattern` is common just because it
|
|
|
|
often contains characters that should be escaped.
|
|
|
|
*********************************************
|
|
|
|
|
2013-08-28 19:24:34 -04:00
|
|
|
`group` set to `-1` (the default) is equivalent to "split". Using group
|
|
|
|
>= 0 selects the matching group as the token. For example, if you have:
|
|
|
|
|
|
|
|
------------------------
|
2014-06-25 13:18:43 -04:00
|
|
|
pattern = '([^']+)'
|
2013-08-28 19:24:34 -04:00
|
|
|
group = 0
|
|
|
|
input = aaa 'bbb' 'ccc'
|
|
|
|
------------------------
|
|
|
|
|
2014-06-25 13:18:43 -04:00
|
|
|
the output will be two tokens: `'bbb'` and `'ccc'` (including the `'`
|
|
|
|
marks). With the same input but using group=1, the output would be:
|
|
|
|
`bbb` and `ccc` (no `'` marks).
|