30 lines
1.1 KiB
Plaintext
30 lines
1.1 KiB
Plaintext
|
[[analysis-pattern-tokenizer]]
|
||
|
=== Pattern Tokenizer
|
||
|
|
||
|
A tokenizer of type `pattern` that can flexibly separate text into terms
|
||
|
via a regular expression. Accepts the following settings:
|
||
|
|
||
|
[cols="<,<",options="header",]
|
||
|
|======================================================================
|
||
|
|Setting |Description
|
||
|
|`pattern` |The regular expression pattern, defaults to `\\W+`.
|
||
|
|`flags` |The regular expression flags.
|
||
|
|`group` |Which group to extract into tokens. Defaults to `-1` (split).
|
||
|
|======================================================================
|
||
|
|
||
|
*IMPORTANT*: The regular expression should match the *token separators*,
|
||
|
not the tokens themselves.
|
||
|
|
||
|
`group` set to `-1` (the default) is equivalent to "split". Using group
|
||
|
>= 0 selects the matching group as the token. For example, if you have:
|
||
|
|
||
|
------------------------
|
||
|
pattern = \\'([^\']+)\\'
|
||
|
group = 0
|
||
|
input = aaa 'bbb' 'ccc'
|
||
|
------------------------
|
||
|
|
||
|
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks).
|
||
|
With the same input but using group=1, the output would be: bbb and ccc
|
||
|
(no ' marks).
|