OpenSearch/docs/reference/analysis/tokenizers/pattern-tokenizer.asciidoc

[[analysis-pattern-tokenizer]]
=== Pattern Tokenizer

A tokenizer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:

[cols="<,<",options="header",]
|======================================================================
|Setting |Description
|`pattern` |The regular expression pattern, defaults to `\W+`.
|`flags` |The regular expression flags.
|`group` |Which group to extract into tokens. Defaults to `-1` (split).
|======================================================================

*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.

*********************************************
Note that you may need to escape `pattern` string literal according to
your client language rules. For example, in many programming languages
a string literal for `\W+` pattern is written as `"\\W+"`.
There is nothing special about `pattern` (you may have to escape other
string literals as well); escaping `pattern` is common just because it
often contains characters that should be escaped.
*********************************************

`group` set to `-1` (the default) is equivalent to "split". Using group
>= 0 selects the matching group as the token. For example, if you have:

------------------------
pattern = '([^']+)'
group   = 0
input   = aaa 'bbb' 'ccc'
------------------------

the output will be two tokens: `'bbb'` and `'ccc'` (including the `'`
marks). With the same input but using group=1, the output would be:
`bbb` and `ccc` (no `'` marks).
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-pattern-tokenizer]]`
			`=== Pattern Tokenizer`

			A tokenizer of type `pattern` that can flexibly separate text into terms
			`via a regular expression. Accepts the following settings:`

			`[cols="<,<",options="header",]`
			`\|======================================================================`
			`\|Setting \|Description`
Docs: unescape regexes in Pattern Tokenizer docs Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`. Closes #6615 2014-06-25 13:18:43 -04:00			\|`pattern` \|The regular expression pattern, defaults to `\W+`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			\|`flags` \|The regular expression flags.
			\|`group` \|Which group to extract into tokens. Defaults to `-1` (split).
			`\|======================================================================`

			`IMPORTANT: The regular expression should match the token separators,`
			`not the tokens themselves.`

Docs: unescape regexes in Pattern Tokenizer docs Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`. Closes #6615 2014-06-25 13:18:43 -04:00			`*********************************************`
			Note that you may need to escape `pattern` string literal according to
			`your client language rules. For example, in many programming languages`
			a string literal for `\W+` pattern is written as `"\\W+"`.
			There is nothing special about `pattern` (you may have to escape other
			string literals as well); escaping `pattern` is common just because it
			`often contains characters that should be escaped.`
			`*********************************************`

Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`group` set to `-1` (the default) is equivalent to "split". Using group
			`>= 0 selects the matching group as the token. For example, if you have:`

			`------------------------`
Docs: unescape regexes in Pattern Tokenizer docs Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`. Closes #6615 2014-06-25 13:18:43 -04:00			`pattern = '([^']+)'`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`group = 0`
			`input = aaa 'bbb' 'ccc'`
			`------------------------`

Docs: unescape regexes in Pattern Tokenizer docs Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`. Closes #6615 2014-06-25 13:18:43 -04:00			the output will be two tokens: `'bbb'` and `'ccc'` (including the `'`
			`marks). With the same input but using group=1, the output would be:`
			`bbb` and `ccc` (no `'` marks).