6.2 KiB

Raw Blame History

layout	title	nav_order	has_children	has_toc
default	Tokenizers	60	false	false

Tokenizers

A tokenizer receives a stream of characters and splits the text into individual tokens. A token consists of a term (usually, a word) and metadata about this term. For example, a tokenizer can split text on white space so that the text Actions speak louder than words. becomes [Actions, speak, louder, than, words.].

The output of a tokenizer is a stream of tokens. Tokenizers also maintain the following metadata about tokens:

The order or position of each token: This information is used for word and phrase proximity queries.
The starting and ending positions (offsets) of the tokens in the text: This information is used for highlighting search terms.
The token type: Some tokenizers (for example, standard) classify tokens by type, for example, <ALPHANUM> or <NUM>. Simpler tokenizers (for example, letter) only classify tokens as type word.

You can use tokenizers to define custom analyzers.

Built-in tokenizers

The following tables list the built-in tokenizers that OpenSearch provides.

Word tokenizers

Word tokenizers parse full text into words.

Tokenizer	Description	Example
`standard`	- Parses strings into tokens at word boundaries - Removes most punctuation	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`]
`letter`	- Parses strings into tokens on any non-letter character - Removes non-letter characters	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`]
`lowercase`	- Parses strings into tokens on any non-letter character - Removes non-letter characters - Converts terms to lowercase	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
`whitespace`	- Parses strings into tokens at white space characters	`It’s fun to contribute a brand-new PR or 2 to OpenSearch!` becomes [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
`uax_url_email`	- Similar to the standard tokenizer - Unlike the standard tokenizer, leaves URLs and email addresses as single terms	`It’s fun to contribute a brand-new PR or 2 to OpenSearch opensearch-project@github.com!` becomes [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `opensearch-project@github.com`]
`classic`	- Parses strings into tokens on: - Punctuation characters that are followed by a white space character - Hyphens if the term does not contain numbers - Removes punctuation - Leaves URLs and email addresses as single terms	`Part number PA-35234, single-use product (128.32)` becomes [`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`]
`thai`	- Parses Thai text into terms	`สวัสดีและยินดีต` becomes [`สวัสด`, `และ`, `ยินดี`, `ต`]

Partial word tokenizers

Partial word tokenizers parse text into words and generate fragments of those words for partial word matching.

Tokenizer	Description	Example
`ngram`	- Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word	`My repo` becomes [`M`, `My`, `y`, `y` , , `r`, `r`, `re`, `e`, `ep`, `p`, `po`, `o`] because the default n-gram length is 1--2 characters
`edge_ngram`	- Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word)	`My repo` becomes [`M`, `My`] because the default n-gram length is 1--2 characters

Structured text tokenizers

Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes.

Tokenizer	Description	Example
`keyword`	- No-op tokenizer - Outputs the entire string unchanged - Can be combined with token filters, like lowercase, to normalize terms	`My repo` becomes `My repo`
`pattern`	- Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms - Uses Java regular expressions	`https://opensearch.org/forum` becomes [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`) Can be configured with a regex pattern
`simple_pattern`	- Uses a regular expression pattern to return matching text as terms - Uses Lucene regular expressions - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions	Returns an empty array by default Must be configured with a pattern because the pattern defaults to an empty string
`simple_pattern_split`	- Uses a regular expression pattern to split the text at matches rather than returning the matches as terms - Uses Lucene regular expressions - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions	No-op by default Must be configured with a pattern
`char_group`	- Parses on a set of configurable characters - Faster than tokenizers that run regular expressions	No-op by default Must be configured with a list of characters
`path_hierarchy`	- Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy	`one/two/three` becomes [`one`, `one/two`, `one/two/three`]

6.2 KiB Raw Blame History Unescape Escape