OpenSearch/docs/reference/analysis/tokenizers.asciidoc

[[analysis-tokenizers]]
== Tokenizer reference

A _tokenizer_  receives a stream of characters, breaks it up into individual
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace.  It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.

The tokenizer is also responsible for recording the following:

* Order or _position_ of each term (used for phrase and word proximity queries)
* Start and end _character offsets_ of the original word which the term
represents (used for highlighting search snippets).
* _Token type_, a classification of each term produced, such as `<ALPHANUM>`,
`<HANGUL>`, or `<NUM>`. Simpler analyzers only produce the `word` token type.

Elasticsearch has a number of built in tokenizers which can be used to build
<<analysis-custom-analyzer,custom analyzers>>.

[float]
=== Word Oriented Tokenizers

The following tokenizers are usually used for tokenizing full text into
individual words:

<<analysis-standard-tokenizer,Standard Tokenizer>>::

The `standard` tokenizer divides text into terms on word boundaries, as
defined by the Unicode Text Segmentation algorithm. It removes most
punctuation symbols. It is the best choice for most languages.

<<analysis-letter-tokenizer,Letter Tokenizer>>::

The `letter` tokenizer divides text into terms whenever it encounters a
character which is not a letter.

<<analysis-lowercase-tokenizer,Lowercase Tokenizer>>::

The `lowercase` tokenizer, like the `letter` tokenizer,  divides text into
terms whenever it encounters a character which is not a letter, but it also
lowercases all terms.

<<analysis-whitespace-tokenizer,Whitespace Tokenizer>>::

The `whitespace` tokenizer divides text into terms whenever it encounters any
whitespace character.

<<analysis-uaxurlemail-tokenizer,UAX URL Email Tokenizer>>::

The `uax_url_email` tokenizer is like the `standard` tokenizer except that it
recognises URLs and email addresses as single tokens.

<<analysis-classic-tokenizer,Classic Tokenizer>>::

The `classic` tokenizer is a grammar based tokenizer for the English Language.

<<analysis-thai-tokenizer,Thai Tokenizer>>::

The `thai` tokenizer segments Thai text into words.

[float]
=== Partial Word Tokenizers

These tokenizers break up text or words into small fragments, for partial word
matching:

<<analysis-ngram-tokenizer,N-Gram Tokenizer>>::

The `ngram` tokenizer can break up text into words when it encounters any of
a list of specified characters (e.g. whitespace or punctuation), then it returns
n-grams of each word: a sliding window of continuous letters, e.g. `quick` ->
`[qu, ui, ic, ck]`.

<<analysis-edgengram-tokenizer,Edge N-Gram Tokenizer>>::

The `edge_ngram` tokenizer can break up text into words when it encounters any of
a list of specified characters (e.g. whitespace or punctuation), then it returns
n-grams of each word which are anchored to the start of the word, e.g. `quick` ->
`[q, qu, qui, quic, quick]`.


[float]
=== Structured Text Tokenizers

The following tokenizers are usually used with structured text like
identifiers, email addresses, zip codes, and paths, rather than with full
text:

<<analysis-keyword-tokenizer,Keyword Tokenizer>>::

The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
is given and outputs the exact same text as a single term.  It can be combined
with token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> to
normalise the analysed terms.

<<analysis-pattern-tokenizer,Pattern Tokenizer>>::

The `pattern` tokenizer uses a regular expression to either split text into
terms whenever it matches a word separator, or to capture matching text as
terms.

<<analysis-simplepattern-tokenizer,Simple Pattern Tokenizer>>::

The `simple_pattern` tokenizer uses a regular expression to capture matching
text as terms. It uses a restricted subset of regular expression features
and is generally faster than the `pattern` tokenizer.

<<analysis-chargroup-tokenizer,Char Group Tokenizer>>::

The `char_group` tokenizer is configurable through sets of characters to split
on, which is usually less expensive than running regular expressions.

<<analysis-simplepatternsplit-tokenizer,Simple Pattern Split Tokenizer>>::

The `simple_pattern_split` tokenizer uses the same restricted regular expression
subset as the `simple_pattern` tokenizer, but splits the input at matches rather
than returning the matches as terms.

<<analysis-pathhierarchy-tokenizer,Path Tokenizer>>::

The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
path, splits on the path separator, and emits a term for each component in the
tree, e.g. `/foo/bar/baz` -> `[/foo, /foo/bar, /foo/bar/baz ]`.


include::tokenizers/chargroup-tokenizer.asciidoc[]

include::tokenizers/classic-tokenizer.asciidoc[]

include::tokenizers/edgengram-tokenizer.asciidoc[]

include::tokenizers/keyword-tokenizer.asciidoc[]

include::tokenizers/letter-tokenizer.asciidoc[]

include::tokenizers/lowercase-tokenizer.asciidoc[]

include::tokenizers/ngram-tokenizer.asciidoc[]

include::tokenizers/pathhierarchy-tokenizer.asciidoc[]

include::tokenizers/pathhierarchy-tokenizer-examples.asciidoc[]

include::tokenizers/pattern-tokenizer.asciidoc[]

include::tokenizers/simplepattern-tokenizer.asciidoc[]

include::tokenizers/simplepatternsplit-tokenizer.asciidoc[]

include::tokenizers/standard-tokenizer.asciidoc[]

include::tokenizers/thai-tokenizer.asciidoc[]

include::tokenizers/uaxurlemail-tokenizer.asciidoc[]

include::tokenizers/whitespace-tokenizer.asciidoc[]