[[analysis-tokenizers]] == Tokenizers A _tokenizer_ receives a stream of characters, breaks it up into individual _tokens_ (usually individual words), and outputs a stream of _tokens_. For instance, a <> tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`. The tokenizer is also responsible for recording the following: * Order or _position_ of each term (used for phrase and word proximity queries) * Start and end _character offsets_ of the original word which the term represents (used for highlighting search snippets). * _Token type_, a classification of each term produced, such as ``, ``, or ``. Simpler analyzers only produce the `word` token type. Elasticsearch has a number of built in tokenizers which can be used to build <>. [float] === Word Oriented Tokenizers The following tokenizers are usually used for tokenizing full text into individual words: <>:: The `standard` tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages. <>:: The `letter` tokenizer divides text into terms whenever it encounters a character which is not a letter. <>:: The `lowercase` tokenizer, like the `letter` tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. <>:: The `whitespace` tokenizer divides text into terms whenever it encounters any whitespace character. <>:: The `uax_url_email` tokenizer is like the `standard` tokenizer except that it recognises URLs and email addresses as single tokens. <>:: The `classic` tokenizer is a grammar based tokenizer for the English Language. <>:: The `thai` tokenizer segments Thai text into words. [float] === Partial Word Tokenizers These tokenizers break up text or words into small fragments, for partial word matching: <>:: The `ngram` tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g. `quick` -> `[qu, ui, ic, ck]`. <>:: The `edge_ngram` tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g. `quick` -> `[q, qu, qui, quic, quick]`. [float] === Structured Text Tokenizers The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text: <>:: The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like <> to normalise the analysed terms. <>:: The `pattern` tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms. <>:: The `simple_pattern` tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the `pattern` tokenizer. <>:: The `char_group` tokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions. <>:: The `simple_pattern_split` tokenizer uses the same restricted regular expression subset as the `simple_pattern` tokenizer, but splits the input at matches rather than returning the matches as terms. <>:: The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. `/foo/bar/baz` -> `[/foo, /foo/bar, /foo/bar/baz ]`. include::tokenizers/chargroup-tokenizer.asciidoc[] include::tokenizers/classic-tokenizer.asciidoc[] include::tokenizers/edgengram-tokenizer.asciidoc[] include::tokenizers/keyword-tokenizer.asciidoc[] include::tokenizers/letter-tokenizer.asciidoc[] include::tokenizers/lowercase-tokenizer.asciidoc[] include::tokenizers/ngram-tokenizer.asciidoc[] include::tokenizers/pathhierarchy-tokenizer.asciidoc[] include::tokenizers/pathhierarchy-tokenizer-examples.asciidoc[] include::tokenizers/pattern-tokenizer.asciidoc[] include::tokenizers/simplepattern-tokenizer.asciidoc[] include::tokenizers/simplepatternsplit-tokenizer.asciidoc[] include::tokenizers/standard-tokenizer.asciidoc[] include::tokenizers/thai-tokenizer.asciidoc[] include::tokenizers/uaxurlemail-tokenizer.asciidoc[] include::tokenizers/whitespace-tokenizer.asciidoc[]