2013-08-28 19:24:34 -04:00
|
|
|
[[analysis-tokenizers]]
|
|
|
|
== Tokenizers
|
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
A _tokenizer_ receives a stream of characters, breaks it up into individual
|
|
|
|
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
|
|
|
|
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
|
|
|
|
text into tokens whenever it sees any whitespace. It would convert the text
|
|
|
|
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2019-12-02 09:22:21 -05:00
|
|
|
The tokenizer is also responsible for recording the following:
|
|
|
|
|
|
|
|
* Order or _position_ of each term (used for phrase and word proximity queries)
|
|
|
|
* Start and end _character offsets_ of the original word which the term
|
|
|
|
represents (used for highlighting search snippets).
|
|
|
|
* _Token type_, a classification of each term produced, such as `<ALPHANUM>`,
|
|
|
|
`<HANGUL>`, or `<NUM>`. Simpler analyzers only produce the `word` token type.
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
Elasticsearch has a number of built in tokenizers which can be used to build
|
|
|
|
<<analysis-custom-analyzer,custom analyzers>>.
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
[float]
|
|
|
|
=== Word Oriented Tokenizers
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
The following tokenizers are usually used for tokenizing full text into
|
|
|
|
individual words:
|
|
|
|
|
|
|
|
<<analysis-standard-tokenizer,Standard Tokenizer>>::
|
|
|
|
|
|
|
|
The `standard` tokenizer divides text into terms on word boundaries, as
|
|
|
|
defined by the Unicode Text Segmentation algorithm. It removes most
|
|
|
|
punctuation symbols. It is the best choice for most languages.
|
|
|
|
|
|
|
|
<<analysis-letter-tokenizer,Letter Tokenizer>>::
|
|
|
|
|
|
|
|
The `letter` tokenizer divides text into terms whenever it encounters a
|
|
|
|
character which is not a letter.
|
|
|
|
|
2016-12-07 11:43:07 -05:00
|
|
|
<<analysis-lowercase-tokenizer,Lowercase Tokenizer>>::
|
2016-05-19 13:42:23 -04:00
|
|
|
|
|
|
|
The `lowercase` tokenizer, like the `letter` tokenizer, divides text into
|
|
|
|
terms whenever it encounters a character which is not a letter, but it also
|
|
|
|
lowercases all terms.
|
|
|
|
|
|
|
|
<<analysis-whitespace-tokenizer,Whitespace Tokenizer>>::
|
|
|
|
|
|
|
|
The `whitespace` tokenizer divides text into terms whenever it encounters any
|
|
|
|
whitespace character.
|
|
|
|
|
|
|
|
<<analysis-uaxurlemail-tokenizer,UAX URL Email Tokenizer>>::
|
|
|
|
|
|
|
|
The `uax_url_email` tokenizer is like the `standard` tokenizer except that it
|
|
|
|
recognises URLs and email addresses as single tokens.
|
|
|
|
|
|
|
|
<<analysis-classic-tokenizer,Classic Tokenizer>>::
|
|
|
|
|
|
|
|
The `classic` tokenizer is a grammar based tokenizer for the English Language.
|
|
|
|
|
|
|
|
<<analysis-thai-tokenizer,Thai Tokenizer>>::
|
|
|
|
|
|
|
|
The `thai` tokenizer segments Thai text into words.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Partial Word Tokenizers
|
|
|
|
|
|
|
|
These tokenizers break up text or words into small fragments, for partial word
|
|
|
|
matching:
|
|
|
|
|
|
|
|
<<analysis-ngram-tokenizer,N-Gram Tokenizer>>::
|
|
|
|
|
|
|
|
The `ngram` tokenizer can break up text into words when it encounters any of
|
|
|
|
a list of specified characters (e.g. whitespace or punctuation), then it returns
|
|
|
|
n-grams of each word: a sliding window of continuous letters, e.g. `quick` ->
|
|
|
|
`[qu, ui, ic, ck]`.
|
|
|
|
|
|
|
|
<<analysis-edgengram-tokenizer,Edge N-Gram Tokenizer>>::
|
|
|
|
|
|
|
|
The `edge_ngram` tokenizer can break up text into words when it encounters any of
|
|
|
|
a list of specified characters (e.g. whitespace or punctuation), then it returns
|
|
|
|
n-grams of each word which are anchored to the start of the word, e.g. `quick` ->
|
|
|
|
`[q, qu, qui, quic, quick]`.
|
|
|
|
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Structured Text Tokenizers
|
|
|
|
|
|
|
|
The following tokenizers are usually used with structured text like
|
|
|
|
identifiers, email addresses, zip codes, and paths, rather than with full
|
|
|
|
text:
|
|
|
|
|
|
|
|
<<analysis-keyword-tokenizer,Keyword Tokenizer>>::
|
|
|
|
|
|
|
|
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
|
|
|
|
is given and outputs the exact same text as a single term. It can be combined
|
|
|
|
with token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> to
|
|
|
|
normalise the analysed terms.
|
|
|
|
|
|
|
|
<<analysis-pattern-tokenizer,Pattern Tokenizer>>::
|
|
|
|
|
|
|
|
The `pattern` tokenizer uses a regular expression to either split text into
|
|
|
|
terms whenever it matches a word separator, or to capture matching text as
|
|
|
|
terms.
|
|
|
|
|
2017-06-13 15:46:59 -04:00
|
|
|
<<analysis-simplepattern-tokenizer,Simple Pattern Tokenizer>>::
|
|
|
|
|
2017-06-19 16:48:43 -04:00
|
|
|
The `simple_pattern` tokenizer uses a regular expression to capture matching
|
2017-06-13 15:46:59 -04:00
|
|
|
text as terms. It uses a restricted subset of regular expression features
|
|
|
|
and is generally faster than the `pattern` tokenizer.
|
|
|
|
|
[Feature] Adding a char_group tokenizer (#24186)
=== Char Group Tokenizer
The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.
=== Configuration
The `char_group` tokenizer accepts one parameter:
`tokenize_on_chars`::
A string containing a list of characters to tokenize the string on.
Whenever a character
from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
Defaults to an empty list.
=== Example output
```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```
When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:
```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
2018-05-22 10:26:31 -04:00
|
|
|
<<analysis-chargroup-tokenizer,Char Group Tokenizer>>::
|
|
|
|
|
|
|
|
The `char_group` tokenizer is configurable through sets of characters to split
|
|
|
|
on, which is usually less expensive than running regular expressions.
|
|
|
|
|
2017-06-13 15:46:59 -04:00
|
|
|
<<analysis-simplepatternsplit-tokenizer,Simple Pattern Split Tokenizer>>::
|
|
|
|
|
2017-06-19 16:48:43 -04:00
|
|
|
The `simple_pattern_split` tokenizer uses the same restricted regular expression
|
|
|
|
subset as the `simple_pattern` tokenizer, but splits the input at matches rather
|
2017-06-13 15:46:59 -04:00
|
|
|
than returning the matches as terms.
|
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
<<analysis-pathhierarchy-tokenizer,Path Tokenizer>>::
|
|
|
|
|
|
|
|
The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
|
|
|
|
path, splits on the path separator, and emits a term for each component in the
|
|
|
|
tree, e.g. `/foo/bar/baz` -> `[/foo, /foo/bar, /foo/bar/baz ]`.
|
|
|
|
|
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/chargroup-tokenizer.asciidoc[]
|
2016-05-19 13:42:23 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/classic-tokenizer.asciidoc[]
|
2016-05-19 13:42:23 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/edgengram-tokenizer.asciidoc[]
|
2016-05-19 13:42:23 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/keyword-tokenizer.asciidoc[]
|
2013-08-28 19:24:34 -04:00
|
|
|
|
|
|
|
include::tokenizers/letter-tokenizer.asciidoc[]
|
|
|
|
|
|
|
|
include::tokenizers/lowercase-tokenizer.asciidoc[]
|
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
include::tokenizers/ngram-tokenizer.asciidoc[]
|
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
|
2016-05-19 13:42:23 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/pathhierarchy-tokenizer-examples.asciidoc[]
|
2016-05-19 13:42:23 -04:00
|
|
|
|
|
|
|
include::tokenizers/pattern-tokenizer.asciidoc[]
|
|
|
|
|
2017-06-13 15:46:59 -04:00
|
|
|
include::tokenizers/simplepattern-tokenizer.asciidoc[]
|
2016-05-19 13:42:23 -04:00
|
|
|
|
2017-06-13 15:46:59 -04:00
|
|
|
include::tokenizers/simplepatternsplit-tokenizer.asciidoc[]
|
2016-05-19 13:42:23 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/standard-tokenizer.asciidoc[]
|
2019-05-20 19:43:01 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/thai-tokenizer.asciidoc[]
|
2019-05-20 19:43:01 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/uaxurlemail-tokenizer.asciidoc[]
|
2019-05-20 19:43:01 -04:00
|
|
|
|
2019-10-15 15:46:50 -04:00
|
|
|
include::tokenizers/whitespace-tokenizer.asciidoc[]
|