mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-11 07:25:23 +00:00
Expose the experimental simplepattern and simplepatternsplit tokenizers in the common analysis plugin. They provide tokenization based on regular expressions, using Lucene's deterministic regex implementation that is usually faster than Java's and has protections against creating too-deep stacks during matching. Both have a not-very-useful default pattern of the empty string because all tokenizer factories must be able to be instantiated at index creation time. They should always be configured by the user in practice.
151 lines
5.1 KiB
Plaintext
151 lines
5.1 KiB
Plaintext
[[analysis-tokenizers]]
|
|
== Tokenizers
|
|
|
|
A _tokenizer_ receives a stream of characters, breaks it up into individual
|
|
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
|
|
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
|
|
text into tokens whenever it sees any whitespace. It would convert the text
|
|
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
|
|
|
|
The tokenizer is also responsible for recording the order or _position_ of
|
|
each term (used for phrase and word proximity queries) and the start and end
|
|
_character offsets_ of the original word which the term represents (used for
|
|
highlighting search snippets).
|
|
|
|
Elasticsearch has a number of built in tokenizers which can be used to build
|
|
<<analysis-custom-analyzer,custom analyzers>>.
|
|
|
|
[float]
|
|
=== Word Oriented Tokenizers
|
|
|
|
The following tokenizers are usually used for tokenizing full text into
|
|
individual words:
|
|
|
|
<<analysis-standard-tokenizer,Standard Tokenizer>>::
|
|
|
|
The `standard` tokenizer divides text into terms on word boundaries, as
|
|
defined by the Unicode Text Segmentation algorithm. It removes most
|
|
punctuation symbols. It is the best choice for most languages.
|
|
|
|
<<analysis-letter-tokenizer,Letter Tokenizer>>::
|
|
|
|
The `letter` tokenizer divides text into terms whenever it encounters a
|
|
character which is not a letter.
|
|
|
|
<<analysis-lowercase-tokenizer,Lowercase Tokenizer>>::
|
|
|
|
The `lowercase` tokenizer, like the `letter` tokenizer, divides text into
|
|
terms whenever it encounters a character which is not a letter, but it also
|
|
lowercases all terms.
|
|
|
|
<<analysis-whitespace-tokenizer,Whitespace Tokenizer>>::
|
|
|
|
The `whitespace` tokenizer divides text into terms whenever it encounters any
|
|
whitespace character.
|
|
|
|
<<analysis-uaxurlemail-tokenizer,UAX URL Email Tokenizer>>::
|
|
|
|
The `uax_url_email` tokenizer is like the `standard` tokenizer except that it
|
|
recognises URLs and email addresses as single tokens.
|
|
|
|
<<analysis-classic-tokenizer,Classic Tokenizer>>::
|
|
|
|
The `classic` tokenizer is a grammar based tokenizer for the English Language.
|
|
|
|
<<analysis-thai-tokenizer,Thai Tokenizer>>::
|
|
|
|
The `thai` tokenizer segments Thai text into words.
|
|
|
|
[float]
|
|
=== Partial Word Tokenizers
|
|
|
|
These tokenizers break up text or words into small fragments, for partial word
|
|
matching:
|
|
|
|
<<analysis-ngram-tokenizer,N-Gram Tokenizer>>::
|
|
|
|
The `ngram` tokenizer can break up text into words when it encounters any of
|
|
a list of specified characters (e.g. whitespace or punctuation), then it returns
|
|
n-grams of each word: a sliding window of continuous letters, e.g. `quick` ->
|
|
`[qu, ui, ic, ck]`.
|
|
|
|
<<analysis-edgengram-tokenizer,Edge N-Gram Tokenizer>>::
|
|
|
|
The `edge_ngram` tokenizer can break up text into words when it encounters any of
|
|
a list of specified characters (e.g. whitespace or punctuation), then it returns
|
|
n-grams of each word which are anchored to the start of the word, e.g. `quick` ->
|
|
`[q, qu, qui, quic, quick]`.
|
|
|
|
|
|
[float]
|
|
=== Structured Text Tokenizers
|
|
|
|
The following tokenizers are usually used with structured text like
|
|
identifiers, email addresses, zip codes, and paths, rather than with full
|
|
text:
|
|
|
|
<<analysis-keyword-tokenizer,Keyword Tokenizer>>::
|
|
|
|
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
|
|
is given and outputs the exact same text as a single term. It can be combined
|
|
with token filters like <<analysis-lowercase-tokenfilter,`lowercase`>> to
|
|
normalise the analysed terms.
|
|
|
|
<<analysis-pattern-tokenizer,Pattern Tokenizer>>::
|
|
|
|
The `pattern` tokenizer uses a regular expression to either split text into
|
|
terms whenever it matches a word separator, or to capture matching text as
|
|
terms.
|
|
|
|
<<analysis-simplepattern-tokenizer,Simple Pattern Tokenizer>>::
|
|
|
|
The `simplepattern` tokenizer uses a regular expression to capture matching
|
|
text as terms. It uses a restricted subset of regular expression features
|
|
and is generally faster than the `pattern` tokenizer.
|
|
|
|
<<analysis-simplepatternsplit-tokenizer,Simple Pattern Split Tokenizer>>::
|
|
|
|
The `simplepatternsplit` tokenizer uses the same restricted regular expression
|
|
subset as the `simplepattern` tokenizer, but splits the input at matches rather
|
|
than returning the matches as terms.
|
|
|
|
<<analysis-pathhierarchy-tokenizer,Path Tokenizer>>::
|
|
|
|
The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem
|
|
path, splits on the path separator, and emits a term for each component in the
|
|
tree, e.g. `/foo/bar/baz` -> `[/foo, /foo/bar, /foo/bar/baz ]`.
|
|
|
|
|
|
|
|
|
|
|
|
include::tokenizers/standard-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/letter-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/lowercase-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/whitespace-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/uaxurlemail-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/classic-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/thai-tokenizer.asciidoc[]
|
|
|
|
|
|
include::tokenizers/ngram-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/edgengram-tokenizer.asciidoc[]
|
|
|
|
|
|
include::tokenizers/keyword-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/pattern-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/simplepattern-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/simplepatternsplit-tokenizer.asciidoc[]
|
|
|
|
include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
|