61 lines
2.3 KiB
Plaintext
61 lines
2.3 KiB
Plaintext
[[analyzer-anatomy]]
|
|
== Anatomy of an analyzer
|
|
|
|
An _analyzer_ -- whether built-in or custom -- is just a package which
|
|
contains three lower-level building blocks: _character filters_,
|
|
_tokenizers_, and _token filters_.
|
|
|
|
The built-in <<analysis-analyzers,analyzers>> pre-package these building
|
|
blocks into analyzers suitable for different languages and types of text.
|
|
Elasticsearch also exposes the individual building blocks so that they can be
|
|
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
|
|
|
|
[float]
|
|
=== Character filters
|
|
|
|
A _character filter_ receives the original text as a stream of characters and
|
|
can transform the stream by adding, removing, or changing characters. For
|
|
instance, a character filter could be used to convert Hindu-Arabic numerals
|
|
(٠١٢٣٤٥٦٧٨٩) into their Arabic-Latin equivalents (0123456789), or to strip HTML
|
|
elements like `<b>` from the stream.
|
|
|
|
An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
|
|
which are applied in order.
|
|
|
|
[float]
|
|
=== Tokenizer
|
|
|
|
A _tokenizer_ receives a stream of characters, breaks it up into individual
|
|
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
|
|
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
|
|
text into tokens whenever it sees any whitespace. It would convert the text
|
|
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
|
|
|
|
The tokenizer is also responsible for recording the order or _position_ of
|
|
each term and the start and end _character offsets_ of the original word which
|
|
the term represents.
|
|
|
|
An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
|
|
|
|
|
|
[float]
|
|
=== Token filters
|
|
|
|
A _token filter_ receives the token stream and may add, remove, or change
|
|
tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
|
|
filter converts all tokens to lowercase, a
|
|
<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
|
|
(_stop words_) like `the` from the token stream, and a
|
|
<<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
|
|
into the token stream.
|
|
|
|
Token filters are not allowed to change the position or character offsets of
|
|
each token.
|
|
|
|
An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
|
|
which are applied in order.
|
|
|
|
|
|
|
|
|