OpenSearch/docs/reference/analysis/anatomy.asciidoc

55 lines
2.3 KiB
Plaintext

[[analyzer-anatomy]]
=== Anatomy of an analyzer
An _analyzer_ -- whether built-in or custom -- is just a package which
contains three lower-level building blocks: _character filters_,
_tokenizers_, and _token filters_.
The built-in <<analysis-analyzers,analyzers>> pre-package these building
blocks into analyzers suitable for different languages and types of text.
Elasticsearch also exposes the individual building blocks so that they can be
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
[[analyzer-anatomy-character-filters]]
==== Character filters
A _character filter_ receives the original text as a stream of characters and
can transform the stream by adding, removing, or changing characters. For
instance, a character filter could be used to convert Hindu-Arabic numerals
(٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML
elements like `<b>` from the stream.
An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
which are applied in order.
[[analyzer-anatomy-tokenizer]]
==== Tokenizer
A _tokenizer_ receives a stream of characters, breaks it up into individual
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace. It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
The tokenizer is also responsible for recording the order or _position_ of
each term and the start and end _character offsets_ of the original word which
the term represents.
An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
[[analyzer-anatomy-token-filters]]
==== Token filters
A _token filter_ receives the token stream and may add, remove, or change
tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
filter converts all tokens to lowercase, a
<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
(_stop words_) like `the` from the token stream, and a
<<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
into the token stream.
Token filters are not allowed to change the position or character offsets of
each token.
An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
which are applied in order.