OpenSearch/docs/reference/analysis/anatomy.asciidoc

[[analyzer-anatomy]]
=== Anatomy of an analyzer

An _analyzer_  -- whether built-in or custom -- is just a package which
contains three lower-level building blocks: _character filters_,
_tokenizers_, and _token filters_.

The built-in <<analysis-analyzers,analyzers>> pre-package these building
blocks into analyzers suitable for different languages and types of text.
Elasticsearch also exposes the individual building blocks so that they can be
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.

[[analyzer-anatomy-character-filters]]
==== Character filters

A _character filter_ receives the original text as a stream of characters and
can transform the stream by adding, removing, or changing characters.  For
instance, a character filter could be used to convert Hindu-Arabic numerals
(٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML
elements like `<b>` from the stream.

An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
which are applied in order.

[[analyzer-anatomy-tokenizer]]
==== Tokenizer

A _tokenizer_  receives a stream of characters, breaks it up into individual
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace.  It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.

The tokenizer is also responsible for recording the order or _position_ of
each term and the start and end _character offsets_ of the original word which
the term represents.

An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.

[[analyzer-anatomy-token-filters]]
==== Token filters

A _token filter_ receives the token stream and may add, remove, or change
tokens.  For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
filter converts all tokens to lowercase, a
<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
(_stop words_) like `the` from the token stream, and a
<<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
into the token stream.

Token filters are not allowed to change the position or character offsets of
each token.

An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
which are applied in order.
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[[analyzer-anatomy]]`
[DOCS] Add concepts section to analysis topic (#50801) This helps the topic better match the structure of our machine learning docs, e.g. https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts' child page, but I plan to add other concepts, such as 'Index time vs. search time', with later PRs. 2020-01-16 13:00:04 -05:00			`=== Anatomy of an analyzer`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`An _analyzer_ -- whether built-in or custom -- is just a package which`
			`contains three lower-level building blocks: _character filters_,`
			`_tokenizers_, and _token filters_.`

			`The built-in <<analysis-analyzers,analyzers>> pre-package these building`
			`blocks into analyzers suitable for different languages and types of text.`
			`Elasticsearch also exposes the individual building blocks so that they can be`
			combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.

[DOCS] Add token graph concept docs (#53339) Adds conceptual docs for token graphs. These docs cover: * How a token graph is constructed from a token stream * How synonyms and multi-position tokens impact token graphs * How token graphs are used during search * Why some token filters produce invalid token graphs Also makes the following supporting changes: * Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking * Adds several SVGs for token graph diagrams 2020-03-19 07:42:26 -04:00			`[[analyzer-anatomy-character-filters]]`
[DOCS] Add concepts section to analysis topic (#50801) This helps the topic better match the structure of our machine learning docs, e.g. https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts' child page, but I plan to add other concepts, such as 'Index time vs. search time', with later PRs. 2020-01-16 13:00:04 -05:00			`==== Character filters`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`A _character filter_ receives the original text as a stream of characters and`
			`can transform the stream by adding, removing, or changing characters. For`
Correction of the names of numirals (#21531) What was called Arabic numerals is actually Hindu - Eastern Arabic notation. And the Latin numerals you refer to is the Arabic numbers. 2016-11-25 08:30:08 -05:00			`instance, a character filter could be used to convert Hindu-Arabic numerals`
			`(٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			elements like `<b>` from the stream.

			`An analyzer may have zero or more <<analysis-charfilters,character filters>>,`
			`which are applied in order.`

[DOCS] Add token graph concept docs (#53339) Adds conceptual docs for token graphs. These docs cover: * How a token graph is constructed from a token stream * How synonyms and multi-position tokens impact token graphs * How token graphs are used during search * Why some token filters produce invalid token graphs Also makes the following supporting changes: * Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking * Adds several SVGs for token graph diagrams 2020-03-19 07:42:26 -04:00			`[[analyzer-anatomy-tokenizer]]`
[DOCS] Add concepts section to analysis topic (#50801) This helps the topic better match the structure of our machine learning docs, e.g. https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts' child page, but I plan to add other concepts, such as 'Index time vs. search time', with later PRs. 2020-01-16 13:00:04 -05:00			`==== Tokenizer`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`A _tokenizer_ receives a stream of characters, breaks it up into individual`
			`_tokens_ (usually individual words), and outputs a stream of _tokens_. For`
			instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
			`text into tokens whenever it sees any whitespace. It would convert the text`
			`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.

			`The tokenizer is also responsible for recording the order or _position_ of`
			`each term and the start and end _character offsets_ of the original word which`
			`the term represents.`

			`An analyzer must have exactly one <<analysis-tokenizers,tokenizer>>.`

[DOCS] Add token graph concept docs (#53339) Adds conceptual docs for token graphs. These docs cover: * How a token graph is constructed from a token stream * How synonyms and multi-position tokens impact token graphs * How token graphs are used during search * Why some token filters produce invalid token graphs Also makes the following supporting changes: * Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking * Adds several SVGs for token graph diagrams 2020-03-19 07:42:26 -04:00			`[[analyzer-anatomy-token-filters]]`
[DOCS] Add concepts section to analysis topic (#50801) This helps the topic better match the structure of our machine learning docs, e.g. https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts' child page, but I plan to add other concepts, such as 'Index time vs. search time', with later PRs. 2020-01-16 13:00:04 -05:00			`==== Token filters`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`A _token filter_ receives the token stream and may add, remove, or change`
			tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
			`filter converts all tokens to lowercase, a`
			<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
			(_stop words_) like `the` from the token stream, and a
			<<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
			`into the token stream.`

			`Token filters are not allowed to change the position or character offsets of`
			`each token.`

			`An analyzer may have zero or more <<analysis-tokenfilters,token filters>>,`
[DOCS] Add concepts section to analysis topic (#50801) This helps the topic better match the structure of our machine learning docs, e.g. https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts' child page, but I plan to add other concepts, such as 'Index time vs. search time', with later PRs. 2020-01-16 13:00:04 -05:00			`which are applied in order.`