[[analyzer-anatomy]] === Anatomy of an analyzer An _analyzer_ -- whether built-in or custom -- is just a package which contains three lower-level building blocks: _character filters_, _tokenizers_, and _token filters_. The built-in <> pre-package these building blocks into analyzers suitable for different languages and types of text. Elasticsearch also exposes the individual building blocks so that they can be combined to define new <> analyzers. [[analyzer-anatomy-character-filters]] ==== Character filters A _character filter_ receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like `` from the stream. An analyzer may have *zero or more* <>, which are applied in order. [[analyzer-anatomy-tokenizer]] ==== Tokenizer A _tokenizer_ receives a stream of characters, breaks it up into individual _tokens_ (usually individual words), and outputs a stream of _tokens_. For instance, a <> tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`. The tokenizer is also responsible for recording the order or _position_ of each term and the start and end _character offsets_ of the original word which the term represents. An analyzer must have *exactly one* <>. [[analyzer-anatomy-token-filters]] ==== Token filters A _token filter_ receives the token stream and may add, remove, or change tokens. For example, a <> token filter converts all tokens to lowercase, a <> token filter removes common words (_stop words_) like `the` from the token stream, and a <> token filter introduces synonyms into the token stream. Token filters are not allowed to change the position or character offsets of each token. An analyzer may have *zero or more* <>, which are applied in order.