OpenSearch/docs/reference/analysis/analyzers.asciidoc

[[analysis-analyzers]]
== Built-in analyzer reference

Elasticsearch ships with a wide range of built-in analyzers, which can be used
in any index without further configuration:

<<analysis-standard-analyzer,Standard Analyzer>>::

The `standard` analyzer divides text into terms on word boundaries, as defined
by the Unicode Text Segmentation algorithm. It removes most punctuation,
lowercases terms, and supports removing stop words.

<<analysis-simple-analyzer,Simple Analyzer>>::

The `simple` analyzer divides text into terms whenever it encounters a
character which is not a letter.  It lowercases all terms.

<<analysis-whitespace-analyzer,Whitespace Analyzer>>::

The `whitespace` analyzer divides text into terms whenever it encounters any
whitespace character.  It does not lowercase terms.

<<analysis-stop-analyzer,Stop Analyzer>>::

The `stop` analyzer is like the `simple` analyzer, but also supports removal
of stop words.

<<analysis-keyword-analyzer,Keyword Analyzer>>::

The `keyword` analyzer is a ``noop'' analyzer that accepts whatever text it is
given and outputs the exact same text as a single term.

<<analysis-pattern-analyzer,Pattern Analyzer>>::

The `pattern` analyzer uses a regular expression to split the text into terms.
It supports lower-casing and stop words.

<<analysis-lang-analyzer,Language Analyzers>>::

Elasticsearch provides many language-specific analyzers like `english` or
`french`.

<<analysis-fingerprint-analyzer,Fingerprint Analyzer>>::

The `fingerprint` analyzer is a specialist analyzer which creates a
fingerprint which can be used for duplicate detection.

[float]
=== Custom analyzers

If you do not find an analyzer suitable for your needs, you can create a
<<analysis-custom-analyzer,`custom`>> analyzer which combines the appropriate
<<analysis-charfilters, character filters>>,
<<analysis-tokenizers,tokenizer>>, and <<analysis-tokenfilters,token filters>>.


include::analyzers/fingerprint-analyzer.asciidoc[]

include::analyzers/keyword-analyzer.asciidoc[]

include::analyzers/lang-analyzer.asciidoc[]

include::analyzers/pattern-analyzer.asciidoc[]

include::analyzers/simple-analyzer.asciidoc[]

include::analyzers/standard-analyzer.asciidoc[]

include::analyzers/stop-analyzer.asciidoc[]

include::analyzers/whitespace-analyzer.asciidoc[]
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-analyzers]]`
[DOCS] Retitle analysis reference pages (#51071) * Changes titles to sentence case. * Appends pages with 'reference' to differentiate their content from conceptual overviews. * Moves the 'Normalizers' page to end of the Analysis topic pages. 2020-01-16 12:27:54 -05:00			`== Built-in analyzer reference`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`Elasticsearch ships with a wide range of built-in analyzers, which can be used`
			`in any index without further configuration:`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`<<analysis-standard-analyzer,Standard Analyzer>>::`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			The `standard` analyzer divides text into terms on word boundaries, as defined
			`by the Unicode Text Segmentation algorithm. It removes most punctuation,`
			`lowercases terms, and supports removing stop words.`

			`<<analysis-simple-analyzer,Simple Analyzer>>::`

			The `simple` analyzer divides text into terms whenever it encounters a
			`character which is not a letter. It lowercases all terms.`

			`<<analysis-whitespace-analyzer,Whitespace Analyzer>>::`

			The `whitespace` analyzer divides text into terms whenever it encounters any
			`whitespace character. It does not lowercase terms.`

			`<<analysis-stop-analyzer,Stop Analyzer>>::`

			The `stop` analyzer is like the `simple` analyzer, but also supports removal
			`of stop words.`

			`<<analysis-keyword-analyzer,Keyword Analyzer>>::`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			The `keyword` analyzer is a ``noop'' analyzer that accepts whatever text it is
			`given and outputs the exact same text as a single term.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`<<analysis-pattern-analyzer,Pattern Analyzer>>::`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			The `pattern` analyzer uses a regular expression to split the text into terms.
			`It supports lower-casing and stop words.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`<<analysis-lang-analyzer,Language Analyzers>>::`

			Elasticsearch provides many language-specific analyzers like `english` or
			`french`.

			`<<analysis-fingerprint-analyzer,Fingerprint Analyzer>>::`

			The `fingerprint` analyzer is a specialist analyzer which creates a
			`fingerprint which can be used for duplicate detection.`

			`[float]`
			`=== Custom analyzers`

			`If you do not find an analyzer suitable for your needs, you can create a`
			<<analysis-custom-analyzer,`custom`>> analyzer which combines the appropriate
			`<<analysis-charfilters, character filters>>,`
			`<<analysis-tokenizers,tokenizer>>, and <<analysis-tokenfilters,token filters>>.`


[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:46:50 -04:00			`include::analyzers/fingerprint-analyzer.asciidoc[]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:46:50 -04:00			`include::analyzers/keyword-analyzer.asciidoc[]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:46:50 -04:00			`include::analyzers/lang-analyzer.asciidoc[]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:46:50 -04:00			`include::analyzers/pattern-analyzer.asciidoc[]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:46:50 -04:00			`include::analyzers/simple-analyzer.asciidoc[]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:46:50 -04:00			`include::analyzers/standard-analyzer.asciidoc[]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:46:50 -04:00			`include::analyzers/stop-analyzer.asciidoc[]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Add tutorials section to analysis topic (#50809) Adds a 'Configure text analysis' page to house tutorial content for the analysis topic. Also relocates the following pages as children as this new page: * 'Test an analyzer' * 'Configuring built-in analyzers' * 'Create a custom analyzer' I plan to add a tutorial for specifying index-time and search-time analyzers to this section as part of a future PR. 2020-01-16 13:11:42 -05:00			`include::analyzers/whitespace-analyzer.asciidoc[]`