OpenSearch/docs/reference/analysis.asciidoc

[[analysis]]
= Text analysis

:lucene-analysis-docs:  https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis
:lucene-stop-word-link: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis

[partintro]
--

_Text analysis_ is the process of converting unstructured text, like
the body of an email or a product description, into a structured format that's
optimized for search.

[discrete]
[[when-to-configure-analysis]]
=== When to configure text analysis

{es} performs text analysis when indexing or searching <<text,`text`>> fields.

If your index doesn't contain `text` fields, no further setup is needed; you can
skip the pages in this section.

However, if you use `text` fields or your text searches aren't returning results
as expected, configuring text analysis can often help. You should also look into
analysis configuration if you're using {es} to:

* Build a search engine
* Mine unstructured data
* Fine-tune search for a specific language
* Perform lexicographic or linguistic research

[discrete]
[[analysis-toc]]
=== In this section

* <<analysis-overview>>
* <<analysis-concepts>>
* <<configure-text-analysis>>
* <<analysis-analyzers>>
* <<analysis-tokenizers>>
* <<analysis-tokenfilters>>
* <<analysis-charfilters>>
* <<analysis-normalizers>>

--

include::analysis/overview.asciidoc[]

include::analysis/concepts.asciidoc[]

include::analysis/configure-text-analysis.asciidoc[]

include::analysis/analyzers.asciidoc[]

include::analysis/tokenizers.asciidoc[]

include::analysis/tokenfilters.asciidoc[]

include::analysis/charfilters.asciidoc[]

include::analysis/normalizers.asciidoc[]
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis]]`
[DOCS] Add overview page to analysis topic (#50515) Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results 2020-01-08 13:53:08 -05:00			`= Text analysis`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Add attribute for Lucene analysis links (#51687) Adds a `lucene-analysis-docs` attribute for the Lucene `/analysis/` javadocs directory. This should prevent typos and keep the docs DRY. 2020-01-30 11:22:30 -05:00			`:lucene-analysis-docs: https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis`
[DOCS] Reformat `stop` token filter (#53059) Makes the following changes to the `stop` token filter docs: * Updates description * Adds a link to the related Lucene filter * Adds detailed analyze snippet * Updates custom analyzer and custom filter snippets * Adds a list of predefined stop words by language Co-authored-by: ScottieL <36999642+ScottieL@users.noreply.github.com> 2020-03-03 13:22:52 -05:00			`:lucene-stop-word-link: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis`
[DOCS] Add attribute for Lucene analysis links (#51687) Adds a `lucene-analysis-docs` attribute for the Lucene `/analysis/` javadocs directory. This should prevent typos and keep the docs DRY. 2020-01-30 11:22:30 -05:00
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[partintro]`
			`--`

[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			`_Text analysis_ is the process of converting unstructured text, like`
			`the body of an email or a product description, into a structured format that's`
			`optimized for search.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			`[[when-to-configure-analysis]]`
			`=== When to configure text analysis`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			{es} performs text analysis when indexing or searching <<text,`text`>> fields.
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			If your index doesn't contain `text` fields, no further setup is needed; you can
			`skip the pages in this section.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			However, if you use `text` fields or your text searches aren't returning results
			`as expected, configuring text analysis can often help. You should also look into`
			`analysis configuration if you're using {es} to:`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			`* Build a search engine`
			`* Mine unstructured data`
			`* Fine-tune search for a specific language`
			`* Perform lexicographic or linguistic research`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			`[[analysis-toc]]`
			`=== In this section`

			`* <<analysis-overview>>`
			`* <<analysis-concepts>>`
			`* <<configure-text-analysis>>`
			`* <<analysis-analyzers>>`
			`* <<analysis-tokenizers>>`
			`* <<analysis-tokenfilters>>`
			`* <<analysis-charfilters>>`
			`* <<analysis-normalizers>>`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`--`

[DOCS] Add overview page to analysis topic (#50515) Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results 2020-01-08 13:53:08 -05:00			`include::analysis/overview.asciidoc[]`

[DOCS] Add concepts section to analysis topic (#50801) This helps the topic better match the structure of our machine learning docs, e.g. https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts' child page, but I plan to add other concepts, such as 'Index time vs. search time', with later PRs. 2020-01-16 13:00:04 -05:00			`include::analysis/concepts.asciidoc[]`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[DOCS] Add tutorials section to analysis topic (#50809) Adds a 'Configure text analysis' page to house tutorial content for the analysis topic. Also relocates the following pages as children as this new page: * 'Test an analyzer' * 'Configuring built-in analyzers' * 'Create a custom analyzer' I plan to add a tutorial for specifying index-time and search-time analyzers to this section as part of a future PR. 2020-01-16 13:11:42 -05:00			`include::analysis/configure-text-analysis.asciidoc[]`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`include::analysis/analyzers.asciidoc[]`

			`include::analysis/tokenizers.asciidoc[]`

			`include::analysis/tokenfilters.asciidoc[]`

			`include::analysis/charfilters.asciidoc[]`

[DOCS] Rewrite analysis intro (#51184) * [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip 2020-01-30 09:19:53 -05:00			`include::analysis/normalizers.asciidoc[]`