OpenSearch/docs/reference/analysis/stemming.asciidoc

[[stemming]]
=== Stemming

_Stemming_ is the process of reducing a word to its root form. This ensures
variants of a word match during a search.

For example, `walking` and `walked` can be stemmed to the same root word:
`walk`. Once stemmed, an occurrence of either word would match the other in a
search.

Stemming is language-dependent but often involves removing prefixes and
suffixes from words.

In some cases, the root form of a stemmed word may not be a real word. For
example, `jumping` and `jumpiness` can both be stemmed to `jumpi`. While `jumpi`
isn't a real English word, it doesn't matter for search; if all variants of a
word are reduced to the same root form, they will match correctly.

[[temmer-token-filters]]
==== Stemmer token filters

In {es}, stemming is handled by stemmer <<analyzer-anatomy-token-filters,token
filters>>. These token filters can be categorized based on how they stem words:

* <<algorithmic-stemmers,Algorithmic stemmers>>, which stem words based on a set
of rules
* <<dictionary-stemmers,Dictionary stemmers>>, which stem words by looking them
up in a dictionary

Because stemming changes tokens, we recommend using the same stemmer token
filters during <<analysis-index-search-time,index and search analysis>>.

[[algorithmic-stemmers]]
==== Algorithmic stemmers

Algorithmic stemmers apply a series of rules to each word to reduce it to its
root form. For example, an algorithmic stemmer for English may remove the `-s`
and `-es` prefixes from the end of plural words.

Algorithmic stemmers have a few advantages:

* They require little setup and usually work well out of the box.
* They use little memory.
* They are typically faster than <<dictionary-stemmers,dictionary stemmers>>.

However, most algorithmic stemmers only alter the existing text of a word. This
means they may not work well with irregular words that don't contain their root
form, such as:

* `be`, `are`, and `am`
* `mouse` and `mice`
* `foot` and `feet`

The following token filters use algorithmic stemming:

* <<analysis-stemmer-tokenfilter,`stemmer`>>, which provides algorithmic
stemming for several languages, some with additional variants.
* <<analysis-kstem-tokenfilter,`kstem`>>, a stemmer for English that combines
algorithmic stemming with a built-in dictionary.
* <<analysis-porterstem-tokenfilter,`porter_stem`>>, our recommended algorithmic
stemmer for English.
* <<analysis-snowball-tokenfilter,`snowball`>>, which uses
http://snowball.tartarus.org/[Snowball]-based stemming rules for several
languages.

[[dictionary-stemmers]]
==== Dictionary stemmers

Dictionary stemmers look up words in a provided dictionary, replacing unstemmed
word variants with stemmed words from the dictionary.

In theory, dictionary stemmers are well suited for:

* Stemming irregular words
* Discerning between words that are spelled similarly but not related
conceptually, such as:
** `organ` and `organization`
** `broker` and `broken`

In practice, algorithmic stemmers typically outperform dictionary stemmers. This
is because dictionary stemmers have the following disadvantages:

* *Dictionary quality* +
A dictionary stemmer is only as good as its dictionary. To work well, these
dictionaries must include a significant number of words, be updated regularly,
and change with language trends. Often, by the time a dictionary has been made
available, it's incomplete and some of its entries are already outdated.

* *Size and performance* +
Dictionary stemmers must load all words, prefixes, and suffixes from its
dictionary into memory. This can use a significant amount of RAM. Low-quality
dictionaries may also be less efficient with prefix and suffix removal, which
can slow the stemming process significantly.

You can use the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter to
perform dictionary stemming.

[TIP]
====
If available, we recommend trying an algorithmic stemmer for your language
before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
====

[[control-stemming]]
==== Control stemming

Sometimes stemming can produce shared root words that are spelled similarly but
not related conceptually. For example, a stemmer may reduce both `skies` and
`skiing` to the same root word: `ski`.

To prevent this and better control stemming, you can use the following token
filters:

* <<analysis-stemmer-override-tokenfilter,`stemmer_override`>>, which lets you
define rules for stemming specific tokens.
* <<analysis-keyword-marker-tokenfilter,`keyword_marker`>>, which marks
specified tokens as keywords. Keyword tokens are not stemmed by subsequent
stemmer token filters.
* <<analysis-condition-tokenfilter,`conditional`>>, which can be used to mark
tokens as keywords, similar to the `keyword_marker` filter.


For built-in <<analysis-lang-analyzer,language analyzers>>, you also can use the
<<_excluding_words_from_stemming,`stem_exclusion`>> parameter to specify a list
of words that won't be stemmed.
[DOCS] Add stemming concept docs (#55156) Adds conceptual documentation for stemming, including: * An overview of why stemming is helpful in search * Algorithmic vs. dictionary stemming * Token filters used to control stemming, such as `stemmer_override`, `keyword_marker`, and `conditional` 2020-04-24 11:01:28 -04:00			`[[stemming]]`
			`=== Stemming`

			`_Stemming_ is the process of reducing a word to its root form. This ensures`
			`variants of a word match during a search.`

			For example, `walking` and `walked` can be stemmed to the same root word:
			`walk`. Once stemmed, an occurrence of either word would match the other in a
			`search.`

			`Stemming is language-dependent but often involves removing prefixes and`
			`suffixes from words.`

			`In some cases, the root form of a stemmed word may not be a real word. For`
			example, `jumping` and `jumpiness` can both be stemmed to `jumpi`. While `jumpi`
			`isn't a real English word, it doesn't matter for search; if all variants of a`
			`word are reduced to the same root form, they will match correctly.`

			`[[temmer-token-filters]]`
			`==== Stemmer token filters`

			`In {es}, stemming is handled by stemmer <<analyzer-anatomy-token-filters,token`
			`filters>>. These token filters can be categorized based on how they stem words:`

			`* <<algorithmic-stemmers,Algorithmic stemmers>>, which stem words based on a set`
			`of rules`
			`* <<dictionary-stemmers,Dictionary stemmers>>, which stem words by looking them`
			`up in a dictionary`

			`Because stemming changes tokens, we recommend using the same stemmer token`
			`filters during <<analysis-index-search-time,index and search analysis>>.`

			`[[algorithmic-stemmers]]`
			`==== Algorithmic stemmers`

			`Algorithmic stemmers apply a series of rules to each word to reduce it to its`
			root form. For example, an algorithmic stemmer for English may remove the `-s`
			and `-es` prefixes from the end of plural words.

			`Algorithmic stemmers have a few advantages:`

			`* They require little setup and usually work well out of the box.`
			`* They use little memory.`
			`* They are typically faster than <<dictionary-stemmers,dictionary stemmers>>.`

			`However, most algorithmic stemmers only alter the existing text of a word. This`
			`means they may not work well with irregular words that don't contain their root`
			`form, such as:`

			* `be`, `are`, and `am`
			* `mouse` and `mice`
			* `foot` and `feet`

			`The following token filters use algorithmic stemming:`

			* <<analysis-stemmer-tokenfilter,`stemmer`>>, which provides algorithmic
			`stemming for several languages, some with additional variants.`
			* <<analysis-kstem-tokenfilter,`kstem`>>, a stemmer for English that combines
			`algorithmic stemming with a built-in dictionary.`
			* <<analysis-porterstem-tokenfilter,`porter_stem`>>, our recommended algorithmic
			`stemmer for English.`
			* <<analysis-snowball-tokenfilter,`snowball`>>, which uses
			`http://snowball.tartarus.org/[Snowball]-based stemming rules for several`
			`languages.`

			`[[dictionary-stemmers]]`
			`==== Dictionary stemmers`

			`Dictionary stemmers look up words in a provided dictionary, replacing unstemmed`
			`word variants with stemmed words from the dictionary.`

			`In theory, dictionary stemmers are well suited for:`

			`* Stemming irregular words`
			`* Discerning between words that are spelled similarly but not related`
			`conceptually, such as:`
			** `organ` and `organization`
			** `broker` and `broken`

			`In practice, algorithmic stemmers typically outperform dictionary stemmers. This`
			`is because dictionary stemmers have the following disadvantages:`

			`* Dictionary quality +`
			`A dictionary stemmer is only as good as its dictionary. To work well, these`
			`dictionaries must include a significant number of words, be updated regularly,`
			`and change with language trends. Often, by the time a dictionary has been made`
			`available, it's incomplete and some of its entries are already outdated.`

			`* Size and performance +`
			`Dictionary stemmers must load all words, prefixes, and suffixes from its`
			`dictionary into memory. This can use a significant amount of RAM. Low-quality`
			`dictionaries may also be less efficient with prefix and suffix removal, which`
			`can slow the stemming process significantly.`

			You can use the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter to
			`perform dictionary stemming.`

			`[TIP]`
			`====`
			`If available, we recommend trying an algorithmic stemmer for your language`
			before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
			`====`

			`[[control-stemming]]`
			`==== Control stemming`

			`Sometimes stemming can produce shared root words that are spelled similarly but`
			not related conceptually. For example, a stemmer may reduce both `skies` and
			`skiing` to the same root word: `ski`.

			`To prevent this and better control stemming, you can use the following token`
			`filters:`

			* <<analysis-stemmer-override-tokenfilter,`stemmer_override`>>, which lets you
			`define rules for stemming specific tokens.`
			* <<analysis-keyword-marker-tokenfilter,`keyword_marker`>>, which marks
			`specified tokens as keywords. Keyword tokens are not stemmed by subsequent`
			`stemmer token filters.`
			* <<analysis-condition-tokenfilter,`conditional`>>, which can be used to mark
			tokens as keywords, similar to the `keyword_marker` filter.


			`For built-in <<analysis-lang-analyzer,language analyzers>>, you also can use the`
			<<_excluding_words_from_stemming,`stem_exclusion`>> parameter to specify a list
			`of words that won't be stemmed.`