[DOCS] Add stemming concept docs (#55156)

Adds conceptual documentation for stemming, including: * An overview of why stemming is helpful in search * Algorithmic vs. dictionary stemming * Token filters used to control stemming, such as `stemmer_override`, `keyword_marker`, and `conditional`
2025-03-24 17:09:48 +00:00 · 2020-04-24 11:01:28 -04:00 · 2020-04-24 11:01:28 -04:00 · 96285b90c1
commit 96285b90c1
parent f95a741ad3
3 changed files with 128 additions and 0 deletions
--- a/docs/reference/analysis/analyzers/lang-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/lang-analyzer.asciidoc
@ -48,6 +48,7 @@ the config, or by using an external stopwords file by setting
 `stopwords_path`. Check <<analysis-stop-analyzer,Stop Analyzer>> for
 more details.

+[[_excluding_words_from_stemming]]
 ===== Excluding words from stemming

 The `stem_exclusion` parameter allows you to specify an array
--- a/docs/reference/analysis/concepts.asciidoc
+++ b/docs/reference/analysis/concepts.asciidoc
@ -8,8 +8,10 @@ This section explains the fundamental concepts of text analysis in {es}.

 * <<analyzer-anatomy>>
 * <<analysis-index-search-time>>
+* <<stemming>>
 * <<token-graphs>>

 include::anatomy.asciidoc[]
 include::index-search-time.asciidoc[]
+include::stemming.asciidoc[]
 include::token-graphs.asciidoc[]
--- a/docs/reference/analysis/stemming.asciidoc
+++ b/docs/reference/analysis/stemming.asciidoc
@ -0,0 +1,125 @@
+[[stemming]]
+=== Stemming
+
+_Stemming_ is the process of reducing a word to its root form. This ensures
+variants of a word match during a search.
+
+For example, `walking` and `walked` can be stemmed to the same root word:
+`walk`. Once stemmed, an occurrence of either word would match the other in a
+search.
+
+Stemming is language-dependent but often involves removing prefixes and
+suffixes from words.
+
+In some cases, the root form of a stemmed word may not be a real word. For
+example, `jumping` and `jumpiness` can both be stemmed to `jumpi`. While `jumpi`
+isn't a real English word, it doesn't matter for search; if all variants of a
+word are reduced to the same root form, they will match correctly.
+
+[[temmer-token-filters]]
+==== Stemmer token filters
+
+In {es}, stemming is handled by stemmer <<analyzer-anatomy-token-filters,token
+filters>>. These token filters can be categorized based on how they stem words:
+
+* <<algorithmic-stemmers,Algorithmic stemmers>>, which stem words based on a set
+of rules
+* <<dictionary-stemmers,Dictionary stemmers>>, which stem words by looking them
+up in a dictionary
+
+Because stemming changes tokens, we recommend using the same stemmer token
+filters during <<analysis-index-search-time,index and search analysis>>.
+
+[[algorithmic-stemmers]]
+==== Algorithmic stemmers
+
+Algorithmic stemmers apply a series of rules to each word to reduce it to its
+root form. For example, an algorithmic stemmer for English may remove the `-s`
+and `-es` prefixes from the end of plural words.
+
+Algorithmic stemmers have a few advantages:
+
+* They require little setup and usually work well out of the box.
+* They use little memory.
+* They are typically faster than <<dictionary-stemmers,dictionary stemmers>>.
+
+However, most algorithmic stemmers only alter the existing text of a word. This
+means they may not work well with irregular words that don't contain their root
+form, such as:
+
+* `be`, `are`, and `am`
+* `mouse` and `mice`
+* `foot` and `feet`
+
+The following token filters use algorithmic stemming:
+
+* <<analysis-stemmer-tokenfilter,`stemmer`>>, which provides algorithmic
+stemming for several languages, some with additional variants.
+* <<analysis-kstem-tokenfilter,`kstem`>>, a stemmer for English that combines
+algorithmic stemming with a built-in dictionary.
+* <<analysis-porterstem-tokenfilter,`porter_stem`>>, our recommended algorithmic
+stemmer for English.
+* <<analysis-snowball-tokenfilter,`snowball`>>, which uses
+http://snowball.tartarus.org/[Snowball]-based stemming rules for several
+languages.
+
+[[dictionary-stemmers]]
+==== Dictionary stemmers
+
+Dictionary stemmers look up words in a provided dictionary, replacing unstemmed
+word variants with stemmed words from the dictionary.
+
+In theory, dictionary stemmers are well suited for:
+
+* Stemming irregular words
+* Discerning between words that are spelled similarly but not related
+conceptually, such as:
+** `organ` and `organization`
+** `broker` and `broken`
+
+In practice, algorithmic stemmers typically outperform dictionary stemmers. This
+is because dictionary stemmers have the following disadvantages:
+
+* *Dictionary quality* +
+A dictionary stemmer is only as good as its dictionary. To work well, these
+dictionaries must include a significant number of words, be updated regularly,
+and change with language trends. Often, by the time a dictionary has been made
+available, it's incomplete and some of its entries are already outdated.
+
+* *Size and performance* +
+Dictionary stemmers must load all words, prefixes, and suffixes from its
+dictionary into memory. This can use a significant amount of RAM. Low-quality
+dictionaries may also be less efficient with prefix and suffix removal, which
+can slow the stemming process significantly.
+
+You can use the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter to
+perform dictionary stemming.
+
+[TIP]
+====
+If available, we recommend trying an algorithmic stemmer for your language
+before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
+====
+
+[[control-stemming]]
+==== Control stemming
+
+Sometimes stemming can produce shared root words that are spelled similarly but
+not related conceptually. For example, a stemmer may reduce both `skies` and
+`skiing` to the same root word: `ski`.
+
+To prevent this and better control stemming, you can use the following token
+filters:
+
+* <<analysis-stemmer-override-tokenfilter,`stemmer_override`>>, which lets you
+define rules for stemming specific tokens.
+* <<analysis-keyword-marker-tokenfilter,`keyword_marker`>>, which marks
+specified tokens as keywords. Keyword tokens are not stemmed by subsequent
+stemmer token filters.
+* <<analysis-condition-tokenfilter,`conditional`>>, which can be used to mark
+tokens as keywords, similar to the `keyword_marker` filter.
+
+
+For built-in <<analysis-lang-analyzer,language analyzers>>, you also can use the
+<<_excluding_words_from_stemming,`stem_exclusion`>> parameter to specify a list
+of words that won't be stemmed.