126 lines
4.8 KiB
Plaintext
126 lines
4.8 KiB
Plaintext
[[stemming]]
|
|
=== Stemming
|
|
|
|
_Stemming_ is the process of reducing a word to its root form. This ensures
|
|
variants of a word match during a search.
|
|
|
|
For example, `walking` and `walked` can be stemmed to the same root word:
|
|
`walk`. Once stemmed, an occurrence of either word would match the other in a
|
|
search.
|
|
|
|
Stemming is language-dependent but often involves removing prefixes and
|
|
suffixes from words.
|
|
|
|
In some cases, the root form of a stemmed word may not be a real word. For
|
|
example, `jumping` and `jumpiness` can both be stemmed to `jumpi`. While `jumpi`
|
|
isn't a real English word, it doesn't matter for search; if all variants of a
|
|
word are reduced to the same root form, they will match correctly.
|
|
|
|
[[stemmer-token-filters]]
|
|
==== Stemmer token filters
|
|
|
|
In {es}, stemming is handled by stemmer <<analyzer-anatomy-token-filters,token
|
|
filters>>. These token filters can be categorized based on how they stem words:
|
|
|
|
* <<algorithmic-stemmers,Algorithmic stemmers>>, which stem words based on a set
|
|
of rules
|
|
* <<dictionary-stemmers,Dictionary stemmers>>, which stem words by looking them
|
|
up in a dictionary
|
|
|
|
Because stemming changes tokens, we recommend using the same stemmer token
|
|
filters during <<analysis-index-search-time,index and search analysis>>.
|
|
|
|
[[algorithmic-stemmers]]
|
|
==== Algorithmic stemmers
|
|
|
|
Algorithmic stemmers apply a series of rules to each word to reduce it to its
|
|
root form. For example, an algorithmic stemmer for English may remove the `-s`
|
|
and `-es` prefixes from the end of plural words.
|
|
|
|
Algorithmic stemmers have a few advantages:
|
|
|
|
* They require little setup and usually work well out of the box.
|
|
* They use little memory.
|
|
* They are typically faster than <<dictionary-stemmers,dictionary stemmers>>.
|
|
|
|
However, most algorithmic stemmers only alter the existing text of a word. This
|
|
means they may not work well with irregular words that don't contain their root
|
|
form, such as:
|
|
|
|
* `be`, `are`, and `am`
|
|
* `mouse` and `mice`
|
|
* `foot` and `feet`
|
|
|
|
The following token filters use algorithmic stemming:
|
|
|
|
* <<analysis-stemmer-tokenfilter,`stemmer`>>, which provides algorithmic
|
|
stemming for several languages, some with additional variants.
|
|
* <<analysis-kstem-tokenfilter,`kstem`>>, a stemmer for English that combines
|
|
algorithmic stemming with a built-in dictionary.
|
|
* <<analysis-porterstem-tokenfilter,`porter_stem`>>, our recommended algorithmic
|
|
stemmer for English.
|
|
* <<analysis-snowball-tokenfilter,`snowball`>>, which uses
|
|
http://snowball.tartarus.org/[Snowball]-based stemming rules for several
|
|
languages.
|
|
|
|
[[dictionary-stemmers]]
|
|
==== Dictionary stemmers
|
|
|
|
Dictionary stemmers look up words in a provided dictionary, replacing unstemmed
|
|
word variants with stemmed words from the dictionary.
|
|
|
|
In theory, dictionary stemmers are well suited for:
|
|
|
|
* Stemming irregular words
|
|
* Discerning between words that are spelled similarly but not related
|
|
conceptually, such as:
|
|
** `organ` and `organization`
|
|
** `broker` and `broken`
|
|
|
|
In practice, algorithmic stemmers typically outperform dictionary stemmers. This
|
|
is because dictionary stemmers have the following disadvantages:
|
|
|
|
* *Dictionary quality* +
|
|
A dictionary stemmer is only as good as its dictionary. To work well, these
|
|
dictionaries must include a significant number of words, be updated regularly,
|
|
and change with language trends. Often, by the time a dictionary has been made
|
|
available, it's incomplete and some of its entries are already outdated.
|
|
|
|
* *Size and performance* +
|
|
Dictionary stemmers must load all words, prefixes, and suffixes from its
|
|
dictionary into memory. This can use a significant amount of RAM. Low-quality
|
|
dictionaries may also be less efficient with prefix and suffix removal, which
|
|
can slow the stemming process significantly.
|
|
|
|
You can use the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter to
|
|
perform dictionary stemming.
|
|
|
|
[TIP]
|
|
====
|
|
If available, we recommend trying an algorithmic stemmer for your language
|
|
before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
|
|
====
|
|
|
|
[[control-stemming]]
|
|
==== Control stemming
|
|
|
|
Sometimes stemming can produce shared root words that are spelled similarly but
|
|
not related conceptually. For example, a stemmer may reduce both `skies` and
|
|
`skiing` to the same root word: `ski`.
|
|
|
|
To prevent this and better control stemming, you can use the following token
|
|
filters:
|
|
|
|
* <<analysis-stemmer-override-tokenfilter,`stemmer_override`>>, which lets you
|
|
define rules for stemming specific tokens.
|
|
* <<analysis-keyword-marker-tokenfilter,`keyword_marker`>>, which marks
|
|
specified tokens as keywords. Keyword tokens are not stemmed by subsequent
|
|
stemmer token filters.
|
|
* <<analysis-condition-tokenfilter,`conditional`>>, which can be used to mark
|
|
tokens as keywords, similar to the `keyword_marker` filter.
|
|
|
|
|
|
For built-in <<analysis-lang-analyzer,language analyzers>>, you also can use the
|
|
<<_excluding_words_from_stemming,`stem_exclusion`>> parameter to specify a list
|
|
of words that won't be stemmed.
|