[DOCS] Reformat `stemmer` token filter (#55693)

Makes the following changes to the `stemmer` token filter docs: * Adds detailed analyze example * Rewrites parameter definitions * Adds custom analyzer example * Adds a `language` value for the `estonian` stemmer * Reorders the `language` values to show recommended algorithms first, followed by other values alphabetically
2020-04-24 11:25:01 -04:00 · 2020-04-24 11:25:01 -04:00 · e0a8adb5b2
parent 96285b90c1
commit e0a8adb5b2
1 changed files with 164 additions and 75 deletions
--- a/docs/reference/analysis/tokenfilters/stemmer-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/stemmer-tokenfilter.asciidoc
@ -4,189 +4,278 @@
 <titleabbrev>Stemmer</titleabbrev>
 ++++
-// Adds attribute for the 'minimal_portuguese' stemmer values link.
+Provides <<algorithmic-stemmers,algorithmic stemming>> for several languages,
-// This link contains ~, which is converted to subscript.
+some with additional variants. For a list of supported languages, see the
-// This attribute prevents that substitution.
+<<analysis-stemmer-tokenfilter-language-parm,`language`>> parameter.
 // See https://github.com/asciidoctor/asciidoctor/wiki/How-to-prevent-URLs-containing-formatting-characters-from-getting-mangled
 :min-pt-stemmer-values-url: http://www.inf.ufrgs.br/~buriol/papers/Orengo_CLEF07.pdf
-A filter that provides access to (almost) all of the available stemming token
+When not customized, the filter uses the
-filters through a single unified interface. For example:
+http://snowball.tartarus.org/algorithms/porter/stemmer.html[porter stemming
 algorithm] for English.
 [[analysis-stemmer-graph-tokenfilter-analyze-ex]]
 ==== Example
 The following analyze API request uses the `stemmer` filter's default porter
 stemming algorithm to stem `the foxes jumping quickly` to `the fox jump
 quickli`:
 [source,console]
--------------------------------------------------
+----
 GET /_analyze
 {
  "tokenizer": "standard",
  "filter": [ "stemmer" ],
  "text": "the foxes jumping quickly"
 }
 ----
 The filter produces the following tokens:
 [source,text]
 ----
 [ the, fox, jump, quickli ]
 ----
 ////
 [source,console-result]
 ----
 {
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "fox",
      "start_offset": 4,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "jump",
      "start_offset": 10,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "quickli",
      "start_offset": 18,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
 }
 ----
 ////
 [[analysis-stemmer-tokenfilter-analyzer-ex]]
 ==== Add to an analyzer
 The following <<indices-create-index,create index API>> request uses the
 `stemmer` filter to configure a new <<analysis-custom-analyzer,custom
 analyzer>>.
 [source,console]
 ----
 PUT /my_index
 {
-    "settings": {
+  "settings": {
-        "analysis" : {
+    "analysis": {
-            "analyzer" : {
+      "analyzer": {
-                "my_analyzer" : {
+        "my_analyzer": {
-                    "tokenizer" : "standard",
+          "tokenizer": "whitespace",
-                    "filter" : ["lowercase", "my_stemmer"]
+          "filter": [ "stemmer" ]
                }
            },
            "filter" : {
                "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "light_german"
                }
            }
        }
      }
    }
  }
 }
--------------------------------------------------
+----
-The `language`/`name` parameter controls the stemmer with the following
+[role="child_attributes"]
-available values (the preferred filters are marked in *bold*):
+[[analysis-stemmer-tokenfilter-configure-parms]]
 ==== Configurable parameters
 [[analysis-stemmer-tokenfilter-language-parm]]
 `language`::
 (Optional, string)
 Language-dependent stemming algorithm used to stem tokens. If both this and the
 `name` parameter are specified, the `language` parameter argument is used.
 +
 [%collapsible%open]
 .Valid values for `language`
 ====
 Valid values are sorted by language. Defaults to
 http://snowball.tartarus.org/algorithms/porter/stemmer.html[*`english`*].
 Recommended algorithms are *bolded*.
 [horizontal]
 Arabic::
-
+{lucene-analysis-docs}/ar/ArabicStemmer.html[*`arabic`*]
 http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicStemmer.html[*`arabic`*]
 Armenian::
 http://snowball.tartarus.org/algorithms/armenian/stemmer.html[*`armenian`*]
 Basque::
 http://snowball.tartarus.org/algorithms/basque/stemmer.html[*`basque`*]
 Bengali::
 http://www.tandfonline.com/doi/abs/10.1080/02564602.1993.11437284[*`bengali`*]
 Brazilian Portuguese::
-
+{lucene-analysis-docs}/br/BrazilianStemmer.html[*`brazilian`*]
 http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemmer.html[*`brazilian`*]
 Bulgarian::
 http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf[*`bulgarian`*]
 Catalan::
 http://snowball.tartarus.org/algorithms/catalan/stemmer.html[*`catalan`*]
 Czech::
 http://portal.acm.org/citation.cfm?id=1598600[*`czech`*]
 Danish::
 http://snowball.tartarus.org/algorithms/danish/stemmer.html[*`danish`*]
 Dutch::
 http://snowball.tartarus.org/algorithms/dutch/stemmer.html[*`dutch`*],
 http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html[`dutch_kp`]
 English::
 http://snowball.tartarus.org/algorithms/porter/stemmer.html[*`english`*],
 http://ciir.cs.umass.edu/pubfiles/ir-35.pdf[`light_english`],
 http://snowball.tartarus.org/algorithms/lovins/stemmer.html[`lovins`],
 http://www.researchgate.net/publication/220433848_How_effective_is_suffixing[`minimal_english`],
 http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilter.html[`possessive_english`],
 http://snowball.tartarus.org/algorithms/english/stemmer.html[`porter2`],
-http://snowball.tartarus.org/algorithms/lovins/stemmer.html[`lovins`]
+{lucene-analysis-docs}/en/EnglishPossessiveFilter.html[`possessive_english`]
 Estonian::
 https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/tartarus/snowball/ext/EstonianStemmer.html[*`estonian`*]
 Finnish::
 http://snowball.tartarus.org/algorithms/finnish/stemmer.html[*`finnish`*],
 http://clef.isti.cnr.it/2003/WN_web/22.pdf[`light_finnish`]
 French::
 http://snowball.tartarus.org/algorithms/french/stemmer.html[`french`],
 http://dl.acm.org/citation.cfm?id=1141523[*`light_french`*],
 http://snowball.tartarus.org/algorithms/french/stemmer.html[`french`],
 http://dl.acm.org/citation.cfm?id=318984[`minimal_french`]
 Galician::
 http://bvg.udc.es/recursos_lingua/stemming.jsp[*`galician`*],
 http://bvg.udc.es/recursos_lingua/stemming.jsp[`minimal_galician`] (Plural step only)
 German::
-
+http://dl.acm.org/citation.cfm?id=1141523[*`light_german`*],
 http://snowball.tartarus.org/algorithms/german/stemmer.html[`german`],
 http://snowball.tartarus.org/algorithms/german2/stemmer.html[`german2`],
 http://dl.acm.org/citation.cfm?id=1141523[*`light_german`*],
 http://members.unine.ch/jacques.savoy/clef/morpho.pdf[`minimal_german`]
 Greek::
 http://sais.se/mthprize/2007/ntais2007.pdf[*`greek`*]
 Hindi::
 http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf[*`hindi`*]
 Hungarian::
 http://snowball.tartarus.org/algorithms/hungarian/stemmer.html[*`hungarian`*],
 http://dl.acm.org/citation.cfm?id=1141523&dl=ACM&coll=DL&CFID=179095584&CFTOKEN=80067181[`light_hungarian`]
 Indonesian::
 http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf[*`indonesian`*]
 Irish::
 http://snowball.tartarus.org/otherapps/oregan/intro.html[*`irish`*]
 Italian::
-
+http://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf[*`light_italian`*],
-http://snowball.tartarus.org/algorithms/italian/stemmer.html[`italian`],
+http://snowball.tartarus.org/algorithms/italian/stemmer.html[`italian`]
 http://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf[*`light_italian`*]
 Kurdish (Sorani)::
-
+{lucene-analysis-docs}/ckb/SoraniStemmer.html[*`sorani`*]
 http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniStemmer.html[*`sorani`*]
 Latvian::
-
+{lucene-analysis-docs}/lv/LatvianStemmer.html[*`latvian`*]
 http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/lv/LatvianStemmer.html[*`latvian`*]
 Lithuanian::
 http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_5_3/lucene/analysis/common/src/java/org/apache/lucene/analysis/lt/stem_ISO_8859_1.sbl?view=markup[*`lithuanian`*]
 Norwegian (Bokmål)::
 http://snowball.tartarus.org/algorithms/norwegian/stemmer.html[*`norwegian`*],
-http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianLightStemmer.html[*`light_norwegian`*],
+{lucene-analysis-docs}/no/NorwegianLightStemmer.html[*`light_norwegian`*],
-http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemmer.html[`minimal_norwegian`]
+{lucene-analysis-docs}/no/NorwegianMinimalStemmer.html[`minimal_norwegian`]
 Norwegian (Nynorsk)::
-
+{lucene-analysis-docs}/no/NorwegianLightStemmer.html[*`light_nynorsk`*],
-http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianLightStemmer.html[*`light_nynorsk`*],
+{lucene-analysis-docs}/no/NorwegianMinimalStemmer.html[`minimal_nynorsk`]
 http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemmer.html[`minimal_nynorsk`]
 Portuguese::
 http://snowball.tartarus.org/algorithms/portuguese/stemmer.html[`portuguese`],
 http://dl.acm.org/citation.cfm?id=1141523&dl=ACM&coll=DL&CFID=179095584&CFTOKEN=80067181[*`light_portuguese`*],
-{min-pt-stemmer-values-url}[`minimal_portuguese`],
+pass:macros[http://www.inf.ufrgs.br/~buriol/papers/Orengo_CLEF07.pdf[`minimal_portuguese`\]],
 http://snowball.tartarus.org/algorithms/portuguese/stemmer.html[`portuguese`],
 http://www.inf.ufrgs.br/\~viviane/rslp/index.htm[`portuguese_rslp`]
 Romanian::
 http://snowball.tartarus.org/algorithms/romanian/stemmer.html[*`romanian`*]
 Russian::
 http://snowball.tartarus.org/algorithms/russian/stemmer.html[*`russian`*],
 http://doc.rero.ch/lm.php?url=1000%2C43%2C4%2C20091209094227-CA%2FDolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf[`light_russian`]
 Spanish::
-
+http://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf[*`light_spanish`*],
-http://snowball.tartarus.org/algorithms/spanish/stemmer.html[`spanish`],
+http://snowball.tartarus.org/algorithms/spanish/stemmer.html[`spanish`]
 http://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf[*`light_spanish`*]
 Swedish::
 http://snowball.tartarus.org/algorithms/swedish/stemmer.html[*`swedish`*],
 http://clef.isti.cnr.it/2003/WN_web/22.pdf[`light_swedish`]
 Turkish::
 http://snowball.tartarus.org/algorithms/turkish/stemmer.html[*`turkish`*]
 ====
 `name`::
 An alias for the <<analysis-stemmer-tokenfilter-language-parm,`language`>>
 parameter. If both this and the `language` parameter are specified, the
 `language` parameter argument is used.
 [[analysis-stemmer-tokenfilter-customize]]
 ==== Customize
 To customize the `stemmer` filter, duplicate it to create the basis for a new
 custom token filter. You can modify the filter using its configurable
 parameters.
 For example, the following request creates a custom `stemmer` filter that stems
 words using the `light_german` algorithm:
 [source,console]
 ----
 PUT /my_index
 {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stemmer"
          ]
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        }
      }
    }
  }
 }
 ----