[DOCS] Reformat hunspell token filter (#56955)

Changes: * Rewrites description and adds Lucene link * Adds analyze example * Rewrites parameter documentation * Updates custom analyzer example * Rewrites related setting documentation
2025-03-09 14:34:43 +00:00 · 2020-05-20 14:47:53 -04:00 · 2020-05-20 14:47:53 -04:00 · 5cb34d9a6e
commit 5cb34d9a6e
parent ec41d36c62
1 changed files with 201 additions and 73 deletions
--- a/docs/reference/analysis/tokenfilters/hunspell-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/hunspell-tokenfilter.asciidoc
@ -4,18 +4,37 @@
 <titleabbrev>Hunspell</titleabbrev>
 ++++
-Basic support for hunspell stemming. Hunspell dictionaries will be
+Provides <<dictionary-stemmers,dictionary stemming>> based on a provided
-picked up from a dedicated hunspell directory on the filesystem
+http://en.wikipedia.org/wiki/Hunspell[Hunspell dictionary]. The `hunspell`
-(`<path.conf>/hunspell`). Each dictionary is expected to
+filter requires
-have its own directory named after its associated locale (language).
+<<analysis-hunspell-tokenfilter-dictionary-config,configuration>> of one or more
-This dictionary directory is expected to hold a single `*.aff` and
+language-specific Hunspell dictionaries.
-one or more `*.dic` files (all of which will automatically be picked up).
+
-For example, assuming the default hunspell location is used, the
+This filter uses Lucene's
-following directory layout will define the `en_US` dictionary:
+{lucene-analysis-docs}/hunspell/HunspellStemFilter.html[HunspellStemFilter].
 [TIP]
 ====
 If available, we recommend trying an algorithmic stemmer for your language
 before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
 In practice, algorithmic stemmers typically outperform dictionary stemmers.
 See <<dictionary-stemmers>>.
 ====
 [[analysis-hunspell-tokenfilter-dictionary-config]]
 ==== Configure Hunspell dictionaries
 By default, Hunspell dictionaries are stored and detected on a dedicated
 hunspell directory on the filesystem: `<path.config>/hunspell`. Each dictionary
 is expected to have its own directory, named after its associated language and
 locale (e.g., `pt_BR`, `en_GB`). This dictionary directory is expected to hold a
 single `.aff` and one or more `.dic` files, all of which will automatically be
 picked up. For example, assuming the default `<path.config>/hunspell` path
 is used, the following directory layout will define the `en_US` dictionary:
 [source,txt]
 --------------------------------------------------
- conf
+- config
    |-- hunspell
    |    |-- en_US
    |    |    |-- en_US.dic
@ -24,96 +43,205 @@ following directory layout will define the `en_US` dictionary:
 Each dictionary can be configured with one setting:
 [[analysis-hunspell-ignore-case-settings]]
 `ignore_case`::
-    If true, dictionary matching will be case insensitive
+(Static, boolean)
-    (defaults to `false`)
+If true, dictionary matching will be case insensitive. Defaults to `false`.
 This setting can be configured globally in `elasticsearch.yml` using
 `indices.analysis.hunspell.dictionary.ignore_case`.
-* `indices.analysis.hunspell.dictionary.ignore_case`
+To configure the setting for a specific locale, use the
-
+`indices.analysis.hunspell.dictionary.<locale>.ignore_case` setting (e.g., for
-or for specific dictionaries:
+the `en_US` (American English) locale, the setting is
-
+`indices.analysis.hunspell.dictionary.en_US.ignore_case`).
 * `indices.analysis.hunspell.dictionary.en_US.ignore_case`.
 It is also possible to add `settings.yml` file under the dictionary
-directory which holds these settings (this will override any other
+directory which holds these settings. This overrides any other `ignore_case`
-settings defined in the `elasticsearch.yml`).
+settings defined in `elasticsearch.yml`.
-One can use the hunspell stem filter by configuring it the analysis
+[[analysis-hunspell-tokenfilter-analyze-ex]]
-settings:
+==== Example
 The following analyze API request uses the `hunspell` filter to stem 
 `the foxes jumping quickly` to `the fox jump quick`.
 The request specifies the `en_US` locale, meaning that the
 `.aff` and `.dic` files in the `<path.config>/hunspell/en_US` directory are used
 for the Hunspell dictionary.
 [source,console]
--------------------------------------------------
+----
-PUT /hunspell_example
+GET /_analyze
 {
  "tokenizer": "standard",
  "filter": [
    {
      "type": "hunspell",
      "locale": "en_US"
    }
  ],
  "text": "the foxes jumping quickly"
 }
 ----
 The filter produces the following tokens:
 [source,text]
 ----
 [ the, fox, jump, quick ]
 ----
 ////
 [source,console-result]
 ----
 {
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "fox",
      "start_offset": 4,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "jump",
      "start_offset": 10,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "quick",
      "start_offset": 18,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
 }
 ----
 ////
 [[analysis-hunspell-tokenfilter-configure-parms]]
 ==== Configurable parameters
 [[analysis-hunspell-tokenfilter-dictionary-param]]
 `dictionary`::
 (Optional, string or array of strings)
 One or more `.dic` files (e.g, `en_US.dic, my_custom.dic`) to use for the
 Hunspell dictionary.
 +
 By default, the `hunspell` filter uses all `.dic` files in the
 `<path.config>/hunspell/<locale>` directory specified specified using the
 `lang`, `language`, or `locale` parameter. To use another directory, the
 directory's path must be registered using the
 <<indices-analysis-hunspell-dictionary-location,
 `indices.analysis.hunspell.dictionary.location`>> setting.
 `dedup`::
 (Optional, boolean)
 If `true`, duplicate tokens are removed from the filter's output. Defaults to
 `true`.
 `lang`::
 (Required*, string)
 An alias for the <<analysis-hunspell-tokenfilter-locale-param,`locale`
 parameter>>.
 +
 If this parameter is not specified, the `language` or `locale` parameter is
 required.
 `language`::
 (Required*, string)
 An alias for the <<analysis-hunspell-tokenfilter-locale-param,`locale`
 parameter>>.
 +
 If this parameter is not specified, the `lang` or `locale` parameter is
 required.
 [[analysis-hunspell-tokenfilter-locale-param]]
 `locale`::
 (Required*, string)
 Locale directory used to specify the `.aff` and `.dic` files for a Hunspell
 dictionary. See <<analysis-hunspell-tokenfilter-dictionary-config>>.
 +
 If this parameter is not specified, the `lang` or `language` parameter is
 required.
 `longest_only`::
 (Optional, boolean)
 If `true`, only the longest stemmed version of each token is
 included in the output. If `false`, all stemmed versions of the token are
 included. Defaults to `false`.
 [[analysis-hunspell-tokenfilter-analyzer-ex]]
 ==== Customize and add to an analyzer
 To customize the `hunspell` filter, duplicate it to create the
 basis for a new custom token filter. You can modify the filter using its
 configurable parameters.
 For example, the following <<indices-create-index,create index API>> request
 uses a custom `hunspell` filter, `my_en_US_dict_stemmer`, to configure a new
 <<analysis-custom-analyzer,custom analyzer>>.
 The `my_en_US_dict_stemmer` filter uses a `locale` of `en_US`, meaning that the
 `.aff` and `.dic` files in the `<path.config>/hunspell/en_US` directory are
 used. The filter also includes a `dedup` argument of `false`, meaning that
 duplicate tokens added from the dictionary are not removed from the filter's
 output.
 [source,console]
 ----
 PUT /my_index
 {
  "settings": {
    "analysis": {
      "analyzer": {
        "en": {
          "tokenizer": "standard",
-                    "filter" : [ "lowercase", "en_US" ]
+          "filter": [ "my_en_US_dict_stemmer" ]
        }
      },
      "filter": {
-                "en_US" : {
+        "my_en_US_dict_stemmer": {
          "type": "hunspell",
          "locale": "en_US",
-                    "dedup" : true
+          "dedup": false
        }
      }
    }
  }
 }
--------------------------------------------------
+----
-The hunspell token filter accepts four options:
+[[analysis-hunspell-tokenfilter-settings]]
 ==== Settings
-`locale`::
+In addition to the <<analysis-hunspell-ignore-case-settings,`ignore_case`
-    A locale for this filter. If this is unset, the `lang` or
+settings>>, you can configure the following global settings for the `hunspell`
-    `language` are used instead - so one of these has to be set.
+filter using `elasticsearch.yml`:
-`dictionary`::
+`indices.analysis.hunspell.dictionary.lazy`::
-    The name of a dictionary. The path to your hunspell
+(Static, boolean)
-    dictionaries should be configured via
+If `true`, the loading of Hunspell dictionaries is deferred until a dictionary
-    `indices.analysis.hunspell.dictionary.location` before.
+is used. If `false`, the dictionary directory is checked for dictionaries when
 the node starts, and any dictionaries are automatically loaded. Defaults to
 `false`.
-`dedup`::
+[[indices-analysis-hunspell-dictionary-location]]
-    If only unique terms should be returned, this needs to be
+`indices.analysis.hunspell.dictionary.location`::
-    set to `true`. Defaults to `true`.
+(Static, string)
-
+Path to a Hunspell dictionary directory. This path must be absolute or
-`longest_only`::
+relative to the `config` location.
-    If only the longest term should be returned, set this to `true`.
+
-    Defaults to `false`: all possible stems are returned.
+By default, the `<path.config>/hunspell` directory is used, as described in
-
+<<analysis-hunspell-tokenfilter-dictionary-config>>.
 NOTE: As opposed to the snowball stemmers (which are algorithm based)
 this is a dictionary lookup based stemmer and therefore the quality of
 the stemming is determined by the quality of the dictionary.
 [float]
 ==== Dictionary loading
 By default, the default Hunspell directory (`config/hunspell/`) is checked
 for dictionaries when the node starts up, and any dictionaries are
 automatically loaded.
 Dictionary loading can be deferred until they are actually used by setting
 `indices.analysis.hunspell.dictionary.lazy` to `true` in the config file.
 [float]
 ==== References
 Hunspell is a spell checker and morphological analyzer designed for
 languages with rich morphology and complex word compounding and
 character encoding.
 1. Wikipedia, http://en.wikipedia.org/wiki/Hunspell
 2. Source code, http://hunspell.sourceforge.net/
 3. Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries
 4.  Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/
 5. Chromium Hunspell dictionaries,
   http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/