[DOCS] Reformat `hunspell` token filter (#56955)

Changes: * Rewrites description and adds Lucene link * Adds analyze example * Rewrites parameter documentation * Updates custom analyzer example * Rewrites related setting documentation
2020-05-20 14:47:53 -04:00 · 2020-05-20 14:47:53 -04:00 · 5cb34d9a6e
parent ec41d36c62
commit 5cb34d9a6e
1 changed files with 201 additions and 73 deletions
--- a/docs/reference/analysis/tokenfilters/hunspell-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/hunspell-tokenfilter.asciidoc
@ -4,18 +4,37 @@
 <titleabbrev>Hunspell</titleabbrev>
 ++++

-Basic support for hunspell stemming. Hunspell dictionaries will be
-picked up from a dedicated hunspell directory on the filesystem
-(`<path.conf>/hunspell`). Each dictionary is expected to
-have its own directory named after its associated locale (language).
-This dictionary directory is expected to hold a single `*.aff` and
-one or more `*.dic` files (all of which will automatically be picked up).
-For example, assuming the default hunspell location is used, the
-following directory layout will define the `en_US` dictionary:
+Provides <<dictionary-stemmers,dictionary stemming>> based on a provided
+http://en.wikipedia.org/wiki/Hunspell[Hunspell dictionary]. The `hunspell`
+filter requires
+<<analysis-hunspell-tokenfilter-dictionary-config,configuration>> of one or more
+language-specific Hunspell dictionaries.
+
+This filter uses Lucene's
+{lucene-analysis-docs}/hunspell/HunspellStemFilter.html[HunspellStemFilter].
+
+[TIP]
+====
+If available, we recommend trying an algorithmic stemmer for your language
+before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
+In practice, algorithmic stemmers typically outperform dictionary stemmers.
+See <<dictionary-stemmers>>.
+====
+
+[[analysis-hunspell-tokenfilter-dictionary-config]]
+==== Configure Hunspell dictionaries
+
+By default, Hunspell dictionaries are stored and detected on a dedicated
+hunspell directory on the filesystem: `<path.config>/hunspell`. Each dictionary
+is expected to have its own directory, named after its associated language and
+locale (e.g., `pt_BR`, `en_GB`). This dictionary directory is expected to hold a
+single `.aff` and one or more `.dic` files, all of which will automatically be
+picked up. For example, assuming the default `<path.config>/hunspell` path
+is used, the following directory layout will define the `en_US` dictionary:

 [source,txt]
 --------------------------------------------------
- conf
+- config
    |-- hunspell
    |    |-- en_US
    |    |    |-- en_US.dic
@ -24,96 +43,205 @@ following directory layout will define the `en_US` dictionary:

 Each dictionary can be configured with one setting:

+[[analysis-hunspell-ignore-case-settings]]
 `ignore_case`::
-    If true, dictionary matching will be case insensitive
-    (defaults to `false`)
+(Static, boolean)
+If true, dictionary matching will be case insensitive. Defaults to `false`.

 This setting can be configured globally in `elasticsearch.yml` using
+`indices.analysis.hunspell.dictionary.ignore_case`.

-* `indices.analysis.hunspell.dictionary.ignore_case`
-
-or for specific dictionaries:
-
-* `indices.analysis.hunspell.dictionary.en_US.ignore_case`.
+To configure the setting for a specific locale, use the
+`indices.analysis.hunspell.dictionary.<locale>.ignore_case` setting (e.g., for
+the `en_US` (American English) locale, the setting is
+`indices.analysis.hunspell.dictionary.en_US.ignore_case`).

 It is also possible to add `settings.yml` file under the dictionary
-directory which holds these settings (this will override any other
-settings defined in the `elasticsearch.yml`).
+directory which holds these settings. This overrides any other `ignore_case`
+settings defined in `elasticsearch.yml`.

-One can use the hunspell stem filter by configuring it the analysis
-settings:
+[[analysis-hunspell-tokenfilter-analyze-ex]]
+==== Example
+
+The following analyze API request uses the `hunspell` filter to stem 
+`the foxes jumping quickly` to `the fox jump quick`.
+
+The request specifies the `en_US` locale, meaning that the
+`.aff` and `.dic` files in the `<path.config>/hunspell/en_US` directory are used
+for the Hunspell dictionary.

 [source,console]
--------------------------------------------------
-PUT /hunspell_example
+----
+GET /_analyze
+{
+  "tokenizer": "standard",
+  "filter": [
+    {
+      "type": "hunspell",
+      "locale": "en_US"
+    }
+  ],
+  "text": "the foxes jumping quickly"
+}
+----
+
+The filter produces the following tokens:
+
+[source,text]
+----
+[ the, fox, jump, quick ]
+----
+
+////
+[source,console-result]
+----
+{
+  "tokens": [
+    {
+      "token": "the",
+      "start_offset": 0,
+      "end_offset": 3,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "fox",
+      "start_offset": 4,
+      "end_offset": 9,
+      "type": "<ALPHANUM>",
+      "position": 1
+    },
+    {
+      "token": "jump",
+      "start_offset": 10,
+      "end_offset": 17,
+      "type": "<ALPHANUM>",
+      "position": 2
+    },
+    {
+      "token": "quick",
+      "start_offset": 18,
+      "end_offset": 25,
+      "type": "<ALPHANUM>",
+      "position": 3
+    }
+  ]
+}
+----
+////
+
+[[analysis-hunspell-tokenfilter-configure-parms]]
+==== Configurable parameters
+
+[[analysis-hunspell-tokenfilter-dictionary-param]]
+`dictionary`::
+(Optional, string or array of strings)
+One or more `.dic` files (e.g, `en_US.dic, my_custom.dic`) to use for the
+Hunspell dictionary.
+
+By default, the `hunspell` filter uses all `.dic` files in the
+`<path.config>/hunspell/<locale>` directory specified specified using the
+`lang`, `language`, or `locale` parameter. To use another directory, the
+directory's path must be registered using the
+<<indices-analysis-hunspell-dictionary-location,
+`indices.analysis.hunspell.dictionary.location`>> setting.
+
+`dedup`::
+(Optional, boolean)
+If `true`, duplicate tokens are removed from the filter's output. Defaults to
+`true`.
+
+`lang`::
+(Required*, string)
+An alias for the <<analysis-hunspell-tokenfilter-locale-param,`locale`
+parameter>>.
+
+If this parameter is not specified, the `language` or `locale` parameter is
+required.
+
+`language`::
+(Required*, string)
+An alias for the <<analysis-hunspell-tokenfilter-locale-param,`locale`
+parameter>>.
+
+If this parameter is not specified, the `lang` or `locale` parameter is
+required.
+
+[[analysis-hunspell-tokenfilter-locale-param]]
+`locale`::
+(Required*, string)
+Locale directory used to specify the `.aff` and `.dic` files for a Hunspell
+dictionary. See <<analysis-hunspell-tokenfilter-dictionary-config>>.
+
+If this parameter is not specified, the `lang` or `language` parameter is
+required.
+
+`longest_only`::
+(Optional, boolean)
+If `true`, only the longest stemmed version of each token is
+included in the output. If `false`, all stemmed versions of the token are
+included. Defaults to `false`.
+
+[[analysis-hunspell-tokenfilter-analyzer-ex]]
+==== Customize and add to an analyzer
+
+To customize the `hunspell` filter, duplicate it to create the
+basis for a new custom token filter. You can modify the filter using its
+configurable parameters.
+
+For example, the following <<indices-create-index,create index API>> request
+uses a custom `hunspell` filter, `my_en_US_dict_stemmer`, to configure a new
+<<analysis-custom-analyzer,custom analyzer>>.
+
+The `my_en_US_dict_stemmer` filter uses a `locale` of `en_US`, meaning that the
+`.aff` and `.dic` files in the `<path.config>/hunspell/en_US` directory are
+used. The filter also includes a `dedup` argument of `false`, meaning that
+duplicate tokens added from the dictionary are not removed from the filter's
+output.
+
+[source,console]
+----
+PUT /my_index
 {
  "settings": {
-        "analysis" : {
-            "analyzer" : {
-                "en" : {
-                    "tokenizer" : "standard",
-                    "filter" : [ "lowercase", "en_US" ]
+    "analysis": {
+      "analyzer": {
+        "en": {
+          "tokenizer": "standard",
+          "filter": [ "my_en_US_dict_stemmer" ]
        }
      },
-            "filter" : {
-                "en_US" : {
-                    "type" : "hunspell",
-                    "locale" : "en_US",
-                    "dedup" : true
+      "filter": {
+        "my_en_US_dict_stemmer": {
+          "type": "hunspell",
+          "locale": "en_US",
+          "dedup": false
        }
      }
    }
  }
 }
--------------------------------------------------
+----

-The hunspell token filter accepts four options:
+[[analysis-hunspell-tokenfilter-settings]]
+==== Settings

-`locale`::
-    A locale for this filter. If this is unset, the `lang` or
-    `language` are used instead - so one of these has to be set.
+In addition to the <<analysis-hunspell-ignore-case-settings,`ignore_case`
+settings>>, you can configure the following global settings for the `hunspell`
+filter using `elasticsearch.yml`:

-`dictionary`::
-    The name of a dictionary. The path to your hunspell
-    dictionaries should be configured via
-    `indices.analysis.hunspell.dictionary.location` before.
+`indices.analysis.hunspell.dictionary.lazy`::
+(Static, boolean)
+If `true`, the loading of Hunspell dictionaries is deferred until a dictionary
+is used. If `false`, the dictionary directory is checked for dictionaries when
+the node starts, and any dictionaries are automatically loaded. Defaults to
+`false`.

-`dedup`::
-    If only unique terms should be returned, this needs to be
-    set to `true`. Defaults to `true`.
-
-`longest_only`::
-    If only the longest term should be returned, set this to `true`.
-    Defaults to `false`: all possible stems are returned.
-
-NOTE: As opposed to the snowball stemmers (which are algorithm based)
-this is a dictionary lookup based stemmer and therefore the quality of
-the stemming is determined by the quality of the dictionary.
-
-[float]
-==== Dictionary loading
-
-By default, the default Hunspell directory (`config/hunspell/`) is checked
-for dictionaries when the node starts up, and any dictionaries are
-automatically loaded.
-
-Dictionary loading can be deferred until they are actually used by setting
-`indices.analysis.hunspell.dictionary.lazy` to `true` in the config file.
-
-[float]
-==== References
-
-Hunspell is a spell checker and morphological analyzer designed for
-languages with rich morphology and complex word compounding and
-character encoding.
-
-1. Wikipedia, http://en.wikipedia.org/wiki/Hunspell
-
-2. Source code, http://hunspell.sourceforge.net/
-
-3. Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries
-
-4.  Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/
-
-5. Chromium Hunspell dictionaries,
-   http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/
+[[indices-analysis-hunspell-dictionary-location]]
+`indices.analysis.hunspell.dictionary.location`::
+(Static, string)
+Path to a Hunspell dictionary directory. This path must be absolute or
+relative to the `config` location.
+
+By default, the `<path.config>/hunspell` directory is used, as described in
+<<analysis-hunspell-tokenfilter-dictionary-config>>.