[DOCS] Reformat `hunspell` token filter (#56955)

Changes:

* Rewrites description and adds Lucene link
* Adds analyze example
* Rewrites parameter documentation
* Updates custom analyzer example
* Rewrites related setting documentation
This commit is contained in:
James Rodewig 2020-05-20 14:47:53 -04:00 committed by GitHub
parent ec41d36c62
commit 5cb34d9a6e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 201 additions and 73 deletions

View File

@ -4,18 +4,37 @@
<titleabbrev>Hunspell</titleabbrev>
++++
Basic support for hunspell stemming. Hunspell dictionaries will be
picked up from a dedicated hunspell directory on the filesystem
(`<path.conf>/hunspell`). Each dictionary is expected to
have its own directory named after its associated locale (language).
This dictionary directory is expected to hold a single `*.aff` and
one or more `*.dic` files (all of which will automatically be picked up).
For example, assuming the default hunspell location is used, the
following directory layout will define the `en_US` dictionary:
Provides <<dictionary-stemmers,dictionary stemming>> based on a provided
http://en.wikipedia.org/wiki/Hunspell[Hunspell dictionary]. The `hunspell`
filter requires
<<analysis-hunspell-tokenfilter-dictionary-config,configuration>> of one or more
language-specific Hunspell dictionaries.
This filter uses Lucene's
{lucene-analysis-docs}/hunspell/HunspellStemFilter.html[HunspellStemFilter].
[TIP]
====
If available, we recommend trying an algorithmic stemmer for your language
before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
In practice, algorithmic stemmers typically outperform dictionary stemmers.
See <<dictionary-stemmers>>.
====
[[analysis-hunspell-tokenfilter-dictionary-config]]
==== Configure Hunspell dictionaries
By default, Hunspell dictionaries are stored and detected on a dedicated
hunspell directory on the filesystem: `<path.config>/hunspell`. Each dictionary
is expected to have its own directory, named after its associated language and
locale (e.g., `pt_BR`, `en_GB`). This dictionary directory is expected to hold a
single `.aff` and one or more `.dic` files, all of which will automatically be
picked up. For example, assuming the default `<path.config>/hunspell` path
is used, the following directory layout will define the `en_US` dictionary:
[source,txt]
--------------------------------------------------
- conf
- config
|-- hunspell
| |-- en_US
| | |-- en_US.dic
@ -24,96 +43,205 @@ following directory layout will define the `en_US` dictionary:
Each dictionary can be configured with one setting:
[[analysis-hunspell-ignore-case-settings]]
`ignore_case`::
If true, dictionary matching will be case insensitive
(defaults to `false`)
(Static, boolean)
If true, dictionary matching will be case insensitive. Defaults to `false`.
This setting can be configured globally in `elasticsearch.yml` using
`indices.analysis.hunspell.dictionary.ignore_case`.
* `indices.analysis.hunspell.dictionary.ignore_case`
or for specific dictionaries:
* `indices.analysis.hunspell.dictionary.en_US.ignore_case`.
To configure the setting for a specific locale, use the
`indices.analysis.hunspell.dictionary.<locale>.ignore_case` setting (e.g., for
the `en_US` (American English) locale, the setting is
`indices.analysis.hunspell.dictionary.en_US.ignore_case`).
It is also possible to add `settings.yml` file under the dictionary
directory which holds these settings (this will override any other
settings defined in the `elasticsearch.yml`).
directory which holds these settings. This overrides any other `ignore_case`
settings defined in `elasticsearch.yml`.
One can use the hunspell stem filter by configuring it the analysis
settings:
[[analysis-hunspell-tokenfilter-analyze-ex]]
==== Example
The following analyze API request uses the `hunspell` filter to stem
`the foxes jumping quickly` to `the fox jump quick`.
The request specifies the `en_US` locale, meaning that the
`.aff` and `.dic` files in the `<path.config>/hunspell/en_US` directory are used
for the Hunspell dictionary.
[source,console]
--------------------------------------------------
PUT /hunspell_example
----
GET /_analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "hunspell",
"locale": "en_US"
}
],
"text": "the foxes jumping quickly"
}
----
The filter produces the following tokens:
[source,text]
----
[ the, fox, jump, quick ]
----
////
[source,console-result]
----
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "fox",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jump",
"start_offset": 10,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "quick",
"start_offset": 18,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
}
]
}
----
////
[[analysis-hunspell-tokenfilter-configure-parms]]
==== Configurable parameters
[[analysis-hunspell-tokenfilter-dictionary-param]]
`dictionary`::
(Optional, string or array of strings)
One or more `.dic` files (e.g, `en_US.dic, my_custom.dic`) to use for the
Hunspell dictionary.
+
By default, the `hunspell` filter uses all `.dic` files in the
`<path.config>/hunspell/<locale>` directory specified specified using the
`lang`, `language`, or `locale` parameter. To use another directory, the
directory's path must be registered using the
<<indices-analysis-hunspell-dictionary-location,
`indices.analysis.hunspell.dictionary.location`>> setting.
`dedup`::
(Optional, boolean)
If `true`, duplicate tokens are removed from the filter's output. Defaults to
`true`.
`lang`::
(Required*, string)
An alias for the <<analysis-hunspell-tokenfilter-locale-param,`locale`
parameter>>.
+
If this parameter is not specified, the `language` or `locale` parameter is
required.
`language`::
(Required*, string)
An alias for the <<analysis-hunspell-tokenfilter-locale-param,`locale`
parameter>>.
+
If this parameter is not specified, the `lang` or `locale` parameter is
required.
[[analysis-hunspell-tokenfilter-locale-param]]
`locale`::
(Required*, string)
Locale directory used to specify the `.aff` and `.dic` files for a Hunspell
dictionary. See <<analysis-hunspell-tokenfilter-dictionary-config>>.
+
If this parameter is not specified, the `lang` or `language` parameter is
required.
`longest_only`::
(Optional, boolean)
If `true`, only the longest stemmed version of each token is
included in the output. If `false`, all stemmed versions of the token are
included. Defaults to `false`.
[[analysis-hunspell-tokenfilter-analyzer-ex]]
==== Customize and add to an analyzer
To customize the `hunspell` filter, duplicate it to create the
basis for a new custom token filter. You can modify the filter using its
configurable parameters.
For example, the following <<indices-create-index,create index API>> request
uses a custom `hunspell` filter, `my_en_US_dict_stemmer`, to configure a new
<<analysis-custom-analyzer,custom analyzer>>.
The `my_en_US_dict_stemmer` filter uses a `locale` of `en_US`, meaning that the
`.aff` and `.dic` files in the `<path.config>/hunspell/en_US` directory are
used. The filter also includes a `dedup` argument of `false`, meaning that
duplicate tokens added from the dictionary are not removed from the filter's
output.
[source,console]
----
PUT /my_index
{
"settings": {
"analysis" : {
"analyzer" : {
"en" : {
"tokenizer" : "standard",
"filter" : [ "lowercase", "en_US" ]
"analysis": {
"analyzer": {
"en": {
"tokenizer": "standard",
"filter": [ "my_en_US_dict_stemmer" ]
}
},
"filter" : {
"en_US" : {
"type" : "hunspell",
"locale" : "en_US",
"dedup" : true
"filter": {
"my_en_US_dict_stemmer": {
"type": "hunspell",
"locale": "en_US",
"dedup": false
}
}
}
}
}
--------------------------------------------------
----
The hunspell token filter accepts four options:
[[analysis-hunspell-tokenfilter-settings]]
==== Settings
`locale`::
A locale for this filter. If this is unset, the `lang` or
`language` are used instead - so one of these has to be set.
In addition to the <<analysis-hunspell-ignore-case-settings,`ignore_case`
settings>>, you can configure the following global settings for the `hunspell`
filter using `elasticsearch.yml`:
`dictionary`::
The name of a dictionary. The path to your hunspell
dictionaries should be configured via
`indices.analysis.hunspell.dictionary.location` before.
`indices.analysis.hunspell.dictionary.lazy`::
(Static, boolean)
If `true`, the loading of Hunspell dictionaries is deferred until a dictionary
is used. If `false`, the dictionary directory is checked for dictionaries when
the node starts, and any dictionaries are automatically loaded. Defaults to
`false`.
`dedup`::
If only unique terms should be returned, this needs to be
set to `true`. Defaults to `true`.
`longest_only`::
If only the longest term should be returned, set this to `true`.
Defaults to `false`: all possible stems are returned.
NOTE: As opposed to the snowball stemmers (which are algorithm based)
this is a dictionary lookup based stemmer and therefore the quality of
the stemming is determined by the quality of the dictionary.
[float]
==== Dictionary loading
By default, the default Hunspell directory (`config/hunspell/`) is checked
for dictionaries when the node starts up, and any dictionaries are
automatically loaded.
Dictionary loading can be deferred until they are actually used by setting
`indices.analysis.hunspell.dictionary.lazy` to `true` in the config file.
[float]
==== References
Hunspell is a spell checker and morphological analyzer designed for
languages with rich morphology and complex word compounding and
character encoding.
1. Wikipedia, http://en.wikipedia.org/wiki/Hunspell
2. Source code, http://hunspell.sourceforge.net/
3. Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries
4. Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/
5. Chromium Hunspell dictionaries,
http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/
[[indices-analysis-hunspell-dictionary-location]]
`indices.analysis.hunspell.dictionary.location`::
(Static, string)
Path to a Hunspell dictionary directory. This path must be absolute or
relative to the `config` location.
+
By default, the `<path.config>/hunspell` directory is used, as described in
<<analysis-hunspell-tokenfilter-dictionary-config>>.