[DOCS] Reformat compound word token filters (#49006)

* Separates the compound token filters doc pages into separate token
  filter pages:
  * Dictionary decompounder token filter
  * Hyphenation decompounder token filter

* Adds analyze API examples for each compound token filter

* Adds a redirect for the removed compound token filters page

Co-Authored-By: debadair <debadair@elastic.co>
This commit is contained in:
James Rodewig 2019-11-13 09:35:00 -05:00
parent b55022b59f
commit 838af15d29
5 changed files with 337 additions and 117 deletions

View File

@ -22,14 +22,14 @@ include::tokenfilters/classic-tokenfilter.asciidoc[]
include::tokenfilters/common-grams-tokenfilter.asciidoc[]
include::tokenfilters/compound-word-tokenfilter.asciidoc[]
include::tokenfilters/condition-tokenfilter.asciidoc[]
include::tokenfilters/decimal-digit-tokenfilter.asciidoc[]
include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]
include::tokenfilters/dictionary-decompounder-tokenfilter.asciidoc[]
include::tokenfilters/edgengram-tokenfilter.asciidoc[]
include::tokenfilters/elision-tokenfilter.asciidoc[]
@ -40,6 +40,8 @@ include::tokenfilters/flatten-graph-tokenfilter.asciidoc[]
include::tokenfilters/hunspell-tokenfilter.asciidoc[]
include::tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc[]
include::tokenfilters/keep-types-tokenfilter.asciidoc[]
include::tokenfilters/keep-words-tokenfilter.asciidoc[]

View File

@ -1,115 +0,0 @@
[[analysis-compound-word-tokenfilter]]
=== Compound Word Token Filters
The `hyphenation_decompounder` and `dictionary_decompounder` token filters can
decompose compound words found in many German languages into word parts.
Both token filters require a dictionary of word parts, which can be provided
as:
[horizontal]
`word_list`::
An array of words, specified inline in the token filter configuration, or
`word_list_path`::
The path (either absolute or relative to the `config` directory) to a UTF-8
encoded file containing one word per line.
[float]
=== Hyphenation decompounder
The `hyphenation_decompounder` uses hyphenation grammars to find potential
subwords that are then checked against the word dictionary. The quality of the
output tokens is directly connected to the quality of the grammar file you
use. For languages like German they are quite good.
XML based hyphenation grammar files can be found in the
http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects]
(OFFO) Sourceforge project. Currently only FOP v1.2 compatible hyphenation files
are supported. You can download https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download[offo-hyphenation_v1.2.zip]
directly and look in the `offo-hyphenation/hyph/` directory.
Credits for the hyphenation code go to the Apache FOP project .
[float]
=== Dictionary decompounder
The `dictionary_decompounder` uses a brute force approach in conjunction with
only the word dictionary to find subwords in a compound word. It is much
slower than the hyphenation decompounder but can be used as a first start to
check the quality of your dictionary.
[float]
=== Compound token filter parameters
The following parameters can be used to configure a compound word token
filter:
[horizontal]
`type`::
Either `dictionary_decompounder` or `hyphenation_decompounder`.
`word_list`::
A array containing a list of words to use for the word dictionary.
`word_list_path`::
The path (either absolute or relative to the `config` directory) to the word dictionary.
`hyphenation_patterns_path`::
The path (either absolute or relative to the `config` directory) to a FOP XML hyphenation pattern file. (required for hyphenation)
`min_word_size`::
Minimum word size. Defaults to 5.
`min_subword_size`::
Minimum subword size. Defaults to 2.
`max_subword_size`::
Maximum subword size. Defaults to 15.
`only_longest_match`::
Whether to include only the longest matching subword or not. Defaults to `false`
Here is an example:
[source,console]
--------------------------------------------------
PUT /compound_word_example
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["dictionary_decompounder", "hyphenation_decompounder"]
}
},
"filter": {
"dictionary_decompounder": {
"type": "dictionary_decompounder",
"word_list": ["one", "two", "three"]
},
"hyphenation_decompounder": {
"type" : "hyphenation_decompounder",
"word_list_path": "analysis/example_word_list.txt",
"hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
"max_subword_size": 22
}
}
}
}
}
}
--------------------------------------------------

View File

@ -0,0 +1,173 @@
[[analysis-dict-decomp-tokenfilter]]
=== Dictionary decompounder token filter
++++
<titleabbrev>Dictionary decompounder</titleabbrev>
++++
[NOTE]
====
In most cases, we recommend using the faster
<<analysis-hyp-decomp-tokenfilter,`hyphenation_decompounder`>> token filter
in place of this filter. However, you can use the
`dictionary_decompounder` filter to check the quality of a word list before
implementing it in the `hyphenation_decompounder` filter.
====
Uses a specified list of words and a brute force approach to find subwords in
compound words. If found, these subwords are included in the token output.
This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html[DictionaryCompoundWordTokenFilter],
which was built for Germanic languages.
[[analysis-dict-decomp-tokenfilter-analyze-ex]]
==== Example
The following <<indices-analyze,analyze API>> request uses the
`dictionary_decompounder` filter to find subwords in `Donaudampfschiff`. The
filter then checks these subwords against the specified list of words: `Donau`,
`dampf`, `meer`, and `schiff`.
[source,console]
--------------------------------------------------
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "dictionary_decompounder",
"word_list": ["Donau", "dampf", "meer", "schiff"]
}
],
"text": "Donaudampfschiff"
}
--------------------------------------------------
The filter produces the following tokens:
[source,text]
--------------------------------------------------
[ Donaudampfschiff, Donau, dampf, schiff ]
--------------------------------------------------
/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens" : [
{
"token" : "Donaudampfschiff",
"start_offset" : 0,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "Donau",
"start_offset" : 0,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "dampf",
"start_offset" : 0,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "schiff",
"start_offset" : 0,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
--------------------------------------------------
/////////////////////
[[analysis-dict-decomp-tokenfilter-configure-parms]]
==== Configurable parameters
`word_list`::
+
--
(Required+++*+++, array of strings)
A list of subwords to look for in the token stream. If found, the subword is
included in the token output.
Either this parameter or `word_list_path` must be specified.
--
`word_list_path`::
+
--
(Required+++*+++, string)
Path to a file that contains a list of subwords to find in the token stream. If
found, the subword is included in the token output.
This path must be absolute or relative to the `config` location, and the file
must be UTF-8 encoded. Each token in the file must be separated by a line break.
Either this parameter or `word_list` must be specified.
--
`max_subword_size`::
(Optional, integer)
Maximum subword character length. Longer subword tokens are excluded from the
output. Defaults to `15`.
`min_subword_size`::
(Optional, integer)
Minimum subword character length. Shorter subword tokens are excluded from the
output. Defaults to `2`.
`min_word_size`::
(Optional, integer)
Minimum word character length. Shorter word tokens are excluded from the
output. Defaults to `5`.
`only_longest_match`::
(Optional, boolean)
If `true`, only include the longest matching subword. Defaults to `false`.
[[analysis-dict-decomp-tokenfilter-customize]]
==== Customize and add to an analyzer
To customize the `dictionary_decompounder` filter, duplicate it to create the
basis for a new custom token filter. You can modify the filter using its
configurable parameters.
For example, the following <<indices-create-index,create index API>> request
uses a custom `dictionary_decompounder` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.
The custom `dictionary_decompounder` filter find subwords in the
`analysis/example_word_list.txt` file. Subwords longer than 22 characters are
excluded from the token output.
[source,console]
--------------------------------------------------
PUT dictionary_decompound_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_dictionary_decompound": {
"tokenizer": "standard",
"filter": [ "22_char_dictionary_decompound" ]
}
},
"filter": {
"22_char_dictionary_decompound": {
"type": "dictionary_decompounder",
"word_list_path": "analysis/example_word_list.txt",
"max_subword_size": 22
}
}
}
}
}
--------------------------------------------------

View File

@ -0,0 +1,154 @@
[[analysis-hyp-decomp-tokenfilter]]
=== Hyphenation decompounder token filter
++++
<titleabbrev>Hyphenation decompounder</titleabbrev>
++++
Uses XML-based hyphenation patterns to find potential subwords in compound
words. These subwords are then checked against the specified word list. Subwords not
in the list are excluded from the token output.
This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html[HyphenationCompoundWordTokenFilter],
which was built for Germanic languages.
[[analysis-hyp-decomp-tokenfilter-analyze-ex]]
==== Example
The following <<indices-analyze,analyze API>> request uses the
`hyphenation_decompounder` filter to find subwords in `Kaffeetasse` based on
German hyphenation patterns in the `analysis/hyphenation_patterns.xml` file. The
filter then checks these subwords against a list of specified words: `kaffee`,
`zucker`, and `tasse`.
[source,console]
--------------------------------------------------
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "hyphenation_decompounder",
"hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
"word_list": ["Kaffee", "zucker", "tasse"]
}
],
"text": "Kaffeetasse"
}
--------------------------------------------------
// TEST[skip: requires a valid hyphenation_patterns.xml file for DE-DR]
The filter produces the following tokens:
[source,text]
--------------------------------------------------
[ Kaffeetasse, Kaffee, tasse ]
--------------------------------------------------
[[analysis-hyp-decomp-tokenfilter-configure-parms]]
==== Configurable parameters
`hyphenation_patterns_path`::
+
--
(Required, string)
Path to an Apache FOP (Formatting Objects Processor) XML hyphenation pattern file.
This path must be absolute or relative to the `config` location. Only FOP v1.2
compatible files are supported.
For example FOP XML hyphenation pattern files, refer to:
* http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects (OFFO) Sourceforge project]
* https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download[offo-hyphenation_v1.2.zip direct download]
--
`word_list`::
+
--
(Required+++*+++, array of strings)
A list of subwords. Subwords found using the hyphenation pattern but not in this
list are excluded from the token output.
You can use the <<analysis-dict-decomp-tokenfilter,`dictionary_decompounder`>>
filter to test the quality of word lists before implementing them.
Either this parameter or `word_list_path` must be specified.
--
`word_list_path`::
+
--
(Required+++*+++, string)
Path to a file containing a list of subwords. Subwords found using the
hyphenation pattern but not in this list are excluded from the token output.
This path must be absolute or relative to the `config` location, and the file
must be UTF-8 encoded. Each token in the file must be separated by a line break.
You can use the <<analysis-dict-decomp-tokenfilter,`dictionary_decompounder`>>
filter to test the quality of word lists before implementing them.
Either this parameter or `word_list` must be specified.
--
`max_subword_size`::
(Optional, integer)
Maximum subword character length. Longer subword tokens are excluded from the
output. Defaults to `15`.
`min_subword_size`::
(Optional, integer)
Minimum subword character length. Shorter subword tokens are excluded from the
output. Defaults to `2`.
`min_word_size`::
(Optional, integer)
Minimum word character length. Shorter word tokens are excluded from the
output. Defaults to `5`.
`only_longest_match`::
(Optional, boolean)
If `true`, only include the longest matching subword. Defaults to `false`.
[[analysis-hyp-decomp-tokenfilter-customize]]
==== Customize and add to an analyzer
To customize the `hyphenation_decompounder` filter, duplicate it to create the
basis for a new custom token filter. You can modify the filter using its
configurable parameters.
For example, the following <<indices-create-index,create index API>> request
uses a custom `hyphenation_decompounder` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.
The custom `hyphenation_decompounder` filter find subwords based on hyphenation
patterns in the `analysis/hyphenation_patterns.xml` file. The filter then checks
these subwords against the list of words specified in the
`analysis/example_word_list.txt` file. Subwords longer than 22 characters are
excluded from the token output.
[source,console]
--------------------------------------------------
PUT hyphenation_decompound_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_hyphenation_decompound": {
"tokenizer": "standard",
"filter": [ "22_char_hyphenation_decompound" ]
}
},
"filter": {
"22_char_hyphenation_decompound": {
"type": "hyphenation_decompounder",
"word_list_path": "analysis/example_word_list.txt",
"hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
"max_subword_size": 22
}
}
}
}
}
--------------------------------------------------

View File

@ -903,3 +903,9 @@ See <<monitor-elasticsearch-cluster>>.
[role="exclude",id="docker-cli-run"]
See <<docker-cli-run-dev-mode>>.
[role="exclude",id="analysis-compound-word-tokenfilter"]
=== Compound word token filters
See <<analysis-dict-decomp-tokenfilter>> and
<<analysis-hyp-decomp-tokenfilter>>.