[DOCS] Reformat compound word token filters (#49006)
* Separates the compound token filters doc pages into separate token filter pages: * Dictionary decompounder token filter * Hyphenation decompounder token filter * Adds analyze API examples for each compound token filter * Adds a redirect for the removed compound token filters page Co-Authored-By: debadair <debadair@elastic.co>
This commit is contained in:
parent
b55022b59f
commit
838af15d29
|
@ -22,14 +22,14 @@ include::tokenfilters/classic-tokenfilter.asciidoc[]
|
|||
|
||||
include::tokenfilters/common-grams-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/compound-word-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/condition-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/decimal-digit-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/dictionary-decompounder-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/edgengram-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/elision-tokenfilter.asciidoc[]
|
||||
|
@ -40,6 +40,8 @@ include::tokenfilters/flatten-graph-tokenfilter.asciidoc[]
|
|||
|
||||
include::tokenfilters/hunspell-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/keep-types-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/keep-words-tokenfilter.asciidoc[]
|
||||
|
|
|
@ -1,115 +0,0 @@
|
|||
[[analysis-compound-word-tokenfilter]]
|
||||
=== Compound Word Token Filters
|
||||
|
||||
The `hyphenation_decompounder` and `dictionary_decompounder` token filters can
|
||||
decompose compound words found in many German languages into word parts.
|
||||
|
||||
Both token filters require a dictionary of word parts, which can be provided
|
||||
as:
|
||||
|
||||
[horizontal]
|
||||
`word_list`::
|
||||
|
||||
An array of words, specified inline in the token filter configuration, or
|
||||
|
||||
`word_list_path`::
|
||||
|
||||
The path (either absolute or relative to the `config` directory) to a UTF-8
|
||||
encoded file containing one word per line.
|
||||
|
||||
[float]
|
||||
=== Hyphenation decompounder
|
||||
|
||||
The `hyphenation_decompounder` uses hyphenation grammars to find potential
|
||||
subwords that are then checked against the word dictionary. The quality of the
|
||||
output tokens is directly connected to the quality of the grammar file you
|
||||
use. For languages like German they are quite good.
|
||||
|
||||
XML based hyphenation grammar files can be found in the
|
||||
http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects]
|
||||
(OFFO) Sourceforge project. Currently only FOP v1.2 compatible hyphenation files
|
||||
are supported. You can download https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download[offo-hyphenation_v1.2.zip]
|
||||
directly and look in the `offo-hyphenation/hyph/` directory.
|
||||
Credits for the hyphenation code go to the Apache FOP project .
|
||||
|
||||
[float]
|
||||
=== Dictionary decompounder
|
||||
|
||||
The `dictionary_decompounder` uses a brute force approach in conjunction with
|
||||
only the word dictionary to find subwords in a compound word. It is much
|
||||
slower than the hyphenation decompounder but can be used as a first start to
|
||||
check the quality of your dictionary.
|
||||
|
||||
[float]
|
||||
=== Compound token filter parameters
|
||||
|
||||
The following parameters can be used to configure a compound word token
|
||||
filter:
|
||||
|
||||
[horizontal]
|
||||
`type`::
|
||||
|
||||
Either `dictionary_decompounder` or `hyphenation_decompounder`.
|
||||
|
||||
`word_list`::
|
||||
|
||||
A array containing a list of words to use for the word dictionary.
|
||||
|
||||
`word_list_path`::
|
||||
|
||||
The path (either absolute or relative to the `config` directory) to the word dictionary.
|
||||
|
||||
`hyphenation_patterns_path`::
|
||||
|
||||
The path (either absolute or relative to the `config` directory) to a FOP XML hyphenation pattern file. (required for hyphenation)
|
||||
|
||||
`min_word_size`::
|
||||
|
||||
Minimum word size. Defaults to 5.
|
||||
|
||||
`min_subword_size`::
|
||||
|
||||
Minimum subword size. Defaults to 2.
|
||||
|
||||
`max_subword_size`::
|
||||
|
||||
Maximum subword size. Defaults to 15.
|
||||
|
||||
`only_longest_match`::
|
||||
|
||||
Whether to include only the longest matching subword or not. Defaults to `false`
|
||||
|
||||
|
||||
Here is an example:
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT /compound_word_example
|
||||
{
|
||||
"settings": {
|
||||
"index": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"my_analyzer": {
|
||||
"type": "custom",
|
||||
"tokenizer": "standard",
|
||||
"filter": ["dictionary_decompounder", "hyphenation_decompounder"]
|
||||
}
|
||||
},
|
||||
"filter": {
|
||||
"dictionary_decompounder": {
|
||||
"type": "dictionary_decompounder",
|
||||
"word_list": ["one", "two", "three"]
|
||||
},
|
||||
"hyphenation_decompounder": {
|
||||
"type" : "hyphenation_decompounder",
|
||||
"word_list_path": "analysis/example_word_list.txt",
|
||||
"hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
|
||||
"max_subword_size": 22
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
|
@ -0,0 +1,173 @@
|
|||
[[analysis-dict-decomp-tokenfilter]]
|
||||
=== Dictionary decompounder token filter
|
||||
++++
|
||||
<titleabbrev>Dictionary decompounder</titleabbrev>
|
||||
++++
|
||||
|
||||
[NOTE]
|
||||
====
|
||||
In most cases, we recommend using the faster
|
||||
<<analysis-hyp-decomp-tokenfilter,`hyphenation_decompounder`>> token filter
|
||||
in place of this filter. However, you can use the
|
||||
`dictionary_decompounder` filter to check the quality of a word list before
|
||||
implementing it in the `hyphenation_decompounder` filter.
|
||||
====
|
||||
|
||||
Uses a specified list of words and a brute force approach to find subwords in
|
||||
compound words. If found, these subwords are included in the token output.
|
||||
|
||||
This filter uses Lucene's
|
||||
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html[DictionaryCompoundWordTokenFilter],
|
||||
which was built for Germanic languages.
|
||||
|
||||
[[analysis-dict-decomp-tokenfilter-analyze-ex]]
|
||||
==== Example
|
||||
|
||||
The following <<indices-analyze,analyze API>> request uses the
|
||||
`dictionary_decompounder` filter to find subwords in `Donaudampfschiff`. The
|
||||
filter then checks these subwords against the specified list of words: `Donau`,
|
||||
`dampf`, `meer`, and `schiff`.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET _analyze
|
||||
{
|
||||
"tokenizer": "standard",
|
||||
"filter": [
|
||||
{
|
||||
"type": "dictionary_decompounder",
|
||||
"word_list": ["Donau", "dampf", "meer", "schiff"]
|
||||
}
|
||||
],
|
||||
"text": "Donaudampfschiff"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
The filter produces the following tokens:
|
||||
|
||||
[source,text]
|
||||
--------------------------------------------------
|
||||
[ Donaudampfschiff, Donau, dampf, schiff ]
|
||||
--------------------------------------------------
|
||||
|
||||
/////////////////////
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"tokens" : [
|
||||
{
|
||||
"token" : "Donaudampfschiff",
|
||||
"start_offset" : 0,
|
||||
"end_offset" : 16,
|
||||
"type" : "<ALPHANUM>",
|
||||
"position" : 0
|
||||
},
|
||||
{
|
||||
"token" : "Donau",
|
||||
"start_offset" : 0,
|
||||
"end_offset" : 16,
|
||||
"type" : "<ALPHANUM>",
|
||||
"position" : 0
|
||||
},
|
||||
{
|
||||
"token" : "dampf",
|
||||
"start_offset" : 0,
|
||||
"end_offset" : 16,
|
||||
"type" : "<ALPHANUM>",
|
||||
"position" : 0
|
||||
},
|
||||
{
|
||||
"token" : "schiff",
|
||||
"start_offset" : 0,
|
||||
"end_offset" : 16,
|
||||
"type" : "<ALPHANUM>",
|
||||
"position" : 0
|
||||
}
|
||||
]
|
||||
}
|
||||
--------------------------------------------------
|
||||
/////////////////////
|
||||
|
||||
[[analysis-dict-decomp-tokenfilter-configure-parms]]
|
||||
==== Configurable parameters
|
||||
|
||||
`word_list`::
|
||||
+
|
||||
--
|
||||
(Required+++*+++, array of strings)
|
||||
A list of subwords to look for in the token stream. If found, the subword is
|
||||
included in the token output.
|
||||
|
||||
Either this parameter or `word_list_path` must be specified.
|
||||
--
|
||||
|
||||
`word_list_path`::
|
||||
+
|
||||
--
|
||||
(Required+++*+++, string)
|
||||
Path to a file that contains a list of subwords to find in the token stream. If
|
||||
found, the subword is included in the token output.
|
||||
|
||||
This path must be absolute or relative to the `config` location, and the file
|
||||
must be UTF-8 encoded. Each token in the file must be separated by a line break.
|
||||
|
||||
Either this parameter or `word_list` must be specified.
|
||||
--
|
||||
|
||||
`max_subword_size`::
|
||||
(Optional, integer)
|
||||
Maximum subword character length. Longer subword tokens are excluded from the
|
||||
output. Defaults to `15`.
|
||||
|
||||
`min_subword_size`::
|
||||
(Optional, integer)
|
||||
Minimum subword character length. Shorter subword tokens are excluded from the
|
||||
output. Defaults to `2`.
|
||||
|
||||
`min_word_size`::
|
||||
(Optional, integer)
|
||||
Minimum word character length. Shorter word tokens are excluded from the
|
||||
output. Defaults to `5`.
|
||||
|
||||
`only_longest_match`::
|
||||
(Optional, boolean)
|
||||
If `true`, only include the longest matching subword. Defaults to `false`.
|
||||
|
||||
[[analysis-dict-decomp-tokenfilter-customize]]
|
||||
==== Customize and add to an analyzer
|
||||
|
||||
To customize the `dictionary_decompounder` filter, duplicate it to create the
|
||||
basis for a new custom token filter. You can modify the filter using its
|
||||
configurable parameters.
|
||||
|
||||
For example, the following <<indices-create-index,create index API>> request
|
||||
uses a custom `dictionary_decompounder` filter to configure a new
|
||||
<<analysis-custom-analyzer,custom analyzer>>.
|
||||
|
||||
The custom `dictionary_decompounder` filter find subwords in the
|
||||
`analysis/example_word_list.txt` file. Subwords longer than 22 characters are
|
||||
excluded from the token output.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT dictionary_decompound_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"standard_dictionary_decompound": {
|
||||
"tokenizer": "standard",
|
||||
"filter": [ "22_char_dictionary_decompound" ]
|
||||
}
|
||||
},
|
||||
"filter": {
|
||||
"22_char_dictionary_decompound": {
|
||||
"type": "dictionary_decompounder",
|
||||
"word_list_path": "analysis/example_word_list.txt",
|
||||
"max_subword_size": 22
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
|
@ -0,0 +1,154 @@
|
|||
[[analysis-hyp-decomp-tokenfilter]]
|
||||
=== Hyphenation decompounder token filter
|
||||
++++
|
||||
<titleabbrev>Hyphenation decompounder</titleabbrev>
|
||||
++++
|
||||
|
||||
Uses XML-based hyphenation patterns to find potential subwords in compound
|
||||
words. These subwords are then checked against the specified word list. Subwords not
|
||||
in the list are excluded from the token output.
|
||||
|
||||
This filter uses Lucene's
|
||||
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html[HyphenationCompoundWordTokenFilter],
|
||||
which was built for Germanic languages.
|
||||
|
||||
[[analysis-hyp-decomp-tokenfilter-analyze-ex]]
|
||||
==== Example
|
||||
|
||||
The following <<indices-analyze,analyze API>> request uses the
|
||||
`hyphenation_decompounder` filter to find subwords in `Kaffeetasse` based on
|
||||
German hyphenation patterns in the `analysis/hyphenation_patterns.xml` file. The
|
||||
filter then checks these subwords against a list of specified words: `kaffee`,
|
||||
`zucker`, and `tasse`.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
GET _analyze
|
||||
{
|
||||
"tokenizer": "standard",
|
||||
"filter": [
|
||||
{
|
||||
"type": "hyphenation_decompounder",
|
||||
"hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
|
||||
"word_list": ["Kaffee", "zucker", "tasse"]
|
||||
}
|
||||
],
|
||||
"text": "Kaffeetasse"
|
||||
}
|
||||
--------------------------------------------------
|
||||
// TEST[skip: requires a valid hyphenation_patterns.xml file for DE-DR]
|
||||
|
||||
The filter produces the following tokens:
|
||||
|
||||
[source,text]
|
||||
--------------------------------------------------
|
||||
[ Kaffeetasse, Kaffee, tasse ]
|
||||
--------------------------------------------------
|
||||
|
||||
[[analysis-hyp-decomp-tokenfilter-configure-parms]]
|
||||
==== Configurable parameters
|
||||
|
||||
`hyphenation_patterns_path`::
|
||||
+
|
||||
--
|
||||
(Required, string)
|
||||
Path to an Apache FOP (Formatting Objects Processor) XML hyphenation pattern file.
|
||||
|
||||
This path must be absolute or relative to the `config` location. Only FOP v1.2
|
||||
compatible files are supported.
|
||||
|
||||
For example FOP XML hyphenation pattern files, refer to:
|
||||
|
||||
* http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects (OFFO) Sourceforge project]
|
||||
* https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download[offo-hyphenation_v1.2.zip direct download]
|
||||
--
|
||||
|
||||
`word_list`::
|
||||
+
|
||||
--
|
||||
(Required+++*+++, array of strings)
|
||||
A list of subwords. Subwords found using the hyphenation pattern but not in this
|
||||
list are excluded from the token output.
|
||||
|
||||
You can use the <<analysis-dict-decomp-tokenfilter,`dictionary_decompounder`>>
|
||||
filter to test the quality of word lists before implementing them.
|
||||
|
||||
Either this parameter or `word_list_path` must be specified.
|
||||
--
|
||||
|
||||
`word_list_path`::
|
||||
+
|
||||
--
|
||||
(Required+++*+++, string)
|
||||
Path to a file containing a list of subwords. Subwords found using the
|
||||
hyphenation pattern but not in this list are excluded from the token output.
|
||||
|
||||
This path must be absolute or relative to the `config` location, and the file
|
||||
must be UTF-8 encoded. Each token in the file must be separated by a line break.
|
||||
|
||||
You can use the <<analysis-dict-decomp-tokenfilter,`dictionary_decompounder`>>
|
||||
filter to test the quality of word lists before implementing them.
|
||||
|
||||
Either this parameter or `word_list` must be specified.
|
||||
--
|
||||
|
||||
`max_subword_size`::
|
||||
(Optional, integer)
|
||||
Maximum subword character length. Longer subword tokens are excluded from the
|
||||
output. Defaults to `15`.
|
||||
|
||||
`min_subword_size`::
|
||||
(Optional, integer)
|
||||
Minimum subword character length. Shorter subword tokens are excluded from the
|
||||
output. Defaults to `2`.
|
||||
|
||||
`min_word_size`::
|
||||
(Optional, integer)
|
||||
Minimum word character length. Shorter word tokens are excluded from the
|
||||
output. Defaults to `5`.
|
||||
|
||||
`only_longest_match`::
|
||||
(Optional, boolean)
|
||||
If `true`, only include the longest matching subword. Defaults to `false`.
|
||||
|
||||
[[analysis-hyp-decomp-tokenfilter-customize]]
|
||||
==== Customize and add to an analyzer
|
||||
|
||||
To customize the `hyphenation_decompounder` filter, duplicate it to create the
|
||||
basis for a new custom token filter. You can modify the filter using its
|
||||
configurable parameters.
|
||||
|
||||
For example, the following <<indices-create-index,create index API>> request
|
||||
uses a custom `hyphenation_decompounder` filter to configure a new
|
||||
<<analysis-custom-analyzer,custom analyzer>>.
|
||||
|
||||
The custom `hyphenation_decompounder` filter find subwords based on hyphenation
|
||||
patterns in the `analysis/hyphenation_patterns.xml` file. The filter then checks
|
||||
these subwords against the list of words specified in the
|
||||
`analysis/example_word_list.txt` file. Subwords longer than 22 characters are
|
||||
excluded from the token output.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT hyphenation_decompound_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"standard_hyphenation_decompound": {
|
||||
"tokenizer": "standard",
|
||||
"filter": [ "22_char_hyphenation_decompound" ]
|
||||
}
|
||||
},
|
||||
"filter": {
|
||||
"22_char_hyphenation_decompound": {
|
||||
"type": "hyphenation_decompounder",
|
||||
"word_list_path": "analysis/example_word_list.txt",
|
||||
"hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
|
||||
"max_subword_size": 22
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
|
@ -903,3 +903,9 @@ See <<monitor-elasticsearch-cluster>>.
|
|||
[role="exclude",id="docker-cli-run"]
|
||||
|
||||
See <<docker-cli-run-dev-mode>>.
|
||||
|
||||
[role="exclude",id="analysis-compound-word-tokenfilter"]
|
||||
=== Compound word token filters
|
||||
|
||||
See <<analysis-dict-decomp-tokenfilter>> and
|
||||
<<analysis-hyp-decomp-tokenfilter>>.
|
||||
|
|
Loading…
Reference in New Issue