[DOCS] Reformat compound word token filters (#49006)

* Separates the compound token filters doc pages into separate token filter pages: * Dictionary decompounder token filter * Hyphenation decompounder token filter * Adds analyze API examples for each compound token filter * Adds a redirect for the removed compound token filters page Co-Authored-By: debadair <debadair@elastic.co>
2019-11-13 09:35:00 -05:00 · 2019-11-13 09:35:00 -05:00 · 838af15d29
parent b55022b59f
commit 838af15d29
5 changed files with 337 additions and 117 deletions
--- a/docs/reference/analysis/tokenfilters.asciidoc
+++ b/docs/reference/analysis/tokenfilters.asciidoc
@ -22,14 +22,14 @@ include::tokenfilters/classic-tokenfilter.asciidoc[]

 include::tokenfilters/common-grams-tokenfilter.asciidoc[]

-include::tokenfilters/compound-word-tokenfilter.asciidoc[]
-
 include::tokenfilters/condition-tokenfilter.asciidoc[]

 include::tokenfilters/decimal-digit-tokenfilter.asciidoc[]

 include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]

+include::tokenfilters/dictionary-decompounder-tokenfilter.asciidoc[]
+
 include::tokenfilters/edgengram-tokenfilter.asciidoc[]

 include::tokenfilters/elision-tokenfilter.asciidoc[]
@ -40,6 +40,8 @@ include::tokenfilters/flatten-graph-tokenfilter.asciidoc[]

 include::tokenfilters/hunspell-tokenfilter.asciidoc[]

+include::tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc[]
+
 include::tokenfilters/keep-types-tokenfilter.asciidoc[]

 include::tokenfilters/keep-words-tokenfilter.asciidoc[]
--- a/docs/reference/analysis/tokenfilters/compound-word-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/compound-word-tokenfilter.asciidoc
@ -1,115 +0,0 @@
-[[analysis-compound-word-tokenfilter]]
-=== Compound Word Token Filters
-
-The `hyphenation_decompounder` and `dictionary_decompounder` token filters can
-decompose compound words found in many German languages into word parts.
-
-Both token filters require a dictionary of word parts, which can be provided
-as:
-
-[horizontal]
-`word_list`::
-
-An array of words, specified inline in the token filter configuration, or
-
-`word_list_path`::
-
-The path (either absolute or relative to the `config` directory) to a UTF-8
-encoded file containing one word per line.
-
-[float]
-=== Hyphenation decompounder
-
-The `hyphenation_decompounder` uses hyphenation grammars to find potential
-subwords that are then checked against the word dictionary. The quality of the
-output tokens is directly connected to the quality of the grammar file you
-use. For languages like German they are quite good.
-
-XML based hyphenation grammar files can be found in the
-http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects]
-(OFFO) Sourceforge project. Currently only FOP v1.2 compatible hyphenation files
-are supported. You can download https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download[offo-hyphenation_v1.2.zip]
-directly and look in the `offo-hyphenation/hyph/` directory.
-Credits for the hyphenation code go to the Apache FOP project .
-
-[float]
-=== Dictionary decompounder
-
-The `dictionary_decompounder` uses a brute force approach in conjunction with
-only the word dictionary to find subwords in a compound word. It is much
-slower than the hyphenation decompounder but can be used as a first start to
-check the quality of your dictionary.
-
-[float]
-=== Compound token filter parameters
-
-The following parameters can be used to configure a compound word token
-filter:
-
-[horizontal]
-`type`::
-
-Either `dictionary_decompounder` or `hyphenation_decompounder`.
-
-`word_list`::
-
-A array containing a list of words to use for the word dictionary.
-
-`word_list_path`::
-
-The path (either absolute or relative to the `config` directory) to the word dictionary.
-
-`hyphenation_patterns_path`::
-
-The path (either absolute or relative to the `config` directory) to a FOP XML hyphenation pattern file. (required for hyphenation)
-
-`min_word_size`::
-
-Minimum word size. Defaults to 5.
-
-`min_subword_size`::
-
-Minimum subword size. Defaults to 2.
-
-`max_subword_size`::
-
-Maximum subword size. Defaults to 15.
-
-`only_longest_match`::
-
-Whether to include only the longest matching subword or not.  Defaults to `false`
-
-
-Here is an example:
-
-[source,console]
--------------------------------------------------
-PUT /compound_word_example
-{
-    "settings": {
-        "index": {
-            "analysis": {
-                "analyzer": {
-                    "my_analyzer": {
-                        "type": "custom",
-                        "tokenizer": "standard",
-                        "filter": ["dictionary_decompounder", "hyphenation_decompounder"]
-                    }
-                },
-                "filter": {
-                    "dictionary_decompounder": {
-                        "type": "dictionary_decompounder",
-                        "word_list": ["one", "two", "three"]
-                    },
-                    "hyphenation_decompounder": {
-                        "type" : "hyphenation_decompounder",
-                        "word_list_path": "analysis/example_word_list.txt",
-                        "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
-                        "max_subword_size": 22
-                    }
-                }
-            }
-        }
-    }
-}
--------------------------------------------------
--- a/docs/reference/analysis/tokenfilters/dictionary-decompounder-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/dictionary-decompounder-tokenfilter.asciidoc
@ -0,0 +1,173 @@
+[[analysis-dict-decomp-tokenfilter]]
+=== Dictionary decompounder token filter
++++
+<titleabbrev>Dictionary decompounder</titleabbrev>
++++
+
+[NOTE]
+====
+In most cases, we recommend using the faster
+<<analysis-hyp-decomp-tokenfilter,`hyphenation_decompounder`>> token filter
+in place of this filter. However, you can use the
+`dictionary_decompounder` filter to check the quality of a word list before
+implementing it in the `hyphenation_decompounder` filter.
+====
+
+Uses a specified list of words and a brute force approach to find subwords in
+compound words. If found, these subwords are included in the token output.
+
+This filter uses Lucene's
+https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html[DictionaryCompoundWordTokenFilter],
+which was built for Germanic languages.
+
+[[analysis-dict-decomp-tokenfilter-analyze-ex]]
+==== Example
+
+The following <<indices-analyze,analyze API>> request uses the
+`dictionary_decompounder` filter to find subwords in `Donaudampfschiff`. The
+filter then checks these subwords against the specified list of words: `Donau`,
+`dampf`, `meer`, and `schiff`.
+
+[source,console]
+--------------------------------------------------
+GET _analyze
+{
+  "tokenizer": "standard",
+  "filter": [
+    {
+      "type": "dictionary_decompounder",
+      "word_list": ["Donau", "dampf", "meer", "schiff"]
+    }
+  ],
+  "text": "Donaudampfschiff"
+}
+--------------------------------------------------
+
+The filter produces the following tokens:
+
+[source,text]
+--------------------------------------------------
+[ Donaudampfschiff, Donau, dampf, schiff ]
+--------------------------------------------------
+
+/////////////////////
+[source,console-result]
+--------------------------------------------------
+{
+  "tokens" : [
+    {
+      "token" : "Donaudampfschiff",
+      "start_offset" : 0,
+      "end_offset" : 16,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "Donau",
+      "start_offset" : 0,
+      "end_offset" : 16,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "dampf",
+      "start_offset" : 0,
+      "end_offset" : 16,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    },
+    {
+      "token" : "schiff",
+      "start_offset" : 0,
+      "end_offset" : 16,
+      "type" : "<ALPHANUM>",
+      "position" : 0
+    }
+  ]
+}
+--------------------------------------------------
+/////////////////////
+
+[[analysis-dict-decomp-tokenfilter-configure-parms]]
+==== Configurable parameters
+
+`word_list`::
+
+--
+(Required+++*+++, array of strings)
+A list of subwords to look for in the token stream. If found, the subword is
+included in the token output.
+
+Either this parameter or `word_list_path` must be specified.
+--
+
+`word_list_path`::
+
+--
+(Required+++*+++, string)
+Path to a file that contains a list of subwords to find in the token stream. If
+found, the subword is included in the token output.
+
+This path must be absolute or relative to the `config` location, and the file
+must be UTF-8 encoded. Each token in the file must be separated by a line break.
+
+Either this parameter or `word_list` must be specified.
+--
+
+`max_subword_size`::
+(Optional, integer)
+Maximum subword character length. Longer subword tokens are excluded from the
+output. Defaults to `15`.
+
+`min_subword_size`::
+(Optional, integer)
+Minimum subword character length. Shorter subword tokens are excluded from the
+output. Defaults to `2`.
+
+`min_word_size`::
+(Optional, integer)
+Minimum word character length. Shorter word tokens are excluded from the
+output. Defaults to `5`.
+
+`only_longest_match`::
+(Optional, boolean)
+If `true`, only include the longest matching subword. Defaults to `false`.
+
+[[analysis-dict-decomp-tokenfilter-customize]]
+==== Customize and add to an analyzer
+
+To customize the `dictionary_decompounder` filter, duplicate it to create the
+basis for a new custom token filter. You can modify the filter using its
+configurable parameters.
+
+For example, the following <<indices-create-index,create index API>> request
+uses a custom `dictionary_decompounder` filter to configure a new
+<<analysis-custom-analyzer,custom analyzer>>.
+
+The custom `dictionary_decompounder` filter find subwords in the
+`analysis/example_word_list.txt` file. Subwords longer than 22 characters are
+excluded from the token output.
+
+[source,console]
+--------------------------------------------------
+PUT dictionary_decompound_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "standard_dictionary_decompound": {
+          "tokenizer": "standard",
+          "filter": [ "22_char_dictionary_decompound" ]
+        }
+      },
+      "filter": {
+        "22_char_dictionary_decompound": {
+          "type": "dictionary_decompounder",
+          "word_list_path": "analysis/example_word_list.txt",
+          "max_subword_size": 22
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
--- a/docs/reference/analysis/tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc
@ -0,0 +1,154 @@
+[[analysis-hyp-decomp-tokenfilter]]
+=== Hyphenation decompounder token filter
++++
+<titleabbrev>Hyphenation decompounder</titleabbrev>
++++
+
+Uses XML-based hyphenation patterns to find potential subwords in compound
+words. These subwords are then checked against the specified word list. Subwords not
+in the list are excluded from the token output.
+
+This filter uses Lucene's
+https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html[HyphenationCompoundWordTokenFilter],
+which was built for Germanic languages.
+
+[[analysis-hyp-decomp-tokenfilter-analyze-ex]]
+==== Example
+
+The following <<indices-analyze,analyze API>> request uses the
+`hyphenation_decompounder` filter to find subwords in `Kaffeetasse` based on
+German hyphenation patterns in the `analysis/hyphenation_patterns.xml` file. The
+filter then checks these subwords against a list of specified words: `kaffee`,
+`zucker`, and `tasse`.
+
+[source,console]
+--------------------------------------------------
+GET _analyze
+{
+  "tokenizer": "standard",
+  "filter": [
+    {
+      "type": "hyphenation_decompounder",
+      "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
+      "word_list": ["Kaffee", "zucker", "tasse"]
+    }
+  ],
+  "text": "Kaffeetasse"
+}
+--------------------------------------------------
+// TEST[skip: requires a valid hyphenation_patterns.xml file for DE-DR]
+
+The filter produces the following tokens:
+
+[source,text]
+--------------------------------------------------
+[ Kaffeetasse, Kaffee, tasse ]
+--------------------------------------------------
+
+[[analysis-hyp-decomp-tokenfilter-configure-parms]]
+==== Configurable parameters
+
+`hyphenation_patterns_path`::
+
+--
+(Required, string)
+Path to an Apache FOP (Formatting Objects Processor) XML hyphenation pattern file.
+
+This path must be absolute or relative to the `config` location. Only FOP v1.2
+compatible files are supported.
+
+For example FOP XML hyphenation pattern files, refer to:
+
+* http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects (OFFO) Sourceforge project]
+* https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download[offo-hyphenation_v1.2.zip direct download]
+--
+
+`word_list`::
+
+--
+(Required+++*+++, array of strings)
+A list of subwords. Subwords found using the hyphenation pattern but not in this
+list are excluded from the token output.
+
+You can use the <<analysis-dict-decomp-tokenfilter,`dictionary_decompounder`>>
+filter to test the quality of word lists before implementing them.
+
+Either this parameter or `word_list_path` must be specified.
+--
+
+`word_list_path`::
+
+--
+(Required+++*+++, string)
+Path to a file containing a list of subwords. Subwords found using the
+hyphenation pattern but not in this list are excluded from the token output.
+
+This path must be absolute or relative to the `config` location, and the file
+must be UTF-8 encoded. Each token in the file must be separated by a line break.
+
+You can use the <<analysis-dict-decomp-tokenfilter,`dictionary_decompounder`>>
+filter to test the quality of word lists before implementing them.
+
+Either this parameter or `word_list` must be specified.
+--
+
+`max_subword_size`::
+(Optional, integer)
+Maximum subword character length. Longer subword tokens are excluded from the
+output. Defaults to `15`.
+
+`min_subword_size`::
+(Optional, integer)
+Minimum subword character length. Shorter subword tokens are excluded from the
+output. Defaults to `2`.
+
+`min_word_size`::
+(Optional, integer)
+Minimum word character length. Shorter word tokens are excluded from the
+output. Defaults to `5`.
+
+`only_longest_match`::
+(Optional, boolean)
+If `true`, only include the longest matching subword. Defaults to `false`.
+
+[[analysis-hyp-decomp-tokenfilter-customize]]
+==== Customize and add to an analyzer
+
+To customize the `hyphenation_decompounder` filter, duplicate it to create the
+basis for a new custom token filter. You can modify the filter using its
+configurable parameters.
+
+For example, the following <<indices-create-index,create index API>> request
+uses a custom `hyphenation_decompounder` filter to configure a new
+<<analysis-custom-analyzer,custom analyzer>>.
+
+The custom `hyphenation_decompounder` filter find subwords based on hyphenation
+patterns in the `analysis/hyphenation_patterns.xml` file. The filter then checks
+these subwords against the list of words specified in the
+`analysis/example_word_list.txt` file. Subwords longer than 22 characters are
+excluded from the token output.
+
+[source,console]
+--------------------------------------------------
+PUT hyphenation_decompound_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "standard_hyphenation_decompound": {
+          "tokenizer": "standard",
+          "filter": [ "22_char_hyphenation_decompound" ]
+        }
+      },
+      "filter": {
+        "22_char_hyphenation_decompound": {
+          "type": "hyphenation_decompounder",
+          "word_list_path": "analysis/example_word_list.txt",
+          "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
+          "max_subword_size": 22
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
--- a/docs/reference/redirects.asciidoc
+++ b/docs/reference/redirects.asciidoc
@ -903,3 +903,9 @@ See <<monitor-elasticsearch-cluster>>.
 [role="exclude",id="docker-cli-run"]

 See <<docker-cli-run-dev-mode>>.
+
+[role="exclude",id="analysis-compound-word-tokenfilter"]
+=== Compound word token filters
+
+See <<analysis-dict-decomp-tokenfilter>> and
+<<analysis-hyp-decomp-tokenfilter>>.