Update compound-word-tokenfilter.asciidoc

Improved the docs for compound work token filter. Closes #13670 Closes #13595
2015-09-21 11:18:18 +02:00 · 2015-09-21 11:18:18 +02:00 · 1f76f49003
parent 3f94b5a395
commit 1f76f49003
1 changed files with 71 additions and 21 deletions
--- a/docs/reference/analysis/tokenfilters/compound-word-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/compound-word-tokenfilter.asciidoc
@ -1,34 +1,83 @@
 [[analysis-compound-word-tokenfilter]]
 === Compound Word Token Filter

-Token filters that allow to decompose compound words. There are two
-types available: `dictionary_decompounder` and
-`hyphenation_decompounder`.
+The `hyphenation_decompounder` and `dictionary_decompounder` token filters can
+decompose compound words found in many German languages into word parts.

-The following are settings that can be set for a compound word token
-filter type:
+Both token filters require a dictionary of word parts, which can be provided
+as:

-[cols="<,<",options="header",]
-|=======================================================================
-|Setting |Description
-|`word_list` |A list of words to use.
+[horizontal]
+`word_list`::

-|`word_list_path` |A path (either relative to `config` location, or
-absolute) to a list of words.
+An array of words, specified inline in the token filter configuration, or

-|`hyphenation_patterns_path` |A path (either relative to `config` location, or
-absolute) to a FOP XML hyphenation pattern file. (See http://offo.sourceforge.net/hyphenation/)
-Required for `hyphenation_decompounder`.
+`word_list_path`::

-|`min_word_size` |Minimum word size(Integer). Defaults to 5.
+The path (either absolute or relative to the `config` directory) to a UTF-8
+encoded file containing one word per line.

-|`min_subword_size` |Minimum subword size(Integer). Defaults to 2.
+[float]
+=== Hyphenation decompounder

-|`max_subword_size` |Maximum subword size(Integer). Defaults to 15.
+The `hyphenation_decompounder` uses hyphenation grammars to find potential
+subwords that are then checked against the word dictionary. The quality of the
+output tokens is directly connected to the quality of the grammar file you
+use. For languages like German they are quite good.
+
+XML based hyphenation grammar files can be found in the
+http://offo.sourceforge.net/hyphenation/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects]
+(OFFO) Sourceforge project. You can download http://downloads.sourceforge.net/offo/offo-hyphenation.zip[offo-hyphenation.zip]
+directly and look in the `offo-hyphenation/hyph/` directory.
+Credits for the hyphenation code go to the Apache FOP project .
+
+[float]
+=== Dictionary decompounder
+
+The `dictionary_decompounder` uses a brute force approach in conjuction with
+only the word dictionary to find subwords in a compound word. It is much
+slower than the hyphenation decompounder but can be used as a first start to
+check the quality of your dictionary.
+
+[float]
+=== Compound token filter parameters
+
+The following parameters can be used to configure a compound word token
+filter:
+
+[horizontal]
+`type`::
+
+Either `dictionary_decompounder` or `hyphenation_decompounder`.
+
+`word_list`::
+
+A array containing a list of words to use for the word dictionary.
+
+`word_list_path`::
+
+The path (either absolute or relative to the `config` directory) to the word dictionary.
+
+`hyphenation_patterns_path`::
+
+The path (either absolute or relative to the `config` directory) to a FOP XML hyphenation pattern file. (required for hyphenation)
+
+`min_word_size`::
+
+Minimum word size. Defaults to 5.
+
+`min_subword_size`::
+
+Minimum subword size. Defaults to 2.
+
+`max_subword_size`::
+
+Maximum subword size. Defaults to 15.
+
+`only_longest_match`::
+
+Whether to include only the longest matching subword or not.  Defaults to `false`

-|`only_longest_match` |Only matching the longest(Boolean). Defaults to
-`false`
-|=======================================================================

 Here is an example:

@ -48,5 +97,6 @@ index :
            myTokenFilter2 :
                type : hyphenation_decompounder
                word_list_path: path/to/words.txt
+                hyphenation_patterns_path: path/to/fop.xml
                max_subword_size : 22
 --------------------------------------------------