LUCENE-8238: improve javadocs for WordDelimiterFilter and WordDelimiterGraphFilter

This commit is contained in:
Mike McCandless 2018-04-06 15:20:22 -04:00
parent 5c37b07a3d
commit 0f53adbee4
3 changed files with 28 additions and 12 deletions

View File

@ -153,6 +153,11 @@ Build
* LUCENE-8230: Upgrade forbiddenapis to version 2.5. (Uwe Schindler)
Documentation
* LUCENE-8238: Improve WordDelimiterFilter and WordDelimiterGraphFilter javadocs
xo (Mike Sokolov via Mike McCandless)
======================= Lucene 7.3.0 =======================
API Changes

View File

@ -55,11 +55,14 @@ import org.apache.lucene.util.InPlaceMergeSorter;
* </li>
* </ul>
*
* The <b>combinations</b> parameter affects how subwords are combined:
* The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
* various <b>CATENATE_...</b> parameters affect how those parts are combined.
*
* <ul>
* <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
* &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
* <li>combinations="1" means that in addition to the subwords, maximum runs of
* <li>If no CATENATE option is set, then no subword combinations are generated:
* <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
* positions)</li>
* <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
* non-numeric subwords are catenated and produced at the same position of the
* last subword in the run:
* <ul>
@ -72,12 +75,15 @@ import org.apache.lucene.util.InPlaceMergeSorter;
* </li>
* </ul>
* </li>
* <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
* <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
* </ul>
*
* One use for {@link WordDelimiterFilter} is to help match words with different
* subword delimiters. For example, if the source text contained "wi-fi" one may
* want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
* is to specify combinations="1" in the analyzer used for indexing, and
* combinations="0" (the default) in the analyzer used for querying. Given that
* is to specify CATENATE options in the analyzer used for indexing, and
* not in the analyzer used for querying. Given that
* the current {@link StandardTokenizer} immediately removes many intra-word
* delimiters, it is recommended that this filter be used after a tokenizer that
* does not do this (such as {@link WhitespaceTokenizer}).

View File

@ -62,11 +62,14 @@ import org.apache.lucene.util.RamUsageEstimator;
* </li>
* </ul>
*
* The <b>combinations</b> parameter affects how subwords are combined:
* The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
* various <b>CATENATE_...</b> parameters affect how those parts are combined.
*
* <ul>
* <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
* &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
* <li>combinations="1" means that in addition to the subwords, maximum runs of
* <li>If no CATENATE option is set, then no subword combinations are generated:
* <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
* positions)</li>
* <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
* non-numeric subwords are catenated and produced at the same position of the
* last subword in the run:
* <ul>
@ -79,12 +82,14 @@ import org.apache.lucene.util.RamUsageEstimator;
* </li>
* </ul>
* </li>
* <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
* <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
* </ul>
* One use for {@link WordDelimiterGraphFilter} is to help match words with different
* subword delimiters. For example, if the source text contained "wi-fi" one may
* want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
* is to specify combinations="1" in the analyzer used for indexing, and
* combinations="0" (the default) in the analyzer used for querying. Given that
* is to specify CATENATE options in the analyzer used for indexing, and not
* in the analyzer used for querying. Given that
* the current {@link StandardTokenizer} immediately removes many intra-word
* delimiters, it is recommended that this filter be used after a tokenizer that
* does not do this (such as {@link WhitespaceTokenizer}).