mirror of https://github.com/apache/lucene.git
LUCENE-8238: improve javadocs for WordDelimiterFilter and WordDelimiterGraphFilter
This commit is contained in:
parent
5c37b07a3d
commit
0f53adbee4
|
@ -153,6 +153,11 @@ Build
|
|||
|
||||
* LUCENE-8230: Upgrade forbiddenapis to version 2.5. (Uwe Schindler)
|
||||
|
||||
Documentation
|
||||
|
||||
* LUCENE-8238: Improve WordDelimiterFilter and WordDelimiterGraphFilter javadocs
|
||||
xo (Mike Sokolov via Mike McCandless)
|
||||
|
||||
======================= Lucene 7.3.0 =======================
|
||||
|
||||
API Changes
|
||||
|
|
|
@ -55,11 +55,14 @@ import org.apache.lucene.util.InPlaceMergeSorter;
|
|||
* </li>
|
||||
* </ul>
|
||||
*
|
||||
* The <b>combinations</b> parameter affects how subwords are combined:
|
||||
* The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
|
||||
* various <b>CATENATE_...</b> parameters affect how those parts are combined.
|
||||
*
|
||||
* <ul>
|
||||
* <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
|
||||
* → <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
|
||||
* <li>combinations="1" means that in addition to the subwords, maximum runs of
|
||||
* <li>If no CATENATE option is set, then no subword combinations are generated:
|
||||
* <code>"PowerShot"</code> → <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
|
||||
* positions)</li>
|
||||
* <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
|
||||
* non-numeric subwords are catenated and produced at the same position of the
|
||||
* last subword in the run:
|
||||
* <ul>
|
||||
|
@ -72,12 +75,15 @@ import org.apache.lucene.util.InPlaceMergeSorter;
|
|||
* </li>
|
||||
* </ul>
|
||||
* </li>
|
||||
* <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
|
||||
* <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
|
||||
* </ul>
|
||||
*
|
||||
* One use for {@link WordDelimiterFilter} is to help match words with different
|
||||
* subword delimiters. For example, if the source text contained "wi-fi" one may
|
||||
* want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
|
||||
* is to specify combinations="1" in the analyzer used for indexing, and
|
||||
* combinations="0" (the default) in the analyzer used for querying. Given that
|
||||
* is to specify CATENATE options in the analyzer used for indexing, and
|
||||
* not in the analyzer used for querying. Given that
|
||||
* the current {@link StandardTokenizer} immediately removes many intra-word
|
||||
* delimiters, it is recommended that this filter be used after a tokenizer that
|
||||
* does not do this (such as {@link WhitespaceTokenizer}).
|
||||
|
|
|
@ -62,11 +62,14 @@ import org.apache.lucene.util.RamUsageEstimator;
|
|||
* </li>
|
||||
* </ul>
|
||||
*
|
||||
* The <b>combinations</b> parameter affects how subwords are combined:
|
||||
* The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
|
||||
* various <b>CATENATE_...</b> parameters affect how those parts are combined.
|
||||
*
|
||||
* <ul>
|
||||
* <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
|
||||
* → <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
|
||||
* <li>combinations="1" means that in addition to the subwords, maximum runs of
|
||||
* <li>If no CATENATE option is set, then no subword combinations are generated:
|
||||
* <code>"PowerShot"</code> → <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
|
||||
* positions)</li>
|
||||
* <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
|
||||
* non-numeric subwords are catenated and produced at the same position of the
|
||||
* last subword in the run:
|
||||
* <ul>
|
||||
|
@ -79,12 +82,14 @@ import org.apache.lucene.util.RamUsageEstimator;
|
|||
* </li>
|
||||
* </ul>
|
||||
* </li>
|
||||
* <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
|
||||
* <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
|
||||
* </ul>
|
||||
* One use for {@link WordDelimiterGraphFilter} is to help match words with different
|
||||
* subword delimiters. For example, if the source text contained "wi-fi" one may
|
||||
* want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
|
||||
* is to specify combinations="1" in the analyzer used for indexing, and
|
||||
* combinations="0" (the default) in the analyzer used for querying. Given that
|
||||
* is to specify CATENATE options in the analyzer used for indexing, and not
|
||||
* in the analyzer used for querying. Given that
|
||||
* the current {@link StandardTokenizer} immediately removes many intra-word
|
||||
* delimiters, it is recommended that this filter be used after a tokenizer that
|
||||
* does not do this (such as {@link WhitespaceTokenizer}).
|
||||
|
|
Loading…
Reference in New Issue