LUCENE-8238: improve javadocs for WordDelimiterFilter and WordDelimiterGraphFilter

2018-04-06 15:20:22 -04:00 · 2018-04-06 15:20:22 -04:00 · 0f53adbee4
parent 5c37b07a3d
commit 0f53adbee4
3 changed files with 28 additions and 12 deletions
--- a/lucene/CHANGES.txt
+++ b/lucene/CHANGES.txt
@ -153,6 +153,11 @@ Build

 * LUCENE-8230: Upgrade forbiddenapis to version 2.5.  (Uwe Schindler)

+Documentation
+
+* LUCENE-8238: Improve WordDelimiterFilter and WordDelimiterGraphFilter javadocs
+xo  (Mike Sokolov via Mike McCandless)
+
 ======================= Lucene 7.3.0 =======================

 API Changes
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
@ -55,11 +55,14 @@ import org.apache.lucene.util.InPlaceMergeSorter;
 * </li>
 * </ul>
 * 
- * The <b>combinations</b> parameter affects how subwords are combined:
+ * The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
+ * various <b>CATENATE_...</b> parameters affect how those parts are combined.
+ *
 * <ul>
- * <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
- * &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
- * <li>combinations="1" means that in addition to the subwords, maximum runs of
+ * <li>If no CATENATE option is set, then no subword combinations are generated:
+ * <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
+ * positions)</li>
+ * <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
 * non-numeric subwords are catenated and produced at the same position of the
 * last subword in the run:
 * <ul>
@ -72,12 +75,15 @@ import org.apache.lucene.util.InPlaceMergeSorter;
 * </li>
 * </ul>
 * </li>
+ * <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
+ * <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
 * </ul>
+ *
 * One use for {@link WordDelimiterFilter} is to help match words with different
 * subword delimiters. For example, if the source text contained "wi-fi" one may
 * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
- * is to specify combinations="1" in the analyzer used for indexing, and
- * combinations="0" (the default) in the analyzer used for querying. Given that
+ * is to specify CATENATE options in the analyzer used for indexing, and
+ * not in the analyzer used for querying. Given that
 * the current {@link StandardTokenizer} immediately removes many intra-word
 * delimiters, it is recommended that this filter be used after a tokenizer that
 * does not do this (such as {@link WhitespaceTokenizer}).
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
@ -62,11 +62,14 @@ import org.apache.lucene.util.RamUsageEstimator;
 * </li>
 * </ul>
 * 
- * The <b>combinations</b> parameter affects how subwords are combined:
+ * The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
+ * various <b>CATENATE_...</b> parameters affect how those parts are combined.
+ *
 * <ul>
- * <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
- * &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
- * <li>combinations="1" means that in addition to the subwords, maximum runs of
+ * <li>If no CATENATE option is set, then no subword combinations are generated:
+ * <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
+ * positions)</li>
+ * <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
 * non-numeric subwords are catenated and produced at the same position of the
 * last subword in the run:
 * <ul>
@ -79,12 +82,14 @@ import org.apache.lucene.util.RamUsageEstimator;
 * </li>
 * </ul>
 * </li>
+ * <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
+ * <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
 * </ul>
 * One use for {@link WordDelimiterGraphFilter} is to help match words with different
 * subword delimiters. For example, if the source text contained "wi-fi" one may
 * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
- * is to specify combinations="1" in the analyzer used for indexing, and
- * combinations="0" (the default) in the analyzer used for querying. Given that
+ * is to specify CATENATE options in the analyzer used for indexing, and not
+ * in the analyzer used for querying. Given that
 * the current {@link StandardTokenizer} immediately removes many intra-word
 * delimiters, it is recommended that this filter be used after a tokenizer that
 * does not do this (such as {@link WhitespaceTokenizer}).