LUCENE-5389: Add more guidance in the analyis documentation package overview (closes #14)

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1557010 13f79535-47bb-0310-9956-ffa450edef68
2014-01-10 02:10:45 +00:00 · 2014-01-10 02:10:45 +00:00 · ca6454bab4
parent 6b33a8c593
commit ca6454bab4
2 changed files with 70 additions and 1 deletions
--- a/lucene/CHANGES.txt
+++ b/lucene/CHANGES.txt
@ -167,6 +167,10 @@ Documentation
  to the analysis package overview.  
  (Benson Margulies via Robert Muir - pull request #12)

+* LUCENE-5389: Add more guidance in the analyis documentation 
+  package overview.
+  (Benson Margulies via Robert Muir - pull request #14)
+
 ======================= Lucene 4.6.0 =======================

 New Features
--- a/lucene/core/src/java/org/apache/lucene/analysis/package.html
+++ b/lucene/core/src/java/org/apache/lucene/analysis/package.html
@ -179,7 +179,7 @@ and proximity searches (though sentence identification is not provided by Lucene
 <p>
  However an application might invoke Analysis of any text for testing or for any other purpose, something like:
 </p>
-<PRE class="prettyprint">
+<PRE class="prettyprint" id="analysis-workflow">
    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
    Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer
    TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some text goes here"));
@ -476,6 +476,71 @@ and proximity searches (though sentence identification is not provided by Lucene
    </td>
  </tr>
 </table>
+<h3>More Requirements for Analysis Component Classes</h3>
+Due to the historical development of the API, there are some perhaps
+less than obvious requirements to implement analysis components
+classes.
+<h4 id="analysis-lifetime">Token Stream Lifetime</h4>
+The code fragment of the <a href="#analysis-workflow">analysis workflow
+protocol</a> above shows a token stream being obtained, used, and then
+left for garbage. However, that does not mean that the components of
+that token stream will, in fact, be discarded. The default is just the
+opposite. {@link org.apache.lucene.analysis.Analyzer} applies a reuse
+strategy to the tokenizer and the token filters. It will reuse
+them. For each new input, it calls {@link org.apache.lucene.analysis.Tokenizer#setReader(java.io.Reader)} 
+to set the input. Your components must be prepared for this scenario,
+as described below.
+<h4>Tokenizer</h4>
+<ul>
+  <li>
+  You should create your tokenizer class by extending {@link org.apache.lucene.analysis.Tokenizer}.
+  </li>
+  <li>
+  Your tokenizer must <strong>never</strong> make direct use of the
+  {@link java.io.Reader} supplied to its constructor(s). (A future
+  release of Apache Lucene may remove the reader parameters from the
+  Tokenizer constructors.)
+  {@link org.apache.lucene.analysis.Tokenizer} wraps the reader in an
+  object that helps enforce that applications comply with the <a
+  href="#analysis-workflow">analysis workflow</a>. Thus, your class
+  should only reference the input via the protected 'input' field
+  of Tokenizer.
+  </li>
+  <li>
+  Your tokenizer <strong>must</strong> override {@link org.apache.lucene.analysis.TokenStream#end()}.
+  Your implementation <strong>must</strong> call
+  <code>super.end()</code>. It must set a correct final offset into
+  the offset attribute, and finish up and other attributes to reflect
+  the end of the stream.
+  </li>
+  <li>
+  If your tokenizer overrides {@link org.apache.lucene.analysis.TokenStream#reset()}
+  or {@link org.apache.lucene.analysis.TokenStream#close()}, it
+  <strong>must</strong> call the corresponding superclass method.
+  </li>
+</ul>
+<h4>Token Filter</h4>
+  You should create your token filter class by extending {@link org.apache.lucene.analysis.TokenFilter}.
+  If your token filter overrides {@link org.apache.lucene.analysis.TokenStream#reset()},
+  {@link org.apache.lucene.analysis.TokenStream#end()}
+  or {@link org.apache.lucene.analysis.TokenStream#close()}, it
+  <strong>must</strong> call the corresponding superclass method.
+<h4>Creating delegates</h4>
+  Forwarding classes (those which extend {@link org.apache.lucene.analysis.Tokenizer} but delegate
+  selected logic to another tokenizer) must also set the reader to the delegate in the overridden
+  {@link org.apache.lucene.analysis.Tokenizer#reset()} method, e.g.:
+  <pre class="prettyprint">
+    public class ForwardingTokenizer extends Tokenizer {
+       private Tokenizer delegate;
+       ...
+       {@literal @Override}
+       public void reset() {
+          super.reset();
+          delegate.setReader(this.input);
+          delegate.reset();
+       }
+    }
+  </pre>
 <h3>Testing Your Analysis Component</h3>
 <p>
    The lucene-test-framework component defines