LUCENE-5389: Add more guidance in the analyis documentation package overview (closes #14)

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1557010 13f79535-47bb-0310-9956-ffa450edef68
2014-01-10 02:10:45 +00:00 · 2014-01-10 02:10:45 +00:00 · ca6454bab4
parent 6b33a8c593
commit ca6454bab4
2 changed files with 70 additions and 1 deletions
--- a/lucene/CHANGES.txt
+++ b/lucene/CHANGES.txt
@ -167,6 +167,10 @@ Documentation
  to the analysis package overview.  
  (Benson Margulies via Robert Muir - pull request #12)
 * LUCENE-5389: Add more guidance in the analyis documentation 
  package overview.
  (Benson Margulies via Robert Muir - pull request #14)
 ======================= Lucene 4.6.0 =======================
 New Features
--- a/lucene/core/src/java/org/apache/lucene/analysis/package.html
+++ b/lucene/core/src/java/org/apache/lucene/analysis/package.html
@ -179,7 +179,7 @@ and proximity searches (though sentence identification is not provided by Lucene
 <p>
  However an application might invoke Analysis of any text for testing or for any other purpose, something like:
 </p>
-<PRE class="prettyprint">
+<PRE class="prettyprint" id="analysis-workflow">
    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
    Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer
    TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some text goes here"));
@ -476,6 +476,71 @@ and proximity searches (though sentence identification is not provided by Lucene
    </td>
  </tr>
 </table>
 <h3>More Requirements for Analysis Component Classes</h3>
 Due to the historical development of the API, there are some perhaps
 less than obvious requirements to implement analysis components
 classes.
 <h4 id="analysis-lifetime">Token Stream Lifetime</h4>
 The code fragment of the <a href="#analysis-workflow">analysis workflow
 protocol</a> above shows a token stream being obtained, used, and then
 left for garbage. However, that does not mean that the components of
 that token stream will, in fact, be discarded. The default is just the
 opposite. {@link org.apache.lucene.analysis.Analyzer} applies a reuse
 strategy to the tokenizer and the token filters. It will reuse
 them. For each new input, it calls {@link org.apache.lucene.analysis.Tokenizer#setReader(java.io.Reader)} 
 to set the input. Your components must be prepared for this scenario,
 as described below.
 <h4>Tokenizer</h4>
 <ul>
  <li>
  You should create your tokenizer class by extending {@link org.apache.lucene.analysis.Tokenizer}.
  </li>
  <li>
  Your tokenizer must <strong>never</strong> make direct use of the
  {@link java.io.Reader} supplied to its constructor(s). (A future
  release of Apache Lucene may remove the reader parameters from the
  Tokenizer constructors.)
  {@link org.apache.lucene.analysis.Tokenizer} wraps the reader in an
  object that helps enforce that applications comply with the <a
  href="#analysis-workflow">analysis workflow</a>. Thus, your class
  should only reference the input via the protected 'input' field
  of Tokenizer.
  </li>
  <li>
  Your tokenizer <strong>must</strong> override {@link org.apache.lucene.analysis.TokenStream#end()}.
  Your implementation <strong>must</strong> call
  <code>super.end()</code>. It must set a correct final offset into
  the offset attribute, and finish up and other attributes to reflect
  the end of the stream.
  </li>
  <li>
  If your tokenizer overrides {@link org.apache.lucene.analysis.TokenStream#reset()}
  or {@link org.apache.lucene.analysis.TokenStream#close()}, it
  <strong>must</strong> call the corresponding superclass method.
  </li>
 </ul>
 <h4>Token Filter</h4>
  You should create your token filter class by extending {@link org.apache.lucene.analysis.TokenFilter}.
  If your token filter overrides {@link org.apache.lucene.analysis.TokenStream#reset()},
  {@link org.apache.lucene.analysis.TokenStream#end()}
  or {@link org.apache.lucene.analysis.TokenStream#close()}, it
  <strong>must</strong> call the corresponding superclass method.
 <h4>Creating delegates</h4>
  Forwarding classes (those which extend {@link org.apache.lucene.analysis.Tokenizer} but delegate
  selected logic to another tokenizer) must also set the reader to the delegate in the overridden
  {@link org.apache.lucene.analysis.Tokenizer#reset()} method, e.g.:
  <pre class="prettyprint">
    public class ForwardingTokenizer extends Tokenizer {
       private Tokenizer delegate;
       ...
       {@literal @Override}
       public void reset() {
          super.reset();
          delegate.setReader(this.input);
          delegate.reset();
       }
    }
  </pre>
 <h3>Testing Your Analysis Component</h3>
 <p>
    The lucene-test-framework component defines