diff --git a/lucene/CHANGES.txt b/lucene/CHANGES.txt index 9c412782dcc..bfbea0565bc 100644 --- a/lucene/CHANGES.txt +++ b/lucene/CHANGES.txt @@ -167,6 +167,10 @@ Documentation to the analysis package overview. (Benson Margulies via Robert Muir - pull request #12) +* LUCENE-5389: Add more guidance in the analyis documentation + package overview. + (Benson Margulies via Robert Muir - pull request #14) + ======================= Lucene 4.6.0 ======================= New Features diff --git a/lucene/core/src/java/org/apache/lucene/analysis/package.html b/lucene/core/src/java/org/apache/lucene/analysis/package.html index 5d5b65aa347..c76666d05f8 100644 --- a/lucene/core/src/java/org/apache/lucene/analysis/package.html +++ b/lucene/core/src/java/org/apache/lucene/analysis/package.html @@ -179,7 +179,7 @@ and proximity searches (though sentence identification is not provided by Lucene

However an application might invoke Analysis of any text for testing or for any other purpose, something like:

-
+
     Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
     Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer
     TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some text goes here"));
@@ -476,6 +476,71 @@ and proximity searches (though sentence identification is not provided by Lucene
     
   
 
+

More Requirements for Analysis Component Classes

+Due to the historical development of the API, there are some perhaps +less than obvious requirements to implement analysis components +classes. +

Token Stream Lifetime

+The code fragment of the analysis workflow +protocol above shows a token stream being obtained, used, and then +left for garbage. However, that does not mean that the components of +that token stream will, in fact, be discarded. The default is just the +opposite. {@link org.apache.lucene.analysis.Analyzer} applies a reuse +strategy to the tokenizer and the token filters. It will reuse +them. For each new input, it calls {@link org.apache.lucene.analysis.Tokenizer#setReader(java.io.Reader)} +to set the input. Your components must be prepared for this scenario, +as described below. +

Tokenizer

+ +

Token Filter

+ You should create your token filter class by extending {@link org.apache.lucene.analysis.TokenFilter}. + If your token filter overrides {@link org.apache.lucene.analysis.TokenStream#reset()}, + {@link org.apache.lucene.analysis.TokenStream#end()} + or {@link org.apache.lucene.analysis.TokenStream#close()}, it + must call the corresponding superclass method. +

Creating delegates

+ Forwarding classes (those which extend {@link org.apache.lucene.analysis.Tokenizer} but delegate + selected logic to another tokenizer) must also set the reader to the delegate in the overridden + {@link org.apache.lucene.analysis.Tokenizer#reset()} method, e.g.: +
+    public class ForwardingTokenizer extends Tokenizer {
+       private Tokenizer delegate;
+       ...
+       {@literal @Override}
+       public void reset() {
+          super.reset();
+          delegate.setReader(this.input);
+          delegate.reset();
+       }
+    }
+  

Testing Your Analysis Component

The lucene-test-framework component defines