LUCENE-1760: javadoc improvements for TokenStream

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@807645 13f79535-47bb-0310-9956-ffa450edef68
2009-08-25 14:16:00 +00:00 · 2009-08-25 14:16:00 +00:00 · 3519f543e7
parent 1a23145d53
commit 3519f543e7
1 changed files with 148 additions and 122 deletions
--- a/src/java/org/apache/lucene/analysis/TokenStream.java
+++ b/src/java/org/apache/lucene/analysis/TokenStream.java
@ -26,48 +26,62 @@ import org.apache.lucene.analysis.tokenattributes.PayloadAttribute;
 import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.util.Attribute;
 import org.apache.lucene.util.AttributeImpl;
 import org.apache.lucene.util.AttributeSource;
-/** A TokenStream enumerates the sequence of tokens, either from
+/**
-  fields of a document or from query text.
+ * A {@link TokenStream} enumerates the sequence of tokens, either from
-  <p>
+ * {@link Field}s of a {@link Document} or from query text.
-  This is an abstract class.  Concrete subclasses are:
+ * <p>
-  <ul>
+ * This is an abstract class. Concrete subclasses are:
-  <li>{@link Tokenizer}, a TokenStream
+ * <ul>
-  whose input is a Reader; and
+ * <li>{@link Tokenizer}, a {@link TokenStream} whose input is a Reader; and
-  <li>{@link TokenFilter}, a TokenStream
+ * <li>{@link TokenFilter}, a {@link TokenStream} whose input is another
-  whose input is another TokenStream.
+ * {@link TokenStream}.
-  </ul>
+ * </ul>
-  A new TokenStream API is introduced with Lucene 2.9. While Token still 
+ * A new {@link TokenStream} API has been introduced with Lucene 2.9. This API
-  exists in 2.9 as a convenience class, the preferred way to store
+ * has moved from being {@link Token} based to {@link Attribute} based. While
-  the information of a token is to use {@link AttributeImpl}s.
+ * {@link Token} still exists in 2.9 as a convenience class, the preferred way
-  <p>
+ * to store the information of a {@link Token} is to use {@link AttributeImpl}s.
-  For that reason TokenStream extends {@link AttributeSource}
+ * <p>
-  now. Note that only one instance per {@link AttributeImpl} is
+ * {@link TokenStream} now extends {@link AttributeSource}, which provides
-  created and reused for every token. This approach reduces
+ * access to all of the token {@link Attribute}s for the {@link TokenStream}.
-  object creations and allows local caching of references to
+ * Note that only one instance per {@link AttributeImpl} is created and reused
-  the {@link AttributeImpl}s. See {@link #incrementToken()} for further details.
+ * for every token. This approach reduces object creation and allows local
-  <p>
+ * caching of references to the {@link AttributeImpl}s. See
-  <b>The workflow of the new TokenStream API is as follows:</b>
+ * {@link #incrementToken()} for further details.
-  <ol>
+ * <p>
-    <li>Instantiation of TokenStream/TokenFilters which add/get attributes
+ * <b>The workflow of the new {@link TokenStream} API is as follows:</b>
-        to/from the {@link AttributeSource}. 
+ * <ol>
-    <li>The consumer calls {@link TokenStream#reset()}.
+ * <li>Instantiation of {@link TokenStream}/{@link TokenFilter}s which add/get
-    <li>the consumer retrieves attributes from the
+ * attributes to/from the {@link AttributeSource}.
-        stream and stores local references to all attributes it wants to access
+ * <li>The consumer calls {@link TokenStream#reset()}.
-    <li>The consumer calls {@link #incrementToken()} until it returns false and
+ * <li>the consumer retrieves attributes from the stream and stores local
-        consumes the attributes after each call.    
+ * references to all attributes it wants to access
-  </ol>
+ * <li>The consumer calls {@link #incrementToken()} until it returns false and
-  To make sure that filters and consumers know which attributes are available
+ * consumes the attributes after each call.
-  the attributes must be added during instantiation. Filters and 
+ * <li>The consumer calls {@link #end()} so that any end-of-stream operations
-  consumers are not required to check for availability of attributes in {@link #incrementToken()}.
+ * can be performed.
-  <p>
+ * <li>The consumer calls {@link #close()} to release any resource when finished
-  Sometimes it is desirable to capture a current state of a
+ * using the {@link TokenStream}
-  TokenStream, e. g. for buffering purposes (see {@link CachingTokenFilter},
+ * </ol>
-  {@link TeeSinkTokenFilter}). For this usecase
+ * To make sure that filters and consumers know which attributes are available,
-  {@link AttributeSource#captureState} and {@link AttributeSource#restoreState} can be used.  
+ * the attributes must be added during instantiation. Filters and consumers are
 * not required to check for availability of attributes in
 * {@link #incrementToken()}.
 * <p>
 * You can find some example code for the new API in the analysis package level
 * Javadoc.
 * <p>
 * Sometimes it is desirable to capture a current state of a {@link TokenStream}
 * , e. g. for buffering purposes (see {@link CachingTokenFilter},
 * {@link TeeSinkTokenFilter}). For this usecase
 * {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}
 * can be used.
 */
 public abstract class TokenStream extends AttributeSource {
@ -228,54 +242,67 @@ public abstract class TokenStream extends AttributeSource {
  }
  /**
-   * For extra performance you can globally enable the new {@link #incrementToken}
+   * For extra performance you can globally enable the new
-   * API using {@link Attribute}s. There will be a small, but in most cases neglectible performance 
+   * {@link #incrementToken} API using {@link Attribute}s. There will be a
-   * increase by enabling this, but it only works if <b>all</b> TokenStreams and -Filters
+   * small, but in most cases negligible performance increase by enabling this,
-   * use the new API and implement {@link #incrementToken}. This setting can only be enabled
+   * but it only works if <b>all</b> {@link TokenStream}s use the new API and
   * implement {@link #incrementToken}. This setting can only be enabled
   * globally.
-   * <P>This setting only affects TokenStreams instantiated after this call. All TokenStreams
+   * <P>
-   * already created use the other setting.
+   * This setting only affects {@link TokenStream}s instantiated after this
-   * <P>All core analyzers are compatible with this setting, if you have own
+   * call. All {@link TokenStream}s already created use the other setting.
-   * TokenStreams/-Filters, that are also compatible, enable this.
+   * <P>
-   * <P>When enabled, tokenization may throw {@link UnsupportedOperationException}s,
+   * All core {@link Analyzer}s are compatible with this setting, if you have
-   * if the whole tokenizer chain is not compatible.
+   * your own {@link TokenStream}s that are also compatible, you should enable
-   * <P>The default is <code>false</code>, so there is the fallback to the old API available.
+   * this.
-   * @deprecated This setting will be <code>true</code> per default in Lucene 3.0,
+   * <P>
-   * when {@link #incrementToken} is abstract and must be always implemented.
+   * When enabled, tokenization may throw {@link UnsupportedOperationException}
   * s, if the whole tokenizer chain is not compatible eg one of the
   * {@link TokenStream}s does not implement the new {@link TokenStream} API.
   * <P>
   * The default is <code>false</code>, so there is the fallback to the old API
   * available.
   * 
   * @deprecated This setting will no longer be needed in Lucene 3.0 as the old
   *             API will be removed.
   */
  public static void setOnlyUseNewAPI(boolean onlyUseNewAPI) {
    TokenStream.onlyUseNewAPI = onlyUseNewAPI;
  }
-  /** Returns if only the new API is used.
+  /**
   * Returns if only the new API is used.
   * 
   * @see #setOnlyUseNewAPI
-   * @deprecated This setting will be <code>true</code> per default in Lucene 3.0,
+   * @deprecated This setting will no longer be needed in Lucene 3.0 as
-   * when {@link #incrementToken} is abstract and must be always implemented.
+   *             the old API will be removed.
   */
  public static boolean getOnlyUseNewAPI() {
    return onlyUseNewAPI;
  }
  /**
-   * Consumers (eg the indexer) use this method to advance the stream 
+   * Consumers (ie {@link IndexWriter}) use this method to advance the stream to
-   * to the next token. Implementing classes must implement this method 
+   * the next token. Implementing classes must implement this method and update
-   * and update the appropriate {@link AttributeImpl}s with content of the 
+   * the appropriate {@link AttributeImpl}s with the attributes of the next
-   * next token.
+   * token.
   * <p>
   * This method is called for every token of a document, so an efficient
   * implementation is crucial for good performance. To avoid calls to
-   * {@link #addAttribute(Class)} and {@link #getAttribute(Class)} and
+   * {@link #addAttribute(Class)} and {@link #getAttribute(Class)} or downcasts,
-   * downcasts, references to all {@link AttributeImpl}s that this stream uses 
+   * references to all {@link AttributeImpl}s that this stream uses should be
-   * should be retrieved during instantiation.   
+   * retrieved during instantiation.
   * <p>
-   * To make sure that filters and consumers know which attributes are available
+   * To ensure that filters and consumers know which attributes are available,
-   * the attributes must be added during instantiation. Filters and 
+   * the attributes must be added during instantiation. Filters and consumers
-   * consumers are not required to check for availability of attributes in {@link #incrementToken()}.
+   * are not required to check for availability of attributes in
   * {@link #incrementToken()}.
   * 
   * @return false for end of stream; true otherwise
   * 
   *         <p>
-   * <b>Note that this method will be defined abstract in Lucene 3.0.</b>
+   *         <b>Note that this method will be defined abstract in Lucene
   *         3.0.</b>
   */
  public boolean incrementToken() throws IOException {
    assert !onlyUseNewAPI && tokenWrapper != null;
@ -293,14 +320,15 @@ public abstract class TokenStream extends AttributeSource {
  }
  /**
-   * This method is called by the consumer after the last token has been consumed, 
+   * This method is called by the consumer after the last token has been
-   * ie after {@link #incrementToken()} returned <code>false</code> (using the new TokenStream API)
+   * consumed, eg after {@link #incrementToken()} returned <code>false</code>
-   * or after {@link #next(Token)} or {@link #next()} returned <code>null</code> (old TokenStream API).
+   * (using the new {@link TokenStream} API) or after {@link #next(Token)} or
   * {@link #next()} returned <code>null</code> (old {@link TokenStream} API).
   * <p/>
-   * This method can be used to perform any end-of-stream operations, such as setting the final
+   * This method can be used to perform any end-of-stream operations, such as
-   * offset of a stream. The final offset of a stream might differ from the offset of the last token
+   * setting the final offset of a stream. The final offset of a stream might
-   * eg in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer}
+   * differ from the offset of the last token eg in case one or more whitespaces
-   * was used.
+   * followed after the last token, but a {@link WhitespaceTokenizer} was used.
   * 
   * @throws IOException
   */
@ -308,34 +336,33 @@ public abstract class TokenStream extends AttributeSource {
    // do nothing by default
  }
-  /** Returns the next token in the stream, or null at EOS.
+  /**
-   *  When possible, the input Token should be used as the
+   * Returns the next token in the stream, or null at EOS. When possible, the
-   *  returned Token (this gives fastest tokenization
+   * input Token should be used as the returned Token (this gives fastest
-   *  performance), but this is not required and a new Token
+   * tokenization performance), but this is not required and a new Token may be
-   *  may be returned. Callers may re-use a single Token
+   * returned. Callers may re-use a single Token instance for successive calls
-   *  instance for successive calls to this method.
+   * to this method.
   * <p>
-   *  This implicitly defines a "contract" between 
+   * This implicitly defines a "contract" between consumers (callers of this
-   *  consumers (callers of this method) and 
+   * method) and producers (implementations of this method that are the source
-   *  producers (implementations of this method 
+   * for tokens):
   *  that are the source for tokens):
   * <ul>
-   *   <li>A consumer must fully consume the previously 
+   * <li>A consumer must fully consume the previously returned {@link Token}
-   *       returned Token before calling this method again.</li>
+   * before calling this method again.</li>
-   *   <li>A producer must call {@link Token#clear()}
+   * <li>A producer must call {@link Token#clear()} before setting the fields in
-   *       before setting the fields in it & returning it</li>
+   * it and returning it</li>
   * </ul>
-   *  Also, the producer must make no assumptions about a
+   * Also, the producer must make no assumptions about a {@link Token} after it
-   *  Token after it has been returned: the caller may
+   * has been returned: the caller may arbitrarily change it. If the producer
-   *  arbitrarily change it.  If the producer needs to hold
+   * needs to hold onto the {@link Token} for subsequent calls, it must clone()
-   *  onto the token for subsequent calls, it must clone()
+   * it before storing it. Note that a {@link TokenFilter} is considered a
-   *  it before storing it.
+   * consumer.
-   *  Note that a {@link TokenFilter} is considered a consumer.
+   * 
-   *  @param reusableToken a Token that may or may not be used to
+   * @param reusableToken a {@link Token} that may or may not be used to return;
-   *  return; this parameter should never be null (the callee
+   *        this parameter should never be null (the callee is not required to
-   *  is not required to check for null before using it, but it is a
+   *        check for null before using it, but it is a good idea to assert that
-   *  good idea to assert that it is not null.)
+   *        it is not null.)
-   *  @return next token in the stream or null if end-of-stream was hit
+   * @return next {@link Token} in the stream or null if end-of-stream was hit
   * @deprecated The new {@link #incrementToken()} and {@link AttributeSource}
   *             APIs should be used instead.
   */
@ -357,12 +384,13 @@ public abstract class TokenStream extends AttributeSource {
    }
  }
-  /** Returns the next token in the stream, or null at EOS.
+  /**
-   * @deprecated The returned Token is a "full private copy" (not
+   * Returns the next {@link Token} in the stream, or null at EOS.
-   * re-used across calls to next()) but will be slower
+   * 
-   * than calling {@link #next(Token)} or using the new
+   * @deprecated The returned Token is a "full private copy" (not re-used across
-   * {@link #incrementToken()} method with the new
+   *             calls to {@link #next()}) but will be slower than calling
-   * {@link AttributeSource} API.
+   *             {@link #next(Token)} or using the new {@link #incrementToken()}
   *             method with the new {@link AttributeSource} API.
   */
  public Token next() throws IOException {
    if (onlyUseNewAPI)
@ -379,17 +407,15 @@ public abstract class TokenStream extends AttributeSource {
    }
  }
-  /** Resets this stream to the beginning. This is an
+  /**
-   *  optional operation, so subclasses may or may not
+   * Resets this stream to the beginning. This is an optional operation, so
-   *  implement this method. Reset() is not needed for
+   * subclasses may or may not implement this method. {@link #reset()} is not needed for
-   *  the standard indexing process. However, if the Tokens 
+   * the standard indexing process. However, if the tokens of a
-   *  of a TokenStream are intended to be consumed more than 
+   * {@link TokenStream} are intended to be consumed more than once, it is
-   *  once, it is necessary to implement reset().  Note that
+   * necessary to implement {@link #reset()}. Note that if your TokenStream
-   *  if your TokenStream caches tokens and feeds them back
+   * caches tokens and feeds them back again after a reset, it is imperative
-   *  again after a reset, it is imperative that you
+   * that you clone the tokens when you store them away (on the first pass) as
-   *  clone the tokens when you store them away (on the
+   * well as when you return them (on future passes after {@link #reset()}).
   *  first pass) as well as when you return them (on future
   *  passes after reset()).
   */
  public void reset() throws IOException {}