LUCENE-1760: javadoc improvements for TokenStream

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@807645 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Mark Robert Miller 2009-08-25 14:16:00 +00:00
parent 1a23145d53
commit 3519f543e7
1 changed files with 148 additions and 122 deletions

View File

@ -26,48 +26,62 @@ import org.apache.lucene.analysis.tokenattributes.PayloadAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.util.Attribute; import org.apache.lucene.util.Attribute;
import org.apache.lucene.util.AttributeImpl; import org.apache.lucene.util.AttributeImpl;
import org.apache.lucene.util.AttributeSource; import org.apache.lucene.util.AttributeSource;
/** A TokenStream enumerates the sequence of tokens, either from /**
fields of a document or from query text. * A {@link TokenStream} enumerates the sequence of tokens, either from
<p> * {@link Field}s of a {@link Document} or from query text.
This is an abstract class. Concrete subclasses are: * <p>
<ul> * This is an abstract class. Concrete subclasses are:
<li>{@link Tokenizer}, a TokenStream * <ul>
whose input is a Reader; and * <li>{@link Tokenizer}, a {@link TokenStream} whose input is a Reader; and
<li>{@link TokenFilter}, a TokenStream * <li>{@link TokenFilter}, a {@link TokenStream} whose input is another
whose input is another TokenStream. * {@link TokenStream}.
</ul> * </ul>
A new TokenStream API is introduced with Lucene 2.9. While Token still * A new {@link TokenStream} API has been introduced with Lucene 2.9. This API
exists in 2.9 as a convenience class, the preferred way to store * has moved from being {@link Token} based to {@link Attribute} based. While
the information of a token is to use {@link AttributeImpl}s. * {@link Token} still exists in 2.9 as a convenience class, the preferred way
<p> * to store the information of a {@link Token} is to use {@link AttributeImpl}s.
For that reason TokenStream extends {@link AttributeSource} * <p>
now. Note that only one instance per {@link AttributeImpl} is * {@link TokenStream} now extends {@link AttributeSource}, which provides
created and reused for every token. This approach reduces * access to all of the token {@link Attribute}s for the {@link TokenStream}.
object creations and allows local caching of references to * Note that only one instance per {@link AttributeImpl} is created and reused
the {@link AttributeImpl}s. See {@link #incrementToken()} for further details. * for every token. This approach reduces object creation and allows local
<p> * caching of references to the {@link AttributeImpl}s. See
<b>The workflow of the new TokenStream API is as follows:</b> * {@link #incrementToken()} for further details.
<ol> * <p>
<li>Instantiation of TokenStream/TokenFilters which add/get attributes * <b>The workflow of the new {@link TokenStream} API is as follows:</b>
to/from the {@link AttributeSource}. * <ol>
<li>The consumer calls {@link TokenStream#reset()}. * <li>Instantiation of {@link TokenStream}/{@link TokenFilter}s which add/get
<li>the consumer retrieves attributes from the * attributes to/from the {@link AttributeSource}.
stream and stores local references to all attributes it wants to access * <li>The consumer calls {@link TokenStream#reset()}.
<li>The consumer calls {@link #incrementToken()} until it returns false and * <li>the consumer retrieves attributes from the stream and stores local
consumes the attributes after each call. * references to all attributes it wants to access
</ol> * <li>The consumer calls {@link #incrementToken()} until it returns false and
To make sure that filters and consumers know which attributes are available * consumes the attributes after each call.
the attributes must be added during instantiation. Filters and * <li>The consumer calls {@link #end()} so that any end-of-stream operations
consumers are not required to check for availability of attributes in {@link #incrementToken()}. * can be performed.
<p> * <li>The consumer calls {@link #close()} to release any resource when finished
Sometimes it is desirable to capture a current state of a * using the {@link TokenStream}
TokenStream, e. g. for buffering purposes (see {@link CachingTokenFilter}, * </ol>
{@link TeeSinkTokenFilter}). For this usecase * To make sure that filters and consumers know which attributes are available,
{@link AttributeSource#captureState} and {@link AttributeSource#restoreState} can be used. * the attributes must be added during instantiation. Filters and consumers are
* not required to check for availability of attributes in
* {@link #incrementToken()}.
* <p>
* You can find some example code for the new API in the analysis package level
* Javadoc.
* <p>
* Sometimes it is desirable to capture a current state of a {@link TokenStream}
* , e. g. for buffering purposes (see {@link CachingTokenFilter},
* {@link TeeSinkTokenFilter}). For this usecase
* {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}
* can be used.
*/ */
public abstract class TokenStream extends AttributeSource { public abstract class TokenStream extends AttributeSource {
@ -228,54 +242,67 @@ public abstract class TokenStream extends AttributeSource {
} }
/** /**
* For extra performance you can globally enable the new {@link #incrementToken} * For extra performance you can globally enable the new
* API using {@link Attribute}s. There will be a small, but in most cases neglectible performance * {@link #incrementToken} API using {@link Attribute}s. There will be a
* increase by enabling this, but it only works if <b>all</b> TokenStreams and -Filters * small, but in most cases negligible performance increase by enabling this,
* use the new API and implement {@link #incrementToken}. This setting can only be enabled * but it only works if <b>all</b> {@link TokenStream}s use the new API and
* implement {@link #incrementToken}. This setting can only be enabled
* globally. * globally.
* <P>This setting only affects TokenStreams instantiated after this call. All TokenStreams * <P>
* already created use the other setting. * This setting only affects {@link TokenStream}s instantiated after this
* <P>All core analyzers are compatible with this setting, if you have own * call. All {@link TokenStream}s already created use the other setting.
* TokenStreams/-Filters, that are also compatible, enable this. * <P>
* <P>When enabled, tokenization may throw {@link UnsupportedOperationException}s, * All core {@link Analyzer}s are compatible with this setting, if you have
* if the whole tokenizer chain is not compatible. * your own {@link TokenStream}s that are also compatible, you should enable
* <P>The default is <code>false</code>, so there is the fallback to the old API available. * this.
* @deprecated This setting will be <code>true</code> per default in Lucene 3.0, * <P>
* when {@link #incrementToken} is abstract and must be always implemented. * When enabled, tokenization may throw {@link UnsupportedOperationException}
* s, if the whole tokenizer chain is not compatible eg one of the
* {@link TokenStream}s does not implement the new {@link TokenStream} API.
* <P>
* The default is <code>false</code>, so there is the fallback to the old API
* available.
*
* @deprecated This setting will no longer be needed in Lucene 3.0 as the old
* API will be removed.
*/ */
public static void setOnlyUseNewAPI(boolean onlyUseNewAPI) { public static void setOnlyUseNewAPI(boolean onlyUseNewAPI) {
TokenStream.onlyUseNewAPI = onlyUseNewAPI; TokenStream.onlyUseNewAPI = onlyUseNewAPI;
} }
/** Returns if only the new API is used. /**
* Returns if only the new API is used.
*
* @see #setOnlyUseNewAPI * @see #setOnlyUseNewAPI
* @deprecated This setting will be <code>true</code> per default in Lucene 3.0, * @deprecated This setting will no longer be needed in Lucene 3.0 as
* when {@link #incrementToken} is abstract and must be always implemented. * the old API will be removed.
*/ */
public static boolean getOnlyUseNewAPI() { public static boolean getOnlyUseNewAPI() {
return onlyUseNewAPI; return onlyUseNewAPI;
} }
/** /**
* Consumers (eg the indexer) use this method to advance the stream * Consumers (ie {@link IndexWriter}) use this method to advance the stream to
* to the next token. Implementing classes must implement this method * the next token. Implementing classes must implement this method and update
* and update the appropriate {@link AttributeImpl}s with content of the * the appropriate {@link AttributeImpl}s with the attributes of the next
* next token. * token.
* <p> * <p>
* This method is called for every token of a document, so an efficient * This method is called for every token of a document, so an efficient
* implementation is crucial for good performance. To avoid calls to * implementation is crucial for good performance. To avoid calls to
* {@link #addAttribute(Class)} and {@link #getAttribute(Class)} and * {@link #addAttribute(Class)} and {@link #getAttribute(Class)} or downcasts,
* downcasts, references to all {@link AttributeImpl}s that this stream uses * references to all {@link AttributeImpl}s that this stream uses should be
* should be retrieved during instantiation. * retrieved during instantiation.
* <p> * <p>
* To make sure that filters and consumers know which attributes are available * To ensure that filters and consumers know which attributes are available,
* the attributes must be added during instantiation. Filters and * the attributes must be added during instantiation. Filters and consumers
* consumers are not required to check for availability of attributes in {@link #incrementToken()}. * are not required to check for availability of attributes in
* {@link #incrementToken()}.
* *
* @return false for end of stream; true otherwise * @return false for end of stream; true otherwise
* *
* <p> * <p>
* <b>Note that this method will be defined abstract in Lucene 3.0.</b> * <b>Note that this method will be defined abstract in Lucene
* 3.0.</b>
*/ */
public boolean incrementToken() throws IOException { public boolean incrementToken() throws IOException {
assert !onlyUseNewAPI && tokenWrapper != null; assert !onlyUseNewAPI && tokenWrapper != null;
@ -293,14 +320,15 @@ public abstract class TokenStream extends AttributeSource {
} }
/** /**
* This method is called by the consumer after the last token has been consumed, * This method is called by the consumer after the last token has been
* ie after {@link #incrementToken()} returned <code>false</code> (using the new TokenStream API) * consumed, eg after {@link #incrementToken()} returned <code>false</code>
* or after {@link #next(Token)} or {@link #next()} returned <code>null</code> (old TokenStream API). * (using the new {@link TokenStream} API) or after {@link #next(Token)} or
* {@link #next()} returned <code>null</code> (old {@link TokenStream} API).
* <p/> * <p/>
* This method can be used to perform any end-of-stream operations, such as setting the final * This method can be used to perform any end-of-stream operations, such as
* offset of a stream. The final offset of a stream might differ from the offset of the last token * setting the final offset of a stream. The final offset of a stream might
* eg in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer} * differ from the offset of the last token eg in case one or more whitespaces
* was used. * followed after the last token, but a {@link WhitespaceTokenizer} was used.
* *
* @throws IOException * @throws IOException
*/ */
@ -308,34 +336,33 @@ public abstract class TokenStream extends AttributeSource {
// do nothing by default // do nothing by default
} }
/** Returns the next token in the stream, or null at EOS. /**
* When possible, the input Token should be used as the * Returns the next token in the stream, or null at EOS. When possible, the
* returned Token (this gives fastest tokenization * input Token should be used as the returned Token (this gives fastest
* performance), but this is not required and a new Token * tokenization performance), but this is not required and a new Token may be
* may be returned. Callers may re-use a single Token * returned. Callers may re-use a single Token instance for successive calls
* instance for successive calls to this method. * to this method.
* <p> * <p>
* This implicitly defines a "contract" between * This implicitly defines a "contract" between consumers (callers of this
* consumers (callers of this method) and * method) and producers (implementations of this method that are the source
* producers (implementations of this method * for tokens):
* that are the source for tokens):
* <ul> * <ul>
* <li>A consumer must fully consume the previously * <li>A consumer must fully consume the previously returned {@link Token}
* returned Token before calling this method again.</li> * before calling this method again.</li>
* <li>A producer must call {@link Token#clear()} * <li>A producer must call {@link Token#clear()} before setting the fields in
* before setting the fields in it & returning it</li> * it and returning it</li>
* </ul> * </ul>
* Also, the producer must make no assumptions about a * Also, the producer must make no assumptions about a {@link Token} after it
* Token after it has been returned: the caller may * has been returned: the caller may arbitrarily change it. If the producer
* arbitrarily change it. If the producer needs to hold * needs to hold onto the {@link Token} for subsequent calls, it must clone()
* onto the token for subsequent calls, it must clone() * it before storing it. Note that a {@link TokenFilter} is considered a
* it before storing it. * consumer.
* Note that a {@link TokenFilter} is considered a consumer. *
* @param reusableToken a Token that may or may not be used to * @param reusableToken a {@link Token} that may or may not be used to return;
* return; this parameter should never be null (the callee * this parameter should never be null (the callee is not required to
* is not required to check for null before using it, but it is a * check for null before using it, but it is a good idea to assert that
* good idea to assert that it is not null.) * it is not null.)
* @return next token in the stream or null if end-of-stream was hit * @return next {@link Token} in the stream or null if end-of-stream was hit
* @deprecated The new {@link #incrementToken()} and {@link AttributeSource} * @deprecated The new {@link #incrementToken()} and {@link AttributeSource}
* APIs should be used instead. * APIs should be used instead.
*/ */
@ -357,12 +384,13 @@ public abstract class TokenStream extends AttributeSource {
} }
} }
/** Returns the next token in the stream, or null at EOS. /**
* @deprecated The returned Token is a "full private copy" (not * Returns the next {@link Token} in the stream, or null at EOS.
* re-used across calls to next()) but will be slower *
* than calling {@link #next(Token)} or using the new * @deprecated The returned Token is a "full private copy" (not re-used across
* {@link #incrementToken()} method with the new * calls to {@link #next()}) but will be slower than calling
* {@link AttributeSource} API. * {@link #next(Token)} or using the new {@link #incrementToken()}
* method with the new {@link AttributeSource} API.
*/ */
public Token next() throws IOException { public Token next() throws IOException {
if (onlyUseNewAPI) if (onlyUseNewAPI)
@ -379,17 +407,15 @@ public abstract class TokenStream extends AttributeSource {
} }
} }
/** Resets this stream to the beginning. This is an /**
* optional operation, so subclasses may or may not * Resets this stream to the beginning. This is an optional operation, so
* implement this method. Reset() is not needed for * subclasses may or may not implement this method. {@link #reset()} is not needed for
* the standard indexing process. However, if the Tokens * the standard indexing process. However, if the tokens of a
* of a TokenStream are intended to be consumed more than * {@link TokenStream} are intended to be consumed more than once, it is
* once, it is necessary to implement reset(). Note that * necessary to implement {@link #reset()}. Note that if your TokenStream
* if your TokenStream caches tokens and feeds them back * caches tokens and feeds them back again after a reset, it is imperative
* again after a reset, it is imperative that you * that you clone the tokens when you store them away (on the first pass) as
* clone the tokens when you store them away (on the * well as when you return them (on future passes after {@link #reset()}).
* first pass) as well as when you return them (on future
* passes after reset()).
*/ */
public void reset() throws IOException {} public void reset() throws IOException {}