LUCENE-1487: improve javadoc for FieldCacheTermsFilter

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@738862 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael McCandless 2009-01-29 14:13:35 +00:00
parent 994ae0e18a
commit 48c3220021
1 changed files with 65 additions and 14 deletions

View File

@ -18,31 +18,82 @@ package org.apache.lucene.search;
*/
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermDocs; // for javadoc
import org.apache.lucene.util.OpenBitSet;
import java.io.IOException;
import java.util.Iterator;
/**
* A term filter built on top of a cached single field (in FieldCache). It can be used only
* with single-valued fields.
* A {@link Filter} that only accepts documents whose single
* term value in the specified field is contained in the
* provided set of allowed terms.
*
* <p/>
* FieldCacheTermsFilter builds a single cache for the field the first time it is used. Each
* subsequent FieldCacheTermsFilter on the same field then re-uses this cache even if the terms
* themselves are different.
*
* This is the same functionality as TermsFilter (from
* contrib/queries), except this filter requires that the
* field contains only a single term for all documents.
* Because of drastically different implementations, they
* also have different performance characteristics, as
* described below.
*
* <p/>
* The FieldCacheTermsFilter is faster than building a TermsFilter each time.
* FieldCacheTermsFilter are fast to build in cases where number of documents are far more than
* unique terms. Internally, it creates a BitSet by term number and scans by document id.
*
* The first invocation of this filter on a given field will
* be slower, since a {@link FieldCache.StringIndex} must be
* created. Subsequent invocations using the same field
* will re-use this cache. However, as with all
* functionality based on {@link FieldCache}, persistent RAM
* is consumed to hold the cache, and is not freed until the
* {@link IndexReader} is closed. In contrast, TermsFilter
* has no persistent RAM consumption.
*
*
* <p/>
* As with all FieldCache based functionality, FieldCacheTermsFilter is only valid for fields
* which contain zero or one terms for each document. Thus it works on dates, prices and other
* single value fields but will not work on regular text fields. It is preferable to use an
* NOT_ANALYZED field to ensure that there is only a single term.
*
* With each search, this filter translates the specified
* set of Terms into a private {@link OpenBitSet} keyed by
* term number per unique {@link IndexReader} (normally one
* reader per segment). Then, during matching, the term
* number for each docID is retrieved from the cache and
* then checked for inclusion using the {@link OpenBitSet}.
* Since all testing is done using RAM resident data
* structures, performance should be very fast, most likely
* fast enough to not require further caching of the
* DocIdSet for each possible combination of terms.
* However, because docIDs are simply scanned linearly, an
* index with a great many small documents may find this
* linear scan too costly.
*
* <p/>
* Also, collation is performed at the time the FieldCache is built; to change collation you
* need to override the getFieldCache() method to change the underlying cache.
*
* In contrast, TermsFilter builds up an {@link OpenBitSet},
* keyed by docID, every time it's created, by enumerating
* through all matching docs using {@link TermDocs} to seek
* and scan through each term's docID list. While there is
* no linear scan of all docIDs, besides the allocation of
* the underlying array in the {@link OpenBitSet}, this
* approach requires a number of "disk seeks" in proportion
* to the number of terms, which can be exceptionally costly
* when there are cache misses in the OS's IO cache.
*
* <p/>
*
* Generally, this filter will be slower on the first
* invocation for a given field, but subsequent invocations,
* even if you change the allowed set of Terms, should be
* faster than TermsFilter, especially as the number of
* Terms being matched increases. If you are matching only
* a very small number of terms, and those terms in turn
* match a very small number of documents, TermsFilter may
* perform faster.
*
* <p/>
*
* Which filter is best is very application dependent.
*/
public class FieldCacheTermsFilter extends Filter {
private String field;
private Iterable terms;