LUCENE-1487: improve javadoc for FieldCacheTermsFilter

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@738862 13f79535-47bb-0310-9956-ffa450edef68
2009-01-29 14:13:35 +00:00 · 2009-01-29 14:13:35 +00:00 · 48c3220021
parent 994ae0e18a
commit 48c3220021
1 changed files with 65 additions and 14 deletions
--- a/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java
+++ b/src/java/org/apache/lucene/search/FieldCacheTermsFilter.java
@ -18,31 +18,82 @@ package org.apache.lucene.search;
 */

 import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.TermDocs;          // for javadoc
 import org.apache.lucene.util.OpenBitSet;

 import java.io.IOException;
 import java.util.Iterator;

 /**
- * A term filter built on top of a cached single field (in FieldCache). It can be used only
- * with single-valued fields.
+ * A {@link Filter} that only accepts documents whose single
+ * term value in the specified field is contained in the
+ * provided set of allowed terms.
+ * 
 * <p/>
- * FieldCacheTermsFilter builds a single cache for the field the first time it is used. Each
- * subsequent FieldCacheTermsFilter on the same field then re-uses this cache even if the terms
- * themselves are different.
+ * 
+ * This is the same functionality as TermsFilter (from
+ * contrib/queries), except this filter requires that the
+ * field contains only a single term for all documents.
+ * Because of drastically different implementations, they
+ * also have different performance characteristics, as
+ * described below.
+ * 
 * <p/>
- * The FieldCacheTermsFilter is faster than building a TermsFilter each time.
- * FieldCacheTermsFilter are fast to build in cases where number of documents are far more than
- * unique terms. Internally, it creates a BitSet by term number and scans by document id.
+ * 
+ * The first invocation of this filter on a given field will
+ * be slower, since a {@link FieldCache.StringIndex} must be
+ * created.  Subsequent invocations using the same field
+ * will re-use this cache.  However, as with all
+ * functionality based on {@link FieldCache}, persistent RAM
+ * is consumed to hold the cache, and is not freed until the
+ * {@link IndexReader} is closed.  In contrast, TermsFilter
+ * has no persistent RAM consumption.
+ * 
+ * 
 * <p/>
- * As with all FieldCache based functionality, FieldCacheTermsFilter is only valid for fields
- * which contain zero or one terms for each document. Thus it works on dates, prices and other
- * single value fields but will not work on regular text fields. It is preferable to use an
- * NOT_ANALYZED field to ensure that there is only a single term.
+ * 
+ * With each search, this filter translates the specified
+ * set of Terms into a private {@link OpenBitSet} keyed by
+ * term number per unique {@link IndexReader} (normally one
+ * reader per segment).  Then, during matching, the term
+ * number for each docID is retrieved from the cache and
+ * then checked for inclusion using the {@link OpenBitSet}.
+ * Since all testing is done using RAM resident data
+ * structures, performance should be very fast, most likely
+ * fast enough to not require further caching of the
+ * DocIdSet for each possible combination of terms.
+ * However, because docIDs are simply scanned linearly, an
+ * index with a great many small documents may find this
+ * linear scan too costly.
+ * 
 * <p/>
- * Also, collation is performed at the time the FieldCache is built; to change collation you
- * need to override the getFieldCache() method to change the underlying cache.
+ * 
+ * In contrast, TermsFilter builds up an {@link OpenBitSet},
+ * keyed by docID, every time it's created, by enumerating
+ * through all matching docs using {@link TermDocs} to seek
+ * and scan through each term's docID list.  While there is
+ * no linear scan of all docIDs, besides the allocation of
+ * the underlying array in the {@link OpenBitSet}, this
+ * approach requires a number of "disk seeks" in proportion
+ * to the number of terms, which can be exceptionally costly
+ * when there are cache misses in the OS's IO cache.
+ * 
+ * <p/>
+ * 
+ * Generally, this filter will be slower on the first
+ * invocation for a given field, but subsequent invocations,
+ * even if you change the allowed set of Terms, should be
+ * faster than TermsFilter, especially as the number of
+ * Terms being matched increases.  If you are matching only
+ * a very small number of terms, and those terms in turn
+ * match a very small number of documents, TermsFilter may
+ * perform faster.
+ *
+ * <p/>
+ *
+ * Which filter is best is very application dependent.
 */
+
 public class FieldCacheTermsFilter extends Filter {
  private String field;
  private Iterable terms;