mirror of https://github.com/apache/lucene.git
LUCENE-1487: improve javadoc for FieldCacheTermsFilter
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@738862 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
994ae0e18a
commit
48c3220021
|
@ -18,31 +18,82 @@ package org.apache.lucene.search;
|
|||
*/
|
||||
|
||||
import org.apache.lucene.index.IndexReader;
|
||||
import org.apache.lucene.index.TermDocs; // for javadoc
|
||||
import org.apache.lucene.util.OpenBitSet;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.Iterator;
|
||||
|
||||
/**
|
||||
* A term filter built on top of a cached single field (in FieldCache). It can be used only
|
||||
* with single-valued fields.
|
||||
* A {@link Filter} that only accepts documents whose single
|
||||
* term value in the specified field is contained in the
|
||||
* provided set of allowed terms.
|
||||
*
|
||||
* <p/>
|
||||
* FieldCacheTermsFilter builds a single cache for the field the first time it is used. Each
|
||||
* subsequent FieldCacheTermsFilter on the same field then re-uses this cache even if the terms
|
||||
* themselves are different.
|
||||
*
|
||||
* This is the same functionality as TermsFilter (from
|
||||
* contrib/queries), except this filter requires that the
|
||||
* field contains only a single term for all documents.
|
||||
* Because of drastically different implementations, they
|
||||
* also have different performance characteristics, as
|
||||
* described below.
|
||||
*
|
||||
* <p/>
|
||||
* The FieldCacheTermsFilter is faster than building a TermsFilter each time.
|
||||
* FieldCacheTermsFilter are fast to build in cases where number of documents are far more than
|
||||
* unique terms. Internally, it creates a BitSet by term number and scans by document id.
|
||||
*
|
||||
* The first invocation of this filter on a given field will
|
||||
* be slower, since a {@link FieldCache.StringIndex} must be
|
||||
* created. Subsequent invocations using the same field
|
||||
* will re-use this cache. However, as with all
|
||||
* functionality based on {@link FieldCache}, persistent RAM
|
||||
* is consumed to hold the cache, and is not freed until the
|
||||
* {@link IndexReader} is closed. In contrast, TermsFilter
|
||||
* has no persistent RAM consumption.
|
||||
*
|
||||
*
|
||||
* <p/>
|
||||
* As with all FieldCache based functionality, FieldCacheTermsFilter is only valid for fields
|
||||
* which contain zero or one terms for each document. Thus it works on dates, prices and other
|
||||
* single value fields but will not work on regular text fields. It is preferable to use an
|
||||
* NOT_ANALYZED field to ensure that there is only a single term.
|
||||
*
|
||||
* With each search, this filter translates the specified
|
||||
* set of Terms into a private {@link OpenBitSet} keyed by
|
||||
* term number per unique {@link IndexReader} (normally one
|
||||
* reader per segment). Then, during matching, the term
|
||||
* number for each docID is retrieved from the cache and
|
||||
* then checked for inclusion using the {@link OpenBitSet}.
|
||||
* Since all testing is done using RAM resident data
|
||||
* structures, performance should be very fast, most likely
|
||||
* fast enough to not require further caching of the
|
||||
* DocIdSet for each possible combination of terms.
|
||||
* However, because docIDs are simply scanned linearly, an
|
||||
* index with a great many small documents may find this
|
||||
* linear scan too costly.
|
||||
*
|
||||
* <p/>
|
||||
* Also, collation is performed at the time the FieldCache is built; to change collation you
|
||||
* need to override the getFieldCache() method to change the underlying cache.
|
||||
*
|
||||
* In contrast, TermsFilter builds up an {@link OpenBitSet},
|
||||
* keyed by docID, every time it's created, by enumerating
|
||||
* through all matching docs using {@link TermDocs} to seek
|
||||
* and scan through each term's docID list. While there is
|
||||
* no linear scan of all docIDs, besides the allocation of
|
||||
* the underlying array in the {@link OpenBitSet}, this
|
||||
* approach requires a number of "disk seeks" in proportion
|
||||
* to the number of terms, which can be exceptionally costly
|
||||
* when there are cache misses in the OS's IO cache.
|
||||
*
|
||||
* <p/>
|
||||
*
|
||||
* Generally, this filter will be slower on the first
|
||||
* invocation for a given field, but subsequent invocations,
|
||||
* even if you change the allowed set of Terms, should be
|
||||
* faster than TermsFilter, especially as the number of
|
||||
* Terms being matched increases. If you are matching only
|
||||
* a very small number of terms, and those terms in turn
|
||||
* match a very small number of documents, TermsFilter may
|
||||
* perform faster.
|
||||
*
|
||||
* <p/>
|
||||
*
|
||||
* Which filter is best is very application dependent.
|
||||
*/
|
||||
|
||||
public class FieldCacheTermsFilter extends Filter {
|
||||
private String field;
|
||||
private Iterable terms;
|
||||
|
|
Loading…
Reference in New Issue