git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1332797 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael McCandless 2012-05-01 19:52:39 +00:00
parent 9554a045e9
commit 22f2879134
1 changed files with 27 additions and 8 deletions

View File

@ -51,15 +51,23 @@ import org.apache.lucene.util.fst.FST; // javadocs
* <p> * <p>
* <a name="Termdictionary" id="Termdictionary"></a> * <a name="Termdictionary" id="Termdictionary"></a>
* <h3>Term Dictionary</h3> * <h3>Term Dictionary</h3>
* <p>The .tim file contains the list of terms in each field, in UTF-8 order, *
* along with per-term statistics (such as docfreq) and pointers to the frequencies, * <p>The .tim file contains the list of terms in each
* positions, and skip data in the .frq and .prx files. * field along with per-term statistics (such as docfreq)
* and pointers to the frequencies, positions, payloads and
* skip data in the .frq and .prx files.
* </p> * </p>
* <p>The .tim is arranged in blocks: with blocks containing either terms or *
* sub-blocks.</p> * <p>The .tim is arranged in blocks: with blocks containing
* a variable number of entries (by default 25-48), where
* each entry is either a term or a reference to a
* sub-block. It's written by {@link BlockTreeTermsWriter}
* and read by {@link BlockTreeTermsReader}.</p>
*
* <p>NOTE: The term dictionary can plug into different postings implementations: * <p>NOTE: The term dictionary can plug into different postings implementations:
* for example the postings writer/reader are actually responsible for encoding * for example the postings writer/reader are actually responsible for encoding
* and decoding the MetadataBlock.</p> * and decoding the MetadataBlock.</p>
*
* <ul> * <ul>
* <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc --> * <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
* <li>TermsDict (.tim) --&gt; Header, DirOffset, PostingsHeader, SkipInterval, * <li>TermsDict (.tim) --&gt; Header, DirOffset, PostingsHeader, SkipInterval,
@ -122,7 +130,8 @@ import org.apache.lucene.util.fst.FST; // javadocs
* <a name="Termindex" id="Termindex"></a> * <a name="Termindex" id="Termindex"></a>
* <h3>Term Index</h3> * <h3>Term Index</h3>
* <p>The .tip file contains an index into the term dictionary, so that it can be * <p>The .tip file contains an index into the term dictionary, so that it can be
* accessed randomly.</p> * accessed randomly. The index is also used to determine
* when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
* <ul> * <ul>
* <li>TermsIndex (.tip) --&gt; Header, &lt;IndexStartFP&gt;<sup>NumFields</sup>, * <li>TermsIndex (.tip) --&gt; Header, &lt;IndexStartFP&gt;<sup>NumFields</sup>,
* FSTIndex<sup>NumFields</sup></li> * FSTIndex<sup>NumFields</sup></li>
@ -133,8 +142,18 @@ import org.apache.lucene.util.fst.FST; // javadocs
* </ul> * </ul>
* <p>Notes:</p> * <p>Notes:</p>
* <ul> * <ul>
* <li>The .tip file contains a separate FST for each field. Each field's IndexStartFP points * <li>The .tip file contains a separate FST for each
* to its FST.</li> * field. The FST maps a term prefix to the on-disk
* block that holds all terms starting with that
* prefix. Each field's IndexStartFP points to its
* FST.</li>
* <li>It's possible that an on-disk block would contain
* too many terms (more than the allowed maximum
* (default: 48)). When this happens, the block is
* sub-divided into new blocks (called "floor
* blocks"), and then the output in the FST for the
* block's prefix encodes the leading byte of each
* sub-block, and its file pointer.
* </ul> * </ul>
* <a name="Frequencies" id="Frequencies"></a> * <a name="Frequencies" id="Frequencies"></a>
* <h3>Frequencies</h3> * <h3>Frequencies</h3>