mirror of https://github.com/apache/lucene.git
LUCENE-4446: factor blocktree's documentation out of Lucene40/Block, minor cleanups to Block's docs
git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1396060 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
2830b50786
commit
41f3f73539
|
@ -38,8 +38,8 @@ import org.apache.lucene.util.IOUtils;
|
|||
import org.apache.lucene.util.packed.PackedInts;
|
||||
|
||||
/**
|
||||
* Block postings format, which encodes postings in packed int blocks
|
||||
* for faster decode.
|
||||
* Block postings format, which encodes postings in packed integer blocks
|
||||
* for fast decode.
|
||||
*
|
||||
* <p><b>NOTE</b>: this format is still experimental and
|
||||
* subject to change without backwards compatibility.
|
||||
|
@ -48,21 +48,22 @@ import org.apache.lucene.util.packed.PackedInts;
|
|||
* Basic idea:
|
||||
* <ul>
|
||||
* <li>
|
||||
* <b>Packed Block and VInt Block</b>:
|
||||
* <p>In packed block, integers are encoded with the same bit width ({@link PackedInts packed format}),
|
||||
* the block size (i.e. number of integers inside block) is fixed. </p>
|
||||
* <p>In VInt block, integers are encoded as {@link DataOutput#writeVInt VInt},
|
||||
* <b>Packed Blocks and VInt Blocks</b>:
|
||||
* <p>In packed blocks, integers are encoded with the same bit width ({@link PackedInts packed format}):
|
||||
* the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks
|
||||
* that are all the same value are encoded in an optimized way.</p>
|
||||
* <p>In VInt blocks, integers are encoded as {@link DataOutput#writeVInt VInt}:
|
||||
* the block size is variable.</p>
|
||||
* </li>
|
||||
*
|
||||
* <li>
|
||||
* <b>Block structure</b>:
|
||||
* <p>When the postings is long enough, BlockPostingsFormat will try to encode most integer data
|
||||
* as packed block.</p>
|
||||
* <p>Take a term with 259 documents as example, the first 256 document ids are encoded as two packed
|
||||
* blocks, while the remaining 3 as one VInt block. </p>
|
||||
* <p>When the postings are long enough, BlockPostingsFormat will try to encode most integer data
|
||||
* as a packed block.</p>
|
||||
* <p>Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed
|
||||
* blocks, while the remaining 3 are encoded as one VInt block. </p>
|
||||
* <p>Different kinds of data are always encoded separately into different packed blocks, but may
|
||||
* possible be encoded into a same VInt block. </p>
|
||||
* possibly be interleaved into the same VInt block. </p>
|
||||
* <p>This strategy is applied to pairs:
|
||||
* <document number, frequency>,
|
||||
* <position, payload length>,
|
||||
|
@ -71,25 +72,25 @@ import org.apache.lucene.util.packed.PackedInts;
|
|||
* </li>
|
||||
*
|
||||
* <li>
|
||||
* <b>Skipper setting</b>:
|
||||
* <p>The structure of skip table is quite similar to Lucene40PostingsFormat. Skip interval is the
|
||||
* <b>Skipdata settings</b>:
|
||||
* <p>The structure of skip table is quite similar to previous version of Lucene. Skip interval is the
|
||||
* same as block size, and each skip entry points to the beginning of each block. However, for
|
||||
* the first block, skip data is omitted.</p>
|
||||
* </li>
|
||||
*
|
||||
* <li>
|
||||
* <b>Positions, Payloads, and Offsets</b>:
|
||||
* <p>A position is an integer indicating where the term occurs at within one document.
|
||||
* <p>A position is an integer indicating where the term occurs within one document.
|
||||
* A payload is a blob of metadata associated with current position.
|
||||
* An offset is a pair of integers indicating the tokenized start/end offsets for given term
|
||||
* in current position. </p>
|
||||
* in current position: it is essentially a specialized payload. </p>
|
||||
* <p>When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a
|
||||
* null payload contributes one count). As mentioned in block structure, it is possible to encode
|
||||
* these three either combined or separately.
|
||||
* <p>For all the cases, payloads and offsets are stored together. When encoded as packed block,
|
||||
* <p>In all cases, payloads and offsets are stored together. When encoded as a packed block,
|
||||
* position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload
|
||||
* metadata will also be stored directly in .pay). When encoded as VInt block, all these three are
|
||||
* stored in .pos (so as payload metadata).</p>
|
||||
* metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are
|
||||
* stored interleaved into the .pos (so is payload metadata).</p>
|
||||
* <p>With this strategy, the majority of payload and offset data will be outside .pos file.
|
||||
* So for queries that require only position data, running on a full index with payloads and offsets,
|
||||
* this reduces disk pre-fetches.</p>
|
||||
|
@ -113,35 +114,29 @@ import org.apache.lucene.util.packed.PackedInts;
|
|||
* <dd>
|
||||
* <b>Term Dictionary</b>
|
||||
*
|
||||
* <p>The .tim file format is quite similar to Lucene40PostingsFormat,
|
||||
* with minor difference in MetadataBlock</p>
|
||||
* <p>The .tim file contains the list of terms in each
|
||||
* field along with per-term statistics (such as docfreq)
|
||||
* and pointers to the frequencies, positions, payload and
|
||||
* skip data in the .doc, .pos, and .pay files.
|
||||
* See {@link BlockTreeTermsWriter} for more details on the format.
|
||||
* </p>
|
||||
*
|
||||
* <p>NOTE: The term dictionary can plug into different postings implementations:
|
||||
* the postings writer/reader are actually responsible for encoding
|
||||
* and decoding the Postings Metadata and Term Metadata sections described here:</p>
|
||||
*
|
||||
* <ul>
|
||||
* <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
|
||||
* <li>TermDictionary(.tim) --> Header, PostingsHeader, PackedBlockSize,
|
||||
* <Block><sup>NumBlocks</sup>, FieldSummary, DirOffset</li>
|
||||
* <li>Block --> SuffixBlock, StatsBlock, MetadataBlock</li>
|
||||
* <li>SuffixBlock --> EntryCount, SuffixLength, {@link DataOutput#writeByte byte}<sup>SuffixLength</sup></li>
|
||||
* <li>StatsBlock --> StatsLength, <DocFreq, TotalTermFreq><sup>EntryCount</sup></li>
|
||||
* <li>MetadataBlock --> MetaLength, <DocFPDelta,
|
||||
* <PosFPDelta, PosVIntBlockFPDelta?, PayFPDelta?>?,
|
||||
* SkipFPDelta?><sup>EntryCount</sup></li>
|
||||
* <li>FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength,
|
||||
* {@link DataOutput#writeByte byte}<sup>RootCodeLength</sup>, SumDocFreq, DocCount>
|
||||
* <sup>NumFields</sup></li>
|
||||
* <li>Header, PostingsHeader --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
* <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
|
||||
* <li>PackedBlockSize, EntryCount, SuffixLength, StatsLength, DocFreq, MetaLength,
|
||||
* PosVIntBlockFPDelta, SkipFPDelta, NumFields, FieldNumber, RootCodeLength, DocCount -->
|
||||
* {@link DataOutput#writeVInt VInt}</li>
|
||||
* <li>TotalTermFreq, DocFPDelta, PosFPDelta, PayFPDelta, NumTerms, SumTotalTermFreq, SumDocFreq -->
|
||||
* {@link DataOutput#writeVLong VLong}</li>
|
||||
* <li>Postings Metadata --> Header, PackedBlockSize</li>
|
||||
* <li>Term Metadata --> DocFPDelta, PosFPDelta?, PosVIntBlockFPDelta?, PayFPDelta?,
|
||||
* SkipFPDelta?</li>
|
||||
* <li>Header, --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
* <li>PackedBlockSize, PosVIntBlockFPDelta, SkipFPDelta --> {@link DataOutput#writeVInt VInt}</li>
|
||||
* <li>DocFPDelta, PosFPDelta, PayFPDelta --> {@link DataOutput#writeVLong VLong}</li>
|
||||
* </ul>
|
||||
* <p>Notes:</p>
|
||||
* <ul>
|
||||
* <li>Here explains MetadataBlock only, other fields are mentioned in
|
||||
* <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary">Lucene40PostingsFormat:TermDictionary</a>
|
||||
* </li>
|
||||
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
|
||||
* for the postings.</li>
|
||||
* <li>PackedBlockSize is the fixed block size for packed blocks. In packed block, bit width is
|
||||
* determined by the largest integer. Smaller block size result in smaller variance among width
|
||||
* of integers hence smaller indexes. Larger block size result in more efficient bulk i/o hence
|
||||
|
@ -175,8 +170,8 @@ import org.apache.lucene.util.packed.PackedInts;
|
|||
* <dl>
|
||||
* <dd>
|
||||
* <b>Term Index</b>
|
||||
* <p>The .tim file format is mentioned in
|
||||
* <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termindex">Lucene40PostingsFormat:TermIndex</a>
|
||||
* <p>The .tip file contains an index into the term dictionary, so that it can be
|
||||
* accessed randomly. See {@link BlockTreeTermsWriter} for more details on the format.</p>
|
||||
* </dd>
|
||||
* </dl>
|
||||
*
|
||||
|
@ -298,7 +293,7 @@ import org.apache.lucene.util.packed.PackedInts;
|
|||
* <dd>
|
||||
* <b>Payloads and Offsets</b>
|
||||
* <p>The .pay file will store payloads and offsets associated with certain term-document positions.
|
||||
* Some payloads and offsets will be separated out into .pos file, for speedup reason.</p>
|
||||
* Some payloads and offsets will be separated out into .pos file, for performance reasons.</p>
|
||||
* <ul>
|
||||
* <li>PayFile(.pay): --> Header, <TermPayloads, TermOffsets?> <sup>TermCount</sup></li>
|
||||
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
|
@ -319,7 +314,7 @@ import org.apache.lucene.util.packed.PackedInts;
|
|||
* for PackedOffsetBlockNum.</li>
|
||||
* <li>SumPayLength is the total length of payloads written within one block, should be the sum
|
||||
* of PayLengths in one packed block.</li>
|
||||
* <li>PayLength in PackedPayLengthBlock is the length of each payload, associated with current
|
||||
* <li>PayLength in PackedPayLengthBlock is the length of each payload associated with the current
|
||||
* position.</li>
|
||||
* </ul>
|
||||
* </dd>
|
||||
|
|
|
@ -23,10 +23,12 @@ import java.util.Comparator;
|
|||
import java.util.List;
|
||||
|
||||
import org.apache.lucene.index.FieldInfo.IndexOptions;
|
||||
import org.apache.lucene.index.DocsEnum;
|
||||
import org.apache.lucene.index.FieldInfo;
|
||||
import org.apache.lucene.index.FieldInfos;
|
||||
import org.apache.lucene.index.IndexFileNames;
|
||||
import org.apache.lucene.index.SegmentWriteState;
|
||||
import org.apache.lucene.store.DataOutput;
|
||||
import org.apache.lucene.store.IndexOutput;
|
||||
import org.apache.lucene.store.RAMOutputStream;
|
||||
import org.apache.lucene.util.ArrayUtil;
|
||||
|
@ -76,6 +78,97 @@ import org.apache.lucene.util.fst.Util;
|
|||
* Writes terms dict and index, block-encoding (column
|
||||
* stride) each term's metadata for each set of terms
|
||||
* between two index terms.
|
||||
* <p>
|
||||
* Files:
|
||||
* <ul>
|
||||
* <li><tt>.tim</tt>: <a href="#Termdictionary">Term Dictionary</a></li>
|
||||
* <li><tt>.tip</tt>: <a href="#Termindex">Term Index</a></li>
|
||||
* </ul>
|
||||
* <p>
|
||||
* <a name="Termdictionary" id="Termdictionary"></a>
|
||||
* <h3>Term Dictionary</h3>
|
||||
*
|
||||
* <p>The .tim file contains the list of terms in each
|
||||
* field along with per-term statistics (such as docfreq)
|
||||
* and per-term metadata (typically pointers to the postings list
|
||||
* for that term in the inverted index).
|
||||
* </p>
|
||||
*
|
||||
* <p>The .tim is arranged in blocks: with blocks containing
|
||||
* a variable number of entries (by default 25-48), where
|
||||
* each entry is either a term or a reference to a
|
||||
* sub-block.</p>
|
||||
*
|
||||
* <p>NOTE: The term dictionary can plug into different postings implementations:
|
||||
* for example the postings writer/reader are actually responsible for encoding
|
||||
* and decoding the MetadataBlock.</p>
|
||||
*
|
||||
* <ul>
|
||||
* <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
|
||||
* <li>TermsDict (.tim) --> Header, <i>Postings Metadata</i>, Block<sup>NumBlocks</sup>,
|
||||
* FieldSummary, DirOffset</li>
|
||||
* <li>Block --> SuffixBlock, StatsBlock, MetadataBlock</li>
|
||||
* <li>SuffixBlock --> EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
|
||||
* <li>StatsBlock --> StatsLength, <DocFreq, TotalTermFreq><sup>EntryCount</sup></li>
|
||||
* <li>MetadataBlock --> MetaLength, <<i>Term Metadata</i>><sup>EntryCount</sup></li>
|
||||
* <li>FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
|
||||
* SumDocFreq, DocCount><sup>NumFields</sup></li>
|
||||
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
* <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
|
||||
* <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
|
||||
* FieldNumber,RootCodeLength,DocCount --> {@link DataOutput#writeVInt VInt}</li>
|
||||
* <li>TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq -->
|
||||
* {@link DataOutput#writeVLong VLong}</li>
|
||||
* </ul>
|
||||
* <p>Notes:</p>
|
||||
* <ul>
|
||||
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
|
||||
* for the BlockTree implementation.</li>
|
||||
* <li>DirOffset is a pointer to the FieldSummary section.</li>
|
||||
* <li>DocFreq is the count of documents which contain the term.</li>
|
||||
* <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
|
||||
* as the difference between the total number of occurrences and the DocFreq.</li>
|
||||
* <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
|
||||
* <li>NumTerms is the number of unique terms for the field.</li>
|
||||
* <li>RootCode points to the root block for the field.</li>
|
||||
* <li>SumDocFreq is the total number of postings, the number of term-document pairs across
|
||||
* the entire field.</li>
|
||||
* <li>DocCount is the number of documents that have at least one posting for this field.</li>
|
||||
* <li>PostingsMetadata and TermMetadata are plugged into by the specific postings implementation:
|
||||
* these contain arbitrary per-file data (such as parameters or versioning information)
|
||||
* and per-term data (such as pointers to inverted files).
|
||||
* </ul>
|
||||
* <a name="Termindex" id="Termindex"></a>
|
||||
* <h3>Term Index</h3>
|
||||
* <p>The .tip file contains an index into the term dictionary, so that it can be
|
||||
* accessed randomly. The index is also used to determine
|
||||
* when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
|
||||
* <ul>
|
||||
* <li>TermsIndex (.tip) --> Header, FSTIndex<sup>NumFields</sup>
|
||||
* <IndexStartFP><sup>NumFields</sup>, DirOffset</li>
|
||||
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
* <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
|
||||
* <li>IndexStartFP --> {@link DataOutput#writeVLong VLong}</li>
|
||||
* <!-- TODO: better describe FST output here -->
|
||||
* <li>FSTIndex --> {@link FST FST<byte[]>}</li>
|
||||
* </ul>
|
||||
* <p>Notes:</p>
|
||||
* <ul>
|
||||
* <li>The .tip file contains a separate FST for each
|
||||
* field. The FST maps a term prefix to the on-disk
|
||||
* block that holds all terms starting with that
|
||||
* prefix. Each field's IndexStartFP points to its
|
||||
* FST.</li>
|
||||
* <li>DirOffset is a pointer to the start of the IndexStartFPs
|
||||
* for all fields</li>
|
||||
* <li>It's possible that an on-disk block would contain
|
||||
* too many terms (more than the allowed maximum
|
||||
* (default: 48)). When this happens, the block is
|
||||
* sub-divided into new blocks (called "floor
|
||||
* blocks"), and then the output in the FST for the
|
||||
* block's prefix encodes the leading byte of each
|
||||
* sub-block, and its file pointer.
|
||||
* </ul>
|
||||
*
|
||||
* @see BlockTreeTermsReader
|
||||
* @lucene.experimental
|
||||
|
|
|
@ -54,43 +54,24 @@ import org.apache.lucene.util.fst.FST; // javadocs
|
|||
* field along with per-term statistics (such as docfreq)
|
||||
* and pointers to the frequencies, positions and
|
||||
* skip data in the .frq and .prx files.
|
||||
* See {@link BlockTreeTermsWriter} for more details on the format.
|
||||
* </p>
|
||||
*
|
||||
* <p>The .tim is arranged in blocks: with blocks containing
|
||||
* a variable number of entries (by default 25-48), where
|
||||
* each entry is either a term or a reference to a
|
||||
* sub-block. It's written by {@link BlockTreeTermsWriter}
|
||||
* and read by {@link BlockTreeTermsReader}.</p>
|
||||
*
|
||||
* <p>NOTE: The term dictionary can plug into different postings implementations:
|
||||
* for example the postings writer/reader are actually responsible for encoding
|
||||
* and decoding the MetadataBlock.</p>
|
||||
*
|
||||
* the postings writer/reader are actually responsible for encoding
|
||||
* and decoding the Postings Metadata and Term Metadata sections described here:</p>
|
||||
* <ul>
|
||||
* <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
|
||||
* <li>TermsDict (.tim) --> Header, PostingsHeader, SkipInterval,
|
||||
* MaxSkipLevels, SkipMinimum, Block<sup>NumBlocks</sup>,
|
||||
* FieldSummary, DirOffset</li>
|
||||
* <li>Block --> SuffixBlock, StatsBlock, MetadataBlock</li>
|
||||
* <li>SuffixBlock --> EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
|
||||
* <li>StatsBlock --> StatsLength, <DocFreq, TotalTermFreq><sup>EntryCount</sup></li>
|
||||
* <li>MetadataBlock --> MetaLength, <FreqDelta, SkipDelta?, ProxDelta?><sup>EntryCount</sup></li>
|
||||
* <li>FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
|
||||
* SumDocFreq, DocCount><sup>NumFields</sup></li>
|
||||
* <li>Header,PostingsHeader --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
* <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
|
||||
* <li>Postings Metadata --> Header, SkipInterval, MaxSkipLevels, SkipMinimum</li>
|
||||
* <li>Term Metadata --> FreqDelta, SkipDelta?, ProxDelta?
|
||||
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
* <li>SkipInterval,MaxSkipLevels,SkipMinimum --> {@link DataOutput#writeInt Uint32}</li>
|
||||
* <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,SkipDelta,NumFields,
|
||||
* FieldNumber,RootCodeLength,DocCount --> {@link DataOutput#writeVInt VInt}</li>
|
||||
* <li>TotalTermFreq,FreqDelta,ProxDelta,NumTerms,SumTotalTermFreq,SumDocFreq -->
|
||||
* {@link DataOutput#writeVLong VLong}</li>
|
||||
* <li>SkipDelta --> {@link DataOutput#writeVInt VInt}</li>
|
||||
* <li>FreqDelta,ProxDelta --> {@link DataOutput#writeVLong VLong}</li>
|
||||
* </ul>
|
||||
* <p>Notes:</p>
|
||||
* <ul>
|
||||
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
|
||||
* for the BlockTree implementation. On the other hand, PostingsHeader stores the version
|
||||
* information for the postings reader/writer.</li>
|
||||
* <li>DirOffset is a pointer to the FieldSummary section.</li>
|
||||
* for the postings.</li>
|
||||
* <li>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate
|
||||
* {@link DocsEnum#advance(int)}. Larger values result in smaller indexes, greater
|
||||
* acceleration, but fewer accelerable cases, while smaller values result in bigger indexes,
|
||||
|
@ -102,9 +83,6 @@ import org.apache.lucene.util.fst.FST; // javadocs
|
|||
* information about skip levels.</li>
|
||||
* <li>SkipMinimum is the minimum document frequency a term must have in order to write any
|
||||
* skip data at all.</li>
|
||||
* <li>DocFreq is the count of documents which contain the term.</li>
|
||||
* <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
|
||||
* as the difference between the total number of occurrences and the DocFreq.</li>
|
||||
* <li>FreqDelta determines the position of this term's TermFreqs within the .frq
|
||||
* file. In particular, it is the difference between the position of this term's
|
||||
* data in that file and the position of the previous term's data (or zero, for
|
||||
|
@ -118,44 +96,11 @@ import org.apache.lucene.util.fst.FST; // javadocs
|
|||
* file. In particular, it is the number of bytes after TermFreqs that the
|
||||
* SkipData starts. In other words, it is the length of the TermFreq data.
|
||||
* SkipDelta is only stored if DocFreq is not smaller than SkipMinimum.</li>
|
||||
* <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
|
||||
* <li>NumTerms is the number of unique terms for the field.</li>
|
||||
* <li>RootCode points to the root block for the field.</li>
|
||||
* <li>SumDocFreq is the total number of postings, the number of term-document pairs across
|
||||
* the entire field.</li>
|
||||
* <li>DocCount is the number of documents that have at least one posting for this field.</li>
|
||||
* </ul>
|
||||
* <a name="Termindex" id="Termindex"></a>
|
||||
* <h3>Term Index</h3>
|
||||
* <p>The .tip file contains an index into the term dictionary, so that it can be
|
||||
* accessed randomly. The index is also used to determine
|
||||
* when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
|
||||
* <ul>
|
||||
* <li>TermsIndex (.tip) --> Header, FSTIndex<sup>NumFields</sup>,
|
||||
* <IndexStartFP><sup>NumFields</sup>, DirOffset</li>
|
||||
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}</li>
|
||||
* <li>IndexStartFP --> {@link DataOutput#writeVLong VLong}</li>
|
||||
* <!-- TODO: better describe FST output here -->
|
||||
* <li>FSTIndex --> {@link FST FST<byte[]>}</li>
|
||||
* <li>DirOffset --> {@link DataOutput#writeLong Uint64}</li>
|
||||
* </ul>
|
||||
* <p>Notes:</p>
|
||||
* <ul>
|
||||
* <li>The .tip file contains a separate FST for each
|
||||
* field. The FST maps a term prefix to the on-disk
|
||||
* block that holds all terms starting with that
|
||||
* prefix. Each field's IndexStartFP points to its
|
||||
* FST.</li>
|
||||
* <li>DirOffset is a pointer to the start of the IndexStartFPs
|
||||
* for all fields</li>
|
||||
* <li>It's possible that an on-disk block would contain
|
||||
* too many terms (more than the allowed maximum
|
||||
* (default: 48)). When this happens, the block is
|
||||
* sub-divided into new blocks (called "floor
|
||||
* blocks"), and then the output in the FST for the
|
||||
* block's prefix encodes the leading byte of each
|
||||
* sub-block, and its file pointer.
|
||||
* </ul>
|
||||
* accessed randomly. See {@link BlockTreeTermsWriter} for more details on the format.</p>
|
||||
* <a name="Frequencies" id="Frequencies"></a>
|
||||
* <h3>Frequencies</h3>
|
||||
* <p>The .frq file contains the lists of documents which contain each term, along
|
||||
|
|
Loading…
Reference in New Issue