LUCENE-4446: factor blocktree's documentation out of Lucene40/Block, minor cleanups to Block's docs

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1396060 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Robert Muir 2012-10-09 15:12:30 +00:00
parent 2830b50786
commit 41f3f73539
3 changed files with 144 additions and 111 deletions

View File

@ -38,8 +38,8 @@ import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.packed.PackedInts;
/**
* Block postings format, which encodes postings in packed int blocks
* for faster decode.
* Block postings format, which encodes postings in packed integer blocks
* for fast decode.
*
* <p><b>NOTE</b>: this format is still experimental and
* subject to change without backwards compatibility.
@ -48,21 +48,22 @@ import org.apache.lucene.util.packed.PackedInts;
* Basic idea:
* <ul>
* <li>
* <b>Packed Block and VInt Block</b>:
* <p>In packed block, integers are encoded with the same bit width ({@link PackedInts packed format}),
* the block size (i.e. number of integers inside block) is fixed. </p>
* <p>In VInt block, integers are encoded as {@link DataOutput#writeVInt VInt},
* <b>Packed Blocks and VInt Blocks</b>:
* <p>In packed blocks, integers are encoded with the same bit width ({@link PackedInts packed format}):
* the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks
* that are all the same value are encoded in an optimized way.</p>
* <p>In VInt blocks, integers are encoded as {@link DataOutput#writeVInt VInt}:
* the block size is variable.</p>
* </li>
*
* <li>
* <b>Block structure</b>:
* <p>When the postings is long enough, BlockPostingsFormat will try to encode most integer data
* as packed block.</p>
* <p>Take a term with 259 documents as example, the first 256 document ids are encoded as two packed
* blocks, while the remaining 3 as one VInt block. </p>
* <p>When the postings are long enough, BlockPostingsFormat will try to encode most integer data
* as a packed block.</p>
* <p>Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed
* blocks, while the remaining 3 are encoded as one VInt block. </p>
* <p>Different kinds of data are always encoded separately into different packed blocks, but may
* possible be encoded into a same VInt block. </p>
* possibly be interleaved into the same VInt block. </p>
* <p>This strategy is applied to pairs:
* &lt;document number, frequency&gt;,
* &lt;position, payload length&gt;,
@ -71,25 +72,25 @@ import org.apache.lucene.util.packed.PackedInts;
* </li>
*
* <li>
* <b>Skipper setting</b>:
* <p>The structure of skip table is quite similar to Lucene40PostingsFormat. Skip interval is the
* <b>Skipdata settings</b>:
* <p>The structure of skip table is quite similar to previous version of Lucene. Skip interval is the
* same as block size, and each skip entry points to the beginning of each block. However, for
* the first block, skip data is omitted.</p>
* </li>
*
* <li>
* <b>Positions, Payloads, and Offsets</b>:
* <p>A position is an integer indicating where the term occurs at within one document.
* <p>A position is an integer indicating where the term occurs within one document.
* A payload is a blob of metadata associated with current position.
* An offset is a pair of integers indicating the tokenized start/end offsets for given term
* in current position. </p>
* in current position: it is essentially a specialized payload. </p>
* <p>When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a
* null payload contributes one count). As mentioned in block structure, it is possible to encode
* these three either combined or separately.
* <p>For all the cases, payloads and offsets are stored together. When encoded as packed block,
* <p>In all cases, payloads and offsets are stored together. When encoded as a packed block,
* position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload
* metadata will also be stored directly in .pay). When encoded as VInt block, all these three are
* stored in .pos (so as payload metadata).</p>
* metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are
* stored interleaved into the .pos (so is payload metadata).</p>
* <p>With this strategy, the majority of payload and offset data will be outside .pos file.
* So for queries that require only position data, running on a full index with payloads and offsets,
* this reduces disk pre-fetches.</p>
@ -113,35 +114,29 @@ import org.apache.lucene.util.packed.PackedInts;
* <dd>
* <b>Term Dictionary</b>
*
* <p>The .tim file format is quite similar to Lucene40PostingsFormat,
* with minor difference in MetadataBlock</p>
* <p>The .tim file contains the list of terms in each
* field along with per-term statistics (such as docfreq)
* and pointers to the frequencies, positions, payload and
* skip data in the .doc, .pos, and .pay files.
* See {@link BlockTreeTermsWriter} for more details on the format.
* </p>
*
* <p>NOTE: The term dictionary can plug into different postings implementations:
* the postings writer/reader are actually responsible for encoding
* and decoding the Postings Metadata and Term Metadata sections described here:</p>
*
* <ul>
* <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
* <li>TermDictionary(.tim) --&gt; Header, PostingsHeader, PackedBlockSize,
* &lt;Block&gt;<sup>NumBlocks</sup>, FieldSummary, DirOffset</li>
* <li>Block --&gt; SuffixBlock, StatsBlock, MetadataBlock</li>
* <li>SuffixBlock --&gt; EntryCount, SuffixLength, {@link DataOutput#writeByte byte}<sup>SuffixLength</sup></li>
* <li>StatsBlock --&gt; StatsLength, &lt;DocFreq, TotalTermFreq&gt;<sup>EntryCount</sup></li>
* <li>MetadataBlock --&gt; MetaLength, &lt;DocFPDelta,
* &lt;PosFPDelta, PosVIntBlockFPDelta?, PayFPDelta?&gt;?,
* SkipFPDelta?&gt;<sup>EntryCount</sup></li>
* <li>FieldSummary --&gt; NumFields, &lt;FieldNumber, NumTerms, RootCodeLength,
* {@link DataOutput#writeByte byte}<sup>RootCodeLength</sup>, SumDocFreq, DocCount&gt;
* <sup>NumFields</sup></li>
* <li>Header, PostingsHeader --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
* <li>PackedBlockSize, EntryCount, SuffixLength, StatsLength, DocFreq, MetaLength,
* PosVIntBlockFPDelta, SkipFPDelta, NumFields, FieldNumber, RootCodeLength, DocCount --&gt;
* {@link DataOutput#writeVInt VInt}</li>
* <li>TotalTermFreq, DocFPDelta, PosFPDelta, PayFPDelta, NumTerms, SumTotalTermFreq, SumDocFreq --&gt;
* {@link DataOutput#writeVLong VLong}</li>
* <li>Postings Metadata --&gt; Header, PackedBlockSize</li>
* <li>Term Metadata --&gt; DocFPDelta, PosFPDelta?, PosVIntBlockFPDelta?, PayFPDelta?,
* SkipFPDelta?</li>
* <li>Header, --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>PackedBlockSize, PosVIntBlockFPDelta, SkipFPDelta --&gt; {@link DataOutput#writeVInt VInt}</li>
* <li>DocFPDelta, PosFPDelta, PayFPDelta --&gt; {@link DataOutput#writeVLong VLong}</li>
* </ul>
* <p>Notes:</p>
* <ul>
* <li>Here explains MetadataBlock only, other fields are mentioned in
* <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary">Lucene40PostingsFormat:TermDictionary</a>
* </li>
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
* for the postings.</li>
* <li>PackedBlockSize is the fixed block size for packed blocks. In packed block, bit width is
* determined by the largest integer. Smaller block size result in smaller variance among width
* of integers hence smaller indexes. Larger block size result in more efficient bulk i/o hence
@ -175,8 +170,8 @@ import org.apache.lucene.util.packed.PackedInts;
* <dl>
* <dd>
* <b>Term Index</b>
* <p>The .tim file format is mentioned in
* <a href="{@docRoot}/../core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termindex">Lucene40PostingsFormat:TermIndex</a>
* <p>The .tip file contains an index into the term dictionary, so that it can be
* accessed randomly. See {@link BlockTreeTermsWriter} for more details on the format.</p>
* </dd>
* </dl>
*
@ -298,7 +293,7 @@ import org.apache.lucene.util.packed.PackedInts;
* <dd>
* <b>Payloads and Offsets</b>
* <p>The .pay file will store payloads and offsets associated with certain term-document positions.
* Some payloads and offsets will be separated out into .pos file, for speedup reason.</p>
* Some payloads and offsets will be separated out into .pos file, for performance reasons.</p>
* <ul>
* <li>PayFile(.pay): --&gt; Header, &lt;TermPayloads, TermOffsets?&gt; <sup>TermCount</sup></li>
* <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
@ -319,7 +314,7 @@ import org.apache.lucene.util.packed.PackedInts;
* for PackedOffsetBlockNum.</li>
* <li>SumPayLength is the total length of payloads written within one block, should be the sum
* of PayLengths in one packed block.</li>
* <li>PayLength in PackedPayLengthBlock is the length of each payload, associated with current
* <li>PayLength in PackedPayLengthBlock is the length of each payload associated with the current
* position.</li>
* </ul>
* </dd>

View File

@ -23,10 +23,12 @@ import java.util.Comparator;
import java.util.List;
import org.apache.lucene.index.FieldInfo.IndexOptions;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.FieldInfos;
import org.apache.lucene.index.IndexFileNames;
import org.apache.lucene.index.SegmentWriteState;
import org.apache.lucene.store.DataOutput;
import org.apache.lucene.store.IndexOutput;
import org.apache.lucene.store.RAMOutputStream;
import org.apache.lucene.util.ArrayUtil;
@ -76,6 +78,97 @@ import org.apache.lucene.util.fst.Util;
* Writes terms dict and index, block-encoding (column
* stride) each term's metadata for each set of terms
* between two index terms.
* <p>
* Files:
* <ul>
* <li><tt>.tim</tt>: <a href="#Termdictionary">Term Dictionary</a></li>
* <li><tt>.tip</tt>: <a href="#Termindex">Term Index</a></li>
* </ul>
* <p>
* <a name="Termdictionary" id="Termdictionary"></a>
* <h3>Term Dictionary</h3>
*
* <p>The .tim file contains the list of terms in each
* field along with per-term statistics (such as docfreq)
* and per-term metadata (typically pointers to the postings list
* for that term in the inverted index).
* </p>
*
* <p>The .tim is arranged in blocks: with blocks containing
* a variable number of entries (by default 25-48), where
* each entry is either a term or a reference to a
* sub-block.</p>
*
* <p>NOTE: The term dictionary can plug into different postings implementations:
* for example the postings writer/reader are actually responsible for encoding
* and decoding the MetadataBlock.</p>
*
* <ul>
* <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
* <li>TermsDict (.tim) --&gt; Header, <i>Postings Metadata</i>, Block<sup>NumBlocks</sup>,
* FieldSummary, DirOffset</li>
* <li>Block --&gt; SuffixBlock, StatsBlock, MetadataBlock</li>
* <li>SuffixBlock --&gt; EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
* <li>StatsBlock --&gt; StatsLength, &lt;DocFreq, TotalTermFreq&gt;<sup>EntryCount</sup></li>
* <li>MetadataBlock --&gt; MetaLength, &lt;<i>Term Metadata</i>&gt;<sup>EntryCount</sup></li>
* <li>FieldSummary --&gt; NumFields, &lt;FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
* SumDocFreq, DocCount&gt;<sup>NumFields</sup></li>
* <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
* <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
* FieldNumber,RootCodeLength,DocCount --&gt; {@link DataOutput#writeVInt VInt}</li>
* <li>TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --&gt;
* {@link DataOutput#writeVLong VLong}</li>
* </ul>
* <p>Notes:</p>
* <ul>
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
* for the BlockTree implementation.</li>
* <li>DirOffset is a pointer to the FieldSummary section.</li>
* <li>DocFreq is the count of documents which contain the term.</li>
* <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
* as the difference between the total number of occurrences and the DocFreq.</li>
* <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
* <li>NumTerms is the number of unique terms for the field.</li>
* <li>RootCode points to the root block for the field.</li>
* <li>SumDocFreq is the total number of postings, the number of term-document pairs across
* the entire field.</li>
* <li>DocCount is the number of documents that have at least one posting for this field.</li>
* <li>PostingsMetadata and TermMetadata are plugged into by the specific postings implementation:
* these contain arbitrary per-file data (such as parameters or versioning information)
* and per-term data (such as pointers to inverted files).
* </ul>
* <a name="Termindex" id="Termindex"></a>
* <h3>Term Index</h3>
* <p>The .tip file contains an index into the term dictionary, so that it can be
* accessed randomly. The index is also used to determine
* when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
* <ul>
* <li>TermsIndex (.tip) --&gt; Header, FSTIndex<sup>NumFields</sup>
* &lt;IndexStartFP&gt;<sup>NumFields</sup>, DirOffset</li>
* <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
* <li>IndexStartFP --&gt; {@link DataOutput#writeVLong VLong}</li>
* <!-- TODO: better describe FST output here -->
* <li>FSTIndex --&gt; {@link FST FST&lt;byte[]&gt;}</li>
* </ul>
* <p>Notes:</p>
* <ul>
* <li>The .tip file contains a separate FST for each
* field. The FST maps a term prefix to the on-disk
* block that holds all terms starting with that
* prefix. Each field's IndexStartFP points to its
* FST.</li>
* <li>DirOffset is a pointer to the start of the IndexStartFPs
* for all fields</li>
* <li>It's possible that an on-disk block would contain
* too many terms (more than the allowed maximum
* (default: 48)). When this happens, the block is
* sub-divided into new blocks (called "floor
* blocks"), and then the output in the FST for the
* block's prefix encodes the leading byte of each
* sub-block, and its file pointer.
* </ul>
*
* @see BlockTreeTermsReader
* @lucene.experimental

View File

@ -54,43 +54,24 @@ import org.apache.lucene.util.fst.FST; // javadocs
* field along with per-term statistics (such as docfreq)
* and pointers to the frequencies, positions and
* skip data in the .frq and .prx files.
* See {@link BlockTreeTermsWriter} for more details on the format.
* </p>
*
* <p>The .tim is arranged in blocks: with blocks containing
* a variable number of entries (by default 25-48), where
* each entry is either a term or a reference to a
* sub-block. It's written by {@link BlockTreeTermsWriter}
* and read by {@link BlockTreeTermsReader}.</p>
*
* <p>NOTE: The term dictionary can plug into different postings implementations:
* for example the postings writer/reader are actually responsible for encoding
* and decoding the MetadataBlock.</p>
*
* the postings writer/reader are actually responsible for encoding
* and decoding the Postings Metadata and Term Metadata sections described here:</p>
* <ul>
* <!-- TODO: expand on this, its not really correct and doesnt explain sub-blocks etc -->
* <li>TermsDict (.tim) --&gt; Header, PostingsHeader, SkipInterval,
* MaxSkipLevels, SkipMinimum, Block<sup>NumBlocks</sup>,
* FieldSummary, DirOffset</li>
* <li>Block --&gt; SuffixBlock, StatsBlock, MetadataBlock</li>
* <li>SuffixBlock --&gt; EntryCount, SuffixLength, Byte<sup>SuffixLength</sup></li>
* <li>StatsBlock --&gt; StatsLength, &lt;DocFreq, TotalTermFreq&gt;<sup>EntryCount</sup></li>
* <li>MetadataBlock --&gt; MetaLength, &lt;FreqDelta, SkipDelta?, ProxDelta?&gt;<sup>EntryCount</sup></li>
* <li>FieldSummary --&gt; NumFields, &lt;FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
* SumDocFreq, DocCount&gt;<sup>NumFields</sup></li>
* <li>Header,PostingsHeader --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
* <li>Postings Metadata --&gt; Header, SkipInterval, MaxSkipLevels, SkipMinimum</li>
* <li>Term Metadata --&gt; FreqDelta, SkipDelta?, ProxDelta?
* <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>SkipInterval,MaxSkipLevels,SkipMinimum --&gt; {@link DataOutput#writeInt Uint32}</li>
* <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,SkipDelta,NumFields,
* FieldNumber,RootCodeLength,DocCount --&gt; {@link DataOutput#writeVInt VInt}</li>
* <li>TotalTermFreq,FreqDelta,ProxDelta,NumTerms,SumTotalTermFreq,SumDocFreq --&gt;
* {@link DataOutput#writeVLong VLong}</li>
* <li>SkipDelta --&gt; {@link DataOutput#writeVInt VInt}</li>
* <li>FreqDelta,ProxDelta --&gt; {@link DataOutput#writeVLong VLong}</li>
* </ul>
* <p>Notes:</p>
* <ul>
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information
* for the BlockTree implementation. On the other hand, PostingsHeader stores the version
* information for the postings reader/writer.</li>
* <li>DirOffset is a pointer to the FieldSummary section.</li>
* for the postings.</li>
* <li>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate
* {@link DocsEnum#advance(int)}. Larger values result in smaller indexes, greater
* acceleration, but fewer accelerable cases, while smaller values result in bigger indexes,
@ -102,9 +83,6 @@ import org.apache.lucene.util.fst.FST; // javadocs
* information about skip levels.</li>
* <li>SkipMinimum is the minimum document frequency a term must have in order to write any
* skip data at all.</li>
* <li>DocFreq is the count of documents which contain the term.</li>
* <li>TotalTermFreq is the total number of occurrences of the term. This is encoded
* as the difference between the total number of occurrences and the DocFreq.</li>
* <li>FreqDelta determines the position of this term's TermFreqs within the .frq
* file. In particular, it is the difference between the position of this term's
* data in that file and the position of the previous term's data (or zero, for
@ -118,44 +96,11 @@ import org.apache.lucene.util.fst.FST; // javadocs
* file. In particular, it is the number of bytes after TermFreqs that the
* SkipData starts. In other words, it is the length of the TermFreq data.
* SkipDelta is only stored if DocFreq is not smaller than SkipMinimum.</li>
* <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)</li>
* <li>NumTerms is the number of unique terms for the field.</li>
* <li>RootCode points to the root block for the field.</li>
* <li>SumDocFreq is the total number of postings, the number of term-document pairs across
* the entire field.</li>
* <li>DocCount is the number of documents that have at least one posting for this field.</li>
* </ul>
* <a name="Termindex" id="Termindex"></a>
* <h3>Term Index</h3>
* <p>The .tip file contains an index into the term dictionary, so that it can be
* accessed randomly. The index is also used to determine
* when a given term cannot exist on disk (in the .tim file), saving a disk seek.</p>
* <ul>
* <li>TermsIndex (.tip) --&gt; Header, FSTIndex<sup>NumFields</sup>,
* &lt;IndexStartFP&gt;<sup>NumFields</sup>, DirOffset</li>
* <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}</li>
* <li>IndexStartFP --&gt; {@link DataOutput#writeVLong VLong}</li>
* <!-- TODO: better describe FST output here -->
* <li>FSTIndex --&gt; {@link FST FST&lt;byte[]&gt;}</li>
* <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}</li>
* </ul>
* <p>Notes:</p>
* <ul>
* <li>The .tip file contains a separate FST for each
* field. The FST maps a term prefix to the on-disk
* block that holds all terms starting with that
* prefix. Each field's IndexStartFP points to its
* FST.</li>
* <li>DirOffset is a pointer to the start of the IndexStartFPs
* for all fields</li>
* <li>It's possible that an on-disk block would contain
* too many terms (more than the allowed maximum
* (default: 48)). When this happens, the block is
* sub-divided into new blocks (called "floor
* blocks"), and then the output in the FST for the
* block's prefix encodes the leading byte of each
* sub-block, and its file pointer.
* </ul>
* accessed randomly. See {@link BlockTreeTermsWriter} for more details on the format.</p>
* <a name="Frequencies" id="Frequencies"></a>
* <h3>Frequencies</h3>
* <p>The .frq file contains the lists of documents which contain each term, along