mirror of https://github.com/apache/lucene.git
reorganize termvectors format description (javadocs). (#130)
This commit is contained in:
parent
891b192dcf
commit
6ebf959502
|
@ -56,15 +56,20 @@ import org.apache.lucene.util.packed.PackedInts;
|
||||||
* <li>VectorMeta (.tvm) --> <Header>, PackedIntsVersion, ChunkSize,
|
* <li>VectorMeta (.tvm) --> <Header>, PackedIntsVersion, ChunkSize,
|
||||||
* ChunkIndexMetadata, ChunkCount, DirtyChunkCount, DirtyDocsCount, Footer
|
* ChunkIndexMetadata, ChunkCount, DirtyChunkCount, DirtyDocsCount, Footer
|
||||||
* <li>Header --> {@link CodecUtil#writeIndexHeader IndexHeader}
|
* <li>Header --> {@link CodecUtil#writeIndexHeader IndexHeader}
|
||||||
* <li>PackedIntsVersion --> {@link PackedInts#VERSION_CURRENT} as a {@link
|
* <li>PackedIntsVersion, ChunkSize --> {@link DataOutput#writeVInt VInt}
|
||||||
* DataOutput#writeVInt VInt}
|
* <li>ChunkCount, DirtyChunkCount, DirtyDocsCount --> {@link DataOutput#writeVLong
|
||||||
* <li>ChunkSize is the number of bytes of terms to accumulate before flushing, as a {@link
|
* VLong}
|
||||||
* DataOutput#writeVInt VInt}
|
* <li>ChunkIndexMetadata --> {@link FieldsIndexWriter}
|
||||||
* <li>ChunkCount is not known in advance and is the number of chunks necessary to store all
|
|
||||||
* document of the segment
|
|
||||||
* <li>DirtyChunkCount --> the number of prematurely flushed chunks in the .tvd file
|
|
||||||
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
|
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
|
||||||
* </ul>
|
* </ul>
|
||||||
|
* <p>Notes:
|
||||||
|
* <ul>
|
||||||
|
* <li>PackedIntsVersion is {@link PackedInts#VERSION_CURRENT}.
|
||||||
|
* <li>ChunkSize is the number of bytes of terms to accumulate before flushing.
|
||||||
|
* <li>ChunkCount is not known in advance and is the number of chunks necessary to store all
|
||||||
|
* document of the segment.
|
||||||
|
* <li>DirtyChunkCount is the number of prematurely flushed chunks in the .tvd file.
|
||||||
|
* </ul>
|
||||||
* <li><a id="vector_data"></a>
|
* <li><a id="vector_data"></a>
|
||||||
* <p>A vector data file (extension <code>.tvd</code>). This file stores terms, frequencies,
|
* <p>A vector data file (extension <code>.tvd</code>). This file stores terms, frequencies,
|
||||||
* positions, offsets and payloads for every document. Upon writing a new segment, it
|
* positions, offsets and payloads for every document. Upon writing a new segment, it
|
||||||
|
@ -80,76 +85,78 @@ import org.apache.lucene.util.packed.PackedInts;
|
||||||
* FieldNumOffs >, < Flags >, < NumTerms >, < TermLengths >, <
|
* FieldNumOffs >, < Flags >, < NumTerms >, < TermLengths >, <
|
||||||
* TermFreqs >, < Positions >, < StartOffsets >, < Lengths >, <
|
* TermFreqs >, < Positions >, < StartOffsets >, < Lengths >, <
|
||||||
* PayloadLengths >, < TermAndPayloads >
|
* PayloadLengths >, < TermAndPayloads >
|
||||||
* <li>DocBase is the ID of the first doc of the chunk as a {@link DataOutput#writeVInt
|
|
||||||
* VInt}
|
|
||||||
* <li>ChunkDocs is the number of documents in the chunk
|
|
||||||
* <li>NumFields --> DocNumFields<sup>ChunkDocs</sup>
|
* <li>NumFields --> DocNumFields<sup>ChunkDocs</sup>
|
||||||
* <li>DocNumFields is the number of fields for each doc, written as a {@link
|
* <li>FieldNums --> FieldNumDelta<sup>TotalDistincFields</sup>
|
||||||
* DataOutput#writeVInt VInt} if ChunkDocs==1 and as a {@link PackedInts} array
|
|
||||||
* otherwise
|
|
||||||
* <li>FieldNums --> FieldNumDelta<sup>TotalDistincFields</sup>, a delta-encoded list of
|
|
||||||
* the sorted unique field numbers present in the chunk
|
|
||||||
* <li>FieldNumOffs --> FieldNumOff<sup>TotalFields</sup>, as a {@link PackedInts} array
|
|
||||||
* <li>FieldNumOff is the offset of the field number in FieldNums
|
|
||||||
* <li>TotalFields is the total number of fields (sum of the values of NumFields)
|
|
||||||
* <li>Flags --> Bit < FieldFlags >
|
* <li>Flags --> Bit < FieldFlags >
|
||||||
* <li>Bit is a single bit which when true means that fields have the same options for every
|
|
||||||
* document in the chunk
|
|
||||||
* <li>FieldFlags --> if Bit==1: Flag<sup>TotalDistinctFields</sup> else
|
* <li>FieldFlags --> if Bit==1: Flag<sup>TotalDistinctFields</sup> else
|
||||||
* Flag<sup>TotalFields</sup>
|
* Flag<sup>TotalFields</sup>
|
||||||
|
* <li>NumTerms --> FieldNumTerms<sup>TotalFields</sup>
|
||||||
|
* <li>TermLengths --> PrefixLength<sup>TotalTerms</sup>
|
||||||
|
* SuffixLength<sup>TotalTerms</sup>
|
||||||
|
* <li>TermFreqs --> TermFreqMinus1<sup>TotalTerms</sup>
|
||||||
|
* <li>Positions --> PositionDelta<sup>TotalPositions</sup>
|
||||||
|
* <li>StartOffsets --> (AvgCharsPerTerm<sup>TotalDistinctFields</sup>)
|
||||||
|
* StartOffsetDelta<sup>TotalOffsets</sup>
|
||||||
|
* <li>Lengths --> LengthMinusTermLength<sup>TotalOffsets</sup>
|
||||||
|
* <li>PayloadLengths --> PayloadLength<sup>TotalPayloads</sup>
|
||||||
|
* <li>TermAndPayloads --> LZ4-compressed representation of < FieldTermsAndPayLoads
|
||||||
|
* ><sup>TotalFields</sup>
|
||||||
|
* <li>FieldTermsAndPayLoads --> Terms (Payloads)
|
||||||
|
* <li>DocBase, ChunkDocs, DocNumFields (with ChunkDocs==1) --> {@link
|
||||||
|
* DataOutput#writeVInt VInt}
|
||||||
|
* <li>AvgCharsPerTerm --> {@link DataOutput#writeInt Int}
|
||||||
|
* <li>DocNumFields (with ChunkDocs>=1), FieldNumOffs --> {@link PackedInts} array
|
||||||
|
* <li>FieldNumTerms, PrefixLength, SuffixLength, TermFreqMinus1, PositionDelta,
|
||||||
|
* StartOffsetDelta, LengthMinusTermLength, PayloadLength --> {@link
|
||||||
|
* BlockPackedWriter blocks of 64 packed ints}
|
||||||
|
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
|
||||||
|
* </ul>
|
||||||
|
* <p>Notes:
|
||||||
|
* <ul>
|
||||||
|
* <li>DocBase is the ID of the first doc of the chunk.
|
||||||
|
* <li>ChunkDocs is the number of documents in the chunk.
|
||||||
|
* <li>DocNumFields is the number of fields for each doc.
|
||||||
|
* <li>FieldNums is a delta-encoded list of the sorted unique field numbers present in the
|
||||||
|
* chunk.
|
||||||
|
* <li>FieldNumOffs is the array of FieldNumOff; array size is the total number of fields in
|
||||||
|
* the chunk.
|
||||||
|
* <li>FieldNumOff is the offset of the field number in FieldNums.
|
||||||
|
* <li>TotalFields is the total number of fields (sum of the values of NumFields).
|
||||||
|
* <li>Bit in Flags is a single bit which when true means that fields have the same options
|
||||||
|
* for every document in the chunk.
|
||||||
* <li>Flag: a 3-bits int where:
|
* <li>Flag: a 3-bits int where:
|
||||||
* <ul>
|
* <ul>
|
||||||
* <li>the first bit means that the field has positions
|
* <li>the first bit means that the field has positions
|
||||||
* <li>the second bit means that the field has offsets
|
* <li>the second bit means that the field has offsets
|
||||||
* <li>the third bit means that the field has payloads
|
* <li>the third bit means that the field has payloads
|
||||||
* </ul>
|
* </ul>
|
||||||
* <li>NumTerms --> FieldNumTerms<sup>TotalFields</sup>
|
* <li>FieldNumTerms is the number of terms for each field.
|
||||||
* <li>FieldNumTerms: the number of terms for each field, using {@link BlockPackedWriter
|
* <li>TotalTerms is the total number of terms (sum of NumTerms).
|
||||||
* blocks of 64 packed ints}
|
* <li>PrefixLength is 0 for the first term of a field, the common prefix with the previous
|
||||||
* <li>TermLengths --> PrefixLength<sup>TotalTerms</sup>
|
* term otherwise.
|
||||||
* SuffixLength<sup>TotalTerms</sup>
|
* <li>SuffixLength is the length of the term minus PrefixLength for every term using.
|
||||||
* <li>TotalTerms: total number of terms (sum of NumTerms)
|
* <li>TermFreqMinus1 is (frequency - 1) for each term.
|
||||||
* <li>PrefixLength: 0 for the first term of a field, the common prefix with the previous
|
* <li>TotalPositions is the sum of frequencies of terms of all fields that have positions.
|
||||||
* term otherwise using {@link BlockPackedWriter blocks of 64 packed ints}
|
* <li>PositionDelta is the absolute position for the first position of a term, and the
|
||||||
* <li>SuffixLength: length of the term minus PrefixLength for every term using {@link
|
* difference with the previous positions for following positions.
|
||||||
* BlockPackedWriter blocks of 64 packed ints}
|
* <li>TotalOffsets is the sum of frequencies of terms of all fields that have offsets.
|
||||||
* <li>TermFreqs --> TermFreqMinus1<sup>TotalTerms</sup>
|
* <li>AvgCharsPerTerm is the average number of chars per term, encoded as a float on 4
|
||||||
* <li>TermFreqMinus1: (frequency - 1) for each term using {@link BlockPackedWriter blocks
|
* bytes. They are not present if no field has both positions and offsets enabled.
|
||||||
* of 64 packed ints}
|
* <li>StartOffsetDelta is the (startOffset - previousStartOffset - AvgCharsPerTerm *
|
||||||
* <li>Positions --> PositionDelta<sup>TotalPositions</sup>
|
|
||||||
* <li>TotalPositions is the sum of frequencies of terms of all fields that have positions
|
|
||||||
* <li>PositionDelta: the absolute position for the first position of a term, and the
|
|
||||||
* difference with the previous positions for following positions using {@link
|
|
||||||
* BlockPackedWriter blocks of 64 packed ints}
|
|
||||||
* <li>StartOffsets --> (AvgCharsPerTerm<sup>TotalDistinctFields</sup>)
|
|
||||||
* StartOffsetDelta<sup>TotalOffsets</sup>
|
|
||||||
* <li>TotalOffsets is the sum of frequencies of terms of all fields that have offsets
|
|
||||||
* <li>AvgCharsPerTerm: average number of chars per term, encoded as a float on 4 bytes.
|
|
||||||
* They are not present if no field has both positions and offsets enabled.
|
|
||||||
* <li>StartOffsetDelta: (startOffset - previousStartOffset - AvgCharsPerTerm *
|
|
||||||
* PositionDelta). previousStartOffset is 0 for the first offset and AvgCharsPerTerm is
|
* PositionDelta). previousStartOffset is 0 for the first offset and AvgCharsPerTerm is
|
||||||
* 0 if the field has no positions using {@link BlockPackedWriter blocks of 64 packed
|
* 0 if the field has no positions.
|
||||||
* ints}
|
* <li>LengthMinusTermLength is (endOffset - startOffset - termLength).
|
||||||
* <li>Lengths --> LengthMinusTermLength<sup>TotalOffsets</sup>
|
* <li>TotalPayloads is the sum of frequencies of terms of all fields that have payloads.
|
||||||
* <li>LengthMinusTermLength: (endOffset - startOffset - termLength) using {@link
|
* <li>PayloadLength is the payload length encoded.
|
||||||
* BlockPackedWriter blocks of 64 packed ints}
|
* <li>Terms is term bytes.
|
||||||
* <li>PayloadLengths --> PayloadLength<sup>TotalPayloads</sup>
|
* <li>Payloads is payload bytes (if the field has payloads).
|
||||||
* <li>TotalPayloads is the sum of frequencies of terms of all fields that have payloads
|
|
||||||
* <li>PayloadLength is the payload length encoded using {@link BlockPackedWriter blocks of
|
|
||||||
* 64 packed ints}
|
|
||||||
* <li>TermAndPayloads --> LZ4-compressed representation of < FieldTermsAndPayLoads
|
|
||||||
* ><sup>TotalFields</sup>
|
|
||||||
* <li>FieldTermsAndPayLoads --> Terms (Payloads)
|
|
||||||
* <li>Terms: term bytes
|
|
||||||
* <li>Payloads: payload bytes (if the field has payloads)
|
|
||||||
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
|
|
||||||
* </ul>
|
* </ul>
|
||||||
* <li><a id="vector_index"></a>
|
* <li><a id="vector_index"></a>
|
||||||
* <p>An index file (extension <code>.tvx</code>).
|
* <p>An index file (extension <code>.tvx</code>).
|
||||||
* <ul>
|
* <ul>
|
||||||
* <li>VectorIndex (.tvx) --> <Header>, <ChunkIndex>, Footer
|
* <li>VectorIndex (.tvx) --> <Header>, <ChunkIndex>, Footer
|
||||||
* <li>Header --> {@link CodecUtil#writeIndexHeader IndexHeader}
|
* <li>Header --> {@link CodecUtil#writeIndexHeader IndexHeader}
|
||||||
* <li>ChunkIndex: See {@link FieldsIndexWriter}
|
* <li>ChunkIndex --> {@link FieldsIndexWriter}
|
||||||
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
|
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
|
||||||
* </ul>
|
* </ul>
|
||||||
* </ol>
|
* </ol>
|
||||||
|
|
Loading…
Reference in New Issue