mirror of https://github.com/apache/lucene.git
Updated file format documentation to note skip data.
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@150258 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
3f54fbaaea
commit
8dd2e3e7f1
|
@ -1332,9 +1332,18 @@ limitations under the License.
|
|||
|
||||
<p>
|
||||
TermInfoFile (.tis)-->
|
||||
TermCount, TermInfos
|
||||
TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
|
||||
</p>
|
||||
<p>TIVersion -->
|
||||
UInt32
|
||||
</p>
|
||||
<p>TermCount -->
|
||||
UInt64
|
||||
</p>
|
||||
<p>IndexInterval -->
|
||||
UInt32
|
||||
</p>
|
||||
<p>SkipInterval -->
|
||||
UInt32
|
||||
</p>
|
||||
<p>TermInfos -->
|
||||
|
@ -1357,6 +1366,9 @@ limitations under the License.
|
|||
by the term's field name, and within that lexicographically by the
|
||||
term's text.
|
||||
</p>
|
||||
<p>TIVersion names the version of the format
|
||||
of this file and is -1 in Lucene 1.4.
|
||||
</p>
|
||||
<p>Term
|
||||
text prefixes are shared. The PrefixLength is the number of initial
|
||||
characters from the previous term which must be pre-pended to a
|
||||
|
@ -1389,7 +1401,7 @@ limitations under the License.
|
|||
</p>
|
||||
|
||||
<p>
|
||||
This contains every 128th entry from the .tis
|
||||
This contains every IndexInterval<sup>th</sup> entry from the .tis
|
||||
file, along with its location in the "tis" file. This is
|
||||
designed to be read entirely into memory and used to provide random
|
||||
access to the "tis" file.
|
||||
|
@ -1440,6 +1452,7 @@ limitations under the License.
|
|||
</p>
|
||||
<p>FreqFile (.frq) -->
|
||||
<TermFreqs><sup>TermCount</sup>
|
||||
<SkipDatum><sup>TermCount/SkipInterval</sup>
|
||||
</p>
|
||||
<p>TermFreqs -->
|
||||
<TermFreq><sup>DocFreq</sup>
|
||||
|
@ -1447,7 +1460,10 @@ limitations under the License.
|
|||
<p>TermFreq -->
|
||||
DocDelta, Freq?
|
||||
</p>
|
||||
<p>DocDelta,Freq -->
|
||||
<p>SkipDatum -->
|
||||
DocSkip,FreqSkip,ProxSkip
|
||||
</p>
|
||||
<p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip -->
|
||||
VInt
|
||||
</p>
|
||||
<p>TermFreqs
|
||||
|
@ -1470,6 +1486,29 @@ limitations under the License.
|
|||
</p>
|
||||
<p> 15,
|
||||
22, 3
|
||||
</p>
|
||||
<p>DocSkip records the document number before every
|
||||
SkipInterval<sup>th</sup> document in TermFreqs.
|
||||
Document numbers are represented as differences
|
||||
from the previous value in the sequence. FreqSkip
|
||||
and ProxSkip record the position of every
|
||||
SkipInterval<sup>th</sup> entry in FreqFile and
|
||||
ProxFile, respectively. File positions are
|
||||
relative to the start of TermFreqs and Positions,
|
||||
to the previous SkipDatum in the sequence.
|
||||
</p>
|
||||
<p>For example, if TermCount=35 and SkipInterval=16,
|
||||
then there are two SkipData entries, containing
|
||||
the 15<sup>th</sup> and 31<sup>st</sup> document
|
||||
numbers in TermFreqs. The first FreqSkip names
|
||||
the number of bytes after the beginning of
|
||||
TermFreqs that the 16<sup>th</sup> SkipDatum
|
||||
starts, and the second the number of bytes after
|
||||
that that the 32<sup>nd</sup> starts. The first
|
||||
ProxSkip names the number of bytes after the
|
||||
beginning of Positions that the 16<sup>th</sup>
|
||||
SkipDatum starts, and the second the number of
|
||||
bytes after that that the 32<sup>nd</sup> starts.
|
||||
</p>
|
||||
</blockquote>
|
||||
</td></tr>
|
||||
|
@ -1588,8 +1627,8 @@ limitations under the License.
|
|||
<p>This contains, for each document, a pointer to the document data in the Document
|
||||
(.tvd) file.
|
||||
</p>
|
||||
<p>DocumentIndex (.tvx) --> FormatVersion<DocumentPosition><sup>NumDocs</sup></p>
|
||||
<p>FormatVersion --> Int</p>
|
||||
<p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p>
|
||||
<p>TVXVersion --> Int</p>
|
||||
<p>DocumentPosition --> UInt64</p>
|
||||
<p>This is used to find the position of the Document in the .tvd file.</p>
|
||||
</li>
|
||||
|
@ -1599,9 +1638,9 @@ limitations under the License.
|
|||
term vector info and finally a list of pointers to the field information in the .tvf
|
||||
(Term Vector Fields) file.</p>
|
||||
<p>
|
||||
Document (.tvd) --> FormatVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup>
|
||||
Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup>
|
||||
</p>
|
||||
<p>FormatVersion --> Int</p>
|
||||
<p>TVDVersion --> Int</p>
|
||||
<p>NumFields --> VInt</p>
|
||||
<p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p>
|
||||
<p>FieldNumDelta --> VInt</p>
|
||||
|
@ -1614,8 +1653,8 @@ limitations under the License.
|
|||
<p>The Field or .tvf file.</p>
|
||||
<p>This file contains, for each field that has a term vector stored, a list of
|
||||
the terms and their frequencies.</p>
|
||||
<p>Field (.tvf) --> FormatVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p>
|
||||
<p>FormatVersion --> Int</p>
|
||||
<p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p>
|
||||
<p>TVFVersion --> Int</p>
|
||||
<p>NumTerms --> VInt</p>
|
||||
<p>NumDistinct --> VInt -- Future Use</p>
|
||||
<p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p>
|
||||
|
|
|
@ -167,7 +167,7 @@ patents</a>.</p>
|
|||
limited contract work.</p>
|
||||
|
||||
</li>
|
||||
<li><b>Otis Gospodnetić</b> (otis at apache.org)</li>
|
||||
<li><b>Otis Gospodneti?</b> (otis at apache.org)</li>
|
||||
<li><b>Brian Goetz</b> (briangoetz at apache.org)</li>
|
||||
<li><b>Scott Ganyo</b> (scottganyo at apache.org)</li>
|
||||
<li><b>Eugene Gluzberg</b> (drag0n at apache.org)</li>
|
||||
|
|
|
@ -905,9 +905,18 @@
|
|||
|
||||
<p>
|
||||
TermInfoFile (.tis)-->
|
||||
TermCount, TermInfos
|
||||
TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
|
||||
</p>
|
||||
<p>TIVersion -->
|
||||
UInt32
|
||||
</p>
|
||||
<p>TermCount -->
|
||||
UInt64
|
||||
</p>
|
||||
<p>IndexInterval -->
|
||||
UInt32
|
||||
</p>
|
||||
<p>SkipInterval -->
|
||||
UInt32
|
||||
</p>
|
||||
<p>TermInfos -->
|
||||
|
@ -930,6 +939,9 @@
|
|||
by the term's field name, and within that lexicographically by the
|
||||
term's text.
|
||||
</p>
|
||||
<p>TIVersion names the version of the format
|
||||
of this file and is -1 in Lucene 1.4.
|
||||
</p>
|
||||
<p>Term
|
||||
text prefixes are shared. The PrefixLength is the number of initial
|
||||
characters from the previous term which must be pre-pended to a
|
||||
|
@ -962,7 +974,7 @@
|
|||
</p>
|
||||
|
||||
<p>
|
||||
This contains every 128th entry from the .tis
|
||||
This contains every IndexInterval<sup>th</sup> entry from the .tis
|
||||
file, along with its location in the "tis" file. This is
|
||||
designed to be read entirely into memory and used to provide random
|
||||
access to the "tis" file.
|
||||
|
@ -1005,6 +1017,7 @@
|
|||
</p>
|
||||
<p>FreqFile (.frq) -->
|
||||
<TermFreqs><sup>TermCount</sup>
|
||||
<SkipDatum><sup>TermCount/SkipInterval</sup>
|
||||
</p>
|
||||
<p>TermFreqs -->
|
||||
<TermFreq><sup>DocFreq</sup>
|
||||
|
@ -1012,7 +1025,10 @@
|
|||
<p>TermFreq -->
|
||||
DocDelta, Freq?
|
||||
</p>
|
||||
<p>DocDelta,Freq -->
|
||||
<p>SkipDatum -->
|
||||
DocSkip,FreqSkip,ProxSkip
|
||||
</p>
|
||||
<p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip -->
|
||||
VInt
|
||||
</p>
|
||||
<p>TermFreqs
|
||||
|
@ -1036,6 +1052,30 @@
|
|||
<p> 15,
|
||||
22, 3
|
||||
</p>
|
||||
<p>DocSkip records the document number before every
|
||||
SkipInterval<sup>th</sup> document in TermFreqs.
|
||||
Document numbers are represented as differences
|
||||
from the previous value in the sequence. FreqSkip
|
||||
and ProxSkip record the position of every
|
||||
SkipInterval<sup>th</sup> entry in FreqFile and
|
||||
ProxFile, respectively. File positions are
|
||||
relative to the start of TermFreqs and Positions,
|
||||
to the previous SkipDatum in the sequence.
|
||||
</p>
|
||||
<p>For example, if TermCount=35 and SkipInterval=16,
|
||||
then there are two SkipData entries, containing
|
||||
the 15<sup>th</sup> and 31<sup>st</sup> document
|
||||
numbers in TermFreqs. The first FreqSkip names
|
||||
the number of bytes after the beginning of
|
||||
TermFreqs that the 16<sup>th</sup> SkipDatum
|
||||
starts, and the second the number of bytes after
|
||||
that that the 32<sup>nd</sup> starts. The first
|
||||
ProxSkip names the number of bytes after the
|
||||
beginning of Positions that the 16<sup>th</sup>
|
||||
SkipDatum starts, and the second the number of
|
||||
bytes after that that the 32<sup>nd</sup> starts.
|
||||
</p>
|
||||
|
||||
</subsection>
|
||||
<subsection name="Positions">
|
||||
|
||||
|
@ -1127,8 +1167,8 @@
|
|||
<p>This contains, for each document, a pointer to the document data in the Document
|
||||
(.tvd) file.
|
||||
</p>
|
||||
<p>DocumentIndex (.tvx) --> FormatVersion<DocumentPosition><sup>NumDocs</sup></p>
|
||||
<p>FormatVersion --> Int</p>
|
||||
<p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p>
|
||||
<p>TVXVersion --> Int</p>
|
||||
<p>DocumentPosition --> UInt64</p>
|
||||
<p>This is used to find the position of the Document in the .tvd file.</p>
|
||||
</li>
|
||||
|
@ -1138,9 +1178,9 @@
|
|||
term vector info and finally a list of pointers to the field information in the .tvf
|
||||
(Term Vector Fields) file.</p>
|
||||
<p>
|
||||
Document (.tvd) --> FormatVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup>
|
||||
Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup>
|
||||
</p>
|
||||
<p>FormatVersion --> Int</p>
|
||||
<p>TVDVersion --> Int</p>
|
||||
<p>NumFields --> VInt</p>
|
||||
<p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p>
|
||||
<p>FieldNumDelta --> VInt</p>
|
||||
|
@ -1153,8 +1193,8 @@
|
|||
<p>The Field or .tvf file.</p>
|
||||
<p>This file contains, for each field that has a term vector stored, a list of
|
||||
the terms and their frequencies.</p>
|
||||
<p>Field (.tvf) --> FormatVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p>
|
||||
<p>FormatVersion --> Int</p>
|
||||
<p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p>
|
||||
<p>TVFVersion --> Int</p>
|
||||
<p>NumTerms --> VInt</p>
|
||||
<p>NumDistinct --> VInt -- Future Use</p>
|
||||
<p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p>
|
||||
|
|
Loading…
Reference in New Issue