mirror of https://github.com/apache/lucene.git
add missing details about file format changes by version
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@922013 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
b51587f72d
commit
f44f23026f
|
@ -419,10 +419,32 @@ document.write("Last Published: " + document.lastModified);
|
|||
compatible (in the same way as the lock-less commits
|
||||
change in 2.1).
|
||||
</p>
|
||||
<p>
|
||||
In version 2.4, Strings are now written as true UTF-8
|
||||
byte sequence, not Java's modified UTF-8. See issue
|
||||
LUCENE-510 for details.
|
||||
</p>
|
||||
<p>
|
||||
In version 2.9, an optional opaque Map<String,String>
|
||||
CommitUserData may be passed to IndexWriter's commit
|
||||
methods (and later retrieved), which is recorded in
|
||||
the segments_N file. See issue LUCENE-1382 for
|
||||
details. Also, diagnostics were added to each segment
|
||||
written recording details about why it was written
|
||||
(due to flush, merge; which OS/JRE was used; etc.).
|
||||
See issue LUCENE-1654 for details.
|
||||
</p>
|
||||
<p>
|
||||
In version 3.0, compressed fields are no longer
|
||||
written to the index (they can still be read, but on
|
||||
merge the new segment will write them,
|
||||
uncompressed). See issue LUCENE-1960 for details.
|
||||
|
||||
</p>
|
||||
</div>
|
||||
|
||||
|
||||
<a name="N1002B"></a><a name="Definitions"></a>
|
||||
<a name="N10034"></a><a name="Definitions"></a>
|
||||
<h2 class="boxed">Definitions</h2>
|
||||
<div class="section">
|
||||
<p>
|
||||
|
@ -463,7 +485,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
strings, the first naming the field, and the second naming text
|
||||
within the field.
|
||||
</p>
|
||||
<a name="N1004B"></a><a name="Inverted Indexing"></a>
|
||||
<a name="N10054"></a><a name="Inverted Indexing"></a>
|
||||
<h3 class="boxed">Inverted Indexing</h3>
|
||||
<p>
|
||||
The index stores statistics about terms in order
|
||||
|
@ -473,7 +495,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
it. This is the inverse of the natural relationship, in which
|
||||
documents list terms.
|
||||
</p>
|
||||
<a name="N10057"></a><a name="Types of Fields"></a>
|
||||
<a name="N10060"></a><a name="Types of Fields"></a>
|
||||
<h3 class="boxed">Types of Fields</h3>
|
||||
<p>
|
||||
In Lucene, fields may be <i>stored</i>, in which
|
||||
|
@ -487,7 +509,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
to be indexed literally.
|
||||
</p>
|
||||
<p>See the <a href="api/core/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
|
||||
<a name="N10074"></a><a name="Segments"></a>
|
||||
<a name="N1007D"></a><a name="Segments"></a>
|
||||
<h3 class="boxed">Segments</h3>
|
||||
<p>
|
||||
Lucene indexes may be composed of multiple sub-indexes, or
|
||||
|
@ -513,7 +535,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
Searches may involve multiple segments and/or multiple indexes, each
|
||||
index potentially composed of a set of segments.
|
||||
</p>
|
||||
<a name="N10092"></a><a name="Document Numbers"></a>
|
||||
<a name="N1009B"></a><a name="Document Numbers"></a>
|
||||
<h3 class="boxed">Document Numbers</h3>
|
||||
<p>
|
||||
Internally, Lucene refers to documents by an integer <i>document
|
||||
|
@ -568,7 +590,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</div>
|
||||
|
||||
|
||||
<a name="N100B9"></a><a name="Overview"></a>
|
||||
<a name="N100C2"></a><a name="Overview"></a>
|
||||
<h2 class="boxed">Overview</h2>
|
||||
<div class="section">
|
||||
<p>
|
||||
|
@ -667,7 +689,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</div>
|
||||
|
||||
|
||||
<a name="N100FC"></a><a name="File Naming"></a>
|
||||
<a name="N10105"></a><a name="File Naming"></a>
|
||||
<h2 class="boxed">File Naming</h2>
|
||||
<div class="section">
|
||||
<p>
|
||||
|
@ -694,7 +716,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</p>
|
||||
</div>
|
||||
|
||||
<a name="N1010B"></a><a name="file-names"></a>
|
||||
<a name="N10114"></a><a name="file-names"></a>
|
||||
<h2 class="boxed">Summary of File Extensions</h2>
|
||||
<div class="section">
|
||||
<p>The following table summarizes the names and extensions of the files in Lucene:
|
||||
|
@ -836,10 +858,10 @@ document.write("Last Published: " + document.lastModified);
|
|||
</div>
|
||||
|
||||
|
||||
<a name="N101F5"></a><a name="Primitive Types"></a>
|
||||
<a name="N101FE"></a><a name="Primitive Types"></a>
|
||||
<h2 class="boxed">Primitive Types</h2>
|
||||
<div class="section">
|
||||
<a name="N101FA"></a><a name="Byte"></a>
|
||||
<a name="N10203"></a><a name="Byte"></a>
|
||||
<h3 class="boxed">Byte</h3>
|
||||
<p>
|
||||
The most primitive type
|
||||
|
@ -847,7 +869,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
other data types are defined as sequences
|
||||
of bytes, so file formats are byte-order independent.
|
||||
</p>
|
||||
<a name="N10203"></a><a name="UInt32"></a>
|
||||
<a name="N1020C"></a><a name="UInt32"></a>
|
||||
<h3 class="boxed">UInt32</h3>
|
||||
<p>
|
||||
32-bit unsigned integers are written as four
|
||||
|
@ -857,7 +879,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
UInt32 --> <Byte><sup>4</sup>
|
||||
|
||||
</p>
|
||||
<a name="N10212"></a><a name="Uint64"></a>
|
||||
<a name="N1021B"></a><a name="Uint64"></a>
|
||||
<h3 class="boxed">Uint64</h3>
|
||||
<p>
|
||||
64-bit unsigned integers are written as eight
|
||||
|
@ -866,7 +888,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
<p>UInt64 --> <Byte><sup>8</sup>
|
||||
|
||||
</p>
|
||||
<a name="N10221"></a><a name="VInt"></a>
|
||||
<a name="N1022A"></a><a name="VInt"></a>
|
||||
<h3 class="boxed">VInt</h3>
|
||||
<p>
|
||||
A variable-length format for positive integers is
|
||||
|
@ -1416,13 +1438,13 @@ document.write("Last Published: " + document.lastModified);
|
|||
This provides compression while still being
|
||||
efficient to decode.
|
||||
</p>
|
||||
<a name="N10506"></a><a name="Chars"></a>
|
||||
<a name="N1050F"></a><a name="Chars"></a>
|
||||
<h3 class="boxed">Chars</h3>
|
||||
<p>
|
||||
Lucene writes unicode
|
||||
character sequences as UTF-8 encoded bytes.
|
||||
</p>
|
||||
<a name="N1050F"></a><a name="String"></a>
|
||||
<a name="N10518"></a><a name="String"></a>
|
||||
<h3 class="boxed">String</h3>
|
||||
<p>
|
||||
Lucene writes strings as UTF-8 encoded bytes.
|
||||
|
@ -1435,10 +1457,10 @@ document.write("Last Published: " + document.lastModified);
|
|||
</div>
|
||||
|
||||
|
||||
<a name="N1051C"></a><a name="Compound Types"></a>
|
||||
<a name="N10525"></a><a name="Compound Types"></a>
|
||||
<h2 class="boxed">Compound Types</h2>
|
||||
<div class="section">
|
||||
<a name="N10521"></a><a name="MapStringString"></a>
|
||||
<a name="N1052A"></a><a name="MapStringString"></a>
|
||||
<h3 class="boxed">Map<String,String></h3>
|
||||
<p>
|
||||
In a couple places Lucene stores a Map
|
||||
|
@ -1451,13 +1473,13 @@ document.write("Last Published: " + document.lastModified);
|
|||
</div>
|
||||
|
||||
|
||||
<a name="N10531"></a><a name="Per-Index Files"></a>
|
||||
<a name="N1053A"></a><a name="Per-Index Files"></a>
|
||||
<h2 class="boxed">Per-Index Files</h2>
|
||||
<div class="section">
|
||||
<p>
|
||||
The files in this section exist one-per-index.
|
||||
</p>
|
||||
<a name="N10539"></a><a name="Segments File"></a>
|
||||
<a name="N10542"></a><a name="Segments File"></a>
|
||||
<h3 class="boxed">Segments File</h3>
|
||||
<p>
|
||||
The active segments in the index are stored in the
|
||||
|
@ -1624,7 +1646,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
Lucene version, OS, Java version, why the segment
|
||||
was created (merge, flush, addIndexes), etc.
|
||||
</p>
|
||||
<a name="N105BE"></a><a name="Lock File"></a>
|
||||
<a name="N105C7"></a><a name="Lock File"></a>
|
||||
<h3 class="boxed">Lock File</h3>
|
||||
<p>
|
||||
The write lock, which is stored in the index
|
||||
|
@ -1638,14 +1660,14 @@ document.write("Last Published: " + document.lastModified);
|
|||
documents). This lock file ensures that only one
|
||||
writer is modifying the index at a time.
|
||||
</p>
|
||||
<a name="N105C7"></a><a name="Deletable File"></a>
|
||||
<a name="N105D0"></a><a name="Deletable File"></a>
|
||||
<h3 class="boxed">Deletable File</h3>
|
||||
<p>
|
||||
A writer dynamically computes
|
||||
the files that are deletable, instead, so no file
|
||||
is written.
|
||||
</p>
|
||||
<a name="N105D0"></a><a name="Compound Files"></a>
|
||||
<a name="N105D9"></a><a name="Compound Files"></a>
|
||||
<h3 class="boxed">Compound Files</h3>
|
||||
<p>Starting with Lucene 1.4 the compound file format became default. This
|
||||
is simply a container for all files described in the next section
|
||||
|
@ -1672,14 +1694,14 @@ document.write("Last Published: " + document.lastModified);
|
|||
</div>
|
||||
|
||||
|
||||
<a name="N105F8"></a><a name="Per-Segment Files"></a>
|
||||
<a name="N10601"></a><a name="Per-Segment Files"></a>
|
||||
<h2 class="boxed">Per-Segment Files</h2>
|
||||
<div class="section">
|
||||
<p>
|
||||
The remaining files are all per-segment, and are
|
||||
thus defined by suffix.
|
||||
</p>
|
||||
<a name="N10600"></a><a name="Fields"></a>
|
||||
<a name="N10609"></a><a name="Fields"></a>
|
||||
<h3 class="boxed">Fields</h3>
|
||||
<p>
|
||||
|
||||
|
@ -1873,7 +1895,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</li>
|
||||
|
||||
</ol>
|
||||
<a name="N106A7"></a><a name="Term Dictionary"></a>
|
||||
<a name="N106B0"></a><a name="Term Dictionary"></a>
|
||||
<h3 class="boxed">Term Dictionary</h3>
|
||||
<p>
|
||||
The term dictionary is represented as two files:
|
||||
|
@ -2065,7 +2087,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</li>
|
||||
|
||||
</ol>
|
||||
<a name="N1072B"></a><a name="Frequencies"></a>
|
||||
<a name="N10734"></a><a name="Frequencies"></a>
|
||||
<h3 class="boxed">Frequencies</h3>
|
||||
<p>
|
||||
The .frq file contains the lists of documents
|
||||
|
@ -2193,7 +2215,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer
|
||||
to entry 31 on level 0.
|
||||
</p>
|
||||
<a name="N107B3"></a><a name="Positions"></a>
|
||||
<a name="N107BC"></a><a name="Positions"></a>
|
||||
<h3 class="boxed">Positions</h3>
|
||||
<p>
|
||||
The .prx file contains the lists of positions that
|
||||
|
@ -2263,7 +2285,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
Payload. If PayloadLength is not stored, then this Payload has the same
|
||||
length as the Payload at the previous position.
|
||||
</p>
|
||||
<a name="N107EF"></a><a name="Normalization Factors"></a>
|
||||
<a name="N107F8"></a><a name="Normalization Factors"></a>
|
||||
<h3 class="boxed">Normalization Factors</h3>
|
||||
<p>There's a single .nrm file containing all norms:
|
||||
</p>
|
||||
|
@ -2343,7 +2365,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</p>
|
||||
<p>Separate norm files are created (when adequate) for both compound and non compound segments.
|
||||
</p>
|
||||
<a name="N10840"></a><a name="Term Vectors"></a>
|
||||
<a name="N10849"></a><a name="Term Vectors"></a>
|
||||
<h3 class="boxed">Term Vectors</h3>
|
||||
<p>
|
||||
Term Vector support is an optional on a field by
|
||||
|
@ -2479,7 +2501,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</li>
|
||||
|
||||
</ol>
|
||||
<a name="N108DC"></a><a name="Deleted Documents"></a>
|
||||
<a name="N108E5"></a><a name="Deleted Documents"></a>
|
||||
<h3 class="boxed">Deleted Documents</h3>
|
||||
<p>The .del file is
|
||||
optional, and only exists when a segment contains deletions.
|
||||
|
@ -2543,7 +2565,7 @@ document.write("Last Published: " + document.lastModified);
|
|||
</div>
|
||||
|
||||
|
||||
<a name="N10916"></a><a name="Limitations"></a>
|
||||
<a name="N1091F"></a><a name="Limitations"></a>
|
||||
<h2 class="boxed">Limitations</h2>
|
||||
<div class="section">
|
||||
<p>
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -63,6 +63,30 @@
|
|||
change in 2.1).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In version 2.4, Strings are now written as true UTF-8
|
||||
byte sequence, not Java's modified UTF-8. See issue
|
||||
LUCENE-510 for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In version 2.9, an optional opaque Map<String,String>
|
||||
CommitUserData may be passed to IndexWriter's commit
|
||||
methods (and later retrieved), which is recorded in
|
||||
the segments_N file. See issue LUCENE-1382 for
|
||||
details. Also, diagnostics were added to each segment
|
||||
written recording details about why it was written
|
||||
(due to flush, merge; which OS/JRE was used; etc.).
|
||||
See issue LUCENE-1654 for details.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In version 3.0, compressed fields are no longer
|
||||
written to the index (they can still be read, but on
|
||||
merge the new segment will write them,
|
||||
uncompressed). See issue LUCENE-1960 for details.
|
||||
</p>
|
||||
|
||||
</section>
|
||||
|
||||
<section id="Definitions"><title>Definitions</title>
|
||||
|
|
Loading…
Reference in New Issue