mirror of https://github.com/apache/lucene.git
1291 lines
54 KiB
XML
1291 lines
54 KiB
XML
<?xml version="1.0"?>
|
|
|
|
<document>
|
|
|
|
<properties>
|
|
<title>Index File Formats</title>
|
|
<authors>
|
|
<person email="cutting@apache.org" name="Doug Cutting"/>
|
|
</authors>
|
|
</properties>
|
|
|
|
<body>
|
|
<section name="Index File Formats">
|
|
|
|
<p>
|
|
This document defines the index file formats used
|
|
in Lucene version 1.4.
|
|
</p>
|
|
|
|
<p>
|
|
Jakarta Lucene is written in Java, but several
|
|
efforts are underway to write versions of Lucene in other programming
|
|
languages. If these versions are to remain compatible with Jakarta
|
|
Lucene, then a language-independent definition of the Lucene index
|
|
format is required. This document thus attempts to provide a
|
|
complete and independent definition of the Jakarta Lucene 1.4 file
|
|
formats.
|
|
</p>
|
|
|
|
<p>
|
|
As Lucene evolves, this document should evolve.
|
|
Versions of Lucene in different programming languages should endeavor
|
|
to agree on file formats, and generate new versions of this document.
|
|
</p>
|
|
|
|
<p>
|
|
Compatibility notes are provided in this document,
|
|
describing how file formats have changed from prior versions.
|
|
</p>
|
|
|
|
</section>
|
|
|
|
<section name="Definitions">
|
|
|
|
<p>
|
|
The fundamental concepts in Lucene are index,
|
|
document, field and term.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
An index contains a sequence of documents.
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
A document is a sequence of fields.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
A field is a named sequence of terms.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
A term is a string.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
The same string in two different fields is
|
|
considered a different term. Thus terms are represented as a pair of
|
|
strings, the first naming the field, and the second naming text
|
|
within the field.
|
|
</p>
|
|
|
|
<subsection name="Inverted Indexing">
|
|
|
|
<p>
|
|
The index stores statistics about terms in order
|
|
to make term-based search more efficient. Lucene's
|
|
index falls into the family of indexes known as an <i>inverted
|
|
index.</i> This is because it can list, for a term, the documents that contain
|
|
it. This is the inverse of the natural relationship, in which
|
|
documents list terms.
|
|
</p>
|
|
</subsection>
|
|
<subsection name="Types of Fields">
|
|
|
|
<p>
|
|
In Lucene, fields may be <i>stored</i>, in which
|
|
case their text is stored in the index literally, in a non-inverted
|
|
manner. Fields that are inverted are called <i>indexed</i>. A field
|
|
may be both stored and indexed.</p>
|
|
|
|
<p>The text of a field may be <i>tokenized</i> into terms to be
|
|
indexed, or the text of a field may be used literally as a term to be indexed.
|
|
Most fields are
|
|
tokenized, but sometimes it is useful for certain identifier fields
|
|
to be indexed literally.
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
<subsection name="Segments">
|
|
|
|
<p>
|
|
Lucene indexes may be composed of multiple sub-indexes, or<i>
|
|
segments</i>. Each segment is a fully independent index, which could be searched
|
|
separately. Indexes evolve by:
|
|
</p>
|
|
|
|
<ol>
|
|
<li><p>Creating new segments for newly added documents.</p>
|
|
</li>
|
|
<li><p>Merging existing segments.</p>
|
|
</li>
|
|
</ol>
|
|
|
|
<p>
|
|
Searches may involve multiple segments and/or multiple indexes, each
|
|
index potentially composed of a set of segments.
|
|
</p>
|
|
</subsection>
|
|
|
|
<subsection name="Document Numbers">
|
|
|
|
<p>
|
|
Internally, Lucene refers to documents by an integer <i>document
|
|
number</i>. The first document added to an index is numbered zero, and each
|
|
subsequent document added gets a number one greater than the previous.
|
|
</p>
|
|
|
|
<p>
|
|
<br/>
|
|
</p>
|
|
|
|
<p>
|
|
Note that a document's number may change, so caution should be taken
|
|
when storing these numbers outside of Lucene. In particular, numbers may
|
|
change in the following situations:
|
|
</p>
|
|
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
The
|
|
numbers stored in each segment are unique only within the segment,
|
|
and must be converted before they can be used in a larger context.
|
|
The standard technique is to allocate each segment a range of
|
|
values, based on the range of numbers used in that segment. To
|
|
convert a document number from a segment to an external value, the
|
|
segment's <i>base</i> document
|
|
number is added. To convert an external value back to a
|
|
segment-specific value, the segment is identified by the range that
|
|
the external value is in, and the segment's base value is
|
|
subtracted. For example two five document segments might be
|
|
combined, so that the first segment has a base value of zero, and
|
|
the second of five. Document three from the second segment would
|
|
have an external value of eight.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
When documents are deleted, gaps are created
|
|
in the numbering. These are eventually removed as the index evolves
|
|
through merging. Deleted documents are dropped when segments are
|
|
merged. A freshly-merged segment thus has no gaps in its numbering.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
</subsection>
|
|
|
|
</section>
|
|
|
|
<section name="Overview">
|
|
|
|
<p>
|
|
Each segment index maintains the following:
|
|
</p>
|
|
<ul>
|
|
<li><p>Field names. This
|
|
contains the set of field names used in the index.
|
|
|
|
</p>
|
|
</li>
|
|
<li><p>Stored Field
|
|
values. This contains, for each document, a list of attribute-value
|
|
pairs, where the attributes are field names. These are used to
|
|
store auxiliary information about the document, such as its title,
|
|
url, or an identifier to access a
|
|
database. The set of stored fields are what is returned for each hit
|
|
when searching. This is keyed by document number.
|
|
</p>
|
|
</li>
|
|
<li><p>Term dictionary.
|
|
A dictionary containing all of the terms used in all of the indexed
|
|
fields of all of the documents. The dictionary also contains the
|
|
number of documents which contain the term, and pointers to the
|
|
term's frequency and proximity data.
|
|
</p>
|
|
</li>
|
|
|
|
<li><p>Term Frequency
|
|
data. For each term in the dictionary, the numbers of all the
|
|
documents that contain that term, and the frequency of the term in
|
|
that document.
|
|
</p>
|
|
</li>
|
|
|
|
<li><p>Term Proximity
|
|
data. For each term in the dictionary, the positions that the term
|
|
occurs in each document.
|
|
</p>
|
|
</li>
|
|
|
|
<li><p>Normalization
|
|
factors. For each field in each document, a value is stored that is
|
|
multiplied into the score for hits on that field.
|
|
</p>
|
|
</li>
|
|
<li><p>Term Vectors. For each field in each document, the term vector
|
|
(sometimes called document vector) is stored. A term vector consists
|
|
of term text and term frequency.
|
|
</p>
|
|
</li>
|
|
<li><p>Deleted documents.
|
|
An optional file indicating which documents are deleted.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Details on each of these are provided in subsequent sections.
|
|
</p>
|
|
</section>
|
|
|
|
<section name="File Naming">
|
|
|
|
<p>
|
|
All files belonging to a segment have the same name with varying
|
|
extensions. The extensions correspond to the different file formats
|
|
described below.
|
|
</p>
|
|
|
|
<p>
|
|
Typically, all segments
|
|
in an index are stored in a single directory, although this is not
|
|
required.
|
|
</p>
|
|
|
|
</section>
|
|
|
|
<section name="Primitive Types">
|
|
|
|
<subsection name="Byte">
|
|
|
|
<p>
|
|
The most primitive type
|
|
is an eight-bit byte. Files are accessed as sequences of bytes. All
|
|
other data types are defined as sequences
|
|
of bytes, so file formats are byte-order independent.
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
<subsection name="UInt32">
|
|
|
|
<p>
|
|
32-bit unsigned integers are written as four
|
|
bytes, high-order bytes first.
|
|
</p>
|
|
<p>
|
|
UInt32 --> <Byte><sup>4</sup>
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
<subsection name="Uint64">
|
|
|
|
<p>
|
|
64-bit unsigned integers are written as eight
|
|
bytes, high-order bytes first.
|
|
</p>
|
|
|
|
<p>UInt64 --> <Byte><sup>8</sup>
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
<subsection name="VInt">
|
|
|
|
<p>
|
|
A variable-length format for positive integers is
|
|
defined where the high-order bit of each byte indicates whether more
|
|
bytes remain to be read. The low-order seven bits are appended as
|
|
increasingly more significant bits in the resulting integer value.
|
|
Thus values from zero to 127 may be stored in a single byte, values
|
|
from 128 to 16,383 may be stored in two bytes, and so on.
|
|
</p>
|
|
|
|
<p><b>VInt Encoding Example</b></p>
|
|
|
|
<table width="100%" border="0" cellpadding="4" cellspacing="0">
|
|
<col width="64*" />
|
|
<col width="64*" />
|
|
<col width="64*" />
|
|
<col width="64*" />
|
|
<tr valign="TOP">
|
|
<td width="25%">
|
|
<p align="RIGHT"><b>Value</b>
|
|
</p>
|
|
</td>
|
|
<td width="25%">
|
|
<p align="RIGHT"><b>First byte</b>
|
|
</p>
|
|
</td>
|
|
<td width="25%">
|
|
<p align="RIGHT"><b>Second byte</b>
|
|
</p>
|
|
</td>
|
|
<td width="25%">
|
|
<p align="RIGHT"><b>Third byte</b>
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="0" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">0
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="0" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
00000000
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="1" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">1
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="2" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">2
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
00000010
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="25%" valign="TOP">
|
|
<p align="RIGHT">...
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: 0.11cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="127" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">127
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1111111" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
01111111
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="128" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">128
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10000000" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.07cm;
|
|
margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="129" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">129
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10000001" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
10000001
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.07cm;
|
|
margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="130" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">130
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10000010" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
10000010
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.07cm;
|
|
margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="25%" valign="TOP">
|
|
<p align="RIGHT">...
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: 0.11cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="16383" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">16,383
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="11111111" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
11111111
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1111111" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.07cm;
|
|
margin-right: 0.01cm">
|
|
01111111
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdnum="1033;0;00000000">
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right:
|
|
0.01cm"><br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="16384" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">16,384
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10000000" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10000000" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.07cm;
|
|
margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.47cm;
|
|
margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr valign="BOTTOM">
|
|
<td width="25%" sdval="16385" sdnum="1033;0;#,##0">
|
|
<p align="RIGHT">16,385
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10000001" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
10000001
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="10000000" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.07cm;
|
|
margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
</td>
|
|
<td width="25%" sdval="1" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.47cm;
|
|
margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="25%" valign="TOP">
|
|
<p align="RIGHT">...
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: 0.11cm;
|
|
margin-right: 0.01cm">
|
|
<br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.07cm;
|
|
margin-right: 0.01cm">
|
|
<br/>
|
|
|
|
</p>
|
|
</td>
|
|
<td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
|
|
<p class="western" align="RIGHT" style="margin-left: -0.47cm;
|
|
margin-right: 0.01cm">
|
|
<br/>
|
|
|
|
</p>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<p>
|
|
This provides compression while still being
|
|
efficient to decode.
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
<subsection name="Chars">
|
|
|
|
<p>
|
|
Lucene writes unicode
|
|
character sequences using the standard UTF-8 encoding.
|
|
</p>
|
|
|
|
|
|
</subsection>
|
|
|
|
<subsection name="String">
|
|
|
|
<p>
|
|
Lucene writes strings as a VInt representing the length, followed by
|
|
the character data.
|
|
</p>
|
|
|
|
<p>
|
|
String --> VInt, Chars
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
</section>
|
|
|
|
<section name="Per-Index Files">
|
|
|
|
<p>
|
|
The files in this section exist one-per-index.
|
|
</p>
|
|
|
|
<subsection name="Segments File">
|
|
|
|
<p>
|
|
The active segments in the index are stored in the
|
|
segment info file. An index only has
|
|
a single file in this format, and it is named "segments".
|
|
This lists each segment by name, and also contains the size of each
|
|
segment.
|
|
</p>
|
|
|
|
<p>
|
|
Segments --> Format, Version, SegCount, <SegName, SegSize><sup>SegCount</sup>
|
|
</p>
|
|
|
|
<p>
|
|
Format, SegCount, SegSize --> UInt32
|
|
</p>
|
|
|
|
<p>
|
|
Version --> UInt64
|
|
</p>
|
|
|
|
<p>
|
|
SegName --> String
|
|
</p>
|
|
|
|
<p>
|
|
Format is -1 in Lucene 1.4.
|
|
</p>
|
|
|
|
<p>
|
|
Version counts how often the index has been
|
|
changed by adding or deleting documents.
|
|
</p>
|
|
|
|
<p>
|
|
SegName is the name of the segment, and is used as the file name prefix
|
|
for all of the files that compose the segment's index.
|
|
</p>
|
|
|
|
<p>
|
|
SegSize is the number of documents contained in the segment index.
|
|
</p>
|
|
|
|
|
|
</subsection>
|
|
|
|
<subsection name="Lock Files">
|
|
|
|
<p>
|
|
Several files are used to indicate that another
|
|
process is using an index. Note that these files are not
|
|
stored in the index directory itself, but rather in the
|
|
system's temporary directory, as indicated in the Java
|
|
system property "java.io.tmpdir".
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
When a file named "commit.lock"
|
|
is present, a process is currently re-writing the "segments"
|
|
file and deleting outdated segment index files, or a process is
|
|
reading the "segments"
|
|
file and opening the files of the segments it names. This lock file
|
|
prevents files from being deleted by another process after a process
|
|
has read the "segments"
|
|
file but before it has managed to open all of the files of the
|
|
segments named therein.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
When a file named "write.lock"
|
|
is present, a process is currently adding documents to an index, or
|
|
removing files from that index. This lock file prevents several
|
|
processes from attempting to modify an index at the same time.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
</subsection>
|
|
|
|
<subsection name="Deletable File">
|
|
|
|
<p>
|
|
A file named "deletable"
|
|
contains the names of files that are no longer used by the index, but
|
|
which could not be deleted. This is only used on Win32, where a
|
|
file may not be deleted while it is still open. On other platforms
|
|
the file contains only null bytes.
|
|
</p>
|
|
|
|
<p>
|
|
Deletable --> DeletableCount,
|
|
<DelableName><sup>DeletableCount</sup>
|
|
</p>
|
|
|
|
<p>DeletableCount --> UInt32
|
|
</p>
|
|
<p>DeletableName -->
|
|
String
|
|
</p>
|
|
</subsection>
|
|
</section>
|
|
|
|
<section name="Per-Segment Files">
|
|
|
|
<p>
|
|
The remaining files are all per-segment, and are
|
|
thus defined by suffix.
|
|
</p>
|
|
<subsection name="Fields">
|
|
<p><br/><b>Field Info</b><br/></p>
|
|
|
|
<p>
|
|
Field names are
|
|
stored in the field info file, with suffix .fnm.
|
|
</p>
|
|
<p>
|
|
FieldInfos
|
|
(.fnm) --> FieldsCount, <FieldName,
|
|
FieldBits><sup>FieldsCount</sup>
|
|
</p>
|
|
|
|
<p>
|
|
FieldsCount --> VInt
|
|
</p>
|
|
|
|
<p>
|
|
FieldName --> String
|
|
</p>
|
|
|
|
<p>
|
|
FieldBits --> Byte
|
|
</p>
|
|
|
|
<p>
|
|
The low-order bit is one for
|
|
indexed fields, and zero for non-indexed fields. The second lowest-order
|
|
bit is one for fields that have term vectors stored, and zero for fields
|
|
without term vectors.
|
|
</p>
|
|
|
|
<p>
|
|
Fields are numbered by their order in this file. Thus field zero is
|
|
the
|
|
first field in the file, field one the next, and so on. Note that,
|
|
like document numbers, field numbers are segment relative.
|
|
</p>
|
|
|
|
<p><br/><b>Stored Fields</b><br/></p>
|
|
|
|
<p>
|
|
Stored fields are represented by two files:
|
|
</p>
|
|
|
|
<ol>
|
|
<li>
|
|
<p>
|
|
The field index, or .fdx file.
|
|
</p>
|
|
|
|
<p>
|
|
This contains, for each document, a pointer to
|
|
its field data, as follows:
|
|
</p>
|
|
|
|
<p>
|
|
FieldIndex
|
|
(.fdx) -->
|
|
<FieldValuesPosition><sup>SegSize</sup>
|
|
</p>
|
|
<p>FieldValuesPosition
|
|
--> Uint64
|
|
</p>
|
|
<p>This
|
|
is used to find the location within the field data file of the
|
|
fields of a particular document. Because it contains fixed-length
|
|
data, this file may be easily randomly accessed. The position of
|
|
document<i> n</i>'s<i> </i>field data is the Uint64 at <i>n*8</i> in
|
|
this file.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
The field data, or .fdt file.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
This contains the stored fields of each document,
|
|
as follows:
|
|
</p>
|
|
|
|
<p>
|
|
FieldData (.fdt) -->
|
|
<DocFieldData><sup>SegSize</sup>
|
|
</p>
|
|
<p>DocFieldData -->
|
|
FieldCount, <FieldNum, Bits, Value><sup>FieldCount</sup>
|
|
</p>
|
|
<p>FieldCount -->
|
|
VInt
|
|
</p>
|
|
<p>FieldNum -->
|
|
VInt
|
|
</p>
|
|
<p>Bits -->
|
|
Byte
|
|
</p>
|
|
<p>Value -->
|
|
String
|
|
</p>
|
|
<p>Currently
|
|
only the low-order bit is used of Bits is used. It is one for
|
|
tokenized fields, and zero for non-tokenized fields.
|
|
</p>
|
|
</li>
|
|
</ol>
|
|
|
|
</subsection>
|
|
<subsection name="Term Dictionary">
|
|
|
|
<p>
|
|
The term dictionary is represented as two files:
|
|
</p>
|
|
<ol>
|
|
<li>
|
|
<p>
|
|
The term infos, or tis file.
|
|
</p>
|
|
|
|
<p>
|
|
TermInfoFile (.tis)-->
|
|
TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
|
|
</p>
|
|
<p>TIVersion -->
|
|
UInt32
|
|
</p>
|
|
<p>TermCount -->
|
|
UInt64
|
|
</p>
|
|
<p>IndexInterval -->
|
|
UInt32
|
|
</p>
|
|
<p>SkipInterval -->
|
|
UInt32
|
|
</p>
|
|
<p>TermInfos -->
|
|
<TermInfo><sup>TermCount</sup>
|
|
</p>
|
|
<p>TermInfo -->
|
|
<Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
|
|
</p>
|
|
<p>Term -->
|
|
<PrefixLength, Suffix, FieldNum>
|
|
</p>
|
|
<p>Suffix -->
|
|
String
|
|
</p>
|
|
<p>PrefixLength,
|
|
DocFreq, FreqDelta, ProxDelta, SkipDelta<br/> --> VInt
|
|
</p>
|
|
<p>This
|
|
file is sorted by Term. Terms are ordered first lexicographically
|
|
by the term's field name, and within that lexicographically by the
|
|
term's text.
|
|
</p>
|
|
<p>TIVersion names the version of the format
|
|
of this file and is -2 in Lucene 1.4.
|
|
</p>
|
|
<p>Term
|
|
text prefixes are shared. The PrefixLength is the number of initial
|
|
characters from the previous term which must be pre-pended to a
|
|
term's suffix in order to form the term's text. Thus, if the
|
|
previous term's text was "bone" and the term is "boy",
|
|
the PrefixLength is two and the suffix is "y".
|
|
</p>
|
|
<p>FieldNumber
|
|
determines the term's field, whose name is stored in the .fdt file.
|
|
</p>
|
|
<p>DocFreq
|
|
is the count of documents which contain the term.
|
|
</p>
|
|
<p>FreqDelta
|
|
determines the position of this term's TermFreqs within the .frq
|
|
file. In particular, it is the difference between the position of
|
|
this term's data in that file and the position of the previous
|
|
term's data (or zero, for the first term in the file).
|
|
</p>
|
|
<p>ProxDelta
|
|
determines the position of this term's TermPositions within the .prx
|
|
file. In particular, it is the difference between the position of
|
|
this term's data in that file and the position of the previous
|
|
term's data (or zero, for the first term in the file.
|
|
</p>
|
|
<p>SkipDelta determines the position of this
|
|
term's SkipData within the .frq file. In
|
|
particular, it is the number of bytes
|
|
after TermFreqs that the SkipData starts.
|
|
In other words, it is the length of the
|
|
TermFreq data.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
The term info index, or .tii file.
|
|
</p>
|
|
|
|
<p>
|
|
This contains every IndexInterval<sup>th</sup> entry from the .tis
|
|
file, along with its location in the "tis" file. This is
|
|
designed to be read entirely into memory and used to provide random
|
|
access to the "tis" file.
|
|
</p>
|
|
|
|
<p>
|
|
The structure of this file is very similar to the
|
|
.tis file, with the addition of one item per record, the IndexDelta.
|
|
</p>
|
|
|
|
<p>
|
|
TermInfoIndex (.tii)-->
|
|
IndexTermCount, TermIndices
|
|
</p>
|
|
<p>IndexTermCount -->
|
|
UInt32
|
|
</p>
|
|
<p>TermIndices -->
|
|
<TermInfo, IndexDelta><sup>IndexTermCount</sup>
|
|
</p>
|
|
<p>IndexDelta -->
|
|
VInt
|
|
</p>
|
|
<p>IndexDelta
|
|
determines the position of this term's TermInfo the .tis file. In
|
|
particular, it is the difference between the position of this term's
|
|
entry in that file and the position of the previous term's entry (or
|
|
zero for the first term in the file).
|
|
</p>
|
|
</li>
|
|
</ol>
|
|
</subsection>
|
|
|
|
<subsection name="Frequencies">
|
|
|
|
<p>
|
|
The .frq file contains the lists of documents
|
|
which contain each term, along with the frequency of the term in that
|
|
document.
|
|
</p>
|
|
<p>FreqFile (.frq) -->
|
|
<TermFreqs, SkipData><sup>TermCount</sup>
|
|
</p>
|
|
<p>TermFreqs -->
|
|
<TermFreq><sup>DocFreq</sup>
|
|
</p>
|
|
<p>TermFreq -->
|
|
DocDelta, Freq?
|
|
</p>
|
|
<p>SkipData -->
|
|
<SkipDatum><sup>DocFreq/SkipInterval</sup>
|
|
</p>
|
|
<p>SkipDatum -->
|
|
DocSkip,FreqSkip,ProxSkip
|
|
</p>
|
|
<p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip -->
|
|
VInt
|
|
</p>
|
|
<p>TermFreqs
|
|
are ordered by term (the term is implicit, from the .tis file).
|
|
</p>
|
|
<p>TermFreq
|
|
entries are ordered by increasing document number.
|
|
</p>
|
|
<p>DocDelta
|
|
determines both the document number and the frequency. In
|
|
particular, DocDelta/2 is the difference between this document number
|
|
and the previous document number (or zero when this is the first
|
|
document in a TermFreqs). When DocDelta is odd, the frequency is
|
|
one. When DocDelta is even, the frequency is read as another VInt.
|
|
</p>
|
|
<p>For
|
|
example, the TermFreqs for a term which occurs once in document seven
|
|
and three times in document eleven would be the following sequence of
|
|
VInts:
|
|
</p>
|
|
<p> 15,
|
|
22, 3
|
|
</p>
|
|
<p>DocSkip records the document number before every
|
|
SkipInterval<sup>th</sup> document in TermFreqs.
|
|
Document numbers are represented as differences
|
|
from the previous value in the sequence. FreqSkip
|
|
and ProxSkip record the position of every
|
|
SkipInterval<sup>th</sup> entry in FreqFile and
|
|
ProxFile, respectively. File positions are
|
|
relative to the start of TermFreqs and Positions,
|
|
to the previous SkipDatum in the sequence.
|
|
</p>
|
|
<p>For example, if DocFreq=35 and SkipInterval=16,
|
|
then there are two SkipData entries, containing
|
|
the 15<sup>th</sup> and 31<sup>st</sup> document
|
|
numbers in TermFreqs. The first FreqSkip names
|
|
the number of bytes after the beginning of
|
|
TermFreqs that the 16<sup>th</sup> SkipDatum
|
|
starts, and the second the number of bytes after
|
|
that that the 32<sup>nd</sup> starts. The first
|
|
ProxSkip names the number of bytes after the
|
|
beginning of Positions that the 16<sup>th</sup>
|
|
SkipDatum starts, and the second the number of
|
|
bytes after that that the 32<sup>nd</sup> starts.
|
|
</p>
|
|
|
|
</subsection>
|
|
<subsection name="Positions">
|
|
|
|
<p>
|
|
The .prx file contains the lists of positions that
|
|
each term occurs at within documents.
|
|
</p>
|
|
<p>ProxFile (.prx) -->
|
|
<TermPositions><sup>TermCount</sup>
|
|
</p>
|
|
<p>TermPositions -->
|
|
<Positions><sup>DocFreq</sup>
|
|
</p>
|
|
<p>Positions -->
|
|
<PositionDelta><sup>Freq</sup>
|
|
</p>
|
|
<p>PositionDelta -->
|
|
VInt
|
|
</p>
|
|
<p>TermPositions
|
|
are ordered by term (the term is implicit, from the .tis file).
|
|
</p>
|
|
<p>Positions
|
|
entries are ordered by increasing document number (the document
|
|
number is implicit from the .frq file).
|
|
</p>
|
|
<p>PositionDelta
|
|
is the difference between the position of the current occurrence in
|
|
the document and the previous occurrence (or zero, if this is the
|
|
first occurrence in this document).
|
|
</p>
|
|
<p>
|
|
For example, the TermPositions for a
|
|
term which occurs as the fourth term in one document, and as the
|
|
fifth and ninth term in a subsequent document, would be the following
|
|
sequence of VInts:
|
|
</p>
|
|
<p> 4,
|
|
5, 4
|
|
</p>
|
|
</subsection>
|
|
<subsection name="Normalization Factors">
|
|
<p>There's a norm file for each indexed field with a byte for
|
|
each document. The .f[0-9]* file contains,
|
|
for each document, a byte that encodes a value that is multiplied
|
|
into the score for hits on that field:
|
|
</p>
|
|
<p>Norms
|
|
(.f[0-9]*) --> <Byte><sup>SegSize</sup>
|
|
</p>
|
|
<p>Each
|
|
byte encodes a floating point value. Bits 0-2 contain the 3-bit
|
|
mantissa, and bits 3-8 contain the 5-bit exponent.
|
|
</p>
|
|
<p>These
|
|
are converted to an IEEE single float value as follows:
|
|
</p>
|
|
<ol>
|
|
<li><p>If
|
|
the byte is zero, use a zero float.
|
|
</p>
|
|
</li>
|
|
<li><p>Otherwise,
|
|
set the sign bit of the float to zero;
|
|
</p>
|
|
</li>
|
|
<li><p>add
|
|
48 to the exponent and use this as the float's exponent;
|
|
</p>
|
|
</li>
|
|
<li><p>map
|
|
the mantissa to the high-order 3 bits of the float's mantissa; and
|
|
|
|
</p>
|
|
</li>
|
|
<li><p>set
|
|
the low-order 21 bits of the float's mantissa to zero.
|
|
</p>
|
|
</li>
|
|
</ol>
|
|
|
|
</subsection>
|
|
<subsection name="Term Vectors">
|
|
Term Vector support is an optional on a field by field basis. It consists of 4
|
|
files.
|
|
<ol>
|
|
<li>
|
|
<p>The Document Index or .tvx file.</p>
|
|
<p>This contains, for each document, a pointer to the document data in the Document
|
|
(.tvd) file.
|
|
</p>
|
|
<p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p>
|
|
<p>TVXVersion --> Int</p>
|
|
<p>DocumentPosition --> UInt64</p>
|
|
<p>This is used to find the position of the Document in the .tvd file.</p>
|
|
</li>
|
|
<li>
|
|
<p>The Document or .tvd file.</p>
|
|
<p>This contains, for each document, the number of fields, a list of the fields with
|
|
term vector info and finally a list of pointers to the field information in the .tvf
|
|
(Term Vector Fields) file.</p>
|
|
<p>
|
|
Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup>
|
|
</p>
|
|
<p>TVDVersion --> Int</p>
|
|
<p>NumFields --> VInt</p>
|
|
<p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p>
|
|
<p>FieldNumDelta --> VInt</p>
|
|
<p>FieldPositions --> <FieldPosition><sup>NumFields</sup></p>
|
|
<p>FieldPosition --> VLong</p>
|
|
<p>The .tvd file is used to map out the fields that have term vectors stored and
|
|
where the field information is in the .tvf file.</p>
|
|
</li>
|
|
<li>
|
|
<p>The Field or .tvf file.</p>
|
|
<p>This file contains, for each field that has a term vector stored, a list of
|
|
the terms and their frequencies.</p>
|
|
<p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p>
|
|
<p>TVFVersion --> Int</p>
|
|
<p>NumTerms --> VInt</p>
|
|
<p>NumDistinct --> VInt -- Future Use</p>
|
|
<p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p>
|
|
<p>TermText --> <PrefixLength, Suffix></p>
|
|
<p>PrefixLength --> VInt</p>
|
|
<p>Suffix --> String</p>
|
|
<p>TermFreq --> VInt</p>
|
|
<p>Term
|
|
text prefixes are shared. The PrefixLength is the number of initial
|
|
characters from the previous term which must be pre-pended to a
|
|
term's suffix in order to form the term's text. Thus, if the
|
|
previous term's text was "bone" and the term is "boy",
|
|
the PrefixLength is two and the suffix is "y".
|
|
</p>
|
|
</li>
|
|
</ol>
|
|
</subsection>
|
|
|
|
<subsection name="Deleted Documents">
|
|
|
|
<p>The .del file is
|
|
optional, and only exists when a segment contains deletions:
|
|
</p>
|
|
|
|
<p>Deletions
|
|
(.del) --> ByteCount,BitCount,Bits
|
|
</p>
|
|
|
|
<p>ByteSize,BitCount -->
|
|
Uint32
|
|
</p>
|
|
|
|
<p>Bits -->
|
|
<Byte><sup>ByteCount</sup>
|
|
</p>
|
|
|
|
<p>ByteCount
|
|
indicates the number of bytes in Bits. It is typically
|
|
(SegSize/8)+1.
|
|
</p>
|
|
|
|
<p>
|
|
BitCount
|
|
indicates the number of bits that are currently set in Bits.
|
|
</p>
|
|
|
|
<p>Bits
|
|
contains one bit for each document indexed. When the bit
|
|
corresponding to a document number is set, that document is marked as
|
|
deleted. Bit ordering is from least to most significant. Thus, if
|
|
Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
|
|
deleted.
|
|
</p>
|
|
</subsection>
|
|
</section>
|
|
|
|
<section name="Limitations">
|
|
|
|
<p>There
|
|
are a few places where these file formats limit the maximum number of
|
|
terms and documents to a 32-bit quantity, or to approximately 4
|
|
billion. This is not today a problem, but, in the long term,
|
|
probably will be. These should therefore be replaced with either
|
|
UInt64 values, or better yet, with VInt values which have no limit.
|
|
</p>
|
|
|
|
</section>
|
|
|
|
</body>
|
|
|
|
</document>
|