From 533f3e360f0897277d9ae07d8ceed95773fe671d Mon Sep 17 00:00:00 2001 From: Otis Gospodnetic Date: Wed, 30 Oct 2002 04:10:30 +0000 Subject: [PATCH] - Lucene file formats. PR: Obtained from: Submitted by: Doug Cutting, Clemens Marschner. Reviewed by: git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149873 13f79535-47bb-0310-9956-ffa450edef68 --- xdocs/fileformats.xml | 1184 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1184 insertions(+) create mode 100644 xdocs/fileformats.xml diff --git a/xdocs/fileformats.xml b/xdocs/fileformats.xml new file mode 100644 index 00000000000..4a798a156b1 --- /dev/null +++ b/xdocs/fileformats.xml @@ -0,0 +1,1184 @@ + + + + + +
+ +

+ This document defines the index file formats used + in Lucene version 1.3. +

+ +

+ Jakarta Lucene is written in Java, but several + efforts are underway to write versions of Lucene in other programming + languages. If these versions are to remain compatible with Jakarta + Lucene, then a language-independent definition of the Lucene index + format is required. This document thus attempts to provide a + complete and independent definition of the Jakarta Lucene 1.3 file + formats. +

+ +

+ As Lucene's evolves, this document should evolve. + Versions of Lucene in different programming languages should endeavor + to agree on file formats, and generate new versions of this document. +

+ +

+ Compatibility notes are provided in this document, + describing how file formats have changed from prior versions. +

+ +
+ +
+ +

+ The fundamental concepts in Lucene are index, + document, field and term. +

+ + +

+ An index contains a sequence of documents. +

+ +
    +
  • +

    + A document is a sequence of fields. +

    +
  • + +
  • +

    + A field is a named sequence of terms. +

    +
  • + +
  • + A term is a string. +
  • +
+ +

+ The same string in two different fields is + considered a different term. Thus terms are represented as a pair of + strings, the first naming the field, and the second naming text + within the field. +

+ + + +

+ The index stores statistics about terms in order + to make term-based search more efficient. Lucene's + index falls into the family of indexes known as an inverted + index. This is because it can list, for a term, the documents that contain + it. This is the inverse of the natural relationship, in which + documents list terms. +

+
+ + +

+ In Lucene, fields may be stored, in which + case their text is stored in the index literally, in a non-inverted + manner. Fields that are inverted are called indexed. A field + may be both stored and indexed.

+ +

The text of a field may be tokenized into terms to be + indexed, or the text of a field may be used literally as a term to be indexed. + Most fields are + tokenized, but sometimes it is useful for certain identifier fields + to be indexed literally. +

+ +
+ + + +

+ Lucene indexes may be composed of multiple sub-indexes, or + segments. Each segment is a fully independent index, which could be searched + separately. Indexes evolve by: +

+ +
    +
  1. Creating new segments for newly added documents.

    +
  2. +
  3. Merging existing segments.

    +
  4. +
+ +

+ Searches may involve multiple segments and/or multiple indexes, each + index potentially composed of a set of segments. +

+
+ + + +

+ Internally, Lucene refers to documents by an integer document + number. The first document added to an index is numbered zero, and each + subsequent document added gets a number one greater than the previous. +

+ +

+
+

+ +

+ Note that a document's number may change, so caution should be taken + when storing these numbers outside of Lucene. In particular, numbers may + change in the following situations: +

+ + +
    +
  • +

    + The + numbers stored in each segment are unique only within the segment, + and must be converted before they can be used in a larger context. + The standard technique is to allocate each segment a range of + values, based on the range of numbers used in that segment. To + convert a document number from a segment to an external value, the + segment's base document + number is added. To convert an external value back to a + segment-specific value, the segment is identified by the range that + the external value is in, and the segment's base value is + subtracted. For example two five document segments might be + combined, so that the first segment has a base value of zero, and + the second of five. Document three from the second segment would + have an external value of eight. +

    +
  • +
  • +

    + When documents are deleted, gaps are created + in the numbering. These are eventually removed as the index evolves + through merging. Deleted documents are dropped when segments are + merged. A freshly-merged segment thus has no gaps in its numbering. +

    +
  • +
+ +
+ +
+ +
+ +

+ Each segment index maintains the following: +

+
    +
  • Field names. This + contains the set of field names used in the index. + +

    +
  • +
  • Stored Field + values. This contains, for each document, a list of attribute-value + pairs, where the attributes are field names. These are used to + store auxiliary information about the document, such as its title, + url, or an identifier to access a + database. The set of stored fields are what is returned for each hit + when searching. This is keyed by document number. +

    +
  • +
  • Term dictionary. + A dictionary containing all of the terms used in all of the indexed + fields of all of the documents. The dictionary also contains the + number of documents which contain the term, and pointers to the + term's frequency and proximity data. +

    +
  • + +
  • Term Frequency + data. For each term in the dictionary, the numbers of all the + documents that contain that term, and the frequency of the term in + that document. +

    +
  • + +
  • Term Proximity + data. For each term in the dictionary, the positions that the term + occurs in each document. +

    +
  • + +
  • Normalization + factors. For each field in each document, a value is stored that is + multiplied into the score for hits on that field. +

    +
  • + +
  • Deleted documents. + An optional file indicating which documents are deleted. +

    +
  • +
+ +

Details on each of these are provided in subsequent sections. +

+
+ +
+ +

+ All files belonging to a segment have the same name with varying + extensions. The extensions correspond to the different file formats + described below. +

+ +

+ Typically, all segments + in an index are stored in a single directory, although this is not + required. +

+ +
+ +
+ + + +

+ The most primitive type + is an eight-bit byte. Files are accessed as sequences of bytes. All + other data types are defined as sequences + of bytes, so file formats are byte-order independent. +

+ +
+ + + +

+ 32-bit unsigned integers are written as four + bytes, high-order bytes first. +

+

+ UInt32 --> <Byte>4 +

+ +
+ + + +

+ 64-bit unsigned integers are written as eight + bytes, high-order bytes first. +

+ +

UInt32 --> <Byte>8 +

+ +
+ + + +

+ A variable-length format for positive integers is + defined where the high-order bit of each byte indicates whether more + bytes remain to be read. The low-order seven bits are appended as + increasingly more significant bits in the resulting integer value. + Thus values from zero to 127 may be stored in a single byte, values + from 128 to 16,383 may be stored in two bytes, and so on. +

+ +

VInt Encoding Example

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Value +

+
+

First byte +

+
+

Second byte +

+
+

Third byte +

+
+

0 +

+
+

+ 00000000 +

+
+


+ +

+
+


+ +

+
+

1 +

+
+

+ 00000001 +

+
+


+ +

+
+


+ +

+
+

2 +

+
+

+ 00000010 +

+
+


+ +

+
+


+ +

+
+

... +

+
+


+ +

+
+


+ +

+
+


+ +

+
+

127 +

+
+

+ 01111111 +

+
+


+ +

+
+


+ +

+
+

128 +

+
+

+ 10000000 +

+
+

+ 00000001 +

+
+


+ +

+
+

129 +

+
+

+ 10000001 +

+
+

+ 00000001 +

+
+


+ +

+
+

130 +

+
+

+ 10000010 +

+
+

+ 00000001 +

+
+


+ +

+
+

... +

+
+


+ +

+
+


+ +

+
+


+ +

+
+

16,383 +

+
+

+ 11111111 +

+
+

+ 01111111 +

+
+


+ +

+
+

16,384 +

+
+

+ 10000000 +

+
+

+ 10000000 +

+
+

+ 00000001 +

+
+

16,385 +

+
+

+ 10000001 +

+
+

+ 10000000 +

+
+

+ 00000001 +

+
+

... +

+
+

+
+ +

+
+

+
+ +

+
+

+
+ +

+
+ +

+ This provides compression while still being + efficient to decode. +

+ +
+ + + +

+ Lucene writes unicode + character sequences using the standard UTF-8 encoding. +

+ + +
+ + + +

+ Lucene writes strings as a VInt representing the length, followed by + the character data. +

+ +

+ String --> VInt, Chars +

+ +
+ +
+ +
+ +

+ The files in this section exist one-per-index. +

+ + + +

+ The active segments in the index are stored in the + segment info file. An index only has + a single file in this format, and it is named "segments". + This lists each segment by name, and also contains the size of each + segment. +

+ +

+ Segments --> SegCount, <SegName, SegSize>SegCount +

+ +

+ SegCount, SegSize --> UInt32 +

+ +

+ SegName --> String +

+ +

+ SegName is the name of the segment, and is used as the file name prefix + for + all of the files that compose the segment's index. +

+ +

+ SegSize is the number of documents contained in the segment index. +

+ + +
+ + + +

+ Several files are used to indicate that another + process is using an index. +

+ +
    +
  • +

    + When a file named "commit.lock" + is present, a process is currently re-writing the "segments" + file and deleting outdated segment index files, or a process is + reading the "segments" + file and opening the files of the segments it names. This lock file + prevents files from being deleted by another process after a process + has read the "segments" + file but before it has managed to open all of the files of the + segments named therein. +

    +
  • + +
  • +

    + When a file + named "index.lock" + is present, a process is currently adding documents to an index, or + removing files from that index. This lock file prevents several + processes from attempting to modify an index at the same time. +

    +
  • +
+
+ + + +

+ A file named "deletetable" + contains the names of files that are no longer used by the index, but + which could not be deleted. This is only generated on Win32, where a + file may not be deleted while it is still open. +

+ +

+ Deleteable --> DelableCount, + <DelableName>DelableCount +

+ +

DelableCount --> UInt32 +

+

DelableName --> + String +

+
+
+ +
+ +

+ The remaining files are all per-segment, and are + thus defined by suffix. +

+ +


Field Info

+ +

+ Field names are + stored in the field info file, with suffix .fnm. +

+

+ FieldInfos + (.fnm) --> FieldsCount, <FieldName, + FieldBits>FieldsCount +

+ +

+ FieldsCount --> VInt +

+ +

+ FieldName --> String +

+ +

+ FieldBits --> Byte +

+ +

+ Currently only the low-order bit is used of FieldBits is used. It is + one for + indexed fields, and zero for non-indexed fields. +

+ +

+ Fields are numbered by their order in this file. Thus field zero is + the + first field in the file, field one the next, and so on. Note that, + like document numbers, field numbers are segment relative. +

+ +


Stored Fields

+ +

+ Stored fields are represented by two files: +

+ +
    +
  1. +

    + The field index, or .fdx file. +

    + +

    + This contains, for each document, a pointer to + its field data, as follows: +

    + +

    + FieldIndex + (.fdx) --> + <FieldValuesPosition>SegSize +

    +

    FieldValuesPosition + --> Uint64 +

    +

    This + is used to find the location within the field data file of the + fields of a particular document. Because it contains fixed-length + data, this file may be easily randomly accessed. The position of + document n's field data is the Uint64 at n*8 in + this file. +

    +
  2. +
  3. +

    + The field data, or .fdt file. + +

    + +

    + This contains the stored fields of each document, + as follows: +

    + +

    + FieldData (.fdt) --> + <DocFieldData>SegSize +

    +

    DocFieldData --> + FieldCount, <FieldNum, Bits, Value>FieldCount +

    +

    Count --> + VInt +

    +

    FieldNum --> + VInt +

    +

    Bits --> + Byte +

    +

    Value --> + String +

    +

    Currently + only the low-order bit is used of Bits is used. It is one for + tokenized fields, and zero for non-tokenized fields. +

    +
  4. +
+ +
+ + +

+ The term dictionary is represented as two files: +

+
    +
  1. +

    + The term infos, or tis file. +

    + +

    + TermInfoFile (.tis)--> + TermCount, TermInfos +

    +

    TermCount --> + UInt32 +

    +

    TermInfos --> + <TermInfo>TermCount +

    +

    TermInfo --> + <Term, DocFreq, FreqDelta, ProxDelta> +

    +

    Term --> + <PrefixLength, Suffix, FieldNum> +

    +

    Suffix --> + String +

    +

    PrefixLength, + DocFreq, FreqDelta, ProxDelta
    --> VInt +

    +

    This + file is sorted by Term. Terms are ordered first lexicographically + by the term's field name, and within that lexicographically by the + term's text. +

    +

    Term + text prefixes are shared. The PrefixLength is the number of initial + characters from the previous term which must be pre-pended to a + term's suffix in order to form the term's text. Thus, if the + previous term's text was "bone" and the term is "boy", + the PrefixLength is two and the suffix is "y". +

    +

    FieldNumber + determines the term's field, whose name is stored in the .fdt file. +

    +

    DocFreq + is the count of documents which contain the term. +

    +

    FreqDelta + determines the position of this term's TermFreqs within the .frq + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file). +

    +

    ProxDelta + determines the position of this term's TermPositions within the .prx + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file. +

    +
  2. +
  3. +

    + The term info index, or .tii file. +

    + +

    + This contains every 128th entry from the .tis + file, along with its location in the "tis" file. This is + designed to be read entirely into memory and used to provide random + access to the "tis" file. +

    + +

    + The structure of this file is very similar to the + .tis file, with the addition of one item per record, the IndexDelta. +

    + +

    + TermInfoIndex (.tii)--> + IndexTermCount, TermIndices +

    +

    IndexTermCount --> + UInt32 +

    +

    TermIndices --> + <TermInfo, IndexDelta>IndexTermCount +

    +

    IndexDelta --> + VInt +

    +

    IndexDelta + determines the position of this term's TermInfo the .tis file. In + particular, it is the difference between the position of this term's + entry in that file and the position of the previous term's entry (or + zero for the first term in the file). +

    +
  4. +
+
+ + + +

+ The .frq file contains the lists of documents + which contain each term, along with the frequency of the term in that + document. +

+

FreqFile (.frq) --> + <TermFreqs>TermCount +

+

TermFreqs --> + <TermFreq>DocFreq +

+

TermFreq --> + DocDelta, Freq? +

+

DocDelta,Freq --> + VInt +

+

TermFreqs + are ordered by term (the term is implicit, from the .tis file). +

+

TermFreq + entries are ordered by increasing document number. +

+

DocDelta + determines both the document number and the frequency. In + particular, DocDelta/2 is the difference between this document number + and the previous document number (or zero when this is the first + document in a TermFreqs). When DocDelta is odd, the frequency is + one. When DocDelta is even, the frequency is read as another VInt. +

+

For + example, the TermFreqs for a term which occurs once in document seven + and three times in document eleven would be the following sequence of + VInts: +

+

15, + 22, 3 +

+
+ + +

+ The .prx file contains the lists of positions that + each term occurs at within documents. +

+

ProxFile (.prx) --> + <TermPositions>TermCount +

+

TermPositions --> + <Positions>DocFreq +

+

Positions --> + <PositionDelta>Freq +

+

PositionDelta --> + VInt +

+

TermPositions + are ordered by term (the term is implicit, from the .tis file). +

+

Positions + entries are ordered by increasing document number (the document + number is implicit from the .frq file). +

+

PositionDelta + is the difference between the position of the current occurrence in + the document and the previous occurrence (or zero, if this is the + first occurrence in this document). +

+

+ For example, the TermPositions for a + term which occurs as the fourth term in one document, and as the + fifth and ninth term in a subsequent document, would be the following + sequence of VInts: +

+

4, + 5, 4 +

+
+ +

The .nrm file contains, + for each document, a byte that encodes a value that is multiplied + into the score for hits on that field: +

+

Norms + (.nrm) --> <Byte>SegSize +

+

Each + byte encodes a floating point value. Bits 0-2 contain the 3-bit + mantissa, and bits 3-8 contain the 5-bit exponent. +

+

These + are converted to an IEEE single float value as follows: +

+
    +
  1. If + the byte is zero, use a zero float. +

    +
  2. +
  3. Otherwise, + set the sign bit of the float to zero; +

    +
  4. +
  5. add + 48 to the exponent and use this as the float's exponent; +

    +
  6. +
  7. map + the mantissa to the high-order 3 bits of the float's mantissa; and + +

    +
  8. +
  9. set + the low-order 21 bits of the float's mantissa to zero. +

    +
  10. +
+ +
+ + + +

The .del file is + optional, and only exists when a segment contains deletions: +

+ +

Deletions + (.del) --> ByteCount,BitCount,Bits +

+ +

ByteSize,BitCount --> + Uint32 +

+ +

Bits --> + <Byte>ByteCount +

+ +

ByteCount + indicates the number of bytes in Bits. It is typically + (SegSize/8)+1. +

+ +

+ BitCount + indicates the number of bits that are currently set in Bits. +

+ +

Bits + contains one bit for each document indexed. When the bit + corresponding to a document number is set, that document is marked as + deleted. Bit ordering is from least to most significant. Thus, if + Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as + deleted. +

+
+
+ +
+ +

There + are a few places where these file formats limit the maximum number of + terms and documents to a 32-bit quantity, or to approximately 4 + billion. This is not today a problem, but, in the long term, + probably will be. These should therefore be replaced with either + UInt64 values, or better yet, with VInt values which have no limit. +

+

There + are only two places where the code requires that a value be fixed + size. These are: +

+
    +
  1. + The FieldValuesPosition (in the stored field index file, .fdx). + This already uses a UInt64, and so is not a problem. +

  2. +
  3. The + TermCount (in the term info file, .tis). This is written last but + is read when the file is first opened, and so is stored at the + front. The indexing code first writes an zero here, then overwrites + it after the rest of the file has been written. So unless this is + stored elsewhere, it must be fixed size and should be changed to a + UInt64. +

    +
  4. +
+

Other + than these, all UInt values could be converted to VInt to remove + limitations. +

+



+ +

+
+ + + +