From d634ccf4e933374ea0cba98a986fd602d8a25c72 Mon Sep 17 00:00:00 2001
From: Michael McCandless
This document defines the index file formats used
- in Lucene version 2.0. If you are using a different
+ in Lucene version 2.1. If you are using a different
version of Lucene, please consult the copy of
+ In version 2.1, the file format was changed to allow
+ lock-less commits (ie, no more commit lock). The
+ change is fully backwards compatible: you can open a
+ pre-2.1 index for searching or adding/deleting of
+ docs. When the new segments file is saved
+ (committed), it will be written in the new file format
+ (meaning no specific "upgrade" process is needed).
+ But note that once a commit has occurred, pre-2.1
+ Lucene will not be able to read the index.
+
docs/fileformats.html
that was distributed
with the version you are using.
@@ -143,6 +143,17 @@ limitations under the License.
Compatibility notes are provided in this document,
describing how file formats have changed from prior versions.
+ As of version 2.1 (lock-less commits), file names are + never re-used (there is one exception, "segments.gen", + see below). That is, when any file is saved to the + Directory it is given a never before used filename. + This is achieved using a simple generations approach. + For example, the first segments file is segments_1, + then segments_2, etc. The generation is a sequential + long integer represented in alpha-numeric (base 36) + form.
@@ -1080,25 +1102,53 @@ limitations under the License.@@ -1121,42 +1200,31 @@ limitations under the License.The active segments in the index are stored in the - segment info file. An index only has - a single file in this format, and it is named "segments". - This lists each segment by name, and also contains the size of each - segment. + segment info file, segments_N. There may + be one or more segments_N files in the + index; however, the one with the largest + generation is the active one (when older + segments_N files are present it's because they + temporarily cannot be deleted, or, a writer is in + the process of committing). This file lists each + segment by name, has details about the separate + norms and deletion files, and also contains the + size of each segment.
+ As of 2.1, there is also a file + segments.gen. This file contains the + current generation (the _N in + segments_N) of the index. This is + used only as a fallback in case the current + generation cannot be accurately determined by + directory listing alone (as is the case for some + NFS clients with time-based directory cache + expiraation). This file simply contains an Int32 + version header (SegmentInfos.FORMAT_LOCKLESS = + -2), followed by the generation recorded as Int64, + written twice. +
++ Pre-2.1: Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>SegCount
- Format, NameCounter, SegCount, SegSize --> UInt32 + 2.1 and above: + Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize, DelGen, NumField, NormGenNumField >SegCount, IsCompoundFile
- Version --> UInt64 + Format, NameCounter, SegCount, SegSize, NumField --> Int32 +
++ Version, DelGen, NormGen --> Int64
SegName --> String
- Format is -1 in Lucene 1.4. + IsCompoundFile --> Int8 +
++ Format is -1 as of Lucene 1.4 and -2 as of Lucene 2.1.
Version counts how often the index has been @@ -1113,6 +1163,35 @@ limitations under the License.
SegSize is the number of documents contained in the segment index. +
++ DelGen is the generation count of the separate + deletes file. If this is -1, there are no + separate deletes. If it is 0, this is a pre-2.1 + segment and you must check filesystem for the + existence of _X.del. Anything above zero means + there are separate deletes (_X_N.del). +
++ NumField is the size of the array for NormGen, or + -1 if there are no NormGens stored. +
++ NormGen records the generation of the separate + norms files. If NumField is -1, there are no + normGens stored and they are all assumed to be 0 + when the segment file was written pre-2.1 and all + assumed to be -1 when the segments file is 2.1 or + above. The generation then has the same meaning + as delGen (above). +
++ IsCompoundFile records whether the segment is + written as a compound file or not. If this is -1, + the segment is not a compound file. If it is 1, + the segment is a compound file. Else it is 0, + which means we check filesystem to see if _X.cfs + exists.
- Lock Files + Lock File |
|
|