diff --git a/lucene/site/html/fileformats.html b/lucene/site/html/fileformats.html index 3ce4940caee..b1608471fa0 100644 --- a/lucene/site/html/fileformats.html +++ b/lucene/site/html/fileformats.html @@ -16,19 +16,19 @@

Apache Lucene - Index File Formats

- +

Index File Formats

This document defines the index file formats used in this version of Lucene. @@ -129,14 +129,14 @@ frequencies.

The same string in two different fields is considered a different term. Thus terms are represented as a pair of strings, the first naming the field, and the second naming text within the field.

- +

Inverted Indexing

The index stores statistics about terms in order to make term-based search more efficient. Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it. This is the inverse of the natural relationship, in which documents list terms.

- +

Types of Fields

In Lucene, fields may be stored, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are @@ -145,7 +145,7 @@ called indexed. A field may be both stored and indexed.

text of a field may be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is useful for certain identifier fields to be indexed literally.

-

See the Field +

See the Field java docs for more information on Fields.

Segments

@@ -162,7 +162,7 @@ Indexes evolve by:

Searches may involve multiple segments and/or multiple indexes, each index potentially composed of a set of segments.

- +

Document Numbers

Internally, Lucene refers to documents by an integer document number. The first document added to an index is numbered zero, and each subsequent @@ -231,7 +231,7 @@ that is multiplied into the score for hits on that field.

Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors

+"core/org/apache/lucene/document/Field.html">Field constructors

  • Deleted documents. An optional file indicating which documents are @@ -240,7 +240,7 @@ deleted.

    Details on each of these are provided in subsequent sections.

  • - +

    File Naming

    All files belonging to a segment have the same name with varying extensions. @@ -268,24 +268,24 @@ Lucene:

    Brief Description -Segments File +Segments File segments.gen, segments_N Stores information about segments -Lock File +Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file. -Compound File +Compound File .cfs An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles. -Compound File Entry table +Compound File Entry table .cfe The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4) @@ -326,7 +326,7 @@ corresponding .cfs file (Since 3.4) Stores position information about where a term occurs in the index -Norms +Norms .nrm Encodes length and boost factors for docs and fields @@ -346,13 +346,13 @@ corresponding .cfs file (Since 3.4) The field level info about term vectors -Deleted Documents +Deleted Documents .del Info about what files are deleted
    - +

    Primitive Types

    @@ -590,7 +590,7 @@ byte, values from 128 to 16,383 may be stored in two bytes, and so on.

    written as a VInt, followed by the bytes.

    String --> VInt, Chars

    - +

    Compound Types

    @@ -599,18 +599,18 @@ id="MapStringString">

    Map<String,String> --> Count<String,String>Count

    - +

    Per-Index Files

    The files in this section exist one-per-index.

    - +

    Segments File

    The active segments in the index are stored in the segment info file, segments_N. There may be one or more segments_N files in the index; however, the one with the largest generation is the active one (when older segments_N files are present it's because they temporarily cannot be deleted, or, a writer is in the process of committing, or a custom IndexDeletionPolicy +"core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy is in use). This file lists each segment by name, has details about the separate norms and deletion files, and also contains the size of each segment.

    @@ -687,7 +687,7 @@ for each segment it creates. It includes metadata like the current Lucene version, OS, Java version, why the segment was created (merge, flush, addIndexes), etc.

    HasVectors is 1 if this segment stores term vectors, else it's 0.

    - +

    Lock File

    The write lock, which is stored in the index directory by default, is named "write.lock". If the lock directory is different from the index directory then @@ -695,11 +695,11 @@ the write lock will be named "XXXX-write.lock" where XXXX is a unique prefix derived from the full path to the index directory. When this file is present, a writer is currently modifying the index (adding or removing documents). This lock file ensures that only one writer is modifying the index at a time.

    - +

    Deletable File

    A writer dynamically computes the files that are deletable, instead, so no file is written.

    - +

    Compound Files

    Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section (except for the @@ -719,7 +719,7 @@ vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx.

    - +

    Per-Segment Files

    The remaining files are all per-segment, and are thus defined by suffix.

    @@ -797,7 +797,7 @@ Lucene version 2.9.x

    ValueSize --> VInt

    - +

    Term Dictionary

    The term dictionary is represented as two files:

      @@ -971,7 +971,7 @@ be the following sequence of VInts (payloads disabled):

      PayloadLength is stored at the current position, then it indicates the length of this Payload. If PayloadLength is not stored, then this Payload has the same length as the Payload at the previous position.

      - +

      Normalization Factors

      There's a single .nrm file containing all norms:

      AllNorms (.nrm) --> NormsHeader,<Norms> @@ -1006,7 +1006,7 @@ are modified. When field N is modified, a separate norm file .sN is created, to maintain the norm values for that field.

      Separate norm files are created (when adequate) for both compound and non compound segments.

      - +

      Term Vectors

      Term Vector support is an optional on a field by field basis. It consists of 3 files.

      @@ -1071,7 +1071,7 @@ startOffset, the second is the endOffset.
    - +

    Deleted Documents

    The .del file is optional, and only exists when a segment contains deletions.