diff --git a/lucene/site/html/fileformats.html b/lucene/site/html/fileformats.html index 3ce4940caee..b1608471fa0 100644 --- a/lucene/site/html/fileformats.html +++ b/lucene/site/html/fileformats.html @@ -16,19 +16,19 @@
This document defines the index file formats used in this version of Lucene. @@ -129,14 +129,14 @@ frequencies.
The same string in two different fields is considered a different term. Thus terms are represented as a pair of strings, the first naming the field, and the second naming text within the field.
- +The index stores statistics about terms in order to make term-based search more efficient. Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it. This is the inverse of the natural relationship, in which documents list terms.
- +In Lucene, fields may be stored, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are @@ -145,7 +145,7 @@ called indexed. A field may be both stored and indexed.
text of a field may be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is useful for certain identifier fields to be indexed literally. -See the Field +
See the Field java docs for more information on Fields.
Searches may involve multiple segments and/or multiple indexes, each index potentially composed of a set of segments.
- +Internally, Lucene refers to documents by an integer document number. The first document added to an index is numbered zero, and each subsequent @@ -231,7 +231,7 @@ that is multiplied into the score for hits on that field.
Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors
+"core/org/apache/lucene/document/Field.html">Field constructorsDeleted documents. An optional file indicating which documents are @@ -240,7 +240,7 @@ deleted.
Details on each of these are provided in subsequent sections.
All files belonging to a segment have the same name with varying extensions. @@ -268,24 +268,24 @@ Lucene:
String --> VInt, Chars
The files in this section exist one-per-index.
- +The active segments in the index are stored in the segment info file, segments_N. There may be one or more segments_N files in the index; however, the one with the largest generation is the active one (when older segments_N files are present it's because they temporarily cannot be deleted, or, a writer is in the process of committing, or a custom IndexDeletionPolicy +"core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy is in use). This file lists each segment by name, has details about the separate norms and deletion files, and also contains the size of each segment.
@@ -687,7 +687,7 @@ for each segment it creates. It includes metadata like the current Lucene version, OS, Java version, why the segment was created (merge, flush, addIndexes), etc.HasVectors is 1 if this segment stores term vectors, else it's 0.
- +The write lock, which is stored in the index directory by default, is named "write.lock". If the lock directory is different from the index directory then @@ -695,11 +695,11 @@ the write lock will be named "XXXX-write.lock" where XXXX is a unique prefix derived from the full path to the index directory. When this file is present, a writer is currently modifying the index (adding or removing documents). This lock file ensures that only one writer is modifying the index at a time.
- +A writer dynamically computes the files that are deletable, instead, so no file is written.
- +Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section (except for the @@ -719,7 +719,7 @@ vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx.
The remaining files are all per-segment, and are thus defined by suffix.
@@ -797,7 +797,7 @@ Lucene version 2.9.xValueSize --> VInt
- +The term dictionary is represented as two files:
There's a single .nrm file containing all norms:
AllNorms (.nrm) --> NormsHeader,<Norms> @@ -1006,7 +1006,7 @@ are modified. When field N is modified, a separate norm file .sN is created, to maintain the norm values for that field.
Separate norm files are created (when adequate) for both compound and non compound segments.
- +Term Vector support is an optional on a field by field basis. It consists of 3 files.
@@ -1071,7 +1071,7 @@ startOffset, the second is the endOffset.The .del file is optional, and only exists when a segment contains deletions.