From 615375d45a9da435aa9e3ca83197e79311169dc2 Mon Sep 17 00:00:00 2001 From: Simon Willnauer Date: Mon, 27 Jun 2011 17:12:52 +0000 Subject: [PATCH] LUCENE-3247: Update CompoundFile format on the website git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1140243 13f79535-47bb-0310-9956-ffa450edef68 --- lucene/src/site/build/site/fileformats.html | 64 +++++++++++-------- .../content/xdocs/fileformats.xml | 18 ++++-- 2 files changed, 50 insertions(+), 32 deletions(-) diff --git a/lucene/src/site/build/site/fileformats.html b/lucene/src/site/build/site/fileformats.html index 8d563031918..ef91a18e36f 100644 --- a/lucene/src/site/build/site/fileformats.html +++ b/lucene/src/site/build/site/fileformats.html @@ -728,6 +728,14 @@ document.write("Last Published: " + document.lastModified); that frequently run out of file handles. + + + +Compound File Entry table + .cfe + The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4) + + @@ -832,10 +840,10 @@ document.write("Last Published: " + document.lastModified); - +

Primitive Types

- +

Byte

The most primitive type @@ -843,7 +851,7 @@ document.write("Last Published: " + document.lastModified); other data types are defined as sequences of bytes, so file formats are byte-order independent.

- +

UInt32

32-bit unsigned integers are written as four @@ -853,7 +861,7 @@ document.write("Last Published: " + document.lastModified); UInt32 --> <Byte>4

- +

Uint64

64-bit unsigned integers are written as eight @@ -862,7 +870,7 @@ document.write("Last Published: " + document.lastModified);

UInt64 --> <Byte>8

- +

VInt

A variable-length format for positive integers is @@ -1412,13 +1420,13 @@ document.write("Last Published: " + document.lastModified); This provides compression while still being efficient to decode.

- +

Chars

Lucene writes unicode character sequences as UTF-8 encoded bytes.

- +

String

Lucene writes strings as UTF-8 encoded bytes. @@ -1431,10 +1439,10 @@ document.write("Last Published: " + document.lastModified);

- +

Compound Types

- +

Map<String,String>

In a couple places Lucene stores a Map @@ -1447,13 +1455,13 @@ document.write("Last Published: " + document.lastModified);

- +

Per-Index Files

The files in this section exist one-per-index.

- +

Segments File

The active segments in the index are stored in the @@ -1626,7 +1634,7 @@ document.write("Last Published: " + document.lastModified);

HasVectors is 1 if this segment stores term vectors, else it's 0.

- +

Lock File

The write lock, which is stored in the index @@ -1640,27 +1648,29 @@ document.write("Last Published: " + document.lastModified); documents). This lock file ensures that only one writer is modifying the index at a time.

- +

Deletable File

A writer dynamically computes the files that are deletable, instead, so no file is written.

- +

Compound Files

Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section (except for the .del file).

-

Compound (.cfs) --> FileCount, <DataOffset, FileName> - FileCount - , - FileData +

Compound Entry Table (.cfe) --> Version, FileCount, <FileName, DataOffset, DataLength> FileCount

+

Compound (.cfs) --> FileData FileCount + +

+

Version --> Int

FileCount --> VInt

DataOffset --> Long

+

DataLength --> Long

FileName --> String

FileData --> raw file data

The raw file data is the data from the individual files named above.

@@ -1674,14 +1684,14 @@ document.write("Last Published: " + document.lastModified);
- +

Per-Segment Files

The remaining files are all per-segment, and are thus defined by suffix.

- +

Fields

@@ -1891,7 +1901,7 @@ document.write("Last Published: " + document.lastModified); - +

Term Dictionary

The term dictionary is represented as two files: @@ -2083,7 +2093,7 @@ document.write("Last Published: " + document.lastModified); - +

Frequencies

The .frq file contains the lists of documents @@ -2211,7 +2221,7 @@ document.write("Last Published: " + document.lastModified); entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0.

- +

Positions

The .prx file contains the lists of positions that @@ -2281,7 +2291,7 @@ document.write("Last Published: " + document.lastModified); Payload. If PayloadLength is not stored, then this Payload has the same length as the Payload at the previous position.

- +

Normalization Factors

There's a single .nrm file containing all norms:

@@ -2361,7 +2371,7 @@ document.write("Last Published: " + document.lastModified);

Separate norm files are created (when adequate) for both compound and non compound segments.

- +

Term Vectors

Term Vector support is an optional on a field by @@ -2497,7 +2507,7 @@ document.write("Last Published: " + document.lastModified); - +

Deleted Documents

The .del file is optional, and only exists when a segment contains deletions. @@ -2561,7 +2571,7 @@ document.write("Last Published: " + document.lastModified);

- +

Limitations

diff --git a/lucene/src/site/src/documentation/content/xdocs/fileformats.xml b/lucene/src/site/src/documentation/content/xdocs/fileformats.xml index 1f797c33f2f..090c32a5c1f 100644 --- a/lucene/src/site/src/documentation/content/xdocs/fileformats.xml +++ b/lucene/src/site/src/documentation/content/xdocs/fileformats.xml @@ -365,6 +365,11 @@ .cfs An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles. + + + Compound File Entry table + .cfe + The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4) Fields @@ -1129,17 +1134,20 @@

Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section (except for the .del file).

- -

Compound (.cfs) --> FileCount, <DataOffset, FileName> - FileCount - , - FileData +

Compound Entry Table (.cfe) --> Version, FileCount, <FileName, DataOffset, DataLength> FileCount

+

Compound (.cfs) --> FileData FileCount +

+ +

Version --> Int

+

FileCount --> VInt

DataOffset --> Long

+ +

DataLength --> Long

FileName --> String