diff --git a/lucene/src/site/build/site/fileformats.html b/lucene/src/site/build/site/fileformats.html index da02cf70a98..a8e75bbb731 100644 --- a/lucene/src/site/build/site/fileformats.html +++ b/lucene/src/site/build/site/fileformats.html @@ -412,10 +412,14 @@ document.write("Last Published: " + document.lastModified); to stored fields file, previously they were stored in text format only.

+

+ In version 3.4, fields can omit position data while + still indexing term frequencies. +

- +

Definitions

@@ -456,7 +460,7 @@ document.write("Last Published: " + document.lastModified); strings, the first naming the field, and the second naming text within the field.

- +

Inverted Indexing

The index stores statistics about terms in order @@ -466,7 +470,7 @@ document.write("Last Published: " + document.lastModified); it. This is the inverse of the natural relationship, in which documents list terms.

- +

Types of Fields

In Lucene, fields may be stored, in which @@ -480,7 +484,7 @@ document.write("Last Published: " + document.lastModified); to be indexed literally.

See the Field java docs for more information on Fields.

- +

Segments

Lucene indexes may be composed of multiple sub-indexes, or @@ -506,7 +510,7 @@ document.write("Last Published: " + document.lastModified); Searches may involve multiple segments and/or multiple indexes, each index potentially composed of a set of segments.

- +

Document Numbers

Internally, Lucene refers to documents by an integer document @@ -561,7 +565,7 @@ document.write("Last Published: " + document.lastModified);

- +

Overview

@@ -608,7 +612,7 @@ document.write("Last Published: " + document.lastModified);

Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in - that document if omitTf is false. + that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)

@@ -619,8 +623,7 @@ document.write("Last Published: " + document.lastModified);

Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will - not exist if all fields in all documents set - omitTf to true. + not exist if all fields in all documents omit position data.

@@ -660,7 +663,7 @@ document.write("Last Published: " + document.lastModified);
- +

File Naming

@@ -687,7 +690,7 @@ document.write("Last Published: " + document.lastModified);

- +

Summary of File Extensions

The following table summarizes the names and extensions of the files in Lucene: @@ -837,10 +840,10 @@ document.write("Last Published: " + document.lastModified);

- +

Primitive Types

- +

Byte

The most primitive type @@ -848,7 +851,7 @@ document.write("Last Published: " + document.lastModified); other data types are defined as sequences of bytes, so file formats are byte-order independent.

- +

UInt32

32-bit unsigned integers are written as four @@ -858,7 +861,7 @@ document.write("Last Published: " + document.lastModified); UInt32 --> <Byte>4

- +

Uint64

64-bit unsigned integers are written as eight @@ -867,7 +870,7 @@ document.write("Last Published: " + document.lastModified);

UInt64 --> <Byte>8

- +

VInt

A variable-length format for positive integers is @@ -1417,13 +1420,13 @@ document.write("Last Published: " + document.lastModified); This provides compression while still being efficient to decode.

- +

Chars

Lucene writes unicode character sequences as UTF-8 encoded bytes.

- +

String

Lucene writes strings as UTF-8 encoded bytes. @@ -1436,10 +1439,10 @@ document.write("Last Published: " + document.lastModified);

- +

Compound Types

- +

Map<String,String>

In a couple places Lucene stores a Map @@ -1452,13 +1455,13 @@ document.write("Last Published: " + document.lastModified);

- +

Per-Index Files

The files in this section exist one-per-index.

- +

Segments File

The active segments in the index are stored in the @@ -1613,7 +1616,7 @@ document.write("Last Published: " + document.lastModified);

HasProx is 1 if any fields in this segment have - omitTf set to false; else, it's 0. + position data (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); else, it's 0.

CommitUserData stores an optional user-supplied @@ -1631,7 +1634,7 @@ document.write("Last Published: " + document.lastModified);

HasVectors is 1 if this segment stores term vectors, else it's 0.

- +

Lock File

The write lock, which is stored in the index @@ -1645,14 +1648,14 @@ document.write("Last Published: " + document.lastModified); documents). This lock file ensures that only one writer is modifying the index at a time.

- +

Deletable File

A writer dynamically computes the files that are deletable, instead, so no file is written.

- +

Compound Files

Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section @@ -1681,14 +1684,14 @@ document.write("Last Published: " + document.lastModified);

- +

Per-Segment Files

The remaining files are all per-segment, and are thus defined by suffix.

- +

Fields

@@ -1741,12 +1744,16 @@ document.write("Last Published: " + document.lastModified);

  • If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.
  • If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.
  • + +
  • If the seventh lowest-order bit is set (0x40), term frequencies and positions omitted for the indexed field.
  • + +
  • If the eighth lowest-order bit is set (0x80), positions are omitted for the indexed field.
  • - FNMVersion (added in 2.9) is always -2. + FNMVersion (added in 2.9) is -2 for indexes from 2.9 - 3.3. It is -3 for indexes in Lucene 3.4+

    Fields are numbered by their order in this file. Thus field zero is @@ -1898,7 +1905,7 @@ document.write("Last Published: " + document.lastModified); - +

    Term Dictionary

    The term dictionary is represented as two files: @@ -2002,7 +2009,7 @@ document.write("Last Published: " + document.lastModified); file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields - with omitTf true, this will be 0 since + that omit position data, this will be 0 since prox information is not stored.

    @@ -2090,12 +2097,12 @@ document.write("Last Published: " + document.lastModified); - +

    Frequencies

    The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that - document (if omitTf is false). + document (except when frequencies are omitted: IndexOptions.DOCS_ONLY).

    FreqFile (.frq) --> <TermFreqs, SkipData> @@ -2135,26 +2142,26 @@ document.write("Last Published: " + document.lastModified);

    TermFreq entries are ordered by increasing document number.

    -

    DocDelta: if omitTf is false, this determines both +

    DocDelta: if frequencies are indexed, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is - read as another VInt. If omitTf is true, DocDelta + read as another VInt. If frequencies are omitted, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored.

    For example, the TermFreqs for a term which occurs once in document seven and three times in document - eleven, with omitTf false, would be the following + eleven, with frequencies indexed, would be the following sequence of VInts:

    15, 8, 3

    -

    If omitTf were true it would be this sequence +

    If frequencies were omitted (IndexOptions.DOCS_ONLY) it would be this sequence of VInts instead:

    @@ -2218,14 +2225,14 @@ document.write("Last Published: " + document.lastModified); entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0.

    - +

    Positions

    The .prx file contains the lists of positions that each term occurs at within documents. Note that - fields with omitTf true do not store + fields omitting positional data do not store anything into this file, and if all fields in the - index have omitTf true then the .prx file will not + index omit positional data then the .prx file will not exist.

    ProxFile (.prx) --> @@ -2288,7 +2295,7 @@ document.write("Last Published: " + document.lastModified); Payload. If PayloadLength is not stored, then this Payload has the same length as the Payload at the previous position.

    - +

    Normalization Factors

    There's a single .nrm file containing all norms:

    @@ -2368,7 +2375,7 @@ document.write("Last Published: " + document.lastModified);

    Separate norm files are created (when adequate) for both compound and non compound segments.

    - +

    Term Vectors

    Term Vector support is an optional on a field by @@ -2504,7 +2511,7 @@ document.write("Last Published: " + document.lastModified); - +

    Deleted Documents

    The .del file is optional, and only exists when a segment contains deletions. @@ -2568,7 +2575,7 @@ document.write("Last Published: " + document.lastModified);

    - +

    Limitations

    diff --git a/lucene/src/site/build/site/skin/basic.css b/lucene/src/site/build/site/skin/basic.css index 4ed58b99ae7..eb24c326c6c 100644 --- a/lucene/src/site/build/site/skin/basic.css +++ b/lucene/src/site/build/site/skin/basic.css @@ -163,4 +163,4 @@ p { .codefrag { font-family: "Courier New", Courier, monospace; font-size: 110%; -} +} \ No newline at end of file diff --git a/lucene/src/site/build/site/skin/print.css b/lucene/src/site/build/site/skin/print.css index 8916b9fc01e..aaa99319acd 100644 --- a/lucene/src/site/build/site/skin/print.css +++ b/lucene/src/site/build/site/skin/print.css @@ -51,4 +51,4 @@ a:link, a:visited { acronym { border: 0; -} +} \ No newline at end of file diff --git a/lucene/src/site/build/site/skin/profile.css b/lucene/src/site/build/site/skin/profile.css index ca72cdbd10b..2ed95546ec6 100644 --- a/lucene/src/site/build/site/skin/profile.css +++ b/lucene/src/site/build/site/skin/profile.css @@ -172,4 +172,4 @@ a:hover { color:#6587ff} } - + \ No newline at end of file diff --git a/lucene/src/site/build/site/skin/screen.css b/lucene/src/site/build/site/skin/screen.css index aa8c457cb30..c6084f81df3 100644 --- a/lucene/src/site/build/site/skin/screen.css +++ b/lucene/src/site/build/site/skin/screen.css @@ -584,4 +584,4 @@ p.instruction { list-style-image: url('../images/instruction_arrow.png'); list-style-position: outside; margin-left: 2em; -} +} \ No newline at end of file