mirror of https://github.com/apache/lucene.git
2421 lines
83 KiB
HTML
2421 lines
83 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
<meta content="Apache Forrest" name="Generator">
|
|
<meta name="Forrest-version" content="0.8">
|
|
<meta name="Forrest-skin-name" content="pelt">
|
|
<title>
|
|
Apache Lucene - Index File Formats
|
|
</title>
|
|
<link type="text/css" href="skin/basic.css" rel="stylesheet">
|
|
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
|
|
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
|
|
<link type="text/css" href="skin/profile.css" rel="stylesheet">
|
|
<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
|
|
<link rel="shortcut icon" href="images/favicon.ico">
|
|
</head>
|
|
<body onload="init()">
|
|
<script type="text/javascript">ndeSetTextSize();</script>
|
|
<div id="top">
|
|
<!--+
|
|
|breadtrail
|
|
+-->
|
|
<div class="breadtrail">
|
|
<a href="http://www.apache.org/">Apache</a> > <a href="http://lucene.apache.org/">Lucene</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
|
|
</div>
|
|
<!--+
|
|
|header
|
|
+-->
|
|
<div class="header">
|
|
<!--+
|
|
|start group logo
|
|
+-->
|
|
<div class="grouplogo">
|
|
<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="http://www.apache.org/images/asf_logo_simple.png" title="Apache Lucene"></a>
|
|
</div>
|
|
<!--+
|
|
|end group logo
|
|
+-->
|
|
<!--+
|
|
|start Project Logo
|
|
+-->
|
|
<div class="projectlogo">
|
|
<a href="http://lucene.apache.org/java/"><img class="logoImage" alt="Lucene" src="http://lucene.apache.org/images/lucene_green_300.gif" title="Apache Lucene is a high-performance, full-featured text search engine library written entirely in
|
|
Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform."></a>
|
|
</div>
|
|
<!--+
|
|
|end Project Logo
|
|
+-->
|
|
<!--+
|
|
|start Search
|
|
+-->
|
|
<div class="searchbox">
|
|
<form action="http://www.google.com/search" method="get" class="roundtopsmall">
|
|
<input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">
|
|
<input name="Search" value="Search" type="submit">
|
|
</form>
|
|
</div>
|
|
<!--+
|
|
|end search
|
|
+-->
|
|
<!--+
|
|
|start Tabs
|
|
+-->
|
|
<ul id="tabs">
|
|
<li class="current">
|
|
<a class="selected" href="http://lucene.apache.org/java/docs/">Main</a>
|
|
</li>
|
|
<li>
|
|
<a class="unselected" href="http://wiki.apache.org/lucene-java">Wiki</a>
|
|
</li>
|
|
<li class="current">
|
|
<a class="selected" href="index.html">Lucene 2.4-dev Documentation</a>
|
|
</li>
|
|
</ul>
|
|
<!--+
|
|
|end Tabs
|
|
+-->
|
|
</div>
|
|
</div>
|
|
<div id="main">
|
|
<div id="publishedStrip">
|
|
<!--+
|
|
|start Subtabs
|
|
+-->
|
|
<div id="level2tabs"></div>
|
|
<!--+
|
|
|end Endtabs
|
|
+-->
|
|
<script type="text/javascript"><!--
|
|
document.write("Last Published: " + document.lastModified);
|
|
// --></script>
|
|
</div>
|
|
<!--+
|
|
|breadtrail
|
|
+-->
|
|
<div class="breadtrail">
|
|
|
|
|
|
</div>
|
|
<!--+
|
|
|start Menu, mainarea
|
|
+-->
|
|
<!--+
|
|
|start Menu
|
|
+-->
|
|
<div id="menu">
|
|
<div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
|
|
<div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
|
|
<div class="menuitem">
|
|
<a href="index.html">Overview</a>
|
|
</div>
|
|
<div onclick="SwitchMenu('menu_1.1.2', 'skin/')" id="menu_1.1.2Title" class="menutitle">Javadocs</div>
|
|
<div id="menu_1.1.2" class="menuitemgroup">
|
|
<div class="menuitem">
|
|
<a href="api/index.html">All</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/core/index.html">Core</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/demo/index.html">Demo</a>
|
|
</div>
|
|
<div onclick="SwitchMenu('menu_1.1.2.4', 'skin/')" id="menu_1.1.2.4Title" class="menutitle">Contrib</div>
|
|
<div id="menu_1.1.2.4" class="menuitemgroup">
|
|
<div class="menuitem">
|
|
<a href="api/contrib-analyzers/index.html">Analyzers</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-ant/index.html">Ant</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-bdb/index.html">Bdb</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-bdb-je/index.html">Bdb-je</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-benchmark/index.html">Benchmark</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-highlighter/index.html">Highlighter</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-lucli/index.html">Lucli</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-memory/index.html">Memory</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-misc/index.html">Miscellaneous</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-queries/index.html">Queries</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-regex/index.html">Regex</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-snowball/index.html">Snowball</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-spellchecker/index.html">Spellchecker</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-surround/index.html">Surround</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-swing/index.html">Swing</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-wikipedia/index.html">Wikipedia</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-wordnet/index.html">Wordnet</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="api/contrib-xml-query-parser/index.html">XML Query Parser</a>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="benchmarks.html">Benchmarks</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="contributions.html">Contributions</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="http://wiki.apache.org/lucene-java/LuceneFAQ">FAQ</a>
|
|
</div>
|
|
<div class="menupage">
|
|
<div class="menupagetitle">File Formats</div>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="gettingstarted.html">Getting Started</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="lucene-sandbox/index.html">Lucene Sandbox</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="queryparsersyntax.html">Query Syntax</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="scoring.html">Scoring</a>
|
|
</div>
|
|
<div class="menuitem">
|
|
<a href="http://wiki.apache.org/lucene-java">Wiki</a>
|
|
</div>
|
|
</div>
|
|
<div id="credit"></div>
|
|
<div id="roundbottom">
|
|
<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
|
|
<!--+
|
|
|alternative credits
|
|
+-->
|
|
<div id="credit2"></div>
|
|
</div>
|
|
<!--+
|
|
|end Menu
|
|
+-->
|
|
<!--+
|
|
|start content
|
|
+-->
|
|
<div id="content">
|
|
<div title="Portable Document Format" class="pdflink">
|
|
<a class="dida" href="fileformats.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
|
|
PDF</a>
|
|
</div>
|
|
<h1>
|
|
Apache Lucene - Index File Formats
|
|
</h1>
|
|
<div id="minitoc-area">
|
|
<ul class="minitoc">
|
|
<li>
|
|
<a href="#Index File Formats">Index File Formats</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Definitions">Definitions</a>
|
|
<ul class="minitoc">
|
|
<li>
|
|
<a href="#Inverted Indexing">Inverted Indexing</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Types of Fields">Types of Fields</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Segments">Segments</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Document Numbers">Document Numbers</a>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<a href="#Overview">Overview</a>
|
|
</li>
|
|
<li>
|
|
<a href="#File Naming">File Naming</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Primitive Types">Primitive Types</a>
|
|
<ul class="minitoc">
|
|
<li>
|
|
<a href="#Byte">Byte</a>
|
|
</li>
|
|
<li>
|
|
<a href="#UInt32">UInt32</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Uint64">Uint64</a>
|
|
</li>
|
|
<li>
|
|
<a href="#VInt">VInt</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Chars">Chars</a>
|
|
</li>
|
|
<li>
|
|
<a href="#String">String</a>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<a href="#Per-Index Files">Per-Index Files</a>
|
|
<ul class="minitoc">
|
|
<li>
|
|
<a href="#Segments File">Segments File</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Lock File">Lock File</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Deletable File">Deletable File</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Compound Files">Compound Files</a>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<a href="#Per-Segment Files">Per-Segment Files</a>
|
|
<ul class="minitoc">
|
|
<li>
|
|
<a href="#Fields">Fields</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Term Dictionary">Term Dictionary</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Frequencies">Frequencies</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Positions">Positions</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Normalization Factors">Normalization Factors</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Term Vectors">Term Vectors</a>
|
|
</li>
|
|
<li>
|
|
<a href="#Deleted Documents">Deleted Documents</a>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<a href="#Limitations">Limitations</a>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<a name="N10016"></a><a name="Index File Formats"></a>
|
|
<h2 class="boxed">Index File Formats</h2>
|
|
<div class="section">
|
|
<p>
|
|
This document defines the index file formats used
|
|
in Lucene version 2.1. If you are using a different
|
|
version of Lucene, please consult the copy of
|
|
<span class="codefrag">docs/fileformats.html</span>
|
|
that was distributed
|
|
with the version you are using.
|
|
</p>
|
|
<p>
|
|
Apache Lucene is written in Java, but several
|
|
efforts are underway to write
|
|
<a href="http://wiki.apache.org/lucene-java/LuceneImplementations">versions
|
|
of Lucene in other programming
|
|
languages</a>. If these versions are to remain compatible with Apache
|
|
Lucene, then a language-independent definition of the Lucene index
|
|
format is required. This document thus attempts to provide a
|
|
complete and independent definition of the Apache Lucene 2.1 file
|
|
formats.
|
|
</p>
|
|
<p>
|
|
As Lucene evolves, this document should evolve.
|
|
Versions of Lucene in different programming languages should endeavor
|
|
to agree on file formats, and generate new versions of this document.
|
|
</p>
|
|
<p>
|
|
Compatibility notes are provided in this document,
|
|
describing how file formats have changed from prior versions.
|
|
</p>
|
|
<p>
|
|
In version 2.1, the file format was changed to allow
|
|
lock-less commits (ie, no more commit lock). The
|
|
change is fully backwards compatible: you can open a
|
|
pre-2.1 index for searching or adding/deleting of
|
|
docs. When the new segments file is saved
|
|
(committed), it will be written in the new file format
|
|
(meaning no specific "upgrade" process is needed).
|
|
But note that once a commit has occurred, pre-2.1
|
|
Lucene will not be able to read the index.
|
|
</p>
|
|
<p>
|
|
In version 2.3, the file format was changed to allow
|
|
segments to share a single set of doc store (vectors &
|
|
stored fields) files. This allows for faster indexing
|
|
in certain cases. The change is fully backwards
|
|
compatible (in the same way as the lock-less commits
|
|
change in 2.1).
|
|
</p>
|
|
</div>
|
|
|
|
|
|
<a name="N10035"></a><a name="Definitions"></a>
|
|
<h2 class="boxed">Definitions</h2>
|
|
<div class="section">
|
|
<p>
|
|
The fundamental concepts in Lucene are index,
|
|
document, field and term.
|
|
</p>
|
|
<p>
|
|
An index contains a sequence of documents.
|
|
</p>
|
|
<ul>
|
|
|
|
<li>
|
|
|
|
<p>
|
|
A document is a sequence of fields.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
|
|
<li>
|
|
|
|
<p>
|
|
A field is a named sequence of terms.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
|
|
<li>
|
|
A term is a string.
|
|
</li>
|
|
|
|
</ul>
|
|
<p>
|
|
The same string in two different fields is
|
|
considered a different term. Thus terms are represented as a pair of
|
|
strings, the first naming the field, and the second naming text
|
|
within the field.
|
|
</p>
|
|
<a name="N10055"></a><a name="Inverted Indexing"></a>
|
|
<h3 class="boxed">Inverted Indexing</h3>
|
|
<p>
|
|
The index stores statistics about terms in order
|
|
to make term-based search more efficient. Lucene's
|
|
index falls into the family of indexes known as an <i>inverted
|
|
index.</i> This is because it can list, for a term, the documents that contain
|
|
it. This is the inverse of the natural relationship, in which
|
|
documents list terms.
|
|
</p>
|
|
<a name="N10061"></a><a name="Types of Fields"></a>
|
|
<h3 class="boxed">Types of Fields</h3>
|
|
<p>
|
|
In Lucene, fields may be <i>stored</i>, in which
|
|
case their text is stored in the index literally, in a non-inverted
|
|
manner. Fields that are inverted are called <i>indexed</i>. A field
|
|
may be both stored and indexed.</p>
|
|
<p>The text of a field may be <i>tokenized</i> into terms to be
|
|
indexed, or the text of a field may be used literally as a term to be indexed.
|
|
Most fields are
|
|
tokenized, but sometimes it is useful for certain identifier fields
|
|
to be indexed literally.
|
|
</p>
|
|
<p>See the <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
|
|
<a name="N1007E"></a><a name="Segments"></a>
|
|
<h3 class="boxed">Segments</h3>
|
|
<p>
|
|
Lucene indexes may be composed of multiple sub-indexes, or
|
|
<i>segments</i>. Each segment is a fully independent index, which could be searched
|
|
separately. Indexes evolve by:
|
|
</p>
|
|
<ol>
|
|
|
|
<li>
|
|
|
|
<p>Creating new segments for newly added documents.</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>Merging existing segments.</p>
|
|
|
|
</li>
|
|
|
|
</ol>
|
|
<p>
|
|
Searches may involve multiple segments and/or multiple indexes, each
|
|
index potentially composed of a set of segments.
|
|
</p>
|
|
<a name="N1009C"></a><a name="Document Numbers"></a>
|
|
<h3 class="boxed">Document Numbers</h3>
|
|
<p>
|
|
Internally, Lucene refers to documents by an integer <i>document
|
|
number</i>. The first document added to an index is numbered zero, and each
|
|
subsequent document added gets a number one greater than the previous.
|
|
</p>
|
|
<p>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
<p>
|
|
Note that a document's number may change, so caution should be taken
|
|
when storing these numbers outside of Lucene. In particular, numbers may
|
|
change in the following situations:
|
|
</p>
|
|
<ul>
|
|
|
|
<li>
|
|
|
|
<p>
|
|
The
|
|
numbers stored in each segment are unique only within the segment,
|
|
and must be converted before they can be used in a larger context.
|
|
The standard technique is to allocate each segment a range of
|
|
values, based on the range of numbers used in that segment. To
|
|
convert a document number from a segment to an external value, the
|
|
segment's <i>base</i> document
|
|
number is added. To convert an external value back to a
|
|
segment-specific value, the segment is identified by the range that
|
|
the external value is in, and the segment's base value is
|
|
subtracted. For example two five document segments might be
|
|
combined, so that the first segment has a base value of zero, and
|
|
the second of five. Document three from the second segment would
|
|
have an external value of eight.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>
|
|
When documents are deleted, gaps are created
|
|
in the numbering. These are eventually removed as the index evolves
|
|
through merging. Deleted documents are dropped when segments are
|
|
merged. A freshly-merged segment thus has no gaps in its numbering.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
</div>
|
|
|
|
|
|
<a name="N100C3"></a><a name="Overview"></a>
|
|
<h2 class="boxed">Overview</h2>
|
|
<div class="section">
|
|
<p>
|
|
Each segment index maintains the following:
|
|
</p>
|
|
<ul>
|
|
|
|
<li>
|
|
|
|
<p>Field names. This
|
|
contains the set of field names used in the index.
|
|
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>Stored Field
|
|
values. This contains, for each document, a list of attribute-value
|
|
pairs, where the attributes are field names. These are used to
|
|
store auxiliary information about the document, such as its title,
|
|
url, or an identifier to access a
|
|
database. The set of stored fields are what is returned for each hit
|
|
when searching. This is keyed by document number.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>Term dictionary.
|
|
A dictionary containing all of the terms used in all of the indexed
|
|
fields of all of the documents. The dictionary also contains the
|
|
number of documents which contain the term, and pointers to the
|
|
term's frequency and proximity data.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
|
|
<li>
|
|
|
|
<p>Term Frequency
|
|
data. For each term in the dictionary, the numbers of all the
|
|
documents that contain that term, and the frequency of the term in
|
|
that document.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
|
|
<li>
|
|
|
|
<p>Term Proximity
|
|
data. For each term in the dictionary, the positions that the term
|
|
occurs in each document.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
|
|
<li>
|
|
|
|
<p>Normalization
|
|
factors. For each field in each document, a value is stored that is
|
|
multiplied into the score for hits on that field.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>Term Vectors. For each field in each document, the term vector
|
|
(sometimes called document vector) may be stored. A term vector consists
|
|
of term text and term frequency. To add Term Vectors to your index see the
|
|
<a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html">Field</a>
|
|
constructors
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>Deleted documents.
|
|
An optional file indicating which documents are deleted.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
<p>Details on each of these are provided in subsequent sections.
|
|
</p>
|
|
</div>
|
|
|
|
|
|
<a name="N10106"></a><a name="File Naming"></a>
|
|
<h2 class="boxed">File Naming</h2>
|
|
<div class="section">
|
|
<p>
|
|
All files belonging to a segment have the same name with varying
|
|
extensions. The extensions correspond to the different file formats
|
|
described below. When using the Compound File format (default in 1.4 and greater) these files are
|
|
collapsed into a single .cfs file (see below for details)
|
|
</p>
|
|
<p>
|
|
Typically, all segments
|
|
in an index are stored in a single directory, although this is not
|
|
required.
|
|
</p>
|
|
<p>
|
|
As of version 2.1 (lock-less commits), file names are
|
|
never re-used (there is one exception, "segments.gen",
|
|
see below). That is, when any file is saved to the
|
|
Directory it is given a never before used filename.
|
|
This is achieved using a simple generations approach.
|
|
For example, the first segments file is segments_1,
|
|
then segments_2, etc. The generation is a sequential
|
|
long integer represented in alpha-numeric (base 36)
|
|
form.
|
|
</p>
|
|
</div>
|
|
|
|
|
|
<a name="N10115"></a><a name="Primitive Types"></a>
|
|
<h2 class="boxed">Primitive Types</h2>
|
|
<div class="section">
|
|
<a name="N1011A"></a><a name="Byte"></a>
|
|
<h3 class="boxed">Byte</h3>
|
|
<p>
|
|
The most primitive type
|
|
is an eight-bit byte. Files are accessed as sequences of bytes. All
|
|
other data types are defined as sequences
|
|
of bytes, so file formats are byte-order independent.
|
|
</p>
|
|
<a name="N10123"></a><a name="UInt32"></a>
|
|
<h3 class="boxed">UInt32</h3>
|
|
<p>
|
|
32-bit unsigned integers are written as four
|
|
bytes, high-order bytes first.
|
|
</p>
|
|
<p>
|
|
UInt32 --> <Byte><sup>4</sup>
|
|
|
|
</p>
|
|
<a name="N10132"></a><a name="Uint64"></a>
|
|
<h3 class="boxed">Uint64</h3>
|
|
<p>
|
|
64-bit unsigned integers are written as eight
|
|
bytes, high-order bytes first.
|
|
</p>
|
|
<p>UInt64 --> <Byte><sup>8</sup>
|
|
|
|
</p>
|
|
<a name="N10141"></a><a name="VInt"></a>
|
|
<h3 class="boxed">VInt</h3>
|
|
<p>
|
|
A variable-length format for positive integers is
|
|
defined where the high-order bit of each byte indicates whether more
|
|
bytes remain to be read. The low-order seven bits are appended as
|
|
increasingly more significant bits in the resulting integer value.
|
|
Thus values from zero to 127 may be stored in a single byte, values
|
|
from 128 to 16,383 may be stored in two bytes, and so on.
|
|
</p>
|
|
<p>
|
|
|
|
<b>VInt Encoding Example</b>
|
|
|
|
</p>
|
|
<table class="ForrestTable" cellspacing="0" cellpadding="4" border="0">
|
|
|
|
<col width="64*">
|
|
|
|
<col width="64*">
|
|
|
|
<col width="64*">
|
|
|
|
<col width="64*">
|
|
|
|
<tr valign="TOP">
|
|
|
|
<td width="25%">
|
|
|
|
<p align="RIGHT">
|
|
|
|
<b>Value</b>
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td width="25%">
|
|
|
|
<p align="RIGHT">
|
|
|
|
<b>First byte</b>
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td width="25%">
|
|
|
|
<p align="RIGHT">
|
|
|
|
<b>Second byte</b>
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td width="25%">
|
|
|
|
<p align="RIGHT">
|
|
|
|
<b>Third byte</b>
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="0" width="25%">
|
|
|
|
<p align="RIGHT">0
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="0" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
00000000
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="1" width="25%">
|
|
|
|
<p align="RIGHT">1
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="2" width="25%">
|
|
|
|
<p align="RIGHT">2
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
00000010
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td valign="TOP" width="25%">
|
|
|
|
<p align="RIGHT">...
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="127" width="25%">
|
|
|
|
<p align="RIGHT">127
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1111111" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
01111111
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="128" width="25%">
|
|
|
|
<p align="RIGHT">128
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10000000" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="129" width="25%">
|
|
|
|
<p align="RIGHT">129
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10000001" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
10000001
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="130" width="25%">
|
|
|
|
<p align="RIGHT">130
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10000010" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
10000010
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td valign="TOP" width="25%">
|
|
|
|
<p align="RIGHT">...
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="16383" width="25%">
|
|
|
|
<p align="RIGHT">16,383
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="11111111" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
11111111
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1111111" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
01111111
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" width="25%">
|
|
|
|
<p align="RIGHT" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="16384" width="25%">
|
|
|
|
<p align="RIGHT">16,384
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10000000" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10000000" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr valign="BOTTOM">
|
|
|
|
<td sdnum="1033;0;#,##0" sdval="16385" width="25%">
|
|
|
|
<p align="RIGHT">16,385
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10000001" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
10000001
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="10000000" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
10000000
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" sdval="1" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
00000001
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td valign="TOP" width="25%">
|
|
|
|
<p align="RIGHT">...
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: 0.11cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.07cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
<td sdnum="1033;0;00000000" valign="BOTTOM" width="25%">
|
|
|
|
<p align="RIGHT" class="western" style="margin-left: -0.47cm; margin-right: 0.01cm">
|
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
<p>
|
|
This provides compression while still being
|
|
efficient to decode.
|
|
</p>
|
|
<a name="N10426"></a><a name="Chars"></a>
|
|
<h3 class="boxed">Chars</h3>
|
|
<p>
|
|
Lucene writes unicode
|
|
character sequences as UTF-8 encoded bytes.
|
|
</p>
|
|
<a name="N1042F"></a><a name="String"></a>
|
|
<h3 class="boxed">String</h3>
|
|
<p>
|
|
Lucene writes strings as UTF-8 encoded bytes.
|
|
First the length, in bytes, is written as a VInt,
|
|
followed by the bytes.
|
|
</p>
|
|
<p>
|
|
String --> VInt, Chars
|
|
</p>
|
|
</div>
|
|
|
|
|
|
<a name="N1043C"></a><a name="Per-Index Files"></a>
|
|
<h2 class="boxed">Per-Index Files</h2>
|
|
<div class="section">
|
|
<p>
|
|
The files in this section exist one-per-index.
|
|
</p>
|
|
<a name="N10444"></a><a name="Segments File"></a>
|
|
<h3 class="boxed">Segments File</h3>
|
|
<p>
|
|
The active segments in the index are stored in the
|
|
segment info file,
|
|
<tt>segments_N</tt>.
|
|
There may
|
|
be one or more
|
|
<tt>segments_N</tt>
|
|
files in the
|
|
index; however, the one with the largest
|
|
generation is the active one (when older
|
|
segments_N files are present it's because they
|
|
temporarily cannot be deleted, or, a writer is in
|
|
the process of committing, or a custom
|
|
<a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy</a>
|
|
is in use). This file lists each
|
|
segment by name, has details about the separate
|
|
norms and deletion files, and also contains the
|
|
size of each segment.
|
|
</p>
|
|
<p>
|
|
As of 2.1, there is also a file
|
|
<tt>segments.gen</tt>.
|
|
This file contains the
|
|
current generation (the
|
|
<tt>_N</tt>
|
|
in
|
|
<tt>segments_N</tt>)
|
|
of the index. This is
|
|
used only as a fallback in case the current
|
|
generation cannot be accurately determined by
|
|
directory listing alone (as is the case for some
|
|
NFS clients with time-based directory cache
|
|
expiraation). This file simply contains an Int32
|
|
version header (SegmentInfos.FORMAT_LOCKLESS =
|
|
-2), followed by the generation recorded as Int64,
|
|
written twice.
|
|
</p>
|
|
<p>
|
|
|
|
<b>Pre-2.1:</b>
|
|
Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize>
|
|
<sup>SegCount</sup>
|
|
|
|
</p>
|
|
<p>
|
|
|
|
<b>2.1 and above:</b>
|
|
Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize, DelGen, HasSingleNormFile, NumField,
|
|
NormGen<sup>NumField</sup>,
|
|
IsCompoundFile><sup>SegCount</sup>
|
|
|
|
</p>
|
|
<p>
|
|
|
|
<b>2.3:</b>
|
|
Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField,
|
|
NormGen<sup>NumField</sup>,
|
|
IsCompoundFile><sup>SegCount</sup>
|
|
|
|
</p>
|
|
<p>
|
|
|
|
<b>2.4 and above:</b>
|
|
Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField,
|
|
NormGen<sup>NumField</sup>,
|
|
IsCompoundFile><sup>SegCount</sup>, Checksum
|
|
</p>
|
|
<p>
|
|
Format, NameCounter, SegCount, SegSize, NumField, DocStoreOffset --> Int32
|
|
</p>
|
|
<p>
|
|
Version, DelGen, NormGen, Checksum --> Int64
|
|
</p>
|
|
<p>
|
|
SegName, DocStoreSegment --> String
|
|
</p>
|
|
<p>
|
|
IsCompoundFile, HasSingleNormFile, DocStoreIsCompoundFile --> Int8
|
|
</p>
|
|
<p>
|
|
Format is -1 as of Lucene 1.4, -3 (SegmentInfos.FORMAT_SINGLE_NORM_FILE) as of Lucene 2.1 and 2.2, -4 (SegmentInfos.FORMAT_SHARED_DOC_STORE) as of Lucene 2.3 and -5 (SegmentInfos.FORMAT_CHECKSUM) as of Lucene 2.4.
|
|
</p>
|
|
<p>
|
|
Version counts how often the index has been
|
|
changed by adding or deleting documents.
|
|
</p>
|
|
<p>
|
|
NameCounter is used to generate names for new segment files.
|
|
</p>
|
|
<p>
|
|
SegName is the name of the segment, and is used as the file name prefix
|
|
for all of the files that compose the segment's index.
|
|
</p>
|
|
<p>
|
|
SegSize is the number of documents contained in the segment index.
|
|
</p>
|
|
<p>
|
|
DelGen is the generation count of the separate
|
|
deletes file. If this is -1, there are no
|
|
separate deletes. If it is 0, this is a pre-2.1
|
|
segment and you must check filesystem for the
|
|
existence of _X.del. Anything above zero means
|
|
there are separate deletes (_X_N.del).
|
|
</p>
|
|
<p>
|
|
NumField is the size of the array for NormGen, or
|
|
-1 if there are no NormGens stored.
|
|
</p>
|
|
<p>
|
|
NormGen records the generation of the separate
|
|
norms files. If NumField is -1, there are no
|
|
normGens stored and they are all assumed to be 0
|
|
when the segment file was written pre-2.1 and all
|
|
assumed to be -1 when the segments file is 2.1 or
|
|
above. The generation then has the same meaning
|
|
as delGen (above).
|
|
</p>
|
|
<p>
|
|
IsCompoundFile records whether the segment is
|
|
written as a compound file or not. If this is -1,
|
|
the segment is not a compound file. If it is 1,
|
|
the segment is a compound file. Else it is 0,
|
|
which means we check filesystem to see if _X.cfs
|
|
exists.
|
|
</p>
|
|
<p>
|
|
If HasSingleNormFile is 1, then the field norms are
|
|
written as a single joined file (with extension
|
|
<tt>.nrm</tt>); if it is 0 then each field's norms
|
|
are stored as separate <tt>.fN</tt> files. See
|
|
"Normalization Factors" below for details.
|
|
</p>
|
|
<p>
|
|
DocStoreOffset, DocStoreSegment,
|
|
DocStoreIsCompoundFile: If DocStoreOffset is -1,
|
|
this segment has its own doc store (stored fields
|
|
values and term vectors) files and DocStoreSegment
|
|
and DocStoreIsCompoundFile are not stored. In
|
|
this case all files for stored field values
|
|
(<tt>*.fdt</tt> and <tt>*.fdx</tt>) and term
|
|
vectors (<tt>*.tvf</tt>, <tt>*.tvd</tt> and
|
|
<tt>*.tvx</tt>) will be stored with this segment.
|
|
Otherwise, DocStoreSegment is the name of the
|
|
segment that has the shared doc store files;
|
|
DocStoreIsCompoundFile is 1 if that segment is
|
|
stored in compound file format (as a <tt>.cfx</tt>
|
|
file); and DocStoreOffset is the starting document
|
|
in the shared doc store files where this segment's
|
|
documents begin. In this case, this segment does
|
|
not store its own doc store files but instead
|
|
shares a single set of these files with other
|
|
segments.
|
|
</p>
|
|
<p>
|
|
Checksum contains the CRC32 checksum of all bytes
|
|
in the segments_N file up until the checksum.
|
|
This is used to verify integrity of the file on
|
|
opening the index.
|
|
</p>
|
|
<a name="N104D8"></a><a name="Lock File"></a>
|
|
<h3 class="boxed">Lock File</h3>
|
|
<p>
|
|
The write lock, which is stored in the index
|
|
directory by default, is named "write.lock". If
|
|
the lock directory is different from the index
|
|
directory then the write lock will be named
|
|
"XXXX-write.lock" where XXXX is a unique prefix
|
|
derived from the full path to the index directory.
|
|
When this file is present, a writer is currently
|
|
modifying the index (adding or removing
|
|
documents). This lock file ensures that only one
|
|
writer is modifying the index at a time.
|
|
</p>
|
|
<p>
|
|
Note that prior to version 2.1, Lucene also used a
|
|
commit lock. This was removed in 2.1.
|
|
</p>
|
|
<a name="N104E4"></a><a name="Deletable File"></a>
|
|
<h3 class="boxed">Deletable File</h3>
|
|
<p>
|
|
Prior to Lucene 2.1 there was a file "deletable"
|
|
that contained details about files that need to be
|
|
deleted. As of 2.1, a writer dynamically computes
|
|
the files that are deletable, instead, so no file
|
|
is written.
|
|
</p>
|
|
<a name="N104ED"></a><a name="Compound Files"></a>
|
|
<h3 class="boxed">Compound Files</h3>
|
|
<p>Starting with Lucene 1.4 the compound file format became default. This
|
|
is simply a container for all files described in the next section
|
|
(except for the .del file).</p>
|
|
<p>Compound (.cfs) --> FileCount, <DataOffset, FileName>
|
|
<sup>FileCount</sup>
|
|
,
|
|
FileData
|
|
<sup>FileCount</sup>
|
|
|
|
</p>
|
|
<p>FileCount --> VInt</p>
|
|
<p>DataOffset --> Long</p>
|
|
<p>FileName --> String</p>
|
|
<p>FileData --> raw file data</p>
|
|
<p>The raw file data is the data from the individual files named above.</p>
|
|
<p>Starting with Lucene 2.3, doc store files (stored
|
|
field values and term vectors) can be shared in a
|
|
single set of files for more than one segment. When
|
|
compound file is enabled, these shared files will be
|
|
added into a single compound file (same format as
|
|
above) but with the extension <tt>.cfx</tt>.
|
|
</p>
|
|
</div>
|
|
|
|
|
|
<a name="N10515"></a><a name="Per-Segment Files"></a>
|
|
<h2 class="boxed">Per-Segment Files</h2>
|
|
<div class="section">
|
|
<p>
|
|
The remaining files are all per-segment, and are
|
|
thus defined by suffix.
|
|
</p>
|
|
<a name="N1051D"></a><a name="Fields"></a>
|
|
<h3 class="boxed">Fields</h3>
|
|
<p>
|
|
|
|
<br>
|
|
|
|
<b>Field Info</b>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
<p>
|
|
Field names are
|
|
stored in the field info file, with suffix .fnm.
|
|
</p>
|
|
<p>
|
|
FieldInfos
|
|
(.fnm) --> FieldsCount, <FieldName,
|
|
FieldBits>
|
|
<sup>FieldsCount</sup>
|
|
|
|
</p>
|
|
<p>
|
|
FieldsCount --> VInt
|
|
</p>
|
|
<p>
|
|
FieldName --> String
|
|
</p>
|
|
<p>
|
|
FieldBits --> Byte
|
|
</p>
|
|
<p>
|
|
|
|
<ul>
|
|
|
|
<li>
|
|
The low-order bit is one for
|
|
indexed fields, and zero for non-indexed fields.
|
|
</li>
|
|
|
|
<li>
|
|
The second lowest-order
|
|
bit is one for fields that have term vectors stored, and zero for fields
|
|
without term vectors.
|
|
</li>
|
|
|
|
<p>
|
|
|
|
<b>Lucene >= 1.9:</b>
|
|
|
|
</p>
|
|
|
|
<li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.</li>
|
|
|
|
<li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.</li>
|
|
|
|
<li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.</li>
|
|
|
|
<li>If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.</li>
|
|
|
|
</ul>
|
|
|
|
</p>
|
|
<p>
|
|
Fields are numbered by their order in this file. Thus field zero is
|
|
the
|
|
first field in the file, field one the next, and so on. Note that,
|
|
like document numbers, field numbers are segment relative.
|
|
</p>
|
|
<p>
|
|
|
|
<br>
|
|
|
|
<b>Stored Fields</b>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
<p>
|
|
Stored fields are represented by two files:
|
|
</p>
|
|
<ol>
|
|
|
|
<li>
|
|
|
|
<p>
|
|
The field index, or .fdx file.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
This contains, for each document, a pointer to
|
|
its field data, as follows:
|
|
</p>
|
|
|
|
|
|
<p>
|
|
FieldIndex
|
|
(.fdx) -->
|
|
<FieldValuesPosition>
|
|
<sup>SegSize</sup>
|
|
|
|
</p>
|
|
|
|
<p>FieldValuesPosition
|
|
--> Uint64
|
|
</p>
|
|
|
|
<p>This
|
|
is used to find the location within the field data file of the
|
|
fields of a particular document. Because it contains fixed-length
|
|
data, this file may be easily randomly accessed. The position of
|
|
document
|
|
<i>n</i>
|
|
's
|
|
<i></i>
|
|
field data is the Uint64 at
|
|
<i>n*8</i>
|
|
in
|
|
this file.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>
|
|
The field data, or .fdt file.
|
|
|
|
</p>
|
|
|
|
|
|
<p>
|
|
This contains the stored fields of each document,
|
|
as follows:
|
|
</p>
|
|
|
|
|
|
<p>
|
|
FieldData (.fdt) -->
|
|
<DocFieldData>
|
|
<sup>SegSize</sup>
|
|
|
|
</p>
|
|
|
|
<p>DocFieldData -->
|
|
FieldCount, <FieldNum, Bits, Value>
|
|
<sup>FieldCount</sup>
|
|
|
|
</p>
|
|
|
|
<p>FieldCount -->
|
|
VInt
|
|
</p>
|
|
|
|
<p>FieldNum -->
|
|
VInt
|
|
</p>
|
|
|
|
|
|
<p>
|
|
|
|
<b>Lucene <= 1.4:</b>
|
|
|
|
</p>
|
|
|
|
<p>Bits -->
|
|
Byte
|
|
</p>
|
|
|
|
<p>Value -->
|
|
String
|
|
</p>
|
|
|
|
<p>Only the low-order bit of Bits is used. It is one for
|
|
tokenized fields, and zero for non-tokenized fields.
|
|
</p>
|
|
|
|
<p>
|
|
|
|
<b>Lucene >= 1.9:</b>
|
|
|
|
</p>
|
|
|
|
<p>Bits -->
|
|
Byte
|
|
</p>
|
|
|
|
<p>
|
|
|
|
<ul>
|
|
|
|
<li>low order bit is one for tokenized fields</li>
|
|
|
|
<li>second bit is one for fields containing binary data</li>
|
|
|
|
<li>third bit is one for fields with compression option enabled
|
|
(if compression is enabled, the algorithm used is ZLIB)</li>
|
|
|
|
</ul>
|
|
|
|
</p>
|
|
|
|
<p>Value -->
|
|
String | BinaryValue (depending on Bits)
|
|
</p>
|
|
|
|
<p>BinaryValue -->
|
|
ValueSize, <Byte>^ValueSize
|
|
</p>
|
|
|
|
<p>ValueSize -->
|
|
VInt
|
|
</p>
|
|
|
|
|
|
</li>
|
|
|
|
</ol>
|
|
<a name="N105D8"></a><a name="Term Dictionary"></a>
|
|
<h3 class="boxed">Term Dictionary</h3>
|
|
<p>
|
|
The term dictionary is represented as two files:
|
|
</p>
|
|
<ol>
|
|
|
|
<li>
|
|
|
|
<p>
|
|
The term infos, or tis file.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
TermInfoFile (.tis)-->
|
|
TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
|
|
</p>
|
|
|
|
<p>TIVersion -->
|
|
UInt32
|
|
</p>
|
|
|
|
<p>TermCount -->
|
|
UInt64
|
|
</p>
|
|
|
|
<p>IndexInterval -->
|
|
UInt32
|
|
</p>
|
|
|
|
<p>SkipInterval -->
|
|
UInt32
|
|
</p>
|
|
|
|
<p>MaxSkipLevels -->
|
|
UInt32
|
|
</p>
|
|
|
|
<p>TermInfos -->
|
|
<TermInfo>
|
|
<sup>TermCount</sup>
|
|
|
|
</p>
|
|
|
|
<p>TermInfo -->
|
|
<Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
|
|
</p>
|
|
|
|
<p>Term -->
|
|
<PrefixLength, Suffix, FieldNum>
|
|
</p>
|
|
|
|
<p>Suffix -->
|
|
String
|
|
</p>
|
|
|
|
<p>PrefixLength,
|
|
DocFreq, FreqDelta, ProxDelta, SkipDelta
|
|
<br>
|
|
--> VInt
|
|
</p>
|
|
|
|
<p>
|
|
This file is sorted by Term. Terms are
|
|
ordered first lexicographically (by UTF16
|
|
character code) by the term's field name,
|
|
and within that lexicographically (by
|
|
UTF16 character code) by the term's text.
|
|
</p>
|
|
|
|
<p>TIVersion names the version of the format
|
|
of this file and is -2 in Lucene 1.4.
|
|
</p>
|
|
|
|
<p>Term
|
|
text prefixes are shared. The PrefixLength is the number of initial
|
|
characters from the previous term which must be pre-pended to a
|
|
term's suffix in order to form the term's text. Thus, if the
|
|
previous term's text was "bone" and the term is "boy",
|
|
the PrefixLength is two and the suffix is "y".
|
|
</p>
|
|
|
|
<p>FieldNumber
|
|
determines the term's field, whose name is stored in the .fdt file.
|
|
</p>
|
|
|
|
<p>DocFreq
|
|
is the count of documents which contain the term.
|
|
</p>
|
|
|
|
<p>FreqDelta
|
|
determines the position of this term's TermFreqs within the .frq
|
|
file. In particular, it is the difference between the position of
|
|
this term's data in that file and the position of the previous
|
|
term's data (or zero, for the first term in the file).
|
|
</p>
|
|
|
|
<p>ProxDelta
|
|
determines the position of this term's TermPositions within the .prx
|
|
file. In particular, it is the difference between the position of
|
|
this term's data in that file and the position of the previous
|
|
term's data (or zero, for the first term in the file.
|
|
</p>
|
|
|
|
<p>SkipDelta determines the position of this
|
|
term's SkipData within the .frq file. In
|
|
particular, it is the number of bytes
|
|
after TermFreqs that the SkipData starts.
|
|
In other words, it is the length of the
|
|
TermFreq data. SkipDelta is only stored
|
|
if DocFreq is not smaller than SkipInterval.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>
|
|
The term info index, or .tii file.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
This contains every IndexInterval
|
|
<sup>th</sup>
|
|
entry from the .tis
|
|
file, along with its location in the "tis" file. This is
|
|
designed to be read entirely into memory and used to provide random
|
|
access to the "tis" file.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
The structure of this file is very similar to the
|
|
.tis file, with the addition of one item per record, the IndexDelta.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
TermInfoIndex (.tii)-->
|
|
TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices
|
|
</p>
|
|
|
|
<p>TIVersion -->
|
|
UInt32
|
|
</p>
|
|
|
|
<p>IndexTermCount -->
|
|
UInt64
|
|
</p>
|
|
|
|
<p>IndexInterval -->
|
|
UInt32
|
|
</p>
|
|
|
|
<p>SkipInterval -->
|
|
UInt32
|
|
</p>
|
|
|
|
<p>TermIndices -->
|
|
<TermInfo, IndexDelta>
|
|
<sup>IndexTermCount</sup>
|
|
|
|
</p>
|
|
|
|
<p>IndexDelta -->
|
|
VLong
|
|
</p>
|
|
|
|
<p>IndexDelta
|
|
determines the position of this term's TermInfo within the .tis file. In
|
|
particular, it is the difference between the position of this term's
|
|
entry in that file and the position of the previous term's entry.
|
|
</p>
|
|
|
|
<p>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int).
|
|
Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while
|
|
smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more
|
|
accelerable cases.</p>
|
|
|
|
<p>MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in
|
|
smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.
|
|
See format of .frq file for more information about skip levels.</p>
|
|
|
|
</li>
|
|
|
|
</ol>
|
|
<a name="N10658"></a><a name="Frequencies"></a>
|
|
<h3 class="boxed">Frequencies</h3>
|
|
<p>
|
|
The .frq file contains the lists of documents
|
|
which contain each term, along with the frequency of the term in that
|
|
document.
|
|
</p>
|
|
<p>FreqFile (.frq) -->
|
|
<TermFreqs, SkipData>
|
|
<sup>TermCount</sup>
|
|
|
|
</p>
|
|
<p>TermFreqs -->
|
|
<TermFreq>
|
|
<sup>DocFreq</sup>
|
|
|
|
</p>
|
|
<p>TermFreq -->
|
|
DocDelta, Freq?
|
|
</p>
|
|
<p>SkipData -->
|
|
<<SkipLevelLength, SkipLevel>
|
|
<sup>NumSkipLevels-1</sup>, SkipLevel>
|
|
<SkipDatum>
|
|
</p>
|
|
<p>SkipLevel -->
|
|
<SkipDatum>
|
|
<sup>DocFreq/(SkipInterval^(Level + 1))</sup>
|
|
|
|
</p>
|
|
<p>SkipDatum -->
|
|
DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
|
|
</p>
|
|
<p>DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip -->
|
|
VInt
|
|
</p>
|
|
<p>SkipChildLevelPointer -->
|
|
VLong
|
|
</p>
|
|
<p>TermFreqs
|
|
are ordered by term (the term is implicit, from the .tis file).
|
|
</p>
|
|
<p>TermFreq
|
|
entries are ordered by increasing document number.
|
|
</p>
|
|
<p>DocDelta
|
|
determines both the document number and the frequency. In
|
|
particular, DocDelta/2 is the difference between this document number
|
|
and the previous document number (or zero when this is the first
|
|
document in a TermFreqs). When DocDelta is odd, the frequency is
|
|
one. When DocDelta is even, the frequency is read as another VInt.
|
|
</p>
|
|
<p>For
|
|
example, the TermFreqs for a term which occurs once in document seven
|
|
and three times in document eleven would be the following sequence of
|
|
VInts:
|
|
</p>
|
|
<p>15,
|
|
8, 3
|
|
</p>
|
|
<p>DocSkip records the document number before every
|
|
SkipInterval
|
|
<sup>th</sup>
|
|
document in TermFreqs.
|
|
If payloads are disabled for the term's field,
|
|
then DocSkip represents the difference from the
|
|
previous value in the sequence.
|
|
If payloads are enabled for the term's field,
|
|
then DocSkip/2 represents the difference from the
|
|
previous value in the sequence. If payloads are enabled
|
|
and DocSkip is odd,
|
|
then PayloadLength is stored indicating the length
|
|
of the last payload before the SkipInterval<sup>th</sup>
|
|
document in TermPositions.
|
|
FreqSkip and ProxSkip record the position of every
|
|
SkipInterval
|
|
<sup>th</sup>
|
|
entry in FreqFile and
|
|
ProxFile, respectively. File positions are
|
|
relative to the start of TermFreqs and Positions,
|
|
to the previous SkipDatum in the sequence.
|
|
</p>
|
|
<p>For example, if DocFreq=35 and SkipInterval=16,
|
|
then there are two SkipData entries, containing
|
|
the 15
|
|
<sup>th</sup>
|
|
and 31
|
|
<sup>st</sup>
|
|
document
|
|
numbers in TermFreqs. The first FreqSkip names
|
|
the number of bytes after the beginning of
|
|
TermFreqs that the 16
|
|
<sup>th</sup>
|
|
SkipDatum
|
|
starts, and the second the number of bytes after
|
|
that that the 32
|
|
<sup>nd</sup>
|
|
starts. The first
|
|
ProxSkip names the number of bytes after the
|
|
beginning of Positions that the 16
|
|
<sup>th</sup>
|
|
SkipDatum starts, and the second the number of
|
|
bytes after that that the 32
|
|
<sup>nd</sup>
|
|
starts.
|
|
</p>
|
|
<p>Lucene 2.2 introduces the notion of skip levels. Each term can have multiple skip levels.
|
|
The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))).
|
|
The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip
|
|
level is Level=0. <br>
|
|
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries,
|
|
containing the 3<sup>rd</sup>, 7<sup>th</sup>, 11<sup>th</sup>, 15<sup>th</sup>, 19<sup>th</sup>, 23<sup>rd</sup>,
|
|
27<sup>th</sup>, and 31<sup>st</sup> document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the
|
|
15<sup>th</sup> and 31<sup>st</sup> document numbers in TermFreqs. <br>
|
|
The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData
|
|
entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer
|
|
to entry 31 on level 0.
|
|
</p>
|
|
<a name="N106DA"></a><a name="Positions"></a>
|
|
<h3 class="boxed">Positions</h3>
|
|
<p>
|
|
The .prx file contains the lists of positions that
|
|
each term occurs at within documents.
|
|
</p>
|
|
<p>ProxFile (.prx) -->
|
|
<TermPositions>
|
|
<sup>TermCount</sup>
|
|
|
|
</p>
|
|
<p>TermPositions -->
|
|
<Positions>
|
|
<sup>DocFreq</sup>
|
|
|
|
</p>
|
|
<p>Positions -->
|
|
<PositionDelta,Payload?>
|
|
<sup>Freq</sup>
|
|
|
|
</p>
|
|
<p>Payload -->
|
|
<PayloadLength?,PayloadData>
|
|
</p>
|
|
<p>PositionDelta -->
|
|
VInt
|
|
</p>
|
|
<p>PayloadLength -->
|
|
VInt
|
|
</p>
|
|
<p>PayloadData -->
|
|
byte<sup>PayloadLength</sup>
|
|
|
|
</p>
|
|
<p>TermPositions
|
|
are ordered by term (the term is implicit, from the .tis file).
|
|
</p>
|
|
<p>Positions
|
|
entries are ordered by increasing document number (the document
|
|
number is implicit from the .frq file).
|
|
</p>
|
|
<p>PositionDelta
|
|
is, if payloads are disabled for the term's field, the difference
|
|
between the position of the current occurrence in
|
|
the document and the previous occurrence (or zero, if this is the
|
|
first occurrence in this document).
|
|
If payloads are enabled for the term's field, then PositionDelta/2
|
|
is the difference between the current and the previous position. If
|
|
payloads are enabled and PositionDelta is odd, then PayloadLength is
|
|
stored, indicating the length of the payload at the current term position.
|
|
</p>
|
|
<p>
|
|
For example, the TermPositions for a
|
|
term which occurs as the fourth term in one document, and as the
|
|
fifth and ninth term in a subsequent document, would be the following
|
|
sequence of VInts (payloads disabled):
|
|
</p>
|
|
<p>4,
|
|
5, 4
|
|
</p>
|
|
<p>PayloadData
|
|
is metadata associated with the current term position. If PayloadLength
|
|
is stored at the current position, then it indicates the length of this
|
|
Payload. If PayloadLength is not stored, then this Payload has the same
|
|
length as the Payload at the previous position.
|
|
</p>
|
|
<a name="N10716"></a><a name="Normalization Factors"></a>
|
|
<h3 class="boxed">Normalization Factors</h3>
|
|
<p>
|
|
|
|
<b>Pre-2.1:</b>
|
|
There's a norm file for each indexed field with a byte for
|
|
each document. The .f[0-9]* file contains,
|
|
for each document, a byte that encodes a value that is multiplied
|
|
into the score for hits on that field:
|
|
</p>
|
|
<p>Norms
|
|
(.f[0-9]*) --> <Byte>
|
|
<sup>SegSize</sup>
|
|
|
|
</p>
|
|
<p>
|
|
|
|
<b>2.1 and above:</b>
|
|
There's a single .nrm file containing all norms:
|
|
</p>
|
|
<p>AllNorms
|
|
(.nrm) --> NormsHeader,<Norms>
|
|
<sup>NumFieldsWithNorms</sup>
|
|
|
|
</p>
|
|
<p>Norms
|
|
--> <Byte>
|
|
<sup>SegSize</sup>
|
|
|
|
</p>
|
|
<p>NormsHeader
|
|
--> 'N','R','M',Version
|
|
</p>
|
|
<p>Version
|
|
--> Byte
|
|
</p>
|
|
<p>NormsHeader
|
|
has 4 bytes, last of which is the format version for this file, currently -1.
|
|
</p>
|
|
<p>Each
|
|
byte encodes a floating point value. Bits 0-2 contain the 3-bit
|
|
mantissa, and bits 3-8 contain the 5-bit exponent.
|
|
</p>
|
|
<p>These
|
|
are converted to an IEEE single float value as follows:
|
|
</p>
|
|
<ol>
|
|
|
|
<li>
|
|
|
|
<p>If
|
|
the byte is zero, use a zero float.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>Otherwise,
|
|
set the sign bit of the float to zero;
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>add
|
|
48 to the exponent and use this as the float's exponent;
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>map
|
|
the mantissa to the high-order 3 bits of the float's mantissa; and
|
|
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>set
|
|
the low-order 21 bits of the float's mantissa to zero.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
</ol>
|
|
<p>A separate norm file is created when the norm values of an existing segment are modified.
|
|
When field <em>N</em> is modified, a separate norm file <em>.sN</em>
|
|
is created, to maintain the norm values for that field.
|
|
</p>
|
|
<p>
|
|
|
|
<b>Pre-2.1:</b>
|
|
Separate norm files are created only for compound segments.
|
|
</p>
|
|
<p>
|
|
|
|
<b>2.1 and above:</b>
|
|
Separate norm files are created (when adequate) for both compound and non compound segments.
|
|
</p>
|
|
<a name="N1077F"></a><a name="Term Vectors"></a>
|
|
<h3 class="boxed">Term Vectors</h3>
|
|
<p>
|
|
Term Vector support is an optional on a field by
|
|
field basis. It consists of 4 files.
|
|
</p>
|
|
<ol>
|
|
|
|
<li>
|
|
|
|
<p>The Document Index or .tvx file.</p>
|
|
|
|
<p>For each document, this stores the offset
|
|
into the document data (.tvd) and field
|
|
data (.tvf) files.
|
|
</p>
|
|
|
|
<p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition>
|
|
<sup>NumDocs</sup>
|
|
|
|
</p>
|
|
|
|
<p>TVXVersion --> Int (3 (TermVectorsReader.FORMAT_VERSION2) for Lucene 2.4)</p>
|
|
|
|
<p>DocumentPosition --> UInt64 (offset in
|
|
the .tvd file)</p>
|
|
|
|
<p>FieldPosition --> UInt64 (offset in the
|
|
.tvf file)</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>The Document or .tvd file.</p>
|
|
|
|
<p>This contains, for each document, the number of fields, a list of the fields with
|
|
term vector info and finally a list of pointers to the field information in the .tvf
|
|
(Term Vector Fields) file.</p>
|
|
|
|
<p>
|
|
Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions>
|
|
<sup>NumDocs</sup>
|
|
|
|
</p>
|
|
|
|
<p>TVDVersion --> Int (3 (TermVectorsReader.FORMAT_VERSION2) for Lucene 2.4)</p>
|
|
|
|
<p>NumFields --> VInt</p>
|
|
|
|
<p>FieldNums --> <FieldNumDelta>
|
|
<sup>NumFields</sup>
|
|
|
|
</p>
|
|
|
|
<p>FieldNumDelta --> VInt</p>
|
|
|
|
<p>FieldPositions --> <FieldPositionDelta>
|
|
<sup>NumFields-1</sup>
|
|
|
|
</p>
|
|
|
|
<p>FieldPositionDelta --> VLong</p>
|
|
|
|
<p>The .tvd file is used to map out the fields that have term vectors stored and
|
|
where the field information is in the .tvf file.</p>
|
|
|
|
</li>
|
|
|
|
<li>
|
|
|
|
<p>The Field or .tvf file.</p>
|
|
|
|
<p>This file contains, for each field that has a term vector stored, a list of
|
|
the terms, their frequencies and, optionally, position and offest information.</p>
|
|
|
|
<p>Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs>
|
|
<sup>NumFields</sup>
|
|
|
|
</p>
|
|
|
|
<p>TVFVersion --> Int (3 (TermVectorsReader.FORMAT_VERSION2) for Lucene 2.4)</p>
|
|
|
|
<p>NumTerms --> VInt</p>
|
|
|
|
<p>Position/Offset --> Byte</p>
|
|
|
|
<p>TermFreqs --> <TermText, TermFreq, Positions?, Offsets?>
|
|
<sup>NumTerms</sup>
|
|
|
|
</p>
|
|
|
|
<p>TermText --> <PrefixLength, Suffix></p>
|
|
|
|
<p>PrefixLength --> VInt</p>
|
|
|
|
<p>Suffix --> String</p>
|
|
|
|
<p>TermFreq --> VInt</p>
|
|
|
|
<p>Positions --> <VInt><sup>TermFreq</sup>
|
|
</p>
|
|
|
|
<p>Offsets --> <VInt, VInt><sup>TermFreq</sup>
|
|
</p>
|
|
|
|
<br>
|
|
|
|
<p>Notes:</p>
|
|
|
|
<ul>
|
|
|
|
<li>Position/Offset byte stores whether this term vector has position or offset information stored.</li>
|
|
|
|
<li>Term
|
|
text prefixes are shared. The PrefixLength is the number of initial
|
|
characters from the previous term which must be pre-pended to a
|
|
term's suffix in order to form the term's text. Thus, if the
|
|
previous term's text was "bone" and the term is "boy",
|
|
the PrefixLength is two and the suffix is "y".
|
|
</li>
|
|
|
|
<li>Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position</li>
|
|
|
|
<li>Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.</li>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
</li>
|
|
|
|
</ol>
|
|
<a name="N10815"></a><a name="Deleted Documents"></a>
|
|
<h3 class="boxed">Deleted Documents</h3>
|
|
<p>The .del file is
|
|
optional, and only exists when a segment contains deletions.
|
|
</p>
|
|
<p>Although per-segment, this file is maintained exterior to compound segment files.
|
|
</p>
|
|
<p>
|
|
|
|
<b>Pre-2.1:</b>
|
|
Deletions
|
|
(.del) --> ByteCount,BitCount,Bits
|
|
</p>
|
|
<p>
|
|
|
|
<b>2.1 and above:</b>
|
|
Deletions
|
|
(.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format)
|
|
</p>
|
|
<p>Format,ByteSize,BitCount -->
|
|
Uint32
|
|
</p>
|
|
<p>Bits -->
|
|
<Byte>
|
|
<sup>ByteCount</sup>
|
|
|
|
</p>
|
|
<p>DGaps -->
|
|
<DGap,NonzeroByte>
|
|
<sup>NonzeroBytesCount</sup>
|
|
|
|
</p>
|
|
<p>DGap -->
|
|
VInt
|
|
</p>
|
|
<p>NonzeroByte -->
|
|
Byte
|
|
</p>
|
|
<p>Format
|
|
is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded.
|
|
</p>
|
|
<p>ByteCount
|
|
indicates the number of bytes in Bits. It is typically
|
|
(SegSize/8)+1.
|
|
</p>
|
|
<p>
|
|
BitCount
|
|
indicates the number of bits that are currently set in Bits.
|
|
</p>
|
|
<p>Bits
|
|
contains one bit for each document indexed. When the bit
|
|
corresponding to a document number is set, that document is marked as
|
|
deleted. Bit ordering is from least to most significant. Thus, if
|
|
Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
|
|
deleted.
|
|
</p>
|
|
<p>DGaps
|
|
represents sparse bit-vectors more efficiently than Bits.
|
|
It is made of DGaps on indexes of nonzero bytes in Bits,
|
|
and the nonzero bytes themselves. The number of nonzero bytes
|
|
in Bits (NonzeroBytesCount) is not stored.
|
|
</p>
|
|
<p>For example,
|
|
if there are 8000 bits and only bits 10,12,32 are set,
|
|
DGaps would be used:
|
|
</p>
|
|
<p>
|
|
(VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1
|
|
</p>
|
|
</div>
|
|
|
|
|
|
<a name="N10858"></a><a name="Limitations"></a>
|
|
<h2 class="boxed">Limitations</h2>
|
|
<div class="section">
|
|
<p>There
|
|
are a few places where these file formats limit the maximum number of
|
|
terms and documents to a 32-bit quantity, or to approximately 4
|
|
billion. This is not today a problem, but, in the long term,
|
|
probably will be. These should therefore be replaced with either
|
|
UInt64 values, or better yet, with VInt values which have no limit.
|
|
</p>
|
|
</div>
|
|
|
|
|
|
</div>
|
|
<!--+
|
|
|end content
|
|
+-->
|
|
<div class="clearboth"> </div>
|
|
</div>
|
|
<div id="footer">
|
|
<!--+
|
|
|start bottomstrip
|
|
+-->
|
|
<div class="lastmodified">
|
|
<script type="text/javascript"><!--
|
|
document.write("Last Published: " + document.lastModified);
|
|
// --></script>
|
|
</div>
|
|
<div class="copyright">
|
|
Copyright ©
|
|
2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
|
|
</div>
|
|
<!--+
|
|
|end bottomstrip
|
|
+-->
|
|
</div>
|
|
</body>
|
|
</html>
|