file formats

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4765@1446988 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Robert Muir 2013-02-17 01:16:53 +00:00
parent a4af217886
commit 4991491bda
3 changed files with 15 additions and 4 deletions

View File

@ -375,7 +375,8 @@ can optionally be indexed into the postings lists. Payloads can be stored in the
term vectors.</li>
<li>In version 4.1, the format of the postings list changed to use either
of FOR compression or variable-byte encoding, depending upon the frequency
of the term.</li>
of the term. Terms appearing only once were changed to inline directly into
the term dictionary. Stored fields are compressed by default. </li>
</ul>
<a name="Limitations" id="Limitations"></a>
<h2>Limitations</h2>

View File

@ -34,7 +34,7 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
/**
* Lucene 4.2 DocValues format.
* <p>
* Encodes the three per-document value types (Numeric,Binary,Sorted) with five basic strategies.
* Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with seven basic strategies.
* <p>
* <ul>
* <li>Delta-compressed Numerics: per-document integers written in blocks of 4096. For each block
@ -51,7 +51,9 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
* start for the block, and the average (expected) delta per entry. For each document the
* deviation from the delta (actual - expected) is written.
* <li>Sorted: an FST mapping deduplicated terms to ordinals is written, along with the per-document
* ordinals written using one of the numeric stratgies above.
* ordinals written using one of the numeric strategies above.
* <li>SortedSet: an FST mapping deduplicated terms to ordinals is written, along with the per-document
* ordinal list written using one of the binary strategies above.
* </ul>
* <p>
* Files:
@ -77,6 +79,8 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
* </ul>
* <p>Sorted fields have two entries: a SortedEntry with the FST metadata,
* and an ordinary NumericEntry for the document-to-ord metadata.</p>
* <p>SortedSet fields have two entries: a SortedEntry with the FST metadata,
* and an ordinary BinaryEntry for the document-to-ord-list metadata.</p>
* <p>FieldNumber of -1 indicates the end of metadata.</p>
* <p>EntryType is a 0 (NumericEntry), 1 (BinaryEntry, or 2 (SortedEntry)</p>
* <p>DataOffset is the pointer to the start of the data in the DocValues data (.dvd)</p>
@ -107,6 +111,8 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
* <li>UncompressedNumerics --&gt; {@link DataOutput#writeByte Byte}<sup>maxdoc</sup></li>
* <li>Addresses --&gt; {@link MonotonicBlockPackedWriter MonotonicBlockPackedInts(blockSize=4096)}</li>
* </ul>
* <p>SortedSet entries store the list of ordinals in their BinaryData as a
* sequences of increasing {@link DataOutput#writeVLong vLong}s, delta-encoded.</p>
* </ol>
*/
public final class Lucene42DocValuesFormat extends DocValuesFormat {

View File

@ -375,7 +375,11 @@ can optionally be indexed into the postings lists. Payloads can be stored in the
term vectors.</li>
<li>In version 4.1, the format of the postings list changed to use either
of FOR compression or variable-byte encoding, depending upon the frequency
of the term.</li>
of the term. Terms appearing only once were changed to inline directly into
the term dictionary. Stored fields are compressed by default. </li>
<li>In version 4.2, term vectors are compressed by default. DocValues has
a new multi-valued type (SortedSet), that can be used for faceting/grouping/joining
on multi-valued fields.</li>
</ul>
<a name="Limitations" id="Limitations"></a>
<h2>Limitations</h2>