file formats

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4765@1446988 13f79535-47bb-0310-9956-ffa450edef68
2013-02-17 01:16:53 +00:00 · 2013-02-17 01:16:53 +00:00 · 4991491bda
parent a4af217886
commit 4991491bda
3 changed files with 15 additions and 4 deletions
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene41/package.html
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene41/package.html
@ -375,7 +375,8 @@ can optionally be indexed into the postings lists. Payloads can be stored in the
 term vectors.</li>
 <li>In version 4.1, the format of the postings list changed to use either
 of FOR compression or variable-byte encoding, depending upon the frequency
-of the term.</li>
+of the term. Terms appearing only once were changed to inline directly into
+the term dictionary. Stored fields are compressed by default. </li>
 </ul>
 <a name="Limitations" id="Limitations"></a>
 <h2>Limitations</h2>
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42DocValuesFormat.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42DocValuesFormat.java
@ -34,7 +34,7 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
 /**
 * Lucene 4.2 DocValues format.
 * <p>
- * Encodes the three per-document value types (Numeric,Binary,Sorted) with five basic strategies.
+ * Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with seven basic strategies.
 * <p>
 * <ul>
 *    <li>Delta-compressed Numerics: per-document integers written in blocks of 4096. For each block
@ -51,7 +51,9 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
 *        start for the block, and the average (expected) delta per entry. For each document the 
 *        deviation from the delta (actual - expected) is written.
 *    <li>Sorted: an FST mapping deduplicated terms to ordinals is written, along with the per-document
- *        ordinals written using one of the numeric stratgies above.
+ *        ordinals written using one of the numeric strategies above.
+ *    <li>SortedSet: an FST mapping deduplicated terms to ordinals is written, along with the per-document
+ *        ordinal list written using one of the binary strategies above.  
 * </ul>
 * <p>
 * Files:
@ -77,6 +79,8 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
 *   </ul>
 *   <p>Sorted fields have two entries: a SortedEntry with the FST metadata,
 *      and an ordinary NumericEntry for the document-to-ord metadata.</p>
+ *   <p>SortedSet fields have two entries: a SortedEntry with the FST metadata,
+ *      and an ordinary BinaryEntry for the document-to-ord-list metadata.</p>
 *   <p>FieldNumber of -1 indicates the end of metadata.</p>
 *   <p>EntryType is a 0 (NumericEntry), 1 (BinaryEntry, or 2 (SortedEntry)</p>
 *   <p>DataOffset is the pointer to the start of the data in the DocValues data (.dvd)</p>
@ -107,6 +111,8 @@ import org.apache.lucene.util.packed.BlockPackedWriter;
 *     <li>UncompressedNumerics --&gt; {@link DataOutput#writeByte Byte}<sup>maxdoc</sup></li>
 *     <li>Addresses --&gt; {@link MonotonicBlockPackedWriter MonotonicBlockPackedInts(blockSize=4096)}</li>
 *   </ul>
+ *   <p>SortedSet entries store the list of ordinals in their BinaryData as a
+ *      sequences of increasing {@link DataOutput#writeVLong vLong}s, delta-encoded.</p>       
 * </ol>
 */
 public final class Lucene42DocValuesFormat extends DocValuesFormat {
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene42/package.html
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene42/package.html
@ -375,7 +375,11 @@ can optionally be indexed into the postings lists. Payloads can be stored in the
 term vectors.</li>
 <li>In version 4.1, the format of the postings list changed to use either
 of FOR compression or variable-byte encoding, depending upon the frequency
-of the term.</li>
+of the term. Terms appearing only once were changed to inline directly into
+the term dictionary. Stored fields are compressed by default. </li>
+<li>In version 4.2, term vectors are compressed by default. DocValues has 
+a new multi-valued type (SortedSet), that can be used for faceting/grouping/joining
+on multi-valued fields.</li>
 </ul>
 <a name="Limitations" id="Limitations"></a>
 <h2>Limitations</h2>