git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4547@1436696 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Robert Muir 2013-01-22 00:01:38 +00:00
parent 28567c2327
commit a53962cf5f
13 changed files with 62 additions and 27 deletions

View File

@ -37,10 +37,7 @@ import org.apache.lucene.util.PriorityQueue;
// prototype streaming DV api // prototype streaming DV api
public abstract class DocValuesConsumer implements Closeable { public abstract class DocValuesConsumer implements Closeable {
// TODO: are any of these params too "infringing" on codec?
// we want codec to get necessary stuff from IW, but trading off against merge complexity.
// nocommit should we pass SegmentWriteState...?
public abstract void addNumericField(FieldInfo field, Iterable<Number> values) throws IOException; public abstract void addNumericField(FieldInfo field, Iterable<Number> values) throws IOException;
public abstract void addBinaryField(FieldInfo field, Iterable<BytesRef> values) throws IOException; public abstract void addBinaryField(FieldInfo field, Iterable<BytesRef> values) throws IOException;

View File

@ -363,11 +363,11 @@ file, previously they were stored in text format only.</li>
frequencies.</li> frequencies.</li>
<li>In version 4.0, the format of the inverted index became extensible via <li>In version 4.0, the format of the inverted index became extensible via
the {@link org.apache.lucene.codecs.Codec Codec} api. Fast per-document storage the {@link org.apache.lucene.codecs.Codec Codec} api. Fast per-document storage
({@link org.apache.lucene.index.DocValues DocValues}) was introduced. Normalization ({@code DocValues}) was introduced. Normalization factors need no longer be a
factors need no longer be a single byte, they can be any DocValues single byte, they can be any {@link org.apache.lucene.index.NumericDocValues NumericDocValues}.
{@link org.apache.lucene.index.DocValues.Type type}. Terms need not be unicode Terms need not be unicode strings, they can be any byte sequence. Term offsets
strings, they can be any byte sequence. Term offsets can optionally be indexed can optionally be indexed into the postings lists. Payloads can be stored in the
into the postings lists. Payloads can be stored in the term vectors.</li> term vectors.</li>
</ul> </ul>
<a name="Limitations" id="Limitations"></a> <a name="Limitations" id="Limitations"></a>
<h2>Limitations</h2> <h2>Limitations</h2>

View File

@ -368,11 +368,11 @@ file, previously they were stored in text format only.</li>
frequencies.</li> frequencies.</li>
<li>In version 4.0, the format of the inverted index became extensible via <li>In version 4.0, the format of the inverted index became extensible via
the {@link org.apache.lucene.codecs.Codec Codec} api. Fast per-document storage the {@link org.apache.lucene.codecs.Codec Codec} api. Fast per-document storage
({@link org.apache.lucene.index.DocValues DocValues}) was introduced. Normalization ({@code DocValues}) was introduced. Normalization factors need no longer be a
factors need no longer be a single byte, they can be any DocValues single byte, they can be any {@link org.apache.lucene.index.NumericDocValues NumericDocValues}.
{@link org.apache.lucene.index.DocValues.Type type}. Terms need not be unicode Terms need not be unicode strings, they can be any byte sequence. Term offsets
strings, they can be any byte sequence. Term offsets can optionally be indexed can optionally be indexed into the postings lists. Payloads can be stored in the
into the postings lists. Payloads can be stored in the term vectors.</li> term vectors.</li>
<li>In version 4.1, the format of the postings list changed to use either <li>In version 4.1, the format of the postings list changed to use either
of FOR compression or variable-byte encoding, depending upon the frequency of FOR compression or variable-byte encoding, depending upon the frequency
of the term.</li> of the term.</li>

View File

@ -368,11 +368,11 @@ file, previously they were stored in text format only.</li>
frequencies.</li> frequencies.</li>
<li>In version 4.0, the format of the inverted index became extensible via <li>In version 4.0, the format of the inverted index became extensible via
the {@link org.apache.lucene.codecs.Codec Codec} api. Fast per-document storage the {@link org.apache.lucene.codecs.Codec Codec} api. Fast per-document storage
({@link org.apache.lucene.index.DocValues DocValues}) was introduced. Normalization ({@code DocValues}) was introduced. Normalization factors need no longer be a
factors need no longer be a single byte, they can be any DocValues single byte, they can be any {@link org.apache.lucene.index.NumericDocValues NumericDocValues}.
{@link org.apache.lucene.index.DocValues.Type type}. Terms need not be unicode Terms need not be unicode strings, they can be any byte sequence. Term offsets
strings, they can be any byte sequence. Term offsets can optionally be indexed can optionally be indexed into the postings lists. Payloads can be stored in the
into the postings lists. Payloads can be stored in the term vectors.</li> term vectors.</li>
<li>In version 4.1, the format of the postings list changed to use either <li>In version 4.1, the format of the postings list changed to use either
of FOR compression or variable-byte encoding, depending upon the frequency of FOR compression or variable-byte encoding, depending upon the frequency
of the term.</li> of the term.</li>

View File

@ -182,9 +182,6 @@ public abstract class PerFieldDocValuesFormat extends DocValuesFormat {
} }
} }
// nocommit what if SimpleNormsFormat wants to use this
// ...? we have a "boolean isNorms" issue...? I guess we
// just need to make a PerFieldNormsFormat?
private class FieldsReader extends DocValuesProducer { private class FieldsReader extends DocValuesProducer {
private final Map<String,DocValuesProducer> fields = new TreeMap<String,DocValuesProducer>(); private final Map<String,DocValuesProducer> fields = new TreeMap<String,DocValuesProducer>();

View File

@ -416,7 +416,7 @@ public class FieldType implements IndexableFieldType {
* {@inheritDoc} * {@inheritDoc}
* <p> * <p>
* The default is <code>null</code> (no docValues) * The default is <code>null</code> (no docValues)
* @see #setDocValueType(DocValuesType) * @see #setDocValueType(org.apache.lucene.index.FieldInfo.DocValuesType)
*/ */
@Override @Override
public DocValuesType docValueType() { public DocValuesType docValueType() {

View File

@ -175,10 +175,10 @@ public abstract class AtomicReader extends IndexReader {
* used by a single thread. */ * used by a single thread. */
public abstract SortedDocValues getSortedDocValues(String field) throws IOException; public abstract SortedDocValues getSortedDocValues(String field) throws IOException;
// nocommit document that these are thread-private:
/** Returns {@link NumericDocValues} representing norms /** Returns {@link NumericDocValues} representing norms
* for this field, or null if no {@link NumericDocValues} * for this field, or null if no {@link NumericDocValues}
* were indexed. */ * were indexed. The returned instance should only be
* used by a single thread. */
public abstract NumericDocValues getNormValues(String field) throws IOException; public abstract NumericDocValues getNormValues(String field) throws IOException;
/** /**

View File

@ -19,6 +19,9 @@ package org.apache.lucene.index;
import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRef;
/**
* A per-document byte[]
*/
public abstract class BinaryDocValues { public abstract class BinaryDocValues {
/** Lookup the value for document. /** Lookup the value for document.
@ -29,8 +32,12 @@ public abstract class BinaryDocValues {
* "private" instance should be used for each source. */ * "private" instance should be used for each source. */
public abstract void get(int docID, BytesRef result); public abstract void get(int docID, BytesRef result);
/**
* Indicates the value was missing for the document.
*/
public static final byte[] MISSING = new byte[0]; public static final byte[] MISSING = new byte[0];
/** An empty BinaryDocValues which returns empty bytes for every document */
public static final BinaryDocValues EMPTY = new BinaryDocValues() { public static final BinaryDocValues EMPTY = new BinaryDocValues() {
@Override @Override
public void get(int docID, BytesRef result) { public void get(int docID, BytesRef result) {

View File

@ -17,9 +17,18 @@ package org.apache.lucene.index;
* limitations under the License. * limitations under the License.
*/ */
/**
* A per-document numeric value.
*/
public abstract class NumericDocValues { public abstract class NumericDocValues {
/**
* Returns the numeric value for the specified document ID.
* @param docID document ID to lookup
* @return numeric value
*/
public abstract long get(int docID); public abstract long get(int docID);
/** An empty NumericDocValues which returns zero for every document */
public static final NumericDocValues EMPTY = new NumericDocValues() { public static final NumericDocValues EMPTY = new NumericDocValues() {
@Override @Override
public long get(int docID) { public long get(int docID) {

View File

@ -19,11 +19,35 @@ package org.apache.lucene.index;
import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRef;
/**
* A per-document byte[] with presorted values.
* <p>
* Per-Document values in a SortedDocValues are deduplicated, dereferenced,
* and sorted into a dictionary of unique values. A pointer to the
* dictionary value (ordinal) can be retrieved for each document. Ordinals
* are dense and in increasing sorted order.
*/
public abstract class SortedDocValues extends BinaryDocValues { public abstract class SortedDocValues extends BinaryDocValues {
/**
* Returns the ordinal for the specified docID.
* @param docID document ID to lookup
* @return ordinal for the document: this is dense, starts at 0, then
* increments by 1 for the next value in sorted order.
*/
public abstract int getOrd(int docID); public abstract int getOrd(int docID);
/** Retrieves the value for the specified ordinal.
* @param ord ordinal to lookup
* @param result will be populated with the ordinal's value
* @see #getOrd(int)
*/
public abstract void lookupOrd(int ord, BytesRef result); public abstract void lookupOrd(int ord, BytesRef result);
/**
* Returns the number of unique values.
* @return number of unique values in this SortedDocValues. This is
* also equivalent to one plus the maximum ordinal.
*/
public abstract int getValueCount(); public abstract int getValueCount();
@Override @Override
@ -37,6 +61,7 @@ public abstract class SortedDocValues extends BinaryDocValues {
} }
} }
/** An empty SortedDocValues which returns empty bytes for every document */
public static final SortedDocValues EMPTY = new SortedDocValues() { public static final SortedDocValues EMPTY = new SortedDocValues() {
@Override @Override
public int getOrd(int docID) { public int getOrd(int docID) {

View File

@ -254,7 +254,7 @@ its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.
</p> </p>
<p> <p>
Additional user-supplied statistics can be added to the document as DocValues fields and Additional user-supplied statistics can be added to the document as DocValues fields and
accessed via {@link org.apache.lucene.index.AtomicReader#docValues}. accessed via {@link org.apache.lucene.index.AtomicReader#getNumericDocValues}.
</p> </p>
<p> <p>
</body> </body>

View File

@ -338,7 +338,7 @@ extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
to implement a new retrieval model, or to use external scoring factors particular to your application. For example, to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
a custom Similarity can access per-document values via {@link org.apache.lucene.search.FieldCache FieldCache} or a custom Similarity can access per-document values via {@link org.apache.lucene.search.FieldCache FieldCache} or
{@link org.apache.lucene.index.DocValues} and integrate them into the score. {@link org.apache.lucene.index.NumericDocValues} and integrate them into the score.
</p> </p>
<p> <p>
See the {@link org.apache.lucene.search.similarities} package documentation for information See the {@link org.apache.lucene.search.similarities} package documentation for information

View File

@ -132,7 +132,7 @@ subclassing the Similarity, one can simply introduce a new basic model and tell
matching term occurs. In these matching term occurs. In these
cases people have overridden Similarity to return 1 from the tf() method.</p></li> cases people have overridden Similarity to return 1 from the tf() method.</p></li>
<li><p>Changing Length Normalization &mdash; By overriding <li><p>Changing Length Normalization &mdash; By overriding
{@link org.apache.lucene.search.similarities.Similarity#computeNorm(FieldInvertState state, Norm)}, {@link org.apache.lucene.search.similarities.Similarity#computeNorm(FieldInvertState state)},
it is possible to discount how the length of a field contributes it is possible to discount how the length of a field contributes
to a score. In {@link org.apache.lucene.search.similarities.DefaultSimilarity}, to a score. In {@link org.apache.lucene.search.similarities.DefaultSimilarity},
lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be