diff --git a/CHANGES.txt b/CHANGES.txt index 0b96235e3ca..497e7e17573 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -76,9 +76,9 @@ API Changes SingleInstanceLockFactory (ie, in memory locking) locking with an FSDirectory. Note that now you must call setDisableLocks before the instantiation a FSDirectory if you wish to disable locking - for that Directory. + for that Directory. (Michael McCandless, Jeff Patterson via Yonik Seeley) - + Bug fixes 1. Fixed the web application demo (built with "ant war-demo") which @@ -127,7 +127,7 @@ Bug fixes has no value. (Oliver Hutchison via Chris Hostetter) - + Optimizations 1. LUCENE-586: TermDocs.skipTo() is now more efficient for multi-segment @@ -164,7 +164,8 @@ Documentation 1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll) - 2. Added scoring.xml document into xdocs.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless) + 2. Added scoring.xml document into xdocs. Updated Similarity.java scoring formula.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless, Doron Cohen, Chris Hostetter, Doug Cutting). Issue 664. + Release 2.0.0 2006-05-26 diff --git a/docs/scoring.html b/docs/scoring.html index 7ddb74b9151..76b3d7639f3 100644 --- a/docs/scoring.html +++ b/docs/scoring.html @@ -188,6 +188,63 @@ limitations under the License.
+ + + + +
+ + Score Boosting + +
+
+

Lucene allows influencing search results by "boosting" in more than one level: +

    +
  • Document level boosting + - while indexing - by calling + document.setBoost() + before a document is added to the index. +
  • +
  • Document's Field level boosting + - while indexing - by calling + field.setBoost() + before adding a field to the document (and before adding the document to the index). +
  • +
  • Query level boosting + - during search, by setting a boost on a query clause, calling + Query.setBoost(). +
  • +
+

+

Indexing time boosts are preprocessed for storage efficiency and written to + the directory (when writing the document) in a single byte (!) as follows: + For each field of a document, all boosts of that field + (i.e. all boosts under the same field name in that doc) are multiplied. + The result is multiplied by the boost of the document, + and also multiplied by a "field length norm" value + that represents the length of that field in that doc + (so shorter fields are automatically boosted up). + The result is decoded as a single byte + (with some precision loss of course) and stored in the directory. + The similarity object in effect at indexing computes the length-norm of the field. +

+

This composition of 1-byte representation of norms + (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) + is nicely described in + Fieldable.setBoost(). +

+

Encoding and decoding of the resulted float norm in a single byte are done by the + static methods of the class Similarity: + encodeNorm() and + decodeNorm(). + Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, + e.g. decode(encode(0.89)) = 0.75. + At scoring (search) time, this norm is brought into the score of document + as indexBoost, as shown by the formula in + Similarity. +

+
+

@@ -295,7 +284,7 @@ limitations under the License. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query - section below + section below highlights some of the more important Query classes. For information on the other ones, see the package summary. For details on implementing your own Query class, see Changing your Scoring -- diff --git a/src/java/org/apache/lucene/search/Similarity.java b/src/java/org/apache/lucene/search/Similarity.java index 1e6e152bf55..98799311650 100644 --- a/src/java/org/apache/lucene/search/Similarity.java +++ b/src/java/org/apache/lucene/search/Similarity.java @@ -16,67 +16,271 @@ package org.apache.lucene.search; * limitations under the License. */ -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.IndexWriter; -import org.apache.lucene.index.Term; -import org.apache.lucene.util.SmallFloat; - import java.io.IOException; import java.io.Serializable; import java.util.Collection; import java.util.Iterator; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexWriter; +import org.apache.lucene.index.Term; +import org.apache.lucene.util.SmallFloat; + /** Expert: Scoring API. *

Subclasses implement search scoring. * - *

The score of query q for document d is defined - * in terms of these methods as follows: + *

The score of query q for document d correlates to the + * cosine-distance or dot-product between document and query vectors in a + * + * Vector Space Model (VSM) of Information Retrieval. + * A document whose vector is closer to the query vector in that model is scored higher. * - *

@@ -198,78 +255,10 @@ limitations under the License.

- Lucene's scoring formula computes the score of one document d for a given query q across each - term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more - relevant document d is to the query q. This is taken from - Similarity: - -

- - score(q,d) = - - sum t in q( - tf - (t in d) * - idf - (t)^2 * - - getBoost - - (t in q) * - getBoost - (t.field in d) * - - lengthNorm - - (t.field in d) ) * - - coord - - (q,d) * - - queryNorm - (sumOfSquaredWeights) -
-

-

- where - -

- sumOfSquaredWeights = - sumt in q( - - idf - - (t) * - - getBoost - - (t in q) )^2 -
-

-

- This scoring formula is mostly implemented in the - TermScorer class, where it makes calls to the - Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation: -

    - -
  1. tf(t in d) - Term Frequency - The number of times the term t appears in the current document d being scored. Documents that have more occurrences of a given term receive a higher score.
  2. - -
  3. idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.

  4. - -
  5. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.

  6. - -
  7. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.

  8. - -
  9. coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.

  10. - -
  11. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable - GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) - that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem - to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

  12. -
- Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided - for context and are not authoratitive. + This scoring formula is described in the + Similarity class. Please take the time to study this formula, as it contains much of the information about how the + basics of Lucene scoring work, especially the + TermScorer.

+ * The score is computed as follows: + * + *

+ *

+ * + *
+ * * - * - * - * - * + * + * * - * - * + * + * + * + * * *
score(q,d) =
- * Σ - * ( {@link #tf(int) tf}(t in d) * - * {@link #idf(Term,Searcher) idf}(t)^2 * - * {@link Query#getBoost getBoost}(t in q) * - * {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) * - * {@link #lengthNorm(String,int) lengthNorm}(t.field in d) ) - *  * - * {@link #coord(int,int) coord}(q,d) * - * {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights) + * + * score(q,d)   =   + * coord(q,d)  ·  + * queryNorm(q)  ·  + * + * + * + * ( + * tf(t in d)  ·  + * idf(t)2  ·  + * t.getBoost() ·  + * norm(t,d) + * ) *
- * t in q - *
t in q
- * + *
+ * *

where - * - * - * - * - * - * - * - * - * - * - *
sumOfSqaredWeights =
- * Σ - * ( {@link #idf(Term,Searcher) idf}(t) * - * {@link Query#getBoost getBoost}(t in q) )^2 - *
- * t in q - *
- * - *

Note that the above formula is motivated by the cosine-distance or dot-product - * between document and query vector, which is implemented by {@link DefaultSimilarity}. + *

    + *
  1. + * + * tf(t in d) + * correlates to the term's frequency, + * defined as the number of times term t appears in the currently scored document d. + * Documents that have more occurrences of a given term receive a higher score. + * The default computation for tf(t in d) in + * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is: + * + *
     
    + * + * + * + * + * + *
    + * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)}   =   + * + * frequency½ + *
    + *
     
    + *
  2. + * + *
  3. + * + * idf(t) stands for Inverse Document Frequency. This value + * correlates to the inverse of docFreq + * (the number of documents in which the term t appears). + * This means rarer terms give higher contribution to the total score. + * The default computation for idf(t) in + * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is: + * + *
     
    + * + * + * + * + * + * + * + *
    + * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)}  =   + * + * 1 + log ( + * + * + * + * + * + *
    numDocs
    –––––––––
    docFreq+1
    + *
    + * ) + *
    + *
     
    + *
  4. + * + *
  5. + * + * coord(q,d) + * is a score factor based on how many of the query terms are found in the specified document. + * Typically, a document that contains more of the query's terms will receive a higher score + * than another document with fewer query terms. + * This is a search time factor computed in + * {@link #coord(int, int) coord(q,d)} + * by the Similarity in effect at search time. + *
     
    + *
  6. + * + *
  7. + * + * queryNorm(q) + * + * is a normalizing factor used to make scores between queries comparable. + * This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), + * but rather just attempts to make scores from different queries (or even different indexes) comparable. + * This is a search time factor computed by the Similarity in effect at search time. + * + * The default computation in + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity} + * is: + *
     
    + * + * + * + * + * + *
    + * queryNorm(q)   =   + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)} + *   =   + * + * + * + * + * + *
    1
    + * –––––––––––––– + *
    sumOfSquaredWeights½
    + *
    + *
     
    + * + * The sum of squared weights (of the query terms) is + * computed by the query {@link org.apache.lucene.search.Weight} object. + * For example, a {@link org.apache.lucene.search.BooleanQuery boolean query} + * computes this value as: + * + *
     
    + * + * + * + * + * + * + * + * + * + * + * + *
    + * {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights}   =   + * {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} 2 + *  ·  + * + * + * + * ( + * idf(t)  ·  + * t.getBoost() + * ) 2 + *
    t in q
    + *
     
    + * + *
  8. + * + *
  9. + * + * t.getBoost() + * is a search time boost of term t in the query q as + * specified in the query text + * (see query syntax), + * or as set by application calls to + * {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}. + * Notice that there is really no direct API for accessing a boost of one term in a multi term query, + * but rather multi terms are represented in a query as multi + * {@link org.apache.lucene.search.TermQuery TermQuery} objects, + * and so the boost of a term in the query is accessible by calling the sub-query + * {@link org.apache.lucene.search.Query#getBoost() getBoost()}. + *
     
    + *
  10. + * + *
  11. + * + * norm(t,d) encapsulates a few (indexing time) boost and length factors: + * + * + * + *

    + * When a document is added to the index, all the above factors are multiplied. + * If the document has multiple fields with the same name, all their boosts are multiplied together: + * + *
     
    + * + * + * + * + * + * + * + * + * + * + * + *
    + * norm(t,d)   =   + * {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()} + *  ·  + * {@link #lengthNorm(String, int) lengthNorm(field)} + *  ·  + * + * + * + * {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}() + *
    field f in d named as t
    + *
     
    + * However the resulted norm value is {@link #encodeNorm(float) encoded} as a single byte + * before being stored. + * At search time, the norm byte value is read from the index + * {@link org.apache.lucene.store.Directory directory} and + * {@link #decodeNorm(byte) decoded} back to a float norm value. + * This encoding/decoding, while reducing index size, comes with the price of + * precision loss - it is not guaranteed that decode(encode(x)) = x. + * For instance, decode(encode(0.89)) = 0.75. + * Also notice that search time is too late to modify this norm part of scoring, e.g. by + * using a different {@link Similarity} for search. + *
     
    + *

  12. + *
* * @see #setDefault(Similarity) * @see IndexWriter#setSimilarity(Similarity) diff --git a/xdocs/scoring.xml b/xdocs/scoring.xml index 6da61ac704c..44bd59f7bb8 100644 --- a/xdocs/scoring.xml +++ b/xdocs/scoring.xml @@ -61,80 +61,60 @@ on the Fields).

+ +

Lucene allows influencing search results by "boosting" in more than one level: +

+

+

Indexing time boosts are preprocessed for storage efficiency and written to + the directory (when writing the document) in a single byte (!) as follows: + For each field of a document, all boosts of that field + (i.e. all boosts under the same field name in that doc) are multiplied. + The result is multiplied by the boost of the document, + and also multiplied by a "field length norm" value + that represents the length of that field in that doc + (so shorter fields are automatically boosted up). + The result is decoded as a single byte + (with some precision loss of course) and stored in the directory. + The similarity object in effect at indexing computes the length-norm of the field. +

+

This composition of 1-byte representation of norms + (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) + is nicely described in + Fieldable.setBoost(). +

+

Encoding and decoding of the resulted float norm in a single byte are done by the + static methods of the class Similarity: + encodeNorm() and + decodeNorm(). + Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, + e.g. decode(encode(0.89)) = 0.75. + At scoring (search) time, this norm is brought into the score of document + as indexBoost, as shown by the formula in + Similarity. +

+
+

- Lucene's scoring formula computes the score of one document d for a given query q across each - term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more - relevant document d is to the query q. This is taken from - Similarity: - -

- - score(q,d) = - - sum t in q( - tf - (t in d) * - idf - (t)^2 * - - getBoost - - (t in q) * - getBoost - (t.field in d) * - - lengthNorm - - (t.field in d) ) * - - coord - - (q,d) * - - queryNorm - (sumOfSquaredWeights) -
-

-

- where - -

- sumOfSquaredWeights = - sumt in q( - - idf - - (t) * - - getBoost - - (t in q) )^2 -
-

-

- This scoring formula is mostly implemented in the - TermScorer class, where it makes calls to the - Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation: -

    - -
  1. tf(t in d) - Term Frequency - The number of times the term t appears in the current document d being scored. Documents that have more occurrences of a given term receive a higher score.
  2. - -
  3. idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.

  4. - -
  5. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.

  6. - -
  7. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.

  8. - -
  9. coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.

  10. - -
  11. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable - GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) - that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem - to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

  12. -
- Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided - for context and are not authoratitive. + This scoring formula is described in the + Similarity class. Please take the time to study this formula, as it contains much of the information about how the + basics of Lucene scoring work, especially the + TermScorer.

@@ -150,7 +130,7 @@ These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query - section below + section below highlights some of the more important Query classes. For information on the other ones, see the package summary. For details on implementing your own Query class, see Changing your Scoring --