diff --git a/CHANGES.txt b/CHANGES.txt index 0b96235e3ca..497e7e17573 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -76,9 +76,9 @@ API Changes SingleInstanceLockFactory (ie, in memory locking) locking with an FSDirectory. Note that now you must call setDisableLocks before the instantiation a FSDirectory if you wish to disable locking - for that Directory. + for that Directory. (Michael McCandless, Jeff Patterson via Yonik Seeley) - + Bug fixes 1. Fixed the web application demo (built with "ant war-demo") which @@ -127,7 +127,7 @@ Bug fixes has no value. (Oliver Hutchison via Chris Hostetter) - + Optimizations 1. LUCENE-586: TermDocs.skipTo() is now more efficient for multi-segment @@ -164,7 +164,8 @@ Documentation 1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll) - 2. Added scoring.xml document into xdocs.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless) + 2. Added scoring.xml document into xdocs. Updated Similarity.java scoring formula.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless, Doron Cohen, Chris Hostetter, Doug Cutting). Issue 664. + Release 2.0.0 2006-05-26 diff --git a/docs/scoring.html b/docs/scoring.html index 7ddb74b9151..76b3d7639f3 100644 --- a/docs/scoring.html +++ b/docs/scoring.html @@ -188,6 +188,63 @@ limitations under the License.
+ + Score Boosting + + |
+ ++ |
@@ -198,78 +255,10 @@ limitations under the License. | |||||||||||
|
+ *
|
where - * - *
sumOfSqaredWeights = |
- * - * Σ | - *- * ( {@link #idf(Term,Searcher) idf}(t) * - * {@link Query#getBoost getBoost}(t in q) )^2 - * | - *
- * t in q - * | - *
Note that the above formula is motivated by the cosine-distance or dot-product - * between document and query vector, which is implemented by {@link DefaultSimilarity}. + *
+ * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} = + * | + *+ * frequency½ + * | + *
+ * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)} = + * | + *+ * 1 + log ( + * | + *
+ *
|
+ * + * ) + * | + *
+ * queryNorm(q) = + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)} + * = + * | + *
+ *
|
+ *
+ * {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} = + * {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} 2 + * · + * | + *+ * ∑ + * | + *+ * ( + * idf(t) · + * t.getBoost() + * ) 2 + * | + *
+ * | t in q | + *+ * |
+ * When a document is added to the index, all the above factors are multiplied.
+ * If the document has multiple fields with the same name, all their boosts are multiplied together:
+ *
+ *
+ *
+ * norm(t,d) = + * {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()} + * · + * {@link #lengthNorm(String, int) lengthNorm(field)} + * · + * | + *+ * ∏ + * | + *+ * {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}() + * | + *
+ * | field f in d named as t | + *+ * |
Lucene allows influencing search results by "boosting" in more than one level: +
Indexing time boosts are preprocessed for storage efficiency and written to + the directory (when writing the document) in a single byte (!) as follows: + For each field of a document, all boosts of that field + (i.e. all boosts under the same field name in that doc) are multiplied. + The result is multiplied by the boost of the document, + and also multiplied by a "field length norm" value + that represents the length of that field in that doc + (so shorter fields are automatically boosted up). + The result is decoded as a single byte + (with some precision loss of course) and stored in the directory. + The similarity object in effect at indexing computes the length-norm of the field. +
+This composition of 1-byte representation of norms + (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) + is nicely described in + Fieldable.setBoost(). +
+Encoding and decoding of the resulted float norm in a single byte are done by the + static methods of the class Similarity: + encodeNorm() and + decodeNorm(). + Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, + e.g. decode(encode(0.89)) = 0.75. + At scoring (search) time, this norm is brought into the score of document + as indexBoost, as shown by the formula in + Similarity. +
+- Lucene's scoring formula computes the score of one document d for a given query q across each - term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more - relevant document d is to the query q. This is taken from - Similarity: - -
- where - -
- -- This scoring formula is mostly implemented in the - TermScorer class, where it makes calls to the - Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation: -
idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.
getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.
lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.
coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.
queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable - GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) - that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem - to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?