Updated scoring information to only have one copy of the Scoring Formula. Implemented Doron Cohen's new scoring formula description in the javadoc.

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@454767 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Grant Ingersoll 2006-10-10 14:57:25 +00:00
parent 0d4e1b171d
commit 75f561901e
4 changed files with 370 additions and 196 deletions

View File

@ -164,7 +164,8 @@ Documentation
1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll) 1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll)
2. Added scoring.xml document into xdocs.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless) 2. Added scoring.xml document into xdocs. Updated Similarity.java scoring formula.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless, Doron Cohen, Chris Hostetter, Doug Cutting). Issue 664.
Release 2.0.0 2006-05-26 Release 2.0.0 2006-05-26

View File

@ -188,6 +188,63 @@ limitations under the License.
</blockquote> </blockquote>
</td></tr> </td></tr>
<tr><td><br/></td></tr> <tr><td><br/></td></tr>
</table>
<table border="0" cellspacing="0" cellpadding="2" width="100%">
<tr><td bgcolor="#828DA6">
<font color="#ffffff" face="arial,helvetica,sanserif">
<a name="Score Boosting"><strong>Score Boosting</strong></a>
</font>
</td></tr>
<tr><td>
<blockquote>
<p>Lucene allows influencing search results by "boosting" in more than one level:
<ul>
<li><b>Document level boosting</b>
- while indexing - by calling
<a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
before a document is added to the index.
</li>
<li><b>Document's Field level boosting</b>
- while indexing - by calling
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
before adding a field to the document (and before adding the document to the index).
</li>
<li><b>Query level boosting</b>
- during search, by setting a boost on a query clause, calling
<a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
</li>
</ul>
</p>
<p>Indexing time boosts are preprocessed for storage efficiency and written to
the directory (when writing the document) in a single byte (!) as follows:
For each field of a document, all boosts of that field
(i.e. all boosts under the same field name in that doc) are multiplied.
The result is multiplied by the boost of the document,
and also multiplied by a "field length norm" value
that represents the length of that field in that doc
(so shorter fields are automatically boosted up).
The result is decoded as a single byte
(with some precision loss of course) and stored in the directory.
The similarity object in effect at indexing computes the length-norm of the field.
</p>
<p>This composition of 1-byte representation of norms
(that is, indexing time multiplication of field boosts &amp; doc boost &amp; field-length-norm)
is nicely described in
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
</p>
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
static methods of the class Similarity:
<a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
<a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
e.g. decode(encode(0.89)) = 0.75.
At scoring (search) time, this norm is brought into the score of document
as <b>indexBoost</b>, as shown by the formula in
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
</p>
</blockquote>
</td></tr>
<tr><td><br/></td></tr>
</table> </table>
<table border="0" cellspacing="0" cellpadding="2" width="100%"> <table border="0" cellspacing="0" cellpadding="2" width="100%">
<tr><td bgcolor="#828DA6"> <tr><td bgcolor="#828DA6">
@ -198,78 +255,10 @@ limitations under the License.
<tr><td> <tr><td>
<blockquote> <blockquote>
<p> <p>
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each This scoring formula is described in the
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
relevant document <i>d</i> is to the query <i>q</i>. This is taken from basics of Lucene scoring work, especially the
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>: <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
<div class="formula">
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
score(q,d) =
<span class="big" id="summation">
sum </span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
(t in d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
(t)^2 *
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
getBoost
</A>
(t in q) *
getBoost
(t.field in d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
lengthNorm
</A>
(t.field in d) )</span> <span> *
<A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
coord
</A>
(q,d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
queryNorm
</A>(sumOfSquaredWeights)</span>
</div>
</p>
<p>
where
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
<div id="#sumOfSquares">
sumOfSquaredWeights =
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
idf
</A>
(t) *
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
getBoost
</A>
(t in q) )^2</span>
</div>
</p>
<p>
This scoring formula is mostly implemented in the
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
<ol>
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
</ol>
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
for context and are not authoratitive.
</p> </p>
</blockquote> </blockquote>
</td></tr> </td></tr>

View File

@ -16,67 +16,271 @@ package org.apache.lucene.search;
* limitations under the License. * limitations under the License.
*/ */
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.util.SmallFloat;
import java.io.IOException; import java.io.IOException;
import java.io.Serializable; import java.io.Serializable;
import java.util.Collection; import java.util.Collection;
import java.util.Iterator; import java.util.Iterator;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.util.SmallFloat;
/** Expert: Scoring API. /** Expert: Scoring API.
* <p>Subclasses implement search scoring. * <p>Subclasses implement search scoring.
* *
* <p>The score of query <code>q</code> for document <code>d</code> is defined * <p>The score of query <code>q</code> for document <code>d</code> correlates to the
* in terms of these methods as follows: * cosine-distance or dot-product between document and query vectors in a
* <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">
* Vector Space Model (VSM) of Information Retrieval</a>.
* A document whose vector is closer to the query vector in that model is scored higher.
* *
* <table cellpadding="0" cellspacing="0" border="0"> * The score is computed as follows:
*
* <P>
* <table cellpadding="1" cellspacing="0" border="1" align="center">
* <tr><td>
* <table cellpadding="1" cellspacing="0" border="0" align="center">
* <tr> * <tr>
* <td valign="middle" align="right" rowspan="2">score(q,d) =<br></td> * <td valign="middle" align="right" rowspan="1">
* <td valign="middle" align="center"> * score(q,d) &nbsp; = &nbsp;
* <big><big><big><big><big>&Sigma;</big></big></big></big></big></td> * <A HREF="#formula_coord">coord(q,d)</A> &nbsp;&middot;&nbsp;
* <td valign="middle"><small> * <A HREF="#formula_queryNorm">queryNorm(q)</A> &nbsp;&middot;&nbsp;
* ( {@link #tf(int) tf}(t in d) * * </td>
* {@link #idf(Term,Searcher) idf}(t)^2 * * <td valign="bottom" align="center" rowspan="1">
* {@link Query#getBoost getBoost}(t in q) * * <big><big><big>&sum;</big></big></big>
* {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) * * </td>
* {@link #lengthNorm(String,int) lengthNorm}(t.field in d) ) * <td valign="middle" align="right" rowspan="1">
* </small></td> * <big><big>(</big></big>
* <td valign="middle" rowspan="2">&nbsp;* * <A HREF="#formula_tf">tf(t in d)</A> &nbsp;&middot;&nbsp;
* {@link #coord(int,int) coord}(q,d) * * <A HREF="#formula_idf">idf(t)</A><sup>2</sup> &nbsp;&middot;&nbsp;
* {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights) * <A HREF="#formula_termBoost">t.getBoost()</A>&nbsp;&middot;&nbsp;
* <A HREF="#formula_norm">norm(t,d)</A>
* <big><big>)</big></big>
* </td> * </td>
* </tr> * </tr>
* <tr> * <tr valigh="top">
* <td valign="top" align="right"> * <td></td>
* <small>t in q</small> * <td align="center"><small>t in q</small></td>
* </td> * <td></td>
* </tr> * </tr>
* </table> * </table>
* </td></tr>
* </table>
* *
* <p> where * <p> where
* <ol>
* <li>
* <A NAME="formula_tf"></A>
* <b>tf(t in d)</b>
* correlates to the term's <i>frequency</i>,
* defined as the number of times term <i>t</i> appears in the currently scored document <i>d</i>.
* Documents that have more occurrences of a given term receive a higher score.
* The default computation for <i>tf(t in d)</i> in
* {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is:
* *
* <table cellpadding="0" cellspacing="0" border="0"> * <br>&nbsp;<br>
* <tr> * <table cellpadding="2" cellspacing="2" border="0" align="center">
* <td valign="middle" align="right" rowspan="2">sumOfSqaredWeights =<br></td> * <tr>
* <td valign="middle" align="center"> * <td valign="middle" align="right" rowspan="1">
* <big><big><big><big><big>&Sigma;</big></big></big></big></big></td> * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} &nbsp; = &nbsp;
* <td valign="middle"><small> * </td>
* ( {@link #idf(Term,Searcher) idf}(t) * * <td valign="top" align="center" rowspan="1">
* {@link Query#getBoost getBoost}(t in q) )^2 * frequency<sup><big>&frac12;</big></sup>
* </small></td> * </td>
* </tr> * </tr>
* <tr> * </table>
* <td valign="top" align="right"> * <br>&nbsp;<br>
* <small>t in q</small> * </li>
* </td>
* </tr>
* </table>
* *
* <p> Note that the above formula is motivated by the cosine-distance or dot-product * <li>
* between document and query vector, which is implemented by {@link DefaultSimilarity}. * <A NAME="formula_idf"></A>
* <b>idf(t)</b> stands for Inverse Document Frequency. This value
* correlates to the inverse of <i>docFreq</i>
* (the number of documents in which the term <i>t</i> appears).
* This means rarer terms give higher contribution to the total score.
* The default computation for <i>idf(t)</i> in
* {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is:
*
* <br>&nbsp;<br>
* <table cellpadding="2" cellspacing="2" border="0" align="center">
* <tr>
* <td valign="middle" align="right">
* {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)}&nbsp; = &nbsp;
* </td>
* <td valign="middle" align="center">
* 1 + log <big>(</big>
* </td>
* <td valign="middle" align="center">
* <table>
* <tr><td align="center"><small>numDocs</small></td></tr>
* <tr><td align="center">&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;</td></tr>
* <tr><td align="center"><small>docFreq+1</small></td></tr>
* </table>
* </td>
* <td valign="middle" align="center">
* <big>)</big>
* </td>
* </tr>
* </table>
* <br>&nbsp;<br>
* </li>
*
* <li>
* <A NAME="formula_coord"></A>
* <b>coord(q,d)</b>
* is a score factor based on how many of the query terms are found in the specified document.
* Typically, a document that contains more of the query's terms will receive a higher score
* than another document with fewer query terms.
* This is a search time factor computed in
* {@link #coord(int, int) coord(q,d)}
* by the Similarity in effect at search time.
* <br>&nbsp;<br>
* </li>
*
* <li><b>
* <A NAME="formula_queryNorm"></A>
* queryNorm(q)
* </b>
* is a normalizing factor used to make scores between queries comparable.
* This factor does not affect document ranking (since all ranked documents are multiplied by the same factor),
* but rather just attempts to make scores from different queries (or even different indexes) comparable.
* This is a search time factor computed by the Similarity in effect at search time.
*
* The default computation in
* {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity}
* is:
* <br>&nbsp;<br>
* <table cellpadding="1" cellspacing="0" border="0" align="center">
* <tr>
* <td valign="middle" align="right" rowspan="1">
* queryNorm(q) &nbsp; = &nbsp;
* {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)}
* &nbsp; = &nbsp;
* </td>
* <td valign="middle" align="center" rowspan="1">
* <table>
* <tr><td align="center"><big>1</big></td></tr>
* <tr><td align="center"><big>
* &ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;
* </big></td></tr>
* <tr><td align="center">sumOfSquaredWeights<sup><big>&frac12;</big></sup></td></tr>
* </table>
* </td>
* </tr>
* </table>
* <br>&nbsp;<br>
*
* The sum of squared weights (of the query terms) is
* computed by the query {@link org.apache.lucene.search.Weight} object.
* For example, a {@link org.apache.lucene.search.BooleanQuery boolean query}
* computes this value as:
*
* <br>&nbsp;<br>
* <table cellpadding="1" cellspacing="0" border="0"n align="center">
* <tr>
* <td valign="middle" align="right" rowspan="1">
* {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} &nbsp; = &nbsp;
* {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} <sup><big>2</big></sup>
* &nbsp;&middot;&nbsp;
* </td>
* <td valign="bottom" align="center" rowspan="1">
* <big><big><big>&sum;</big></big></big>
* </td>
* <td valign="middle" align="right" rowspan="1">
* <big><big>(</big></big>
* <A HREF="#formula_idf">idf(t)</A> &nbsp;&middot;&nbsp;
* <A HREF="#formula_termBoost">t.getBoost()</A>
* <big><big>) <sup>2</sup> </big></big>
* </td>
* </tr>
* <tr valigh="top">
* <td></td>
* <td align="center"><small>t in q</small></td>
* <td></td>
* </tr>
* </table>
* <br>&nbsp;<br>
*
* </li>
*
* <li>
* <A NAME="formula_termBoost"></A>
* <b>t.getBoost()</b>
* is a search time boost of term <i>t</i> in the query <i>q</i> as
* specified in the query text
* (see <A HREF="../../../../../queryparsersyntax.html#Boosting a Term">query syntax</A>),
* or as set by application calls to
* {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}.
* Notice that there is really no direct API for accessing a boost of one term in a multi term query,
* but rather multi terms are represented in a query as multi
* {@link org.apache.lucene.search.TermQuery TermQuery} objects,
* and so the boost of a term in the query is accessible by calling the sub-query
* {@link org.apache.lucene.search.Query#getBoost() getBoost()}.
* <br>&nbsp;<br>
* </li>
*
* <li>
* <A NAME="formula_norm"></A>
* <b>norm(t,d)</b> encapsulates a few (indexing time) boost and length factors:
*
* <ul>
* <li><b>Document boost</b> - set by calling
* {@link org.apache.lucene.document.Document#setBoost(float) doc.setBoost()}
* before adding the document to the index.
* </li>
* <li><b>Field boost</b> - set by calling
* {@link org.apache.lucene.document.Fieldable#setBoost(float) field.setBoost()}
* before adding the field to a document.
* </li>
* <li>{@link #lengthNorm(String, int) <b>lengthNorm</b>(field)} - computed
* when the document is added to the index in accordance with the number of tokens
* of this field in the document, so that shorter fields contribute more to the score.
* LengthNorm is computed by the Similarity class in effect at indexing.
* </li>
* </ul>
*
* <p>
* When a document is added to the index, all the above factors are multiplied.
* If the document has multiple fields with the same name, all their boosts are multiplied together:
*
* <br>&nbsp;<br>
* <table cellpadding="1" cellspacing="0" border="0"n align="center">
* <tr>
* <td valign="middle" align="right" rowspan="1">
* norm(t,d) &nbsp; = &nbsp;
* {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()}
* &nbsp;&middot;&nbsp;
* {@link #lengthNorm(String, int) lengthNorm(field)}
* &nbsp;&middot;&nbsp;
* </td>
* <td valign="bottom" align="center" rowspan="1">
* <big><big><big>&prod;</big></big></big>
* </td>
* <td valign="middle" align="right" rowspan="1">
* {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}()
* </td>
* </tr>
* <tr valigh="top">
* <td></td>
* <td align="center"><small>field <i><b>f</b></i> in <i>d</i> named as <i><b>t</b></i></small></td>
* <td></td>
* </tr>
* </table>
* <br>&nbsp;<br>
* However the resulted <i>norm</i> value is {@link #encodeNorm(float) encoded} as a single byte
* before being stored.
* At search time, the norm byte value is read from the index
* {@link org.apache.lucene.store.Directory directory} and
* {@link #decodeNorm(byte) decoded} back to a float <i>norm</i> value.
* This encoding/decoding, while reducing index size, comes with the price of
* precision loss - it is not guaranteed that decode(encode(x)) = x.
* For instance, decode(encode(0.89)) = 0.75.
* Also notice that search time is too late to modify this <i>norm</i> part of scoring, e.g. by
* using a different {@link Similarity} for search.
* <br>&nbsp;<br>
* </li>
* </ol>
* *
* @see #setDefault(Similarity) * @see #setDefault(Similarity)
* @see IndexWriter#setSimilarity(Similarity) * @see IndexWriter#setSimilarity(Similarity)

View File

@ -61,80 +61,60 @@
on the Fields). on the Fields).
</p> </p>
</subsection> </subsection>
<subsection name="Score Boosting">
<p>Lucene allows influencing search results by "boosting" in more than one level:
<ul>
<li><b>Document level boosting</b>
- while indexing - by calling
<a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
before a document is added to the index.
</li>
<li><b>Document's Field level boosting</b>
- while indexing - by calling
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
before adding a field to the document (and before adding the document to the index).
</li>
<li><b>Query level boosting</b>
- during search, by setting a boost on a query clause, calling
<a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
</li>
</ul>
</p>
<p>Indexing time boosts are preprocessed for storage efficiency and written to
the directory (when writing the document) in a single byte (!) as follows:
For each field of a document, all boosts of that field
(i.e. all boosts under the same field name in that doc) are multiplied.
The result is multiplied by the boost of the document,
and also multiplied by a "field length norm" value
that represents the length of that field in that doc
(so shorter fields are automatically boosted up).
The result is decoded as a single byte
(with some precision loss of course) and stored in the directory.
The similarity object in effect at indexing computes the length-norm of the field.
</p>
<p>This composition of 1-byte representation of norms
(that is, indexing time multiplication of field boosts &amp; doc boost &amp; field-length-norm)
is nicely described in
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
</p>
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
static methods of the class Similarity:
<a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
<a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
e.g. decode(encode(0.89)) = 0.75.
At scoring (search) time, this norm is brought into the score of document
as <b>indexBoost</b>, as shown by the formula in
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
</p>
</subsection>
<subsection name="Understanding the Scoring Formula"> <subsection name="Understanding the Scoring Formula">
<p> <p>
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each This scoring formula is described in the
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
relevant document <i>d</i> is to the query <i>q</i>. This is taken from basics of Lucene scoring work, especially the
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>: <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
<div class="formula">
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
score(q,d) =
<span class="big" id="summation">
sum </span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
(t in d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
(t)^2 *
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
getBoost
</A>
(t in q) *
getBoost
(t.field in d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
lengthNorm
</A>
(t.field in d) )</span> <span> *
<A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
coord
</A>
(q,d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
queryNorm
</A>(sumOfSquaredWeights)</span>
</div>
</p>
<p>
where
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
<div id="#sumOfSquares">
sumOfSquaredWeights =
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
idf
</A>
(t) *
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
getBoost
</A>
(t in q) )^2</span>
</div>
</p>
<p>
This scoring formula is mostly implemented in the
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
<ol>
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
</ol>
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
for context and are not authoratitive.
</p> </p>
</subsection> </subsection>
<subsection name="The Big Picture"> <subsection name="The Big Picture">