mirror of https://github.com/apache/lucene.git
Updated scoring information to only have one copy of the Scoring Formula. Implemented Doron Cohen's new scoring formula description in the javadoc.
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@454767 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
0d4e1b171d
commit
75f561901e
|
@ -164,7 +164,8 @@ Documentation
|
||||||
|
|
||||||
1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll)
|
1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll)
|
||||||
|
|
||||||
2. Added scoring.xml document into xdocs.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless)
|
2. Added scoring.xml document into xdocs. Updated Similarity.java scoring formula.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless, Doron Cohen, Chris Hostetter, Doug Cutting). Issue 664.
|
||||||
|
|
||||||
|
|
||||||
Release 2.0.0 2006-05-26
|
Release 2.0.0 2006-05-26
|
||||||
|
|
||||||
|
|
|
@ -188,6 +188,63 @@ limitations under the License.
|
||||||
</blockquote>
|
</blockquote>
|
||||||
</td></tr>
|
</td></tr>
|
||||||
<tr><td><br/></td></tr>
|
<tr><td><br/></td></tr>
|
||||||
|
</table>
|
||||||
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||||
|
<tr><td bgcolor="#828DA6">
|
||||||
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||||
|
<a name="Score Boosting"><strong>Score Boosting</strong></a>
|
||||||
|
</font>
|
||||||
|
</td></tr>
|
||||||
|
<tr><td>
|
||||||
|
<blockquote>
|
||||||
|
<p>Lucene allows influencing search results by "boosting" in more than one level:
|
||||||
|
<ul>
|
||||||
|
<li><b>Document level boosting</b>
|
||||||
|
- while indexing - by calling
|
||||||
|
<a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
|
||||||
|
before a document is added to the index.
|
||||||
|
</li>
|
||||||
|
<li><b>Document's Field level boosting</b>
|
||||||
|
- while indexing - by calling
|
||||||
|
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
|
||||||
|
before adding a field to the document (and before adding the document to the index).
|
||||||
|
</li>
|
||||||
|
<li><b>Query level boosting</b>
|
||||||
|
- during search, by setting a boost on a query clause, calling
|
||||||
|
<a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</p>
|
||||||
|
<p>Indexing time boosts are preprocessed for storage efficiency and written to
|
||||||
|
the directory (when writing the document) in a single byte (!) as follows:
|
||||||
|
For each field of a document, all boosts of that field
|
||||||
|
(i.e. all boosts under the same field name in that doc) are multiplied.
|
||||||
|
The result is multiplied by the boost of the document,
|
||||||
|
and also multiplied by a "field length norm" value
|
||||||
|
that represents the length of that field in that doc
|
||||||
|
(so shorter fields are automatically boosted up).
|
||||||
|
The result is decoded as a single byte
|
||||||
|
(with some precision loss of course) and stored in the directory.
|
||||||
|
The similarity object in effect at indexing computes the length-norm of the field.
|
||||||
|
</p>
|
||||||
|
<p>This composition of 1-byte representation of norms
|
||||||
|
(that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
|
||||||
|
is nicely described in
|
||||||
|
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
|
||||||
|
</p>
|
||||||
|
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
|
||||||
|
static methods of the class Similarity:
|
||||||
|
<a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
|
||||||
|
<a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
|
||||||
|
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
|
||||||
|
e.g. decode(encode(0.89)) = 0.75.
|
||||||
|
At scoring (search) time, this norm is brought into the score of document
|
||||||
|
as <b>indexBoost</b>, as shown by the formula in
|
||||||
|
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
|
||||||
|
</p>
|
||||||
|
</blockquote>
|
||||||
|
</td></tr>
|
||||||
|
<tr><td><br/></td></tr>
|
||||||
</table>
|
</table>
|
||||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||||
<tr><td bgcolor="#828DA6">
|
<tr><td bgcolor="#828DA6">
|
||||||
|
@ -198,78 +255,10 @@ limitations under the License.
|
||||||
<tr><td>
|
<tr><td>
|
||||||
<blockquote>
|
<blockquote>
|
||||||
<p>
|
<p>
|
||||||
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
|
This scoring formula is described in the
|
||||||
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
|
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
|
||||||
relevant document <i>d</i> is to the query <i>q</i>. This is taken from
|
basics of Lucene scoring work, especially the
|
||||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
|
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
|
||||||
|
|
||||||
<div class="formula">
|
|
||||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
|
||||||
score(q,d) =
|
|
||||||
<span class="big" id="summation">
|
|
||||||
sum </span><span class="summation-range">t in q</span><span>(
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
|
|
||||||
(t in d) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
|
|
||||||
(t)^2 *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
|
||||||
getBoost
|
|
||||||
</A>
|
|
||||||
(t in q) *
|
|
||||||
getBoost
|
|
||||||
(t.field in d) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
|
|
||||||
lengthNorm
|
|
||||||
</A>
|
|
||||||
(t.field in d) )</span> <span> *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
|
|
||||||
coord
|
|
||||||
</A>
|
|
||||||
(q,d) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
|
|
||||||
queryNorm
|
|
||||||
</A>(sumOfSquaredWeights)</span>
|
|
||||||
</div>
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
where
|
|
||||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
|
||||||
<div id="#sumOfSquares">
|
|
||||||
sumOfSquaredWeights =
|
|
||||||
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
|
|
||||||
idf
|
|
||||||
</A>
|
|
||||||
(t) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
|
||||||
getBoost
|
|
||||||
</A>
|
|
||||||
(t in q) )^2</span>
|
|
||||||
</div>
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
This scoring formula is mostly implemented in the
|
|
||||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
|
|
||||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
|
|
||||||
<ol>
|
|
||||||
|
|
||||||
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
|
|
||||||
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
|
|
||||||
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
|
|
||||||
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
|
|
||||||
</ol>
|
|
||||||
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
|
|
||||||
for context and are not authoratitive.
|
|
||||||
</p>
|
</p>
|
||||||
</blockquote>
|
</blockquote>
|
||||||
</td></tr>
|
</td></tr>
|
||||||
|
|
|
@ -16,67 +16,271 @@ package org.apache.lucene.search;
|
||||||
* limitations under the License.
|
* limitations under the License.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
import org.apache.lucene.index.IndexReader;
|
|
||||||
import org.apache.lucene.index.IndexWriter;
|
|
||||||
import org.apache.lucene.index.Term;
|
|
||||||
import org.apache.lucene.util.SmallFloat;
|
|
||||||
|
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.io.Serializable;
|
import java.io.Serializable;
|
||||||
import java.util.Collection;
|
import java.util.Collection;
|
||||||
import java.util.Iterator;
|
import java.util.Iterator;
|
||||||
|
|
||||||
|
import org.apache.lucene.index.IndexReader;
|
||||||
|
import org.apache.lucene.index.IndexWriter;
|
||||||
|
import org.apache.lucene.index.Term;
|
||||||
|
import org.apache.lucene.util.SmallFloat;
|
||||||
|
|
||||||
/** Expert: Scoring API.
|
/** Expert: Scoring API.
|
||||||
* <p>Subclasses implement search scoring.
|
* <p>Subclasses implement search scoring.
|
||||||
*
|
*
|
||||||
* <p>The score of query <code>q</code> for document <code>d</code> is defined
|
* <p>The score of query <code>q</code> for document <code>d</code> correlates to the
|
||||||
* in terms of these methods as follows:
|
* cosine-distance or dot-product between document and query vectors in a
|
||||||
|
* <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">
|
||||||
|
* Vector Space Model (VSM) of Information Retrieval</a>.
|
||||||
|
* A document whose vector is closer to the query vector in that model is scored higher.
|
||||||
*
|
*
|
||||||
* <table cellpadding="0" cellspacing="0" border="0">
|
* The score is computed as follows:
|
||||||
|
*
|
||||||
|
* <P>
|
||||||
|
* <table cellpadding="1" cellspacing="0" border="1" align="center">
|
||||||
|
* <tr><td>
|
||||||
|
* <table cellpadding="1" cellspacing="0" border="0" align="center">
|
||||||
* <tr>
|
* <tr>
|
||||||
* <td valign="middle" align="right" rowspan="2">score(q,d) =<br></td>
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
* <td valign="middle" align="center">
|
* score(q,d) =
|
||||||
* <big><big><big><big><big>Σ</big></big></big></big></big></td>
|
* <A HREF="#formula_coord">coord(q,d)</A> ·
|
||||||
* <td valign="middle"><small>
|
* <A HREF="#formula_queryNorm">queryNorm(q)</A> ·
|
||||||
* ( {@link #tf(int) tf}(t in d) *
|
* </td>
|
||||||
* {@link #idf(Term,Searcher) idf}(t)^2 *
|
* <td valign="bottom" align="center" rowspan="1">
|
||||||
* {@link Query#getBoost getBoost}(t in q) *
|
* <big><big><big>∑</big></big></big>
|
||||||
* {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) *
|
* </td>
|
||||||
* {@link #lengthNorm(String,int) lengthNorm}(t.field in d) )
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
* </small></td>
|
* <big><big>(</big></big>
|
||||||
* <td valign="middle" rowspan="2"> *
|
* <A HREF="#formula_tf">tf(t in d)</A> ·
|
||||||
* {@link #coord(int,int) coord}(q,d) *
|
* <A HREF="#formula_idf">idf(t)</A><sup>2</sup> ·
|
||||||
* {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights)
|
* <A HREF="#formula_termBoost">t.getBoost()</A> ·
|
||||||
|
* <A HREF="#formula_norm">norm(t,d)</A>
|
||||||
|
* <big><big>)</big></big>
|
||||||
* </td>
|
* </td>
|
||||||
* </tr>
|
* </tr>
|
||||||
* <tr>
|
* <tr valigh="top">
|
||||||
* <td valign="top" align="right">
|
* <td></td>
|
||||||
* <small>t in q</small>
|
* <td align="center"><small>t in q</small></td>
|
||||||
* </td>
|
* <td></td>
|
||||||
* </tr>
|
* </tr>
|
||||||
* </table>
|
* </table>
|
||||||
|
* </td></tr>
|
||||||
|
* </table>
|
||||||
*
|
*
|
||||||
* <p> where
|
* <p> where
|
||||||
|
* <ol>
|
||||||
|
* <li>
|
||||||
|
* <A NAME="formula_tf"></A>
|
||||||
|
* <b>tf(t in d)</b>
|
||||||
|
* correlates to the term's <i>frequency</i>,
|
||||||
|
* defined as the number of times term <i>t</i> appears in the currently scored document <i>d</i>.
|
||||||
|
* Documents that have more occurrences of a given term receive a higher score.
|
||||||
|
* The default computation for <i>tf(t in d)</i> in
|
||||||
|
* {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is:
|
||||||
*
|
*
|
||||||
* <table cellpadding="0" cellspacing="0" border="0">
|
* <br> <br>
|
||||||
* <tr>
|
* <table cellpadding="2" cellspacing="2" border="0" align="center">
|
||||||
* <td valign="middle" align="right" rowspan="2">sumOfSqaredWeights =<br></td>
|
* <tr>
|
||||||
* <td valign="middle" align="center">
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
* <big><big><big><big><big>Σ</big></big></big></big></big></td>
|
* {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} =
|
||||||
* <td valign="middle"><small>
|
* </td>
|
||||||
* ( {@link #idf(Term,Searcher) idf}(t) *
|
* <td valign="top" align="center" rowspan="1">
|
||||||
* {@link Query#getBoost getBoost}(t in q) )^2
|
* frequency<sup><big>½</big></sup>
|
||||||
* </small></td>
|
* </td>
|
||||||
* </tr>
|
* </tr>
|
||||||
* <tr>
|
* </table>
|
||||||
* <td valign="top" align="right">
|
* <br> <br>
|
||||||
* <small>t in q</small>
|
* </li>
|
||||||
* </td>
|
|
||||||
* </tr>
|
|
||||||
* </table>
|
|
||||||
*
|
*
|
||||||
* <p> Note that the above formula is motivated by the cosine-distance or dot-product
|
* <li>
|
||||||
* between document and query vector, which is implemented by {@link DefaultSimilarity}.
|
* <A NAME="formula_idf"></A>
|
||||||
|
* <b>idf(t)</b> stands for Inverse Document Frequency. This value
|
||||||
|
* correlates to the inverse of <i>docFreq</i>
|
||||||
|
* (the number of documents in which the term <i>t</i> appears).
|
||||||
|
* This means rarer terms give higher contribution to the total score.
|
||||||
|
* The default computation for <i>idf(t)</i> in
|
||||||
|
* {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is:
|
||||||
|
*
|
||||||
|
* <br> <br>
|
||||||
|
* <table cellpadding="2" cellspacing="2" border="0" align="center">
|
||||||
|
* <tr>
|
||||||
|
* <td valign="middle" align="right">
|
||||||
|
* {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)} =
|
||||||
|
* </td>
|
||||||
|
* <td valign="middle" align="center">
|
||||||
|
* 1 + log <big>(</big>
|
||||||
|
* </td>
|
||||||
|
* <td valign="middle" align="center">
|
||||||
|
* <table>
|
||||||
|
* <tr><td align="center"><small>numDocs</small></td></tr>
|
||||||
|
* <tr><td align="center">–––––––––</td></tr>
|
||||||
|
* <tr><td align="center"><small>docFreq+1</small></td></tr>
|
||||||
|
* </table>
|
||||||
|
* </td>
|
||||||
|
* <td valign="middle" align="center">
|
||||||
|
* <big>)</big>
|
||||||
|
* </td>
|
||||||
|
* </tr>
|
||||||
|
* </table>
|
||||||
|
* <br> <br>
|
||||||
|
* </li>
|
||||||
|
*
|
||||||
|
* <li>
|
||||||
|
* <A NAME="formula_coord"></A>
|
||||||
|
* <b>coord(q,d)</b>
|
||||||
|
* is a score factor based on how many of the query terms are found in the specified document.
|
||||||
|
* Typically, a document that contains more of the query's terms will receive a higher score
|
||||||
|
* than another document with fewer query terms.
|
||||||
|
* This is a search time factor computed in
|
||||||
|
* {@link #coord(int, int) coord(q,d)}
|
||||||
|
* by the Similarity in effect at search time.
|
||||||
|
* <br> <br>
|
||||||
|
* </li>
|
||||||
|
*
|
||||||
|
* <li><b>
|
||||||
|
* <A NAME="formula_queryNorm"></A>
|
||||||
|
* queryNorm(q)
|
||||||
|
* </b>
|
||||||
|
* is a normalizing factor used to make scores between queries comparable.
|
||||||
|
* This factor does not affect document ranking (since all ranked documents are multiplied by the same factor),
|
||||||
|
* but rather just attempts to make scores from different queries (or even different indexes) comparable.
|
||||||
|
* This is a search time factor computed by the Similarity in effect at search time.
|
||||||
|
*
|
||||||
|
* The default computation in
|
||||||
|
* {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity}
|
||||||
|
* is:
|
||||||
|
* <br> <br>
|
||||||
|
* <table cellpadding="1" cellspacing="0" border="0" align="center">
|
||||||
|
* <tr>
|
||||||
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
|
* queryNorm(q) =
|
||||||
|
* {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)}
|
||||||
|
* =
|
||||||
|
* </td>
|
||||||
|
* <td valign="middle" align="center" rowspan="1">
|
||||||
|
* <table>
|
||||||
|
* <tr><td align="center"><big>1</big></td></tr>
|
||||||
|
* <tr><td align="center"><big>
|
||||||
|
* ––––––––––––––
|
||||||
|
* </big></td></tr>
|
||||||
|
* <tr><td align="center">sumOfSquaredWeights<sup><big>½</big></sup></td></tr>
|
||||||
|
* </table>
|
||||||
|
* </td>
|
||||||
|
* </tr>
|
||||||
|
* </table>
|
||||||
|
* <br> <br>
|
||||||
|
*
|
||||||
|
* The sum of squared weights (of the query terms) is
|
||||||
|
* computed by the query {@link org.apache.lucene.search.Weight} object.
|
||||||
|
* For example, a {@link org.apache.lucene.search.BooleanQuery boolean query}
|
||||||
|
* computes this value as:
|
||||||
|
*
|
||||||
|
* <br> <br>
|
||||||
|
* <table cellpadding="1" cellspacing="0" border="0"n align="center">
|
||||||
|
* <tr>
|
||||||
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
|
* {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} =
|
||||||
|
* {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} <sup><big>2</big></sup>
|
||||||
|
* ·
|
||||||
|
* </td>
|
||||||
|
* <td valign="bottom" align="center" rowspan="1">
|
||||||
|
* <big><big><big>∑</big></big></big>
|
||||||
|
* </td>
|
||||||
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
|
* <big><big>(</big></big>
|
||||||
|
* <A HREF="#formula_idf">idf(t)</A> ·
|
||||||
|
* <A HREF="#formula_termBoost">t.getBoost()</A>
|
||||||
|
* <big><big>) <sup>2</sup> </big></big>
|
||||||
|
* </td>
|
||||||
|
* </tr>
|
||||||
|
* <tr valigh="top">
|
||||||
|
* <td></td>
|
||||||
|
* <td align="center"><small>t in q</small></td>
|
||||||
|
* <td></td>
|
||||||
|
* </tr>
|
||||||
|
* </table>
|
||||||
|
* <br> <br>
|
||||||
|
*
|
||||||
|
* </li>
|
||||||
|
*
|
||||||
|
* <li>
|
||||||
|
* <A NAME="formula_termBoost"></A>
|
||||||
|
* <b>t.getBoost()</b>
|
||||||
|
* is a search time boost of term <i>t</i> in the query <i>q</i> as
|
||||||
|
* specified in the query text
|
||||||
|
* (see <A HREF="../../../../../queryparsersyntax.html#Boosting a Term">query syntax</A>),
|
||||||
|
* or as set by application calls to
|
||||||
|
* {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}.
|
||||||
|
* Notice that there is really no direct API for accessing a boost of one term in a multi term query,
|
||||||
|
* but rather multi terms are represented in a query as multi
|
||||||
|
* {@link org.apache.lucene.search.TermQuery TermQuery} objects,
|
||||||
|
* and so the boost of a term in the query is accessible by calling the sub-query
|
||||||
|
* {@link org.apache.lucene.search.Query#getBoost() getBoost()}.
|
||||||
|
* <br> <br>
|
||||||
|
* </li>
|
||||||
|
*
|
||||||
|
* <li>
|
||||||
|
* <A NAME="formula_norm"></A>
|
||||||
|
* <b>norm(t,d)</b> encapsulates a few (indexing time) boost and length factors:
|
||||||
|
*
|
||||||
|
* <ul>
|
||||||
|
* <li><b>Document boost</b> - set by calling
|
||||||
|
* {@link org.apache.lucene.document.Document#setBoost(float) doc.setBoost()}
|
||||||
|
* before adding the document to the index.
|
||||||
|
* </li>
|
||||||
|
* <li><b>Field boost</b> - set by calling
|
||||||
|
* {@link org.apache.lucene.document.Fieldable#setBoost(float) field.setBoost()}
|
||||||
|
* before adding the field to a document.
|
||||||
|
* </li>
|
||||||
|
* <li>{@link #lengthNorm(String, int) <b>lengthNorm</b>(field)} - computed
|
||||||
|
* when the document is added to the index in accordance with the number of tokens
|
||||||
|
* of this field in the document, so that shorter fields contribute more to the score.
|
||||||
|
* LengthNorm is computed by the Similarity class in effect at indexing.
|
||||||
|
* </li>
|
||||||
|
* </ul>
|
||||||
|
*
|
||||||
|
* <p>
|
||||||
|
* When a document is added to the index, all the above factors are multiplied.
|
||||||
|
* If the document has multiple fields with the same name, all their boosts are multiplied together:
|
||||||
|
*
|
||||||
|
* <br> <br>
|
||||||
|
* <table cellpadding="1" cellspacing="0" border="0"n align="center">
|
||||||
|
* <tr>
|
||||||
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
|
* norm(t,d) =
|
||||||
|
* {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()}
|
||||||
|
* ·
|
||||||
|
* {@link #lengthNorm(String, int) lengthNorm(field)}
|
||||||
|
* ·
|
||||||
|
* </td>
|
||||||
|
* <td valign="bottom" align="center" rowspan="1">
|
||||||
|
* <big><big><big>∏</big></big></big>
|
||||||
|
* </td>
|
||||||
|
* <td valign="middle" align="right" rowspan="1">
|
||||||
|
* {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}()
|
||||||
|
* </td>
|
||||||
|
* </tr>
|
||||||
|
* <tr valigh="top">
|
||||||
|
* <td></td>
|
||||||
|
* <td align="center"><small>field <i><b>f</b></i> in <i>d</i> named as <i><b>t</b></i></small></td>
|
||||||
|
* <td></td>
|
||||||
|
* </tr>
|
||||||
|
* </table>
|
||||||
|
* <br> <br>
|
||||||
|
* However the resulted <i>norm</i> value is {@link #encodeNorm(float) encoded} as a single byte
|
||||||
|
* before being stored.
|
||||||
|
* At search time, the norm byte value is read from the index
|
||||||
|
* {@link org.apache.lucene.store.Directory directory} and
|
||||||
|
* {@link #decodeNorm(byte) decoded} back to a float <i>norm</i> value.
|
||||||
|
* This encoding/decoding, while reducing index size, comes with the price of
|
||||||
|
* precision loss - it is not guaranteed that decode(encode(x)) = x.
|
||||||
|
* For instance, decode(encode(0.89)) = 0.75.
|
||||||
|
* Also notice that search time is too late to modify this <i>norm</i> part of scoring, e.g. by
|
||||||
|
* using a different {@link Similarity} for search.
|
||||||
|
* <br> <br>
|
||||||
|
* </li>
|
||||||
|
* </ol>
|
||||||
*
|
*
|
||||||
* @see #setDefault(Similarity)
|
* @see #setDefault(Similarity)
|
||||||
* @see IndexWriter#setSimilarity(Similarity)
|
* @see IndexWriter#setSimilarity(Similarity)
|
||||||
|
|
|
@ -61,80 +61,60 @@
|
||||||
on the Fields).
|
on the Fields).
|
||||||
</p>
|
</p>
|
||||||
</subsection>
|
</subsection>
|
||||||
|
<subsection name="Score Boosting">
|
||||||
|
<p>Lucene allows influencing search results by "boosting" in more than one level:
|
||||||
|
<ul>
|
||||||
|
<li><b>Document level boosting</b>
|
||||||
|
- while indexing - by calling
|
||||||
|
<a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
|
||||||
|
before a document is added to the index.
|
||||||
|
</li>
|
||||||
|
<li><b>Document's Field level boosting</b>
|
||||||
|
- while indexing - by calling
|
||||||
|
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
|
||||||
|
before adding a field to the document (and before adding the document to the index).
|
||||||
|
</li>
|
||||||
|
<li><b>Query level boosting</b>
|
||||||
|
- during search, by setting a boost on a query clause, calling
|
||||||
|
<a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</p>
|
||||||
|
<p>Indexing time boosts are preprocessed for storage efficiency and written to
|
||||||
|
the directory (when writing the document) in a single byte (!) as follows:
|
||||||
|
For each field of a document, all boosts of that field
|
||||||
|
(i.e. all boosts under the same field name in that doc) are multiplied.
|
||||||
|
The result is multiplied by the boost of the document,
|
||||||
|
and also multiplied by a "field length norm" value
|
||||||
|
that represents the length of that field in that doc
|
||||||
|
(so shorter fields are automatically boosted up).
|
||||||
|
The result is decoded as a single byte
|
||||||
|
(with some precision loss of course) and stored in the directory.
|
||||||
|
The similarity object in effect at indexing computes the length-norm of the field.
|
||||||
|
</p>
|
||||||
|
<p>This composition of 1-byte representation of norms
|
||||||
|
(that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
|
||||||
|
is nicely described in
|
||||||
|
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
|
||||||
|
</p>
|
||||||
|
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
|
||||||
|
static methods of the class Similarity:
|
||||||
|
<a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
|
||||||
|
<a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
|
||||||
|
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
|
||||||
|
e.g. decode(encode(0.89)) = 0.75.
|
||||||
|
At scoring (search) time, this norm is brought into the score of document
|
||||||
|
as <b>indexBoost</b>, as shown by the formula in
|
||||||
|
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
|
||||||
|
</p>
|
||||||
|
</subsection>
|
||||||
<subsection name="Understanding the Scoring Formula">
|
<subsection name="Understanding the Scoring Formula">
|
||||||
|
|
||||||
<p>
|
<p>
|
||||||
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
|
This scoring formula is described in the
|
||||||
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
|
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
|
||||||
relevant document <i>d</i> is to the query <i>q</i>. This is taken from
|
basics of Lucene scoring work, especially the
|
||||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
|
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
|
||||||
|
|
||||||
<div class="formula">
|
|
||||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
|
||||||
score(q,d) =
|
|
||||||
<span class="big" id="summation">
|
|
||||||
sum </span><span class="summation-range">t in q</span><span>(
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
|
|
||||||
(t in d) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
|
|
||||||
(t)^2 *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
|
||||||
getBoost
|
|
||||||
</A>
|
|
||||||
(t in q) *
|
|
||||||
getBoost
|
|
||||||
(t.field in d) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
|
|
||||||
lengthNorm
|
|
||||||
</A>
|
|
||||||
(t.field in d) )</span> <span> *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
|
|
||||||
coord
|
|
||||||
</A>
|
|
||||||
(q,d) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
|
|
||||||
queryNorm
|
|
||||||
</A>(sumOfSquaredWeights)</span>
|
|
||||||
</div>
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
where
|
|
||||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
|
||||||
<div id="#sumOfSquares">
|
|
||||||
sumOfSquaredWeights =
|
|
||||||
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
|
|
||||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
|
|
||||||
idf
|
|
||||||
</A>
|
|
||||||
(t) *
|
|
||||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
|
||||||
getBoost
|
|
||||||
</A>
|
|
||||||
(t in q) )^2</span>
|
|
||||||
</div>
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
This scoring formula is mostly implemented in the
|
|
||||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
|
|
||||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
|
|
||||||
<ol>
|
|
||||||
|
|
||||||
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
|
|
||||||
|
|
||||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
|
|
||||||
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
|
|
||||||
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
|
|
||||||
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
|
|
||||||
</ol>
|
|
||||||
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
|
|
||||||
for context and are not authoratitive.
|
|
||||||
</p>
|
</p>
|
||||||
</subsection>
|
</subsection>
|
||||||
<subsection name="The Big Picture">
|
<subsection name="The Big Picture">
|
||||||
|
|
Loading…
Reference in New Issue