mirror of https://github.com/apache/lucene.git
Updated scoring information to only have one copy of the Scoring Formula. Implemented Doron Cohen's new scoring formula description in the javadoc.
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@454767 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
0d4e1b171d
commit
75f561901e
|
@ -76,9 +76,9 @@ API Changes
|
|||
SingleInstanceLockFactory (ie, in memory locking) locking with an
|
||||
FSDirectory. Note that now you must call setDisableLocks before
|
||||
the instantiation a FSDirectory if you wish to disable locking
|
||||
for that Directory.
|
||||
for that Directory.
|
||||
(Michael McCandless, Jeff Patterson via Yonik Seeley)
|
||||
|
||||
|
||||
Bug fixes
|
||||
|
||||
1. Fixed the web application demo (built with "ant war-demo") which
|
||||
|
@ -127,7 +127,7 @@ Bug fixes
|
|||
has no value.
|
||||
(Oliver Hutchison via Chris Hostetter)
|
||||
|
||||
|
||||
|
||||
Optimizations
|
||||
|
||||
1. LUCENE-586: TermDocs.skipTo() is now more efficient for multi-segment
|
||||
|
@ -164,7 +164,8 @@ Documentation
|
|||
|
||||
1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll)
|
||||
|
||||
2. Added scoring.xml document into xdocs.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless)
|
||||
2. Added scoring.xml document into xdocs. Updated Similarity.java scoring formula.(Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless, Doron Cohen, Chris Hostetter, Doug Cutting). Issue 664.
|
||||
|
||||
|
||||
Release 2.0.0 2006-05-26
|
||||
|
||||
|
|
|
@ -188,6 +188,63 @@ limitations under the License.
|
|||
</blockquote>
|
||||
</td></tr>
|
||||
<tr><td><br/></td></tr>
|
||||
</table>
|
||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||
<tr><td bgcolor="#828DA6">
|
||||
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||
<a name="Score Boosting"><strong>Score Boosting</strong></a>
|
||||
</font>
|
||||
</td></tr>
|
||||
<tr><td>
|
||||
<blockquote>
|
||||
<p>Lucene allows influencing search results by "boosting" in more than one level:
|
||||
<ul>
|
||||
<li><b>Document level boosting</b>
|
||||
- while indexing - by calling
|
||||
<a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
|
||||
before a document is added to the index.
|
||||
</li>
|
||||
<li><b>Document's Field level boosting</b>
|
||||
- while indexing - by calling
|
||||
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
|
||||
before adding a field to the document (and before adding the document to the index).
|
||||
</li>
|
||||
<li><b>Query level boosting</b>
|
||||
- during search, by setting a boost on a query clause, calling
|
||||
<a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
|
||||
</li>
|
||||
</ul>
|
||||
</p>
|
||||
<p>Indexing time boosts are preprocessed for storage efficiency and written to
|
||||
the directory (when writing the document) in a single byte (!) as follows:
|
||||
For each field of a document, all boosts of that field
|
||||
(i.e. all boosts under the same field name in that doc) are multiplied.
|
||||
The result is multiplied by the boost of the document,
|
||||
and also multiplied by a "field length norm" value
|
||||
that represents the length of that field in that doc
|
||||
(so shorter fields are automatically boosted up).
|
||||
The result is decoded as a single byte
|
||||
(with some precision loss of course) and stored in the directory.
|
||||
The similarity object in effect at indexing computes the length-norm of the field.
|
||||
</p>
|
||||
<p>This composition of 1-byte representation of norms
|
||||
(that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
|
||||
is nicely described in
|
||||
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
|
||||
</p>
|
||||
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
|
||||
static methods of the class Similarity:
|
||||
<a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
|
||||
<a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
|
||||
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
|
||||
e.g. decode(encode(0.89)) = 0.75.
|
||||
At scoring (search) time, this norm is brought into the score of document
|
||||
as <b>indexBoost</b>, as shown by the formula in
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
|
||||
</p>
|
||||
</blockquote>
|
||||
</td></tr>
|
||||
<tr><td><br/></td></tr>
|
||||
</table>
|
||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||
<tr><td bgcolor="#828DA6">
|
||||
|
@ -198,78 +255,10 @@ limitations under the License.
|
|||
<tr><td>
|
||||
<blockquote>
|
||||
<p>
|
||||
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
|
||||
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
|
||||
relevant document <i>d</i> is to the query <i>q</i>. This is taken from
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
|
||||
|
||||
<div class="formula">
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
score(q,d) =
|
||||
<span class="big" id="summation">
|
||||
sum </span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
|
||||
(t in d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
|
||||
(t)^2 *
|
||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
||||
getBoost
|
||||
</A>
|
||||
(t in q) *
|
||||
getBoost
|
||||
(t.field in d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
|
||||
lengthNorm
|
||||
</A>
|
||||
(t.field in d) )</span> <span> *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
|
||||
coord
|
||||
</A>
|
||||
(q,d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
|
||||
queryNorm
|
||||
</A>(sumOfSquaredWeights)</span>
|
||||
</div>
|
||||
</p>
|
||||
<p>
|
||||
where
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
<div id="#sumOfSquares">
|
||||
sumOfSquaredWeights =
|
||||
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
|
||||
idf
|
||||
</A>
|
||||
(t) *
|
||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
||||
getBoost
|
||||
</A>
|
||||
(t in q) )^2</span>
|
||||
</div>
|
||||
</p>
|
||||
<p>
|
||||
This scoring formula is mostly implemented in the
|
||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
|
||||
<ol>
|
||||
|
||||
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
|
||||
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
|
||||
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
|
||||
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
|
||||
</ol>
|
||||
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
|
||||
for context and are not authoratitive.
|
||||
This scoring formula is described in the
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
|
||||
basics of Lucene scoring work, especially the
|
||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
|
||||
</p>
|
||||
</blockquote>
|
||||
</td></tr>
|
||||
|
@ -295,7 +284,7 @@ limitations under the License.
|
|||
These implementations can be combined in a wide variety of ways to provide complex querying
|
||||
capabilities along with
|
||||
information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
|
||||
section below
|
||||
section below
|
||||
highlights some of the more important Query classes. For information on the other ones, see the
|
||||
<a href="api/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
|
||||
your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
|
||||
|
|
|
@ -16,67 +16,271 @@ package org.apache.lucene.search;
|
|||
* limitations under the License.
|
||||
*/
|
||||
|
||||
import org.apache.lucene.index.IndexReader;
|
||||
import org.apache.lucene.index.IndexWriter;
|
||||
import org.apache.lucene.index.Term;
|
||||
import org.apache.lucene.util.SmallFloat;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.io.Serializable;
|
||||
import java.util.Collection;
|
||||
import java.util.Iterator;
|
||||
|
||||
import org.apache.lucene.index.IndexReader;
|
||||
import org.apache.lucene.index.IndexWriter;
|
||||
import org.apache.lucene.index.Term;
|
||||
import org.apache.lucene.util.SmallFloat;
|
||||
|
||||
/** Expert: Scoring API.
|
||||
* <p>Subclasses implement search scoring.
|
||||
*
|
||||
* <p>The score of query <code>q</code> for document <code>d</code> is defined
|
||||
* in terms of these methods as follows:
|
||||
* <p>The score of query <code>q</code> for document <code>d</code> correlates to the
|
||||
* cosine-distance or dot-product between document and query vectors in a
|
||||
* <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">
|
||||
* Vector Space Model (VSM) of Information Retrieval</a>.
|
||||
* A document whose vector is closer to the query vector in that model is scored higher.
|
||||
*
|
||||
* <table cellpadding="0" cellspacing="0" border="0">
|
||||
* The score is computed as follows:
|
||||
*
|
||||
* <P>
|
||||
* <table cellpadding="1" cellspacing="0" border="1" align="center">
|
||||
* <tr><td>
|
||||
* <table cellpadding="1" cellspacing="0" border="0" align="center">
|
||||
* <tr>
|
||||
* <td valign="middle" align="right" rowspan="2">score(q,d) =<br></td>
|
||||
* <td valign="middle" align="center">
|
||||
* <big><big><big><big><big>Σ</big></big></big></big></big></td>
|
||||
* <td valign="middle"><small>
|
||||
* ( {@link #tf(int) tf}(t in d) *
|
||||
* {@link #idf(Term,Searcher) idf}(t)^2 *
|
||||
* {@link Query#getBoost getBoost}(t in q) *
|
||||
* {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) *
|
||||
* {@link #lengthNorm(String,int) lengthNorm}(t.field in d) )
|
||||
* </small></td>
|
||||
* <td valign="middle" rowspan="2"> *
|
||||
* {@link #coord(int,int) coord}(q,d) *
|
||||
* {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights)
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* score(q,d) =
|
||||
* <A HREF="#formula_coord">coord(q,d)</A> ·
|
||||
* <A HREF="#formula_queryNorm">queryNorm(q)</A> ·
|
||||
* </td>
|
||||
* <td valign="bottom" align="center" rowspan="1">
|
||||
* <big><big><big>∑</big></big></big>
|
||||
* </td>
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* <big><big>(</big></big>
|
||||
* <A HREF="#formula_tf">tf(t in d)</A> ·
|
||||
* <A HREF="#formula_idf">idf(t)</A><sup>2</sup> ·
|
||||
* <A HREF="#formula_termBoost">t.getBoost()</A> ·
|
||||
* <A HREF="#formula_norm">norm(t,d)</A>
|
||||
* <big><big>)</big></big>
|
||||
* </td>
|
||||
* </tr>
|
||||
* <tr>
|
||||
* <td valign="top" align="right">
|
||||
* <small>t in q</small>
|
||||
* </td>
|
||||
* <tr valigh="top">
|
||||
* <td></td>
|
||||
* <td align="center"><small>t in q</small></td>
|
||||
* <td></td>
|
||||
* </tr>
|
||||
* </table>
|
||||
*
|
||||
* </td></tr>
|
||||
* </table>
|
||||
*
|
||||
* <p> where
|
||||
*
|
||||
* <table cellpadding="0" cellspacing="0" border="0">
|
||||
* <tr>
|
||||
* <td valign="middle" align="right" rowspan="2">sumOfSqaredWeights =<br></td>
|
||||
* <td valign="middle" align="center">
|
||||
* <big><big><big><big><big>Σ</big></big></big></big></big></td>
|
||||
* <td valign="middle"><small>
|
||||
* ( {@link #idf(Term,Searcher) idf}(t) *
|
||||
* {@link Query#getBoost getBoost}(t in q) )^2
|
||||
* </small></td>
|
||||
* </tr>
|
||||
* <tr>
|
||||
* <td valign="top" align="right">
|
||||
* <small>t in q</small>
|
||||
* </td>
|
||||
* </tr>
|
||||
* </table>
|
||||
*
|
||||
* <p> Note that the above formula is motivated by the cosine-distance or dot-product
|
||||
* between document and query vector, which is implemented by {@link DefaultSimilarity}.
|
||||
* <ol>
|
||||
* <li>
|
||||
* <A NAME="formula_tf"></A>
|
||||
* <b>tf(t in d)</b>
|
||||
* correlates to the term's <i>frequency</i>,
|
||||
* defined as the number of times term <i>t</i> appears in the currently scored document <i>d</i>.
|
||||
* Documents that have more occurrences of a given term receive a higher score.
|
||||
* The default computation for <i>tf(t in d)</i> in
|
||||
* {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is:
|
||||
*
|
||||
* <br> <br>
|
||||
* <table cellpadding="2" cellspacing="2" border="0" align="center">
|
||||
* <tr>
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} =
|
||||
* </td>
|
||||
* <td valign="top" align="center" rowspan="1">
|
||||
* frequency<sup><big>½</big></sup>
|
||||
* </td>
|
||||
* </tr>
|
||||
* </table>
|
||||
* <br> <br>
|
||||
* </li>
|
||||
*
|
||||
* <li>
|
||||
* <A NAME="formula_idf"></A>
|
||||
* <b>idf(t)</b> stands for Inverse Document Frequency. This value
|
||||
* correlates to the inverse of <i>docFreq</i>
|
||||
* (the number of documents in which the term <i>t</i> appears).
|
||||
* This means rarer terms give higher contribution to the total score.
|
||||
* The default computation for <i>idf(t)</i> in
|
||||
* {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is:
|
||||
*
|
||||
* <br> <br>
|
||||
* <table cellpadding="2" cellspacing="2" border="0" align="center">
|
||||
* <tr>
|
||||
* <td valign="middle" align="right">
|
||||
* {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)} =
|
||||
* </td>
|
||||
* <td valign="middle" align="center">
|
||||
* 1 + log <big>(</big>
|
||||
* </td>
|
||||
* <td valign="middle" align="center">
|
||||
* <table>
|
||||
* <tr><td align="center"><small>numDocs</small></td></tr>
|
||||
* <tr><td align="center">–––––––––</td></tr>
|
||||
* <tr><td align="center"><small>docFreq+1</small></td></tr>
|
||||
* </table>
|
||||
* </td>
|
||||
* <td valign="middle" align="center">
|
||||
* <big>)</big>
|
||||
* </td>
|
||||
* </tr>
|
||||
* </table>
|
||||
* <br> <br>
|
||||
* </li>
|
||||
*
|
||||
* <li>
|
||||
* <A NAME="formula_coord"></A>
|
||||
* <b>coord(q,d)</b>
|
||||
* is a score factor based on how many of the query terms are found in the specified document.
|
||||
* Typically, a document that contains more of the query's terms will receive a higher score
|
||||
* than another document with fewer query terms.
|
||||
* This is a search time factor computed in
|
||||
* {@link #coord(int, int) coord(q,d)}
|
||||
* by the Similarity in effect at search time.
|
||||
* <br> <br>
|
||||
* </li>
|
||||
*
|
||||
* <li><b>
|
||||
* <A NAME="formula_queryNorm"></A>
|
||||
* queryNorm(q)
|
||||
* </b>
|
||||
* is a normalizing factor used to make scores between queries comparable.
|
||||
* This factor does not affect document ranking (since all ranked documents are multiplied by the same factor),
|
||||
* but rather just attempts to make scores from different queries (or even different indexes) comparable.
|
||||
* This is a search time factor computed by the Similarity in effect at search time.
|
||||
*
|
||||
* The default computation in
|
||||
* {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity}
|
||||
* is:
|
||||
* <br> <br>
|
||||
* <table cellpadding="1" cellspacing="0" border="0" align="center">
|
||||
* <tr>
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* queryNorm(q) =
|
||||
* {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)}
|
||||
* =
|
||||
* </td>
|
||||
* <td valign="middle" align="center" rowspan="1">
|
||||
* <table>
|
||||
* <tr><td align="center"><big>1</big></td></tr>
|
||||
* <tr><td align="center"><big>
|
||||
* ––––––––––––––
|
||||
* </big></td></tr>
|
||||
* <tr><td align="center">sumOfSquaredWeights<sup><big>½</big></sup></td></tr>
|
||||
* </table>
|
||||
* </td>
|
||||
* </tr>
|
||||
* </table>
|
||||
* <br> <br>
|
||||
*
|
||||
* The sum of squared weights (of the query terms) is
|
||||
* computed by the query {@link org.apache.lucene.search.Weight} object.
|
||||
* For example, a {@link org.apache.lucene.search.BooleanQuery boolean query}
|
||||
* computes this value as:
|
||||
*
|
||||
* <br> <br>
|
||||
* <table cellpadding="1" cellspacing="0" border="0"n align="center">
|
||||
* <tr>
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} =
|
||||
* {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} <sup><big>2</big></sup>
|
||||
* ·
|
||||
* </td>
|
||||
* <td valign="bottom" align="center" rowspan="1">
|
||||
* <big><big><big>∑</big></big></big>
|
||||
* </td>
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* <big><big>(</big></big>
|
||||
* <A HREF="#formula_idf">idf(t)</A> ·
|
||||
* <A HREF="#formula_termBoost">t.getBoost()</A>
|
||||
* <big><big>) <sup>2</sup> </big></big>
|
||||
* </td>
|
||||
* </tr>
|
||||
* <tr valigh="top">
|
||||
* <td></td>
|
||||
* <td align="center"><small>t in q</small></td>
|
||||
* <td></td>
|
||||
* </tr>
|
||||
* </table>
|
||||
* <br> <br>
|
||||
*
|
||||
* </li>
|
||||
*
|
||||
* <li>
|
||||
* <A NAME="formula_termBoost"></A>
|
||||
* <b>t.getBoost()</b>
|
||||
* is a search time boost of term <i>t</i> in the query <i>q</i> as
|
||||
* specified in the query text
|
||||
* (see <A HREF="../../../../../queryparsersyntax.html#Boosting a Term">query syntax</A>),
|
||||
* or as set by application calls to
|
||||
* {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}.
|
||||
* Notice that there is really no direct API for accessing a boost of one term in a multi term query,
|
||||
* but rather multi terms are represented in a query as multi
|
||||
* {@link org.apache.lucene.search.TermQuery TermQuery} objects,
|
||||
* and so the boost of a term in the query is accessible by calling the sub-query
|
||||
* {@link org.apache.lucene.search.Query#getBoost() getBoost()}.
|
||||
* <br> <br>
|
||||
* </li>
|
||||
*
|
||||
* <li>
|
||||
* <A NAME="formula_norm"></A>
|
||||
* <b>norm(t,d)</b> encapsulates a few (indexing time) boost and length factors:
|
||||
*
|
||||
* <ul>
|
||||
* <li><b>Document boost</b> - set by calling
|
||||
* {@link org.apache.lucene.document.Document#setBoost(float) doc.setBoost()}
|
||||
* before adding the document to the index.
|
||||
* </li>
|
||||
* <li><b>Field boost</b> - set by calling
|
||||
* {@link org.apache.lucene.document.Fieldable#setBoost(float) field.setBoost()}
|
||||
* before adding the field to a document.
|
||||
* </li>
|
||||
* <li>{@link #lengthNorm(String, int) <b>lengthNorm</b>(field)} - computed
|
||||
* when the document is added to the index in accordance with the number of tokens
|
||||
* of this field in the document, so that shorter fields contribute more to the score.
|
||||
* LengthNorm is computed by the Similarity class in effect at indexing.
|
||||
* </li>
|
||||
* </ul>
|
||||
*
|
||||
* <p>
|
||||
* When a document is added to the index, all the above factors are multiplied.
|
||||
* If the document has multiple fields with the same name, all their boosts are multiplied together:
|
||||
*
|
||||
* <br> <br>
|
||||
* <table cellpadding="1" cellspacing="0" border="0"n align="center">
|
||||
* <tr>
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* norm(t,d) =
|
||||
* {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()}
|
||||
* ·
|
||||
* {@link #lengthNorm(String, int) lengthNorm(field)}
|
||||
* ·
|
||||
* </td>
|
||||
* <td valign="bottom" align="center" rowspan="1">
|
||||
* <big><big><big>∏</big></big></big>
|
||||
* </td>
|
||||
* <td valign="middle" align="right" rowspan="1">
|
||||
* {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}()
|
||||
* </td>
|
||||
* </tr>
|
||||
* <tr valigh="top">
|
||||
* <td></td>
|
||||
* <td align="center"><small>field <i><b>f</b></i> in <i>d</i> named as <i><b>t</b></i></small></td>
|
||||
* <td></td>
|
||||
* </tr>
|
||||
* </table>
|
||||
* <br> <br>
|
||||
* However the resulted <i>norm</i> value is {@link #encodeNorm(float) encoded} as a single byte
|
||||
* before being stored.
|
||||
* At search time, the norm byte value is read from the index
|
||||
* {@link org.apache.lucene.store.Directory directory} and
|
||||
* {@link #decodeNorm(byte) decoded} back to a float <i>norm</i> value.
|
||||
* This encoding/decoding, while reducing index size, comes with the price of
|
||||
* precision loss - it is not guaranteed that decode(encode(x)) = x.
|
||||
* For instance, decode(encode(0.89)) = 0.75.
|
||||
* Also notice that search time is too late to modify this <i>norm</i> part of scoring, e.g. by
|
||||
* using a different {@link Similarity} for search.
|
||||
* <br> <br>
|
||||
* </li>
|
||||
* </ol>
|
||||
*
|
||||
* @see #setDefault(Similarity)
|
||||
* @see IndexWriter#setSimilarity(Similarity)
|
||||
|
|
|
@ -61,80 +61,60 @@
|
|||
on the Fields).
|
||||
</p>
|
||||
</subsection>
|
||||
<subsection name="Score Boosting">
|
||||
<p>Lucene allows influencing search results by "boosting" in more than one level:
|
||||
<ul>
|
||||
<li><b>Document level boosting</b>
|
||||
- while indexing - by calling
|
||||
<a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
|
||||
before a document is added to the index.
|
||||
</li>
|
||||
<li><b>Document's Field level boosting</b>
|
||||
- while indexing - by calling
|
||||
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
|
||||
before adding a field to the document (and before adding the document to the index).
|
||||
</li>
|
||||
<li><b>Query level boosting</b>
|
||||
- during search, by setting a boost on a query clause, calling
|
||||
<a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
|
||||
</li>
|
||||
</ul>
|
||||
</p>
|
||||
<p>Indexing time boosts are preprocessed for storage efficiency and written to
|
||||
the directory (when writing the document) in a single byte (!) as follows:
|
||||
For each field of a document, all boosts of that field
|
||||
(i.e. all boosts under the same field name in that doc) are multiplied.
|
||||
The result is multiplied by the boost of the document,
|
||||
and also multiplied by a "field length norm" value
|
||||
that represents the length of that field in that doc
|
||||
(so shorter fields are automatically boosted up).
|
||||
The result is decoded as a single byte
|
||||
(with some precision loss of course) and stored in the directory.
|
||||
The similarity object in effect at indexing computes the length-norm of the field.
|
||||
</p>
|
||||
<p>This composition of 1-byte representation of norms
|
||||
(that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
|
||||
is nicely described in
|
||||
<a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
|
||||
</p>
|
||||
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
|
||||
static methods of the class Similarity:
|
||||
<a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
|
||||
<a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
|
||||
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
|
||||
e.g. decode(encode(0.89)) = 0.75.
|
||||
At scoring (search) time, this norm is brought into the score of document
|
||||
as <b>indexBoost</b>, as shown by the formula in
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
|
||||
</p>
|
||||
</subsection>
|
||||
<subsection name="Understanding the Scoring Formula">
|
||||
|
||||
<p>
|
||||
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
|
||||
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
|
||||
relevant document <i>d</i> is to the query <i>q</i>. This is taken from
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
|
||||
|
||||
<div class="formula">
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
score(q,d) =
|
||||
<span class="big" id="summation">
|
||||
sum </span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
|
||||
(t in d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
|
||||
(t)^2 *
|
||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
||||
getBoost
|
||||
</A>
|
||||
(t in q) *
|
||||
getBoost
|
||||
(t.field in d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
|
||||
lengthNorm
|
||||
</A>
|
||||
(t.field in d) )</span> <span> *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
|
||||
coord
|
||||
</A>
|
||||
(q,d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
|
||||
queryNorm
|
||||
</A>(sumOfSquaredWeights)</span>
|
||||
</div>
|
||||
</p>
|
||||
<p>
|
||||
where
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
<div id="#sumOfSquares">
|
||||
sumOfSquaredWeights =
|
||||
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
|
||||
idf
|
||||
</A>
|
||||
(t) *
|
||||
<A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
|
||||
getBoost
|
||||
</A>
|
||||
(t in q) )^2</span>
|
||||
</div>
|
||||
</p>
|
||||
<p>
|
||||
This scoring formula is mostly implemented in the
|
||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
|
||||
<ol>
|
||||
|
||||
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
|
||||
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
|
||||
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
|
||||
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
|
||||
</ol>
|
||||
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
|
||||
for context and are not authoratitive.
|
||||
This scoring formula is described in the
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
|
||||
basics of Lucene scoring work, especially the
|
||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
|
||||
</p>
|
||||
</subsection>
|
||||
<subsection name="The Big Picture">
|
||||
|
@ -150,7 +130,7 @@
|
|||
These implementations can be combined in a wide variety of ways to provide complex querying
|
||||
capabilities along with
|
||||
information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
|
||||
section below
|
||||
section below
|
||||
highlights some of the more important Query classes. For information on the other ones, see the
|
||||
<a href="api/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
|
||||
your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
|
||||
|
|
Loading…
Reference in New Issue