Applied patch from:

http://issues.apache.org/jira/browse/LUCENE-664

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@434091 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Grant Ingersoll 2006-08-23 17:28:04 +00:00
parent e14f9e35ee
commit 9f374d9202
3 changed files with 54 additions and 37 deletions

View File

@ -142,7 +142,7 @@ Documentation
1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll) 1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll)
2. Added draft scoring.xml document into xdocs. Intent is to be the equivalent of fileformats.xml for scoring. It is not linked into project.xml, so it will not show up on the 2. Added draft scoring.xml document into xdocs. Intent is to be the equivalent of fileformats.xml for scoring. It is not linked into project.xml, so it will not show up on the
website yet. (Grant Ingersoll and Steve Rowe) website yet. (Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless)
Release 2.0.0 2006-05-26 Release 2.0.0 2006-05-26

View File

@ -122,7 +122,7 @@ limitations under the License.
help you figure out the what and why of Lucene scoring.</p> help you figure out the what and why of Lucene scoring.</p>
<p>Lucene scoring uses a combination of the <p>Lucene scoring uses a combination of the
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
Retrieval</a> and the Boolean model Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
to determine to determine
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
times a query term appears in a document relative to times a query term appears in a document relative to
@ -181,7 +181,7 @@ limitations under the License.
and the other in one Field will return different scores for the same query due to length normalization and the other in one Field will return different scores for the same query due to length normalization
(assumming the (assumming the
<a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
on the Fields. on the Fields.)
</p> </p>
</blockquote> </blockquote>
</td></tr> </td></tr>
@ -196,13 +196,15 @@ limitations under the License.
<tr><td> <tr><td>
<blockquote> <blockquote>
<p> <p>
Lucene's scoring formula, taken from Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
is relevant document <i>d</i> is to the query <i>q</i>. This is taken from
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
<div class="formula"> <div class="formula">
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references--> <!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
score(q,d) = score(q,d) =
<span class="big" id="summation"> <span class="big" id="summation">
sum </span><span class="summation-range">t in q</span><span>( sum </span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A> <A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
(t in d) * (t in d) *
@ -224,15 +226,14 @@ limitations under the License.
(q,d) * (q,d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)"> <A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
queryNorm queryNorm
</A>(sumOfSqaredWeights)</span> </A>(sumOfSquaredWeights)</span>
</div> </div>
</p> </p>
<p> <p>
where where
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references--> <!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
<div id="#sumOfSquares"> <div id="#sumOfSquares">
sumOfSqaredWeights = sumOfSquaredWeights =
<span class="big">sum</span><span class="summation-range">t in q</span><span>( <span class="big">sum</span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)"> <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
idf idf
@ -244,18 +245,26 @@ limitations under the License.
(t in q) )^2</span> (t in q) )^2</span>
</div> </div>
</p> </p>
<p>This scoring formula is mostly incorporated into the <p>
This scoring formula is mostly implemented in the
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following: <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
<ol> <ol>
<li>tf - Term Frequency - The number of times the term <i>t</i> appears in the current document being scored. </li>
<li>idf - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears in.</li> <li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
<li>getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term.</li>
<li>lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Usually longer fields return a smaller value.</li> <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
<li>coord(q, d) - Score factor based on how many terms the specified document has in common with the query.</li>
<li>queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable <li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) <span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions.</span></li> that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
</ol> </ol>
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
for context and are not authoratitive. for context and are not authoratitive.

View File

@ -17,7 +17,7 @@
help you figure out the what and why of Lucene scoring.</p> help you figure out the what and why of Lucene scoring.</p>
<p>Lucene scoring uses a combination of the <p>Lucene scoring uses a combination of the
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
Retrieval</a> and the Boolean model Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
to determine to determine
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
times a query term appears in a document relative to times a query term appears in a document relative to
@ -58,18 +58,20 @@
and the other in one Field will return different scores for the same query due to length normalization and the other in one Field will return different scores for the same query due to length normalization
(assumming the (assumming the
<a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
on the Fields. on the Fields.)
</p> </p>
</subsection> </subsection>
<subsection name="Understanding the Scoring Formula"> <subsection name="Understanding the Scoring Formula">
<p> <p>
Lucene's scoring formula, taken from Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
is relevant document <i>d</i> is to the query <i>q</i>. This is taken from
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
<div class="formula"> <div class="formula">
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references--> <!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
score(q,d) = score(q,d) =
<span class="big" id="summation"> <span class="big" id="summation">
sum </span><span class="summation-range">t in q</span><span>( sum </span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A> <A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
(t in d) * (t in d) *
@ -91,15 +93,14 @@
(q,d) * (q,d) *
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)"> <A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
queryNorm queryNorm
</A>(sumOfSqaredWeights)</span> </A>(sumOfSquaredWeights)</span>
</div> </div>
</p> </p>
<p> <p>
where where
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references--> <!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
<div id="#sumOfSquares"> <div id="#sumOfSquares">
sumOfSqaredWeights = sumOfSquaredWeights =
<span class="big">sum</span><span class="summation-range">t in q</span><span>( <span class="big">sum</span><span class="summation-range">t in q</span><span>(
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)"> <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
idf idf
@ -111,19 +112,26 @@
(t in q) )^2</span> (t in q) )^2</span>
</div> </div>
</p> </p>
<p>This scoring formula is mostly incorporated into the <p>
This scoring formula is mostly implemented in the
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following: <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
<ol> <ol>
<li>tf - Term Frequency - The number of times the term <i>t</i> appears in the current document being scored. </li>
<li>idf - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears in.</li> <li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
<li>getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term.</li>
<li>lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Usually longer fields return a smaller value.</li> <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
<li>coord(q, d) - Score factor based on how many terms the specified document has in common with the query.</li>
<li>queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable <li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) <span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></li> to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
</ol> </ol>
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
for context and are not authoratitive. for context and are not authoratitive.