mirror of https://github.com/apache/lucene.git
Applied patch from:
http://issues.apache.org/jira/browse/LUCENE-664 git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@434091 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
e14f9e35ee
commit
9f374d9202
|
@ -142,7 +142,7 @@ Documentation
|
|||
1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll)
|
||||
|
||||
2. Added draft scoring.xml document into xdocs. Intent is to be the equivalent of fileformats.xml for scoring. It is not linked into project.xml, so it will not show up on the
|
||||
website yet. (Grant Ingersoll and Steve Rowe)
|
||||
website yet. (Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless)
|
||||
|
||||
Release 2.0.0 2006-05-26
|
||||
|
||||
|
|
|
@ -122,7 +122,7 @@ limitations under the License.
|
|||
help you figure out the what and why of Lucene scoring.</p>
|
||||
<p>Lucene scoring uses a combination of the
|
||||
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
|
||||
Retrieval</a> and the Boolean model
|
||||
Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
|
||||
to determine
|
||||
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
|
||||
times a query term appears in a document relative to
|
||||
|
@ -181,7 +181,7 @@ limitations under the License.
|
|||
and the other in one Field will return different scores for the same query due to length normalization
|
||||
(assumming the
|
||||
<a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
|
||||
on the Fields.
|
||||
on the Fields.)
|
||||
</p>
|
||||
</blockquote>
|
||||
</td></tr>
|
||||
|
@ -196,13 +196,15 @@ limitations under the License.
|
|||
<tr><td>
|
||||
<blockquote>
|
||||
<p>
|
||||
Lucene's scoring formula, taken from
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>
|
||||
is
|
||||
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
|
||||
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
|
||||
relevant document <i>d</i> is to the query <i>q</i>. This is taken from
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
|
||||
|
||||
<div class="formula">
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
score(q,d) =
|
||||
<span class="big" id="summation">
|
||||
<span class="big" id="summation">
|
||||
sum </span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
|
||||
(t in d) *
|
||||
|
@ -224,15 +226,14 @@ limitations under the License.
|
|||
(q,d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
|
||||
queryNorm
|
||||
</A>(sumOfSqaredWeights)</span>
|
||||
</A>(sumOfSquaredWeights)</span>
|
||||
</div>
|
||||
|
||||
</p>
|
||||
<p>
|
||||
where
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
<div id="#sumOfSquares">
|
||||
sumOfSqaredWeights =
|
||||
sumOfSquaredWeights =
|
||||
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
|
||||
idf
|
||||
|
@ -244,18 +245,26 @@ limitations under the License.
|
|||
(t in q) )^2</span>
|
||||
</div>
|
||||
</p>
|
||||
<p>This scoring formula is mostly incorporated into the
|
||||
<p>
|
||||
This scoring formula is mostly implemented in the
|
||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following:
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
|
||||
<ol>
|
||||
<li>tf - Term Frequency - The number of times the term <i>t</i> appears in the current document being scored. </li>
|
||||
<li>idf - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears in.</li>
|
||||
<li>getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term.</li>
|
||||
<li>lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Usually longer fields return a smaller value.</li>
|
||||
<li>coord(q, d) - Score factor based on how many terms the specified document has in common with the query.</li>
|
||||
<li>queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable
|
||||
|
||||
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
|
||||
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
|
||||
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions.</span></li>
|
||||
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
|
||||
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
|
||||
</ol>
|
||||
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
|
||||
for context and are not authoratitive.
|
||||
|
|
|
@ -17,7 +17,7 @@
|
|||
help you figure out the what and why of Lucene scoring.</p>
|
||||
<p>Lucene scoring uses a combination of the
|
||||
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
|
||||
Retrieval</a> and the Boolean model
|
||||
Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
|
||||
to determine
|
||||
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
|
||||
times a query term appears in a document relative to
|
||||
|
@ -58,18 +58,20 @@
|
|||
and the other in one Field will return different scores for the same query due to length normalization
|
||||
(assumming the
|
||||
<a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
|
||||
on the Fields.
|
||||
on the Fields.)
|
||||
</p>
|
||||
</subsection>
|
||||
<subsection name="Understanding the Scoring Formula">
|
||||
<p>
|
||||
Lucene's scoring formula, taken from
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>
|
||||
is
|
||||
Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
|
||||
term <i>t</i> that occurs in q. The score attempts to measure relevance, so the higher the score, the more
|
||||
relevant document <i>d</i> is to the query <i>q</i>. This is taken from
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
|
||||
|
||||
<div class="formula">
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
score(q,d) =
|
||||
<span class="big" id="summation">
|
||||
<span class="big" id="summation">
|
||||
sum </span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
|
||||
(t in d) *
|
||||
|
@ -91,15 +93,14 @@
|
|||
(q,d) *
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
|
||||
queryNorm
|
||||
</A>(sumOfSqaredWeights)</span>
|
||||
</A>(sumOfSquaredWeights)</span>
|
||||
</div>
|
||||
|
||||
</p>
|
||||
<p>
|
||||
where
|
||||
<!-- Anyone know how to specify sigma in Anakia? It always seems to strip out my numeric character references-->
|
||||
<div id="#sumOfSquares">
|
||||
sumOfSqaredWeights =
|
||||
sumOfSquaredWeights =
|
||||
<span class="big">sum</span><span class="summation-range">t in q</span><span>(
|
||||
<A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
|
||||
idf
|
||||
|
@ -111,19 +112,26 @@
|
|||
(t in q) )^2</span>
|
||||
</div>
|
||||
</p>
|
||||
<p>This scoring formula is mostly incorporated into the
|
||||
<p>
|
||||
This scoring formula is mostly implemented in the
|
||||
<a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following:
|
||||
<a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
|
||||
<ol>
|
||||
<li>tf - Term Frequency - The number of times the term <i>t</i> appears in the current document being scored. </li>
|
||||
<li>idf - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears in.</li>
|
||||
<li>getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term.</li>
|
||||
<li>lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Usually longer fields return a smaller value.</li>
|
||||
<li>coord(q, d) - Score factor based on how many terms the specified document has in common with the query.</li>
|
||||
<li>queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable
|
||||
|
||||
<li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
|
||||
|
||||
<li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
|
||||
<span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
|
||||
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
|
||||
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></li>
|
||||
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
|
||||
</ol>
|
||||
Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
|
||||
for context and are not authoratitive.
|
||||
|
|
Loading…
Reference in New Issue