<authoremail="gsingers at apache.org">Grant Ingersoll</author>
<title>Scoring - Apache Lucene</title>
</properties>
<body>
<sectionname="Introduction">
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
scores lower than a different document with only one of the query terms. </p>
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
help you figure out the what and why of Lucene scoring.</p>
<p>Lucene scoring uses a combination of the
<ahref="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
<ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
<li><AHREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
<spanclass="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
<p>In this regard, Lucene offers a wide variety of <ahref="api/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
<ahref="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
<ahref="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a> (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class),
(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default.
See <ahref="http://svn.apache.org/repos/asf/lucene/java/trunk/CHANGES.txt">CHANGES.txt</a> under release 1.9 RC1 for more information on choosing which Scorer to use.
</p>
<p>
Assuming the use of the BooleanWeight2, a
BooleanScorer2 is created by bringing together
all of the
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
provided by each scorer while factoring in the coord() score.
<!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->