<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
scores lower than a different document with only one of the query terms. </p>
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
help you figure out the what and why of Lucene scoring.</p>
<p>Lucene scoring uses a combination of the
<ahref="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
<ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
<li><AHREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
<spanclass="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?</span></p></li>
is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
the use and interactions between the
<ahref="api/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
response to a user's information need.
</p>
<p>In this regard, Lucene offers a wide variety of Query implementations, most of which are in the
org.apache.lucene.search package.
These implementations can be combined in a wide variety of ways to provide complex querying
capabilities along with
information about where matches took place in the document collection. The <ahref="#Query Classes">Query</a>
section below will
highlight some of the more important Query classes. For information on the other ones, see the
<ahref="api/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
your own Query class, see <ahref="#Changing your Scoring -- Expert Level">Changing your Scoring --
Expert Level</a> below.
</p>
<p>Once a Query has been created and submitted to the
<ahref="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
begins. (See the <ahref="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,
control finally passes to the Weight implementation and it's
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
<ahref="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
<ahref="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a> (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class),
(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default.
See <ahref="http://svn.apache.org/repos/asf/lucene/java/trunk/CHANGES.txt">CHANGES.txt</a> under release 1.9 RC1 for more information on choosing which Scorer to use.
</p>
<p>
Assuming the use of the BooleanWeight2, a
BooleanScorer2 is created by bringing together
all of the
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
provided by each scorer while factoring in the coord() score.
<!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
<ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all your searching needs.
However, in some applications it may be necessary to alter your Similarity. For instance, some applications do not need to
distinguish between shorter documents and longer documents (for example,
see <ahref="http://www.gossamer-threads.com/lists/lucene/java-user/38967?search_string=Similarity;#38967">a "fair" similarity</a>)
To change the Similarity, one must do so for both indexing and searching and the changes must take place before
any of these actions are undertaken (although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen).
To make this change, implement your Similarity (you probably want to override
<ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>) and then set the new
class on
<ahref="api/org/apache/lucene/index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity(org.apache.lucene.search.Similarity)</a> for indexing and on
If you are interested in use cases for changing your similarity, see the mailing list at <ahref="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>.
In summary, here are a few use cases:
<ol>
<li>SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount
and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.</li>
<li>Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these
cases people have overridden Similarity to return 1 from the tf() method.</li>
<li>Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes
to a score. In the DefaultSimilarity, lengthNorm = 1/ (numTerms in field)^0.5, but if one changes this to be
1 / (numTerms in field), all fields will be treated
In general, Chris Hostetter sums it up best in saying (from <ahref="http://www.gossamer-threads.com/lists/lucene/java-user/39125">the mailing list</a>):
<blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just that
it's "text" is a situation where it *might* make sense to to override your
<aname="Changing your Scoring -- Expert Level"><strong>Changing your Scoring -- Expert Level</strong></a>
</font>
</td></tr>
<tr><td>
<blockquote>
<p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
you want help.
</p>
<p>With the warning out of the way, it is possible to change a lot more than just the Similarity
when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
<spanclass="highlight-for-editing">three main classes</span>:
<ol>
<li>
<ahref="api/org/apache/lucene/search/Query.html">Query</a> -- The abstract object representation of the user's information need.</li>
<li>
<ahref="api/org/apache/lucene/search/Weight.html">Weight</a> -- The internal interface representation of the user's Query, so that Query objects may be reused.</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a> -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.</li>
</ol>
Details on each of these classes, and their children can be found in the subsections below.
class is where it all begins. Without a Query, there would be
nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it
is often responsible
for creating them or coordinating the functionality between them. The
<ahref="api/org/apache/lucene/search/Query.html">Query</a> class has several methods that are important for
derived classes:
<ol>
<li>createWeight(Searcher searcher) -- A
<ahref="api/org/apache/lucene/search/Weight.html">Weight</a> is the internal representation of the Query, so each Query implementation must
provide an implementation of Weight. See the subsection on <ahref="#The Weight Interface">The Weight Interface</a> below for details on implementing the Weight interface.</li>
dependent state should be stored in the Weight implementation,
not in the Query class. The interface defines 6 methods that must be implemented:
<ol>
<li>
<ahref="api/org/apache/lucene/search/Weight.html#getQuery()">Weight#getQuery()</a> -- Pointer to the Query that this Weight represents.</li>
<li>
<ahref="api/org/apache/lucene/search/Weight.html#getValue()">Weight#getValue()</a> -- The weight for this Query. For example, the TermQuery.TermWeight value is
equal to the idf^2 * boost * queryNorm <!-- DOUBLE CHECK THIS --></li>
abstract class provides common scoring functionality for all Scorer implementations and
is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which
must be implemented:
<ol>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#next()">Scorer#next()</a> -- Advances to the next document that matches this Query, returning true if and only
if there is another document that matches.</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#doc()">Scorer#doc()</a> -- Returns the id of the
that contains the match. Is not valid until next() has been called at least once.
</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#score()">Scorer#score()</a> -- Return the score of the current document. This value can be determined in any
appropriate way for an application. For instance, the
<ahref="api/org/apache/lucene/search/Scorer.html#skipTo(int)">Scorer#skipTo(int)</a> -- Skip ahead in the document matches to the document whose id is greater than
or equal to the passed in value. In many instances, skipTo can be
implemented more efficiently than simply looping through all the matching documents until
the target document is identified.</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#explain(int)">Scorer#explain(int)</a> -- Provides details on why the score came about.</li>