<authoremail="gsingers at apache.org">Grant Ingersoll</author>
<title>Scoring - Apache Lucene</title>
</properties>
<body>
<sectionname="Introduction">
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
scores lower than a different document with only one of the query terms. </p>
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
help you figure out the what and why of Lucene scoring.</p>
<p>Lucene scoring uses a combination of the
<ahref="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
<ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following. Note that the descriptions apply to <ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
<li><AHREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored. Documents that have more occurrences of a given term receive a higher score.</li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears. This means rarer terms give higher contribution to the total score.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
<li><p><AHREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
<spanclass="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure)
that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem
<p>In this regard, Lucene offers a wide variety of <ahref="api/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
<ahref="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
<ahref="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a> (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class),
(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default.
See <ahref="http://svn.apache.org/repos/asf/lucene/java/trunk/CHANGES.txt">CHANGES.txt</a> under release 1.9 RC1 for more information on choosing which Scorer to use.
</p>
<p>
Assuming the use of the BooleanWeight2, a
BooleanScorer2 is created by bringing together
all of the
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
provided by each scorer while factoring in the coord() score.
<!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
is the easiest to understand and the most often used in applications. A <ahref="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> matches all the documents that contain the specified
Thus, a <ahref="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> identifies and scores all
<ahref="api/org/apache/lucene/document/Document.html">Document</a>s that have a <ahref="api/org/apache/lucene/document/Field.html">Field</a> with the specified string in it.
TermQuery tq = new TermQuery(new Term("fieldName", "term");
</pre>In this example, the <ahref="api/org/apache/lucene/search/Query.html">Query</a> identifies all <ahref="api/org/apache/lucene/document/Document.html">Document</a>s that have the <ahref="api/org/apache/lucene/document/Field.html">Field</a> named <tt>"fieldName"</tt> and
<p>Things start to get interesting when one combines multiple
<ahref="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> instances into a <ahref="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>.
A <ahref="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> contains multiple
that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>. This type of <ahref="api/org/apache/lucene/search/Query.html">Query</a> is frequently used to
The <ahref="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a> allows an application
to identify all documents with terms that begin with a certain string. The <ahref="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> generalizes this by allowing
for the use of <tt>*</tt> (matches 0 or more characters) and <tt>?</tt> (matches exactly one character) wildcards. Note that the <ahref="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> can be quite slow. Also note that
<ahref="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> should
not start with <tt>*</tt> and <tt>?</tt>, as these are extremely slow. For tricks on how to search using a wildcard at
<p>Chances are <ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all your searching needs.
However, in some applications it may be necessary to customize your <ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a> implementation. For instance, some applications do not need to
distinguish between shorter and longer documents (see <ahref="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p>
<p>To change <ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a>, one must do so for both indexing and searching, and the changes must happen before
either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.
</p>
<p>To make this change, implement your own <ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a> (likely you'll want to simply subclass
<ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>) and then use the new
class by calling
<ahref="api/org/apache/lucene/index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity</a> before indexing and
<ahref="api/org/apache/lucene/search/Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity</a> before searching.
If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <ahref="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>.
<li><p><ahref="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> -- <ahref="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> gives small increases as the frequency increases a small amount
and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.</p></li>
<li><p>Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these
cases people have overridden Similarity to return 1 from the tf() method.</p></li>
<li><p>Changing Length Normalization -- By overriding <ahref="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)">lengthNorm</a>, it is possible to discount how the length of a field contributes
to a score. In <ahref="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
In general, Chris Hostetter sums it up best in saying (from <ahref="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
<blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just that
it's "text" is a situation where it *might* make sense to to override your
Similarity method.</blockquote>
</p>
</subsection>
</section>
<sectionname="Changing your Scoring -- Expert Level">
<p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
you want help.
</p>
<p>With the warning out of the way, it is possible to change a lot more than just the Similarity
when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
<spanclass="highlight-for-editing">three main classes</span>:
<ol>
<li>
<ahref="api/org/apache/lucene/search/Query.html">Query</a> -- The abstract object representation of the user's information need.</li>
<li>
<ahref="api/org/apache/lucene/search/Weight.html">Weight</a> -- The internal interface representation of the user's Query, so that Query objects may be reused.</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a> -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.</li>
</ol>
Details on each of these classes, and their children can be found in the subsections below.
dependent state should be stored in the Weight implementation,
not in the Query class. The interface defines 6 methods that must be implemented:
<ol>
<li>
<ahref="api/org/apache/lucene/search/Weight.html#getQuery()">Weight#getQuery()</a> -- Pointer to the Query that this Weight represents.</li>
<li>
<ahref="api/org/apache/lucene/search/Weight.html#getValue()">Weight#getValue()</a> -- The weight for this Query. For example, the TermQuery.TermWeight value is
equal to the idf^2 * boost * queryNorm <!-- DOUBLE CHECK THIS --></li>
abstract class provides common scoring functionality for all Scorer implementations and
is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which
must be implemented:
<ol>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#next()">Scorer#next()</a> -- Advances to the next document that matches this Query, returning true if and only
if there is another document that matches.</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#doc()">Scorer#doc()</a> -- Returns the id of the
that contains the match. Is not valid until next() has been called at least once.
</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#score()">Scorer#score()</a> -- Return the score of the current document. This value can be determined in any
appropriate way for an application. For instance, the
<ahref="api/org/apache/lucene/search/Scorer.html#skipTo(int)">Scorer#skipTo(int)</a> -- Skip ahead in the document matches to the document whose id is greater than
or equal to the passed in value. In many instances, skipTo can be
implemented more efficiently than simply looping through all the matching documents until
the target document is identified.</li>
<li>
<ahref="api/org/apache/lucene/search/Scorer.html#explain(int)">Scorer#explain(int)</a> -- Provides details on why the score came about.</li>
</ol>
</p>
</subsection>
<subsectionname="Why would I want to add my own Query?">
<p>In a nutshell, you want to add your own custom Query implementation when you think that Lucene's
aren't appropriate for the
task that you want to do. You might be doing some cutting edge research or you need more information
back
out of Lucene (similar to Doug adding SpanQuery functionality).</p>