<inputvalue="lucene.apache.org"name="sitesearch"type="hidden"><inputonFocus="getBlank (this, 'Search the site with google');"size="25"name="q"id="query"type="text"value="Search the site with google">
<ahref="#Changing your Scoring -- Expert Level">Changing your Scoring -- Expert Level</a>
</li>
<li>
<ahref="#Appendix">Appendix</a>
<ulclass="minitoc">
<li>
<ahref="#Class Diagrams">Class Diagrams</a>
</li>
<li>
<ahref="#Sequence Diagrams">Sequence Diagrams</a>
</li>
<li>
<ahref="#Algorithm">Algorithm</a>
</li>
</ul>
</li>
</ul>
</div>
<aname="N10013"></a><aname="Introduction"></a>
<h2class="boxed">Introduction</h2>
<divclass="section">
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
scores lower than a different document with only one of the query terms. </p>
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
help you figure out the what and why of Lucene scoring.</p>
<p>Lucene scoring uses a combination of the
<ahref="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
Retrieval</a> and the <ahref="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
to determine
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
times a query term appears in a document relative to
the number of times the term appears in all the documents in the collection, the more relevant that
document is to the query. It uses the Boolean model to first narrow down the documents that need to
be scored based on the use of boolean logic in the Query specification. Lucene also adds some
capabilities and refinements onto this model to support boolean and fuzzy searching, but it
essentially remains a VSM based system at the heart.
For some valuable references on VSM and IR in general refer to the
<ahref="http://wiki.apache.org/jakarta-lucene/InformationRetrieval">Lucene Wiki IR references</a>.
</p>
<p>The rest of this document will cover <ahref="#Scoring">Scoring</a> basics and how to change your
<ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a>. Next it will cover ways you can
customize the Lucene internals in <ahref="#Changing your Scoring -- Expert Level">Changing your Scoring
-- Expert Level</a> which gives details on implementing your own
<ahref="api/org/apache/lucene/search/Query.html">Query</a> class and related functionality. Finally, we
will finish up with some reference material in the <ahref="#Appendix">Appendix</a>.
</p>
</div>
<aname="N10045"></a><aname="Scoring"></a>
<h2class="boxed">Scoring</h2>
<divclass="section">
<p>Scoring is very much dependent on the way documents are indexed,
so it is important to understand indexing (see
<ahref="gettingstarted.html">Apache Lucene - Getting Started Guide</a>
and the Lucene
<ahref="fileformats.html">file formats</a>
before continuing on with this section.) It is also assumed that readers know how to use the
<ahref="api/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,
which can go a long way in informing why a score is returned.
</p>
<aname="N10059"></a><aname="Fields and Documents"></a>
<h3class="boxed">Fields and Documents</h3>
<p>In Lucene, the objects we are scoring are
<ahref="api/org/apache/lucene/document/Document.html">Documents</a>. A Document is a collection
of
<ahref="api/org/apache/lucene/document/Field.html">Fields</a>. Each Field has semantics about how
it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to
note that Lucene scoring works on Fields and then combines the results to return Documents. This is
important because two Documents with the exact same content, but one having the content in two Fields
and the other in one Field will return different scores for the same query due to length normalization
<aname="N100B1"></a><aname="Understanding the Scoring Formula"></a>
<h3class="boxed">Understanding the Scoring Formula</h3>
<p>
This scoring formula is described in the
<ahref="api/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
the use and interactions between the
<ahref="api/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
response to a user's information need.
</p>
<p>In this regard, Lucene offers a wide variety of <ahref="api/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
These implementations can be combined in a wide variety of ways to provide complex querying
capabilities along with
information about where matches took place in the document collection. The <ahref="#Query Classes">Query</a>
section below
highlights some of the more important Query classes. For information on the other ones, see the
<ahref="api/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
your own Query class, see <ahref="#Changing your Scoring -- Expert Level">Changing your Scoring --
Expert Level</a> below.
</p>
<p>Once a Query has been created and submitted to the
<ahref="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
begins. (See the <ahref="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,
control finally passes to the <ahref="api/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
<ahref="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
<ahref="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a> (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class),
(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default.
See <ahref="http://svn.apache.org/repos/asf/lucene/java/trunk/CHANGES.txt">CHANGES.txt</a> under release 1.9 RC1 for more information on choosing which Scorer to use.
</p>
<p>
Assuming the use of the BooleanWeight2, a
BooleanScorer2 is created by bringing together
all of the
<ahref="api/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
provided by each scorer while factoring in the coord() score.
</p>
<aname="N1011A"></a><aname="Query Classes"></a>
<h3class="boxed">Query Classes</h3>
<p>For information on the Query Classes, refer to the