lucene/xdocs/scoring.xml

<?xml version="1.0"?>

<document>
    <properties>
        <author email="gsingers at apache.org">Grant Ingersoll</author>
        <title>Scoring - Apache Lucene</title>
    </properties>

    <body>

        <section name="Introduction">
            <p>Lucene scoring is the heart of why we all love Lucene.  It is blazingly fast and it hides almost all of the complexity from the user.
                In a nutshell, it works.  At least, that is, until it doesn't work, or doesn't work as one would expect it to
            work.  Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
            scores lower than a different document with only one of the query terms. </p>
            <p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
            help you figure out the what and why of Lucene scoring.</p>
            <p>Lucene scoring uses a combination of the
                <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
                    Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
                to determine
                how relevant a given Document is to a User's query.  In general, the idea behind the VSM is the more
                times a query term appears in a document relative to
                the number of times the term appears in all the documents in the collection, the more relevant that
                document is to the query.  It uses the Boolean model to first narrow down the documents that need to
                be scored based on the use of boolean logic in the Query specification.  Lucene also adds some
                capabilities and refinements onto this model to support boolean and fuzzy searching, but it
                essentially remains a VSM based system at the heart.
                For some valuable references on VSM and IR in general refer to the
                <a href="http://wiki.apache.org/jakarta-lucene/InformationRetrieval">Lucene Wiki IR references</a>.
            </p>
            <p>The rest of this document will cover <a href="#Scoring">Scoring</a> basics and how to change your
                <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.  Next it will cover ways you can
                customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring
                -- Expert Level</a> which gives details on implementing your own
                <a href="api/org/apache/lucene/search/Query.html">Query</a> class and related functionality.  Finally, we
                will finish up with some reference material in the <a href="#Appendix">Appendix</a>.
            </p>
        </section>
        <section name="Scoring">
            <p>Scoring is very much dependent on the way documents are indexed,
                so it is important to understand indexing (see
                <a href="gettingstarted.html">Apache Lucene - Getting Started Guide</a>
                and the Lucene
                <a href="fileformats.html">file formats</a>
                before continuing on with this section.)  It is also assumed that readers know how to use the
                <a href="api/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,
                which can go a long way in informing why a score is returned.
            </p>
            <subsection name="Fields and Documents">
                <p>In Lucene, the objects we are scoring are
                    <a href="api/org/apache/lucene/document/Document.html">Documents</a>.  A Document is a collection
                of
                    <a href="api/org/apache/lucene/document/Field.html">Fields</a>.  Each Field has semantics about how
                it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.)  It is important to
                    note that Lucene scoring works on Fields and then combines the results to return Documents.  This is
                    important because two Documents with the exact same content, but one having the content in two Fields
                    and the other in one Field will return different scores for the same query due to length normalization
                    (assumming the
                    <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
                    on the Fields).
                </p>
            </subsection>
            <subsection name="Understanding the Scoring Formula">
                <p>
                    Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
                    term <i>t</i> that occurs in q.  The score attempts to measure relevance, so the higher the score, the more
		    relevant document <i>d</i> is to the query <i>q</i>.  This is taken from
		    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:

                    <div class="formula">
                        <!-- Anyone know how to specify sigma in Anakia?  It always seems to strip out my numeric character references-->
                        score(q,d) =
			<span class="big" id="summation">
                            sum </span><span class="summation-range">t in q</span><span>(
                        <A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
                        (t in d) *
                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
                        (t)^2 *
                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
                        getBoost
                        </A>
                        (t in q) *
                        getBoost
                        (t.field in d) *
                        <A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
                            lengthNorm
                        </A>
                        (t.field in d) )</span> <span> *
                        <A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
                            coord
                        </A>
                        (q,d) *
                        <A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
                            queryNorm
                        </A>(sumOfSquaredWeights)</span>
                    </div>
                </p>
                <p>
                    where
                    <!-- Anyone know how to specify sigma in Anakia?  It always seems to strip out my numeric character references-->
                    <div id="#sumOfSquares">
                        sumOfSquaredWeights =
                        <span class="big">sum</span><span class="summation-range">t in q</span><span>(
                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
                            idf
                        </A>
                        (t) *
                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
                            getBoost
                        </A>
                        (t in q) )^2</span>
                    </div>
                </p>
                <p>
		This scoring formula is mostly implemented in the
                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following.  Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
                    <ol>

                        <li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored.  Documents that have more occurrences of a given term receive a higher score.</li>

                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears.  This means rarer terms give higher contribution to the total score.</p></li>

                        <li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term.  A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance.  A boost of 1.0 (the default boost) has no effect.</p></li>

                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched.  Typically longer fields return a smaller value.  This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>

                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query.  Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>

                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
                            <span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen.  I have always understood (but not 100% sure)
                                that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions.  However, I also seem
                            to remember some research on using sum of squares as being somewhat suitable for score comparison.  Anyone have any thoughts here?</span></p></li>
                    </ol>
                    Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
                    for context and are not authoratitive.
                </p>
            </subsection>
            <subsection name="The Big Picture">
                <p>OK, so the tf-idf formula and the
                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>
                    is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
                    the use and interactions between the
                    <a href="api/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
                    response to a user's information need.
                </p>
                <p>In this regard, Lucene offers a wide variety of <a href="api/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
                    <a href="api/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.
                    These implementations can be combined in a wide variety of ways to provide complex querying
                    capabilities along with
                    information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
                    section below 
                    highlights some of the more important Query classes.  For information on the other ones, see the
                    <a href="api/org/apache/lucene/search/package-summary.html">package summary</a>.  For details on implementing
                    your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
                    Expert Level</a> below.
                </p>
                <p>Once a Query has been created and submitted to the
                    <a href="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
                begins.  (See the <a
                href="#Appendix">Appendix</a> Algorithm section for more notes on the process.)  After some infrastructure setup,
                control finally passes to the <a href="api/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
                    <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a> instance.  In the case of any type of
                    <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
                    <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a> (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class),
                    unless the static
                    <a href="api/org/apache/lucene/search/BooleanQuery.html#setUseScorer14(boolean)">
                        BooleanQuery#setUseScorer14(boolean)</a> method is set to true,
                in which case the
                    <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight</a>
                    (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default.
                    See <a href="http://svn.apache.org/repos/asf/lucene/java/trunk/CHANGES.txt">CHANGES.txt</a> under release 1.9 RC1 for more information on choosing which Scorer to use.
                </p>
                <p>
                    Assuming the use of the BooleanWeight2, a
                    BooleanScorer2 is created by bringing together
                    all of the
                    <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
                    When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
                    of clauses in the Query.  This internal Scorer essentially loops over the sub scorers and sums the scores
                    provided by each scorer while factoring in the coord() score.
                    <!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
                </p>
            </subsection>
            <subsection name="Query Classes">
                <p>For information on the Query Classes, refer to the
                    <a href="api/org/apache/lucene/search/package-summary.html#query">search package javadocs</a>
                </p>
            </subsection>
            <subsection name="Changing Similarity">
                <p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors.  For information on
                how to do this, see the
                    <a href="api/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs</a></p>
            </subsection>

        </section>
        <section name="Changing your Scoring -- Expert Level">
            <p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.)  To learn more
                about how to do this, refer to the
                <a href="api/org/apache/lucene/search/package-summary.html#scoring">search package javadocs</a>
            </p>
        </section>

        <section name="Appendix">
            <subsection name="Class Diagrams">
                <p>
                    <a href="http://wiki.apache.org/jakarta-lucene/KarlWettin?action=AttachFile&amp;do=view&amp;target=search_uml_1.jpg">
                        Karl Wettin's UML on the Wiki</a>
                </p>
            </subsection>
            <subsection name="Sequence Diagrams">
                <p class="highlight-for-editing">FILL IN HERE. Volunteers?</p>
            </subsection>
            <subsection name="Algorithm" class="highlight-for-editing">
                <p>GSI Note: This section is mostly my notes on stepping through the Scoring process and serves as
                    fertilizer for the earlier sections.</p>
                <p>In the typical search application, a
                    <a href="api/org/apache/lucene/search/Query.html">Query</a>
                    is passed to the
                    <a
                            href="api/org/apache/lucene/search/Searcher.html">Searcher</a>
                    , beginning the scoring process.
                </p>
                <p>Once inside the Searcher, a
                    <a href="api/org/apache/lucene/search/Hits.html">Hits</a>
                    object is constructed, which handles the scoring and caching of the search results.
                    The Hits constructor stores references to three or four important objects:
                    <ol>
                        <li>The
                            <a href="api/org/apache/lucene/search/Weight.html">Weight</a>
                            object of the Query. The Weight object is an internal representation of the Query that
                            allows the Query to be reused by the Searcher.
                        </li>
                        <li>The Searcher that initiated the call.</li>
                        <li>A
                            <a href="api/org/apache/lucene/search/Filter.html">Filter</a>
                            for limiting the result set. Note, the Filter may be null.
                        </li>
                        <li>A
                            <a href="api/org/apache/lucene/search/Sort.html">Sort</a>
                            object for specifying how to sort the results if the standard score based sort method is not
                            desired.
                        </li>
                    </ol>
                </p>
                <p>Now that the Hits object has been initialized, it begins the process of identifying documents that
                    match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't
                    effect the raw Lucene score),
                    we call on the "expert" search method of the Searcher, passing in our
                    <a href="api/org/apache/lucene/search/Weight.html">Weight</a>
                    object,
                    <a href="api/org/apache/lucene/search/Filter.html">Filter</a>
                    and the number of results we want. This method
                    returns a
                    <a href="api/org/apache/lucene/search/TopDocs.html">TopDocs</a>
                    object, which is an internal collection of search results.
                    The Searcher creates a
                    <a href="api/org/apache/lucene/search/TopDocCollector.html">TopDocCollector</a>
                    and passes it along with the Weight, Filter to another expert search method (for more on the
                    <a href="api/org/apache/lucene/search/HitCollector.html">HitCollector</a>
                    mechanism, see
                    <a href="api/org/apache/lucene/search/Searcher.html">Searcher</a>
                    .) The TopDocCollector uses a
                    <a href="api/org/apache/lucene/util/PriorityQueue.html">PriorityQueue</a>
                    to collect the top results for the search.
                </p>
                <p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise,
                    we ask the Weight for
                    a
                    <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a>
                    for the
                    <a href="api/org/apache/lucene/index/IndexReader.html">IndexReader</a>
                    of the current searcher and we proceed by
                    calling the score method on the
                    <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a>
                    .
                </p>
                <p>At last, we are actually going to score some documents. The score method takes in the HitCollector
                    (most likely the TopDocCollector) and does its business.
                    Of course, here is where things get involved. The
                    <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a>
                    that is returned by the
                    <a href="api/org/apache/lucene/search/Weight.html">Weight</a>
                    object depends on what type of Query was submitted. In most real world applications with multiple
                    query terms,
                    the
                    <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a>
                    is going to be a
                    <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2</a>
                    (see the section on customizing your scoring for info on changing this.)

                </p>
                <p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the
                    coord() factor. We then
                    get a internal Scorer based on the required, optional and prohibited parts of the query.
                    Using this internal Scorer, the BooleanScorer2 then proceeds
                    into a while loop based on the Scorer#next() method. The next() method advances to the next document
                    matching the query. This is an
                    abstract method in the Scorer class and is thus overriden by all derived
                    implementations.  <!-- DOUBLE CHECK THIS -->If you have a simple OR query
                    your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers
                    from the sub scorers of the OR'd terms.</p>
            </subsection>
        </section>
    </body>
</document>