Scoring - Apache Lucene

Grant Ingersoll Scoring - Apache Lucene

Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.

While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.

Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification. Lucene also adds some capabilities and refinements onto this model to support boolean and fuzzy searching, but it essentially remains a VSM based system at the heart. For some valuable references on VSM and IR in general refer to the Lucene Wiki IR references.

The rest of this document will cover Scoring basics and how to change your Similarity. Next it will cover ways you can customize the Lucene internals in Changing your Scoring -- Expert Level which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.

Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing (see Apache Lucene - Getting Started Guide and the Lucene file formats before continuing on with this section.) It is also assumed that readers know how to use the Searcher.explain(Query query, int doc) functionality, which can go a long way in informing why a score is returned.

In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the DefaultSimilarity on the Fields.

Lucene's scoring formula, taken from Similarity is

score(q,d) = sum t in q( tf (t in d) * idf (t)^2 * getBoost (t in q) * getBoost (t.field in d) * lengthNorm (t.field in d) ) * coord (q,d) * queryNorm (sumOfSqaredWeights)

where

sumOfSqaredWeights = sumt in q( idf (t) * getBoost (t in q) )^2

This scoring formula is mostly incorporated into the TermScorer class, where it makes calls to the Similarity class to retrieve values for the following:

tf - Term Frequency - The number of times the term t appears in the current document being scored.
idf - Inverse Document Frequency - One divided by the number of documents in which the term t appears in.
getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term.
lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Usually longer fields return a smaller value.
coord(q, d) - Score factor based on how many terms the specified document has in common with the query.
queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided for context and are not authoratitive.

OK, so the tf-idf formula and the Similarity is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are the use and interactions between the Query classes, as created by each application in response to a user's information need.

In this regard, Lucene offers a wide variety of Query implementations, most of which are in the org.apache.lucene.search package. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query section below will highlight some of the more important Query classes. For information on the other ones, see the package summary. For details on implementing your own Query class, see Changing your Scoring -- Expert Level below.

Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, control finally passes to the Weight implementation and it's Scorer instance. In the case of any type of BooleanQuery, scoring is handled by the BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), unless the static BooleanQuery#setUseScorer14(boolean) method is set to true, in which case the BooleanWeight (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default. See CHANGES.txt under release 1.9 RC1 for more information on choosing which Scorer to use.

Assuming the use of the BooleanWeight2, a BooleanScorer2 is created by bringing together all of the Scorers from the sub-clauses of the BooleanQuery. When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores provided by each scorer while factoring in the coord() score.

TermQuery

Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in most applications. A TermQuery is a Query that matches all the documents that contain the specified Term . A Term is a word that occurs in a specific Field . Thus, a TermQuery identifies and scores all Document s that have a Field with the specified string in it. Constructing a TermQuery is as simple as: TermQuery tq = new TermQuery(new Term("fieldName", "term"); In this example, the Query would identify all Documents that have the Field named "fieldName" that contain the word "term".

BooleanQuery

Things start to get interesting when one starts to combine TermQuerys, which is handled by the BooleanQuery class. The BooleanQuery is a collection of other Query classes along with semantics about how to combine the different subqueries. It currently supports three different operators for specifying the logic of the query (see BooleanClause )

SHOULD -- Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then a non-empty result set will have matched at least one of the clauses in the query.
MUST -- Use this operator when a clause is required to occur in the result set.
MUST NOT -- Use this operator when a clause must not occur in the result set.

Boolean queries are constructed by adding two or more BooleanClause instances to the BooleanQuery instance. In some cases, too many clauses may be added to the BooleanQuery, which will cause a TooManyClauses exception to be thrown. This most often occurs when using a Query that is rewritten into many TermQuery instances, such as the WildCardQuery . The default setting for too many clauses is currently set to 1024, but it can be overridden via the BooleanQuery#setMaxClauseCount(int) static method on BooleanQuery.

Phrases

Another common task in search is to identify phrases, which can be handled in two different ways.

PhraseQuery -- Matches a sequence of Terms . The PhraseQuery can specify a slop factor which determines how many positions may occur between any two terms and still be considered a match.
SpanNearQuery -- Matches a sequence of other SpanQuery instances. The SpanNearQuery allows for much more complicated phrasal queries to be built since it is constructed out of other SpanQuery objects, not just Terms.

RangeQuery

The RangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term . For instance, one could find all documents that have terms beginning with the letters a through c. This type of Query is most often used to find documents that occur in a specific date range.

PrefixQuery , WildcardQuery

While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery . The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalize this by allowing for the use of * and ? wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuerys should not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at the beginning of a term, see Starts With x and Ends With x Queries from the Lucene archives.

FuzzyQuery

A FuzzyQuery matches documents that contain similar terms to the specified term. Similarity is determined using the Levenshtein (edit distance) algorithm . This type of query can be useful when accounting for spelling variations in the collection.

Chances are, the DefaultSimilarity is sufficient for all your searching needs. However, in some applications it may be necessary to alter your Similarity. For instance, some applications do not need to distinguish between shorter documents and longer documents (for example, see a "fair" similarity) To change the Similarity, one must do so for both indexing and searching and the changes must take place before any of these actions are undertaken (although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen). To make this change, implement your Similarity (you probably want to override DefaultSimilarity) and then set the new class on IndexWriter.setSimilarity(org.apache.lucene.search.Similarity) for indexing and on Searcher.setSimilarity(org.apache.lucene.search.Similarity).

If you are interested in use cases for changing your similarity, see the mailing list at Overriding Similarity. In summary, here are a few use cases:

SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes to a score. In the DefaultSimilarity, lengthNorm = 1/ (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".

In general, Chris Hostetter sums it up best in saying (from the mailing list):

[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.

Changing scoring is an expert level task, so tread carefully and be prepared to share your code if you want help.

With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by three main classes:

Query -- The abstract object representation of the user's information need.
Weight -- The internal interface representation of the user's Query, so that Query objects may be reused.
Scorer -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.

Details on each of these classes, and their children can be found in the subsections below.

In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:

createWeight(Searcher searcher) -- A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
rewrite(IndexReader reader) -- Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, OTHERS????

The Weight interface provides an internal representation of the Query so that it can be reused. Any Searcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines 6 methods that must be implemented:

Weight#getQuery() -- Pointer to the Query that this Weight represents.
Weight#getValue() -- The weight for this Query. For example, the TermQuery.TermWeight value is equal to the idf^2 * boost * queryNorm
Weight#sumOfSquaredWeights() -- The sum of squared weights. Tor TermQuery, this is (idf * boost)^2
Weight#normalize(float) -- Determine the query normalization factor. The query normalization may allow for comparing scores between queries.
Weight#scorer(IndexReader) -- Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
Weight#explain(IndexReader, int) -- Provide a means for explaining why a given document was scored the way it was.

The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which must be implemented:

Scorer#next() -- Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
Scorer#doc() -- Returns the id of the Document that contains the match. Is not valid until next() has been called at least once.
Scorer#score() -- Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer returns the tf * Weight.getValue() * fieldNorm.
Scorer#skipTo(int) -- Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, skipTo can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
Scorer#explain(int) -- Provides details on why the score came about.

In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).

FILL IN HERE

Karl Wettin's UML on the Wiki

FILL IN HERE. Volunteers?

GSI Note: This section is mostly my notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.

In the typical search application, a Query is passed to the Searcher , beginning the scoring process.

Once inside the Searcher, a Hits object is constructed, which handles the scoring and caching of the search results. The Hits constructor stores references to three or four important objects:

The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the Searcher.
The Searcher that initiated the call.
A Filter for limiting the result set. Note, the Filter may be null.
A Sort object for specifying how to sort the results if the standard score based sort method is not desired.

Now that the Hits object has been initialized, it begins the process of identifying documents that match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), we call on the "expert" search method of the Searcher, passing in our Weight object, Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The Searcher creates a TopDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the HitCollector mechanism, see Searcher .) The TopDocCollector uses a PriorityQueue to collect the top results for the search.

If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for the IndexReader of the current searcher and we proceed by calling the score method on the Scorer .

At last, we are actually going to score some documents. The score method takes in the HitCollector (most likely the TopDocCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2 (see the section on customizing your scoring for info on changing this.)

Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer#next() method. The next() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overriden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.