scoring documentation updates from Michael McCandless

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@438127 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Grant Ingersoll 2006-08-29 17:43:18 +00:00
parent 18c5d5be0e
commit b237052d3a
1 changed files with 95 additions and 89 deletions

View File

@ -181,7 +181,7 @@ limitations under the License.
and the other in one Field will return different scores for the same query due to length normalization and the other in one Field will return different scores for the same query due to length normalization
(assumming the (assumming the
<a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
on the Fields.) on the Fields).
</p> </p>
</blockquote> </blockquote>
</td></tr> </td></tr>
@ -288,13 +288,13 @@ limitations under the License.
<a href="api/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in <a href="api/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
response to a user's information need. response to a user's information need.
</p> </p>
<p>In this regard, Lucene offers a wide variety of Query implementations, most of which are in the <p>In this regard, Lucene offers a wide variety of <a href="api/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
org.apache.lucene.search package. <a href="api/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.
These implementations can be combined in a wide variety of ways to provide complex querying These implementations can be combined in a wide variety of ways to provide complex querying
capabilities along with capabilities along with
information about where matches took place in the document collection. The <a href="#Query Classes">Query</a> information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
section below will section below
highlight some of the more important Query classes. For information on the other ones, see the highlights some of the more important Query classes. For information on the other ones, see the
<a href="api/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing <a href="api/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring -- your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
Expert Level</a> below. Expert Level</a> below.
@ -302,7 +302,7 @@ limitations under the License.
<p>Once a Query has been created and submitted to the <p>Once a Query has been created and submitted to the
<a href="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process <a href="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
begins. (See the <a href="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup, begins. (See the <a href="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,
control finally passes to the Weight implementation and it's control finally passes to the <a href="api/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
<a href="api/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of <a href="api/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
<a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
<a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a> (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), <a href="http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a> (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class),
@ -340,68 +340,77 @@ limitations under the License.
<a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
</h4> </h4>
<p>Of the various implementations of <p>Of the various implementations of
Query, the <a href="api/org/apache/lucene/search/Query.html">Query</a>, the
<a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
is the easiest to understand and the most often used in most applications. A TermQuery is a Query is the easiest to understand and the most often used in applications. A <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> matches all the documents that contain the specified
that matches all the documents that contain the specified <a href="api/org/apache/lucene/index/Term.html">Term</a>,
<a href="api/org/apache/lucene/index/Term.html">Term</a> which is a word that occurs in a certain
. A Term is a word that occurs in a specific <a href="api/org/apache/lucene/document/Field.html">Field</a>.
<a href="api/org/apache/lucene/document/Field.html">Field</a> Thus, a <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> identifies and scores all
. Thus, a TermQuery identifies and scores all <a href="api/org/apache/lucene/document/Document.html">Document</a>s that have a <a href="api/org/apache/lucene/document/Field.html">Field</a> with the specified string in it.
<a href="api/org/apache/lucene/document/Document.html">Document</a> Constructing a <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a>
s that have a Field with the specified string in it. is as simple as:
Constructing a TermQuery is as simple as: <pre>
<code>TermQuery tq = new TermQuery(new Term("fieldName", "term");</code> TermQuery tq = new TermQuery(new Term("fieldName", "term");
In this example, the Query would identify all Documents that have the Field named "fieldName" that </pre>In this example, the <a href="api/org/apache/lucene/search/Query.html">Query</a> identifies all <a href="api/org/apache/lucene/document/Document.html">Document</a>s that have the <a href="api/org/apache/lucene/document/Field.html">Field</a> named <tt>"fieldName"</tt> and
contain the word "term". contain the word <tt>"term"</tt>.
</p> </p>
<h4> <h4>
<a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>
</h4> </h4>
<p>Things start to get interesting when one starts to combine TermQuerys, which is handled by the <p>Things start to get interesting when one combines multiple
<a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> instances into a <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>.
class. The BooleanQuery is a collection A <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> contains multiple
of other <a href="api/org/apache/lucene/search/BooleanClause.html">BooleanClause</a>s,
<a href="api/org/apache/lucene/search/Query.html">Query</a> where each clause contains a sub-query (<a href="api/org/apache/lucene/search/Query.html">Query</a>
classes along with semantics about how to combine the different subqueries. instance) and an operator (from <a href="api/org/apache/lucene/search/BooleanClause.Occur.html">BooleanClause.Occur</a>)
It currently supports three different operators for specifying the logic of the query (see describing how that sub-query is combined with the other clauses:
<a href="api/org/apache/lucene/search/BooleanClause.html">BooleanClause</a>
)
<ol> <ol>
<li>SHOULD -- Use this operator when a clause can occur in the result set, but is not required.
If a query is made up of all SHOULD clauses, then a non-empty result <li><p>SHOULD -- Use this operator when a clause can occur in the result set, but is not required.
set will have matched at least one of the clauses in the query.</li> If a query is made up of all SHOULD clauses, then every document in the result
<li>MUST -- Use this operator when a clause is required to occur in the result set.</li> set matches at least one of these clauses.</p></li>
<li>MUST NOT -- Use this operator when a clause must not occur in the result set.</li>
<li><p>MUST -- Use this operator when a clause is required to occur in the result set. Every
document in the result set will match
all such clauses.</p></li>
<li><p>MUST NOT -- Use this operator when a
clause must not occur in the result set. No
document in the result set will match
any such clauses.</p></li>
</ol> </ol>
Boolean queries are constructed by adding two or more Boolean queries are constructed by adding two or more
<a href="api/org/apache/lucene/search/BooleanClause.html">BooleanClause</a> <a href="api/org/apache/lucene/search/BooleanClause.html">BooleanClause</a>
instances to the BooleanQuery instance. In some cases, instances. If too many clauses are added, a <a href="api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html">TooManyClauses</a>
too many clauses may be added to the BooleanQuery, which will cause a TooManyClauses exception to be exception will be thrown during searching. This most often occurs
thrown. This when a <a href="api/org/apache/lucene/search/Query.html">Query</a>
most often occurs when using a Query that is rewritten into many TermQuery instances, such as the is rewritten into a <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a> with many
<a href="api/org/apache/lucene/search/WildCardQuery.html">WildCardQuery</a> <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> clauses,
. The default for example by <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>.
setting for too many clauses is currently set to 1024, but it can be overridden via the The default setting for the maximum number
<a href="api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)">BooleanQuery#setMaxClauseCount(int)</a> static method on BooleanQuery. of clauses 1024, but this can be changed via the
static method <a href="api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)">setMaxClauseCount</a>
in <a href="api/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>.
</p> </p>
<h4>Phrases</h4> <h4>Phrases</h4>
<p>Another common task in search is to identify phrases, which can be handled in two different ways. <p>Another common search is to find documents containing certain phrases. This
is handled in two different ways.
<ol> <ol>
<li> <li>
<a href="api/org/apache/lucene/search/PhraseQuery.html">PhraseQuery</a> <p><a href="api/org/apache/lucene/search/PhraseQuery.html">PhraseQuery</a>
-- Matches a sequence of -- Matches a sequence of
<a href="api/org/apache/lucene/index/Term.html">Terms</a> <a href="api/org/apache/lucene/index/Term.html">Terms</a>.
. The PhraseQuery can specify a slop factor which determines <a href="api/org/apache/lucene/search/PhraseQuery.html">PhraseQuery</a> uses a slop factor to determine
how many positions may occur between any two terms and still be considered a match. how many positions may occur between any two terms in the phrase and still be considered a match.</p>
</li> </li>
<li> <li>
<a href="api/org/apache/lucene/search/spans/SpanNearQuery.html">SpanNearQuery</a> <p><a href="api/org/apache/lucene/search/spans/SpanNearQuery.html">SpanNearQuery</a>
-- Matches a sequence of other -- Matches a sequence of other
<a href="api/org/apache/lucene/search/spans/SpanQuery.html">SpanQuery</a> <a href="api/org/apache/lucene/search/spans/SpanQuery.html">SpanQuery</a>
instances. The SpanNearQuery allows for much more instances. <a href="api/org/apache/lucene/search/spans/SpanNearQuery.html">SpanNearQuery</a> allows for much more
complicated phrasal queries to be built since it is constructed out of other SpanQuery complicated phrase queries since it is constructed from other to <a href="api/org/apache/lucene/search/spans/SpanQuery.html">SpanQuery</a>
objects, not just Terms. instances, instead of only <a href="api/org/apache/lucene/search/TermQuery.html">TermQuery</a> instances.</p>
</li> </li>
</ol> </ol>
</p> </p>
@ -414,41 +423,39 @@ limitations under the License.
exclusive range of a lower exclusive range of a lower
<a href="api/org/apache/lucene/index/Term.html">Term</a> <a href="api/org/apache/lucene/index/Term.html">Term</a>
and an upper and an upper
<a href="api/org/apache/lucene/index/Term.html">Term</a> <a href="api/org/apache/lucene/index/Term.html">Term</a>.
. For instance, one could find all documents For example, one could find all documents
that have terms beginning with the letters a through c. This type of Query is most often used to that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>. This type of <a href="api/org/apache/lucene/search/Query.html">Query</a> is frequently used to
find find
documents that occur in a specific date range. documents that occur in a specific date range.
</p> </p>
<h4> <h4>
<a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a> <a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a>,
,
<a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>
</h4> </h4>
<p>While the <p>While the
<a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a> <a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a>
has a different implementation, it is essentially a special case of the has a different implementation, it is essentially a special case of the
<a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a>.
. The PrefixQuery allows an application The <a href="api/org/apache/lucene/search/PrefixQuery.html">PrefixQuery</a> allows an application
to identify all documents with terms that begin with a certain string. The WildcardQuery generalize to identify all documents with terms that begin with a certain string. The <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> generalizes this by allowing
this by allowing for the use of <tt>*</tt> (matches 0 or more characters) and <tt>?</tt> (matches exactly one character) wildcards. Note that the <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> can be quite slow. Also note that
for the use of * and ? wildcards. Note that the WildcardQuery can be quite slow. Also note that <a href="api/org/apache/lucene/search/WildcardQuery.html">WildcardQuery</a> should
WildcardQuerys should not start with <tt>*</tt> and <tt>?</tt>, as these are extremely slow. For tricks on how to search using a wildcard at
not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at
the beginning of a term, see the beginning of a term, see
<a href="http://www.gossamer-threads.com/lists/lucene/java-user/13373?search_string=WildcardQuery%20start;#13373"> <a href="http://www.gossamer-threads.com/lists/lucene/java-user/13373#13373">
Starts With x and Ends With x Queries</a> Starts With x and Ends With x Queries</a>
from the Lucene archives. from the Lucene users's mailing list.
</p> </p>
<h4> <h4>
<a href="api/org/apache/lucene/search/FuzzyQuery.html">FuzzyQuery</a> <a href="api/org/apache/lucene/search/FuzzyQuery.html">FuzzyQuery</a>
</h4> </h4>
<p>A <p>A
<a href="api/org/apache/lucene/search/FuzzyQuery.html">FuzzyQuery</a> <a href="api/org/apache/lucene/search/FuzzyQuery.html">FuzzyQuery</a>
matches documents that contain similar terms to the specified term. Similarity is matches documents that contain terms similar to the specified term. Similarity is
determined using the determined using
<a href="http://en.wikipedia.org/wiki/Levenshtein">Levenshtein (edit distance) algorithm</a> <a href="http://en.wikipedia.org/wiki/Levenshtein">Levenshtein (edit) distance</a>.
. This type of query can be useful when accounting for spelling variations in the collection. This type of query can be useful when accounting for spelling variations in the collection.
</p> </p>
</blockquote> </blockquote>
</td></tr> </td></tr>
@ -462,33 +469,32 @@ limitations under the License.
</td></tr> </td></tr>
<tr><td> <tr><td>
<blockquote> <blockquote>
<p>Chances are, the <p>Chances are <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all your searching needs.
<a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all your searching needs. However, in some applications it may be necessary to customize your <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> implementation. For instance, some applications do not need to
However, in some applications it may be necessary to alter your Similarity. For instance, some applications do not need to distinguish between shorter and longer documents (see <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p>
distinguish between shorter documents and longer documents (for example, <p>To change <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>, one must do so for both indexing and searching, and the changes must happen before
see <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967?search_string=Similarity;#38967">a "fair" similarity</a>) either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.
To change the Similarity, one must do so for both indexing and searching and the changes must take place before </p>
any of these actions are undertaken (although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen). <p>To make this change, implement your own <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> (likely you'll want to simply subclass
To make this change, implement your Similarity (you probably want to override <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>) and then use the new
<a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>) and then set the new class by calling
class on <a href="api/org/apache/lucene/index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity</a> before indexing and
<a href="api/org/apache/lucene/index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity(org.apache.lucene.search.Similarity)</a> for indexing and on <a href="api/org/apache/lucene/search/Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity</a> before searching.
<a href="api/org/apache/lucene/search/Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity(org.apache.lucene.search.Similarity)</a>.
</p> </p>
<p> <p>
If you are interested in use cases for changing your similarity, see the mailing list at <a href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>. If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>.
In summary, here are a few use cases: In summary, here are a few use cases:
<ol> <ol>
<li>SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount <li><p><a href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> -- <a href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> gives small increases as the frequency increases a small amount
and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.</li> and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.</p></li>
<li>Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these <li><p>Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these
cases people have overridden Similarity to return 1 from the tf() method.</li> cases people have overridden Similarity to return 1 from the tf() method.</p></li>
<li>Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes <li><p>Changing Length Normalization -- By overriding <a href="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)">lengthNorm</a>, it is possible to discount how the length of a field contributes
to a score. In the DefaultSimilarity, lengthNorm = 1/ (numTerms in field)^0.5, but if one changes this to be to a score. In <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
1 / (numTerms in field), all fields will be treated 1 / (numTerms in field), all fields will be treated
<a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967?search_string=Similarity;#38967">"fairly"</a>.</li> <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</p></li>
</ol> </ol>
In general, Chris Hostetter sums it up best in saying (from <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">the mailing list</a>): In general, Chris Hostetter sums it up best in saying (from <a href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
<blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just that <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just that
it's "text" is a situation where it *might* make sense to to override your it's "text" is a situation where it *might* make sense to to override your
Similarity method.</blockquote> Similarity method.</blockquote>