integrate scoring.html into scoring package, fix broken links, and update for 4.0

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1328929 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Robert Muir 2012-04-22 18:41:42 +00:00
parent 4850ac1dc0
commit f13faf3ee9
4 changed files with 189 additions and 355 deletions

View File

@ -27,18 +27,33 @@ Code to search indices.
<ol>
<li><a href="#search">Search Basics</a></li>
<li><a href="#query">The Query Classes</a></li>
<li><a href="#scoring">Changing the Scoring</a></li>
<li><a href="#scoring">Scoring: Introduction</a></li>
<li><a href="#scoringBasics">Scoring: Basics</a></li>
<li><a href="#changingScoring">Changing the Scoring</a></li>
<li><a href="#algorithm">Appendix: Search Algorithm</a></li>
</ol>
</p>
<a name="search"></a>
<h2>Search</h2>
<h2>Search Basics</h2>
<p>
Search over indices.
Applications usually call {@link
Lucene offers a wide variety of {@link org.apache.lucene.search.Query} implementations, most of which are in
this package, its subpackages ({@link org.apache.lucene.search.spans spans}, {@link org.apache.lucene.search.payloads payloads}),
or the <a href="{@docRoot}/../queries/overview-summary.html">queries module</a>. These implementations can be combined in a wide
variety of ways to provide complex querying capabilities along with information about where matches took place in the document
collection. The <a href="#query">Query Classes</a> section below highlights some of the more important Query classes. For details
on implementing your own Query class, see <a href="#customQueries">Custom Queries -- Expert Level</a> below.
</p>
<p>
To perform a search, applications usually call {@link
org.apache.lucene.search.IndexSearcher#search(Query,int)} or {@link
org.apache.lucene.search.IndexSearcher#search(Query,Filter,int)}.
</p>
<p>
Once a Query has been created and submitted to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, the scoring
process begins. After some infrastructure setup, control finally passes to the {@link org.apache.lucene.search.Weight Weight}
implementation and its {@link org.apache.lucene.search.Scorer Scorer} instances. See the <a href="#algorithm">Algorithm</a>
section for more notes on the process.
</p>
<!-- FILL IN MORE HERE -->
<!-- TODO: this page over-links the same things too many times -->
</p>
@ -211,20 +226,118 @@ org.apache.lucene.search.IndexSearcher#search(Query,Filter,int)}.
This type of query can be useful when accounting for spelling variations in the collection.
</p>
<a name="scoring"></a>
<h2>Scoring &mdash; Introduction</h2>
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides
almost all of the complexity from the user. In a nutshell, it works. At least, that is,
until it doesn't work, or doesn't work as one would expect it to work. Then we are left
digging into Lucene internals or asking for help on
<a href="mailto:java-user@lucene.apache.org">java-user@lucene.apache.org</a> to figure out
why a document with five of our query terms scores lower than a different document with
only one of the query terms.
</p>
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you
to the places that can help you figure out the <i>what</i> and <i>why</i> of Lucene scoring.
</p>
<p>Lucene scoring supports a number of pluggable information retrieval
<a href="http://en.wikipedia.org/wiki/Information_retrieval#Model_types">models</a>, including:
<ul>
<li><a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM)</a></li>
<li><a href="http://en.wikipedia.org/wiki/Probabilistic_relevance_model">Probablistic Models</a> such as
<a href="http://en.wikipedia.org/wiki/Probabilistic_relevance_model_(BM25)">Okapi BM25</a> and
<a href="http://en.wikipedia.org/wiki/Divergence-from-randomness_model">DFR</a></li>
<li><a href="http://en.wikipedia.org/wiki/Language_model">Language models</a></li>
</ul>
These models can be plugged in via the {@link org.apache.lucene.search.similarities Similarity API},
and offer extension hooks and parameters for tuning. In general, Lucene first narrows down the documents
that need to be scored based on boolean logic in the Query specification, and then ranks this subset of
documents via the retrieval model. For some valuable references on VSM and IR in general refer to
<a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
</p>
<p>The rest of this document will cover <a href="#scoringBasics">Scoring basics</a> and explain how to
change your {@link org.apache.lucene.search.similarities.Similarity Similarity}. Next, it will cover
ways you can customize the lucene internals in
<a href="#customQueriesExpert">Custom Queries -- Expert Level</a>, which gives details on
implementing your own {@link org.apache.lucene.search.Query Query} class and related functionality.
Finally, we will finish up with some reference material in the <a href="#algorithm">Appendix</a>.
</p>
<a name="scoringBasics"></a>
<h2>Scoring &mdash; Basics</h2>
<p>Scoring is very much dependent on the way documents are indexed, so it is important to understand
indexing. (see <a href="@{docRoot}/overview-summary.html">Lucene overview</a> before continuing
on with this section) It is also assumed that readers know how to use the
{@link org.apache.lucene.search.IndexSearcher#explain(org.apache.lucene.search.Query, int) IndexSearcher.explain(Query, doc)}
functionality, which can go a long way in informing why a score is returned.
</p>
<h4>Fields and Documents</h4>
<p>In Lucene, the objects we are scoring are {@link org.apache.lucene.document.Document Document}s.
A Document is a collection of {@link org.apache.lucene.document.Field Field}s. Each Field has
{@link org.apache.lucene.document.FieldType semantics} about how it is created and stored
({@link org.apache.lucene.document.FieldType#tokenized() tokenized},
{@link org.apache.lucene.document.FieldType#stored() stored}, etc). It is important to note that
Lucene scoring works on Fields and then combines the results to return Documents. This is
important because two Documents with the exact same content, but one having the content in two
Fields and the other in one Field may return different scores for the same query due to length
normalization.
</p>
<h4>Score Boosting</h4>
<p>Lucene allows influencing search results by "boosting" in more than one level:
<ul>
<li><b>Index-time boost</b> by calling
{@link org.apache.lucene.document.Field#setBoost(float) Field.setBoost()} before a document is
added to the index.</li>
<li><b>Query-time boost</b> by setting a boost on a query clause, calling
{@link org.apache.lucene.search.Query#setBoost(float) Query.setBoost()}.</li>
</ul>
</p>
<p>Indexing time boosts are pre-processed for storage efficiency and written to
storage for a field as follows:
<ul>
<li>All boosts of that field (i.e. all boosts under the same field name in that doc) are
multiplied.</li>
<li>The boost is then encoded into a normalization value by the Similarity
object at index-time: {@link org.apache.lucene.search.similarities.Similarity#computeNorm computeNorm()}.
The actual encoding depends upon the Similarity implementation, but note that most
use a lossy encoding (such as multiplying the boost with document length or similar, packed
into a single byte!).</li>
<li>Decoding of any index-time normalization values and integration into the document's score is also performed
at search time by the Similarity.</li>
</ul>
</p>
<a name="changingScoring"></a>
<h2>Changing Scoring &mdash; Similarity</h2>
<p>
Changing {@link org.apache.lucene.search.similarities.Similarity Similarity} is an easy way to
influence scoring, this is done at index-time with
{@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(org.apache.lucene.search.similarities.Similarity)
IndexWriterConfig.setSimilarity(Similarity)} and at query-time with
{@link org.apache.lucene.search.IndexSearcher#setSimilarity(org.apache.lucene.search.similarities.Similarity)
IndexSearcher.setSimilarity(Similarity)}.
</p>
<p>
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its
parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can
extend by plugging in a different component (e.g. term frequency normalizer).
</p>
<p>
Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
a custom Similarity can access per-document values via {@link org.apache.lucene.search.FieldCache FieldCache} or
{@link org.apache.lucene.index.DocValues} and integrate them into the score.
</p>
<p>
See the {@link org.apache.lucene.search.similarities} package documentation for information
on the available scoring models and extending or changing Similarity.
on the built-in available scoring models and extending or changing Similarity.
</p>
<a name="customQueriesExpert"></a>
<h2>Custom Queries &mdash; Expert Level</h2>
<h2>Changing Scoring &mdash; Expert Level</h2>
<p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
<p>Custom queries are an expert level task, so tread carefully and be prepared to share your code if
you want help.
</p>
<p>With the warning out of the way, it is possible to change a lot more than just the Similarity
when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
<span >three main classes</span>:
when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by
<span>three main classes</span>:
<ol>
<li>
{@link org.apache.lucene.search.Query Query} &mdash; The abstract object representation of the
@ -248,13 +361,13 @@ on the available scoring models and extending or changing Similarity.
{@link org.apache.lucene.search.Query Query} class has several methods that are important for
derived classes:
<ol>
<li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher} &mdash; A
<li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher)} &mdash; A
{@link org.apache.lucene.search.Weight Weight} is the internal representation of the
Query, so each Query implementation must
provide an implementation of Weight. See the subsection on <a
href="#weightClass">The Weight Interface</a> below for details on implementing the Weight
interface.</li>
<li>{@link org.apache.lucene.search.Query#rewrite(IndexReader) rewrite(IndexReader reader} &mdash; Rewrites queries into primitive queries. Primitive queries are:
<li>{@link org.apache.lucene.search.Query#rewrite(IndexReader) rewrite(IndexReader reader)} &mdash; Rewrites queries into primitive queries. Primitive queries are:
{@link org.apache.lucene.search.TermQuery TermQuery},
{@link org.apache.lucene.search.BooleanQuery BooleanQuery}, <span
>and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher)}</span></li>
@ -363,5 +476,63 @@ on the available scoring models and extending or changing Similarity.
back
out of Lucene (similar to Doug adding SpanQuery functionality).</p>
<!-- TODO: integrate this better, its better served as an intro than an appendix -->
<a name="algorithm"></a>
<h2>Appendix: Search Algorithm</h2>
<p>This section is mostly notes on stepping through the Scoring process and serves as
fertilizer for the earlier sections.</p>
<p>In the typical search application, a {@link org.apache.lucene.search.Query Query}
is passed to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher},
beginning the scoring process.</p>
<p>Once inside the IndexSearcher, a {@link org.apache.lucene.search.Collector Collector}
is used for the scoring and sorting of the search results.
These important objects are involved in a search:
<ol>
<li>The {@link org.apache.lucene.search.Weight Weight} object of the Query. The
Weight object is an internal representation of the Query that allows the Query
to be reused by the IndexSearcher.</li>
<li>The IndexSearcher that initiated the call.</li>
<li>A {@link org.apache.lucene.search.Filter Filter} for limiting the result set.
Note, the Filter may be null.</li>
<li>A {@link org.apache.lucene.search.Sort Sort} object for specifying how to sort
the results if the standard score-based sort method is not desired.</li>
</ol>
</p>
<p>Assuming we are not sorting (since sorting doesn't affect the raw Lucene score),
we call one of the search methods of the IndexSearcher, passing in the
{@link org.apache.lucene.search.Weight Weight} object created by
{@link org.apache.lucene.search.IndexSearcher#createNormalizedWeight(org.apache.lucene.search.Query)
IndexSearcher.createNormalizedWeight(Query)},
{@link org.apache.lucene.search.Filter Filter} and the number of results we want.
This method returns a {@link org.apache.lucene.search.TopDocs TopDocs} object,
which is an internal collection of search results. The IndexSearcher creates
a {@link org.apache.lucene.search.TopScoreDocCollector TopScoreDocCollector} and
passes it along with the Weight, Filter to another expert search method (for
more on the {@link org.apache.lucene.search.Collector Collector} mechanism,
see {@link org.apache.lucene.search.IndexSearcher IndexSearcher}). The TopScoreDocCollector
uses a {@link org.apache.lucene.util.PriorityQueue PriorityQueue} to collect the
top results for the search.
</p>
<p>If a Filter is being used, some initial setup is done to determine which docs to include.
Otherwise, we ask the Weight for a {@link org.apache.lucene.search.Scorer Scorer} for each
{@link org.apache.lucene.index.IndexReader IndexReader} segment and proceed by calling
{@link org.apache.lucene.search.Scorer#score(org.apache.lucene.search.Collector) Scorer.score()}.
</p>
<p>At last, we are actually going to score some documents. The score method takes in the Collector
(most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here
is where things get involved. The {@link org.apache.lucene.search.Scorer Scorer} that is returned
by the {@link org.apache.lucene.search.Weight Weight} object depends on what type of Query was
submitted. In most real world applications with multiple query terms, the
{@link org.apache.lucene.search.Scorer Scorer} is going to be a <code>BooleanScorer2</code> created
from {@link org.apache.lucene.search.BooleanQuery.BooleanWeight BooleanWeight} (see the section on
<a href="#customQueriesExpert">custom queries</a> for info on changing this).
</p>
<p>Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the coord()
factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query.
Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the
{@link org.apache.lucene.search.Scorer#nextDoc Scorer.nextDoc()} method. The nextDoc() method advances
to the next document matching the query. This is an abstract method in the Scorer class and is thus
overridden by all derived implementations. If you have a simple OR query your internal Scorer is most
likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.</p>
</body>
</html>

View File

@ -39,7 +39,8 @@ package.
<h2>Summary of the Ranking Methods</h2>
<p>{@link org.apache.lucene.search.similarities.DefaultSimilarity} is the original Lucene
scoring function. It is based on a highly optimized Vector Space Model. For more
scoring function. It is based on a highly optimized
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>. For more
information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}.</p>
<p>{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized

View File

@ -1,338 +0,0 @@
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Apache Lucene - Scoring</title>
</head>
<body>
<h1>Apache Lucene - Scoring</h1>
<div id="minitoc-area">
<ul class="minitoc">
<li>
<a href="#Introduction">Introduction</a>
</li>
<li>
<a href="#Scoring">Scoring</a>
<ul class="minitoc">
<li>
<a href="#Fields and Documents">Fields and Documents</a>
</li>
<li>
<a href="#Score Boosting">Score Boosting</a>
</li>
<li>
<a href="#Understanding the Scoring Formula">Understanding the Scoring Formula</a>
</li>
<li>
<a href="#The Big Picture">The Big Picture</a>
</li>
<li>
<a href="#Query Classes">Query Classes</a>
</li>
<li>
<a href="#Changing Similarity">Changing Similarity</a>
</li>
</ul>
</li>
<li>
<a href="#Changing your Scoring -- Expert Level">Changing your Scoring -- Expert Level</a>
</li>
<li>
<a href="#Appendix">Appendix</a>
<ul class="minitoc">
<li>
<a href="#Algorithm">Algorithm</a>
</li>
</ul>
</li>
</ul>
</div>
<a name="N10013"></a><a name="Introduction"></a>
<h2 class="boxed">Introduction</h2>
<div class="section">
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
scores lower than a different document with only one of the query terms. </p>
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
help you figure out the what and why of Lucene scoring.</p>
<p>Lucene scoring uses a combination of the
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
to determine
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
times a query term appears in a document relative to
the number of times the term appears in all the documents in the collection, the more relevant that
document is to the query. It uses the Boolean model to first narrow down the documents that need to
be scored based on the use of boolean logic in the Query specification. Lucene also adds some
capabilities and refinements onto this model to support boolean and fuzzy searching, but it
essentially remains a VSM based system at the heart.
For some valuable references on VSM and IR in general refer to the
<a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
</p>
<p>The rest of this document will cover <a href="#Scoring">Scoring</a> basics and how to change your
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>. Next it will cover ways you can
customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring
-- Expert Level</a> which gives details on implementing your own
<a href="core/org/apache/lucene/search/Query.html">Query</a> class and related functionality. Finally, we
will finish up with some reference material in the <a href="#Appendix">Appendix</a>.
</p>
</div>
<a name="N10045"></a><a name="Scoring"></a>
<h2 class="boxed">Scoring</h2>
<div class="section">
<p>Scoring is very much dependent on the way documents are indexed,
so it is important to understand indexing (see
<a href="gettingstarted.html">Apache Lucene - Getting Started Guide</a>
and the Lucene
<a href="fileformats.html">file formats</a>
before continuing on with this section.) It is also assumed that readers know how to use the
<a href="core/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,
which can go a long way in informing why a score is returned.
</p>
<a name="N10059"></a><a name="Fields and Documents"></a>
<h3 class="boxed">Fields and Documents</h3>
<p>In Lucene, the objects we are scoring are
<a href="core/org/apache/lucene/document/Document.html">Documents</a>. A Document is a collection
of
<a href="core/org/apache/lucene/document/Field.html">Fields</a>. Each Field has semantics about how
it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to
note that Lucene scoring works on Fields and then combines the results to return Documents. This is
important because two Documents with the exact same content, but one having the content in two Fields
and the other in one Field will return different scores for the same query due to length normalization
(assumming the
<a href="core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
on the Fields).
</p>
<a name="N1006E"></a><a name="Score Boosting"></a>
<h3 class="boxed">Score Boosting</h3>
<p>Lucene allows influencing search results by "boosting" in more than one level:
<ul>
<li>
<b>Document level boosting</b>
- while indexing - by calling
<a href="core/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
before a document is added to the index.
</li>
<li>
<b>Document's Field level boosting</b>
- while indexing - by calling
<a href="core/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
before adding a field to the document (and before adding the document to the index).
</li>
<li>
<b>Query level boosting</b>
- during search, by setting a boost on a query clause, calling
<a href="core/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
</li>
</ul>
</p>
<p>Indexing time boosts are preprocessed for storage efficiency and written to
the directory (when writing the document) in a single byte (!) as follows:
For each field of a document, all boosts of that field
(i.e. all boosts under the same field name in that doc) are multiplied.
The result is multiplied by the boost of the document,
and also multiplied by a "field length norm" value
that represents the length of that field in that doc
(so shorter fields are automatically boosted up).
The result is decoded as a single byte
(with some precision loss of course) and stored in the directory.
The similarity object in effect at indexing computes the length-norm of the field.
</p>
<p>This composition of 1-byte representation of norms
(that is, indexing time multiplication of field boosts &amp; doc boost &amp; field-length-norm)
is nicely described in
<a href="core/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
</p>
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
static methods of the class Similarity:
<a href="core/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
<a href="core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
e.g. decode(encode(0.89)) = 0.75.
At scoring (search) time, this norm is brought into the score of document
as <b>norm(t, d)</b>, as shown by the formula in
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>.
</p>
<a name="N100B1"></a><a name="Understanding the Scoring Formula"></a>
<h3 class="boxed">Understanding the Scoring Formula</h3>
<p>
This scoring formula is described in the
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
basics of Lucene scoring work, especially the
<a href="core/org/apache/lucene/search/TermQuery.html">TermQuery</a>.
</p>
<a name="N100C2"></a><a name="The Big Picture"></a>
<h3 class="boxed">The Big Picture</h3>
<p>OK, so the tf-idf formula and the
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>
is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
the use and interactions between the
<a href="core/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
response to a user's information need.
</p>
<p>In this regard, Lucene offers a wide variety of <a href="core/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
<a href="core/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.
These implementations can be combined in a wide variety of ways to provide complex querying
capabilities along with
information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
section below
highlights some of the more important Query classes. For information on the other ones, see the
<a href="core/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
Expert Level</a> below.
</p>
<p>Once a Query has been created and submitted to the
<a href="core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
begins. (See the <a href="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,
control finally passes to the <a href="core/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
<a href="core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a>
(link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight</a>
(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).
</p>
<p>
Assuming the use of the BooleanWeight2, a
BooleanScorer2 is created by bringing together
all of the
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
provided by each scorer while factoring in the coord() score.
<!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
</p>
<a name="N10112"></a><a name="Query Classes"></a>
<h3 class="boxed">Query Classes</h3>
<p>For information on the Query Classes, refer to the
<a href="core/org/apache/lucene/search/package-summary.html#query">search package javadocs</a>
</p>
<a name="N1011F"></a><a name="Changing Similarity"></a>
<h3 class="boxed">Changing Similarity</h3>
<p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on
how to do this, see the
<a href="core/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs</a>
</p>
</div>
<a name="N1012C"></a><a name="Changing your Scoring -- Expert Level"></a>
<h2 class="boxed">Changing your Scoring -- Expert Level</h2>
<div class="section">
<p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more
about how to do this, refer to the
<a href="core/org/apache/lucene/search/package-summary.html#scoring">search package javadocs</a>
</p>
</div>
<a name="N10139"></a><a name="Appendix"></a>
<h2 class="boxed">Appendix</h2>
<div class="section">
<a name="N1013E"></a><a name="Algorithm"></a>
<h3 class="boxed">Algorithm</h3>
<p>This section is mostly notes on stepping through the Scoring process and serves as
fertilizer for the earlier sections.</p>
<p>In the typical search application, a
<a href="core/org/apache/lucene/search/Query.html">Query</a>
is passed to the
<a href="core/org/apache/lucene/search/Searcher.html">Searcher</a>
, beginning the scoring process.
</p>
<p>Once inside the Searcher, a
<a href="core/org/apache/lucene/search/Collector.html">Collector</a>
is used for the scoring and sorting of the search results.
These important objects are involved in a search:
<ol>
<li>The
<a href="core/org/apache/lucene/search/Weight.html">Weight</a>
object of the Query. The Weight object is an internal representation of the Query that
allows the Query to be reused by the Searcher.
</li>
<li>The Searcher that initiated the call.</li>
<li>A
<a href="core/org/apache/lucene/search/Filter.html">Filter</a>
for limiting the result set. Note, the Filter may be null.
</li>
<li>A
<a href="core/org/apache/lucene/search/Sort.html">Sort</a>
object for specifying how to sort the results if the standard score based sort method is not
desired.
</li>
</ol>
</p>
<p> Assuming we are not sorting (since sorting doesn't
effect the raw Lucene score),
we call one of the search methods of the Searcher, passing in the
<a href="core/org/apache/lucene/search/Weight.html">Weight</a>
object created by Searcher.createWeight(Query),
<a href="core/org/apache/lucene/search/Filter.html">Filter</a>
and the number of results we want. This method
returns a
<a href="core/org/apache/lucene/search/TopDocs.html">TopDocs</a>
object, which is an internal collection of search results.
The Searcher creates a
<a href="core/org/apache/lucene/search/TopScoreDocCollector.html">TopScoreDocCollector</a>
and passes it along with the Weight, Filter to another expert search method (for more on the
<a href="core/org/apache/lucene/search/Collector.html">Collector</a>
mechanism, see
<a href="core/org/apache/lucene/search/Searcher.html">Searcher</a>
.) The TopDocCollector uses a
<a href="core/org/apache/lucene/util/PriorityQueue.html">PriorityQueue</a>
to collect the top results for the search.
</p>
<p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise,
we ask the Weight for
a
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
for the
<a href="core/org/apache/lucene/index/IndexReader.html">IndexReader</a>
of the current searcher and we proceed by
calling the score method on the
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
.
</p>
<p>At last, we are actually going to score some documents. The score method takes in the Collector
(most likely the TopScoreDocCollector or TopFieldCollector) and does its business.
Of course, here is where things get involved. The
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
that is returned by the
<a href="core/org/apache/lucene/search/Weight.html">Weight</a>
object depends on what type of Query was submitted. In most real world applications with multiple
query terms,
the
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
is going to be a
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2</a>
(see the section on customizing your scoring for info on changing this.)
</p>
<p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the
coord() factor. We then
get a internal Scorer based on the required, optional and prohibited parts of the query.
Using this internal Scorer, the BooleanScorer2 then proceeds
into a while loop based on the Scorer#next() method. The next() method advances to the next document
matching the query. This is an
abstract method in the Scorer class and is thus overriden by all derived
implementations. <!-- DOUBLE CHECK THIS -->If you have a simple OR query
your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers
from the sub scorers of the OR'd terms.</p>
</div>
</body>
</html>

View File

@ -61,7 +61,7 @@
<ul>
<li><a href="changes/Changes.html">Changes</a>: List of changes in this release.</li>
<li><a href="fileformats.html">File Formats</a>: Guide to the index format used by Lucene.</li>
<li><a href="scoring.html">Scoring in Lucene</a>: Introduction to how Lucene scores documents.</li>
<li><a href="core/org/apache/lucene/search/package-summary.html#package_description">Search and Scoring in Lucene</a>: Introduction to how Lucene scores documents.</li>
<li><a href="core/org/apache/lucene/search/similarities/TFIDFSimilarity.html">Classic Scoring Formula</a>: Formula of Lucene's classic <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space</a> implementation. (look <a href="core/org/apache/lucene/search/similarities/package-summary.html#package_description">here</a> for other models)</li>
<li><a href="queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description">Classic QueryParser Syntax</a>: Overview of the Classic QueryParser's syntax and features.</li>
<li><a href="facet/org/apache/lucene/facet/doc-files/userguide.html">Facet User Guide</a>: User's Guide to implementing <a href="http://en.wikipedia.org/wiki/Faceted_search">Faceted search</a>.</li>