mirror of
https://github.com/apache/lucene.git
synced 2025-03-04 07:19:18 +00:00
integrate scoring.html into scoring package, fix broken links, and update for 4.0
git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1328929 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
4850ac1dc0
commit
f13faf3ee9
@ -27,18 +27,33 @@ Code to search indices.
|
||||
<ol>
|
||||
<li><a href="#search">Search Basics</a></li>
|
||||
<li><a href="#query">The Query Classes</a></li>
|
||||
<li><a href="#scoring">Changing the Scoring</a></li>
|
||||
<li><a href="#scoring">Scoring: Introduction</a></li>
|
||||
<li><a href="#scoringBasics">Scoring: Basics</a></li>
|
||||
<li><a href="#changingScoring">Changing the Scoring</a></li>
|
||||
<li><a href="#algorithm">Appendix: Search Algorithm</a></li>
|
||||
</ol>
|
||||
</p>
|
||||
<a name="search"></a>
|
||||
<h2>Search</h2>
|
||||
<h2>Search Basics</h2>
|
||||
<p>
|
||||
Search over indices.
|
||||
|
||||
Applications usually call {@link
|
||||
Lucene offers a wide variety of {@link org.apache.lucene.search.Query} implementations, most of which are in
|
||||
this package, its subpackages ({@link org.apache.lucene.search.spans spans}, {@link org.apache.lucene.search.payloads payloads}),
|
||||
or the <a href="{@docRoot}/../queries/overview-summary.html">queries module</a>. These implementations can be combined in a wide
|
||||
variety of ways to provide complex querying capabilities along with information about where matches took place in the document
|
||||
collection. The <a href="#query">Query Classes</a> section below highlights some of the more important Query classes. For details
|
||||
on implementing your own Query class, see <a href="#customQueries">Custom Queries -- Expert Level</a> below.
|
||||
</p>
|
||||
<p>
|
||||
To perform a search, applications usually call {@link
|
||||
org.apache.lucene.search.IndexSearcher#search(Query,int)} or {@link
|
||||
org.apache.lucene.search.IndexSearcher#search(Query,Filter,int)}.
|
||||
|
||||
</p>
|
||||
<p>
|
||||
Once a Query has been created and submitted to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, the scoring
|
||||
process begins. After some infrastructure setup, control finally passes to the {@link org.apache.lucene.search.Weight Weight}
|
||||
implementation and its {@link org.apache.lucene.search.Scorer Scorer} instances. See the <a href="#algorithm">Algorithm</a>
|
||||
section for more notes on the process.
|
||||
</p>
|
||||
<!-- FILL IN MORE HERE -->
|
||||
<!-- TODO: this page over-links the same things too many times -->
|
||||
</p>
|
||||
@ -211,20 +226,118 @@ org.apache.lucene.search.IndexSearcher#search(Query,Filter,int)}.
|
||||
This type of query can be useful when accounting for spelling variations in the collection.
|
||||
</p>
|
||||
<a name="scoring"></a>
|
||||
<h2>Scoring — Introduction</h2>
|
||||
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides
|
||||
almost all of the complexity from the user. In a nutshell, it works. At least, that is,
|
||||
until it doesn't work, or doesn't work as one would expect it to work. Then we are left
|
||||
digging into Lucene internals or asking for help on
|
||||
<a href="mailto:java-user@lucene.apache.org">java-user@lucene.apache.org</a> to figure out
|
||||
why a document with five of our query terms scores lower than a different document with
|
||||
only one of the query terms.
|
||||
</p>
|
||||
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you
|
||||
to the places that can help you figure out the <i>what</i> and <i>why</i> of Lucene scoring.
|
||||
</p>
|
||||
<p>Lucene scoring supports a number of pluggable information retrieval
|
||||
<a href="http://en.wikipedia.org/wiki/Information_retrieval#Model_types">models</a>, including:
|
||||
<ul>
|
||||
<li><a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM)</a></li>
|
||||
<li><a href="http://en.wikipedia.org/wiki/Probabilistic_relevance_model">Probablistic Models</a> such as
|
||||
<a href="http://en.wikipedia.org/wiki/Probabilistic_relevance_model_(BM25)">Okapi BM25</a> and
|
||||
<a href="http://en.wikipedia.org/wiki/Divergence-from-randomness_model">DFR</a></li>
|
||||
<li><a href="http://en.wikipedia.org/wiki/Language_model">Language models</a></li>
|
||||
</ul>
|
||||
These models can be plugged in via the {@link org.apache.lucene.search.similarities Similarity API},
|
||||
and offer extension hooks and parameters for tuning. In general, Lucene first narrows down the documents
|
||||
that need to be scored based on boolean logic in the Query specification, and then ranks this subset of
|
||||
documents via the retrieval model. For some valuable references on VSM and IR in general refer to
|
||||
<a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
|
||||
</p>
|
||||
<p>The rest of this document will cover <a href="#scoringBasics">Scoring basics</a> and explain how to
|
||||
change your {@link org.apache.lucene.search.similarities.Similarity Similarity}. Next, it will cover
|
||||
ways you can customize the lucene internals in
|
||||
<a href="#customQueriesExpert">Custom Queries -- Expert Level</a>, which gives details on
|
||||
implementing your own {@link org.apache.lucene.search.Query Query} class and related functionality.
|
||||
Finally, we will finish up with some reference material in the <a href="#algorithm">Appendix</a>.
|
||||
</p>
|
||||
<a name="scoringBasics"></a>
|
||||
<h2>Scoring — Basics</h2>
|
||||
<p>Scoring is very much dependent on the way documents are indexed, so it is important to understand
|
||||
indexing. (see <a href="@{docRoot}/overview-summary.html">Lucene overview</a> before continuing
|
||||
on with this section) It is also assumed that readers know how to use the
|
||||
{@link org.apache.lucene.search.IndexSearcher#explain(org.apache.lucene.search.Query, int) IndexSearcher.explain(Query, doc)}
|
||||
functionality, which can go a long way in informing why a score is returned.
|
||||
</p>
|
||||
<h4>Fields and Documents</h4>
|
||||
<p>In Lucene, the objects we are scoring are {@link org.apache.lucene.document.Document Document}s.
|
||||
A Document is a collection of {@link org.apache.lucene.document.Field Field}s. Each Field has
|
||||
{@link org.apache.lucene.document.FieldType semantics} about how it is created and stored
|
||||
({@link org.apache.lucene.document.FieldType#tokenized() tokenized},
|
||||
{@link org.apache.lucene.document.FieldType#stored() stored}, etc). It is important to note that
|
||||
Lucene scoring works on Fields and then combines the results to return Documents. This is
|
||||
important because two Documents with the exact same content, but one having the content in two
|
||||
Fields and the other in one Field may return different scores for the same query due to length
|
||||
normalization.
|
||||
</p>
|
||||
<h4>Score Boosting</h4>
|
||||
<p>Lucene allows influencing search results by "boosting" in more than one level:
|
||||
<ul>
|
||||
<li><b>Index-time boost</b> by calling
|
||||
{@link org.apache.lucene.document.Field#setBoost(float) Field.setBoost()} before a document is
|
||||
added to the index.</li>
|
||||
<li><b>Query-time boost</b> by setting a boost on a query clause, calling
|
||||
{@link org.apache.lucene.search.Query#setBoost(float) Query.setBoost()}.</li>
|
||||
</ul>
|
||||
</p>
|
||||
<p>Indexing time boosts are pre-processed for storage efficiency and written to
|
||||
storage for a field as follows:
|
||||
<ul>
|
||||
<li>All boosts of that field (i.e. all boosts under the same field name in that doc) are
|
||||
multiplied.</li>
|
||||
<li>The boost is then encoded into a normalization value by the Similarity
|
||||
object at index-time: {@link org.apache.lucene.search.similarities.Similarity#computeNorm computeNorm()}.
|
||||
The actual encoding depends upon the Similarity implementation, but note that most
|
||||
use a lossy encoding (such as multiplying the boost with document length or similar, packed
|
||||
into a single byte!).</li>
|
||||
<li>Decoding of any index-time normalization values and integration into the document's score is also performed
|
||||
at search time by the Similarity.</li>
|
||||
</ul>
|
||||
</p>
|
||||
<a name="changingScoring"></a>
|
||||
<h2>Changing Scoring — Similarity</h2>
|
||||
|
||||
<p>
|
||||
Changing {@link org.apache.lucene.search.similarities.Similarity Similarity} is an easy way to
|
||||
influence scoring, this is done at index-time with
|
||||
{@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(org.apache.lucene.search.similarities.Similarity)
|
||||
IndexWriterConfig.setSimilarity(Similarity)} and at query-time with
|
||||
{@link org.apache.lucene.search.IndexSearcher#setSimilarity(org.apache.lucene.search.similarities.Similarity)
|
||||
IndexSearcher.setSimilarity(Similarity)}.
|
||||
</p>
|
||||
<p>
|
||||
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its
|
||||
parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can
|
||||
extend by plugging in a different component (e.g. term frequency normalizer).
|
||||
</p>
|
||||
<p>
|
||||
Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
|
||||
to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
|
||||
a custom Similarity can access per-document values via {@link org.apache.lucene.search.FieldCache FieldCache} or
|
||||
{@link org.apache.lucene.index.DocValues} and integrate them into the score.
|
||||
</p>
|
||||
<p>
|
||||
See the {@link org.apache.lucene.search.similarities} package documentation for information
|
||||
on the available scoring models and extending or changing Similarity.
|
||||
on the built-in available scoring models and extending or changing Similarity.
|
||||
</p>
|
||||
<a name="customQueriesExpert"></a>
|
||||
<h2>Custom Queries — Expert Level</h2>
|
||||
|
||||
<h2>Changing Scoring — Expert Level</h2>
|
||||
|
||||
<p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
|
||||
<p>Custom queries are an expert level task, so tread carefully and be prepared to share your code if
|
||||
you want help.
|
||||
</p>
|
||||
|
||||
<p>With the warning out of the way, it is possible to change a lot more than just the Similarity
|
||||
when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
|
||||
<span >three main classes</span>:
|
||||
when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by
|
||||
<span>three main classes</span>:
|
||||
<ol>
|
||||
<li>
|
||||
{@link org.apache.lucene.search.Query Query} — The abstract object representation of the
|
||||
@ -248,13 +361,13 @@ on the available scoring models and extending or changing Similarity.
|
||||
{@link org.apache.lucene.search.Query Query} class has several methods that are important for
|
||||
derived classes:
|
||||
<ol>
|
||||
<li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher} — A
|
||||
<li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher)} — A
|
||||
{@link org.apache.lucene.search.Weight Weight} is the internal representation of the
|
||||
Query, so each Query implementation must
|
||||
provide an implementation of Weight. See the subsection on <a
|
||||
href="#weightClass">The Weight Interface</a> below for details on implementing the Weight
|
||||
interface.</li>
|
||||
<li>{@link org.apache.lucene.search.Query#rewrite(IndexReader) rewrite(IndexReader reader} — Rewrites queries into primitive queries. Primitive queries are:
|
||||
<li>{@link org.apache.lucene.search.Query#rewrite(IndexReader) rewrite(IndexReader reader)} — Rewrites queries into primitive queries. Primitive queries are:
|
||||
{@link org.apache.lucene.search.TermQuery TermQuery},
|
||||
{@link org.apache.lucene.search.BooleanQuery BooleanQuery}, <span
|
||||
>and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher)}</span></li>
|
||||
@ -363,5 +476,63 @@ on the available scoring models and extending or changing Similarity.
|
||||
back
|
||||
out of Lucene (similar to Doug adding SpanQuery functionality).</p>
|
||||
|
||||
<!-- TODO: integrate this better, its better served as an intro than an appendix -->
|
||||
<a name="algorithm"></a>
|
||||
<h2>Appendix: Search Algorithm</h2>
|
||||
<p>This section is mostly notes on stepping through the Scoring process and serves as
|
||||
fertilizer for the earlier sections.</p>
|
||||
<p>In the typical search application, a {@link org.apache.lucene.search.Query Query}
|
||||
is passed to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher},
|
||||
beginning the scoring process.</p>
|
||||
<p>Once inside the IndexSearcher, a {@link org.apache.lucene.search.Collector Collector}
|
||||
is used for the scoring and sorting of the search results.
|
||||
These important objects are involved in a search:
|
||||
<ol>
|
||||
<li>The {@link org.apache.lucene.search.Weight Weight} object of the Query. The
|
||||
Weight object is an internal representation of the Query that allows the Query
|
||||
to be reused by the IndexSearcher.</li>
|
||||
<li>The IndexSearcher that initiated the call.</li>
|
||||
<li>A {@link org.apache.lucene.search.Filter Filter} for limiting the result set.
|
||||
Note, the Filter may be null.</li>
|
||||
<li>A {@link org.apache.lucene.search.Sort Sort} object for specifying how to sort
|
||||
the results if the standard score-based sort method is not desired.</li>
|
||||
</ol>
|
||||
</p>
|
||||
<p>Assuming we are not sorting (since sorting doesn't affect the raw Lucene score),
|
||||
we call one of the search methods of the IndexSearcher, passing in the
|
||||
{@link org.apache.lucene.search.Weight Weight} object created by
|
||||
{@link org.apache.lucene.search.IndexSearcher#createNormalizedWeight(org.apache.lucene.search.Query)
|
||||
IndexSearcher.createNormalizedWeight(Query)},
|
||||
{@link org.apache.lucene.search.Filter Filter} and the number of results we want.
|
||||
This method returns a {@link org.apache.lucene.search.TopDocs TopDocs} object,
|
||||
which is an internal collection of search results. The IndexSearcher creates
|
||||
a {@link org.apache.lucene.search.TopScoreDocCollector TopScoreDocCollector} and
|
||||
passes it along with the Weight, Filter to another expert search method (for
|
||||
more on the {@link org.apache.lucene.search.Collector Collector} mechanism,
|
||||
see {@link org.apache.lucene.search.IndexSearcher IndexSearcher}). The TopScoreDocCollector
|
||||
uses a {@link org.apache.lucene.util.PriorityQueue PriorityQueue} to collect the
|
||||
top results for the search.
|
||||
</p>
|
||||
<p>If a Filter is being used, some initial setup is done to determine which docs to include.
|
||||
Otherwise, we ask the Weight for a {@link org.apache.lucene.search.Scorer Scorer} for each
|
||||
{@link org.apache.lucene.index.IndexReader IndexReader} segment and proceed by calling
|
||||
{@link org.apache.lucene.search.Scorer#score(org.apache.lucene.search.Collector) Scorer.score()}.
|
||||
</p>
|
||||
<p>At last, we are actually going to score some documents. The score method takes in the Collector
|
||||
(most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here
|
||||
is where things get involved. The {@link org.apache.lucene.search.Scorer Scorer} that is returned
|
||||
by the {@link org.apache.lucene.search.Weight Weight} object depends on what type of Query was
|
||||
submitted. In most real world applications with multiple query terms, the
|
||||
{@link org.apache.lucene.search.Scorer Scorer} is going to be a <code>BooleanScorer2</code> created
|
||||
from {@link org.apache.lucene.search.BooleanQuery.BooleanWeight BooleanWeight} (see the section on
|
||||
<a href="#customQueriesExpert">custom queries</a> for info on changing this).
|
||||
</p>
|
||||
<p>Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the coord()
|
||||
factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query.
|
||||
Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the
|
||||
{@link org.apache.lucene.search.Scorer#nextDoc Scorer.nextDoc()} method. The nextDoc() method advances
|
||||
to the next document matching the query. This is an abstract method in the Scorer class and is thus
|
||||
overridden by all derived implementations. If you have a simple OR query your internal Scorer is most
|
||||
likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.</p>
|
||||
</body>
|
||||
</html>
|
||||
|
@ -39,7 +39,8 @@ package.
|
||||
<h2>Summary of the Ranking Methods</h2>
|
||||
|
||||
<p>{@link org.apache.lucene.search.similarities.DefaultSimilarity} is the original Lucene
|
||||
scoring function. It is based on a highly optimized Vector Space Model. For more
|
||||
scoring function. It is based on a highly optimized
|
||||
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>. For more
|
||||
information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}.</p>
|
||||
|
||||
<p>{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized
|
||||
|
@ -1,338 +0,0 @@
|
||||
<html>
|
||||
<head>
|
||||
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
||||
<title>Apache Lucene - Scoring</title>
|
||||
</head>
|
||||
<body>
|
||||
<h1>Apache Lucene - Scoring</h1>
|
||||
<div id="minitoc-area">
|
||||
<ul class="minitoc">
|
||||
<li>
|
||||
<a href="#Introduction">Introduction</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#Scoring">Scoring</a>
|
||||
<ul class="minitoc">
|
||||
<li>
|
||||
<a href="#Fields and Documents">Fields and Documents</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#Score Boosting">Score Boosting</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#Understanding the Scoring Formula">Understanding the Scoring Formula</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#The Big Picture">The Big Picture</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#Query Classes">Query Classes</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#Changing Similarity">Changing Similarity</a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#Changing your Scoring -- Expert Level">Changing your Scoring -- Expert Level</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#Appendix">Appendix</a>
|
||||
<ul class="minitoc">
|
||||
<li>
|
||||
<a href="#Algorithm">Algorithm</a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
|
||||
<a name="N10013"></a><a name="Introduction"></a>
|
||||
<h2 class="boxed">Introduction</h2>
|
||||
<div class="section">
|
||||
<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
|
||||
In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
|
||||
work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
|
||||
scores lower than a different document with only one of the query terms. </p>
|
||||
<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
|
||||
help you figure out the what and why of Lucene scoring.</p>
|
||||
<p>Lucene scoring uses a combination of the
|
||||
<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
|
||||
Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
|
||||
to determine
|
||||
how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
|
||||
times a query term appears in a document relative to
|
||||
the number of times the term appears in all the documents in the collection, the more relevant that
|
||||
document is to the query. It uses the Boolean model to first narrow down the documents that need to
|
||||
be scored based on the use of boolean logic in the Query specification. Lucene also adds some
|
||||
capabilities and refinements onto this model to support boolean and fuzzy searching, but it
|
||||
essentially remains a VSM based system at the heart.
|
||||
For some valuable references on VSM and IR in general refer to the
|
||||
<a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
|
||||
</p>
|
||||
<p>The rest of this document will cover <a href="#Scoring">Scoring</a> basics and how to change your
|
||||
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>. Next it will cover ways you can
|
||||
customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring
|
||||
-- Expert Level</a> which gives details on implementing your own
|
||||
<a href="core/org/apache/lucene/search/Query.html">Query</a> class and related functionality. Finally, we
|
||||
will finish up with some reference material in the <a href="#Appendix">Appendix</a>.
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<a name="N10045"></a><a name="Scoring"></a>
|
||||
<h2 class="boxed">Scoring</h2>
|
||||
<div class="section">
|
||||
<p>Scoring is very much dependent on the way documents are indexed,
|
||||
so it is important to understand indexing (see
|
||||
<a href="gettingstarted.html">Apache Lucene - Getting Started Guide</a>
|
||||
and the Lucene
|
||||
<a href="fileformats.html">file formats</a>
|
||||
before continuing on with this section.) It is also assumed that readers know how to use the
|
||||
<a href="core/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,
|
||||
which can go a long way in informing why a score is returned.
|
||||
</p>
|
||||
<a name="N10059"></a><a name="Fields and Documents"></a>
|
||||
<h3 class="boxed">Fields and Documents</h3>
|
||||
<p>In Lucene, the objects we are scoring are
|
||||
<a href="core/org/apache/lucene/document/Document.html">Documents</a>. A Document is a collection
|
||||
of
|
||||
<a href="core/org/apache/lucene/document/Field.html">Fields</a>. Each Field has semantics about how
|
||||
it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to
|
||||
note that Lucene scoring works on Fields and then combines the results to return Documents. This is
|
||||
important because two Documents with the exact same content, but one having the content in two Fields
|
||||
and the other in one Field will return different scores for the same query due to length normalization
|
||||
(assumming the
|
||||
<a href="core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
|
||||
on the Fields).
|
||||
</p>
|
||||
<a name="N1006E"></a><a name="Score Boosting"></a>
|
||||
<h3 class="boxed">Score Boosting</h3>
|
||||
<p>Lucene allows influencing search results by "boosting" in more than one level:
|
||||
<ul>
|
||||
|
||||
<li>
|
||||
<b>Document level boosting</b>
|
||||
- while indexing - by calling
|
||||
<a href="core/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
|
||||
before a document is added to the index.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<b>Document's Field level boosting</b>
|
||||
- while indexing - by calling
|
||||
<a href="core/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
|
||||
before adding a field to the document (and before adding the document to the index).
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<b>Query level boosting</b>
|
||||
- during search, by setting a boost on a query clause, calling
|
||||
<a href="core/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</p>
|
||||
<p>Indexing time boosts are preprocessed for storage efficiency and written to
|
||||
the directory (when writing the document) in a single byte (!) as follows:
|
||||
For each field of a document, all boosts of that field
|
||||
(i.e. all boosts under the same field name in that doc) are multiplied.
|
||||
The result is multiplied by the boost of the document,
|
||||
and also multiplied by a "field length norm" value
|
||||
that represents the length of that field in that doc
|
||||
(so shorter fields are automatically boosted up).
|
||||
The result is decoded as a single byte
|
||||
(with some precision loss of course) and stored in the directory.
|
||||
The similarity object in effect at indexing computes the length-norm of the field.
|
||||
</p>
|
||||
<p>This composition of 1-byte representation of norms
|
||||
(that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
|
||||
is nicely described in
|
||||
<a href="core/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
|
||||
</p>
|
||||
<p>Encoding and decoding of the resulted float norm in a single byte are done by the
|
||||
static methods of the class Similarity:
|
||||
<a href="core/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
|
||||
<a href="core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
|
||||
Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
|
||||
e.g. decode(encode(0.89)) = 0.75.
|
||||
At scoring (search) time, this norm is brought into the score of document
|
||||
as <b>norm(t, d)</b>, as shown by the formula in
|
||||
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>.
|
||||
</p>
|
||||
<a name="N100B1"></a><a name="Understanding the Scoring Formula"></a>
|
||||
<h3 class="boxed">Understanding the Scoring Formula</h3>
|
||||
<p>
|
||||
This scoring formula is described in the
|
||||
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
|
||||
basics of Lucene scoring work, especially the
|
||||
<a href="core/org/apache/lucene/search/TermQuery.html">TermQuery</a>.
|
||||
</p>
|
||||
<a name="N100C2"></a><a name="The Big Picture"></a>
|
||||
<h3 class="boxed">The Big Picture</h3>
|
||||
<p>OK, so the tf-idf formula and the
|
||||
<a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>
|
||||
is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
|
||||
the use and interactions between the
|
||||
<a href="core/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
|
||||
response to a user's information need.
|
||||
</p>
|
||||
<p>In this regard, Lucene offers a wide variety of <a href="core/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
|
||||
<a href="core/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.
|
||||
These implementations can be combined in a wide variety of ways to provide complex querying
|
||||
capabilities along with
|
||||
information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
|
||||
section below
|
||||
highlights some of the more important Query classes. For information on the other ones, see the
|
||||
<a href="core/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
|
||||
your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
|
||||
Expert Level</a> below.
|
||||
</p>
|
||||
<p>Once a Query has been created and submitted to the
|
||||
<a href="core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
|
||||
begins. (See the <a href="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,
|
||||
control finally passes to the <a href="core/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
|
||||
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
|
||||
<a href="core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
|
||||
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a>
|
||||
(link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or
|
||||
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight</a>
|
||||
(link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).
|
||||
</p>
|
||||
<p>
|
||||
Assuming the use of the BooleanWeight2, a
|
||||
BooleanScorer2 is created by bringing together
|
||||
all of the
|
||||
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
|
||||
When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
|
||||
of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
|
||||
provided by each scorer while factoring in the coord() score.
|
||||
<!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
|
||||
</p>
|
||||
<a name="N10112"></a><a name="Query Classes"></a>
|
||||
<h3 class="boxed">Query Classes</h3>
|
||||
<p>For information on the Query Classes, refer to the
|
||||
<a href="core/org/apache/lucene/search/package-summary.html#query">search package javadocs</a>
|
||||
|
||||
</p>
|
||||
<a name="N1011F"></a><a name="Changing Similarity"></a>
|
||||
<h3 class="boxed">Changing Similarity</h3>
|
||||
<p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on
|
||||
how to do this, see the
|
||||
<a href="core/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs</a>
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<a name="N1012C"></a><a name="Changing your Scoring -- Expert Level"></a>
|
||||
<h2 class="boxed">Changing your Scoring -- Expert Level</h2>
|
||||
<div class="section">
|
||||
<p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more
|
||||
about how to do this, refer to the
|
||||
<a href="core/org/apache/lucene/search/package-summary.html#scoring">search package javadocs</a>
|
||||
|
||||
</p>
|
||||
</div>
|
||||
|
||||
|
||||
<a name="N10139"></a><a name="Appendix"></a>
|
||||
<h2 class="boxed">Appendix</h2>
|
||||
<div class="section">
|
||||
<a name="N1013E"></a><a name="Algorithm"></a>
|
||||
<h3 class="boxed">Algorithm</h3>
|
||||
<p>This section is mostly notes on stepping through the Scoring process and serves as
|
||||
fertilizer for the earlier sections.</p>
|
||||
<p>In the typical search application, a
|
||||
<a href="core/org/apache/lucene/search/Query.html">Query</a>
|
||||
is passed to the
|
||||
<a href="core/org/apache/lucene/search/Searcher.html">Searcher</a>
|
||||
, beginning the scoring process.
|
||||
</p>
|
||||
<p>Once inside the Searcher, a
|
||||
<a href="core/org/apache/lucene/search/Collector.html">Collector</a>
|
||||
is used for the scoring and sorting of the search results.
|
||||
These important objects are involved in a search:
|
||||
<ol>
|
||||
|
||||
<li>The
|
||||
<a href="core/org/apache/lucene/search/Weight.html">Weight</a>
|
||||
object of the Query. The Weight object is an internal representation of the Query that
|
||||
allows the Query to be reused by the Searcher.
|
||||
</li>
|
||||
|
||||
<li>The Searcher that initiated the call.</li>
|
||||
|
||||
<li>A
|
||||
<a href="core/org/apache/lucene/search/Filter.html">Filter</a>
|
||||
for limiting the result set. Note, the Filter may be null.
|
||||
</li>
|
||||
|
||||
<li>A
|
||||
<a href="core/org/apache/lucene/search/Sort.html">Sort</a>
|
||||
object for specifying how to sort the results if the standard score based sort method is not
|
||||
desired.
|
||||
</li>
|
||||
|
||||
</ol>
|
||||
|
||||
</p>
|
||||
<p> Assuming we are not sorting (since sorting doesn't
|
||||
effect the raw Lucene score),
|
||||
we call one of the search methods of the Searcher, passing in the
|
||||
<a href="core/org/apache/lucene/search/Weight.html">Weight</a>
|
||||
object created by Searcher.createWeight(Query),
|
||||
<a href="core/org/apache/lucene/search/Filter.html">Filter</a>
|
||||
and the number of results we want. This method
|
||||
returns a
|
||||
<a href="core/org/apache/lucene/search/TopDocs.html">TopDocs</a>
|
||||
object, which is an internal collection of search results.
|
||||
The Searcher creates a
|
||||
<a href="core/org/apache/lucene/search/TopScoreDocCollector.html">TopScoreDocCollector</a>
|
||||
and passes it along with the Weight, Filter to another expert search method (for more on the
|
||||
<a href="core/org/apache/lucene/search/Collector.html">Collector</a>
|
||||
mechanism, see
|
||||
<a href="core/org/apache/lucene/search/Searcher.html">Searcher</a>
|
||||
.) The TopDocCollector uses a
|
||||
<a href="core/org/apache/lucene/util/PriorityQueue.html">PriorityQueue</a>
|
||||
to collect the top results for the search.
|
||||
</p>
|
||||
<p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise,
|
||||
we ask the Weight for
|
||||
a
|
||||
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
|
||||
for the
|
||||
<a href="core/org/apache/lucene/index/IndexReader.html">IndexReader</a>
|
||||
of the current searcher and we proceed by
|
||||
calling the score method on the
|
||||
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
|
||||
.
|
||||
</p>
|
||||
<p>At last, we are actually going to score some documents. The score method takes in the Collector
|
||||
(most likely the TopScoreDocCollector or TopFieldCollector) and does its business.
|
||||
Of course, here is where things get involved. The
|
||||
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
|
||||
that is returned by the
|
||||
<a href="core/org/apache/lucene/search/Weight.html">Weight</a>
|
||||
object depends on what type of Query was submitted. In most real world applications with multiple
|
||||
query terms,
|
||||
the
|
||||
<a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
|
||||
is going to be a
|
||||
<a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2</a>
|
||||
(see the section on customizing your scoring for info on changing this.)
|
||||
|
||||
</p>
|
||||
<p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the
|
||||
coord() factor. We then
|
||||
get a internal Scorer based on the required, optional and prohibited parts of the query.
|
||||
Using this internal Scorer, the BooleanScorer2 then proceeds
|
||||
into a while loop based on the Scorer#next() method. The next() method advances to the next document
|
||||
matching the query. This is an
|
||||
abstract method in the Scorer class and is thus overriden by all derived
|
||||
implementations. <!-- DOUBLE CHECK THIS -->If you have a simple OR query
|
||||
your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers
|
||||
from the sub scorers of the OR'd terms.</p>
|
||||
</div>
|
||||
|
||||
</body>
|
||||
</html>
|
@ -61,7 +61,7 @@
|
||||
<ul>
|
||||
<li><a href="changes/Changes.html">Changes</a>: List of changes in this release.</li>
|
||||
<li><a href="fileformats.html">File Formats</a>: Guide to the index format used by Lucene.</li>
|
||||
<li><a href="scoring.html">Scoring in Lucene</a>: Introduction to how Lucene scores documents.</li>
|
||||
<li><a href="core/org/apache/lucene/search/package-summary.html#package_description">Search and Scoring in Lucene</a>: Introduction to how Lucene scores documents.</li>
|
||||
<li><a href="core/org/apache/lucene/search/similarities/TFIDFSimilarity.html">Classic Scoring Formula</a>: Formula of Lucene's classic <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space</a> implementation. (look <a href="core/org/apache/lucene/search/similarities/package-summary.html#package_description">here</a> for other models)</li>
|
||||
<li><a href="queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description">Classic QueryParser Syntax</a>: Overview of the Classic QueryParser's syntax and features.</li>
|
||||
<li><a href="facet/org/apache/lucene/facet/doc-files/userguide.html">Facet User Guide</a>: User's Guide to implementing <a href="http://en.wikipedia.org/wiki/Faceted_search">Faceted search</a>.</li>
|
||||
|
Loading…
x
Reference in New Issue
Block a user