Update javadocs for Lucene 8.

This fixes a couple mistakes, puts more emphasis on BM25 compared to Classic and
gives more guidance regarding custom scores without a custom query.
This commit is contained in:
Adrien Grand 2018-09-03 12:21:12 +02:00
parent d93c46ea94
commit a1ec716e10
5 changed files with 102 additions and 89 deletions

View File

@ -110,8 +110,10 @@
* inverted index, is comprised of "postings." The postings, with their term dictionary, can be
* thought of as a map that provides efficient lookup given a {@link org.apache.lucene.index.Term}
* (roughly, a word or token), to (the ordered list of) {@link org.apache.lucene.document.Document}s
* containing that Term. Postings do not provide any way of retrieving terms given a document,
* short of scanning the entire index.</p>
* containing that Term. Codecs may additionally record
* {@link org.apache.lucene.index.ImpactsEnum#getImpacts impacts} alongside postings in order to be
* able to skip over low-scoring documents at search time. Postings do not provide any way of
* retrieving terms given a document, short of scanning the entire index.</p>
*
* <a name="stored-fields"></a>
* <p>Stored fields are essentially the opposite of postings, providing efficient retrieval of field

View File

@ -28,6 +28,10 @@ import org.apache.lucene.util.automaton.Automaton;
* <p>This query matches the documents looking for terms that fall into the
* supplied range according to {@link BytesRef#compareTo(BytesRef)}.
*
* <p><b>NOTE</b>: {@link TermRangeQuery} performs significantly slower than
* {@link PointRangeQuery point-based ranges} as it needs to visit all terms
* that match the range and merges their matches.
*
* <p>This query uses the {@link
* MultiTermQuery#CONSTANT_SCORE_REWRITE}
* rewrite method.

View File

@ -44,7 +44,7 @@
* <p>
* Once a Query has been created and submitted to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, the scoring
* process begins. After some infrastructure setup, control finally passes to the {@link org.apache.lucene.search.Weight Weight}
* implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScore}
* implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScorer}
* instances. See the <a href="#algorithm">Algorithm</a> section for more notes on the process.
* <!-- FILL IN MORE HERE -->
* <!-- TODO: this page over-links the same things too many times -->
@ -95,9 +95,11 @@
* If a query is made up of all SHOULD clauses, then every document in the result
* set matches at least one of these clauses.</p></li>
*
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} &mdash; Use this operator when a clause is required to occur in the result set. Every
* document in the result set will match
* all such clauses.</p></li>
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} &mdash; Use this operator when a clause is required to occur in the result set and should
* contribute to the score. Every document in the result set will match all such clauses.</p></li>
*
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#FILTER FILTER} &mdash; Use this operator when a clause is required to occur in the result set but
* should not contribute to the score. Every document in the result set will match all such clauses.</p></li>
*
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST_NOT MUST NOT} &mdash; Use this operator when a
* clause must not occur in the result set. No
@ -113,7 +115,7 @@
* {@link org.apache.lucene.search.TermQuery TermQuery} clauses,
* for example by {@link org.apache.lucene.search.WildcardQuery WildcardQuery}.
* The default setting for the maximum number
* of clauses 1024, but this can be changed via the
* of clauses is 1024, but this can be changed via the
* static method {@link org.apache.lucene.search.BooleanQuery#setMaxClauseCount(int)}.
*
* <h3>Phrases</h3>
@ -149,23 +151,6 @@
* </ol>
*
* <h3>
* {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
* </h3>
*
* <p>The
* {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
* matches all documents that occur in the
* exclusive range of a lower
* {@link org.apache.lucene.index.Term Term}
* and an upper
* {@link org.apache.lucene.index.Term Term}
* according to {@link org.apache.lucene.util.BytesRef#compareTo BytesRef.compareTo()}. It is not intended
* for numerical ranges; use {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery} instead.
*
* For example, one could find all documents
* that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>.
*
* <h3>
* {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery}
* </h3>
*
@ -274,6 +259,7 @@
*
* <a name="changingScoring"></a>
* <h2>Changing Scoring &mdash; Similarity</h2>
* <h3>Changing the scoring formula</h3>
* <p>
* Changing {@link org.apache.lucene.search.similarities.Similarity Similarity} is an easy way to
* influence scoring, this is done at index-time with
@ -289,14 +275,54 @@
* extend by plugging in a different component (e.g. term frequency normalizer).
* <p>
* Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
* to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
* a custom Similarity can access per-document values via {@link org.apache.lucene.index.NumericDocValues} and
* integrate them into the score.
* to implement a new retrieval model.
* <p>
* See the {@link org.apache.lucene.search.similarities} package documentation for information
* on the built-in available scoring models and extending or changing Similarity.
*
* <h3>Integrating field values into the score</h3>
* <p>While similarities help score a document relatively to a query, it is also common for documents to hold
* features that measure the quality of a match. Such features are best integrated into the score by indexing
* a {@link org.apache.lucene.document.FeatureField FeatureField} with the document at index-time, and then
* combining the similarity score and the feature score using a linear combination. For instance the below
* query matches the same documents as {@code originalQuery} and computes scores as
* {@code similarityScore + 0.7 * featureScore}:
* <pre class="prettyprint">
* Query originalQuery = new BooleanQuery.Builder()
* .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD)
* .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD)
* .build();
* Query featureQuery = FeatureField.newSaturationQuery("features", "pagerank");
* Query query = new BooleanQuery.Builder()
* .add(originalQuery, Occur.MUST)
* .add(new BoostQuery(featureQuery, 0.7f), Occur.SHOULD)
* .build();
* </pre>
*
*
* <p>A less efficient yet more flexible way of modifying scores is to index scoring features into
* doc-value fields and then combine them with the similarity score using a
* <a href="{@docRoot}/../queries/org/apache/lucene/queries/function/FunctionScoreQuery.html">FunctionScoreQuery</a>
* from the <a href="{@docRoot}/../queries/overview-summary.html">queries module</a>. For instance
* the below example shows how to compute scores as {@code similarityScore * Math.log(popularity)}
* using the <a href="{@docRoot}/../expressions/overview-summary.html">expressions module</a> and
* assuming that values for the {@code popularity} field have been set in a
* {@link org.apache.lucene.document.NumericDocValuesField NumericDocValuesField} at index time:
* <pre class="prettyprint">
* // compile an expression:
* Expression expr = JavascriptCompiler.compile("_score * ln(popularity)");
*
* // SimpleBindings just maps variables to SortField instances
* SimpleBindings bindings = new SimpleBindings();
* bindings.add(new SortField("_score", SortField.Type.SCORE));
* bindings.add(new SortField("popularity", SortField.Type.INT));
*
* // create a query that matches based on 'originalQuery' but
* // scores using expr
* Query query = new FunctionScoreQuery(
* originalQuery,
* expr.getDoubleValuesSource(bindings));
* </pre>
*
* <a name="customQueriesExpert"></a>
* <h2>Custom Queries &mdash; Expert Level</h2>
*
@ -311,15 +337,14 @@
* {@link org.apache.lucene.search.Query Query} &mdash; The abstract object representation of the
* user's information need.</li>
* <li>
* {@link org.apache.lucene.search.Weight Weight} &mdash; The internal interface representation of
* the user's Query, so that Query objects may be reused.
* This is global (across all segments of the index) and
* generally will require global statistics (such as docFreq
* for a given term across all segments).</li>
* {@link org.apache.lucene.search.Weight Weight} &mdash; A specialization of a Query for a given
* index. This typically associates a Query object with index statistics that are later used to
* compute document scores.
* <li>
* {@link org.apache.lucene.search.Scorer Scorer} &mdash; An abstract class containing common
* functionality for scoring. Provides both scoring and
* explanation capabilities. This is created per-segment.</li>
* {@link org.apache.lucene.search.Scorer Scorer} &mdash; The core class of the scoring process:
* for a given segment, scorers return {@link org.apache.lucene.search.Scorer#iterator iterators}
* over matches and give a way to compute the {@link org.apache.lucene.search.Scorer#score score}
* of these matches.</li>
* <li>
* {@link org.apache.lucene.search.BulkScorer BulkScorer} &mdash; An abstract class that scores
* a range of documents. A default implementation simply iterates through the hits from
@ -338,7 +363,7 @@
* {@link org.apache.lucene.search.Query Query} class has several methods that are important for
* derived classes:
* <ol>
* <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, boolean needsScores, float boost)} &mdash; A
* <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost)} &mdash; A
* {@link org.apache.lucene.search.Weight Weight} is the internal representation of the
* Query, so each Query implementation must
* provide an implementation of Weight. See the subsection on <a
@ -347,7 +372,7 @@
* <li>{@link org.apache.lucene.search.Query#rewrite(org.apache.lucene.index.IndexReader) rewrite(IndexReader reader)} &mdash; Rewrites queries into primitive queries. Primitive queries are:
* {@link org.apache.lucene.search.TermQuery TermQuery},
* {@link org.apache.lucene.search.BooleanQuery BooleanQuery}, <span
* >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,boolean needsScores, float boost)}</span></li>
* >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,ScoreMode scoreMode, float boost)}</span></li>
* </ol>
* <a name="weightClass"></a>
* <h3>The Weight Interface</h3>
@ -356,23 +381,15 @@
* interface provides an internal representation of the Query so that it can be reused. Any
* {@link org.apache.lucene.search.IndexSearcher IndexSearcher}
* dependent state should be stored in the Weight implementation,
* not in the Query class. The interface defines five methods that must be implemented:
* not in the Query class. The interface defines four main methods:
* <ol>
* <li>
* {@link org.apache.lucene.search.Weight#getQuery getQuery()} &mdash; Pointer to the
* Query that this Weight represents.</li>
* <li>
* {@link org.apache.lucene.search.Weight#scorer scorer()} &mdash;
* Construct a new {@link org.apache.lucene.search.Scorer Scorer} for this Weight. See <a href="#scorerClass">The Scorer Class</a>
* below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents
* given the Query.
* </li>
* <li>
* {@link org.apache.lucene.search.Weight#bulkScorer bulkScorer()} &mdash;
* Construct a new {@link org.apache.lucene.search.BulkScorer BulkScorer} for this Weight. See <a href="#bulkScorerClass">The BulkScorer Class</a>
* below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
* </li>
* <li>
* {@link org.apache.lucene.search.Weight#explain(org.apache.lucene.index.LeafReaderContext, int)
* explain(LeafReaderContext context, int doc)} &mdash; Provide a means for explaining why a given document was
* scored the way it was.
@ -380,6 +397,16 @@
* that scores via a {@link org.apache.lucene.search.similarities.Similarity Similarity} will make use of the Similarity's implementation:
* {@link org.apache.lucene.search.similarities.Similarity.SimScorer#explain(Explanation, long) SimScorer#explain(Explanation freq, long norm)}.
* </li>
* <li>
* {@link org.apache.lucene.search.Weight#extractTerms(java.util.Set) extractTerms(Set&lt;Term&gt; terms)} &mdash; Extract terms that
* this query operates on. This is typically used to support distributed search: knowing the terms that a query operates on helps
* merge index statistics of these terms so that scores are computed over a subset of the data like they would if all documents
* were in the same index.
* </li>
* <li>
* {@link org.apache.lucene.search.Weight#matches matches(LeafReaderContext context, int doc)} &mdash; Give information about positions
* and offsets of matches. This is typically useful to implement highlighting.
* </li>
* </ol>
* <a name="scorerClass"></a>
* <h3>The Scorer Class</h3>
@ -458,17 +485,13 @@
* This method returns a {@link org.apache.lucene.search.TopDocs TopDocs} object,
* which is an internal collection of search results. The IndexSearcher creates
* a {@link org.apache.lucene.search.TopScoreDocCollector TopScoreDocCollector} and
* passes it along with the Weight, Filter to another expert search method (for
* passes it along with the Weight to another expert search method (for
* more on the {@link org.apache.lucene.search.Collector Collector} mechanism,
* see {@link org.apache.lucene.search.IndexSearcher IndexSearcher}). The TopScoreDocCollector
* uses a {@link org.apache.lucene.util.PriorityQueue PriorityQueue} to collect the
* top results for the search.
* <p>If a Filter is being used, some initial setup is done to determine which docs to include.
* Otherwise, we ask the Weight for a {@link org.apache.lucene.search.Scorer Scorer} for each
* {@link org.apache.lucene.index.IndexReader IndexReader} segment and proceed by calling
* {@link org.apache.lucene.search.BulkScorer#score(org.apache.lucene.search.LeafCollector,org.apache.lucene.util.Bits) BulkScorer.score(LeafCollector,Bits)}.
* <p>At last, we are actually going to score some documents. The score method takes in the Collector
* (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here
* (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here
* is where things get involved. The {@link org.apache.lucene.search.Scorer Scorer} that is returned
* by the {@link org.apache.lucene.search.Weight Weight} object depends on what type of Query was
* submitted. In most real world applications with multiple query terms, the

View File

@ -73,9 +73,9 @@
* your searching needs.
* However, in some applications it may be necessary to customize your <a
* href="Similarity.html">Similarity</a> implementation. For instance, some
* applications do not need to
* distinguish between shorter and longer documents (see <a
* href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).
* applications do not need to distinguish between shorter and longer documents
* and could set BM25's {@link org.apache.lucene.search.similarities.BM25Similarity#BM25Similarity(float,float) b}
* parameter to {@code 0}.
*
* <p>To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and
* searching, and the changes must happen before
@ -83,15 +83,27 @@
* just isn't well-defined what is going to happen.
*
* <p>To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely
* you'll want to simply subclass an existing method, be it
* {@link org.apache.lucene.search.similarities.ClassicSimilarity} or a descendant of
* {@link org.apache.lucene.search.similarities.SimilarityBase}), and
* you'll want to simply subclass {@link org.apache.lucene.search.similarities.SimilarityBase}), and
* then register the new class by calling
* {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)}
* before indexing and
* {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)}
* before searching.
*
* <h3>Tuning {@linkplain org.apache.lucene.search.similarities.BM25Similarity}</h3>
* <p>{@link org.apache.lucene.search.similarities.BM25Similarity} has
* two parameters that may be tuned:
* <ul>
* <li><tt>k1</tt>, which calibrates term frequency saturation and must be
* positive or null. A value of {@code 0} makes term frequency completely
* ignored, making documents scored only based on the value of the <tt>IDF</tt>
* of the matched terms. Higher values of <tt>k1</tt> increase the impact of
* term frequency on the final score. Default value is {@code 1.2}.</li>
* <li><tt>b</tt>, which controls how much document length should normalize
* term frequency values and must be in {@code [0, 1]}. A value of {@code 0}
* disables length normalization completely. Default value is {@code 0.75}.</li>
* </ul>
*
* <h3>Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}</h3>
* <p>
* The easiest way to quickly implement a new ranking method is to extend
@ -112,33 +124,5 @@
* subclassing the Similarity, one can simply introduce a new basic model and tell
* {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.
*
* <h3>Changing {@linkplain org.apache.lucene.search.similarities.ClassicSimilarity}</h3>
* <p>
* If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a
* href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">Overriding Similarity</a>.
* In summary, here are a few use cases:
* <ol>
* <li><p>The <code>SweetSpotSimilarity</code> in
* <code>org.apache.lucene.misc</code> gives small
* increases as the frequency increases a small amount
* and then greater increases when you hit the "sweet spot", i.e. where
* you think the frequency of terms is more significant.</li>
* <li><p>Overriding tf &mdash; In some applications, it doesn't matter what the score of a document is as long as a
* matching term occurs. In these
* cases people have overridden Similarity to return 1 from the tf() method.</li>
* <li><p>Changing Length Normalization &mdash; By overriding
* {@link org.apache.lucene.search.similarities.Similarity#computeNorm(org.apache.lucene.index.FieldInvertState state)},
* it is possible to discount how the length of a field contributes
* to a score. In {@link org.apache.lucene.search.similarities.ClassicSimilarity},
* lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
* 1 / (numTerms in field), all fields will be treated
* <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</li>
* </ol>
* In general, Chris Hostetter sums it up best in saying (from <a
* href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
* <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just
* that
* it's "text" is a situation where it *might* make sense to to override your
* Similarity method.</blockquote>
*/
package org.apache.lucene.search.similarities;

View File

@ -35,7 +35,7 @@ to check if the results are what we expect):</p>
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
//Directory directory = FSDirectory.open(Paths.get("/tmp/testindex"));
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
@ -50,7 +50,7 @@ to check if the results are what we expect):</p>
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser("fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
assertEquals(1, hits.length);
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {