Update javadocs for Lucene 8.

This fixes a couple mistakes, puts more emphasis on BM25 compared to Classic and gives more guidance regarding custom scores without a custom query.
2018-09-03 12:21:12 +02:00 · 2018-09-03 12:21:12 +02:00 · a1ec716e10
parent d93c46ea94
commit a1ec716e10
5 changed files with 102 additions and 89 deletions
--- a/lucene/core/src/java/org/apache/lucene/index/package-info.java
+++ b/lucene/core/src/java/org/apache/lucene/index/package-info.java
@ -110,8 +110,10 @@
 * inverted index, is comprised of "postings." The postings, with their term dictionary, can be
 * thought of as a map that provides efficient lookup given a {@link org.apache.lucene.index.Term}
 * (roughly, a word or token), to (the ordered list of) {@link org.apache.lucene.document.Document}s
- * containing that Term.  Postings do not provide any way of retrieving terms given a document,
- * short of scanning the entire index.</p>
+ * containing that Term. Codecs may additionally record
+ * {@link org.apache.lucene.index.ImpactsEnum#getImpacts impacts} alongside postings in order to be
+ * able to skip over low-scoring documents at search time. Postings do not provide any way of
+ * retrieving terms given a document, short of scanning the entire index.</p>
 *
 * <a name="stored-fields"></a>
 * <p>Stored fields are essentially the opposite of postings, providing efficient retrieval of field
--- a/lucene/core/src/java/org/apache/lucene/search/TermRangeQuery.java
+++ b/lucene/core/src/java/org/apache/lucene/search/TermRangeQuery.java
@ -28,6 +28,10 @@ import org.apache.lucene.util.automaton.Automaton;
 * <p>This query matches the documents looking for terms that fall into the
 * supplied range according to {@link BytesRef#compareTo(BytesRef)}.
 *
+ * <p><b>NOTE</b>: {@link TermRangeQuery} performs significantly slower than
+ * {@link PointRangeQuery point-based ranges} as it needs to visit all terms
+ * that match the range and merges their matches. 
+ *
 * <p>This query uses the {@link
 * MultiTermQuery#CONSTANT_SCORE_REWRITE}
 * rewrite method.
--- a/lucene/core/src/java/org/apache/lucene/search/package-info.java
+++ b/lucene/core/src/java/org/apache/lucene/search/package-info.java
@ -44,7 +44,7 @@
 * <p>
 * Once a Query has been created and submitted to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, the scoring
 * process begins. After some infrastructure setup, control finally passes to the {@link org.apache.lucene.search.Weight Weight}
- * implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScore}
+ * implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScorer}
 * instances. See the <a href="#algorithm">Algorithm</a> section for more notes on the process.
 *     <!-- FILL IN MORE HERE -->   
 *     <!-- TODO: this page over-links the same things too many times -->
@ -95,9 +95,11 @@
 *             If a query is made up of all SHOULD clauses, then every document in the result
 *             set matches at least one of these clauses.</p></li>
 * 
- *         <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} &mdash; Use this operator when a clause is required to occur in the result set. Every
- *             document in the result set will match
- *             all such clauses.</p></li>
+ *         <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} &mdash; Use this operator when a clause is required to occur in the result set and should
+ *             contribute to the score. Every document in the result set will match all such clauses.</p></li>
+ * 
+ *         <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#FILTER FILTER} &mdash; Use this operator when a clause is required to occur in the result set but
+ *             should not contribute to the score. Every document in the result set will match all such clauses.</p></li>
 * 
 *         <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST_NOT MUST NOT} &mdash; Use this operator when a
 *             clause must not occur in the result set. No
@ -113,7 +115,7 @@
 *     {@link org.apache.lucene.search.TermQuery TermQuery} clauses,
 *     for example by {@link org.apache.lucene.search.WildcardQuery WildcardQuery}.
 *     The default setting for the maximum number
- *     of clauses 1024, but this can be changed via the
+ *     of clauses is 1024, but this can be changed via the
 *     static method {@link org.apache.lucene.search.BooleanQuery#setMaxClauseCount(int)}.
 * 
 * <h3>Phrases</h3>
@ -149,23 +151,6 @@
 *     </ol>
 * 
 * <h3>
- *     {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
- * </h3>
- * 
- * <p>The
- *     {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
- *     matches all documents that occur in the
- *     exclusive range of a lower
- *     {@link org.apache.lucene.index.Term Term}
- *     and an upper
- *     {@link org.apache.lucene.index.Term Term}
- *     according to {@link org.apache.lucene.util.BytesRef#compareTo BytesRef.compareTo()}. It is not intended
- *     for numerical ranges; use {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery} instead.
- * 
- *     For example, one could find all documents
- *     that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>.
- * 
- * <h3>
 *     {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery}
 * </h3>
 * 
@ -274,6 +259,7 @@
 * 
 * <a name="changingScoring"></a>
 * <h2>Changing Scoring &mdash; Similarity</h2>
+ * <h3>Changing the scoring formula</h3>
 * <p>
 * Changing {@link org.apache.lucene.search.similarities.Similarity Similarity} is an easy way to 
 * influence scoring, this is done at index-time with 
@ -289,14 +275,54 @@
 * extend by plugging in a different component (e.g. term frequency normalizer).
 * <p>
 * Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
- * to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
- * a custom Similarity can access per-document values via {@link org.apache.lucene.index.NumericDocValues} and
- * integrate them into the score.
+ * to implement a new retrieval model.
 * <p>
 * See the {@link org.apache.lucene.search.similarities} package documentation for information
 * on the built-in available scoring models and extending or changing Similarity.
+ *
+ * <h3>Integrating field values into the score</h3>
+ * <p>While similarities help score a document relatively to a query, it is also common for documents to hold
+ * features that measure the quality of a match. Such features are best integrated into the score by indexing
+ * a {@link org.apache.lucene.document.FeatureField FeatureField} with the document at index-time, and then
+ * combining the similarity score and the feature score using a linear combination. For instance the below
+ * query matches the same documents as {@code originalQuery} and computes scores as
+ * {@code similarityScore + 0.7 * featureScore}:
+ * <pre class="prettyprint">
+ * Query originalQuery = new BooleanQuery.Builder()
+ *     .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD)
+ *     .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD)
+ *     .build();
+ * Query featureQuery = FeatureField.newSaturationQuery("features", "pagerank");
+ * Query query = new BooleanQuery.Builder()
+ *     .add(originalQuery, Occur.MUST)
+ *     .add(new BoostQuery(featureQuery, 0.7f), Occur.SHOULD)
+ *     .build();
+ * </pre>
 * 
- * 
+ * <p>A less efficient yet more flexible way of modifying scores is to index scoring features into
+ * doc-value fields and then combine them with the similarity score using a
+ * <a href="{@docRoot}/../queries/org/apache/lucene/queries/function/FunctionScoreQuery.html">FunctionScoreQuery</a>
+ * from the <a href="{@docRoot}/../queries/overview-summary.html">queries module</a>. For instance
+ * the below example shows how to compute scores as {@code similarityScore * Math.log(popularity)}
+ * using the <a href="{@docRoot}/../expressions/overview-summary.html">expressions module</a> and
+ * assuming that values for the {@code popularity} field have been set in a
+ * {@link org.apache.lucene.document.NumericDocValuesField NumericDocValuesField} at index time:
+ * <pre class="prettyprint">
+ *   // compile an expression:
+ *   Expression expr = JavascriptCompiler.compile("_score * ln(popularity)");
+ *
+ *   // SimpleBindings just maps variables to SortField instances
+ *   SimpleBindings bindings = new SimpleBindings();
+ *   bindings.add(new SortField("_score", SortField.Type.SCORE));
+ *   bindings.add(new SortField("popularity", SortField.Type.INT));
+ *
+ *   // create a query that matches based on 'originalQuery' but
+ *   // scores using expr
+ *   Query query = new FunctionScoreQuery(
+ *       originalQuery,
+ *       expr.getDoubleValuesSource(bindings));
+ * </pre>
+ *
 * <a name="customQueriesExpert"></a>
 * <h2>Custom Queries &mdash; Expert Level</h2>
 * 
@ -311,15 +337,14 @@
 *             {@link org.apache.lucene.search.Query Query} &mdash; The abstract object representation of the
 *             user's information need.</li>
 *         <li>
- *             {@link org.apache.lucene.search.Weight Weight} &mdash; The internal interface representation of
- *             the user's Query, so that Query objects may be reused.
- *             This is global (across all segments of the index) and
- *             generally will require global statistics (such as docFreq
- *             for a given term across all segments).</li>
+ *             {@link org.apache.lucene.search.Weight Weight} &mdash; A specialization of a Query for a given
+ *             index. This typically associates a Query object with index statistics that are later used to
+ *             compute document scores.
 *         <li>
- *             {@link org.apache.lucene.search.Scorer Scorer} &mdash; An abstract class containing common
- *             functionality for scoring. Provides both scoring and
- *             explanation capabilities.  This is created per-segment.</li>
+ *             {@link org.apache.lucene.search.Scorer Scorer} &mdash; The core class of the scoring process:
+ *             for a given segment, scorers return {@link org.apache.lucene.search.Scorer#iterator iterators}
+ *             over matches and give a way to compute the {@link org.apache.lucene.search.Scorer#score score}
+ *             of these matches.</li>
 *         <li>
 *             {@link org.apache.lucene.search.BulkScorer BulkScorer} &mdash; An abstract class that scores
 *       a range of documents.  A default implementation simply iterates through the hits from
@ -338,7 +363,7 @@
 *         {@link org.apache.lucene.search.Query Query} class has several methods that are important for
 *         derived classes:
 *         <ol>
- *             <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, boolean needsScores, float boost)} &mdash; A
+ *             <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost)} &mdash; A
 *                 {@link org.apache.lucene.search.Weight Weight} is the internal representation of the
 *                 Query, so each Query implementation must
 *                 provide an implementation of Weight. See the subsection on <a
@ -347,7 +372,7 @@
 *             <li>{@link org.apache.lucene.search.Query#rewrite(org.apache.lucene.index.IndexReader) rewrite(IndexReader reader)} &mdash; Rewrites queries into primitive queries. Primitive queries are:
 *                 {@link org.apache.lucene.search.TermQuery TermQuery},
 *                 {@link org.apache.lucene.search.BooleanQuery BooleanQuery}, <span
- *                     >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,boolean needsScores, float boost)}</span></li>
+ *                     >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,ScoreMode scoreMode, float boost)}</span></li>
 *         </ol>
 * <a name="weightClass"></a>
 * <h3>The Weight Interface</h3>
@ -356,23 +381,15 @@
 *         interface provides an internal representation of the Query so that it can be reused. Any
 *         {@link org.apache.lucene.search.IndexSearcher IndexSearcher}
 *         dependent state should be stored in the Weight implementation,
- *         not in the Query class. The interface defines five methods that must be implemented:
+ *         not in the Query class. The interface defines four main methods:
 *         <ol>
 *             <li>
- *                 {@link org.apache.lucene.search.Weight#getQuery getQuery()} &mdash; Pointer to the
- *                 Query that this Weight represents.</li>
- *             <li>
 *                 {@link org.apache.lucene.search.Weight#scorer scorer()} &mdash;
 *                 Construct a new {@link org.apache.lucene.search.Scorer Scorer} for this Weight. See <a href="#scorerClass">The Scorer Class</a>
 *                 below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents 
 *                 given the Query.
 *             </li>
 *             <li>
- *                 {@link org.apache.lucene.search.Weight#bulkScorer bulkScorer()} &mdash;
- *                 Construct a new {@link org.apache.lucene.search.BulkScorer BulkScorer} for this Weight. See <a href="#bulkScorerClass">The BulkScorer Class</a>
- *                 below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- *             </li>
- *             <li>
 *                 {@link org.apache.lucene.search.Weight#explain(org.apache.lucene.index.LeafReaderContext, int) 
 *                   explain(LeafReaderContext context, int doc)} &mdash; Provide a means for explaining why a given document was
 *                 scored the way it was.
@ -380,6 +397,16 @@
 *                 that scores via a {@link org.apache.lucene.search.similarities.Similarity Similarity} will make use of the Similarity's implementation:
 *                 {@link org.apache.lucene.search.similarities.Similarity.SimScorer#explain(Explanation, long) SimScorer#explain(Explanation freq, long norm)}.
 *             </li>
+ *             <li>
+ *                 {@link org.apache.lucene.search.Weight#extractTerms(java.util.Set) extractTerms(Set&lt;Term&gt; terms)} &mdash; Extract terms that
+ *                 this query operates on. This is typically used to support distributed search: knowing the terms that a query operates on helps
+ *                 merge index statistics of these terms so that scores are computed over a subset of the data like they would if all documents
+ *                 were in the same index.
+ *             </li>
+ *             <li>
+ *                 {@link org.apache.lucene.search.Weight#matches matches(LeafReaderContext context, int doc)} &mdash; Give information about positions
+ *                 and offsets of matches. This is typically useful to implement highlighting.
+ *             </li>
 *         </ol>
 * <a name="scorerClass"></a>
 * <h3>The Scorer Class</h3>
@ -458,17 +485,13 @@
 *    This method returns a {@link org.apache.lucene.search.TopDocs TopDocs} object,
 *    which is an internal collection of search results. The IndexSearcher creates
 *    a {@link org.apache.lucene.search.TopScoreDocCollector TopScoreDocCollector} and
- *    passes it along with the Weight, Filter to another expert search method (for
+ *    passes it along with the Weight to another expert search method (for
 *    more on the {@link org.apache.lucene.search.Collector Collector} mechanism,
 *    see {@link org.apache.lucene.search.IndexSearcher IndexSearcher}). The TopScoreDocCollector
 *    uses a {@link org.apache.lucene.util.PriorityQueue PriorityQueue} to collect the
 *    top results for the search.
- * <p>If a Filter is being used, some initial setup is done to determine which docs to include. 
- *    Otherwise, we ask the Weight for a {@link org.apache.lucene.search.Scorer Scorer} for each
- *    {@link org.apache.lucene.index.IndexReader IndexReader} segment and proceed by calling
- *    {@link org.apache.lucene.search.BulkScorer#score(org.apache.lucene.search.LeafCollector,org.apache.lucene.util.Bits) BulkScorer.score(LeafCollector,Bits)}.
 * <p>At last, we are actually going to score some documents. The score method takes in the Collector
- *    (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here 
+ *    (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here 
 *    is where things get involved. The {@link org.apache.lucene.search.Scorer Scorer} that is returned
 *    by the {@link org.apache.lucene.search.Weight Weight} object depends on what type of Query was
 *    submitted. In most real world applications with multiple query terms, the 
--- a/lucene/core/src/java/org/apache/lucene/search/similarities/package-info.java
+++ b/lucene/core/src/java/org/apache/lucene/search/similarities/package-info.java
@ -73,9 +73,9 @@
 *     your searching needs.
 *     However, in some applications it may be necessary to customize your <a
 *         href="Similarity.html">Similarity</a> implementation. For instance, some
- *     applications do not need to
- *     distinguish between shorter and longer documents (see <a
- *         href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).
+ *     applications do not need to distinguish between shorter and longer documents
+ *     and could set BM25's {@link org.apache.lucene.search.similarities.BM25Similarity#BM25Similarity(float,float) b}
+ *     parameter to {@code 0}.
 * 
 * <p>To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and
 *     searching, and the changes must happen before
@ -83,15 +83,27 @@
 *     just isn't well-defined what is going to happen.
 * 
 * <p>To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely
- *     you'll want to simply subclass an existing method, be it
- *     {@link org.apache.lucene.search.similarities.ClassicSimilarity} or a descendant of
- *     {@link org.apache.lucene.search.similarities.SimilarityBase}), and
+ *     you'll want to simply subclass {@link org.apache.lucene.search.similarities.SimilarityBase}), and
 *     then register the new class by calling
 *     {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)}
 *     before indexing and
 *     {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)}
 *     before searching.
 * 
+ * <h3>Tuning {@linkplain org.apache.lucene.search.similarities.BM25Similarity}</h3>
+ * <p>{@link org.apache.lucene.search.similarities.BM25Similarity} has
+ * two parameters that may be tuned:
+ * <ul>
+ *   <li><tt>k1</tt>, which calibrates term frequency saturation and must be
+ *   positive or null. A value of {@code 0} makes term frequency completely
+ *   ignored, making documents scored only based on the value of the <tt>IDF</tt>
+ *   of the matched terms. Higher values of <tt>k1</tt> increase the impact of
+ *   term frequency on the final score. Default value is {@code 1.2}.</li>
+ *   <li><tt>b</tt>, which controls how much document length should normalize
+ *   term frequency values and must be in {@code [0, 1]}. A value of {@code 0}
+ *   disables length normalization completely. Default value is {@code 0.75}.</li>
+ * </ul> 
+ *
 * <h3>Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}</h3>
 * <p>
 * The easiest way to quickly implement a new ranking method is to extend
@ -112,33 +124,5 @@
 * subclassing the Similarity, one can simply introduce a new basic model and tell
 * {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.
 * 
- * <h3>Changing {@linkplain org.apache.lucene.search.similarities.ClassicSimilarity}</h3>
- * <p>
- *     If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a
- *         href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">Overriding Similarity</a>.
- *     In summary, here are a few use cases:
- *     <ol>
- *         <li><p>The <code>SweetSpotSimilarity</code> in
- *             <code>org.apache.lucene.misc</code> gives small
- *             increases as the frequency increases a small amount
- *             and then greater increases when you hit the "sweet spot", i.e. where
- *             you think the frequency of terms is more significant.</li>
- *         <li><p>Overriding tf &mdash; In some applications, it doesn't matter what the score of a document is as long as a
- *             matching term occurs. In these
- *             cases people have overridden Similarity to return 1 from the tf() method.</li>
- *         <li><p>Changing Length Normalization &mdash; By overriding
- *             {@link org.apache.lucene.search.similarities.Similarity#computeNorm(org.apache.lucene.index.FieldInvertState state)},
- *             it is possible to discount how the length of a field contributes
- *             to a score. In {@link org.apache.lucene.search.similarities.ClassicSimilarity},
- *             lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
- *             1 / (numTerms in field), all fields will be treated
- *             <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</li>
- *     </ol>
- *     In general, Chris Hostetter sums it up best in saying (from <a
- *         href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
- *     <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just
- *         that
- *         it's "text" is a situation where it *might* make sense to to override your
- *         Similarity method.</blockquote>
 */
 package org.apache.lucene.search.similarities;
--- a/lucene/core/src/java/overview.html
+++ b/lucene/core/src/java/overview.html
@ -35,7 +35,7 @@ to check if the results are what we expect):</p>
    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
-    //Directory directory = FSDirectory.open("/tmp/testindex");
+    //Directory directory = FSDirectory.open(Paths.get("/tmp/testindex"));
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
@ -50,7 +50,7 @@ to check if the results are what we expect):</p>
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser("fieldname", analyzer);
    Query query = parser.parse("text");
-    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
+    ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {