mirror of https://github.com/apache/lucene.git
Update javadocs for Lucene 8.
This fixes a couple mistakes, puts more emphasis on BM25 compared to Classic and gives more guidance regarding custom scores without a custom query.
This commit is contained in:
parent
d93c46ea94
commit
a1ec716e10
|
@ -110,8 +110,10 @@
|
|||
* inverted index, is comprised of "postings." The postings, with their term dictionary, can be
|
||||
* thought of as a map that provides efficient lookup given a {@link org.apache.lucene.index.Term}
|
||||
* (roughly, a word or token), to (the ordered list of) {@link org.apache.lucene.document.Document}s
|
||||
* containing that Term. Postings do not provide any way of retrieving terms given a document,
|
||||
* short of scanning the entire index.</p>
|
||||
* containing that Term. Codecs may additionally record
|
||||
* {@link org.apache.lucene.index.ImpactsEnum#getImpacts impacts} alongside postings in order to be
|
||||
* able to skip over low-scoring documents at search time. Postings do not provide any way of
|
||||
* retrieving terms given a document, short of scanning the entire index.</p>
|
||||
*
|
||||
* <a name="stored-fields"></a>
|
||||
* <p>Stored fields are essentially the opposite of postings, providing efficient retrieval of field
|
||||
|
|
|
@ -28,6 +28,10 @@ import org.apache.lucene.util.automaton.Automaton;
|
|||
* <p>This query matches the documents looking for terms that fall into the
|
||||
* supplied range according to {@link BytesRef#compareTo(BytesRef)}.
|
||||
*
|
||||
* <p><b>NOTE</b>: {@link TermRangeQuery} performs significantly slower than
|
||||
* {@link PointRangeQuery point-based ranges} as it needs to visit all terms
|
||||
* that match the range and merges their matches.
|
||||
*
|
||||
* <p>This query uses the {@link
|
||||
* MultiTermQuery#CONSTANT_SCORE_REWRITE}
|
||||
* rewrite method.
|
||||
|
|
|
@ -44,7 +44,7 @@
|
|||
* <p>
|
||||
* Once a Query has been created and submitted to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, the scoring
|
||||
* process begins. After some infrastructure setup, control finally passes to the {@link org.apache.lucene.search.Weight Weight}
|
||||
* implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScore}
|
||||
* implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScorer}
|
||||
* instances. See the <a href="#algorithm">Algorithm</a> section for more notes on the process.
|
||||
* <!-- FILL IN MORE HERE -->
|
||||
* <!-- TODO: this page over-links the same things too many times -->
|
||||
|
@ -95,9 +95,11 @@
|
|||
* If a query is made up of all SHOULD clauses, then every document in the result
|
||||
* set matches at least one of these clauses.</p></li>
|
||||
*
|
||||
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} — Use this operator when a clause is required to occur in the result set. Every
|
||||
* document in the result set will match
|
||||
* all such clauses.</p></li>
|
||||
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} — Use this operator when a clause is required to occur in the result set and should
|
||||
* contribute to the score. Every document in the result set will match all such clauses.</p></li>
|
||||
*
|
||||
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#FILTER FILTER} — Use this operator when a clause is required to occur in the result set but
|
||||
* should not contribute to the score. Every document in the result set will match all such clauses.</p></li>
|
||||
*
|
||||
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST_NOT MUST NOT} — Use this operator when a
|
||||
* clause must not occur in the result set. No
|
||||
|
@ -113,7 +115,7 @@
|
|||
* {@link org.apache.lucene.search.TermQuery TermQuery} clauses,
|
||||
* for example by {@link org.apache.lucene.search.WildcardQuery WildcardQuery}.
|
||||
* The default setting for the maximum number
|
||||
* of clauses 1024, but this can be changed via the
|
||||
* of clauses is 1024, but this can be changed via the
|
||||
* static method {@link org.apache.lucene.search.BooleanQuery#setMaxClauseCount(int)}.
|
||||
*
|
||||
* <h3>Phrases</h3>
|
||||
|
@ -149,23 +151,6 @@
|
|||
* </ol>
|
||||
*
|
||||
* <h3>
|
||||
* {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
|
||||
* </h3>
|
||||
*
|
||||
* <p>The
|
||||
* {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
|
||||
* matches all documents that occur in the
|
||||
* exclusive range of a lower
|
||||
* {@link org.apache.lucene.index.Term Term}
|
||||
* and an upper
|
||||
* {@link org.apache.lucene.index.Term Term}
|
||||
* according to {@link org.apache.lucene.util.BytesRef#compareTo BytesRef.compareTo()}. It is not intended
|
||||
* for numerical ranges; use {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery} instead.
|
||||
*
|
||||
* For example, one could find all documents
|
||||
* that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>.
|
||||
*
|
||||
* <h3>
|
||||
* {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery}
|
||||
* </h3>
|
||||
*
|
||||
|
@ -274,6 +259,7 @@
|
|||
*
|
||||
* <a name="changingScoring"></a>
|
||||
* <h2>Changing Scoring — Similarity</h2>
|
||||
* <h3>Changing the scoring formula</h3>
|
||||
* <p>
|
||||
* Changing {@link org.apache.lucene.search.similarities.Similarity Similarity} is an easy way to
|
||||
* influence scoring, this is done at index-time with
|
||||
|
@ -289,14 +275,54 @@
|
|||
* extend by plugging in a different component (e.g. term frequency normalizer).
|
||||
* <p>
|
||||
* Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
|
||||
* to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
|
||||
* a custom Similarity can access per-document values via {@link org.apache.lucene.index.NumericDocValues} and
|
||||
* integrate them into the score.
|
||||
* to implement a new retrieval model.
|
||||
* <p>
|
||||
* See the {@link org.apache.lucene.search.similarities} package documentation for information
|
||||
* on the built-in available scoring models and extending or changing Similarity.
|
||||
*
|
||||
* <h3>Integrating field values into the score</h3>
|
||||
* <p>While similarities help score a document relatively to a query, it is also common for documents to hold
|
||||
* features that measure the quality of a match. Such features are best integrated into the score by indexing
|
||||
* a {@link org.apache.lucene.document.FeatureField FeatureField} with the document at index-time, and then
|
||||
* combining the similarity score and the feature score using a linear combination. For instance the below
|
||||
* query matches the same documents as {@code originalQuery} and computes scores as
|
||||
* {@code similarityScore + 0.7 * featureScore}:
|
||||
* <pre class="prettyprint">
|
||||
* Query originalQuery = new BooleanQuery.Builder()
|
||||
* .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD)
|
||||
* .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD)
|
||||
* .build();
|
||||
* Query featureQuery = FeatureField.newSaturationQuery("features", "pagerank");
|
||||
* Query query = new BooleanQuery.Builder()
|
||||
* .add(originalQuery, Occur.MUST)
|
||||
* .add(new BoostQuery(featureQuery, 0.7f), Occur.SHOULD)
|
||||
* .build();
|
||||
* </pre>
|
||||
*
|
||||
*
|
||||
* <p>A less efficient yet more flexible way of modifying scores is to index scoring features into
|
||||
* doc-value fields and then combine them with the similarity score using a
|
||||
* <a href="{@docRoot}/../queries/org/apache/lucene/queries/function/FunctionScoreQuery.html">FunctionScoreQuery</a>
|
||||
* from the <a href="{@docRoot}/../queries/overview-summary.html">queries module</a>. For instance
|
||||
* the below example shows how to compute scores as {@code similarityScore * Math.log(popularity)}
|
||||
* using the <a href="{@docRoot}/../expressions/overview-summary.html">expressions module</a> and
|
||||
* assuming that values for the {@code popularity} field have been set in a
|
||||
* {@link org.apache.lucene.document.NumericDocValuesField NumericDocValuesField} at index time:
|
||||
* <pre class="prettyprint">
|
||||
* // compile an expression:
|
||||
* Expression expr = JavascriptCompiler.compile("_score * ln(popularity)");
|
||||
*
|
||||
* // SimpleBindings just maps variables to SortField instances
|
||||
* SimpleBindings bindings = new SimpleBindings();
|
||||
* bindings.add(new SortField("_score", SortField.Type.SCORE));
|
||||
* bindings.add(new SortField("popularity", SortField.Type.INT));
|
||||
*
|
||||
* // create a query that matches based on 'originalQuery' but
|
||||
* // scores using expr
|
||||
* Query query = new FunctionScoreQuery(
|
||||
* originalQuery,
|
||||
* expr.getDoubleValuesSource(bindings));
|
||||
* </pre>
|
||||
*
|
||||
* <a name="customQueriesExpert"></a>
|
||||
* <h2>Custom Queries — Expert Level</h2>
|
||||
*
|
||||
|
@ -311,15 +337,14 @@
|
|||
* {@link org.apache.lucene.search.Query Query} — The abstract object representation of the
|
||||
* user's information need.</li>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Weight Weight} — The internal interface representation of
|
||||
* the user's Query, so that Query objects may be reused.
|
||||
* This is global (across all segments of the index) and
|
||||
* generally will require global statistics (such as docFreq
|
||||
* for a given term across all segments).</li>
|
||||
* {@link org.apache.lucene.search.Weight Weight} — A specialization of a Query for a given
|
||||
* index. This typically associates a Query object with index statistics that are later used to
|
||||
* compute document scores.
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Scorer Scorer} — An abstract class containing common
|
||||
* functionality for scoring. Provides both scoring and
|
||||
* explanation capabilities. This is created per-segment.</li>
|
||||
* {@link org.apache.lucene.search.Scorer Scorer} — The core class of the scoring process:
|
||||
* for a given segment, scorers return {@link org.apache.lucene.search.Scorer#iterator iterators}
|
||||
* over matches and give a way to compute the {@link org.apache.lucene.search.Scorer#score score}
|
||||
* of these matches.</li>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.BulkScorer BulkScorer} — An abstract class that scores
|
||||
* a range of documents. A default implementation simply iterates through the hits from
|
||||
|
@ -338,7 +363,7 @@
|
|||
* {@link org.apache.lucene.search.Query Query} class has several methods that are important for
|
||||
* derived classes:
|
||||
* <ol>
|
||||
* <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, boolean needsScores, float boost)} — A
|
||||
* <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost)} — A
|
||||
* {@link org.apache.lucene.search.Weight Weight} is the internal representation of the
|
||||
* Query, so each Query implementation must
|
||||
* provide an implementation of Weight. See the subsection on <a
|
||||
|
@ -347,7 +372,7 @@
|
|||
* <li>{@link org.apache.lucene.search.Query#rewrite(org.apache.lucene.index.IndexReader) rewrite(IndexReader reader)} — Rewrites queries into primitive queries. Primitive queries are:
|
||||
* {@link org.apache.lucene.search.TermQuery TermQuery},
|
||||
* {@link org.apache.lucene.search.BooleanQuery BooleanQuery}, <span
|
||||
* >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,boolean needsScores, float boost)}</span></li>
|
||||
* >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,ScoreMode scoreMode, float boost)}</span></li>
|
||||
* </ol>
|
||||
* <a name="weightClass"></a>
|
||||
* <h3>The Weight Interface</h3>
|
||||
|
@ -356,23 +381,15 @@
|
|||
* interface provides an internal representation of the Query so that it can be reused. Any
|
||||
* {@link org.apache.lucene.search.IndexSearcher IndexSearcher}
|
||||
* dependent state should be stored in the Weight implementation,
|
||||
* not in the Query class. The interface defines five methods that must be implemented:
|
||||
* not in the Query class. The interface defines four main methods:
|
||||
* <ol>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Weight#getQuery getQuery()} — Pointer to the
|
||||
* Query that this Weight represents.</li>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Weight#scorer scorer()} —
|
||||
* Construct a new {@link org.apache.lucene.search.Scorer Scorer} for this Weight. See <a href="#scorerClass">The Scorer Class</a>
|
||||
* below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents
|
||||
* given the Query.
|
||||
* </li>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Weight#bulkScorer bulkScorer()} —
|
||||
* Construct a new {@link org.apache.lucene.search.BulkScorer BulkScorer} for this Weight. See <a href="#bulkScorerClass">The BulkScorer Class</a>
|
||||
* below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
|
||||
* </li>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Weight#explain(org.apache.lucene.index.LeafReaderContext, int)
|
||||
* explain(LeafReaderContext context, int doc)} — Provide a means for explaining why a given document was
|
||||
* scored the way it was.
|
||||
|
@ -380,6 +397,16 @@
|
|||
* that scores via a {@link org.apache.lucene.search.similarities.Similarity Similarity} will make use of the Similarity's implementation:
|
||||
* {@link org.apache.lucene.search.similarities.Similarity.SimScorer#explain(Explanation, long) SimScorer#explain(Explanation freq, long norm)}.
|
||||
* </li>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Weight#extractTerms(java.util.Set) extractTerms(Set<Term> terms)} — Extract terms that
|
||||
* this query operates on. This is typically used to support distributed search: knowing the terms that a query operates on helps
|
||||
* merge index statistics of these terms so that scores are computed over a subset of the data like they would if all documents
|
||||
* were in the same index.
|
||||
* </li>
|
||||
* <li>
|
||||
* {@link org.apache.lucene.search.Weight#matches matches(LeafReaderContext context, int doc)} — Give information about positions
|
||||
* and offsets of matches. This is typically useful to implement highlighting.
|
||||
* </li>
|
||||
* </ol>
|
||||
* <a name="scorerClass"></a>
|
||||
* <h3>The Scorer Class</h3>
|
||||
|
@ -458,17 +485,13 @@
|
|||
* This method returns a {@link org.apache.lucene.search.TopDocs TopDocs} object,
|
||||
* which is an internal collection of search results. The IndexSearcher creates
|
||||
* a {@link org.apache.lucene.search.TopScoreDocCollector TopScoreDocCollector} and
|
||||
* passes it along with the Weight, Filter to another expert search method (for
|
||||
* passes it along with the Weight to another expert search method (for
|
||||
* more on the {@link org.apache.lucene.search.Collector Collector} mechanism,
|
||||
* see {@link org.apache.lucene.search.IndexSearcher IndexSearcher}). The TopScoreDocCollector
|
||||
* uses a {@link org.apache.lucene.util.PriorityQueue PriorityQueue} to collect the
|
||||
* top results for the search.
|
||||
* <p>If a Filter is being used, some initial setup is done to determine which docs to include.
|
||||
* Otherwise, we ask the Weight for a {@link org.apache.lucene.search.Scorer Scorer} for each
|
||||
* {@link org.apache.lucene.index.IndexReader IndexReader} segment and proceed by calling
|
||||
* {@link org.apache.lucene.search.BulkScorer#score(org.apache.lucene.search.LeafCollector,org.apache.lucene.util.Bits) BulkScorer.score(LeafCollector,Bits)}.
|
||||
* <p>At last, we are actually going to score some documents. The score method takes in the Collector
|
||||
* (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here
|
||||
* (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here
|
||||
* is where things get involved. The {@link org.apache.lucene.search.Scorer Scorer} that is returned
|
||||
* by the {@link org.apache.lucene.search.Weight Weight} object depends on what type of Query was
|
||||
* submitted. In most real world applications with multiple query terms, the
|
||||
|
|
|
@ -73,9 +73,9 @@
|
|||
* your searching needs.
|
||||
* However, in some applications it may be necessary to customize your <a
|
||||
* href="Similarity.html">Similarity</a> implementation. For instance, some
|
||||
* applications do not need to
|
||||
* distinguish between shorter and longer documents (see <a
|
||||
* href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).
|
||||
* applications do not need to distinguish between shorter and longer documents
|
||||
* and could set BM25's {@link org.apache.lucene.search.similarities.BM25Similarity#BM25Similarity(float,float) b}
|
||||
* parameter to {@code 0}.
|
||||
*
|
||||
* <p>To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and
|
||||
* searching, and the changes must happen before
|
||||
|
@ -83,15 +83,27 @@
|
|||
* just isn't well-defined what is going to happen.
|
||||
*
|
||||
* <p>To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely
|
||||
* you'll want to simply subclass an existing method, be it
|
||||
* {@link org.apache.lucene.search.similarities.ClassicSimilarity} or a descendant of
|
||||
* {@link org.apache.lucene.search.similarities.SimilarityBase}), and
|
||||
* you'll want to simply subclass {@link org.apache.lucene.search.similarities.SimilarityBase}), and
|
||||
* then register the new class by calling
|
||||
* {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)}
|
||||
* before indexing and
|
||||
* {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)}
|
||||
* before searching.
|
||||
*
|
||||
* <h3>Tuning {@linkplain org.apache.lucene.search.similarities.BM25Similarity}</h3>
|
||||
* <p>{@link org.apache.lucene.search.similarities.BM25Similarity} has
|
||||
* two parameters that may be tuned:
|
||||
* <ul>
|
||||
* <li><tt>k1</tt>, which calibrates term frequency saturation and must be
|
||||
* positive or null. A value of {@code 0} makes term frequency completely
|
||||
* ignored, making documents scored only based on the value of the <tt>IDF</tt>
|
||||
* of the matched terms. Higher values of <tt>k1</tt> increase the impact of
|
||||
* term frequency on the final score. Default value is {@code 1.2}.</li>
|
||||
* <li><tt>b</tt>, which controls how much document length should normalize
|
||||
* term frequency values and must be in {@code [0, 1]}. A value of {@code 0}
|
||||
* disables length normalization completely. Default value is {@code 0.75}.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}</h3>
|
||||
* <p>
|
||||
* The easiest way to quickly implement a new ranking method is to extend
|
||||
|
@ -112,33 +124,5 @@
|
|||
* subclassing the Similarity, one can simply introduce a new basic model and tell
|
||||
* {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.
|
||||
*
|
||||
* <h3>Changing {@linkplain org.apache.lucene.search.similarities.ClassicSimilarity}</h3>
|
||||
* <p>
|
||||
* If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a
|
||||
* href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">Overriding Similarity</a>.
|
||||
* In summary, here are a few use cases:
|
||||
* <ol>
|
||||
* <li><p>The <code>SweetSpotSimilarity</code> in
|
||||
* <code>org.apache.lucene.misc</code> gives small
|
||||
* increases as the frequency increases a small amount
|
||||
* and then greater increases when you hit the "sweet spot", i.e. where
|
||||
* you think the frequency of terms is more significant.</li>
|
||||
* <li><p>Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a
|
||||
* matching term occurs. In these
|
||||
* cases people have overridden Similarity to return 1 from the tf() method.</li>
|
||||
* <li><p>Changing Length Normalization — By overriding
|
||||
* {@link org.apache.lucene.search.similarities.Similarity#computeNorm(org.apache.lucene.index.FieldInvertState state)},
|
||||
* it is possible to discount how the length of a field contributes
|
||||
* to a score. In {@link org.apache.lucene.search.similarities.ClassicSimilarity},
|
||||
* lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
|
||||
* 1 / (numTerms in field), all fields will be treated
|
||||
* <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</li>
|
||||
* </ol>
|
||||
* In general, Chris Hostetter sums it up best in saying (from <a
|
||||
* href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
|
||||
* <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just
|
||||
* that
|
||||
* it's "text" is a situation where it *might* make sense to to override your
|
||||
* Similarity method.</blockquote>
|
||||
*/
|
||||
package org.apache.lucene.search.similarities;
|
||||
|
|
|
@ -35,7 +35,7 @@ to check if the results are what we expect):</p>
|
|||
// Store the index in memory:
|
||||
Directory directory = new RAMDirectory();
|
||||
// To store an index on disk, use this instead:
|
||||
//Directory directory = FSDirectory.open("/tmp/testindex");
|
||||
//Directory directory = FSDirectory.open(Paths.get("/tmp/testindex"));
|
||||
IndexWriterConfig config = new IndexWriterConfig(analyzer);
|
||||
IndexWriter iwriter = new IndexWriter(directory, config);
|
||||
Document doc = new Document();
|
||||
|
@ -50,7 +50,7 @@ to check if the results are what we expect):</p>
|
|||
// Parse a simple query that searches for "text":
|
||||
QueryParser parser = new QueryParser("fieldname", analyzer);
|
||||
Query query = parser.parse("text");
|
||||
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
|
||||
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
|
||||
assertEquals(1, hits.length);
|
||||
// Iterate through the results:
|
||||
for (int i = 0; i < hits.length; i++) {
|
||||
|
|
Loading…
Reference in New Issue