diff --git a/xdocs/scoring.xml b/xdocs/scoring.xml index 68503cdcfd6..4ac236869b2 100644 --- a/xdocs/scoring.xml +++ b/xdocs/scoring.xml @@ -58,7 +58,7 @@ and the other in one Field will return different scores for the same query due to length normalization (assumming the DefaultSimilarity - on the Fields.) + on the Fields).
In this regard, Lucene offers a wide variety of Query implementations, most of which are in the - org.apache.lucene.search package. +
In this regard, Lucene offers a wide variety of Query implementations, most of which are in the + org.apache.lucene.search package. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query - section below will - highlight some of the more important Query classes. For information on the other ones, see the + section below + highlights some of the more important Query classes. For information on the other ones, see the package summary. For details on implementing your own Query class, see Changing your Scoring -- Expert Level below. @@ -160,7 +160,7 @@ IndexSearcher, the scoring process begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, - control finally passes to the Weight implementation and it's + control finally passes to the Weight implementation and its Scorer instance. In the case of any type of BooleanQuery, scoring is handled by the BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), @@ -188,68 +188,79 @@ TermQuery
Of the various implementations of
- Query, the
+ Query, the
TermQuery
- is the easiest to understand and the most often used in most applications. A TermQuery is a Query
- that matches all the documents that contain the specified
- Term
- . A Term is a word that occurs in a specific
- Field
- . Thus, a TermQuery identifies and scores all
- Document
- s that have a Field with the specified string in it.
- Constructing a TermQuery is as simple as:
- TermQuery tq = new TermQuery(new Term("fieldName", "term");
- In this example, the Query would identify all Documents that have the Field named "fieldName" that
- contain the word "term".
+ is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified
+ Term,
+ which is a word that occurs in a certain
+ Field.
+ Thus, a TermQuery identifies and scores all
+ Documents that have a Field with the specified string in it.
+ Constructing a TermQuery
+ is as simple as:
+
+ TermQuery tq = new TermQuery(new Term("fieldName", "term"); +In this example, the Query identifies all Documents that have the Field named "fieldName" and + contain the word "term".
Things start to get interesting when one starts to combine TermQuerys, which is handled by the - BooleanQuery - class. The BooleanQuery is a collection - of other - Query - classes along with semantics about how to combine the different subqueries. - It currently supports three different operators for specifying the logic of the query (see - BooleanClause - ) +
Things start to get interesting when one combines multiple + TermQuery instances into a BooleanQuery. + A BooleanQuery contains multiple + BooleanClauses, + where each clause contains a sub-query (Query + instance) and an operator (from BooleanClause.Occur) + describing how that sub-query is combined with the other clauses:
SHOULD -- Use this operator when a clause can occur in the result set, but is not required. + If a query is made up of all SHOULD clauses, then every document in the result + set matches at least one of these clauses.
MUST -- Use this operator when a clause is required to occur in the result set. Every + document in the result set will match + all such clauses.
MUST NOT -- Use this operator when a + clause must not occur in the result set. No + document in the result set will match + any such clauses.
Another common task in search is to identify phrases, which can be handled in two different ways. +
Another common search is to find documents containing certain phrases. This + is handled in two different ways.
PhraseQuery -- Matches a sequence of - Terms - . The PhraseQuery can specify a slop factor which determines - how many positions may occur between any two terms and still be considered a match. + Terms. + PhraseQuery uses a slop factor to determine + how many positions may occur between any two terms in the phrase and still be considered a match.
SpanNearQuery -- Matches a sequence of other SpanQuery - instances. The SpanNearQuery allows for much more - complicated phrasal queries to be built since it is constructed out of other SpanQuery - objects, not just Terms. + instances. SpanNearQuery allows for much more + complicated phrase queries since it is constructed from other to SpanQuery + instances, instead of only TermQuery instances.
While the PrefixQuery has a different implementation, it is essentially a special case of the - WildcardQuery - . The PrefixQuery allows an application - to identify all documents with terms that begin with a certain string. The WildcardQuery generalize - this by allowing - for the use of * and ? wildcards. Note that the WildcardQuery can be quite slow. Also note that - WildcardQuerys should - not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at + WildcardQuery. + The PrefixQuery allows an application + to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing + for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that + WildcardQuery should + not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at the beginning of a term, see - + Starts With x and Ends With x Queries - from the Lucene archives. + from the Lucene users's mailing list.
A FuzzyQuery - matches documents that contain similar terms to the specified term. Similarity is - determined using the - Levenshtein (edit distance) algorithm - . This type of query can be useful when accounting for spelling variations in the collection. + matches documents that contain terms similar to the specified term. Similarity is + determined using + Levenshtein (edit) distance. + This type of query can be useful when accounting for spelling variations in the collection.
Chances are, the - DefaultSimilarity is sufficient for all your searching needs. - However, in some applications it may be necessary to alter your Similarity. For instance, some applications do not need to - distinguish between shorter documents and longer documents (for example, - see a "fair" similarity) - To change the Similarity, one must do so for both indexing and searching and the changes must take place before - any of these actions are undertaken (although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen). - To make this change, implement your Similarity (you probably want to override - DefaultSimilarity) and then set the new - class on - IndexWriter.setSimilarity(org.apache.lucene.search.Similarity) for indexing and on - Searcher.setSimilarity(org.apache.lucene.search.Similarity). +
Chances are DefaultSimilarity is sufficient for all your searching needs. + However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to + distinguish between shorter and longer documents (see a "fair" similarity).
+ +To change Similarity, one must do so for both indexing and searching, and the changes must happen before + either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen. +
+ +To make this change, implement your own Similarity (likely you'll want to simply subclass + DefaultSimilarity) and then use the new + class by calling + IndexWriter.setSimilarity before indexing and + Searcher.setSimilarity before searching.
- If you are interested in use cases for changing your similarity, see the mailing list at Overriding Similarity. + If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:
SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount + and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these + cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes + to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated - "fairly".
[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.