diff --git a/docs/benchmarks.html b/docs/benchmarks.html index a51e2eeac04..9cf3b289242 100644 --- a/docs/benchmarks.html +++ b/docs/benchmarks.html @@ -85,6 +85,8 @@ limitations under the License.
-- TermQuery -
-Of the various implementations of - Query, the - TermQuery - is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified - Term, - which is a word that occurs in a certain - Field. - Thus, a TermQuery identifies and scores all - Documents that have a Field with the specified string in it. - Constructing a TermQuery - is as simple as: -
- TermQuery tq = new TermQuery(new Term("fieldName", "term"); -In this example, the Query identifies all Documents that have the Field named "fieldName" and - contain the word "term". - -- BooleanQuery -
-Things start to get interesting when one combines multiple - TermQuery instances into a BooleanQuery. - A BooleanQuery contains multiple - BooleanClauses, - where each clause contains a sub-query (Query - instance) and an operator (from BooleanClause.Occur) - describing how that sub-query is combined with the other clauses: -
- -
- Boolean queries are constructed by adding two or more - BooleanClause - instances. If too many clauses are added, a TooManyClauses - exception will be thrown during searching. This most often occurs - when a Query - is rewritten into a BooleanQuery with many - TermQuery clauses, - for example by WildcardQuery. - The default setting for the maximum number - of clauses 1024, but this can be changed via the - static method setMaxClauseCount - in BooleanQuery. - -- - -
SHOULD -- Use this operator when a clause can occur in the result set, but is not required. - If a query is made up of all SHOULD clauses, then every document in the result - set matches at least one of these clauses.
- - -
MUST -- Use this operator when a clause is required to occur in the result set. Every - document in the result set will match - all such clauses.
- -
MUST NOT -- Use this operator when a - clause must not occur in the result set. No - document in the result set will match - any such clauses.
Phrases
-Another common search is to find documents containing certain phrases. This - is handled in two different ways. -
-
- -- -
-PhraseQuery - -- Matches a sequence of - Terms. - PhraseQuery uses a slop factor to determine - how many positions may occur between any two terms in the phrase and still be considered a match.
-- -
-SpanNearQuery - -- Matches a sequence of other - SpanQuery - instances. SpanNearQuery allows for much more - complicated phrase queries since it is constructed from other to SpanQuery - instances, instead of only TermQuery instances.
-- RangeQuery -
-The - RangeQuery - matches all documents that occur in the - exclusive range of a lower - Term - and an upper - Term. - For example, one could find all documents - that have terms beginning with the letters a through c. This type of Query is frequently used to - find - documents that occur in a specific date range. -
-- PrefixQuery, - WildcardQuery -
-While the - PrefixQuery - has a different implementation, it is essentially a special case of the - WildcardQuery. - The PrefixQuery allows an application - to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing - for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that - WildcardQuery should - not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at - the beginning of a term, see - - Starts With x and Ends With x Queries - from the Lucene users's mailing list. -
-- FuzzyQuery -
-A - FuzzyQuery - matches documents that contain terms similar to the specified term. Similarity is - determined using - Levenshtein (edit) distance. - This type of query can be useful when accounting for spelling variations in the collection. +
For information on the Query Classes, refer to the + search package javadocs
-Chances are DefaultSimilarity is sufficient for all your searching needs. - However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to - distinguish between shorter and longer documents (see a "fair" similarity).
-To change Similarity, one must do so for both indexing and searching, and the changes must happen before - either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen. -
-To make this change, implement your own Similarity (likely you'll want to simply subclass - DefaultSimilarity) and then use the new - class by calling - IndexWriter.setSimilarity before indexing and - Searcher.setSimilarity before searching. -
-- If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. - In summary, here are a few use cases: -
-
- In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list): -- -
SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount - and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
- -
Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these - cases people have overridden Similarity to return 1 from the tf() method.
- -
Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes - to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be - 1 / (numTerms in field), all fields will be treated - "fairly".
[One would override the Similarity in] ... any situation where you know more about your data then just that - it's "text" is a situation where it *might* make sense to to override your - Similarity method.- +One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on + how to do this, see the + search package javadocs
-Changing scoring is an expert level task, so tread carefully and be prepared to share your code if - you want help. +
At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more + about how to do this, refer to the + search package javadocs
-With the warning out of the way, it is possible to change a lot more than just the Similarity - when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by - three main classes: -
-
- Details on each of these classes, and their children can be found in the subsections below. - -- - Query -- The abstract object representation of the user's information need.
-- - Weight -- The internal interface representation of the user's Query, so that Query objects may be reused.
-- - Scorer -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.
--
-- - - The Query Class - - - - --In some sense, the - Query - class is where it all begins. Without a Query, there would be - nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it - is often responsible - for creating them or coordinating the functionality between them. The - Query class has several methods that are important for - derived classes: -
-
- -- createWeight(Searcher searcher) -- A - Weight is the internal representation of the Query, so each Query implementation must - provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
-- rewrite(IndexReader reader) -- Rewrites queries into primitive queries. Primitive queries are: - TermQuery, - BooleanQuery, OTHERS????
-- -
-- - - The Weight Interface - - - - --The - Weight - interface provides an internal representation of the Query so that it can be reused. Any - Searcher - dependent state should be stored in the Weight implementation, - not in the Query class. The interface defines 6 methods that must be implemented: -
-
- -- - Weight#getQuery() -- Pointer to the Query that this Weight represents.
-- - Weight#getValue() -- The weight for this Query. For example, the TermQuery.TermWeight value is - equal to the idf^2 * boost * queryNorm
-- - - Weight#sumOfSquaredWeights() -- The sum of squared weights. Tor TermQuery, this is (idf * - boost)^2
-- - - Weight#normalize(float) -- Determine the query normalization factor. The query normalization may - allow for comparing scores between queries.
-- - - Weight#scorer(IndexReader) -- Construct a new - Scorer - for this Weight. See - The Scorer Class - below for help defining a Scorer. As the name implies, the - Scorer is responsible for doing the actual scoring of documents given the Query. -
-- - - Weight#explain(IndexReader, int) -- Provide a means for explaining why a given document was scored - the way it was.
-- -
-- - - The Scorer Class - - - - --The - Scorer - abstract class provides common scoring functionality for all Scorer implementations and - is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which - must be implemented: -
-
- -- - Scorer#next() -- Advances to the next document that matches this Query, returning true if and only - if there is another document that matches.
-- - Scorer#doc() -- Returns the id of the - Document - that contains the match. Is not valid until next() has been called at least once. -
-- - Scorer#score() -- Return the score of the current document. This value can be determined in any - appropriate way for an application. For instance, the - TermScorer - returns the tf * Weight.getValue() * fieldNorm. -
-- - Scorer#skipTo(int) -- Skip ahead in the document matches to the document whose id is greater than - or equal to the passed in value. In many instances, skipTo can be - implemented more efficiently than simply looping through all the matching documents until - the target document is identified.
-- - Scorer#explain(int) -- Provides details on why the score came about.
-- -
-- - - Why would I want to add my own Query? - - - - --In a nutshell, you want to add your own custom Query implementation when you think that Lucene's - aren't appropriate for the - task that you want to do. You might be doing some cutting edge research or you need more information - back - out of Lucene (similar to Doug adding SpanQuery functionality).
-- -
- - - Examples - - - - --FILL IN HERE
--
+
+ + +Search over indices. Applications usually call {@link org.apache.lucene.search.Searcher#search(Query)} or {@link org.apache.lucene.search.Searcher#search(Query,Filter)}. + +
+ +Of the various implementations of + Query, the + TermQuery + is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the + specified + Term, + which is a word that occurs in a certain + Field. + Thus, a TermQuery identifies and scores all + Documents that have a Field with the specified string in it. + Constructing a TermQuery + is as simple as: +
+ TermQuery tq = new TermQuery(new Term("fieldName", "term"); +In this example, the Query identifies all Documents that have the Field named "fieldName" and + contain the word "term". + +
Things start to get interesting when one combines multiple + TermQuery instances into a BooleanQuery. + A BooleanQuery contains multiple + BooleanClauses, + where each clause contains a sub-query (Query + instance) and an operator (from BooleanClause.Occur) + describing how that sub-query is combined with the other clauses: +
SHOULD -- Use this operator when a clause can occur in the result set, but is not required. + If a query is made up of all SHOULD clauses, then every document in the result + set matches at least one of these clauses.
MUST -- Use this operator when a clause is required to occur in the result set. Every + document in the result set will match + all such clauses.
MUST NOT -- Use this operator when a + clause must not occur in the result set. No + document in the result set will match + any such clauses.
Another common search is to find documents containing certain phrases. This + is handled in two different ways. +
PhraseQuery + -- Matches a sequence of + Terms. + PhraseQuery uses a slop factor to determine + how many positions may occur between any two terms in the phrase and still be considered a match.
+SpanNearQuery + -- Matches a sequence of other + SpanQuery + instances. SpanNearQuery allows for + much more + complicated phrase queries since it is constructed from other to SpanQuery + instances, instead of only TermQuery + instances.
+The + RangeQuery + matches all documents that occur in the + exclusive range of a lower + Term + and an upper + Term. + For example, one could find all documents + that have terms beginning with the letters a through c. This type of Query is frequently used to + find + documents that occur in a specific date range. +
+While the + PrefixQuery + has a different implementation, it is essentially a special case of the + WildcardQuery. + The PrefixQuery allows an application + to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing + for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. + Note that the WildcardQuery can be quite slow. Also + note that + WildcardQuery should + not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard + at + the beginning of a term, see + + Starts With x and Ends With x Queries + from the Lucene users's mailing list. +
+A + FuzzyQuery + matches documents that contain terms similar to the specified term. Similarity is + determined using + Levenshtein (edit) distance. + This type of query can be useful when accounting for spelling variations in the collection. +
+ +Chances are DefaultSimilarity is sufficient for all + your searching needs. + However, in some applications it may be necessary to customize your Similarity implementation. For instance, some + applications do not need to + distinguish between shorter and longer documents (see a "fair" similarity).
+ +To change Similarity, one must do so for both indexing and + searching, and the changes must happen before + either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it + just isn't well-defined what is going to happen. +
+ +To make this change, implement your own Similarity (likely + you'll want to simply subclass + DefaultSimilarity) and then use the new + class by calling + IndexWriter.setSimilarity + before indexing and + Searcher.setSimilarity + before searching. +
+ ++ If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. + In summary, here are a few use cases: +
SweetSpotSimilarity -- SweetSpotSimilarity gives small increases + as the frequency increases a small amount + and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is + more significant.
Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a + matching term occurs. In these + cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization -- By overriding lengthNorm, + it is possible to discount how the length of a field contributes + to a score. In DefaultSimilarity, + lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be + 1 / (numTerms in field), all fields will be treated + "fairly".
[One would override the Similarity in] ... any situation where you know more about your data then just + that + it's "text" is a situation where it *might* make sense to to override your + Similarity method.+ + +
Changing scoring is an expert level task, so tread carefully and be prepared to share your code if + you want help. +
+ +With the warning out of the way, it is possible to change a lot more than just the Similarity + when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by + three main classes: +
In some sense, the + Query + class is where it all begins. Without a Query, there would be + nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it + is often responsible + for creating them or coordinating the functionality between them. The + Query class has several methods that are important for + derived classes: +
The + Weight + interface provides an internal representation of the Query so that it can be reused. Any + Searcher + dependent state should be stored in the Weight implementation, + not in the Query class. The interface defines 6 methods that must be implemented: +
The + Scorer + abstract class provides common scoring functionality for all Scorer implementations and + is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which + must be implemented: +
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's + aren't appropriate for the + task that you want to do. You might be doing some cutting edge research or you need more information + back + out of Lucene (similar to Doug adding SpanQuery functionality).
+FILL IN HERE
+