diff --git a/docs/benchmarks.html b/docs/benchmarks.html index a51e2eeac04..9cf3b289242 100644 --- a/docs/benchmarks.html +++ b/docs/benchmarks.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/contributions.html b/docs/contributions.html index 4986488add8..6caaf361ad8 100644 --- a/docs/contributions.html +++ b/docs/contributions.html @@ -89,6 +89,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/demo.html b/docs/demo.html index c087cc3759d..52d31bff5e6 100644 --- a/docs/demo.html +++ b/docs/demo.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/demo2.html b/docs/demo2.html index 530a63080b3..d0a565deb53 100644 --- a/docs/demo2.html +++ b/docs/demo2.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/demo3.html b/docs/demo3.html index 9d044a73d56..99633fb29ef 100644 --- a/docs/demo3.html +++ b/docs/demo3.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/demo4.html b/docs/demo4.html index 04ce3d5480e..98b85d75aca 100644 --- a/docs/demo4.html +++ b/docs/demo4.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/features.html b/docs/features.html index 73dbd7aca2a..14a2a24a344 100644 --- a/docs/features.html +++ b/docs/features.html @@ -83,6 +83,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/fileformats.html b/docs/fileformats.html index 599281128a9..b7772539515 100644 --- a/docs/fileformats.html +++ b/docs/fileformats.html @@ -83,6 +83,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/gettingstarted.html b/docs/gettingstarted.html index 03f61435bd5..4e52ace92be 100644 --- a/docs/gettingstarted.html +++ b/docs/gettingstarted.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/index.html b/docs/index.html index 523209b0ac1..dc18a00e608 100644 --- a/docs/index.html +++ b/docs/index.html @@ -91,6 +91,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/lucene-sandbox/index.html b/docs/lucene-sandbox/index.html index d61517ceb85..772d15d97d2 100644 --- a/docs/lucene-sandbox/index.html +++ b/docs/lucene-sandbox/index.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/mailinglists.html b/docs/mailinglists.html index 3ad0bbe0a0b..bfd5da43863 100644 --- a/docs/mailinglists.html +++ b/docs/mailinglists.html @@ -83,6 +83,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/queryparsersyntax.html b/docs/queryparsersyntax.html index 5e0d8ab7748..d2dc6826098 100644 --- a/docs/queryparsersyntax.html +++ b/docs/queryparsersyntax.html @@ -87,6 +87,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/resources.html b/docs/resources.html index eadc1e3d738..01a9c7fa056 100644 --- a/docs/resources.html +++ b/docs/resources.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/scoring.html b/docs/scoring.html index 506df0deb62..7ddb74b9151 100644 --- a/docs/scoring.html +++ b/docs/scoring.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • @@ -336,126 +338,8 @@ limitations under the License.
    -

    - TermQuery -

    -

    Of the various implementations of - Query, the - TermQuery - is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified - Term, - which is a word that occurs in a certain - Field. - Thus, a TermQuery identifies and scores all - Documents that have a Field with the specified string in it. - Constructing a TermQuery - is as simple as: -

    -      TermQuery tq = new TermQuery(new Term("fieldName", "term");
    -		    
    In this example, the Query identifies all Documents that have the Field named "fieldName" and - contain the word "term". -

    -

    - BooleanQuery -

    -

    Things start to get interesting when one combines multiple - TermQuery instances into a BooleanQuery. - A BooleanQuery contains multiple - BooleanClauses, - where each clause contains a sub-query (Query - instance) and an operator (from BooleanClause.Occur) - describing how that sub-query is combined with the other clauses: -

      - -
    1. SHOULD -- Use this operator when a clause can occur in the result set, but is not required. - If a query is made up of all SHOULD clauses, then every document in the result - set matches at least one of these clauses.

    2. - -
    3. MUST -- Use this operator when a clause is required to occur in the result set. Every - document in the result set will match - all such clauses.

    4. - -
    5. MUST NOT -- Use this operator when a - clause must not occur in the result set. No - document in the result set will match - any such clauses.

    6. -
    - Boolean queries are constructed by adding two or more - BooleanClause - instances. If too many clauses are added, a TooManyClauses - exception will be thrown during searching. This most often occurs - when a Query - is rewritten into a BooleanQuery with many - TermQuery clauses, - for example by WildcardQuery. - The default setting for the maximum number - of clauses 1024, but this can be changed via the - static method setMaxClauseCount - in BooleanQuery. -

    -

    Phrases

    -

    Another common search is to find documents containing certain phrases. This - is handled in two different ways. -

      -
    1. -

      PhraseQuery - -- Matches a sequence of - Terms. - PhraseQuery uses a slop factor to determine - how many positions may occur between any two terms in the phrase and still be considered a match.

      -
    2. -
    3. -

      SpanNearQuery - -- Matches a sequence of other - SpanQuery - instances. SpanNearQuery allows for much more - complicated phrase queries since it is constructed from other to SpanQuery - instances, instead of only TermQuery instances.

      -
    4. -
    -

    -

    - RangeQuery -

    -

    The - RangeQuery - matches all documents that occur in the - exclusive range of a lower - Term - and an upper - Term. - For example, one could find all documents - that have terms beginning with the letters a through c. This type of Query is frequently used to - find - documents that occur in a specific date range. -

    -

    - PrefixQuery, - WildcardQuery -

    -

    While the - PrefixQuery - has a different implementation, it is essentially a special case of the - WildcardQuery. - The PrefixQuery allows an application - to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing - for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that - WildcardQuery should - not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at - the beginning of a term, see - - Starts With x and Ends With x Queries - from the Lucene users's mailing list. -

    -

    - FuzzyQuery -

    -

    A - FuzzyQuery - matches documents that contain terms similar to the specified term. Similarity is - determined using - Levenshtein (edit) distance. - This type of query can be useful when accounting for spelling variations in the collection. +

    For information on the Query Classes, refer to the + search package javadocs

    @@ -469,36 +353,9 @@ limitations under the License.
    -

    Chances are DefaultSimilarity is sufficient for all your searching needs. - However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to - distinguish between shorter and longer documents (see a "fair" similarity).

    -

    To change Similarity, one must do so for both indexing and searching, and the changes must happen before - either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen. -

    -

    To make this change, implement your own Similarity (likely you'll want to simply subclass - DefaultSimilarity) and then use the new - class by calling - IndexWriter.setSimilarity before indexing and - Searcher.setSimilarity before searching. -

    -

    - If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. - In summary, here are a few use cases: -

      -
    1. SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount - and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.

    2. -
    3. Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these - cases people have overridden Similarity to return 1 from the tf() method.

    4. -
    5. Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes - to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be - 1 / (numTerms in field), all fields will be treated - "fairly".

    6. -
    - In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list): -
    [One would override the Similarity in] ... any situation where you know more about your data then just that - it's "text" is a situation where it *might* make sense to to override your - Similarity method.
    -

    +

    One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on + how to do this, see the + search package javadocs


    @@ -516,169 +373,10 @@ limitations under the License.
    -

    Changing scoring is an expert level task, so tread carefully and be prepared to share your code if - you want help. +

    At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more + about how to do this, refer to the + search package javadocs

    -

    With the warning out of the way, it is possible to change a lot more than just the Similarity - when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by - three main classes: -

      -
    1. - Query -- The abstract object representation of the user's information need.
    2. -
    3. - Weight -- The internal interface representation of the user's Query, so that Query objects may be reused.
    4. -
    5. - Scorer -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.
    6. -
    - Details on each of these classes, and their children can be found in the subsections below. -

    - - - - -
    - - The Query Class - -
    -
    -

    In some sense, the - Query - class is where it all begins. Without a Query, there would be - nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it - is often responsible - for creating them or coordinating the functionality between them. The - Query class has several methods that are important for - derived classes: -

      -
    1. createWeight(Searcher searcher) -- A - Weight is the internal representation of the Query, so each Query implementation must - provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
    2. -
    3. rewrite(IndexReader reader) -- Rewrites queries into primitive queries. Primitive queries are: - TermQuery, - BooleanQuery, OTHERS????
    4. -
    -

    -
    -

    - - - - -
    - - The Weight Interface - -
    -
    -

    The - Weight - interface provides an internal representation of the Query so that it can be reused. Any - Searcher - dependent state should be stored in the Weight implementation, - not in the Query class. The interface defines 6 methods that must be implemented: -

      -
    1. - Weight#getQuery() -- Pointer to the Query that this Weight represents.
    2. -
    3. - Weight#getValue() -- The weight for this Query. For example, the TermQuery.TermWeight value is - equal to the idf^2 * boost * queryNorm
    4. -
    5. - - Weight#sumOfSquaredWeights() -- The sum of squared weights. Tor TermQuery, this is (idf * - boost)^2
    6. -
    7. - - Weight#normalize(float) -- Determine the query normalization factor. The query normalization may - allow for comparing scores between queries.
    8. -
    9. - - Weight#scorer(IndexReader) -- Construct a new - Scorer - for this Weight. See - The Scorer Class - below for help defining a Scorer. As the name implies, the - Scorer is responsible for doing the actual scoring of documents given the Query. -
    10. -
    11. - - Weight#explain(IndexReader, int) -- Provide a means for explaining why a given document was scored - the way it was.
    12. -
    -

    -
    -

    - - - - -
    - - The Scorer Class - -
    -
    -

    The - Scorer - abstract class provides common scoring functionality for all Scorer implementations and - is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which - must be implemented: -

      -
    1. - Scorer#next() -- Advances to the next document that matches this Query, returning true if and only - if there is another document that matches.
    2. -
    3. - Scorer#doc() -- Returns the id of the - Document - that contains the match. Is not valid until next() has been called at least once. -
    4. -
    5. - Scorer#score() -- Return the score of the current document. This value can be determined in any - appropriate way for an application. For instance, the - TermScorer - returns the tf * Weight.getValue() * fieldNorm. -
    6. -
    7. - Scorer#skipTo(int) -- Skip ahead in the document matches to the document whose id is greater than - or equal to the passed in value. In many instances, skipTo can be - implemented more efficiently than simply looping through all the matching documents until - the target document is identified.
    8. -
    9. - Scorer#explain(int) -- Provides details on why the score came about.
    10. -
    -

    -
    -

    - - - - -
    - - Why would I want to add my own Query? - -
    -
    -

    In a nutshell, you want to add your own custom Query implementation when you think that Lucene's - aren't appropriate for the - task that you want to do. You might be doing some cutting edge research or you need more information - back - out of Lucene (similar to Doug adding SpanQuery functionality).

    -
    -

    - - - - -
    - - Examples - -
    -
    -

    FILL IN HERE

    -
    -

    diff --git a/docs/systemproperties.html b/docs/systemproperties.html index c5dc7835f02..a8213d1c2c1 100644 --- a/docs/systemproperties.html +++ b/docs/systemproperties.html @@ -85,6 +85,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/docs/whoweare.html b/docs/whoweare.html index 2c4a6fdff08..5f04c1c216a 100644 --- a/docs/whoweare.html +++ b/docs/whoweare.html @@ -87,6 +87,8 @@ limitations under the License.
  • Query Syntax
  • File Formats +
  • +
  • Scoring
  • Javadoc
  • diff --git a/src/java/org/apache/lucene/search/package.html b/src/java/org/apache/lucene/search/package.html index c5827c69725..cdf8dbc619e 100644 --- a/src/java/org/apache/lucene/search/package.html +++ b/src/java/org/apache/lucene/search/package.html @@ -3,13 +3,356 @@ + +

    Table Of Contents

    +

    +

      +
    1. Search Basics
    2. +
    3. The Query Classes
    4. +
    5. Changing the Scoring
    6. +
    +

    + +

    Search

    +

    Search over indices. Applications usually call {@link org.apache.lucene.search.Searcher#search(Query)} or {@link org.apache.lucene.search.Searcher#search(Query,Filter)}. + +

    + +

    Query Classes

    +

    + TermQuery +

    + +

    Of the various implementations of + Query, the + TermQuery + is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the + specified + Term, + which is a word that occurs in a certain + Field. + Thus, a TermQuery identifies and scores all + Documents that have a Field with the specified string in it. + Constructing a TermQuery + is as simple as: +

    +        TermQuery tq = new TermQuery(new Term("fieldName", "term");
    +    
    In this example, the Query identifies all Documents that have the Field named "fieldName" and + contain the word "term". +

    +

    + BooleanQuery +

    + +

    Things start to get interesting when one combines multiple + TermQuery instances into a BooleanQuery. + A BooleanQuery contains multiple + BooleanClauses, + where each clause contains a sub-query (Query + instance) and an operator (from BooleanClause.Occur) + describing how that sub-query is combined with the other clauses: +

      + +
    1. SHOULD -- Use this operator when a clause can occur in the result set, but is not required. + If a query is made up of all SHOULD clauses, then every document in the result + set matches at least one of these clauses.

    2. + +
    3. MUST -- Use this operator when a clause is required to occur in the result set. Every + document in the result set will match + all such clauses.

    4. + +
    5. MUST NOT -- Use this operator when a + clause must not occur in the result set. No + document in the result set will match + any such clauses.

    6. +
    + Boolean queries are constructed by adding two or more + BooleanClause + instances. If too many clauses are added, a TooManyClauses + exception will be thrown during searching. This most often occurs + when a Query + is rewritten into a BooleanQuery with many + TermQuery clauses, + for example by WildcardQuery. + The default setting for the maximum number + of clauses 1024, but this can be changed via the + static method setMaxClauseCount + in BooleanQuery. +

    + +

    Phrases

    + +

    Another common search is to find documents containing certain phrases. This + is handled in two different ways. +

      +
    1. +

      PhraseQuery + -- Matches a sequence of + Terms. + PhraseQuery uses a slop factor to determine + how many positions may occur between any two terms in the phrase and still be considered a match.

      +
    2. +
    3. +

      SpanNearQuery + -- Matches a sequence of other + SpanQuery + instances. SpanNearQuery allows for + much more + complicated phrase queries since it is constructed from other to SpanQuery + instances, instead of only TermQuery + instances.

      +
    4. +
    +

    +

    + RangeQuery +

    + +

    The + RangeQuery + matches all documents that occur in the + exclusive range of a lower + Term + and an upper + Term. + For example, one could find all documents + that have terms beginning with the letters a through c. This type of Query is frequently used to + find + documents that occur in a specific date range. +

    +

    + PrefixQuery, + WildcardQuery +

    + +

    While the + PrefixQuery + has a different implementation, it is essentially a special case of the + WildcardQuery. + The PrefixQuery allows an application + to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing + for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. + Note that the WildcardQuery can be quite slow. Also + note that + WildcardQuery should + not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard + at + the beginning of a term, see + + Starts With x and Ends With x Queries + from the Lucene users's mailing list. +

    +

    + FuzzyQuery +

    + +

    A + FuzzyQuery + matches documents that contain terms similar to the specified term. Similarity is + determined using + Levenshtein (edit) distance. + This type of query can be useful when accounting for spelling variations in the collection. +

    + +

    Changing Similarity

    + +

    Chances are DefaultSimilarity is sufficient for all + your searching needs. + However, in some applications it may be necessary to customize your Similarity implementation. For instance, some + applications do not need to + distinguish between shorter and longer documents (see a "fair" similarity).

    + +

    To change Similarity, one must do so for both indexing and + searching, and the changes must happen before + either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it + just isn't well-defined what is going to happen. +

    + +

    To make this change, implement your own Similarity (likely + you'll want to simply subclass + DefaultSimilarity) and then use the new + class by calling + IndexWriter.setSimilarity + before indexing and + Searcher.setSimilarity + before searching. +

    + +

    + If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. + In summary, here are a few use cases: +

      +
    1. SweetSpotSimilarity -- SweetSpotSimilarity gives small increases + as the frequency increases a small amount + and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is + more significant.

    2. +
    3. Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a + matching term occurs. In these + cases people have overridden Similarity to return 1 from the tf() method.

    4. +
    5. Changing Length Normalization -- By overriding lengthNorm, + it is possible to discount how the length of a field contributes + to a score. In DefaultSimilarity, + lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be + 1 / (numTerms in field), all fields will be treated + "fairly".

    6. +
    + In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list): +
    [One would override the Similarity in] ... any situation where you know more about your data then just + that + it's "text" is a situation where it *might* make sense to to override your + Similarity method.
    +

    + +

    Changing Scoring -- Expert Level

    + +

    Changing scoring is an expert level task, so tread carefully and be prepared to share your code if + you want help. +

    + +

    With the warning out of the way, it is possible to change a lot more than just the Similarity + when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by + three main classes: +

      +
    1. + Query -- The abstract object representation of the + user's information need.
    2. +
    3. + Weight -- The internal interface representation of + the user's Query, so that Query objects may be reused.
    4. +
    5. + Scorer -- An abstract class containing common + functionality for scoring. Provides both scoring and explanation capabilities.
    6. +
    + Details on each of these classes, and their children can be found in the subsections below. +

    +

    The Query Class

    +

    In some sense, the + Query + class is where it all begins. Without a Query, there would be + nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it + is often responsible + for creating them or coordinating the functionality between them. The + Query class has several methods that are important for + derived classes: +

      +
    1. createWeight(Searcher searcher) -- A + Weight is the internal representation of the + Query, so each Query implementation must + provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight + interface.
    2. +
    3. rewrite(IndexReader reader) -- Rewrites queries into primitive queries. Primitive queries are: + TermQuery, + BooleanQuery, OTHERS????
    4. +
    +

    +

    The Weight Interface

    +

    The + Weight + interface provides an internal representation of the Query so that it can be reused. Any + Searcher + dependent state should be stored in the Weight implementation, + not in the Query class. The interface defines 6 methods that must be implemented: +

      +
    1. + Weight#getQuery() -- Pointer to the + Query that this Weight represents.
    2. +
    3. + Weight#getValue() -- The weight for + this Query. For example, the TermQuery.TermWeight value is + equal to the idf^2 * boost * queryNorm
    4. +
    5. + + Weight#sumOfSquaredWeights() -- The sum of squared weights. Tor TermQuery, this is (idf * + boost)^2
    6. +
    7. + + Weight#normalize(float) -- Determine the query normalization factor. The query normalization may + allow for comparing scores between queries.
    8. +
    9. + + Weight#scorer(IndexReader) -- Construct a new + Scorer + for this Weight. See + The Scorer Class + below for help defining a Scorer. As the name implies, the + Scorer is responsible for doing the actual scoring of documents given the Query. +
    10. +
    11. + + Weight#explain(IndexReader, int) -- Provide a means for explaining why a given document was + scored + the way it was.
    12. +
    +

    +

    The Scorer Class

    +

    The + Scorer + abstract class provides common scoring functionality for all Scorer implementations and + is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which + must be implemented: +

      +
    1. + Scorer#next() -- Advances to the next + document that matches this Query, returning true if and only + if there is another document that matches.
    2. +
    3. + Scorer#doc() -- Returns the id of the + Document + that contains the match. Is not valid until next() has been called at least once. +
    4. +
    5. + Scorer#score() -- Return the score of the + current document. This value can be determined in any + appropriate way for an application. For instance, the + TermScorer + returns the tf * Weight.getValue() * fieldNorm. +
    6. +
    7. + Scorer#skipTo(int) -- Skip ahead in + the document matches to the document whose id is greater than + or equal to the passed in value. In many instances, skipTo can be + implemented more efficiently than simply looping through all the matching documents until + the target document is identified.
    8. +
    9. + Scorer#explain(int) -- Provides + details on why the score came about.
    10. +
    +

    +

    Why would I want to add my own Query?

    + +

    In a nutshell, you want to add your own custom Query implementation when you think that Lucene's + aren't appropriate for the + task that you want to do. You might be doing some cutting edge research or you need more information + back + out of Lucene (similar to Doug adding SpanQuery functionality).

    +

    Examples

    +

    FILL IN HERE

    + diff --git a/xdocs/scoring.xml b/xdocs/scoring.xml index 4ac236869b2..6da61ac704c 100644 --- a/xdocs/scoring.xml +++ b/xdocs/scoring.xml @@ -184,281 +184,22 @@

    -

    - TermQuery -

    -

    Of the various implementations of - Query, the - TermQuery - is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified - Term, - which is a word that occurs in a certain - Field. - Thus, a TermQuery identifies and scores all - Documents that have a Field with the specified string in it. - Constructing a TermQuery - is as simple as: -

    -      TermQuery tq = new TermQuery(new Term("fieldName", "term");
    -		    
    In this example, the Query identifies all Documents that have the Field named "fieldName" and - contain the word "term". -

    -

    - BooleanQuery -

    -

    Things start to get interesting when one combines multiple - TermQuery instances into a BooleanQuery. - A BooleanQuery contains multiple - BooleanClauses, - where each clause contains a sub-query (Query - instance) and an operator (from BooleanClause.Occur) - describing how that sub-query is combined with the other clauses: -

      - -
    1. SHOULD -- Use this operator when a clause can occur in the result set, but is not required. - If a query is made up of all SHOULD clauses, then every document in the result - set matches at least one of these clauses.

    2. - -
    3. MUST -- Use this operator when a clause is required to occur in the result set. Every - document in the result set will match - all such clauses.

    4. - -
    5. MUST NOT -- Use this operator when a - clause must not occur in the result set. No - document in the result set will match - any such clauses.

    6. -
    - Boolean queries are constructed by adding two or more - BooleanClause - instances. If too many clauses are added, a TooManyClauses - exception will be thrown during searching. This most often occurs - when a Query - is rewritten into a BooleanQuery with many - TermQuery clauses, - for example by WildcardQuery. - The default setting for the maximum number - of clauses 1024, but this can be changed via the - static method setMaxClauseCount - in BooleanQuery. -

    - -

    Phrases

    -

    Another common search is to find documents containing certain phrases. This - is handled in two different ways. -

      -
    1. -

      PhraseQuery - -- Matches a sequence of - Terms. - PhraseQuery uses a slop factor to determine - how many positions may occur between any two terms in the phrase and still be considered a match.

      -
    2. -
    3. -

      SpanNearQuery - -- Matches a sequence of other - SpanQuery - instances. SpanNearQuery allows for much more - complicated phrase queries since it is constructed from other to SpanQuery - instances, instead of only TermQuery instances.

      -
    4. -
    -

    -

    - RangeQuery -

    -

    The - RangeQuery - matches all documents that occur in the - exclusive range of a lower - Term - and an upper - Term. - For example, one could find all documents - that have terms beginning with the letters a through c. This type of Query is frequently used to - find - documents that occur in a specific date range. -

    -

    - PrefixQuery, - WildcardQuery -

    -

    While the - PrefixQuery - has a different implementation, it is essentially a special case of the - WildcardQuery. - The PrefixQuery allows an application - to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing - for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that - WildcardQuery should - not start with * and ?, as these are extremely slow. For tricks on how to search using a wildcard at - the beginning of a term, see - - Starts With x and Ends With x Queries - from the Lucene users's mailing list. -

    -

    - FuzzyQuery -

    -

    A - FuzzyQuery - matches documents that contain terms similar to the specified term. Similarity is - determined using - Levenshtein (edit) distance. - This type of query can be useful when accounting for spelling variations in the collection. +

    For information on the Query Classes, refer to the + search package javadocs

    -

    Chances are DefaultSimilarity is sufficient for all your searching needs. - However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to - distinguish between shorter and longer documents (see a "fair" similarity).

    - -

    To change Similarity, one must do so for both indexing and searching, and the changes must happen before - either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen. -

    - -

    To make this change, implement your own Similarity (likely you'll want to simply subclass - DefaultSimilarity) and then use the new - class by calling - IndexWriter.setSimilarity before indexing and - Searcher.setSimilarity before searching. -

    -

    - If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. - In summary, here are a few use cases: -

      -
    1. SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount - and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.

    2. -
    3. Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these - cases people have overridden Similarity to return 1 from the tf() method.

    4. -
    5. Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes - to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be - 1 / (numTerms in field), all fields will be treated - "fairly".

    6. -
    - In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list): -
    [One would override the Similarity in] ... any situation where you know more about your data then just that - it's "text" is a situation where it *might* make sense to to override your - Similarity method.
    -

    +

    One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on + how to do this, see the + search package javadocs

    -

    Changing scoring is an expert level task, so tread carefully and be prepared to share your code if - you want help. +

    At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more + about how to do this, refer to the + search package javadocs

    -

    With the warning out of the way, it is possible to change a lot more than just the Similarity - when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by - three main classes: -

      -
    1. - Query -- The abstract object representation of the user's information need.
    2. -
    3. - Weight -- The internal interface representation of the user's Query, so that Query objects may be reused.
    4. -
    5. - Scorer -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.
    6. -
    - Details on each of these classes, and their children can be found in the subsections below. -

    - -

    In some sense, the - Query - class is where it all begins. Without a Query, there would be - nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it - is often responsible - for creating them or coordinating the functionality between them. The - Query class has several methods that are important for - derived classes: -

      -
    1. createWeight(Searcher searcher) -- A - Weight is the internal representation of the Query, so each Query implementation must - provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
    2. -
    3. rewrite(IndexReader reader) -- Rewrites queries into primitive queries. Primitive queries are: - TermQuery, - BooleanQuery, OTHERS????
    4. -
    -

    -
    - -

    The - Weight - interface provides an internal representation of the Query so that it can be reused. Any - Searcher - dependent state should be stored in the Weight implementation, - not in the Query class. The interface defines 6 methods that must be implemented: -

      -
    1. - Weight#getQuery() -- Pointer to the Query that this Weight represents.
    2. -
    3. - Weight#getValue() -- The weight for this Query. For example, the TermQuery.TermWeight value is - equal to the idf^2 * boost * queryNorm
    4. -
    5. - - Weight#sumOfSquaredWeights() -- The sum of squared weights. Tor TermQuery, this is (idf * - boost)^2
    6. -
    7. - - Weight#normalize(float) -- Determine the query normalization factor. The query normalization may - allow for comparing scores between queries.
    8. -
    9. - - Weight#scorer(IndexReader) -- Construct a new - Scorer - for this Weight. See - The Scorer Class - below for help defining a Scorer. As the name implies, the - Scorer is responsible for doing the actual scoring of documents given the Query. -
    10. -
    11. - - Weight#explain(IndexReader, int) -- Provide a means for explaining why a given document was scored - the way it was.
    12. -
    -

    -
    - -

    The - Scorer - abstract class provides common scoring functionality for all Scorer implementations and - is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which - must be implemented: -

      -
    1. - Scorer#next() -- Advances to the next document that matches this Query, returning true if and only - if there is another document that matches.
    2. -
    3. - Scorer#doc() -- Returns the id of the - Document - that contains the match. Is not valid until next() has been called at least once. -
    4. -
    5. - Scorer#score() -- Return the score of the current document. This value can be determined in any - appropriate way for an application. For instance, the - TermScorer - returns the tf * Weight.getValue() * fieldNorm. -
    6. -
    7. - Scorer#skipTo(int) -- Skip ahead in the document matches to the document whose id is greater than - or equal to the passed in value. In many instances, skipTo can be - implemented more efficiently than simply looping through all the matching documents until - the target document is identified.
    8. -
    9. - Scorer#explain(int) -- Provides details on why the score came about.
    10. -
    -

    -
    - -

    In a nutshell, you want to add your own custom Query implementation when you think that Lucene's - aren't appropriate for the - task that you want to do. You might be doing some cutting edge research or you need more information - back - out of Lucene (similar to Doug adding SpanQuery functionality).

    -
    - -

    FILL IN HERE

    -
    diff --git a/xdocs/stylesheets/project.xml b/xdocs/stylesheets/project.xml index 04c9e33fc6d..9b6c741e6ea 100644 --- a/xdocs/stylesheets/project.xml +++ b/xdocs/stylesheets/project.xml @@ -19,6 +19,7 @@ +