diff --git a/CHANGES.txt b/CHANGES.txt index 2488c68abaa..f40ec6d7300 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -142,7 +142,7 @@ Documentation 1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor. (Grant Ingersoll) 2. Added draft scoring.xml document into xdocs. Intent is to be the equivalent of fileformats.xml for scoring. It is not linked into project.xml, so it will not show up on the - website yet. (Grant Ingersoll and Steve Rowe) + website yet. (Grant Ingersoll and Steve Rowe. Updates from: Michael McCandless) Release 2.0.0 2006-05-26 diff --git a/docs/scoring.html b/docs/scoring.html index 6366cc5ccde..0f68311bb54 100644 --- a/docs/scoring.html +++ b/docs/scoring.html @@ -122,7 +122,7 @@ limitations under the License. help you figure out the what and why of Lucene scoring.

Lucene scoring uses a combination of the Vector Space Model (VSM) of Information - Retrieval and the Boolean model + Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to @@ -181,7 +181,7 @@ limitations under the License. and the other in one Field will return different scores for the same query due to length normalization (assumming the DefaultSimilarity - on the Fields. + on the Fields.)

@@ -196,13 +196,15 @@ limitations under the License.

- Lucene's scoring formula, taken from - Similarity - is + Lucene's scoring formula computes the score of one document d for a given query q across each + term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more + relevant document d is to the query q. This is taken from + Similarity: +

score(q,d) = - + sum t in q( tf (t in d) * @@ -224,15 +226,14 @@ limitations under the License. (q,d) * queryNorm - (sumOfSqaredWeights) + (sumOfSquaredWeights)
-

where

- sumOfSqaredWeights = + sumOfSquaredWeights = sumt in q( idf @@ -244,18 +245,26 @@ limitations under the License. (t in q) )^2

-

This scoring formula is mostly incorporated into the +

+ This scoring formula is mostly implemented in the TermScorer class, where it makes calls to the - Similarity class to retrieve values for the following: + Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation:

    -
  1. tf - Term Frequency - The number of times the term t appears in the current document being scored.
  2. -
  3. idf - Inverse Document Frequency - One divided by the number of documents in which the term t appears in.
  4. -
  5. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term.
  6. -
  7. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Usually longer fields return a smaller value.
  8. -
  9. coord(q, d) - Score factor based on how many terms the specified document has in common with the query.
  10. -
  11. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable + +
  12. tf(t in d) - Term Frequency - The number of times the term t appears in the current document d being scored. Documents that have more occurrences of a given term receive a higher score.
  13. + +
  14. idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.

  15. + +
  16. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.

  17. + +
  18. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.

  19. + +
  20. coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.

  21. + +
  22. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) - that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions.

  23. + that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem + to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided for context and are not authoratitive. diff --git a/xdocs/scoring.xml b/xdocs/scoring.xml index 0ed641eb63c..68503cdcfd6 100644 --- a/xdocs/scoring.xml +++ b/xdocs/scoring.xml @@ -17,7 +17,7 @@ help you figure out the what and why of Lucene scoring.

Lucene scoring uses a combination of the Vector Space Model (VSM) of Information - Retrieval and the Boolean model + Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to @@ -58,18 +58,20 @@ and the other in one Field will return different scores for the same query due to length normalization (assumming the DefaultSimilarity - on the Fields. + on the Fields.)

- Lucene's scoring formula, taken from - Similarity - is + Lucene's scoring formula computes the score of one document d for a given query q across each + term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more + relevant document d is to the query q. This is taken from + Similarity: +

score(q,d) = - + sum t in q( tf (t in d) * @@ -91,15 +93,14 @@ (q,d) * queryNorm - (sumOfSqaredWeights) + (sumOfSquaredWeights)
-

where

- sumOfSqaredWeights = + sumOfSquaredWeights = sumt in q( idf @@ -111,19 +112,26 @@ (t in q) )^2

-

This scoring formula is mostly incorporated into the +

+ This scoring formula is mostly implemented in the TermScorer class, where it makes calls to the - Similarity class to retrieve values for the following: + Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation:

    -
  1. tf - Term Frequency - The number of times the term t appears in the current document being scored.
  2. -
  3. idf - Inverse Document Frequency - One divided by the number of documents in which the term t appears in.
  4. -
  5. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term.
  6. -
  7. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Usually longer fields return a smaller value.
  8. -
  9. coord(q, d) - Score factor based on how many terms the specified document has in common with the query.
  10. -
  11. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable + +
  12. tf(t in d) - Term Frequency - The number of times the term t appears in the current document d being scored. Documents that have more occurrences of a given term receive a higher score.
  13. + +
  14. idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.

  15. + +
  16. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.

  17. + +
  18. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.

  19. + +
  20. coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.

  21. + +
  22. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem - to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

  23. + to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided for context and are not authoratitive.