Updated scoring information to only have one copy of the Scoring Formula. Implemented Doron Cohen's new scoring formula description in the javadoc.

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@454767 13f79535-47bb-0310-9956-ffa450edef68
2006-10-10 14:57:25 +00:00 · 2006-10-10 14:57:25 +00:00 · 75f561901e
parent 0d4e1b171d
commit 75f561901e
4 changed files with 370 additions and 196 deletions
--- a/CHANGES.txt
+++ b/CHANGES.txt
@ -164,7 +164,8 @@ Documentation

  1. Added style sheet to xdocs named lucene.css and included in the Anakia VSL descriptor.  (Grant Ingersoll)

-  2. Added scoring.xml document into xdocs.(Grant Ingersoll and Steve Rowe.  Updates from: Michael McCandless)
+  2. Added scoring.xml document into xdocs.  Updated Similarity.java scoring formula.(Grant Ingersoll and Steve Rowe.  Updates from: Michael McCandless, Doron Cohen, Chris Hostetter, Doug Cutting).  Issue 664.
+

 Release 2.0.0 2006-05-26

--- a/docs/scoring.html
+++ b/docs/scoring.html
@ -188,6 +188,63 @@ limitations under the License.
                            </blockquote>
      </td></tr>
      <tr><td><br/></td></tr>
+    </table>
+                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#828DA6">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Score Boosting"><strong>Score Boosting</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>Lucene allows influencing search results by "boosting" in more than one level:
+                  <ul>
+                    <li><b>Document level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
+                    before a document is added to the index.
+                    </li>
+                    <li><b>Document's Field level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
+                    before adding a field to the document (and before adding the document to the index).
+                    </li>
+                    <li><b>Query level boosting</b>
+                     - during search, by setting a boost on a query clause, calling
+                     <a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
+                    </li>
+                  </ul>
+                </p>
+                                                <p>Indexing time boosts are preprocessed for storage efficiency and written to
+                  the directory (when writing the document) in a single byte (!) as follows:
+                  For each field of a document, all boosts of that field
+                  (i.e. all boosts under the same field name in that doc) are multiplied.
+                  The result is multiplied by the boost of the document,
+                  and also multiplied by a "field length norm" value
+                  that represents the length of that field in that doc
+                  (so shorter fields are automatically boosted up).
+                  The result is decoded as a single byte
+                  (with some precision loss of course) and stored in the directory.
+                  The similarity object in effect at indexing computes the length-norm of the field.
+                </p>
+                                                <p>This composition of 1-byte representation of norms
+                (that is, indexing time multiplication of field boosts &amp; doc boost &amp; field-length-norm)
+                is nicely described in
+                <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
+                </p>
+                                                <p>Encoding and decoding of the resulted float norm in a single byte are done by the
+                static methods of the class Similarity:
+                <a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
+                <a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
+                Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
+                e.g. decode(encode(0.89)) = 0.75.
+                At scoring (search) time, this norm is brought into the score of document
+                as <b>indexBoost</b>, as shown by the formula in
+                <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
+                </p>
+                            </blockquote>
+      </td></tr>
+      <tr><td><br/></td></tr>
    </table>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#828DA6">
@ -198,78 +255,10 @@ limitations under the License.
      <tr><td>
        <blockquote>
                                    <p>
-                    Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
-                    term <i>t</i> that occurs in q.  The score attempts to measure relevance, so the higher the score, the more
-		    relevant document <i>d</i> is to the query <i>q</i>.  This is taken from
-		    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
-
-                    <div class="formula">
-                        <!-- Anyone know how to specify sigma in Anakia?  It always seems to strip out my numeric character references-->
-                        score(q,d) =
-			<span class="big" id="summation">
-                            sum </span><span class="summation-range">t in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
-                        (t in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
-                        (t)^2 *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                        getBoost
-                        </A>
-                        (t in q) *
-                        getBoost
-                        (t.field in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
-                            lengthNorm
-                        </A>
-                        (t.field in d) )</span> <span> *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
-                            coord
-                        </A>
-                        (q,d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
-                            queryNorm
-                        </A>(sumOfSquaredWeights)</span>
-                    </div>
-                </p>
-                                                <p>
-                    where
-                    <!-- Anyone know how to specify sigma in Anakia?  It always seems to strip out my numeric character references-->
-                    <div id="#sumOfSquares">
-                        sumOfSquaredWeights =
-                        <span class="big">sum</span><span class="summation-range">t in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
-                            idf
-                        </A>
-                        (t) *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                            getBoost
-                        </A>
-                        (t in q) )^2</span>
-                    </div>
-                </p>
-                                                <p>
-		This scoring formula is mostly implemented in the
-                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
-                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following.  Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
-                    <ol>
-
-                        <li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored.  Documents that have more occurrences of a given term receive a higher score.</li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears.  This means rarer terms give higher contribution to the total score.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term.  A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance.  A boost of 1.0 (the default boost) has no effect.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched.  Typically longer fields return a smaller value.  This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query.  Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
-                            <span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen.  I have always understood (but not 100% sure)
-                                that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions.  However, I also seem
-                            to remember some research on using sum of squares as being somewhat suitable for score comparison.  Anyone have any thoughts here?</span></p></li>
-                    </ol>
-                    Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
-                    for context and are not authoratitive.
+                This scoring formula is described in the
+                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class.  Please take the time to study this formula, as it contains much of the information about how the
+                    basics of Lucene scoring work, especially the
+                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
                </p>
                            </blockquote>
      </td></tr>
--- a/src/java/org/apache/lucene/search/Similarity.java
+++ b/src/java/org/apache/lucene/search/Similarity.java
@ -16,67 +16,271 @@ package org.apache.lucene.search;
 * limitations under the License.
 */

-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.IndexWriter;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.util.SmallFloat;
-
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.Collection;
 import java.util.Iterator;

+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.util.SmallFloat;
+
 /** Expert: Scoring API.
 * <p>Subclasses implement search scoring.
 *
- * <p>The score of query <code>q</code> for document <code>d</code> is defined
- * in terms of these methods as follows:
+ * <p>The score of query <code>q</code> for document <code>d</code> correlates to the
+ * cosine-distance or dot-product between document and query vectors in a
+ * <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">
+ * Vector Space Model (VSM) of Information Retrieval</a>.
+ * A document whose vector is closer to the query vector in that model is scored higher.
 *
- * <table cellpadding="0" cellspacing="0" border="0">
+ * The score is computed as follows:
+ *
+ * <P>
+ * <table cellpadding="1" cellspacing="0" border="1" align="center">
+ * <tr><td>
+ * <table cellpadding="1" cellspacing="0" border="0" align="center">
 *  <tr>
- *    <td valign="middle" align="right" rowspan="2">score(q,d) =<br></td>
- *    <td valign="middle" align="center">
- *    <big><big><big><big><big>&Sigma;</big></big></big></big></big></td>
- *    <td valign="middle"><small>
- *    ( {@link #tf(int) tf}(t in d) *
- *    {@link #idf(Term,Searcher) idf}(t)^2 *
- *    {@link Query#getBoost getBoost}(t in q) *
- *    {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) *
- *    {@link #lengthNorm(String,int) lengthNorm}(t.field in d) )
- *    </small></td>
- *    <td valign="middle" rowspan="2">&nbsp;*
- *    {@link #coord(int,int) coord}(q,d) *
- *    {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights)
+ *    <td valign="middle" align="right" rowspan="1">
+ *      score(q,d) &nbsp; = &nbsp;
+ *      <A HREF="#formula_coord">coord(q,d)</A> &nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_queryNorm">queryNorm(q)</A> &nbsp;&middot;&nbsp;
+ *    </td>
+ *    <td valign="bottom" align="center" rowspan="1">
+ *      <big><big><big>&sum;</big></big></big>
+ *    </td>
+ *    <td valign="middle" align="right" rowspan="1">
+ *      <big><big>(</big></big>
+ *      <A HREF="#formula_tf">tf(t in d)</A> &nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_idf">idf(t)</A><sup>2</sup> &nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_termBoost">t.getBoost()</A>&nbsp;&middot;&nbsp;
+ *      <A HREF="#formula_norm">norm(t,d)</A>
+ *      <big><big>)</big></big>
 *    </td>
 *  </tr>
- *  <tr>
- *   <td valign="top" align="right">
- *    <small>t in q</small>
- *    </td>
+ *  <tr valigh="top">
+ *   <td></td>
+ *   <td align="center"><small>t in q</small></td>
+ *   <td></td>
 *  </tr>
 * </table>
+ * </td></tr>
+ * </table>
 *
 * <p> where
+ * <ol>
+ *    <li>
+ *      <A NAME="formula_tf"></A>
+ *      <b>tf(t in d)</b>
+ *      correlates to the term's <i>frequency</i>,
+ *      defined as the number of times term <i>t</i> appears in the currently scored document <i>d</i>.
+ *      Documents that have more occurrences of a given term receive a higher score.
+ *      The default computation for <i>tf(t in d)</i> in
+ *      {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is:
 *
- * <table cellpadding="0" cellspacing="0" border="0">
- *  <tr>
- *    <td valign="middle" align="right" rowspan="2">sumOfSqaredWeights =<br></td>
- *    <td valign="middle" align="center">
- *    <big><big><big><big><big>&Sigma;</big></big></big></big></big></td>
- *    <td valign="middle"><small>
- *    ( {@link #idf(Term,Searcher) idf}(t) *
- *    {@link Query#getBoost getBoost}(t in q) )^2
- *    </small></td>
- *  </tr>
- *  <tr>
- *   <td valign="top" align="right">
- *    <small>t in q</small>
- *    </td>
- *  </tr>
- * </table>
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="2" cellspacing="2" border="0" align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} &nbsp; = &nbsp;
+ *          </td>
+ *          <td valign="top" align="center" rowspan="1">
+ *               frequency<sup><big>&frac12;</big></sup>
+ *          </td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *    </li>
 *
- * <p> Note that the above formula is motivated by the cosine-distance or dot-product
- * between document and query vector, which is implemented by {@link DefaultSimilarity}.
+ *    <li>
+ *      <A NAME="formula_idf"></A>
+ *      <b>idf(t)</b> stands for Inverse Document Frequency. This value
+ *      correlates to the inverse of <i>docFreq</i>
+ *      (the number of documents in which the term <i>t</i> appears).
+ *      This means rarer terms give higher contribution to the total score.
+ *      The default computation for <i>idf(t)</i> in
+ *      {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is:
+ *
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="2" cellspacing="2" border="0" align="center">
+ *        <tr>
+ *          <td valign="middle" align="right">
+ *            {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)}&nbsp; = &nbsp;
+ *          </td>
+ *          <td valign="middle" align="center">
+ *            1 + log <big>(</big>
+ *          </td>
+ *          <td valign="middle" align="center">
+ *            <table>
+ *               <tr><td align="center"><small>numDocs</small></td></tr>
+ *               <tr><td align="center">&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;</td></tr>
+ *               <tr><td align="center"><small>docFreq+1</small></td></tr>
+ *            </table>
+ *          </td>
+ *          <td valign="middle" align="center">
+ *            <big>)</big>
+ *          </td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *    </li>
+ *
+ *    <li>
+ *      <A NAME="formula_coord"></A>
+ *      <b>coord(q,d)</b>
+ *      is a score factor based on how many of the query terms are found in the specified document.
+ *      Typically, a document that contains more of the query's terms will receive a higher score
+ *      than another document with fewer query terms.
+ *      This is a search time factor computed in
+ *      {@link #coord(int, int) coord(q,d)}
+ *      by the Similarity in effect at search time.
+ *      <br>&nbsp;<br>
+ *    </li>
+ *
+ *    <li><b>
+ *      <A NAME="formula_queryNorm"></A>
+ *      queryNorm(q)
+ *      </b>
+ *      is a normalizing factor used to make scores between queries comparable.
+ *      This factor does not affect document ranking (since all ranked documents are multiplied by the same factor),
+ *      but rather just attempts to make scores from different queries (or even different indexes) comparable.
+ *      This is a search time factor computed by the Similarity in effect at search time.
+ *
+ *      The default computation in
+ *      {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity}
+ *      is:
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="1" cellspacing="0" border="0" align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            queryNorm(q)  &nbsp; = &nbsp;
+ *            {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)}
+ *            &nbsp; = &nbsp;
+ *          </td>
+ *          <td valign="middle" align="center" rowspan="1">
+ *            <table>
+ *               <tr><td align="center"><big>1</big></td></tr>
+ *               <tr><td align="center"><big>
+ *                  &ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;&ndash;
+ *               </big></td></tr>
+ *               <tr><td align="center">sumOfSquaredWeights<sup><big>&frac12;</big></sup></td></tr>
+ *            </table>
+ *          </td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *
+ *      The sum of squared weights (of the query terms) is
+ *      computed by the query {@link org.apache.lucene.search.Weight} object.
+ *      For example, a {@link org.apache.lucene.search.BooleanQuery boolean query}
+ *      computes this value as:
+ *
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="1" cellspacing="0" border="0"n align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} &nbsp; = &nbsp;
+ *            {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} <sup><big>2</big></sup>
+ *            &nbsp;&middot;&nbsp;
+ *          </td>
+ *          <td valign="bottom" align="center" rowspan="1">
+ *            <big><big><big>&sum;</big></big></big>
+ *          </td>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            <big><big>(</big></big>
+ *            <A HREF="#formula_idf">idf(t)</A> &nbsp;&middot;&nbsp;
+ *            <A HREF="#formula_termBoost">t.getBoost()</A>
+ *            <big><big>) <sup>2</sup> </big></big>
+ *          </td>
+ *        </tr>
+ *        <tr valigh="top">
+ *          <td></td>
+ *          <td align="center"><small>t in q</small></td>
+ *          <td></td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *
+ *    </li>
+ *
+ *    <li>
+ *      <A NAME="formula_termBoost"></A>
+ *      <b>t.getBoost()</b>
+ *      is a search time boost of term <i>t</i> in the query <i>q</i> as
+ *      specified in the query text
+ *      (see <A HREF="../../../../../queryparsersyntax.html#Boosting a Term">query syntax</A>),
+ *      or as set by application calls to
+ *      {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}.
+ *      Notice that there is really no direct API for accessing a boost of one term in a multi term query,
+ *      but rather multi terms are represented in a query as multi
+ *      {@link org.apache.lucene.search.TermQuery TermQuery} objects,
+ *      and so the boost of a term in the query is accessible by calling the sub-query
+ *      {@link org.apache.lucene.search.Query#getBoost() getBoost()}.
+ *      <br>&nbsp;<br>
+ *    </li>
+ *
+ *    <li>
+ *      <A NAME="formula_norm"></A>
+ *      <b>norm(t,d)</b> encapsulates a few (indexing time) boost and length factors:
+ *
+ *      <ul>
+ *        <li><b>Document boost</b> - set by calling
+ *        {@link org.apache.lucene.document.Document#setBoost(float) doc.setBoost()}
+ *        before adding the document to the index.
+ *        </li>
+ *        <li><b>Field boost</b> - set by calling
+ *        {@link org.apache.lucene.document.Fieldable#setBoost(float) field.setBoost()}
+ *        before adding the field to a document.
+ *        </li>
+ *        <li>{@link #lengthNorm(String, int) <b>lengthNorm</b>(field)} - computed
+ *        when the document is added to the index in accordance with the number of tokens
+ *        of this field in the document, so that shorter fields contribute more to the score.
+ *        LengthNorm is computed by the Similarity class in effect at indexing.
+ *        </li>
+ *      </ul>
+ *
+ *      <p>
+ *      When a document is added to the index, all the above factors are multiplied.
+ *      If the document has multiple fields with the same name, all their boosts are multiplied together:
+ *
+ *      <br>&nbsp;<br>
+ *      <table cellpadding="1" cellspacing="0" border="0"n align="center">
+ *        <tr>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            norm(t,d) &nbsp; = &nbsp;
+ *            {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()}
+ *            &nbsp;&middot;&nbsp;
+ *            {@link #lengthNorm(String, int) lengthNorm(field)}
+ *            &nbsp;&middot;&nbsp;
+ *          </td>
+ *          <td valign="bottom" align="center" rowspan="1">
+ *            <big><big><big>&prod;</big></big></big>
+ *          </td>
+ *          <td valign="middle" align="right" rowspan="1">
+ *            {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}()
+ *          </td>
+ *        </tr>
+ *        <tr valigh="top">
+ *          <td></td>
+ *          <td align="center"><small>field <i><b>f</b></i> in <i>d</i> named as <i><b>t</b></i></small></td>
+ *          <td></td>
+ *        </tr>
+ *      </table>
+ *      <br>&nbsp;<br>
+ *      However the resulted <i>norm</i> value is {@link #encodeNorm(float) encoded} as a single byte
+ *      before being stored.
+ *      At search time, the norm byte value is read from the index
+ *      {@link org.apache.lucene.store.Directory directory} and
+ *      {@link #decodeNorm(byte) decoded} back to a float <i>norm</i> value.
+ *      This encoding/decoding, while reducing index size, comes with the price of
+ *      precision loss - it is not guaranteed that decode(encode(x)) = x.
+ *      For instance, decode(encode(0.89)) = 0.75.
+ *      Also notice that search time is too late to modify this <i>norm</i> part of scoring, e.g. by
+ *      using a different {@link Similarity} for search.
+ *      <br>&nbsp;<br>
+ *    </li>
+ * </ol>
 *
 * @see #setDefault(Similarity)
 * @see IndexWriter#setSimilarity(Similarity)
--- a/xdocs/scoring.xml
+++ b/xdocs/scoring.xml
@ -61,80 +61,60 @@
                    on the Fields).
                </p>
            </subsection>
+            <subsection name="Score Boosting">
+                <p>Lucene allows influencing search results by "boosting" in more than one level:
+                  <ul>
+                    <li><b>Document level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
+                    before a document is added to the index.
+                    </li>
+                    <li><b>Document's Field level boosting</b>
+                    - while indexing - by calling
+                    <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
+                    before adding a field to the document (and before adding the document to the index).
+                    </li>
+                    <li><b>Query level boosting</b>
+                     - during search, by setting a boost on a query clause, calling
+                     <a href="api/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
+                    </li>
+                  </ul>
+                </p>
+                <p>Indexing time boosts are preprocessed for storage efficiency and written to
+                  the directory (when writing the document) in a single byte (!) as follows:
+                  For each field of a document, all boosts of that field
+                  (i.e. all boosts under the same field name in that doc) are multiplied.
+                  The result is multiplied by the boost of the document,
+                  and also multiplied by a "field length norm" value
+                  that represents the length of that field in that doc
+                  (so shorter fields are automatically boosted up).
+                  The result is decoded as a single byte
+                  (with some precision loss of course) and stored in the directory.
+                  The similarity object in effect at indexing computes the length-norm of the field.
+                </p>
+                <p>This composition of 1-byte representation of norms
+                (that is, indexing time multiplication of field boosts &amp; doc boost &amp; field-length-norm)
+                is nicely described in
+                <a href="api/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
+                </p>
+                <p>Encoding and decoding of the resulted float norm in a single byte are done by the
+                static methods of the class Similarity:
+                <a href="api/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
+                <a href="api/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
+                Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
+                e.g. decode(encode(0.89)) = 0.75.
+                At scoring (search) time, this norm is brought into the score of document
+                as <b>indexBoost</b>, as shown by the formula in
+                <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>.
+                </p>
+            </subsection>
            <subsection name="Understanding the Scoring Formula">
+
                <p>
-                    Lucene's scoring formula computes the score of one document <i>d</i> for a given query <i>q</i> across each
-                    term <i>t</i> that occurs in q.  The score attempts to measure relevance, so the higher the score, the more
-		    relevant document <i>d</i> is to the query <i>q</i>.  This is taken from
-		    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a>:
-
-                    <div class="formula">
-                        <!-- Anyone know how to specify sigma in Anakia?  It always seems to strip out my numeric character references-->
-                        score(q,d) =
-			<span class="big" id="summation">
-                            sum </span><span class="summation-range">t in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf</A>
-                        (t in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf</A>
-                        (t)^2 *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                        getBoost
-                        </A>
-                        (t in q) *
-                        getBoost
-                        (t.field in d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">
-                            lengthNorm
-                        </A>
-                        (t.field in d) )</span> <span> *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">
-                            coord
-                        </A>
-                        (q,d) *
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">
-                            queryNorm
-                        </A>(sumOfSquaredWeights)</span>
-                    </div>
-                </p>
-                <p>
-                    where
-                    <!-- Anyone know how to specify sigma in Anakia?  It always seems to strip out my numeric character references-->
-                    <div id="#sumOfSquares">
-                        sumOfSquaredWeights =
-                        <span class="big">sum</span><span class="summation-range">t in q</span><span>(
-                        <A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">
-                            idf
-                        </A>
-                        (t) *
-                        <A HREF="api/org/apache/lucene/search/Query.html#getBoost()">
-                            getBoost
-                        </A>
-                        (t in q) )^2</span>
-                    </div>
-                </p>
-                <p>
-		This scoring formula is mostly implemented in the
-                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a> class, where it makes calls to the
-                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class to retrieve values for the following.  Note that the descriptions apply to <a href="api/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a> implementation:
-                    <ol>
-
-                        <li><A HREF="api/org/apache/lucene/search/Similarity.html#tf(int)">tf(t in d)</A> - Term Frequency - The number of times the term <i>t</i> appears in the current document <i>d</i> being scored.  Documents that have more occurrences of a given term receive a higher score.</li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#idf(org.apache.lucene.index.Term, org.apache.lucene.search.Searcher)">idf(t)</A> - Inverse Document Frequency - One divided by the number of documents in which the term <i>t</i> appears.  This means rarer terms give higher contribution to the total score.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Query.html#getBoost()">getBoost(t in q)</A> - The boost, specified in the query by the user, that should be applied to this term.  A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance.  A boost of 1.0 (the default boost) has no effect.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)">lengthNorm(t.field in q)</A> - The factor to apply to account for differing lengths in the fields that are being searched.  Typically longer fields return a smaller value.  This means matches against shorter fields receive a higher score than matches against longer fields.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#coord(int, int)">coord(q, d)</A> - Score factor based on how many terms the specified document has in common with the query.  Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.</p></li>
-
-                        <li><p><A HREF="api/org/apache/lucene/search/Similarity.html#queryNorm(float)">queryNorm(sumOfSquaredWeights)</A> - Factor used to make scores between queries comparable
-                            <span class="highlight-for-editing">GSI: might be interesting to have a note on why this formula was chosen.  I have always understood (but not 100% sure)
-                                that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions.  However, I also seem
-                            to remember some research on using sum of squares as being somewhat suitable for score comparison.  Anyone have any thoughts here?</span></p></li>
-                    </ol>
-                    Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided
-                    for context and are not authoratitive.
+                This scoring formula is described in the
+                    <a href="api/org/apache/lucene/search/Similarity.html">Similarity</a> class.  Please take the time to study this formula, as it contains much of the information about how the
+                    basics of Lucene scoring work, especially the
+                    <a href="api/org/apache/lucene/search/TermScorer.html">TermScorer</a>.
                </p>
            </subsection>
            <subsection name="The Big Picture">