diff --git a/docs/benchmarktemplate.xml b/docs/benchmarktemplate.xml
deleted file mode 100644
index df7601f20f9..00000000000
--- a/docs/benchmarktemplate.xml
+++ /dev/null
@@ -1,61 +0,0 @@
-
- Hardware Environment
-
-
- Software environment
-
- Lucene indexing variables
-
- Figures
-
- Notes
-
In Lucene, the objects we are scoring are - Documents. A Document is a collection + Documents. A Document is a collection of - Fields. Each Field has semantics about how + Fields. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the - DefaultSimilarity + DefaultSimilarity on the Fields).
@@ -367,21 +364,21 @@ document.write("Last Published: " + document.lastModified);This composition of 1-byte representation of norms (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) is nicely described in - Fieldable.setBoost(). + Fieldable.setBoost().
Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: - encodeNorm() and - decodeNorm(). + encodeNorm() and + decodeNorm(). Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought into the score of document as norm(t, d), as shown by the formula in - Similarity. + Similarity.
This scoring formula is described in the - Similarity class. Please take the time to study this formula, as it contains much of the information about how the + Similarity class. Please take the time to study this formula, as it contains much of the information about how the basics of Lucene scoring work, especially the - TermQuery. + TermQuery.
OK, so the tf-idf formula and the - Similarity + Similarity is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are the use and interactions between the - Query classes, as created by each application in + Query classes, as created by each application in response to a user's information need.
-In this regard, Lucene offers a wide variety of Query implementations, most of which are in the - org.apache.lucene.search package. +
In this regard, Lucene offers a wide variety of Query implementations, most of which are in the + org.apache.lucene.search package. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query section below highlights some of the more important Query classes. For information on the other ones, see the - package summary. For details on implementing + package summary. For details on implementing your own Query class, see Changing your Scoring -- Expert Level below.
Once a Query has been created and submitted to the - IndexSearcher, the scoring process + IndexSearcher, the scoring process begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, - control finally passes to the Weight implementation and its - Scorer instance. In the case of any type of - BooleanQuery, scoring is handled by the + control finally passes to the Weight implementation and its + Scorer instance. In the case of any type of + BooleanQuery, scoring is handled by the BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), - unless the static - - BooleanQuery#setUseScorer14(boolean) method is set to true, + unless + + Weight#scoresDocsOutOfOrder() method is set to true, in which case the BooleanWeight (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default. See CHANGES.txt under release 1.9 RC1 for more information on choosing which Scorer to use.
-+
ry#setUseScorer14(boolean) Assuming the use of the BooleanWeight2, a BooleanScorer2 is created by bringing together all of the - Scorers from the sub-clauses of the BooleanQuery. + Scorers from the sub-clauses of the BooleanQuery. When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores provided by each scorer while factoring in the coord() score. @@ -470,14 +467,14 @@ document.write("Last Published: " + document.lastModified);
For information on the Query Classes, refer to the - search package javadocs + search package javadocs
One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on how to do this, see the - search package javadocs + search package javadocs
@@ -486,7 +483,7 @@ document.write("Last Published: " + document.lastModified);At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more about how to do this, refer to the - search package javadocs + search package javadocs
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a - Query + Query is passed to the - Searcher + Searcher , beginning the scoring process.
Once inside the Searcher, a - HitCollector + Collector is used for the scoring and sorting of the search results. These important objects are involved in a search:
Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), - we call one of the search method of the Searcher, passing in the - Weight + we call one of the search methods of the Searcher, passing in the + Weight object created by Searcher.createWeight(Query), - Filter + Filter and the number of results we want. This method returns a - TopDocs + TopDocs object, which is an internal collection of search results. The Searcher creates a - TopDocCollector + TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the - HitCollector + Collector mechanism, see - Searcher + Searcher .) The TopDocCollector uses a - PriorityQueue + PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a - Scorer + Scorer for the - IndexReader + IndexReader of the current searcher and we proceed by calling the score method on the - Scorer + Scorer .
-At last, we are actually going to score some documents. The score method takes in the HitCollector - (most likely the TopDocCollector) and does its business. +
At last, we are actually going to score some documents. The score method takes in the Collector
+ (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.
Of course, here is where things get involved. The
- Scorer
+ Scorer
that is returned by the
- Weight
+ Weight
object depends on what type of Query was submitted. In most real world applications with multiple
query terms,
the
- Scorer
+ Scorer
is going to be a
BooleanScorer2
(see the section on customizing your scoring for info on changing this.)
diff --git a/docs/scoring.pdf b/docs/scoring.pdf
index 4c202a59105..5e1f0624a06 100644
--- a/docs/scoring.pdf
+++ b/docs/scoring.pdf
@@ -198,10 +198,10 @@ endobj
>>
endobj
38 0 obj
-<< /Length 2333 /Filter [ /ASCII85Decode /FlateDecode ]
+<< /Length 2402 /Filter [ /ASCII85Decode /FlateDecode ]
>>
stream
-Gau`UgN)%,&:O:SE2F.gAh8UZMN
As we discussed in the previous walk-through, the
The first substantial thing the
-The
-The particular
Looking further down in the file, you should see the
As you can see there isn't much to creating an index. The devil is in the details. You may also
wish to examine the other samples in this directory, particularly the
-The GaW1jW`:Qeo$#.5m69Cl@>piX9t(gP!LN=JGLQnPB"0EQ$"U+KV88oJC-C;;q=k$]>F0USe>g$-BP;&L-Q^J.]"1]+"WV"n06+31XBP3`mhZPM6-#^3=4h1B%>p`\$8gC?e8?\Yi9^&B]CP6&IAm8,TLQcuV"_)k#dH`p_$bm/+?PXB/K>,I5!j>Ie9Md(224q#=A%nl6@#
vi
or your editor of choice and let's take a look at
IndexFiles
class creates a Lucene
+href="api/core/org/apache/lucene/demo/IndexFiles.html">IndexFiles class creates a Lucene
Index. Let's take a look at how it does this.
main
function does is instantiate IndexWriter
. It passes the string
+href="api/core/org/apache/lucene/index/IndexWriter.html">IndexWriter. It passes the string
"index
" and a new instance of a class called StandardAnalyzer
.
+href="api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html">StandardAnalyzer.
The "index
" string is the name of the filesystem directory where all index information
should be stored. Because we're not passing a full path, this will be created as a subdirectory of
the current working directory (if it does not already exist). On some platforms, it may be created
@@ -55,45 +55,45 @@ in other directories (such as the user's home directory).
IndexWriter
is the main
+The IndexWriter
is the main
class responsible for creating indices. To use it you must instantiate it with a path that it can
write the index into. If this path does not exist it will first create it. Otherwise it will
refresh the index at that path. You can also create an index using one of the subclasses of Directory
. In any case, you must also pass an
+href="api/core/org/apache/lucene/store/Directory.html">Directory. In any case, you must also pass an
instance of org.apache.lucene.analysis.Analyzer
.
+href="api/core/org/apache/lucene/analysis/Analyzer.html">org.apache.lucene.analysis.Analyzer.
Analyzer
we
+The particular Analyzer
we
are using, StandardAnalyzer
, is
+href="api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html">StandardAnalyzer, is
little more than a standard Java Tokenizer, converting all strings to lowercase and filtering out
-useless words and characters from the index. By useless words and characters I mean common language
-words such as articles (a, an, the, etc.) and other strings that would be useless for searching
+stop words and characters from the index. By stop words and characters I mean common language
+words such as articles (a, an, the, etc.) and other strings that may have less value for searching
(e.g. 's) . It should be noted that there are different rules for every language, and you
should use the proper analyzer for each. Lucene currently provides Analyzers for a number of
different languages (see the *Analyzer.java
sources under contrib/analyzers/src/java/org/apache/lucene/analysis).
+href="http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/common/src/java/org/apache/lucene/analysis/">contrib/analyzers/src/java/org/apache/lucene/analysis).
indexDocs()
code. This recursive
function simply crawls the directories and uses FileDocument
to create Document
objects. The Document
is simply a data object to
+href="api/core/org/apache/lucene/demo/FileDocument.html">FileDocument to create Document
objects. The Document
is simply a data object to
represent the content in the file as well as its creation time and location. These instances are
added to the indexWriter
. Take a look inside FileDocument
. It's not particularly
+href="api/core/org/apache/lucene/demo/FileDocument.html">FileDocument. It's not particularly
complicated. It just adds fields to the Document
.
+href="api/core/org/apache/lucene/document/Document.html">Document.
IndexHTML
class. It is a bit more
+href="api/core/org/apache/lucene/demo/IndexHTML.html">IndexHTML class. It is a bit more
complex but builds upon this example.
SearchFiles
class is
+The SearchFiles
class is
quite simple. It primarily collaborates with an IndexSearcher
, StandardAnalyzer
+href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher, StandardAnalyzer
(which is used in the IndexFiles
class as well) and a
-QueryParser
. The
+href="api/core/org/apache/lucene/demo/IndexFiles.html">IndexFiles class as well) and a
+QueryParser
. The
query parser is constructed with an analyzer used to interpret your query text in the same way the
documents are interpreted: finding the end of words and removing useless words like 'a', 'an' and
-'the'. The Query
object contains
+'the'. The Query
object contains
the results from the QueryParser
which is passed to
+href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser which is passed to
the searcher. Note that it's also possible to programmatically construct a rich Query
object without using the query
+href="api/core/org/apache/lucene/search/Query.html">Query object without using the query
parser. The query parser just enables decoding the Lucene query
syntax into the corresponding Query
object. Search can be executed in
+href="api/core/org/apache/lucene/search/Query.html">Query object. Search can be executed in
two different ways:
-
HitCollector
subclass
+HitCollector
subclass
simply prints out the document ID and score for each matching document.TopDocCollector
+TopDocCollector
the search results are printed in pages, sorted by score (i. e. relevance).
The rest of this document will cover Scoring basics and how to change your - Similarity. Next it will cover ways you can + Similarity. Next it will cover ways you can customize the Lucene internals in Changing your Scoring -- Expert Level which gives details on implementing your own - Query class and related functionality. Finally, we + Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
@@ -48,20 +48,20 @@ and the Lucene file formats before continuing on with this section.) It is also assumed that readers know how to use the - Searcher.explain(Query query, int doc) functionality, + Searcher.explain(Query query, int doc) functionality, which can go a long way in informing why a score is returned.In Lucene, the objects we are scoring are - Documents. A Document is a collection + Documents. A Document is a collection of - Fields. Each Field has semantics about how + Fields. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the - DefaultSimilarity + DefaultSimilarity on the Fields).
This composition of 1-byte representation of norms (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) is nicely described in - Fieldable.setBoost(). + Fieldable.setBoost().
Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: - encodeNorm() and - decodeNorm(). + encodeNorm() and + decodeNorm(). Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought into the score of document as norm(t, d), as shown by the formula in - Similarity. + Similarity.
This scoring formula is described in the - Similarity class. Please take the time to study this formula, as it contains much of the information about how the + Similarity class. Please take the time to study this formula, as it contains much of the information about how the basics of Lucene scoring work, especially the - TermQuery. + TermQuery.
OK, so the tf-idf formula and the - Similarity + Similarity is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are the use and interactions between the - Query classes, as created by each application in + Query classes, as created by each application in response to a user's information need.
-In this regard, Lucene offers a wide variety of Query implementations, most of which are in the - org.apache.lucene.search package. +
In this regard, Lucene offers a wide variety of Query implementations, most of which are in the + org.apache.lucene.search package. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query section below highlights some of the more important Query classes. For information on the other ones, see the - package summary. For details on implementing + package summary. For details on implementing your own Query class, see Changing your Scoring -- Expert Level below.
Once a Query has been created and submitted to the - IndexSearcher, the scoring process + IndexSearcher, the scoring process begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, - control finally passes to the Weight implementation and its - Scorer instance. In the case of any type of - BooleanQuery, scoring is handled by the + control finally passes to the Weight implementation and its + Scorer instance. In the case of any type of + BooleanQuery, scoring is handled by the BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), - unless the static - - BooleanQuery#setUseScorer14(boolean) method is set to true, + unless + + Weight#scoresDocsOutOfOrder() method is set to true, in which case the BooleanWeight (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default. See CHANGES.txt under release 1.9 RC1 for more information on choosing which Scorer to use.
-+
ry#setUseScorer14(boolean) Assuming the use of the BooleanWeight2, a BooleanScorer2 is created by bringing together all of the - Scorers from the sub-clauses of the BooleanQuery. + Scorers from the sub-clauses of the BooleanQuery. When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores provided by each scorer while factoring in the coord() score. @@ -169,20 +169,20 @@
For information on the Query Classes, refer to the - search package javadocs + search package javadocs
One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on how to do this, see the - search package javadocs
+ search package javadocsAt a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more about how to do this, refer to the - search package javadocs + search package javadocs
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a - Query + Query is passed to the Searcher + href="api/core/org/apache/lucene/search/Searcher.html">Searcher , beginning the scoring process.
Once inside the Searcher, a - HitCollector + Collector is used for the scoring and sorting of the search results. These important objects are involved in a search:
Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), - we call one of the search method of the Searcher, passing in the - Weight + we call one of the search methods of the Searcher, passing in the + Weight object created by Searcher.createWeight(Query), - Filter + Filter and the number of results we want. This method returns a - TopDocs + TopDocs object, which is an internal collection of search results. The Searcher creates a - TopDocCollector + TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the - HitCollector + Collector mechanism, see - Searcher + Searcher .) The TopDocCollector uses a - PriorityQueue + PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a - Scorer + Scorer for the - IndexReader + IndexReader of the current searcher and we proceed by calling the score method on the - Scorer + Scorer .
-At last, we are actually going to score some documents. The score method takes in the HitCollector - (most likely the TopDocCollector) and does its business. +
At last, we are actually going to score some documents. The score method takes in the Collector + (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The - Scorer + Scorer that is returned by the - Weight + Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the - Scorer + Scorer is going to be a BooleanScorer2 (see the section on customizing your scoring for info on changing this.)