- Modified docs.

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149903 13f79535-47bb-0310-9956-ffa450edef68
2002-12-12 06:23:48 +00:00 · 2002-12-12 06:23:48 +00:00 · 9ff9b75780
parent bf5028d9ac
commit 9ff9b75780
19 changed files with 678 additions and 519 deletions
--- a/docs/benchmarks.html
+++ b/docs/benchmarks.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
@ -121,20 +122,20 @@
      <tr><td>
        <blockquote>
                                    <p>
-      The purpose of these user-submitted performance figures is to 
-give current and potential users of Lucene a sense 
-      of how well Lucene scales. If the requirements for an upcoming 
-project is similar to an existing benchmark, you 
-      will also have something to work with when designing the system 
-architecture for the application.
-      </p>
+                The purpose of these user-submitted performance figures is to
+                give current and potential users of Lucene a sense
+                of how well Lucene scales. If the requirements for an upcoming
+                project is similar to an existing benchmark, you
+                will also have something to work with when designing the system
+                architecture for the application.
+            </p>
                                                <p>
-      If you've conducted performance tests with Lucene, we'd 
-appreciate if you can submit these figures for display 
-      on this page. Post these figures to the lucene-user mailing list 
-using this 
-      <a href="benchmarktemplate.xml">template</a>.
-      </p>
+                If you've conducted performance tests with Lucene, we'd
+                appreciate if you can submit these figures for display
+                on this page. Post these figures to the lucene-user mailing list
+                using this
+                <a href="benchmarktemplate.xml">template</a>.
+            </p>
                            </blockquote>
        </p>
      </td></tr>
@ -149,64 +150,64 @@ using this
      <tr><td>
        <blockquote>
                                    <p>
-      <ul>
-      <p>
-      <b>Hardware Environment</b><br />
-      <li><i>Dedicated machine for indexing</i>: Self-explanatory 
-(yes/no)</li>
-      <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
-      <li><i>RAM</i>: Self-explanatory</li>
-      <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
-RAID-1, RAID-5)</li>
-      </p>
-      <p>
-      <b>Software environment</b><br />
-      <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
-</li>
-      <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
-      <li><i>OS Version</i>: Self-explanatory</li>
-      <li><i>Location of index</i>: Is the index stored in filesystem 
-or database? Is it on the same server(local) or 
-      over the network?</li>
-      </p>
-      <p>
-      <b>Lucene indexing variables</b><br />
-      <li><i>Number of source documents</i>: Number of documents being 
-indexed</li>
-      <li><i>Total filesize of source documents</i>: 
-Self-explanatory</li>
-      <li><i>Average filesize of source documents</i>: 
-Self-explanatory</li>
-      <li><i>Source documents storage location</i>: Where are the 
-documents being indexed located? 
-        Filesystem, DB, http,etc</li>
-      <li><i>File type of source documents</i>: Types of files being 
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
-      <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
-various files for indexing, 
-        e.g. XML parser, HTML parser, etc.</li>
-      <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
-      <li><i>Number of fields per document</i>: Number of Fields each 
-Document contains</li>
-      <li><i>Type of fields</i>: Type of each field</li>
-      <li><i>Index persistence</i>: Where the index is stored, e.g. 
-FSDirectory, SqlDirectory, etc</li>
-      </p>
-      <p>
-      <b>Figures</b><br />
-      <li><i>Time taken (in ms/s as an average of at least 3 indexing 
-runs)</i>: Time taken to index all files</li>
-      <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
-1000 files</li>
-      <li><i>Memory consumption</i>: Self-explanatory</li>
-      </p>
-      <p>
-      <b>Notes</b><br />
-      <li><i>Notes</i>: Any comments which don't belong in the above, 
-special tuning/strategies, etc</li>
-      </p>
-      </ul>
-      </p>
+                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br />
+                        <li><i>Dedicated machine for indexing</i>: Self-explanatory
+                            (yes/no)</li>
+                        <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
+                        <li><i>RAM</i>: Self-explanatory</li>
+                        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
+                            RAID-1, RAID-5)</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br />
+                        <li><i>Java Version</i>: Version of Java SDK/JRE that is run
+                        </li>
+                        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
+                        <li><i>OS Version</i>: Self-explanatory</li>
+                        <li><i>Location of index</i>: Is the index stored in filesystem
+                            or database? Is it on the same server(local) or
+                            over the network?</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br />
+                        <li><i>Number of source documents</i>: Number of documents being
+                            indexed</li>
+                        <li><i>Total filesize of source documents</i>:
+                            Self-explanatory</li>
+                        <li><i>Average filesize of source documents</i>:
+                            Self-explanatory</li>
+                        <li><i>Source documents storage location</i>: Where are the
+                            documents being indexed located?
+                            Filesystem, DB, http,etc</li>
+                        <li><i>File type of source documents</i>: Types of files being
+                            indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+                        <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
+                            various files for indexing,
+                            e.g. XML parser, HTML parser, etc.</li>
+                        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
+                        <li><i>Number of fields per document</i>: Number of Fields each
+                            Document contains</li>
+                        <li><i>Type of fields</i>: Type of each field</li>
+                        <li><i>Index persistence</i>: Where the index is stored, e.g.
+                            FSDirectory, SqlDirectory, etc</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br />
+                        <li><i>Time taken (in ms/s as an average of at least 3 indexing
+                                runs)</i>: Time taken to index all files</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
+                            1000 files</li>
+                        <li><i>Memory consumption</i>: Self-explanatory</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br />
+                        <li><i>Notes</i>: Any comments which don't belong in the above,
+                            special tuning/strategies, etc</li>
+                    </p>
+                </ul>
+            </p>
                            </blockquote>
        </p>
      </td></tr>
@ -221,17 +222,17 @@ special tuning/strategies, etc</li>
      <tr><td>
        <blockquote>
                                    <p>
-      These benchmarks have been kindly submitted by Lucene users for 
-reference purposes. 
-      </p>
-                                                <p><b>We make NO guarantees regarding their accuracy or 
-validity.</b>
-      </p>
-                                                <p>We strongly recommend you conduct your own 
-      performance benchmarks before deciding on a particular 
-hardware/software setup (and hopefully submit 
-      these figures to us).
-      </p>
+                These benchmarks have been kindly submitted by Lucene users for
+                reference purposes.
+            </p>
+                                                <p><b>We make NO guarantees regarding their accuracy or
+                    validity.</b>
+            </p>
+                                                <p>We strongly recommend you conduct your own
+                performance benchmarks before deciding on a particular
+                hardware/software setup (and hopefully submit
+                these figures to us).
+            </p>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#828DA6">
        <font color="#ffffff" face="arial,helvetica,sanserif">
@ -241,109 +242,109 @@ hardware/software setup (and hopefully submit
      <tr><td>
        <blockquote>
                                    <ul>
-          <p>
-          <b>Hardware Environment</b><br />
-          <li><i>Dedicated machine for indexing</i>: yes</li>
-          <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
-          <li><i>RAM</i>: 512 DDR</li>
-          <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
-          </p>
-          <p>
-          <b>Software environment</b><br />
-          <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
-          <li><i>Java VM</i>: </li>
-          <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
-          <li><i>Location of index</i>: local</li>
-          </p>
-          <p>
-          <b>Lucene indexing variables</b><br />
-          <li><i>Number of source documents</i>: Random generator. Set 
-to make 1M documents
-in 2x500,000 batches.</li>
-          <li><i>Total filesize of source documents</i>: &gt; 1GB if 
-stored</li>
-          <li><i>Average filesize of source documents</i>: 1KB</li>
-          <li><i>Source documents storage location</i>: Filesystem</li>
-          <li><i>File type of source documents</i>: Generated</li>
-          <li><i>Parser(s) used, if any</i>: </li>
-          <li><i>Analyzer(s) used</i>: Default</li>
-          <li><i>Number of fields per document</i>: 11</li>
-          <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
-          <li><i>Index persistence</i>: FSDirectory</li>
-          </p>
-          <p>
-          <b>Figures</b><br />
-          <li><i>Time taken (in ms/s as an average of at least 3 
-indexing runs)</i>: </li>
-          <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
-          <li><i>Memory consumption</i>:</li>
-          </p>
-          <p>
-          <b>Notes</b><br />
-          <li><i>Notes</i>: 
-          <p>
-          A windows client ran a random document generator which 
-created
-          documents based on some arrays of values and an excerpt 
-(approx 1kb)
-          from a text file of the bible (King James version).<br />
-          These were submitted via a socket connection (open throughout
-          indexing process).<br />
-          The index writer was not closed between index calls.<br />
-          This created a 400Mb index in 23 files (after 
-optimization).<br />
-          </p>
-          <p>
-          <u>Query details</u>:<br />
-          </p>
-          <p>
-          Set up a threaded class to start x number of simultaneous 
-threads to
-          search the above created index.
-          </p>
-          <p>
-          Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
-(Teaser:goo* Tea
-          ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
-          +DisplayStartDate:[mkwsw2jk0
-          -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
-          </p>
-          <p>
-          This query counted 34000 documents and I limited the returned 
-documents
-          to 5.
-          </p>
-          <p>
-          This is using Peter Halacsy's IndexSearcherCache slightly 
-modified to
-          be a singleton returned cached searchers for a given 
-directory. This
-          solved an initial problem with too many files open and 
-running out of
-          linux handles for them.
-          </p>
-          <pre>
-          Threads|Avg Time per query (ms)
-          1       1009ms
-          2       2043ms
-          3       3087ms
-          4       4045ms
-          ..        .
-          ..        .
-          10      10091ms
-          </pre>
-          <p>
-          I removed the two date range terms from the query and it made 
-a HUGE
-          difference in performance. With 4 threads the avg time 
-dropped to 900ms!
-          </p>
-          <p>Other query optimizations made little difference.</p></li>
-          </p>
-          </ul>
+                    <p>
+                        <b>Hardware Environment</b><br />
+                        <li><i>Dedicated machine for indexing</i>: yes</li>
+                        <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
+                        <li><i>RAM</i>: 512 DDR</li>
+                        <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br />
+                        <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br />
+                        <li><i>Number of source documents</i>: Random generator. Set
+                            to make 1M documents
+                            in 2x500,000 batches.</li>
+                        <li><i>Total filesize of source documents</i>: &gt; 1GB if
+                            stored</li>
+                        <li><i>Average filesize of source documents</i>: 1KB</li>
+                        <li><i>Source documents storage location</i>: Filesystem</li>
+                        <li><i>File type of source documents</i>: Generated</li>
+                        <li><i>Parser(s) used, if any</i>: </li>
+                        <li><i>Analyzer(s) used</i>: Default</li>
+                        <li><i>Number of fields per document</i>: 11</li>
+                        <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br />
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: </li>
+                        <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
+                        <li><i>Memory consumption</i>:</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br />
+                        <li><i>Notes</i>:
+                            <p>
+                                A windows client ran a random document generator which
+                                created
+                                documents based on some arrays of values and an excerpt
+                                (approx 1kb)
+                                from a text file of the bible (King James version).<br />
+                                These were submitted via a socket connection (open throughout
+                                indexing process).<br />
+                                The index writer was not closed between index calls.<br />
+                                This created a 400Mb index in 23 files (after
+                                optimization).<br />
+                            </p>
+                            <p>
+                                <u>Query details</u>:<br />
+                            </p>
+                            <p>
+                                Set up a threaded class to start x number of simultaneous
+                                threads to
+                                search the above created index.
+                            </p>
+                            <p>
+                                Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
+                                (Teaser:goo* Tea
+                                ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+                                +DisplayStartDate:[mkwsw2jk0
+                                -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+                            </p>
+                            <p>
+                                This query counted 34000 documents and I limited the returned
+                                documents
+                                to 5.
+                            </p>
+                            <p>
+                                This is using Peter Halacsy's IndexSearcherCache slightly
+                                modified to
+                                be a singleton returned cached searchers for a given
+                                directory. This
+                                solved an initial problem with too many files open and
+                                running out of
+                                linux handles for them.
+                            </p>
+                            <pre>
+                                Threads|Avg Time per query (ms)
+                                1       1009ms
+                                2       2043ms
+                                3       3087ms
+                                4       4045ms
+                                ..        .
+                                ..        .
+                                10      10091ms
+                            </pre>
+                            <p>
+                                I removed the two date range terms from the query and it made
+                                a HUGE
+                                difference in performance. With 4 threads the avg time
+                                dropped to 900ms!
+                            </p>
+                            <p>Other query optimizations made little difference.</p></li>
+                    </p>
+                </ul>
                                                <p>
-          Hamish can be contacted at hamish at catalyst.net.nz.
-          </p>
+                    Hamish can be contacted at hamish at catalyst.net.nz.
+                </p>
                            </blockquote>
      </td></tr>
      <tr><td><br/></td></tr>
@ -357,71 +358,146 @@ dropped to 900ms!
      <tr><td>
        <blockquote>
                                    <ul>
-          <p>
-          <b>Hardware Environment</b><br />
-          <li><i>Dedicated machine for indexing</i>: No, but nominal 
-usage at time of indexing.</li>
-          <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
-          <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
-          <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
-Array</li>
-          </p>
-          <p>
-          <b>Software environment</b><br />
-          <li><i>Java Version</i>: 1.3.1_06</li>
-          <li><i>Java VM</i>: </li>
-          <li><i>OS Version</i>: Winnt 4/Sp6</li>
-          <li><i>Location of index</i>: local</li>
-          </p>
-          <p>
-          <b>Lucene indexing variables</b><br />
-          <li><i>Number of source documents</i>: about 60K</li>
-          <li><i>Total filesize of source documents</i>: 6.5GB</li>
-          <li><i>Average filesize of source documents</i>: 100K 
-(6.5GB/60K documents)</li>
-          <li><i>Source documents storage location</i>: filesystem on 
-NTFS</li>
-          <li><i>File type of source documents</i>: </li>
-          <li><i>Parser(s) used, if any</i>: Currently the only parser 
-used is the Quiotix html
-          parser.</li>
-          <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
-          <li><i>Number of fields per document</i>: 8</li>
-          <li><i>Type of fields</i>: All strings, and all are stored 
-and indexed.</li>
-          <li><i>Index persistence</i>: FSDirectory</li>
-          </p>
-          <p>
-          <b>Figures</b><br />
-          <li><i>Time taken (in ms/s as an average of at least 3 
-indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
-minutes.  Note that the #
-          and size of documents changes daily.</li>
-          <li><i>Time taken / 1000 docs indexed</i>: </li>
-          <li><i>Memory consumption</i>: JVM is given 256MB and uses it 
-all.</li>
-          </p>
-          <p>
-          <b>Notes</b><br />
-          <li><i>Notes</i>: 
-          <p>
-          We have 10 threads reading files from the filesystem and 
-parsing and
-          analyzing them and the pushing them onto a queue and a single 
-thread poping
-          them from the queue and indexing.  Note that we are indexing 
-email messages
-          and are storing the entire plaintext in of the message in the 
-index.  If the
-          message contains attachment and we do not have a filter for 
-the attachment
-          (ie. we do not do PDFs yet), we discard the data.
-          </p></li>
-          </p>
-          </ul>
+                    <p>
+                        <b>Hardware Environment</b><br />
+                        <li><i>Dedicated machine for indexing</i>: No, but nominal
+                            usage at time of indexing.</li>
+                        <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
+                        <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
+                        <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
+                            Array</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br />
+                        <li><i>Java Version</i>: 1.3.1_06</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Winnt 4/Sp6</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br />
+                        <li><i>Number of source documents</i>: about 60K</li>
+                        <li><i>Total filesize of source documents</i>: 6.5GB</li>
+                        <li><i>Average filesize of source documents</i>: 100K
+                            (6.5GB/60K documents)</li>
+                        <li><i>Source documents storage location</i>: filesystem on
+                            NTFS</li>
+                        <li><i>File type of source documents</i>: </li>
+                        <li><i>Parser(s) used, if any</i>: Currently the only parser
+                            used is the Quiotix html
+                            parser.</li>
+                        <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
+                        <li><i>Number of fields per document</i>: 8</li>
+                        <li><i>Type of fields</i>: All strings, and all are stored
+                            and indexed.</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br />
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+                            minutes.  Note that the #
+                            and size of documents changes daily.</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: </li>
+                        <li><i>Memory consumption</i>: JVM is given 256MB and uses it
+                            all.</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br />
+                        <li><i>Notes</i>:
+                            <p>
+                                We have 10 threads reading files from the filesystem and
+                                parsing and
+                                analyzing them and the pushing them onto a queue and a single
+                                thread poping
+                                them from the queue and indexing.  Note that we are indexing
+                                email messages
+                                and are storing the entire plaintext in of the message in the
+                                index.  If the
+                                message contains attachment and we do not have a filter for
+                                the attachment
+                                (ie. we do not do PDFs yet), we discard the data.
+                            </p></li>
+                    </p>
+                </ul>
                                                <p>
-          Justin can be contacted at tvxh-lw4x at spamex.com.
-          </p>
+                    Justin can be contacted at tvxh-lw4x at spamex.com.
+                </p>
+                            </blockquote>
+      </td></tr>
+      <tr><td><br/></td></tr>
+    </table>
+                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#828DA6">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Daniel Armbrust's benchmarks"><strong>Daniel Armbrust's benchmarks</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>
+                    My disclaimer is that this is a very poor "Benchmark".  It was not done for raw speed,
+                    nor was the total index built in one shot.  The index was created on several different
+                    machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+                    1 million documents per batch.  Each of these small indexes was then moved to a
+                    much larger drive, where they were all merged together into a big index.
+                    This process was done manually, over the course of several months, as the sources became available.
+                </p>
+                                                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br />
+                        <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load.  However, the indexing process was built single
+                            threaded, so it only took advantage of 1 of the processors.  It usually got 100% of this processor.</li>
+                        <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
+                        <li><i>RAM</i>: 4 GB Memory</li>
+                        <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br />
+                        <li><i>Java Version</i>: 1.3.1</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br />
+                        <li><i>Number of source documents</i>: 13,820,517</li>
+                        <li><i>Total filesize of source documents</i>: 87.3 GB</li>
+                        <li><i>Average filesize of source documents</i>: 6.3 KB</li>
+                        <li><i>Source documents storage location</i>: Filesystem</li>
+                        <li><i>File type of source documents</i>: XML</li>
+                        <li><i>Parser(s) used, if any</i>: </li>
+                        <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
+                        <li><i>Number of fields per document</i>: 1 - 31</li>
+                        <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                        <li><i>Index size</i>: 12.5 GB</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br />
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
+                        <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
+                            1 GB of memory was allotted to the indexer</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br />
+                        <li><i>Notes</i>:
+                            <p>
+                                The source documents were XML.  The "indexer" opened each document one at a time, ran an
+                                XSL transformation on them, and then proceeded to index the stream.  The indexer optimized
+                                the index every 50,000 documents (on this run) though previously, we optimized every
+                                300,000 documents.  The performance didn't change much either way.  We did no other
+                                tuning (RAM Directories, separate process to pretransform the source material, etc)
+                                to make it index faster.  When all of these individual indexes were built, they were
+                                merged together into the main index.  That process usually took ~ a day.
+                            </p></li>
+                    </p>
+                </ul>
+                                                <p>
+                    Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+                </p>
                            </blockquote>
      </td></tr>
      <tr><td><br/></td></tr>
--- a/docs/contributions.html
+++ b/docs/contributions.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo.html
+++ b/docs/demo.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo2.html
+++ b/docs/demo2.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo3.html
+++ b/docs/demo3.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo4.html
+++ b/docs/demo4.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/fileformats.html
+++ b/docs/fileformats.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/gettingstarted.html
+++ b/docs/gettingstarted.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/index.html
+++ b/docs/index.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/lucene-sandbox/index.html
+++ b/docs/lucene-sandbox/index.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/lucene-sandbox/indyo/tutorial.html
+++ b/docs/lucene-sandbox/indyo/tutorial.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/lucene-sandbox/larm/overview.html
+++ b/docs/lucene-sandbox/larm/overview.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/luceneplan.html
+++ b/docs/luceneplan.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/powered.html
+++ b/docs/powered.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/queryparsersyntax.html
+++ b/docs/queryparsersyntax.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/resources.html
+++ b/docs/resources.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/todo.html
+++ b/docs/todo.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/whoweare.html
+++ b/docs/whoweare.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/xdocs/benchmarks.xml
+++ b/xdocs/benchmarks.xml
@ -1,283 +1,349 @@
 <?xml version="1.0"?>
 <document>
    <properties>
-      <author email="kelvint@apache.org">Kelvin Tan</author>
-      <title>Resources - Performance Benchmarks</title>
+        <author email="kelvint@apache.org">Kelvin Tan</author>
+        <title>Resources - Performance Benchmarks</title>
    </properties>
    <body>

-      <section name="Performance Benchmarks">
-      <p>
-      The purpose of these user-submitted performance figures is to 
-give current and potential users of Lucene a sense 
-      of how well Lucene scales. If the requirements for an upcoming 
-project is similar to an existing benchmark, you 
-      will also have something to work with when designing the system 
-architecture for the application.
-      </p>
-      <p>
-      If you've conducted performance tests with Lucene, we'd 
-appreciate if you can submit these figures for display 
-      on this page. Post these figures to the lucene-user mailing list 
-using this 
-      <a href="benchmarktemplate.xml">template</a>.
-      </p>
-      </section>
-      
-      <section name="Benchmark Variables">
-      <p>
-      <ul>
-      <p>
-      <b>Hardware Environment</b><br/>
-      <li><i>Dedicated machine for indexing</i>: Self-explanatory 
-(yes/no)</li>
-      <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
-      <li><i>RAM</i>: Self-explanatory</li>
-      <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
-RAID-1, RAID-5)</li>
-      </p>
-      <p>
-      <b>Software environment</b><br/>
-      <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
-</li>
-      <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
-      <li><i>OS Version</i>: Self-explanatory</li>
-      <li><i>Location of index</i>: Is the index stored in filesystem 
-or database? Is it on the same server(local) or 
-      over the network?</li>
-      </p>
-      <p>
-      <b>Lucene indexing variables</b><br/>
-      <li><i>Number of source documents</i>: Number of documents being 
-indexed</li>
-      <li><i>Total filesize of source documents</i>: 
-Self-explanatory</li>
-      <li><i>Average filesize of source documents</i>: 
-Self-explanatory</li>
-      <li><i>Source documents storage location</i>: Where are the 
-documents being indexed located? 
-        Filesystem, DB, http,etc</li>
-      <li><i>File type of source documents</i>: Types of files being 
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
-      <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
-various files for indexing, 
-        e.g. XML parser, HTML parser, etc.</li>
-      <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
-      <li><i>Number of fields per document</i>: Number of Fields each 
-Document contains</li>
-      <li><i>Type of fields</i>: Type of each field</li>
-      <li><i>Index persistence</i>: Where the index is stored, e.g. 
-FSDirectory, SqlDirectory, etc</li>
-      </p>
-      <p>
-      <b>Figures</b><br/>
-      <li><i>Time taken (in ms/s as an average of at least 3 indexing 
-runs)</i>: Time taken to index all files</li>
-      <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
-1000 files</li>
-      <li><i>Memory consumption</i>: Self-explanatory</li>
-      </p>
-      <p>
-      <b>Notes</b><br/>
-      <li><i>Notes</i>: Any comments which don't belong in the above, 
-special tuning/strategies, etc</li>
-      </p>
-      </ul>
-      </p>
-      </section>
+        <section name="Performance Benchmarks">
+            <p>
+                The purpose of these user-submitted performance figures is to
+                give current and potential users of Lucene a sense
+                of how well Lucene scales. If the requirements for an upcoming
+                project is similar to an existing benchmark, you
+                will also have something to work with when designing the system
+                architecture for the application.
+            </p>
+            <p>
+                If you've conducted performance tests with Lucene, we'd
+                appreciate if you can submit these figures for display
+                on this page. Post these figures to the lucene-user mailing list
+                using this
+                <a href="benchmarktemplate.xml">template</a>.
+            </p>
+        </section>

-      <section name="User-submitted Benchmarks">
-      <p>
-      These benchmarks have been kindly submitted by Lucene users for 
-reference purposes. 
-      </p>
-      <p><b>We make NO guarantees regarding their accuracy or 
-validity.</b>
-      </p>
-      <p>We strongly recommend you conduct your own 
-      performance benchmarks before deciding on a particular 
-hardware/software setup (and hopefully submit 
-      these figures to us).
-      </p>
-      
-        <subsection name="Hamish Carpenter's benchmarks">
-          <ul>
-          <p>
-          <b>Hardware Environment</b><br/>
-          <li><i>Dedicated machine for indexing</i>: yes</li>
-          <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
-          <li><i>RAM</i>: 512 DDR</li>
-          <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
-          </p>
-          <p>
-          <b>Software environment</b><br/>
-          <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
-          <li><i>Java VM</i>: </li>
-          <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
-          <li><i>Location of index</i>: local</li>
-          </p>
-          <p>
-          <b>Lucene indexing variables</b><br/>
-          <li><i>Number of source documents</i>: Random generator. Set 
-to make 1M documents
-in 2x500,000 batches.</li>
-          <li><i>Total filesize of source documents</i>: > 1GB if 
-stored</li>
-          <li><i>Average filesize of source documents</i>: 1KB</li>
-          <li><i>Source documents storage location</i>: Filesystem</li>
-          <li><i>File type of source documents</i>: Generated</li>
-          <li><i>Parser(s) used, if any</i>: </li>
-          <li><i>Analyzer(s) used</i>: Default</li>
-          <li><i>Number of fields per document</i>: 11</li>
-          <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
-          <li><i>Index persistence</i>: FSDirectory</li>
-          </p>
-          <p>
-          <b>Figures</b><br/>
-          <li><i>Time taken (in ms/s as an average of at least 3 
-indexing runs)</i>: </li>
-          <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
-          <li><i>Memory consumption</i>:</li>
-          </p>
-          <p>
-          <b>Notes</b><br/>
-          <li><i>Notes</i>: 
-          <p>
-          A windows client ran a random document generator which 
-created
-          documents based on some arrays of values and an excerpt 
-(approx 1kb)
-          from a text file of the bible (King James version).<br/>
-          These were submitted via a socket connection (open throughout
-          indexing process).<br/>
-          The index writer was not closed between index calls.<br/>
-          This created a 400Mb index in 23 files (after 
-optimization).<br/>
-          </p>
-          <p>
-          <u>Query details</u>:<br/>
-          </p>
-          <p>
-          Set up a threaded class to start x number of simultaneous 
-threads to
-          search the above created index.
-          </p>
-          <p>
-          Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
-(Teaser:goo* Tea
-          ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
-          +DisplayStartDate:[mkwsw2jk0
-          -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
-          </p>
-          <p>
-          This query counted 34000 documents and I limited the returned 
-documents
-          to 5.
-          </p>
-          <p>
-          This is using Peter Halacsy's IndexSearcherCache slightly 
-modified to
-          be a singleton returned cached searchers for a given 
-directory. This
-          solved an initial problem with too many files open and 
-running out of
-          linux handles for them.
-          </p>
-          <pre>
-          Threads|Avg Time per query (ms)
-          1       1009ms
-          2       2043ms
-          3       3087ms
-          4       4045ms
-          ..        .
-          ..        .
-          10      10091ms
-          </pre>
-          <p>
-          I removed the two date range terms from the query and it made 
-a HUGE
-          difference in performance. With 4 threads the avg time 
-dropped to 900ms!
-          </p>
-          <p>Other query optimizations made little difference.</p></li>
-          </p>
-          </ul>
-          <p>
-          Hamish can be contacted at hamish at catalyst.net.nz.
-          </p>
-        </subsection>     
+        <section name="Benchmark Variables">
+            <p>
+                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br/>
+                        <li><i>Dedicated machine for indexing</i>: Self-explanatory
+                            (yes/no)</li>
+                        <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
+                        <li><i>RAM</i>: Self-explanatory</li>
+                        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
+                            RAID-1, RAID-5)</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br/>
+                        <li><i>Java Version</i>: Version of Java SDK/JRE that is run
+                        </li>
+                        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
+                        <li><i>OS Version</i>: Self-explanatory</li>
+                        <li><i>Location of index</i>: Is the index stored in filesystem
+                            or database? Is it on the same server(local) or
+                            over the network?</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br/>
+                        <li><i>Number of source documents</i>: Number of documents being
+                            indexed</li>
+                        <li><i>Total filesize of source documents</i>:
+                            Self-explanatory</li>
+                        <li><i>Average filesize of source documents</i>:
+                            Self-explanatory</li>
+                        <li><i>Source documents storage location</i>: Where are the
+                            documents being indexed located?
+                            Filesystem, DB, http,etc</li>
+                        <li><i>File type of source documents</i>: Types of files being
+                            indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+                        <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
+                            various files for indexing,
+                            e.g. XML parser, HTML parser, etc.</li>
+                        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
+                        <li><i>Number of fields per document</i>: Number of Fields each
+                            Document contains</li>
+                        <li><i>Type of fields</i>: Type of each field</li>
+                        <li><i>Index persistence</i>: Where the index is stored, e.g.
+                            FSDirectory, SqlDirectory, etc</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br/>
+                        <li><i>Time taken (in ms/s as an average of at least 3 indexing
+                                runs)</i>: Time taken to index all files</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
+                            1000 files</li>
+                        <li><i>Memory consumption</i>: Self-explanatory</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br/>
+                        <li><i>Notes</i>: Any comments which don't belong in the above,
+                            special tuning/strategies, etc</li>
+                    </p>
+                </ul>
+            </p>
+        </section>

-        <subsection name="Justin Greene's benchmarks">
-          <ul>
-          <p>
-          <b>Hardware Environment</b><br/>
-          <li><i>Dedicated machine for indexing</i>: No, but nominal 
-usage at time of indexing.</li>
-          <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
-          <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
-          <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
-Array</li>
-          </p>
-          <p>
-          <b>Software environment</b><br/>
-          <li><i>Java Version</i>: 1.3.1_06</li>
-          <li><i>Java VM</i>: </li>
-          <li><i>OS Version</i>: Winnt 4/Sp6</li>
-          <li><i>Location of index</i>: local</li>
-          </p>
-          <p>
-          <b>Lucene indexing variables</b><br/>
-          <li><i>Number of source documents</i>: about 60K</li>
-          <li><i>Total filesize of source documents</i>: 6.5GB</li>
-          <li><i>Average filesize of source documents</i>: 100K 
-(6.5GB/60K documents)</li>
-          <li><i>Source documents storage location</i>: filesystem on 
-NTFS</li>
-          <li><i>File type of source documents</i>: </li>
-          <li><i>Parser(s) used, if any</i>: Currently the only parser 
-used is the Quiotix html
-          parser.</li>
-          <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
-          <li><i>Number of fields per document</i>: 8</li>
-          <li><i>Type of fields</i>: All strings, and all are stored 
-and indexed.</li>
-          <li><i>Index persistence</i>: FSDirectory</li>
-          </p>
-          <p>
-          <b>Figures</b><br/>
-          <li><i>Time taken (in ms/s as an average of at least 3 
-indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
-minutes.  Note that the #
-          and size of documents changes daily.</li>
-          <li><i>Time taken / 1000 docs indexed</i>: </li>
-          <li><i>Memory consumption</i>: JVM is given 256MB and uses it 
-all.</li>
-          </p>
-          <p>
-          <b>Notes</b><br/>
-          <li><i>Notes</i>: 
-          <p>
-          We have 10 threads reading files from the filesystem and 
-parsing and
-          analyzing them and the pushing them onto a queue and a single 
-thread poping
-          them from the queue and indexing.  Note that we are indexing 
-email messages
-          and are storing the entire plaintext in of the message in the 
-index.  If the
-          message contains attachment and we do not have a filter for 
-the attachment
-          (ie. we do not do PDFs yet), we discard the data.
-          </p></li>
-          </p>
-          </ul>
-          <p>
-          Justin can be contacted at tvxh-lw4x at spamex.com.
-          </p>
-        </subsection> 
+        <section name="User-submitted Benchmarks">
+            <p>
+                These benchmarks have been kindly submitted by Lucene users for
+                reference purposes.
+            </p>
+            <p><b>We make NO guarantees regarding their accuracy or
+                    validity.</b>
+            </p>
+            <p>We strongly recommend you conduct your own
+                performance benchmarks before deciding on a particular
+                hardware/software setup (and hopefully submit
+                these figures to us).
+            </p>

-      </section>
+            <subsection name="Hamish Carpenter's benchmarks">
+                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br/>
+                        <li><i>Dedicated machine for indexing</i>: yes</li>
+                        <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
+                        <li><i>RAM</i>: 512 DDR</li>
+                        <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br/>
+                        <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br/>
+                        <li><i>Number of source documents</i>: Random generator. Set
+                            to make 1M documents
+                            in 2x500,000 batches.</li>
+                        <li><i>Total filesize of source documents</i>: > 1GB if
+                            stored</li>
+                        <li><i>Average filesize of source documents</i>: 1KB</li>
+                        <li><i>Source documents storage location</i>: Filesystem</li>
+                        <li><i>File type of source documents</i>: Generated</li>
+                        <li><i>Parser(s) used, if any</i>: </li>
+                        <li><i>Analyzer(s) used</i>: Default</li>
+                        <li><i>Number of fields per document</i>: 11</li>
+                        <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br/>
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: </li>
+                        <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
+                        <li><i>Memory consumption</i>:</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br/>
+                        <li><i>Notes</i>:
+                            <p>
+                                A windows client ran a random document generator which
+                                created
+                                documents based on some arrays of values and an excerpt
+                                (approx 1kb)
+                                from a text file of the bible (King James version).<br/>
+                                These were submitted via a socket connection (open throughout
+                                indexing process).<br/>
+                                The index writer was not closed between index calls.<br/>
+                                This created a 400Mb index in 23 files (after
+                                optimization).<br/>
+                            </p>
+                            <p>
+                                <u>Query details</u>:<br/>
+                            </p>
+                            <p>
+                                Set up a threaded class to start x number of simultaneous
+                                threads to
+                                search the above created index.
+                            </p>
+                            <p>
+                                Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
+                                (Teaser:goo* Tea
+                                ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+                                +DisplayStartDate:[mkwsw2jk0
+                                -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+                            </p>
+                            <p>
+                                This query counted 34000 documents and I limited the returned
+                                documents
+                                to 5.
+                            </p>
+                            <p>
+                                This is using Peter Halacsy's IndexSearcherCache slightly
+                                modified to
+                                be a singleton returned cached searchers for a given
+                                directory. This
+                                solved an initial problem with too many files open and
+                                running out of
+                                linux handles for them.
+                            </p>
+                            <pre>
+                                Threads|Avg Time per query (ms)
+                                1       1009ms
+                                2       2043ms
+                                3       3087ms
+                                4       4045ms
+                                ..        .
+                                ..        .
+                                10      10091ms
+                            </pre>
+                            <p>
+                                I removed the two date range terms from the query and it made
+                                a HUGE
+                                difference in performance. With 4 threads the avg time
+                                dropped to 900ms!
+                            </p>
+                            <p>Other query optimizations made little difference.</p></li>
+                    </p>
+                </ul>
+                <p>
+                    Hamish can be contacted at hamish at catalyst.net.nz.
+                </p>
+            </subsection>
+
+            <subsection name="Justin Greene's benchmarks">
+                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br/>
+                        <li><i>Dedicated machine for indexing</i>: No, but nominal
+                            usage at time of indexing.</li>
+                        <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
+                        <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
+                        <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
+                            Array</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br/>
+                        <li><i>Java Version</i>: 1.3.1_06</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Winnt 4/Sp6</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br/>
+                        <li><i>Number of source documents</i>: about 60K</li>
+                        <li><i>Total filesize of source documents</i>: 6.5GB</li>
+                        <li><i>Average filesize of source documents</i>: 100K
+                            (6.5GB/60K documents)</li>
+                        <li><i>Source documents storage location</i>: filesystem on
+                            NTFS</li>
+                        <li><i>File type of source documents</i>: </li>
+                        <li><i>Parser(s) used, if any</i>: Currently the only parser
+                            used is the Quiotix html
+                            parser.</li>
+                        <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
+                        <li><i>Number of fields per document</i>: 8</li>
+                        <li><i>Type of fields</i>: All strings, and all are stored
+                            and indexed.</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br/>
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+                            minutes.  Note that the #
+                            and size of documents changes daily.</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: </li>
+                        <li><i>Memory consumption</i>: JVM is given 256MB and uses it
+                            all.</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br/>
+                        <li><i>Notes</i>:
+                            <p>
+                                We have 10 threads reading files from the filesystem and
+                                parsing and
+                                analyzing them and the pushing them onto a queue and a single
+                                thread poping
+                                them from the queue and indexing.  Note that we are indexing
+                                email messages
+                                and are storing the entire plaintext in of the message in the
+                                index.  If the
+                                message contains attachment and we do not have a filter for
+                                the attachment
+                                (ie. we do not do PDFs yet), we discard the data.
+                            </p></li>
+                    </p>
+                </ul>
+                <p>
+                    Justin can be contacted at tvxh-lw4x at spamex.com.
+                </p>
+            </subsection>
+
+
+            <subsection name="Daniel Armbrust's benchmarks">
+                <p>
+                    My disclaimer is that this is a very poor "Benchmark".  It was not done for raw speed,
+                    nor was the total index built in one shot.  The index was created on several different
+                    machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+                    1 million documents per batch.  Each of these small indexes was then moved to a
+                    much larger drive, where they were all merged together into a big index.
+                    This process was done manually, over the course of several months, as the sources became available.
+                </p>
+                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br/>
+                        <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load.  However, the indexing process was built single
+                            threaded, so it only took advantage of 1 of the processors.  It usually got 100% of this processor.</li>
+                        <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
+                        <li><i>RAM</i>: 4 GB Memory</li>
+                        <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br/>
+                        <li><i>Java Version</i>: 1.3.1</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br/>
+                        <li><i>Number of source documents</i>: 13,820,517</li>
+                        <li><i>Total filesize of source documents</i>: 87.3 GB</li>
+                        <li><i>Average filesize of source documents</i>: 6.3 KB</li>
+                        <li><i>Source documents storage location</i>: Filesystem</li>
+                        <li><i>File type of source documents</i>: XML</li>
+                        <li><i>Parser(s) used, if any</i>: </li>
+                        <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
+                        <li><i>Number of fields per document</i>: 1 - 31</li>
+                        <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                        <li><i>Index size</i>: 12.5 GB</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br/>
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
+                        <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
+                            1 GB of memory was allotted to the indexer</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br/>
+                        <li><i>Notes</i>:
+                            <p>
+                                The source documents were XML.  The "indexer" opened each document one at a time, ran an
+                                XSL transformation on them, and then proceeded to index the stream.  The indexer optimized
+                                the index every 50,000 documents (on this run) though previously, we optimized every
+                                300,000 documents.  The performance didn't change much either way.  We did no other
+                                tuning (RAM Directories, separate process to pretransform the source material, etc)
+                                to make it index faster.  When all of these individual indexes were built, they were
+                                merged together into the main index.  That process usually took ~ a day.
+                            </p></li>
+                    </p>
+                </ul>
+                <p>
+                    Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+                </p>
+            </subsection>
+
+        </section>

    </body>
 </document>
-