- Modified docs.

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149903 13f79535-47bb-0310-9956-ffa450edef68
2002-12-12 06:23:48 +00:00 · 2002-12-12 06:23:48 +00:00 · 9ff9b75780
parent bf5028d9ac
commit 9ff9b75780
19 changed files with 678 additions and 519 deletions
--- a/docs/benchmarks.html
+++ b/docs/benchmarks.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
@ -122,17 +123,17 @@
        <blockquote>
                                    <p>
                The purpose of these user-submitted performance figures is to
-give current and potential users of Lucene a sense 
+                give current and potential users of Lucene a sense
                of how well Lucene scales. If the requirements for an upcoming
-project is similar to an existing benchmark, you 
+                project is similar to an existing benchmark, you
                will also have something to work with when designing the system
-architecture for the application.
+                architecture for the application.
            </p>
                                                <p>
                If you've conducted performance tests with Lucene, we'd
-appreciate if you can submit these figures for display 
+                appreciate if you can submit these figures for display
                on this page. Post these figures to the lucene-user mailing list
-using this 
+                using this
                <a href="benchmarktemplate.xml">template</a>.
            </p>
                            </blockquote>
@ -153,57 +154,57 @@ using this
                    <p>
                        <b>Hardware Environment</b><br />
                        <li><i>Dedicated machine for indexing</i>: Self-explanatory
-(yes/no)</li>
+                            (yes/no)</li>
                        <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
                        <li><i>RAM</i>: Self-explanatory</li>
                        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
-RAID-1, RAID-5)</li>
+                            RAID-1, RAID-5)</li>
                    </p>
                    <p>
                        <b>Software environment</b><br />
                        <li><i>Java Version</i>: Version of Java SDK/JRE that is run
-</li>
+                        </li>
                        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
                        <li><i>OS Version</i>: Self-explanatory</li>
                        <li><i>Location of index</i>: Is the index stored in filesystem
-or database? Is it on the same server(local) or 
+                            or database? Is it on the same server(local) or
                            over the network?</li>
                    </p>
                    <p>
                        <b>Lucene indexing variables</b><br />
                        <li><i>Number of source documents</i>: Number of documents being
-indexed</li>
+                            indexed</li>
                        <li><i>Total filesize of source documents</i>:
-Self-explanatory</li>
+                            Self-explanatory</li>
                        <li><i>Average filesize of source documents</i>:
-Self-explanatory</li>
+                            Self-explanatory</li>
                        <li><i>Source documents storage location</i>: Where are the
-documents being indexed located? 
+                            documents being indexed located?
                            Filesystem, DB, http,etc</li>
                        <li><i>File type of source documents</i>: Types of files being
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+                            indexed, e.g. HTML files, XML files, PDF files, etc.</li>
                        <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
-various files for indexing, 
+                            various files for indexing,
                            e.g. XML parser, HTML parser, etc.</li>
                        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
                        <li><i>Number of fields per document</i>: Number of Fields each
-Document contains</li>
+                            Document contains</li>
                        <li><i>Type of fields</i>: Type of each field</li>
                        <li><i>Index persistence</i>: Where the index is stored, e.g.
-FSDirectory, SqlDirectory, etc</li>
+                            FSDirectory, SqlDirectory, etc</li>
                    </p>
                    <p>
                        <b>Figures</b><br />
                        <li><i>Time taken (in ms/s as an average of at least 3 indexing
-runs)</i>: Time taken to index all files</li>
+                                runs)</i>: Time taken to index all files</li>
                        <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
-1000 files</li>
+                            1000 files</li>
                        <li><i>Memory consumption</i>: Self-explanatory</li>
                    </p>
                    <p>
                        <b>Notes</b><br />
                        <li><i>Notes</i>: Any comments which don't belong in the above,
-special tuning/strategies, etc</li>
+                            special tuning/strategies, etc</li>
                    </p>
                </ul>
            </p>
@ -222,14 +223,14 @@ special tuning/strategies, etc</li>
        <blockquote>
                                    <p>
                These benchmarks have been kindly submitted by Lucene users for
-reference purposes. 
+                reference purposes.
            </p>
                                                <p><b>We make NO guarantees regarding their accuracy or
-validity.</b>
+                    validity.</b>
            </p>
                                                <p>We strongly recommend you conduct your own
                performance benchmarks before deciding on a particular
-hardware/software setup (and hopefully submit 
+                hardware/software setup (and hopefully submit
                these figures to us).
            </p>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
@ -258,10 +259,10 @@ hardware/software setup (and hopefully submit
                    <p>
                        <b>Lucene indexing variables</b><br />
                        <li><i>Number of source documents</i>: Random generator. Set
-to make 1M documents
-in 2x500,000 batches.</li>
+                            to make 1M documents
+                            in 2x500,000 batches.</li>
                        <li><i>Total filesize of source documents</i>: &gt; 1GB if
-stored</li>
+                            stored</li>
                        <li><i>Average filesize of source documents</i>: 1KB</li>
                        <li><i>Source documents storage location</i>: Filesystem</li>
                        <li><i>File type of source documents</i>: Generated</li>
@ -274,7 +275,7 @@ stored</li>
                    <p>
                        <b>Figures</b><br />
                        <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: </li>
+                                indexing runs)</i>: </li>
                        <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
                        <li><i>Memory consumption</i>:</li>
                    </p>
@ -283,43 +284,43 @@ indexing runs)</i>: </li>
                        <li><i>Notes</i>:
                            <p>
                                A windows client ran a random document generator which
-created
+                                created
                                documents based on some arrays of values and an excerpt
-(approx 1kb)
+                                (approx 1kb)
                                from a text file of the bible (King James version).<br />
                                These were submitted via a socket connection (open throughout
                                indexing process).<br />
                                The index writer was not closed between index calls.<br />
                                This created a 400Mb index in 23 files (after
-optimization).<br />
+                                optimization).<br />
                            </p>
                            <p>
                                <u>Query details</u>:<br />
                            </p>
                            <p>
                                Set up a threaded class to start x number of simultaneous
-threads to
+                                threads to
                                search the above created index.
                            </p>
                            <p>
                                Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
-(Teaser:goo* Tea
+                                (Teaser:goo* Tea
                                ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
                                +DisplayStartDate:[mkwsw2jk0
                                -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
                            </p>
                            <p>
                                This query counted 34000 documents and I limited the returned
-documents
+                                documents
                                to 5.
                            </p>
                            <p>
                                This is using Peter Halacsy's IndexSearcherCache slightly
-modified to
+                                modified to
                                be a singleton returned cached searchers for a given
-directory. This
+                                directory. This
                                solved an initial problem with too many files open and
-running out of
+                                running out of
                                linux handles for them.
                            </p>
                            <pre>
@ -334,9 +335,9 @@ running out of
                            </pre>
                            <p>
                                I removed the two date range terms from the query and it made
-a HUGE
+                                a HUGE
                                difference in performance. With 4 threads the avg time
-dropped to 900ms!
+                                dropped to 900ms!
                            </p>
                            <p>Other query optimizations made little difference.</p></li>
                    </p>
@ -360,11 +361,11 @@ dropped to 900ms!
                    <p>
                        <b>Hardware Environment</b><br />
                        <li><i>Dedicated machine for indexing</i>: No, but nominal
-usage at time of indexing.</li>
+                            usage at time of indexing.</li>
                        <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
                        <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
                        <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
-Array</li>
+                            Array</li>
                    </p>
                    <p>
                        <b>Software environment</b><br />
@ -378,43 +379,43 @@ Array</li>
                        <li><i>Number of source documents</i>: about 60K</li>
                        <li><i>Total filesize of source documents</i>: 6.5GB</li>
                        <li><i>Average filesize of source documents</i>: 100K
-(6.5GB/60K documents)</li>
+                            (6.5GB/60K documents)</li>
                        <li><i>Source documents storage location</i>: filesystem on
-NTFS</li>
+                            NTFS</li>
                        <li><i>File type of source documents</i>: </li>
                        <li><i>Parser(s) used, if any</i>: Currently the only parser
-used is the Quiotix html
+                            used is the Quiotix html
                            parser.</li>
                        <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
                        <li><i>Number of fields per document</i>: 8</li>
                        <li><i>Type of fields</i>: All strings, and all are stored
-and indexed.</li>
+                            and indexed.</li>
                        <li><i>Index persistence</i>: FSDirectory</li>
                    </p>
                    <p>
                        <b>Figures</b><br />
                        <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
-minutes.  Note that the #
+                                indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+                            minutes.  Note that the #
                            and size of documents changes daily.</li>
                        <li><i>Time taken / 1000 docs indexed</i>: </li>
                        <li><i>Memory consumption</i>: JVM is given 256MB and uses it
-all.</li>
+                            all.</li>
                    </p>
                    <p>
                        <b>Notes</b><br />
                        <li><i>Notes</i>:
                            <p>
                                We have 10 threads reading files from the filesystem and
-parsing and
+                                parsing and
                                analyzing them and the pushing them onto a queue and a single
-thread poping
+                                thread poping
                                them from the queue and indexing.  Note that we are indexing
-email messages
+                                email messages
                                and are storing the entire plaintext in of the message in the
-index.  If the
+                                index.  If the
                                message contains attachment and we do not have a filter for
-the attachment
+                                the attachment
                                (ie. we do not do PDFs yet), we discard the data.
                            </p></li>
                    </p>
@ -425,6 +426,81 @@ the attachment
                            </blockquote>
      </td></tr>
      <tr><td><br/></td></tr>
+    </table>
+                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#828DA6">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Daniel Armbrust's benchmarks"><strong>Daniel Armbrust's benchmarks</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>
+                    My disclaimer is that this is a very poor "Benchmark".  It was not done for raw speed,
+                    nor was the total index built in one shot.  The index was created on several different
+                    machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+                    1 million documents per batch.  Each of these small indexes was then moved to a
+                    much larger drive, where they were all merged together into a big index.
+                    This process was done manually, over the course of several months, as the sources became available.
+                </p>
+                                                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br />
+                        <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load.  However, the indexing process was built single
+                            threaded, so it only took advantage of 1 of the processors.  It usually got 100% of this processor.</li>
+                        <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
+                        <li><i>RAM</i>: 4 GB Memory</li>
+                        <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br />
+                        <li><i>Java Version</i>: 1.3.1</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br />
+                        <li><i>Number of source documents</i>: 13,820,517</li>
+                        <li><i>Total filesize of source documents</i>: 87.3 GB</li>
+                        <li><i>Average filesize of source documents</i>: 6.3 KB</li>
+                        <li><i>Source documents storage location</i>: Filesystem</li>
+                        <li><i>File type of source documents</i>: XML</li>
+                        <li><i>Parser(s) used, if any</i>: </li>
+                        <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
+                        <li><i>Number of fields per document</i>: 1 - 31</li>
+                        <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                        <li><i>Index size</i>: 12.5 GB</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br />
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
+                        <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
+                            1 GB of memory was allotted to the indexer</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br />
+                        <li><i>Notes</i>:
+                            <p>
+                                The source documents were XML.  The "indexer" opened each document one at a time, ran an
+                                XSL transformation on them, and then proceeded to index the stream.  The indexer optimized
+                                the index every 50,000 documents (on this run) though previously, we optimized every
+                                300,000 documents.  The performance didn't change much either way.  We did no other
+                                tuning (RAM Directories, separate process to pretransform the source material, etc)
+                                to make it index faster.  When all of these individual indexes were built, they were
+                                merged together into the main index.  That process usually took ~ a day.
+                            </p></li>
+                    </p>
+                </ul>
+                                                <p>
+                    Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+                </p>
+                            </blockquote>
+      </td></tr>
+      <tr><td><br/></td></tr>
    </table>
                            </blockquote>
        </p>
--- a/docs/contributions.html
+++ b/docs/contributions.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo.html
+++ b/docs/demo.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo2.html
+++ b/docs/demo2.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo3.html
+++ b/docs/demo3.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/demo4.html
+++ b/docs/demo4.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/fileformats.html
+++ b/docs/fileformats.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/gettingstarted.html
+++ b/docs/gettingstarted.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/index.html
+++ b/docs/index.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/lucene-sandbox/index.html
+++ b/docs/lucene-sandbox/index.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/lucene-sandbox/indyo/tutorial.html
+++ b/docs/lucene-sandbox/indyo/tutorial.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/lucene-sandbox/larm/overview.html
+++ b/docs/lucene-sandbox/larm/overview.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/luceneplan.html
+++ b/docs/luceneplan.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/powered.html
+++ b/docs/powered.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/queryparsersyntax.html
+++ b/docs/queryparsersyntax.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/resources.html
+++ b/docs/resources.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/todo.html
+++ b/docs/todo.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/docs/whoweare.html
+++ b/docs/whoweare.html
@ -5,6 +5,7 @@
        
 <!-- start the processing -->
    <!-- ====================================================================== -->
+    <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
    <!-- Main Page Section -->
    <!-- ====================================================================== -->
    <html>
--- a/xdocs/benchmarks.xml
+++ b/xdocs/benchmarks.xml
@ -9,17 +9,17 @@
        <section name="Performance Benchmarks">
            <p>
                The purpose of these user-submitted performance figures is to
-give current and potential users of Lucene a sense 
+                give current and potential users of Lucene a sense
                of how well Lucene scales. If the requirements for an upcoming
-project is similar to an existing benchmark, you 
+                project is similar to an existing benchmark, you
                will also have something to work with when designing the system
-architecture for the application.
+                architecture for the application.
            </p>
            <p>
                If you've conducted performance tests with Lucene, we'd
-appreciate if you can submit these figures for display 
+                appreciate if you can submit these figures for display
                on this page. Post these figures to the lucene-user mailing list
-using this 
+                using this
                <a href="benchmarktemplate.xml">template</a>.
            </p>
        </section>
@ -30,57 +30,57 @@ using this
                    <p>
                        <b>Hardware Environment</b><br/>
                        <li><i>Dedicated machine for indexing</i>: Self-explanatory
-(yes/no)</li>
+                            (yes/no)</li>
                        <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
                        <li><i>RAM</i>: Self-explanatory</li>
                        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
-RAID-1, RAID-5)</li>
+                            RAID-1, RAID-5)</li>
                    </p>
                    <p>
                        <b>Software environment</b><br/>
                        <li><i>Java Version</i>: Version of Java SDK/JRE that is run
-</li>
+                        </li>
                        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
                        <li><i>OS Version</i>: Self-explanatory</li>
                        <li><i>Location of index</i>: Is the index stored in filesystem
-or database? Is it on the same server(local) or 
+                            or database? Is it on the same server(local) or
                            over the network?</li>
                    </p>
                    <p>
                        <b>Lucene indexing variables</b><br/>
                        <li><i>Number of source documents</i>: Number of documents being
-indexed</li>
+                            indexed</li>
                        <li><i>Total filesize of source documents</i>:
-Self-explanatory</li>
+                            Self-explanatory</li>
                        <li><i>Average filesize of source documents</i>:
-Self-explanatory</li>
+                            Self-explanatory</li>
                        <li><i>Source documents storage location</i>: Where are the
-documents being indexed located? 
+                            documents being indexed located?
                            Filesystem, DB, http,etc</li>
                        <li><i>File type of source documents</i>: Types of files being
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+                            indexed, e.g. HTML files, XML files, PDF files, etc.</li>
                        <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
-various files for indexing, 
+                            various files for indexing,
                            e.g. XML parser, HTML parser, etc.</li>
                        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
                        <li><i>Number of fields per document</i>: Number of Fields each
-Document contains</li>
+                            Document contains</li>
                        <li><i>Type of fields</i>: Type of each field</li>
                        <li><i>Index persistence</i>: Where the index is stored, e.g.
-FSDirectory, SqlDirectory, etc</li>
+                            FSDirectory, SqlDirectory, etc</li>
                    </p>
                    <p>
                        <b>Figures</b><br/>
                        <li><i>Time taken (in ms/s as an average of at least 3 indexing
-runs)</i>: Time taken to index all files</li>
+                                runs)</i>: Time taken to index all files</li>
                        <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
-1000 files</li>
+                            1000 files</li>
                        <li><i>Memory consumption</i>: Self-explanatory</li>
                    </p>
                    <p>
                        <b>Notes</b><br/>
                        <li><i>Notes</i>: Any comments which don't belong in the above,
-special tuning/strategies, etc</li>
+                            special tuning/strategies, etc</li>
                    </p>
                </ul>
            </p>
@ -89,14 +89,14 @@ special tuning/strategies, etc</li>
        <section name="User-submitted Benchmarks">
            <p>
                These benchmarks have been kindly submitted by Lucene users for
-reference purposes. 
+                reference purposes.
            </p>
            <p><b>We make NO guarantees regarding their accuracy or
-validity.</b>
+                    validity.</b>
            </p>
            <p>We strongly recommend you conduct your own
                performance benchmarks before deciding on a particular
-hardware/software setup (and hopefully submit 
+                hardware/software setup (and hopefully submit
                these figures to us).
            </p>

@ -119,10 +119,10 @@ hardware/software setup (and hopefully submit
                    <p>
                        <b>Lucene indexing variables</b><br/>
                        <li><i>Number of source documents</i>: Random generator. Set
-to make 1M documents
-in 2x500,000 batches.</li>
+                            to make 1M documents
+                            in 2x500,000 batches.</li>
                        <li><i>Total filesize of source documents</i>: > 1GB if
-stored</li>
+                            stored</li>
                        <li><i>Average filesize of source documents</i>: 1KB</li>
                        <li><i>Source documents storage location</i>: Filesystem</li>
                        <li><i>File type of source documents</i>: Generated</li>
@ -135,7 +135,7 @@ stored</li>
                    <p>
                        <b>Figures</b><br/>
                        <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: </li>
+                                indexing runs)</i>: </li>
                        <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
                        <li><i>Memory consumption</i>:</li>
                    </p>
@ -144,43 +144,43 @@ indexing runs)</i>: </li>
                        <li><i>Notes</i>:
                            <p>
                                A windows client ran a random document generator which
-created
+                                created
                                documents based on some arrays of values and an excerpt
-(approx 1kb)
+                                (approx 1kb)
                                from a text file of the bible (King James version).<br/>
                                These were submitted via a socket connection (open throughout
                                indexing process).<br/>
                                The index writer was not closed between index calls.<br/>
                                This created a 400Mb index in 23 files (after
-optimization).<br/>
+                                optimization).<br/>
                            </p>
                            <p>
                                <u>Query details</u>:<br/>
                            </p>
                            <p>
                                Set up a threaded class to start x number of simultaneous
-threads to
+                                threads to
                                search the above created index.
                            </p>
                            <p>
                                Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
-(Teaser:goo* Tea
+                                (Teaser:goo* Tea
                                ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
                                +DisplayStartDate:[mkwsw2jk0
                                -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
                            </p>
                            <p>
                                This query counted 34000 documents and I limited the returned
-documents
+                                documents
                                to 5.
                            </p>
                            <p>
                                This is using Peter Halacsy's IndexSearcherCache slightly
-modified to
+                                modified to
                                be a singleton returned cached searchers for a given
-directory. This
+                                directory. This
                                solved an initial problem with too many files open and
-running out of
+                                running out of
                                linux handles for them.
                            </p>
                            <pre>
@ -195,9 +195,9 @@ running out of
                            </pre>
                            <p>
                                I removed the two date range terms from the query and it made
-a HUGE
+                                a HUGE
                                difference in performance. With 4 threads the avg time
-dropped to 900ms!
+                                dropped to 900ms!
                            </p>
                            <p>Other query optimizations made little difference.</p></li>
                    </p>
@ -212,11 +212,11 @@ dropped to 900ms!
                    <p>
                        <b>Hardware Environment</b><br/>
                        <li><i>Dedicated machine for indexing</i>: No, but nominal
-usage at time of indexing.</li>
+                            usage at time of indexing.</li>
                        <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
                        <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
                        <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
-Array</li>
+                            Array</li>
                    </p>
                    <p>
                        <b>Software environment</b><br/>
@ -230,43 +230,43 @@ Array</li>
                        <li><i>Number of source documents</i>: about 60K</li>
                        <li><i>Total filesize of source documents</i>: 6.5GB</li>
                        <li><i>Average filesize of source documents</i>: 100K
-(6.5GB/60K documents)</li>
+                            (6.5GB/60K documents)</li>
                        <li><i>Source documents storage location</i>: filesystem on
-NTFS</li>
+                            NTFS</li>
                        <li><i>File type of source documents</i>: </li>
                        <li><i>Parser(s) used, if any</i>: Currently the only parser
-used is the Quiotix html
+                            used is the Quiotix html
                            parser.</li>
                        <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
                        <li><i>Number of fields per document</i>: 8</li>
                        <li><i>Type of fields</i>: All strings, and all are stored
-and indexed.</li>
+                            and indexed.</li>
                        <li><i>Index persistence</i>: FSDirectory</li>
                    </p>
                    <p>
                        <b>Figures</b><br/>
                        <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
-minutes.  Note that the #
+                                indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+                            minutes.  Note that the #
                            and size of documents changes daily.</li>
                        <li><i>Time taken / 1000 docs indexed</i>: </li>
                        <li><i>Memory consumption</i>: JVM is given 256MB and uses it
-all.</li>
+                            all.</li>
                    </p>
                    <p>
                        <b>Notes</b><br/>
                        <li><i>Notes</i>:
                            <p>
                                We have 10 threads reading files from the filesystem and
-parsing and
+                                parsing and
                                analyzing them and the pushing them onto a queue and a single
-thread poping
+                                thread poping
                                them from the queue and indexing.  Note that we are indexing
-email messages
+                                email messages
                                and are storing the entire plaintext in of the message in the
-index.  If the
+                                index.  If the
                                message contains attachment and we do not have a filter for
-the attachment
+                                the attachment
                                (ie. we do not do PDFs yet), we discard the data.
                            </p></li>
                    </p>
@ -276,8 +276,74 @@ the attachment
                </p>
            </subsection>

+
+            <subsection name="Daniel Armbrust's benchmarks">
+                <p>
+                    My disclaimer is that this is a very poor "Benchmark".  It was not done for raw speed,
+                    nor was the total index built in one shot.  The index was created on several different
+                    machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+                    1 million documents per batch.  Each of these small indexes was then moved to a
+                    much larger drive, where they were all merged together into a big index.
+                    This process was done manually, over the course of several months, as the sources became available.
+                </p>
+                <ul>
+                    <p>
+                        <b>Hardware Environment</b><br/>
+                        <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load.  However, the indexing process was built single
+                            threaded, so it only took advantage of 1 of the processors.  It usually got 100% of this processor.</li>
+                        <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
+                        <li><i>RAM</i>: 4 GB Memory</li>
+                        <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+                    </p>
+                    <p>
+                        <b>Software environment</b><br/>
+                        <li><i>Java Version</i>: 1.3.1</li>
+                        <li><i>Java VM</i>: </li>
+                        <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
+                        <li><i>Location of index</i>: local</li>
+                    </p>
+                    <p>
+                        <b>Lucene indexing variables</b><br/>
+                        <li><i>Number of source documents</i>: 13,820,517</li>
+                        <li><i>Total filesize of source documents</i>: 87.3 GB</li>
+                        <li><i>Average filesize of source documents</i>: 6.3 KB</li>
+                        <li><i>Source documents storage location</i>: Filesystem</li>
+                        <li><i>File type of source documents</i>: XML</li>
+                        <li><i>Parser(s) used, if any</i>: </li>
+                        <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
+                        <li><i>Number of fields per document</i>: 1 - 31</li>
+                        <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
+                        <li><i>Index persistence</i>: FSDirectory</li>
+                        <li><i>Index size</i>: 12.5 GB</li>
+                    </p>
+                    <p>
+                        <b>Figures</b><br/>
+                        <li><i>Time taken (in ms/s as an average of at least 3
+                                indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+                        <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
+                        <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
+                            1 GB of memory was allotted to the indexer</li>
+                    </p>
+                    <p>
+                        <b>Notes</b><br/>
+                        <li><i>Notes</i>:
+                            <p>
+                                The source documents were XML.  The "indexer" opened each document one at a time, ran an
+                                XSL transformation on them, and then proceeded to index the stream.  The indexer optimized
+                                the index every 50,000 documents (on this run) though previously, we optimized every
+                                300,000 documents.  The performance didn't change much either way.  We did no other
+                                tuning (RAM Directories, separate process to pretransform the source material, etc)
+                                to make it index faster.  When all of these individual indexes were built, they were
+                                merged together into the main index.  That process usually took ~ a day.
+                            </p></li>
+                    </p>
+                </ul>
+                <p>
+                    Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+                </p>
+            </subsection>
+
        </section>

    </body>
 </document>
-