- User-submitted benchmarks and a template.

Submitted by: Kelvin Tan Reviewed by: otis git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149899 13f79535-47bb-0310-9956-ffa450edef68
2002-12-04 05:46:43 +00:00 · 2002-12-04 05:46:43 +00:00 · 35aa4145e8
parent b34fa3c092
commit 35aa4145e8
3 changed files with 807 additions and 0 deletions
--- a/docs/benchmarks.html
+++ b/docs/benchmarks.html
@ -0,0 +1,467 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+
+<!-- Content Stylesheet for Site -->
+
+        
+<!-- start the processing -->
+    <!-- ====================================================================== -->
+    <!-- Main Page Section -->
+    <!-- ====================================================================== -->
+    <html>
+        <head>
+            <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
+
+                                                    <meta name="author" value="Kelvin Tan">
+            <meta name="email" value="kelvint@apache.org">
+            
+           
+                                    
+            <title>Jakarta Lucene - Resources - Performance Benchmarks</title>
+        </head>
+
+        <body bgcolor="#ffffff" text="#000000" link="#525D76">        
+            <table border="0" width="100%" cellspacing="0">
+                <!-- TOP IMAGE -->
+                <tr>
+                    <td align="left">
+<a href="http://jakarta.apache.org"><img src="http://jakarta.apache.org/images/jakarta-logo.gif" border="0"/></a>
+</td>
+<td align="right">
+<a href="http://jakarta.apache.org/lucene/"><img src="./images/lucene_green_300.gif" alt="Jakarta Lucene" border="0"/></a>
+</td>
+                </tr>
+            </table>
+            <table border="0" width="100%" cellspacing="4">
+                <tr><td colspan="2">
+                    <hr noshade="" size="1"/>
+                </td></tr>
+                
+                <tr>
+                    <!-- LEFT SIDE NAVIGATION -->
+                    <td width="20%" valign="top" nowrap="true">
+                                <p><strong>About</strong></p>
+        <ul>
+                    <li>    <a href="./index.html">Overview</a>
+</li>
+                    <li>    <a href="./powered.html">Powered by Lucene</a>
+</li>
+                    <li>    <a href="./whoweare.html">Who We Are</a>
+</li>
+                    <li>    <a href="http://jakarta.apache.org/site/mail.html">Mailing Lists</a>
+</li>
+                </ul>
+            <p><strong>Resources</strong></p>
+        <ul>
+                    <li>    <a href="http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi">FAQ (Official)</a>
+</li>
+                    <li>    <a href="http://www.jguru.com/faq/Lucene">jGuru FAQ</a>
+</li>
+                    <li>    <a href="./gettingstarted.html">Getting Started</a>
+</li>
+                    <li>    <a href="./queryparsersyntax.html">Query Syntax</a>
+</li>
+                    <li>    <a href="./fileformats.html">File Formats</a>
+</li>
+                    <li>    <a href="./api/index.html">Javadoc</a>
+</li>
+                    <li>    <a href="./contributions.html">Contributions</a>
+</li>
+                    <li>    <a href="./lucene-sandbox/">Lucene Sandbox</a>
+</li>
+                    <li>    <a href="./resources.html">Articles, etc.</a>
+</li>
+                    <li>    <a href="./todo.html">TODO list</a>
+</li>
+                    <li>    <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=%5BPATCH%5D&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27/todo.html">Patches</a>
+</li>
+                    <li>    <a href="http://jakarta.apache.org/site/bugs.html">Bugs</a>
+</li>
+                    <li>    <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Lucene Bugs</a>
+</li>
+                    <li>    <a href="http://nagoya.apache.org/eyebrowse/SummarizeList?listId=30">Lucene-user</a>
+</li>
+                    <li>    <a href="http://nagoya.apache.org/eyebrowse/SummarizeList?listId=29">Lucene-dev</a>
+</li>
+                </ul>
+            <p><strong>Plans</strong></p>
+        <ul>
+                    <li>    <a href="./luceneplan.html">Application Extensions</a>
+</li>
+                </ul>
+            <p><strong>Download</strong></p>
+        <ul>
+                    <li>    <a href="http://jakarta.apache.org/site/binindex.html">Binaries</a>
+</li>
+                    <li>    <a href="http://jakarta.apache.org/site/sourceindex.html">Source Code</a>
+</li>
+                    <li>    <a href="http://jakarta.apache.org/site/cvsindex.html">CVS Repositories</a>
+</li>
+                </ul>
+            <p><strong>Jakarta</strong></p>
+        <ul>
+                    <li>    <a href="http://jakarta.apache.org/site/getinvolved.html">Get Involved</a>
+</li>
+                    <li>    <a href="http://jakarta.apache.org/site/acknowledgements.html">Acknowledgements</a>
+</li>
+                    <li>    <a href="http://jakarta.apache.org/site/contact.html">Contact</a>
+</li>
+                    <li>    <a href="http://jakarta.apache.org/site/legal.html">Legal</a>
+</li>
+                </ul>
+                        </td>
+                    <td width="80%" align="left" valign="top">
+                                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#525D76">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Performance Benchmarks"><strong>Performance Benchmarks</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>
+      The purpose of these user-submitted performance figures is to 
+give current and potential users of Lucene a sense 
+      of how well Lucene scales. If the requirements for an upcoming 
+project is similar to an existing benchmark, you 
+      will also have something to work with when designing the system 
+architecture for the application.
+      </p>
+                                                <p>
+      If you've conducted performance tests with Lucene, we'd 
+appreciate if you can submit these figures for display 
+      on this page. Post these figures to the lucene-user mailing list 
+using this 
+      <a href="benchmarktemplate.xml">template</a>.
+      </p>
+                            </blockquote>
+        </p>
+      </td></tr>
+      <tr><td><br/></td></tr>
+    </table>
+                                                <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#525D76">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Benchmark Variables"><strong>Benchmark Variables</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>
+      <ul>
+      <p>
+      <b>Hardware Environment</b><br />
+      <li><i>Dedicated machine for indexing</i>: Self-explanatory 
+(yes/no)</li>
+      <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
+      <li><i>RAM</i>: Self-explanatory</li>
+      <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
+RAID-1, RAID-5)</li>
+      </p>
+      <p>
+      <b>Software environment</b><br />
+      <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
+</li>
+      <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
+      <li><i>OS Version</i>: Self-explanatory</li>
+      <li><i>Location of index</i>: Is the index stored in filesystem 
+or database? Is it on the same server(local) or 
+      over the network?</li>
+      </p>
+      <p>
+      <b>Lucene indexing variables</b><br />
+      <li><i>Number of source documents</i>: Number of documents being 
+indexed</li>
+      <li><i>Total filesize of source documents</i>: 
+Self-explanatory</li>
+      <li><i>Average filesize of source documents</i>: 
+Self-explanatory</li>
+      <li><i>Source documents storage location</i>: Where are the 
+documents being indexed located? 
+        Filesystem, DB, http,etc</li>
+      <li><i>File type of source documents</i>: Types of files being 
+indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+      <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
+various files for indexing, 
+        e.g. XML parser, HTML parser, etc.</li>
+      <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
+      <li><i>Number of fields per document</i>: Number of Fields each 
+Document contains</li>
+      <li><i>Type of fields</i>: Type of each field</li>
+      <li><i>Index persistence</i>: Where the index is stored, e.g. 
+FSDirectory, SqlDirectory, etc</li>
+      </p>
+      <p>
+      <b>Figures</b><br />
+      <li><i>Time taken (in ms/s as an average of at least 3 indexing 
+runs)</i>: Time taken to index all files</li>
+      <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
+1000 files</li>
+      <li><i>Memory consumption</i>: Self-explanatory</li>
+      </p>
+      <p>
+      <b>Notes</b><br />
+      <li><i>Notes</i>: Any comments which don't belong in the above, 
+special tuning/strategies, etc</li>
+      </p>
+      </ul>
+      </p>
+                            </blockquote>
+        </p>
+      </td></tr>
+      <tr><td><br/></td></tr>
+    </table>
+                                                <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#525D76">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="User-submitted Benchmarks"><strong>User-submitted Benchmarks</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>
+      These benchmarks have been kindly submitted by Lucene users for 
+reference purposes. 
+      </p>
+                                                <p><b>We make NO guarantees regarding their accuracy or 
+validity.</b>
+      </p>
+                                                <p>We strongly recommend you conduct your own 
+      performance benchmarks before deciding on a particular 
+hardware/software setup (and hopefully submit 
+      these figures to us).
+      </p>
+                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#828DA6">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Hamish Carpenter's benchmarks"><strong>Hamish Carpenter's benchmarks</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <ul>
+          <p>
+          <b>Hardware Environment</b><br />
+          <li><i>Dedicated machine for indexing</i>: yes</li>
+          <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
+          <li><i>RAM</i>: 512 DDR</li>
+          <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
+          </p>
+          <p>
+          <b>Software environment</b><br />
+          <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
+          <li><i>Java VM</i>: </li>
+          <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
+          <li><i>Location of index</i>: local</li>
+          </p>
+          <p>
+          <b>Lucene indexing variables</b><br />
+          <li><i>Number of source documents</i>: Random generator. Set 
+to make 1M documents
+in 2x500,000 batches.</li>
+          <li><i>Total filesize of source documents</i>: &gt; 1GB if 
+stored</li>
+          <li><i>Average filesize of source documents</i>: 1KB</li>
+          <li><i>Source documents storage location</i>: Filesystem</li>
+          <li><i>File type of source documents</i>: Generated</li>
+          <li><i>Parser(s) used, if any</i>: </li>
+          <li><i>Analyzer(s) used</i>: Default</li>
+          <li><i>Number of fields per document</i>: 11</li>
+          <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
+          <li><i>Index persistence</i>: FSDirectory</li>
+          </p>
+          <p>
+          <b>Figures</b><br />
+          <li><i>Time taken (in ms/s as an average of at least 3 
+indexing runs)</i>: </li>
+          <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
+          <li><i>Memory consumption</i>:</li>
+          </p>
+          <p>
+          <b>Notes</b><br />
+          <li><i>Notes</i>: 
+          <p>
+          A windows client ran a random document generator which 
+created
+          documents based on some arrays of values and an excerpt 
+(approx 1kb)
+          from a text file of the bible (King James version).<br />
+          These were submitted via a socket connection (open throughout
+          indexing process).<br />
+          The index writer was not closed between index calls.<br />
+          This created a 400Mb index in 23 files (after 
+optimization).<br />
+          </p>
+          <p>
+          <u>Query details</u>:<br />
+          </p>
+          <p>
+          Set up a threaded class to start x number of simultaneous 
+threads to
+          search the above created index.
+          </p>
+          <p>
+          Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
+(Teaser:goo* Tea
+          ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+          +DisplayStartDate:[mkwsw2jk0
+          -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+          </p>
+          <p>
+          This query counted 34000 documents and I limited the returned 
+documents
+          to 5.
+          </p>
+          <p>
+          This is using Peter Halacsy's IndexSearcherCache slightly 
+modified to
+          be a singleton returned cached searchers for a given 
+directory. This
+          solved an initial problem with too many files open and 
+running out of
+          linux handles for them.
+          </p>
+          <pre>
+          Threads|Avg Time per query (ms)
+          1       1009ms
+          2       2043ms
+          3       3087ms
+          4       4045ms
+          ..        .
+          ..        .
+          10      10091ms
+          </pre>
+          <p>
+          I removed the two date range terms from the query and it made 
+a HUGE
+          difference in performance. With 4 threads the avg time 
+dropped to 900ms!
+          </p>
+          <p>Other query optimizations made little difference.</p></li>
+          </p>
+          </ul>
+                                                <p>
+          Hamish can be contacted at hamish at catalyst.net.nz.
+          </p>
+                            </blockquote>
+      </td></tr>
+      <tr><td><br/></td></tr>
+    </table>
+                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#828DA6">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="Justin Greene's benchmarks"><strong>Justin Greene's benchmarks</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <ul>
+          <p>
+          <b>Hardware Environment</b><br />
+          <li><i>Dedicated machine for indexing</i>: No, but nominal 
+usage at time of indexing.</li>
+          <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
+          <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
+          <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
+Array</li>
+          </p>
+          <p>
+          <b>Software environment</b><br />
+          <li><i>Java Version</i>: 1.3.1_06</li>
+          <li><i>Java VM</i>: </li>
+          <li><i>OS Version</i>: Winnt 4/Sp6</li>
+          <li><i>Location of index</i>: local</li>
+          </p>
+          <p>
+          <b>Lucene indexing variables</b><br />
+          <li><i>Number of source documents</i>: about 60K</li>
+          <li><i>Total filesize of source documents</i>: 6.5GB</li>
+          <li><i>Average filesize of source documents</i>: 100K 
+(6.5GB/60K documents)</li>
+          <li><i>Source documents storage location</i>: filesystem on 
+NTFS</li>
+          <li><i>File type of source documents</i>: </li>
+          <li><i>Parser(s) used, if any</i>: Currently the only parser 
+used is the Quiotix html
+          parser.</li>
+          <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
+          <li><i>Number of fields per document</i>: 8</li>
+          <li><i>Type of fields</i>: All strings, and all are stored 
+and indexed.</li>
+          <li><i>Index persistence</i>: FSDirectory</li>
+          </p>
+          <p>
+          <b>Figures</b><br />
+          <li><i>Time taken (in ms/s as an average of at least 3 
+indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
+minutes.  Note that the #
+          and size of documents changes daily.</li>
+          <li><i>Time taken / 1000 docs indexed</i>: </li>
+          <li><i>Memory consumption</i>: JVM is given 256MB and uses it 
+all.</li>
+          </p>
+          <p>
+          <b>Notes</b><br />
+          <li><i>Notes</i>: 
+          <p>
+          We have 10 threads reading files from the filesystem and 
+parsing and
+          analyzing them and the pushing them onto a queue and a single 
+thread poping
+          them from the queue and indexing.  Note that we are indexing 
+email messages
+          and are storing the entire plaintext in of the message in the 
+index.  If the
+          message contains attachment and we do not have a filter for 
+the attachment
+          (ie. we do not do PDFs yet), we discard the data.
+          </p></li>
+          </p>
+          </ul>
+                                                <p>
+          Justin can be contacted at tvxh-lw4x at spamex.com.
+          </p>
+                            </blockquote>
+      </td></tr>
+      <tr><td><br/></td></tr>
+    </table>
+                            </blockquote>
+        </p>
+      </td></tr>
+      <tr><td><br/></td></tr>
+    </table>
+                                        </td>
+                </tr>
+
+                <!-- FOOTER -->
+                <tr><td colspan="2">
+                    <hr noshade="" size="1"/>
+                </td></tr>
+                <tr><td colspan="2">
+                    <div align="center"><font color="#525D76" size="-1"><em>
+                    Copyright &#169; 1999-2002, Apache Software Foundation
+                    </em></font></div>
+                </td></tr>
+            </table>
+        </body>
+    </html>
+<!-- end the processing -->
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
--- a/docs/benchmarktemplate.xml
+++ b/docs/benchmarktemplate.xml
@ -0,0 +1,57 @@
+<benchmark>
+  <ul>
+  <p>
+  <b>Hardware Environment</b><br/>
+  <li><i>Dedicated machine for indexing</i>: Self-explanatory 
+(yes/no)</li>
+  <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
+  <li><i>RAM</i>: Self-explanatory</li>
+  <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, RAID-1, 
+RAID-5)</li>
+  </p>
+  <p>
+  <b>Software environment</b><br/>
+  <li><i>Java Version</i>: Version of Java SDK/JRE that is run </li>
+  <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
+  <li><i>OS Version</i>: Self-explanatory</li>
+  <li><i>Location of index</i>: Is the index stored in filesystem or 
+database? Is it on the same server(local) or 
+  over the network?</li>
+  </p>
+  <p>
+  <b>Lucene indexing variables</b><br/>
+  <li><i>Number of source documents</i>: Number of documents being 
+indexed</li>
+  <li><i>Total filesize of source documents</i>: Self-explanatory</li>
+  <li><i>Average filesize of source documents</i>: 
+Self-explanatory</li>
+  <li><i>Source documents storage location</i>: Where are the documents 
+being indexed located? 
+    Filesystem, DB, http,etc</li>
+  <li><i>File type of source documents</i>: Types of files being 
+indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+  <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
+various files for indexing, 
+    e.g. XML parser, HTML parser, etc.</li>
+  <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
+  <li><i>Number of fields per document</i>: Number of Fields each 
+Document contains</li>
+  <li><i>Type of fields</i>: Type of each field</li>
+  <li><i>Index persistence</i>: Where the index is stored, e.g. 
+FSDirectory, SqlDirectory, etc</li>
+  </p>
+  <p>
+  <b>Figures</b><br/>
+  <li><i>Time taken (in ms/s as an average of at least 3 indexing 
+runs)</i>: Time taken to index to index all files</li>
+  <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 1000 
+files</li>
+  <li><i>Memory consumption</i>: Self-explanatory</li>
+  </p>
+  <p>
+  <b>Notes</b><br/>
+  <li><i>Notes</i>: Any comments which don't belong in the above, 
+special tuning/strategies, etc</li>
+  </p>
+  </ul>
+</benchmark>
--- a/xdocs/benchmarks.xml
+++ b/xdocs/benchmarks.xml
@ -0,0 +1,283 @@
+<?xml version="1.0"?>
+<document>
+    <properties>
+      <author email="kelvint@apache.org">Kelvin Tan</author>
+      <title>Resources - Performance Benchmarks</title>
+    </properties>
+    <body>
+
+      <section name="Performance Benchmarks">
+      <p>
+      The purpose of these user-submitted performance figures is to 
+give current and potential users of Lucene a sense 
+      of how well Lucene scales. If the requirements for an upcoming 
+project is similar to an existing benchmark, you 
+      will also have something to work with when designing the system 
+architecture for the application.
+      </p>
+      <p>
+      If you've conducted performance tests with Lucene, we'd 
+appreciate if you can submit these figures for display 
+      on this page. Post these figures to the lucene-user mailing list 
+using this 
+      <a href="benchmarktemplate.xml">template</a>.
+      </p>
+      </section>
+      
+      <section name="Benchmark Variables">
+      <p>
+      <ul>
+      <p>
+      <b>Hardware Environment</b><br/>
+      <li><i>Dedicated machine for indexing</i>: Self-explanatory 
+(yes/no)</li>
+      <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
+      <li><i>RAM</i>: Self-explanatory</li>
+      <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
+RAID-1, RAID-5)</li>
+      </p>
+      <p>
+      <b>Software environment</b><br/>
+      <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
+</li>
+      <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
+      <li><i>OS Version</i>: Self-explanatory</li>
+      <li><i>Location of index</i>: Is the index stored in filesystem 
+or database? Is it on the same server(local) or 
+      over the network?</li>
+      </p>
+      <p>
+      <b>Lucene indexing variables</b><br/>
+      <li><i>Number of source documents</i>: Number of documents being 
+indexed</li>
+      <li><i>Total filesize of source documents</i>: 
+Self-explanatory</li>
+      <li><i>Average filesize of source documents</i>: 
+Self-explanatory</li>
+      <li><i>Source documents storage location</i>: Where are the 
+documents being indexed located? 
+        Filesystem, DB, http,etc</li>
+      <li><i>File type of source documents</i>: Types of files being 
+indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+      <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
+various files for indexing, 
+        e.g. XML parser, HTML parser, etc.</li>
+      <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
+      <li><i>Number of fields per document</i>: Number of Fields each 
+Document contains</li>
+      <li><i>Type of fields</i>: Type of each field</li>
+      <li><i>Index persistence</i>: Where the index is stored, e.g. 
+FSDirectory, SqlDirectory, etc</li>
+      </p>
+      <p>
+      <b>Figures</b><br/>
+      <li><i>Time taken (in ms/s as an average of at least 3 indexing 
+runs)</i>: Time taken to index all files</li>
+      <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
+1000 files</li>
+      <li><i>Memory consumption</i>: Self-explanatory</li>
+      </p>
+      <p>
+      <b>Notes</b><br/>
+      <li><i>Notes</i>: Any comments which don't belong in the above, 
+special tuning/strategies, etc</li>
+      </p>
+      </ul>
+      </p>
+      </section>
+
+      <section name="User-submitted Benchmarks">
+      <p>
+      These benchmarks have been kindly submitted by Lucene users for 
+reference purposes. 
+      </p>
+      <p><b>We make NO guarantees regarding their accuracy or 
+validity.</b>
+      </p>
+      <p>We strongly recommend you conduct your own 
+      performance benchmarks before deciding on a particular 
+hardware/software setup (and hopefully submit 
+      these figures to us).
+      </p>
+      
+        <subsection name="Hamish Carpenter's benchmarks">
+          <ul>
+          <p>
+          <b>Hardware Environment</b><br/>
+          <li><i>Dedicated machine for indexing</i>: yes</li>
+          <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
+          <li><i>RAM</i>: 512 DDR</li>
+          <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
+          </p>
+          <p>
+          <b>Software environment</b><br/>
+          <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
+          <li><i>Java VM</i>: </li>
+          <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
+          <li><i>Location of index</i>: local</li>
+          </p>
+          <p>
+          <b>Lucene indexing variables</b><br/>
+          <li><i>Number of source documents</i>: Random generator. Set 
+to make 1M documents
+in 2x500,000 batches.</li>
+          <li><i>Total filesize of source documents</i>: > 1GB if 
+stored</li>
+          <li><i>Average filesize of source documents</i>: 1KB</li>
+          <li><i>Source documents storage location</i>: Filesystem</li>
+          <li><i>File type of source documents</i>: Generated</li>
+          <li><i>Parser(s) used, if any</i>: </li>
+          <li><i>Analyzer(s) used</i>: Default</li>
+          <li><i>Number of fields per document</i>: 11</li>
+          <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
+          <li><i>Index persistence</i>: FSDirectory</li>
+          </p>
+          <p>
+          <b>Figures</b><br/>
+          <li><i>Time taken (in ms/s as an average of at least 3 
+indexing runs)</i>: </li>
+          <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
+          <li><i>Memory consumption</i>:</li>
+          </p>
+          <p>
+          <b>Notes</b><br/>
+          <li><i>Notes</i>: 
+          <p>
+          A windows client ran a random document generator which 
+created
+          documents based on some arrays of values and an excerpt 
+(approx 1kb)
+          from a text file of the bible (King James version).<br/>
+          These were submitted via a socket connection (open throughout
+          indexing process).<br/>
+          The index writer was not closed between index calls.<br/>
+          This created a 400Mb index in 23 files (after 
+optimization).<br/>
+          </p>
+          <p>
+          <u>Query details</u>:<br/>
+          </p>
+          <p>
+          Set up a threaded class to start x number of simultaneous 
+threads to
+          search the above created index.
+          </p>
+          <p>
+          Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
+(Teaser:goo* Tea
+          ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+          +DisplayStartDate:[mkwsw2jk0
+          -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+          </p>
+          <p>
+          This query counted 34000 documents and I limited the returned 
+documents
+          to 5.
+          </p>
+          <p>
+          This is using Peter Halacsy's IndexSearcherCache slightly 
+modified to
+          be a singleton returned cached searchers for a given 
+directory. This
+          solved an initial problem with too many files open and 
+running out of
+          linux handles for them.
+          </p>
+          <pre>
+          Threads|Avg Time per query (ms)
+          1       1009ms
+          2       2043ms
+          3       3087ms
+          4       4045ms
+          ..        .
+          ..        .
+          10      10091ms
+          </pre>
+          <p>
+          I removed the two date range terms from the query and it made 
+a HUGE
+          difference in performance. With 4 threads the avg time 
+dropped to 900ms!
+          </p>
+          <p>Other query optimizations made little difference.</p></li>
+          </p>
+          </ul>
+          <p>
+          Hamish can be contacted at hamish at catalyst.net.nz.
+          </p>
+        </subsection>     
+
+        <subsection name="Justin Greene's benchmarks">
+          <ul>
+          <p>
+          <b>Hardware Environment</b><br/>
+          <li><i>Dedicated machine for indexing</i>: No, but nominal 
+usage at time of indexing.</li>
+          <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
+          <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
+          <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
+Array</li>
+          </p>
+          <p>
+          <b>Software environment</b><br/>
+          <li><i>Java Version</i>: 1.3.1_06</li>
+          <li><i>Java VM</i>: </li>
+          <li><i>OS Version</i>: Winnt 4/Sp6</li>
+          <li><i>Location of index</i>: local</li>
+          </p>
+          <p>
+          <b>Lucene indexing variables</b><br/>
+          <li><i>Number of source documents</i>: about 60K</li>
+          <li><i>Total filesize of source documents</i>: 6.5GB</li>
+          <li><i>Average filesize of source documents</i>: 100K 
+(6.5GB/60K documents)</li>
+          <li><i>Source documents storage location</i>: filesystem on 
+NTFS</li>
+          <li><i>File type of source documents</i>: </li>
+          <li><i>Parser(s) used, if any</i>: Currently the only parser 
+used is the Quiotix html
+          parser.</li>
+          <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
+          <li><i>Number of fields per document</i>: 8</li>
+          <li><i>Type of fields</i>: All strings, and all are stored 
+and indexed.</li>
+          <li><i>Index persistence</i>: FSDirectory</li>
+          </p>
+          <p>
+          <b>Figures</b><br/>
+          <li><i>Time taken (in ms/s as an average of at least 3 
+indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
+minutes.  Note that the #
+          and size of documents changes daily.</li>
+          <li><i>Time taken / 1000 docs indexed</i>: </li>
+          <li><i>Memory consumption</i>: JVM is given 256MB and uses it 
+all.</li>
+          </p>
+          <p>
+          <b>Notes</b><br/>
+          <li><i>Notes</i>: 
+          <p>
+          We have 10 threads reading files from the filesystem and 
+parsing and
+          analyzing them and the pushing them onto a queue and a single 
+thread poping
+          them from the queue and indexing.  Note that we are indexing 
+email messages
+          and are storing the entire plaintext in of the message in the 
+index.  If the
+          message contains attachment and we do not have a filter for 
+the attachment
+          (ie. we do not do PDFs yet), we discard the data.
+          </p></li>
+          </p>
+          </ul>
+          <p>
+          Justin can be contacted at tvxh-lw4x at spamex.com.
+          </p>
+        </subsection> 
+
+      </section>
+
+    </body>
+</document>
+