lucene/xdocs/benchmarks.xml

<?xml version="1.0"?>
<document>
    <properties>
      <author email="kelvint@apache.org">Kelvin Tan</author>
      <title>Resources - Performance Benchmarks</title>
    </properties>
    <body>

      <section name="Performance Benchmarks">
      <p>
      The purpose of these user-submitted performance figures is to 
give current and potential users of Lucene a sense 
      of how well Lucene scales. If the requirements for an upcoming 
project is similar to an existing benchmark, you 
      will also have something to work with when designing the system 
architecture for the application.
      </p>
      <p>
      If you've conducted performance tests with Lucene, we'd 
appreciate if you can submit these figures for display 
      on this page. Post these figures to the lucene-user mailing list 
using this 
      <a href="benchmarktemplate.xml">template</a>.
      </p>
      </section>
      
      <section name="Benchmark Variables">
      <p>
      <ul>
      <p>
      <b>Hardware Environment</b><br/>
      <li><i>Dedicated machine for indexing</i>: Self-explanatory 
(yes/no)</li>
      <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
      <li><i>RAM</i>: Self-explanatory</li>
      <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
RAID-1, RAID-5)</li>
      </p>
      <p>
      <b>Software environment</b><br/>
      <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
</li>
      <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
      <li><i>OS Version</i>: Self-explanatory</li>
      <li><i>Location of index</i>: Is the index stored in filesystem 
or database? Is it on the same server(local) or 
      over the network?</li>
      </p>
      <p>
      <b>Lucene indexing variables</b><br/>
      <li><i>Number of source documents</i>: Number of documents being 
indexed</li>
      <li><i>Total filesize of source documents</i>: 
Self-explanatory</li>
      <li><i>Average filesize of source documents</i>: 
Self-explanatory</li>
      <li><i>Source documents storage location</i>: Where are the 
documents being indexed located? 
        Filesystem, DB, http,etc</li>
      <li><i>File type of source documents</i>: Types of files being 
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
      <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
various files for indexing, 
        e.g. XML parser, HTML parser, etc.</li>
      <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
      <li><i>Number of fields per document</i>: Number of Fields each 
Document contains</li>
      <li><i>Type of fields</i>: Type of each field</li>
      <li><i>Index persistence</i>: Where the index is stored, e.g. 
FSDirectory, SqlDirectory, etc</li>
      </p>
      <p>
      <b>Figures</b><br/>
      <li><i>Time taken (in ms/s as an average of at least 3 indexing 
runs)</i>: Time taken to index all files</li>
      <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
1000 files</li>
      <li><i>Memory consumption</i>: Self-explanatory</li>
      </p>
      <p>
      <b>Notes</b><br/>
      <li><i>Notes</i>: Any comments which don't belong in the above, 
special tuning/strategies, etc</li>
      </p>
      </ul>
      </p>
      </section>

      <section name="User-submitted Benchmarks">
      <p>
      These benchmarks have been kindly submitted by Lucene users for 
reference purposes. 
      </p>
      <p><b>We make NO guarantees regarding their accuracy or 
validity.</b>
      </p>
      <p>We strongly recommend you conduct your own 
      performance benchmarks before deciding on a particular 
hardware/software setup (and hopefully submit 
      these figures to us).
      </p>
      
        <subsection name="Hamish Carpenter's benchmarks">
          <ul>
          <p>
          <b>Hardware Environment</b><br/>
          <li><i>Dedicated machine for indexing</i>: yes</li>
          <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
          <li><i>RAM</i>: 512 DDR</li>
          <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
          </p>
          <p>
          <b>Software environment</b><br/>
          <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
          <li><i>Java VM</i>: </li>
          <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
          <li><i>Location of index</i>: local</li>
          </p>
          <p>
          <b>Lucene indexing variables</b><br/>
          <li><i>Number of source documents</i>: Random generator. Set 
to make 1M documents
in 2x500,000 batches.</li>
          <li><i>Total filesize of source documents</i>: > 1GB if 
stored</li>
          <li><i>Average filesize of source documents</i>: 1KB</li>
          <li><i>Source documents storage location</i>: Filesystem</li>
          <li><i>File type of source documents</i>: Generated</li>
          <li><i>Parser(s) used, if any</i>: </li>
          <li><i>Analyzer(s) used</i>: Default</li>
          <li><i>Number of fields per document</i>: 11</li>
          <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
          <li><i>Index persistence</i>: FSDirectory</li>
          </p>
          <p>
          <b>Figures</b><br/>
          <li><i>Time taken (in ms/s as an average of at least 3 
indexing runs)</i>: </li>
          <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
          <li><i>Memory consumption</i>:</li>
          </p>
          <p>
          <b>Notes</b><br/>
          <li><i>Notes</i>: 
          <p>
          A windows client ran a random document generator which 
created
          documents based on some arrays of values and an excerpt 
(approx 1kb)
          from a text file of the bible (King James version).<br/>
          These were submitted via a socket connection (open throughout
          indexing process).<br/>
          The index writer was not closed between index calls.<br/>
          This created a 400Mb index in 23 files (after 
optimization).<br/>
          </p>
          <p>
          <u>Query details</u>:<br/>
          </p>
          <p>
          Set up a threaded class to start x number of simultaneous 
threads to
          search the above created index.
          </p>
          <p>
          Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
(Teaser:goo* Tea
          ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
          +DisplayStartDate:[mkwsw2jk0
          -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
          </p>
          <p>
          This query counted 34000 documents and I limited the returned 
documents
          to 5.
          </p>
          <p>
          This is using Peter Halacsy's IndexSearcherCache slightly 
modified to
          be a singleton returned cached searchers for a given 
directory. This
          solved an initial problem with too many files open and 
running out of
          linux handles for them.
          </p>
          <pre>
          Threads|Avg Time per query (ms)
          1       1009ms
          2       2043ms
          3       3087ms
          4       4045ms
          ..        .
          ..        .
          10      10091ms
          </pre>
          <p>
          I removed the two date range terms from the query and it made 
a HUGE
          difference in performance. With 4 threads the avg time 
dropped to 900ms!
          </p>
          <p>Other query optimizations made little difference.</p></li>
          </p>
          </ul>
          <p>
          Hamish can be contacted at hamish at catalyst.net.nz.
          </p>
        </subsection>     

        <subsection name="Justin Greene's benchmarks">
          <ul>
          <p>
          <b>Hardware Environment</b><br/>
          <li><i>Dedicated machine for indexing</i>: No, but nominal 
usage at time of indexing.</li>
          <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
          <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
          <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
Array</li>
          </p>
          <p>
          <b>Software environment</b><br/>
          <li><i>Java Version</i>: 1.3.1_06</li>
          <li><i>Java VM</i>: </li>
          <li><i>OS Version</i>: Winnt 4/Sp6</li>
          <li><i>Location of index</i>: local</li>
          </p>
          <p>
          <b>Lucene indexing variables</b><br/>
          <li><i>Number of source documents</i>: about 60K</li>
          <li><i>Total filesize of source documents</i>: 6.5GB</li>
          <li><i>Average filesize of source documents</i>: 100K 
(6.5GB/60K documents)</li>
          <li><i>Source documents storage location</i>: filesystem on 
NTFS</li>
          <li><i>File type of source documents</i>: </li>
          <li><i>Parser(s) used, if any</i>: Currently the only parser 
used is the Quiotix html
          parser.</li>
          <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
          <li><i>Number of fields per document</i>: 8</li>
          <li><i>Type of fields</i>: All strings, and all are stored 
and indexed.</li>
          <li><i>Index persistence</i>: FSDirectory</li>
          </p>
          <p>
          <b>Figures</b><br/>
          <li><i>Time taken (in ms/s as an average of at least 3 
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
minutes.  Note that the #
          and size of documents changes daily.</li>
          <li><i>Time taken / 1000 docs indexed</i>: </li>
          <li><i>Memory consumption</i>: JVM is given 256MB and uses it 
all.</li>
          </p>
          <p>
          <b>Notes</b><br/>
          <li><i>Notes</i>: 
          <p>
          We have 10 threads reading files from the filesystem and 
parsing and
          analyzing them and the pushing them onto a queue and a single 
thread poping
          them from the queue and indexing.  Note that we are indexing 
email messages
          and are storing the entire plaintext in of the message in the 
index.  If the
          message contains attachment and we do not have a filter for 
the attachment
          (ie. we do not do PDFs yet), we discard the data.
          </p></li>
          </p>
          </ul>
          <p>
          Justin can be contacted at tvxh-lw4x at spamex.com.
          </p>
        </subsection> 

      </section>

    </body>
</document>
- User-submitted benchmarks and a template. Submitted by: Kelvin Tan Reviewed by: otis git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149899 13f79535-47bb-0310-9956-ffa450edef68 2002-12-04 00:46:43 -05:00			`<?xml version="1.0"?>`
			`<document>`
			`<properties>`
			`<author email="kelvint@apache.org">Kelvin Tan</author>`
			`<title>Resources - Performance Benchmarks</title>`
			`</properties>`
			`<body>`

			`<section name="Performance Benchmarks">`
			`<p>`
			`The purpose of these user-submitted performance figures is to`
			`give current and potential users of Lucene a sense`
			`of how well Lucene scales. If the requirements for an upcoming`
			`project is similar to an existing benchmark, you`
			`will also have something to work with when designing the system`
			`architecture for the application.`
			`</p>`
			`<p>`
			`If you've conducted performance tests with Lucene, we'd`
			`appreciate if you can submit these figures for display`
			`on this page. Post these figures to the lucene-user mailing list`
			`using this`
			`<a href="benchmarktemplate.xml">template</a>.`
			`</p>`
			`</section>`

			`<section name="Benchmark Variables">`
			`<p>`
			`<ul>`
			`<p>`
			`<b>Hardware Environment</b><br/>`
			`<li><i>Dedicated machine for indexing</i>: Self-explanatory`
			`(yes/no)</li>`
			`<li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>`
			`<li><i>RAM</i>: Self-explanatory</li>`
			`<li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,`
			`RAID-1, RAID-5)</li>`
			`</p>`
			`<p>`
			`<b>Software environment</b><br/>`
			`<li><i>Java Version</i>: Version of Java SDK/JRE that is run`
			`</li>`
			`<li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>`
			`<li><i>OS Version</i>: Self-explanatory</li>`
			`<li><i>Location of index</i>: Is the index stored in filesystem`
			`or database? Is it on the same server(local) or`
			`over the network?</li>`
			`</p>`
			`<p>`
			`<b>Lucene indexing variables</b><br/>`
			`<li><i>Number of source documents</i>: Number of documents being`
			`indexed</li>`
			`<li><i>Total filesize of source documents</i>:`
			`Self-explanatory</li>`
			`<li><i>Average filesize of source documents</i>:`
			`Self-explanatory</li>`
			`<li><i>Source documents storage location</i>: Where are the`
			`documents being indexed located?`
			`Filesystem, DB, http,etc</li>`
			`<li><i>File type of source documents</i>: Types of files being`
			`indexed, e.g. HTML files, XML files, PDF files, etc.</li>`
			`<li><i>Parser(s) used, if any</i>: Parsers used for parsing the`
			`various files for indexing,`
			`e.g. XML parser, HTML parser, etc.</li>`
			`<li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>`
			`<li><i>Number of fields per document</i>: Number of Fields each`
			`Document contains</li>`
			`<li><i>Type of fields</i>: Type of each field</li>`
			`<li><i>Index persistence</i>: Where the index is stored, e.g.`
			`FSDirectory, SqlDirectory, etc</li>`
			`</p>`
			`<p>`
			`<b>Figures</b><br/>`
			`<li><i>Time taken (in ms/s as an average of at least 3 indexing`
			`runs)</i>: Time taken to index all files</li>`
			`<li><i>Time taken / 1000 docs indexed</i>: Time taken to index`
			`1000 files</li>`
			`<li><i>Memory consumption</i>: Self-explanatory</li>`
			`</p>`
			`<p>`
			`<b>Notes</b><br/>`
			`<li><i>Notes</i>: Any comments which don't belong in the above,`
			`special tuning/strategies, etc</li>`
			`</p>`
			`</ul>`
			`</p>`
			`</section>`

			`<section name="User-submitted Benchmarks">`
			`<p>`
			`These benchmarks have been kindly submitted by Lucene users for`
			`reference purposes.`
			`</p>`
			`<p><b>We make NO guarantees regarding their accuracy or`
			`validity.</b>`
			`</p>`
			`<p>We strongly recommend you conduct your own`
			`performance benchmarks before deciding on a particular`
			`hardware/software setup (and hopefully submit`
			`these figures to us).`
			`</p>`

			`<subsection name="Hamish Carpenter's benchmarks">`
			`<ul>`
			`<p>`
			`<b>Hardware Environment</b><br/>`
			`<li><i>Dedicated machine for indexing</i>: yes</li>`
			`<li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>`
			`<li><i>RAM</i>: 512 DDR</li>`
			`<li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>`
			`</p>`
			`<p>`
			`<b>Software environment</b><br/>`
			`<li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>`
			`<li><i>Java VM</i>: </li>`
			`<li><i>OS Version</i>: Debian Linux 2.4.18-686</li>`
			`<li><i>Location of index</i>: local</li>`
			`</p>`
			`<p>`
			`<b>Lucene indexing variables</b><br/>`
			`<li><i>Number of source documents</i>: Random generator. Set`
			`to make 1M documents`
			`in 2x500,000 batches.</li>`
			`<li><i>Total filesize of source documents</i>: > 1GB if`
			`stored</li>`
			`<li><i>Average filesize of source documents</i>: 1KB</li>`
			`<li><i>Source documents storage location</i>: Filesystem</li>`
			`<li><i>File type of source documents</i>: Generated</li>`
			`<li><i>Parser(s) used, if any</i>: </li>`
			`<li><i>Analyzer(s) used</i>: Default</li>`
			`<li><i>Number of fields per document</i>: 11</li>`
			`<li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>`
			`<li><i>Index persistence</i>: FSDirectory</li>`
			`</p>`
			`<p>`
			`<b>Figures</b><br/>`
			`<li><i>Time taken (in ms/s as an average of at least 3`
			`indexing runs)</i>: </li>`
			`<li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>`
			`<li><i>Memory consumption</i>:</li>`
			`</p>`
			`<p>`
			`<b>Notes</b><br/>`
			`<li><i>Notes</i>:`
			`<p>`
			`A windows client ran a random document generator which`
			`created`
			`documents based on some arrays of values and an excerpt`
			`(approx 1kb)`
			`from a text file of the bible (King James version).<br/>`
			`These were submitted via a socket connection (open throughout`
			`indexing process).<br/>`
			`The index writer was not closed between index calls.<br/>`
			`This created a 400Mb index in 23 files (after`
			`optimization).<br/>`
			`</p>`
			`<p>`
			`<u>Query details</u>:<br/>`
			`</p>`
			`<p>`
			`Set up a threaded class to start x number of simultaneous`
			`threads to`
			`search the above created index.`
			`</p>`
			`<p>`
			`Query: +Domain:sos +(+((Name:goo^2.0 Name:plan^2.0)`
			`(Teaser:goo* Tea`
			`ser:plan) (Details:goo Details:plan*)) -Cancel:y)`
			`+DisplayStartDate:[mkwsw2jk0`
			`-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]`
			`</p>`
			`<p>`
			`This query counted 34000 documents and I limited the returned`
			`documents`
			`to 5.`
			`</p>`
			`<p>`
			`This is using Peter Halacsy's IndexSearcherCache slightly`
			`modified to`
			`be a singleton returned cached searchers for a given`
			`directory. This`
			`solved an initial problem with too many files open and`
			`running out of`
			`linux handles for them.`
			`</p>`
			`<pre>`
			`Threads\|Avg Time per query (ms)`
			`1 1009ms`
			`2 2043ms`
			`3 3087ms`
			`4 4045ms`
			`.. .`
			`.. .`
			`10 10091ms`
			`</pre>`
			`<p>`
			`I removed the two date range terms from the query and it made`
			`a HUGE`
			`difference in performance. With 4 threads the avg time`
			`dropped to 900ms!`
			`</p>`
			`<p>Other query optimizations made little difference.</p></li>`
			`</p>`
			`</ul>`
			`<p>`
			`Hamish can be contacted at hamish at catalyst.net.nz.`
			`</p>`
			`</subsection>`

			`<subsection name="Justin Greene's benchmarks">`
			`<ul>`
			`<p>`
			`<b>Hardware Environment</b><br/>`
			`<li><i>Dedicated machine for indexing</i>: No, but nominal`
			`usage at time of indexing.</li>`
			`<li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>`
			`<li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>`
			`<li><i>Drive configuration</i>: RAID 5 on Fibre Channel`
			`Array</li>`
			`</p>`
			`<p>`
			`<b>Software environment</b><br/>`
			`<li><i>Java Version</i>: 1.3.1_06</li>`
			`<li><i>Java VM</i>: </li>`
			`<li><i>OS Version</i>: Winnt 4/Sp6</li>`
			`<li><i>Location of index</i>: local</li>`
			`</p>`
			`<p>`
			`<b>Lucene indexing variables</b><br/>`
			`<li><i>Number of source documents</i>: about 60K</li>`
			`<li><i>Total filesize of source documents</i>: 6.5GB</li>`
			`<li><i>Average filesize of source documents</i>: 100K`
			`(6.5GB/60K documents)</li>`
			`<li><i>Source documents storage location</i>: filesystem on`
			`NTFS</li>`
			`<li><i>File type of source documents</i>: </li>`
			`<li><i>Parser(s) used, if any</i>: Currently the only parser`
			`used is the Quiotix html`
			`parser.</li>`
			`<li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>`
			`<li><i>Number of fields per document</i>: 8</li>`
			`<li><i>Type of fields</i>: All strings, and all are stored`
			`and indexed.</li>`
			`<li><i>Index persistence</i>: FSDirectory</li>`
			`</p>`
			`<p>`
			`<b>Figures</b><br/>`
			`<li><i>Time taken (in ms/s as an average of at least 3`
			`indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17`
			`minutes. Note that the #`
			`and size of documents changes daily.</li>`
			`<li><i>Time taken / 1000 docs indexed</i>: </li>`
			`<li><i>Memory consumption</i>: JVM is given 256MB and uses it`
			`all.</li>`
			`</p>`
			`<p>`
			`<b>Notes</b><br/>`
			`<li><i>Notes</i>:`
			`<p>`
			`We have 10 threads reading files from the filesystem and`
			`parsing and`
			`analyzing them and the pushing them onto a queue and a single`
			`thread poping`
			`them from the queue and indexing. Note that we are indexing`
			`email messages`
			`and are storing the entire plaintext in of the message in the`
			`index. If the`
			`message contains attachment and we do not have a filter for`
			`the attachment`
			`(ie. we do not do PDFs yet), we discard the data.`
			`</p></li>`
			`</p>`
			`</ul>`
			`<p>`
			`Justin can be contacted at tvxh-lw4x at spamex.com.`
			`</p>`
			`</subsection>`

			`</section>`

			`</body>`
			`</document>`