mirror of https://github.com/apache/lucene.git
- User-submitted benchmarks and a template.
Submitted by: Kelvin Tan Reviewed by: otis git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149899 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
b34fa3c092
commit
35aa4145e8
|
@ -0,0 +1,467 @@
|
|||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
||||
|
||||
<!-- Content Stylesheet for Site -->
|
||||
|
||||
|
||||
<!-- start the processing -->
|
||||
<!-- ====================================================================== -->
|
||||
<!-- Main Page Section -->
|
||||
<!-- ====================================================================== -->
|
||||
<html>
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
|
||||
|
||||
<meta name="author" value="Kelvin Tan">
|
||||
<meta name="email" value="kelvint@apache.org">
|
||||
|
||||
|
||||
|
||||
<title>Jakarta Lucene - Resources - Performance Benchmarks</title>
|
||||
</head>
|
||||
|
||||
<body bgcolor="#ffffff" text="#000000" link="#525D76">
|
||||
<table border="0" width="100%" cellspacing="0">
|
||||
<!-- TOP IMAGE -->
|
||||
<tr>
|
||||
<td align="left">
|
||||
<a href="http://jakarta.apache.org"><img src="http://jakarta.apache.org/images/jakarta-logo.gif" border="0"/></a>
|
||||
</td>
|
||||
<td align="right">
|
||||
<a href="http://jakarta.apache.org/lucene/"><img src="./images/lucene_green_300.gif" alt="Jakarta Lucene" border="0"/></a>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
<table border="0" width="100%" cellspacing="4">
|
||||
<tr><td colspan="2">
|
||||
<hr noshade="" size="1"/>
|
||||
</td></tr>
|
||||
|
||||
<tr>
|
||||
<!-- LEFT SIDE NAVIGATION -->
|
||||
<td width="20%" valign="top" nowrap="true">
|
||||
<p><strong>About</strong></p>
|
||||
<ul>
|
||||
<li> <a href="./index.html">Overview</a>
|
||||
</li>
|
||||
<li> <a href="./powered.html">Powered by Lucene</a>
|
||||
</li>
|
||||
<li> <a href="./whoweare.html">Who We Are</a>
|
||||
</li>
|
||||
<li> <a href="http://jakarta.apache.org/site/mail.html">Mailing Lists</a>
|
||||
</li>
|
||||
</ul>
|
||||
<p><strong>Resources</strong></p>
|
||||
<ul>
|
||||
<li> <a href="http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi">FAQ (Official)</a>
|
||||
</li>
|
||||
<li> <a href="http://www.jguru.com/faq/Lucene">jGuru FAQ</a>
|
||||
</li>
|
||||
<li> <a href="./gettingstarted.html">Getting Started</a>
|
||||
</li>
|
||||
<li> <a href="./queryparsersyntax.html">Query Syntax</a>
|
||||
</li>
|
||||
<li> <a href="./fileformats.html">File Formats</a>
|
||||
</li>
|
||||
<li> <a href="./api/index.html">Javadoc</a>
|
||||
</li>
|
||||
<li> <a href="./contributions.html">Contributions</a>
|
||||
</li>
|
||||
<li> <a href="./lucene-sandbox/">Lucene Sandbox</a>
|
||||
</li>
|
||||
<li> <a href="./resources.html">Articles, etc.</a>
|
||||
</li>
|
||||
<li> <a href="./todo.html">TODO list</a>
|
||||
</li>
|
||||
<li> <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=%5BPATCH%5D&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27/todo.html">Patches</a>
|
||||
</li>
|
||||
<li> <a href="http://jakarta.apache.org/site/bugs.html">Bugs</a>
|
||||
</li>
|
||||
<li> <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Lucene Bugs</a>
|
||||
</li>
|
||||
<li> <a href="http://nagoya.apache.org/eyebrowse/SummarizeList?listId=30">Lucene-user</a>
|
||||
</li>
|
||||
<li> <a href="http://nagoya.apache.org/eyebrowse/SummarizeList?listId=29">Lucene-dev</a>
|
||||
</li>
|
||||
</ul>
|
||||
<p><strong>Plans</strong></p>
|
||||
<ul>
|
||||
<li> <a href="./luceneplan.html">Application Extensions</a>
|
||||
</li>
|
||||
</ul>
|
||||
<p><strong>Download</strong></p>
|
||||
<ul>
|
||||
<li> <a href="http://jakarta.apache.org/site/binindex.html">Binaries</a>
|
||||
</li>
|
||||
<li> <a href="http://jakarta.apache.org/site/sourceindex.html">Source Code</a>
|
||||
</li>
|
||||
<li> <a href="http://jakarta.apache.org/site/cvsindex.html">CVS Repositories</a>
|
||||
</li>
|
||||
</ul>
|
||||
<p><strong>Jakarta</strong></p>
|
||||
<ul>
|
||||
<li> <a href="http://jakarta.apache.org/site/getinvolved.html">Get Involved</a>
|
||||
</li>
|
||||
<li> <a href="http://jakarta.apache.org/site/acknowledgements.html">Acknowledgements</a>
|
||||
</li>
|
||||
<li> <a href="http://jakarta.apache.org/site/contact.html">Contact</a>
|
||||
</li>
|
||||
<li> <a href="http://jakarta.apache.org/site/legal.html">Legal</a>
|
||||
</li>
|
||||
</ul>
|
||||
</td>
|
||||
<td width="80%" align="left" valign="top">
|
||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||
<tr><td bgcolor="#525D76">
|
||||
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||
<a name="Performance Benchmarks"><strong>Performance Benchmarks</strong></a>
|
||||
</font>
|
||||
</td></tr>
|
||||
<tr><td>
|
||||
<blockquote>
|
||||
<p>
|
||||
The purpose of these user-submitted performance figures is to
|
||||
give current and potential users of Lucene a sense
|
||||
of how well Lucene scales. If the requirements for an upcoming
|
||||
project is similar to an existing benchmark, you
|
||||
will also have something to work with when designing the system
|
||||
architecture for the application.
|
||||
</p>
|
||||
<p>
|
||||
If you've conducted performance tests with Lucene, we'd
|
||||
appreciate if you can submit these figures for display
|
||||
on this page. Post these figures to the lucene-user mailing list
|
||||
using this
|
||||
<a href="benchmarktemplate.xml">template</a>.
|
||||
</p>
|
||||
</blockquote>
|
||||
</p>
|
||||
</td></tr>
|
||||
<tr><td><br/></td></tr>
|
||||
</table>
|
||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||
<tr><td bgcolor="#525D76">
|
||||
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||
<a name="Benchmark Variables"><strong>Benchmark Variables</strong></a>
|
||||
</font>
|
||||
</td></tr>
|
||||
<tr><td>
|
||||
<blockquote>
|
||||
<p>
|
||||
<ul>
|
||||
<p>
|
||||
<b>Hardware Environment</b><br />
|
||||
<li><i>Dedicated machine for indexing</i>: Self-explanatory
|
||||
(yes/no)</li>
|
||||
<li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
|
||||
<li><i>RAM</i>: Self-explanatory</li>
|
||||
<li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
|
||||
RAID-1, RAID-5)</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Software environment</b><br />
|
||||
<li><i>Java Version</i>: Version of Java SDK/JRE that is run
|
||||
</li>
|
||||
<li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
|
||||
<li><i>OS Version</i>: Self-explanatory</li>
|
||||
<li><i>Location of index</i>: Is the index stored in filesystem
|
||||
or database? Is it on the same server(local) or
|
||||
over the network?</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Lucene indexing variables</b><br />
|
||||
<li><i>Number of source documents</i>: Number of documents being
|
||||
indexed</li>
|
||||
<li><i>Total filesize of source documents</i>:
|
||||
Self-explanatory</li>
|
||||
<li><i>Average filesize of source documents</i>:
|
||||
Self-explanatory</li>
|
||||
<li><i>Source documents storage location</i>: Where are the
|
||||
documents being indexed located?
|
||||
Filesystem, DB, http,etc</li>
|
||||
<li><i>File type of source documents</i>: Types of files being
|
||||
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
|
||||
<li><i>Parser(s) used, if any</i>: Parsers used for parsing the
|
||||
various files for indexing,
|
||||
e.g. XML parser, HTML parser, etc.</li>
|
||||
<li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
|
||||
<li><i>Number of fields per document</i>: Number of Fields each
|
||||
Document contains</li>
|
||||
<li><i>Type of fields</i>: Type of each field</li>
|
||||
<li><i>Index persistence</i>: Where the index is stored, e.g.
|
||||
FSDirectory, SqlDirectory, etc</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Figures</b><br />
|
||||
<li><i>Time taken (in ms/s as an average of at least 3 indexing
|
||||
runs)</i>: Time taken to index all files</li>
|
||||
<li><i>Time taken / 1000 docs indexed</i>: Time taken to index
|
||||
1000 files</li>
|
||||
<li><i>Memory consumption</i>: Self-explanatory</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Notes</b><br />
|
||||
<li><i>Notes</i>: Any comments which don't belong in the above,
|
||||
special tuning/strategies, etc</li>
|
||||
</p>
|
||||
</ul>
|
||||
</p>
|
||||
</blockquote>
|
||||
</p>
|
||||
</td></tr>
|
||||
<tr><td><br/></td></tr>
|
||||
</table>
|
||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||
<tr><td bgcolor="#525D76">
|
||||
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||
<a name="User-submitted Benchmarks"><strong>User-submitted Benchmarks</strong></a>
|
||||
</font>
|
||||
</td></tr>
|
||||
<tr><td>
|
||||
<blockquote>
|
||||
<p>
|
||||
These benchmarks have been kindly submitted by Lucene users for
|
||||
reference purposes.
|
||||
</p>
|
||||
<p><b>We make NO guarantees regarding their accuracy or
|
||||
validity.</b>
|
||||
</p>
|
||||
<p>We strongly recommend you conduct your own
|
||||
performance benchmarks before deciding on a particular
|
||||
hardware/software setup (and hopefully submit
|
||||
these figures to us).
|
||||
</p>
|
||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||
<tr><td bgcolor="#828DA6">
|
||||
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||
<a name="Hamish Carpenter's benchmarks"><strong>Hamish Carpenter's benchmarks</strong></a>
|
||||
</font>
|
||||
</td></tr>
|
||||
<tr><td>
|
||||
<blockquote>
|
||||
<ul>
|
||||
<p>
|
||||
<b>Hardware Environment</b><br />
|
||||
<li><i>Dedicated machine for indexing</i>: yes</li>
|
||||
<li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
|
||||
<li><i>RAM</i>: 512 DDR</li>
|
||||
<li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Software environment</b><br />
|
||||
<li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
|
||||
<li><i>Java VM</i>: </li>
|
||||
<li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
|
||||
<li><i>Location of index</i>: local</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Lucene indexing variables</b><br />
|
||||
<li><i>Number of source documents</i>: Random generator. Set
|
||||
to make 1M documents
|
||||
in 2x500,000 batches.</li>
|
||||
<li><i>Total filesize of source documents</i>: > 1GB if
|
||||
stored</li>
|
||||
<li><i>Average filesize of source documents</i>: 1KB</li>
|
||||
<li><i>Source documents storage location</i>: Filesystem</li>
|
||||
<li><i>File type of source documents</i>: Generated</li>
|
||||
<li><i>Parser(s) used, if any</i>: </li>
|
||||
<li><i>Analyzer(s) used</i>: Default</li>
|
||||
<li><i>Number of fields per document</i>: 11</li>
|
||||
<li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
|
||||
<li><i>Index persistence</i>: FSDirectory</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Figures</b><br />
|
||||
<li><i>Time taken (in ms/s as an average of at least 3
|
||||
indexing runs)</i>: </li>
|
||||
<li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
|
||||
<li><i>Memory consumption</i>:</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Notes</b><br />
|
||||
<li><i>Notes</i>:
|
||||
<p>
|
||||
A windows client ran a random document generator which
|
||||
created
|
||||
documents based on some arrays of values and an excerpt
|
||||
(approx 1kb)
|
||||
from a text file of the bible (King James version).<br />
|
||||
These were submitted via a socket connection (open throughout
|
||||
indexing process).<br />
|
||||
The index writer was not closed between index calls.<br />
|
||||
This created a 400Mb index in 23 files (after
|
||||
optimization).<br />
|
||||
</p>
|
||||
<p>
|
||||
<u>Query details</u>:<br />
|
||||
</p>
|
||||
<p>
|
||||
Set up a threaded class to start x number of simultaneous
|
||||
threads to
|
||||
search the above created index.
|
||||
</p>
|
||||
<p>
|
||||
Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
|
||||
(Teaser:goo* Tea
|
||||
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
|
||||
+DisplayStartDate:[mkwsw2jk0
|
||||
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
|
||||
</p>
|
||||
<p>
|
||||
This query counted 34000 documents and I limited the returned
|
||||
documents
|
||||
to 5.
|
||||
</p>
|
||||
<p>
|
||||
This is using Peter Halacsy's IndexSearcherCache slightly
|
||||
modified to
|
||||
be a singleton returned cached searchers for a given
|
||||
directory. This
|
||||
solved an initial problem with too many files open and
|
||||
running out of
|
||||
linux handles for them.
|
||||
</p>
|
||||
<pre>
|
||||
Threads|Avg Time per query (ms)
|
||||
1 1009ms
|
||||
2 2043ms
|
||||
3 3087ms
|
||||
4 4045ms
|
||||
.. .
|
||||
.. .
|
||||
10 10091ms
|
||||
</pre>
|
||||
<p>
|
||||
I removed the two date range terms from the query and it made
|
||||
a HUGE
|
||||
difference in performance. With 4 threads the avg time
|
||||
dropped to 900ms!
|
||||
</p>
|
||||
<p>Other query optimizations made little difference.</p></li>
|
||||
</p>
|
||||
</ul>
|
||||
<p>
|
||||
Hamish can be contacted at hamish at catalyst.net.nz.
|
||||
</p>
|
||||
</blockquote>
|
||||
</td></tr>
|
||||
<tr><td><br/></td></tr>
|
||||
</table>
|
||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||
<tr><td bgcolor="#828DA6">
|
||||
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||
<a name="Justin Greene's benchmarks"><strong>Justin Greene's benchmarks</strong></a>
|
||||
</font>
|
||||
</td></tr>
|
||||
<tr><td>
|
||||
<blockquote>
|
||||
<ul>
|
||||
<p>
|
||||
<b>Hardware Environment</b><br />
|
||||
<li><i>Dedicated machine for indexing</i>: No, but nominal
|
||||
usage at time of indexing.</li>
|
||||
<li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
|
||||
<li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
|
||||
<li><i>Drive configuration</i>: RAID 5 on Fibre Channel
|
||||
Array</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Software environment</b><br />
|
||||
<li><i>Java Version</i>: 1.3.1_06</li>
|
||||
<li><i>Java VM</i>: </li>
|
||||
<li><i>OS Version</i>: Winnt 4/Sp6</li>
|
||||
<li><i>Location of index</i>: local</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Lucene indexing variables</b><br />
|
||||
<li><i>Number of source documents</i>: about 60K</li>
|
||||
<li><i>Total filesize of source documents</i>: 6.5GB</li>
|
||||
<li><i>Average filesize of source documents</i>: 100K
|
||||
(6.5GB/60K documents)</li>
|
||||
<li><i>Source documents storage location</i>: filesystem on
|
||||
NTFS</li>
|
||||
<li><i>File type of source documents</i>: </li>
|
||||
<li><i>Parser(s) used, if any</i>: Currently the only parser
|
||||
used is the Quiotix html
|
||||
parser.</li>
|
||||
<li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
|
||||
<li><i>Number of fields per document</i>: 8</li>
|
||||
<li><i>Type of fields</i>: All strings, and all are stored
|
||||
and indexed.</li>
|
||||
<li><i>Index persistence</i>: FSDirectory</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Figures</b><br />
|
||||
<li><i>Time taken (in ms/s as an average of at least 3
|
||||
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
|
||||
minutes. Note that the #
|
||||
and size of documents changes daily.</li>
|
||||
<li><i>Time taken / 1000 docs indexed</i>: </li>
|
||||
<li><i>Memory consumption</i>: JVM is given 256MB and uses it
|
||||
all.</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Notes</b><br />
|
||||
<li><i>Notes</i>:
|
||||
<p>
|
||||
We have 10 threads reading files from the filesystem and
|
||||
parsing and
|
||||
analyzing them and the pushing them onto a queue and a single
|
||||
thread poping
|
||||
them from the queue and indexing. Note that we are indexing
|
||||
email messages
|
||||
and are storing the entire plaintext in of the message in the
|
||||
index. If the
|
||||
message contains attachment and we do not have a filter for
|
||||
the attachment
|
||||
(ie. we do not do PDFs yet), we discard the data.
|
||||
</p></li>
|
||||
</p>
|
||||
</ul>
|
||||
<p>
|
||||
Justin can be contacted at tvxh-lw4x at spamex.com.
|
||||
</p>
|
||||
</blockquote>
|
||||
</td></tr>
|
||||
<tr><td><br/></td></tr>
|
||||
</table>
|
||||
</blockquote>
|
||||
</p>
|
||||
</td></tr>
|
||||
<tr><td><br/></td></tr>
|
||||
</table>
|
||||
</td>
|
||||
</tr>
|
||||
|
||||
<!-- FOOTER -->
|
||||
<tr><td colspan="2">
|
||||
<hr noshade="" size="1"/>
|
||||
</td></tr>
|
||||
<tr><td colspan="2">
|
||||
<div align="center"><font color="#525D76" size="-1"><em>
|
||||
Copyright © 1999-2002, Apache Software Foundation
|
||||
</em></font></div>
|
||||
</td></tr>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
<!-- end the processing -->
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,57 @@
|
|||
<benchmark>
|
||||
<ul>
|
||||
<p>
|
||||
<b>Hardware Environment</b><br/>
|
||||
<li><i>Dedicated machine for indexing</i>: Self-explanatory
|
||||
(yes/no)</li>
|
||||
<li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
|
||||
<li><i>RAM</i>: Self-explanatory</li>
|
||||
<li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, RAID-1,
|
||||
RAID-5)</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Software environment</b><br/>
|
||||
<li><i>Java Version</i>: Version of Java SDK/JRE that is run </li>
|
||||
<li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
|
||||
<li><i>OS Version</i>: Self-explanatory</li>
|
||||
<li><i>Location of index</i>: Is the index stored in filesystem or
|
||||
database? Is it on the same server(local) or
|
||||
over the network?</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Lucene indexing variables</b><br/>
|
||||
<li><i>Number of source documents</i>: Number of documents being
|
||||
indexed</li>
|
||||
<li><i>Total filesize of source documents</i>: Self-explanatory</li>
|
||||
<li><i>Average filesize of source documents</i>:
|
||||
Self-explanatory</li>
|
||||
<li><i>Source documents storage location</i>: Where are the documents
|
||||
being indexed located?
|
||||
Filesystem, DB, http,etc</li>
|
||||
<li><i>File type of source documents</i>: Types of files being
|
||||
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
|
||||
<li><i>Parser(s) used, if any</i>: Parsers used for parsing the
|
||||
various files for indexing,
|
||||
e.g. XML parser, HTML parser, etc.</li>
|
||||
<li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
|
||||
<li><i>Number of fields per document</i>: Number of Fields each
|
||||
Document contains</li>
|
||||
<li><i>Type of fields</i>: Type of each field</li>
|
||||
<li><i>Index persistence</i>: Where the index is stored, e.g.
|
||||
FSDirectory, SqlDirectory, etc</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Figures</b><br/>
|
||||
<li><i>Time taken (in ms/s as an average of at least 3 indexing
|
||||
runs)</i>: Time taken to index to index all files</li>
|
||||
<li><i>Time taken / 1000 docs indexed</i>: Time taken to index 1000
|
||||
files</li>
|
||||
<li><i>Memory consumption</i>: Self-explanatory</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Notes</b><br/>
|
||||
<li><i>Notes</i>: Any comments which don't belong in the above,
|
||||
special tuning/strategies, etc</li>
|
||||
</p>
|
||||
</ul>
|
||||
</benchmark>
|
|
@ -0,0 +1,283 @@
|
|||
<?xml version="1.0"?>
|
||||
<document>
|
||||
<properties>
|
||||
<author email="kelvint@apache.org">Kelvin Tan</author>
|
||||
<title>Resources - Performance Benchmarks</title>
|
||||
</properties>
|
||||
<body>
|
||||
|
||||
<section name="Performance Benchmarks">
|
||||
<p>
|
||||
The purpose of these user-submitted performance figures is to
|
||||
give current and potential users of Lucene a sense
|
||||
of how well Lucene scales. If the requirements for an upcoming
|
||||
project is similar to an existing benchmark, you
|
||||
will also have something to work with when designing the system
|
||||
architecture for the application.
|
||||
</p>
|
||||
<p>
|
||||
If you've conducted performance tests with Lucene, we'd
|
||||
appreciate if you can submit these figures for display
|
||||
on this page. Post these figures to the lucene-user mailing list
|
||||
using this
|
||||
<a href="benchmarktemplate.xml">template</a>.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section name="Benchmark Variables">
|
||||
<p>
|
||||
<ul>
|
||||
<p>
|
||||
<b>Hardware Environment</b><br/>
|
||||
<li><i>Dedicated machine for indexing</i>: Self-explanatory
|
||||
(yes/no)</li>
|
||||
<li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
|
||||
<li><i>RAM</i>: Self-explanatory</li>
|
||||
<li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
|
||||
RAID-1, RAID-5)</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Software environment</b><br/>
|
||||
<li><i>Java Version</i>: Version of Java SDK/JRE that is run
|
||||
</li>
|
||||
<li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
|
||||
<li><i>OS Version</i>: Self-explanatory</li>
|
||||
<li><i>Location of index</i>: Is the index stored in filesystem
|
||||
or database? Is it on the same server(local) or
|
||||
over the network?</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Lucene indexing variables</b><br/>
|
||||
<li><i>Number of source documents</i>: Number of documents being
|
||||
indexed</li>
|
||||
<li><i>Total filesize of source documents</i>:
|
||||
Self-explanatory</li>
|
||||
<li><i>Average filesize of source documents</i>:
|
||||
Self-explanatory</li>
|
||||
<li><i>Source documents storage location</i>: Where are the
|
||||
documents being indexed located?
|
||||
Filesystem, DB, http,etc</li>
|
||||
<li><i>File type of source documents</i>: Types of files being
|
||||
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
|
||||
<li><i>Parser(s) used, if any</i>: Parsers used for parsing the
|
||||
various files for indexing,
|
||||
e.g. XML parser, HTML parser, etc.</li>
|
||||
<li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
|
||||
<li><i>Number of fields per document</i>: Number of Fields each
|
||||
Document contains</li>
|
||||
<li><i>Type of fields</i>: Type of each field</li>
|
||||
<li><i>Index persistence</i>: Where the index is stored, e.g.
|
||||
FSDirectory, SqlDirectory, etc</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Figures</b><br/>
|
||||
<li><i>Time taken (in ms/s as an average of at least 3 indexing
|
||||
runs)</i>: Time taken to index all files</li>
|
||||
<li><i>Time taken / 1000 docs indexed</i>: Time taken to index
|
||||
1000 files</li>
|
||||
<li><i>Memory consumption</i>: Self-explanatory</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Notes</b><br/>
|
||||
<li><i>Notes</i>: Any comments which don't belong in the above,
|
||||
special tuning/strategies, etc</li>
|
||||
</p>
|
||||
</ul>
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section name="User-submitted Benchmarks">
|
||||
<p>
|
||||
These benchmarks have been kindly submitted by Lucene users for
|
||||
reference purposes.
|
||||
</p>
|
||||
<p><b>We make NO guarantees regarding their accuracy or
|
||||
validity.</b>
|
||||
</p>
|
||||
<p>We strongly recommend you conduct your own
|
||||
performance benchmarks before deciding on a particular
|
||||
hardware/software setup (and hopefully submit
|
||||
these figures to us).
|
||||
</p>
|
||||
|
||||
<subsection name="Hamish Carpenter's benchmarks">
|
||||
<ul>
|
||||
<p>
|
||||
<b>Hardware Environment</b><br/>
|
||||
<li><i>Dedicated machine for indexing</i>: yes</li>
|
||||
<li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
|
||||
<li><i>RAM</i>: 512 DDR</li>
|
||||
<li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Software environment</b><br/>
|
||||
<li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
|
||||
<li><i>Java VM</i>: </li>
|
||||
<li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
|
||||
<li><i>Location of index</i>: local</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Lucene indexing variables</b><br/>
|
||||
<li><i>Number of source documents</i>: Random generator. Set
|
||||
to make 1M documents
|
||||
in 2x500,000 batches.</li>
|
||||
<li><i>Total filesize of source documents</i>: > 1GB if
|
||||
stored</li>
|
||||
<li><i>Average filesize of source documents</i>: 1KB</li>
|
||||
<li><i>Source documents storage location</i>: Filesystem</li>
|
||||
<li><i>File type of source documents</i>: Generated</li>
|
||||
<li><i>Parser(s) used, if any</i>: </li>
|
||||
<li><i>Analyzer(s) used</i>: Default</li>
|
||||
<li><i>Number of fields per document</i>: 11</li>
|
||||
<li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
|
||||
<li><i>Index persistence</i>: FSDirectory</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Figures</b><br/>
|
||||
<li><i>Time taken (in ms/s as an average of at least 3
|
||||
indexing runs)</i>: </li>
|
||||
<li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
|
||||
<li><i>Memory consumption</i>:</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Notes</b><br/>
|
||||
<li><i>Notes</i>:
|
||||
<p>
|
||||
A windows client ran a random document generator which
|
||||
created
|
||||
documents based on some arrays of values and an excerpt
|
||||
(approx 1kb)
|
||||
from a text file of the bible (King James version).<br/>
|
||||
These were submitted via a socket connection (open throughout
|
||||
indexing process).<br/>
|
||||
The index writer was not closed between index calls.<br/>
|
||||
This created a 400Mb index in 23 files (after
|
||||
optimization).<br/>
|
||||
</p>
|
||||
<p>
|
||||
<u>Query details</u>:<br/>
|
||||
</p>
|
||||
<p>
|
||||
Set up a threaded class to start x number of simultaneous
|
||||
threads to
|
||||
search the above created index.
|
||||
</p>
|
||||
<p>
|
||||
Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
|
||||
(Teaser:goo* Tea
|
||||
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
|
||||
+DisplayStartDate:[mkwsw2jk0
|
||||
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
|
||||
</p>
|
||||
<p>
|
||||
This query counted 34000 documents and I limited the returned
|
||||
documents
|
||||
to 5.
|
||||
</p>
|
||||
<p>
|
||||
This is using Peter Halacsy's IndexSearcherCache slightly
|
||||
modified to
|
||||
be a singleton returned cached searchers for a given
|
||||
directory. This
|
||||
solved an initial problem with too many files open and
|
||||
running out of
|
||||
linux handles for them.
|
||||
</p>
|
||||
<pre>
|
||||
Threads|Avg Time per query (ms)
|
||||
1 1009ms
|
||||
2 2043ms
|
||||
3 3087ms
|
||||
4 4045ms
|
||||
.. .
|
||||
.. .
|
||||
10 10091ms
|
||||
</pre>
|
||||
<p>
|
||||
I removed the two date range terms from the query and it made
|
||||
a HUGE
|
||||
difference in performance. With 4 threads the avg time
|
||||
dropped to 900ms!
|
||||
</p>
|
||||
<p>Other query optimizations made little difference.</p></li>
|
||||
</p>
|
||||
</ul>
|
||||
<p>
|
||||
Hamish can be contacted at hamish at catalyst.net.nz.
|
||||
</p>
|
||||
</subsection>
|
||||
|
||||
<subsection name="Justin Greene's benchmarks">
|
||||
<ul>
|
||||
<p>
|
||||
<b>Hardware Environment</b><br/>
|
||||
<li><i>Dedicated machine for indexing</i>: No, but nominal
|
||||
usage at time of indexing.</li>
|
||||
<li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
|
||||
<li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
|
||||
<li><i>Drive configuration</i>: RAID 5 on Fibre Channel
|
||||
Array</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Software environment</b><br/>
|
||||
<li><i>Java Version</i>: 1.3.1_06</li>
|
||||
<li><i>Java VM</i>: </li>
|
||||
<li><i>OS Version</i>: Winnt 4/Sp6</li>
|
||||
<li><i>Location of index</i>: local</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Lucene indexing variables</b><br/>
|
||||
<li><i>Number of source documents</i>: about 60K</li>
|
||||
<li><i>Total filesize of source documents</i>: 6.5GB</li>
|
||||
<li><i>Average filesize of source documents</i>: 100K
|
||||
(6.5GB/60K documents)</li>
|
||||
<li><i>Source documents storage location</i>: filesystem on
|
||||
NTFS</li>
|
||||
<li><i>File type of source documents</i>: </li>
|
||||
<li><i>Parser(s) used, if any</i>: Currently the only parser
|
||||
used is the Quiotix html
|
||||
parser.</li>
|
||||
<li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
|
||||
<li><i>Number of fields per document</i>: 8</li>
|
||||
<li><i>Type of fields</i>: All strings, and all are stored
|
||||
and indexed.</li>
|
||||
<li><i>Index persistence</i>: FSDirectory</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Figures</b><br/>
|
||||
<li><i>Time taken (in ms/s as an average of at least 3
|
||||
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
|
||||
minutes. Note that the #
|
||||
and size of documents changes daily.</li>
|
||||
<li><i>Time taken / 1000 docs indexed</i>: </li>
|
||||
<li><i>Memory consumption</i>: JVM is given 256MB and uses it
|
||||
all.</li>
|
||||
</p>
|
||||
<p>
|
||||
<b>Notes</b><br/>
|
||||
<li><i>Notes</i>:
|
||||
<p>
|
||||
We have 10 threads reading files from the filesystem and
|
||||
parsing and
|
||||
analyzing them and the pushing them onto a queue and a single
|
||||
thread poping
|
||||
them from the queue and indexing. Note that we are indexing
|
||||
email messages
|
||||
and are storing the entire plaintext in of the message in the
|
||||
index. If the
|
||||
message contains attachment and we do not have a filter for
|
||||
the attachment
|
||||
(ie. we do not do PDFs yet), we discard the data.
|
||||
</p></li>
|
||||
</p>
|
||||
</ul>
|
||||
<p>
|
||||
Justin can be contacted at tvxh-lw4x at spamex.com.
|
||||
</p>
|
||||
</subsection>
|
||||
|
||||
</section>
|
||||
|
||||
</body>
|
||||
</document>
|
||||
|
Loading…
Reference in New Issue