mirror of https://github.com/apache/lucene.git
350 lines
19 KiB
XML
350 lines
19 KiB
XML
<?xml version="1.0"?>
|
|
<document>
|
|
<properties>
|
|
<author email="kelvint@apache.org">Kelvin Tan</author>
|
|
<title>Resources - Performance Benchmarks</title>
|
|
</properties>
|
|
<body>
|
|
|
|
<section name="Performance Benchmarks">
|
|
<p>
|
|
The purpose of these user-submitted performance figures is to
|
|
give current and potential users of Lucene a sense
|
|
of how well Lucene scales. If the requirements for an upcoming
|
|
project is similar to an existing benchmark, you
|
|
will also have something to work with when designing the system
|
|
architecture for the application.
|
|
</p>
|
|
<p>
|
|
If you've conducted performance tests with Lucene, we'd
|
|
appreciate if you can submit these figures for display
|
|
on this page. Post these figures to the lucene-user mailing list
|
|
using this
|
|
<a href="benchmarktemplate.xml">template</a>.
|
|
</p>
|
|
</section>
|
|
|
|
<section name="Benchmark Variables">
|
|
<p>
|
|
<ul>
|
|
<p>
|
|
<b>Hardware Environment</b><br/>
|
|
<li><i>Dedicated machine for indexing</i>: Self-explanatory
|
|
(yes/no)</li>
|
|
<li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
|
|
<li><i>RAM</i>: Self-explanatory</li>
|
|
<li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
|
|
RAID-1, RAID-5)</li>
|
|
</p>
|
|
<p>
|
|
<b>Software environment</b><br/>
|
|
<li><i>Java Version</i>: Version of Java SDK/JRE that is run
|
|
</li>
|
|
<li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
|
|
<li><i>OS Version</i>: Self-explanatory</li>
|
|
<li><i>Location of index</i>: Is the index stored in filesystem
|
|
or database? Is it on the same server(local) or
|
|
over the network?</li>
|
|
</p>
|
|
<p>
|
|
<b>Lucene indexing variables</b><br/>
|
|
<li><i>Number of source documents</i>: Number of documents being
|
|
indexed</li>
|
|
<li><i>Total filesize of source documents</i>:
|
|
Self-explanatory</li>
|
|
<li><i>Average filesize of source documents</i>:
|
|
Self-explanatory</li>
|
|
<li><i>Source documents storage location</i>: Where are the
|
|
documents being indexed located?
|
|
Filesystem, DB, http,etc</li>
|
|
<li><i>File type of source documents</i>: Types of files being
|
|
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
|
|
<li><i>Parser(s) used, if any</i>: Parsers used for parsing the
|
|
various files for indexing,
|
|
e.g. XML parser, HTML parser, etc.</li>
|
|
<li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
|
|
<li><i>Number of fields per document</i>: Number of Fields each
|
|
Document contains</li>
|
|
<li><i>Type of fields</i>: Type of each field</li>
|
|
<li><i>Index persistence</i>: Where the index is stored, e.g.
|
|
FSDirectory, SqlDirectory, etc</li>
|
|
</p>
|
|
<p>
|
|
<b>Figures</b><br/>
|
|
<li><i>Time taken (in ms/s as an average of at least 3 indexing
|
|
runs)</i>: Time taken to index all files</li>
|
|
<li><i>Time taken / 1000 docs indexed</i>: Time taken to index
|
|
1000 files</li>
|
|
<li><i>Memory consumption</i>: Self-explanatory</li>
|
|
</p>
|
|
<p>
|
|
<b>Notes</b><br/>
|
|
<li><i>Notes</i>: Any comments which don't belong in the above,
|
|
special tuning/strategies, etc</li>
|
|
</p>
|
|
</ul>
|
|
</p>
|
|
</section>
|
|
|
|
<section name="User-submitted Benchmarks">
|
|
<p>
|
|
These benchmarks have been kindly submitted by Lucene users for
|
|
reference purposes.
|
|
</p>
|
|
<p><b>We make NO guarantees regarding their accuracy or
|
|
validity.</b>
|
|
</p>
|
|
<p>We strongly recommend you conduct your own
|
|
performance benchmarks before deciding on a particular
|
|
hardware/software setup (and hopefully submit
|
|
these figures to us).
|
|
</p>
|
|
|
|
<subsection name="Hamish Carpenter's benchmarks">
|
|
<ul>
|
|
<p>
|
|
<b>Hardware Environment</b><br/>
|
|
<li><i>Dedicated machine for indexing</i>: yes</li>
|
|
<li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
|
|
<li><i>RAM</i>: 512 DDR</li>
|
|
<li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
|
|
</p>
|
|
<p>
|
|
<b>Software environment</b><br/>
|
|
<li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
|
|
<li><i>Java VM</i>: </li>
|
|
<li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
|
|
<li><i>Location of index</i>: local</li>
|
|
</p>
|
|
<p>
|
|
<b>Lucene indexing variables</b><br/>
|
|
<li><i>Number of source documents</i>: Random generator. Set
|
|
to make 1M documents
|
|
in 2x500,000 batches.</li>
|
|
<li><i>Total filesize of source documents</i>: > 1GB if
|
|
stored</li>
|
|
<li><i>Average filesize of source documents</i>: 1KB</li>
|
|
<li><i>Source documents storage location</i>: Filesystem</li>
|
|
<li><i>File type of source documents</i>: Generated</li>
|
|
<li><i>Parser(s) used, if any</i>: </li>
|
|
<li><i>Analyzer(s) used</i>: Default</li>
|
|
<li><i>Number of fields per document</i>: 11</li>
|
|
<li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
|
|
<li><i>Index persistence</i>: FSDirectory</li>
|
|
</p>
|
|
<p>
|
|
<b>Figures</b><br/>
|
|
<li><i>Time taken (in ms/s as an average of at least 3
|
|
indexing runs)</i>: </li>
|
|
<li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
|
|
<li><i>Memory consumption</i>:</li>
|
|
</p>
|
|
<p>
|
|
<b>Notes</b><br/>
|
|
<li><i>Notes</i>:
|
|
<p>
|
|
A windows client ran a random document generator which
|
|
created
|
|
documents based on some arrays of values and an excerpt
|
|
(approx 1kb)
|
|
from a text file of the bible (King James version).<br/>
|
|
These were submitted via a socket connection (open throughout
|
|
indexing process).<br/>
|
|
The index writer was not closed between index calls.<br/>
|
|
This created a 400Mb index in 23 files (after
|
|
optimization).<br/>
|
|
</p>
|
|
<p>
|
|
<u>Query details</u>:<br/>
|
|
</p>
|
|
<p>
|
|
Set up a threaded class to start x number of simultaneous
|
|
threads to
|
|
search the above created index.
|
|
</p>
|
|
<p>
|
|
Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
|
|
(Teaser:goo* Tea
|
|
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
|
|
+DisplayStartDate:[mkwsw2jk0
|
|
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
|
|
</p>
|
|
<p>
|
|
This query counted 34000 documents and I limited the returned
|
|
documents
|
|
to 5.
|
|
</p>
|
|
<p>
|
|
This is using Peter Halacsy's IndexSearcherCache slightly
|
|
modified to
|
|
be a singleton returned cached searchers for a given
|
|
directory. This
|
|
solved an initial problem with too many files open and
|
|
running out of
|
|
linux handles for them.
|
|
</p>
|
|
<pre>
|
|
Threads|Avg Time per query (ms)
|
|
1 1009ms
|
|
2 2043ms
|
|
3 3087ms
|
|
4 4045ms
|
|
.. .
|
|
.. .
|
|
10 10091ms
|
|
</pre>
|
|
<p>
|
|
I removed the two date range terms from the query and it made
|
|
a HUGE
|
|
difference in performance. With 4 threads the avg time
|
|
dropped to 900ms!
|
|
</p>
|
|
<p>Other query optimizations made little difference.</p></li>
|
|
</p>
|
|
</ul>
|
|
<p>
|
|
Hamish can be contacted at hamish at catalyst.net.nz.
|
|
</p>
|
|
</subsection>
|
|
|
|
<subsection name="Justin Greene's benchmarks">
|
|
<ul>
|
|
<p>
|
|
<b>Hardware Environment</b><br/>
|
|
<li><i>Dedicated machine for indexing</i>: No, but nominal
|
|
usage at time of indexing.</li>
|
|
<li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
|
|
<li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
|
|
<li><i>Drive configuration</i>: RAID 5 on Fibre Channel
|
|
Array</li>
|
|
</p>
|
|
<p>
|
|
<b>Software environment</b><br/>
|
|
<li><i>Java Version</i>: 1.3.1_06</li>
|
|
<li><i>Java VM</i>: </li>
|
|
<li><i>OS Version</i>: Winnt 4/Sp6</li>
|
|
<li><i>Location of index</i>: local</li>
|
|
</p>
|
|
<p>
|
|
<b>Lucene indexing variables</b><br/>
|
|
<li><i>Number of source documents</i>: about 60K</li>
|
|
<li><i>Total filesize of source documents</i>: 6.5GB</li>
|
|
<li><i>Average filesize of source documents</i>: 100K
|
|
(6.5GB/60K documents)</li>
|
|
<li><i>Source documents storage location</i>: filesystem on
|
|
NTFS</li>
|
|
<li><i>File type of source documents</i>: </li>
|
|
<li><i>Parser(s) used, if any</i>: Currently the only parser
|
|
used is the Quiotix html
|
|
parser.</li>
|
|
<li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
|
|
<li><i>Number of fields per document</i>: 8</li>
|
|
<li><i>Type of fields</i>: All strings, and all are stored
|
|
and indexed.</li>
|
|
<li><i>Index persistence</i>: FSDirectory</li>
|
|
</p>
|
|
<p>
|
|
<b>Figures</b><br/>
|
|
<li><i>Time taken (in ms/s as an average of at least 3
|
|
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
|
|
minutes. Note that the #
|
|
and size of documents changes daily.</li>
|
|
<li><i>Time taken / 1000 docs indexed</i>: </li>
|
|
<li><i>Memory consumption</i>: JVM is given 256MB and uses it
|
|
all.</li>
|
|
</p>
|
|
<p>
|
|
<b>Notes</b><br/>
|
|
<li><i>Notes</i>:
|
|
<p>
|
|
We have 10 threads reading files from the filesystem and
|
|
parsing and
|
|
analyzing them and the pushing them onto a queue and a single
|
|
thread poping
|
|
them from the queue and indexing. Note that we are indexing
|
|
email messages
|
|
and are storing the entire plaintext in of the message in the
|
|
index. If the
|
|
message contains attachment and we do not have a filter for
|
|
the attachment
|
|
(ie. we do not do PDFs yet), we discard the data.
|
|
</p></li>
|
|
</p>
|
|
</ul>
|
|
<p>
|
|
Justin can be contacted at tvxh-lw4x at spamex.com.
|
|
</p>
|
|
</subsection>
|
|
|
|
|
|
<subsection name="Daniel Armbrust's benchmarks">
|
|
<p>
|
|
My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed,
|
|
nor was the total index built in one shot. The index was created on several different
|
|
machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
|
|
1 million documents per batch. Each of these small indexes was then moved to a
|
|
much larger drive, where they were all merged together into a big index.
|
|
This process was done manually, over the course of several months, as the sources became available.
|
|
</p>
|
|
<ul>
|
|
<p>
|
|
<b>Hardware Environment</b><br/>
|
|
<li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load. However, the indexing process was built single
|
|
threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li>
|
|
<li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
|
|
<li><i>RAM</i>: 4 GB Memory</li>
|
|
<li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
|
|
</p>
|
|
<p>
|
|
<b>Software environment</b><br/>
|
|
<li><i>Java Version</i>: 1.3.1</li>
|
|
<li><i>Java VM</i>: </li>
|
|
<li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
|
|
<li><i>Location of index</i>: local</li>
|
|
</p>
|
|
<p>
|
|
<b>Lucene indexing variables</b><br/>
|
|
<li><i>Number of source documents</i>: 13,820,517</li>
|
|
<li><i>Total filesize of source documents</i>: 87.3 GB</li>
|
|
<li><i>Average filesize of source documents</i>: 6.3 KB</li>
|
|
<li><i>Source documents storage location</i>: Filesystem</li>
|
|
<li><i>File type of source documents</i>: XML</li>
|
|
<li><i>Parser(s) used, if any</i>: </li>
|
|
<li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
|
|
<li><i>Number of fields per document</i>: 1 - 31</li>
|
|
<li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
|
|
<li><i>Index persistence</i>: FSDirectory</li>
|
|
<li><i>Index size</i>: 12.5 GB</li>
|
|
</p>
|
|
<p>
|
|
<b>Figures</b><br/>
|
|
<li><i>Time taken (in ms/s as an average of at least 3
|
|
indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
|
|
<li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
|
|
<li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
|
|
1 GB of memory was allotted to the indexer</li>
|
|
</p>
|
|
<p>
|
|
<b>Notes</b><br/>
|
|
<li><i>Notes</i>:
|
|
<p>
|
|
The source documents were XML. The "indexer" opened each document one at a time, ran an
|
|
XSL transformation on them, and then proceeded to index the stream. The indexer optimized
|
|
the index every 50,000 documents (on this run) though previously, we optimized every
|
|
300,000 documents. The performance didn't change much either way. We did no other
|
|
tuning (RAM Directories, separate process to pretransform the source material, etc)
|
|
to make it index faster. When all of these individual indexes were built, they were
|
|
merged together into the main index. That process usually took ~ a day.
|
|
</p></li>
|
|
</p>
|
|
</ul>
|
|
<p>
|
|
Daniel can be contacted at Armbrust.Daniel at mayo.edu.
|
|
</p>
|
|
</subsection>
|
|
|
|
</section>
|
|
|
|
</body>
|
|
</document>
|