- Modified docs.

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149903 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Otis Gospodnetic 2002-12-12 06:23:48 +00:00
parent bf5028d9ac
commit 9ff9b75780
19 changed files with 678 additions and 519 deletions

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>
@ -122,17 +123,17 @@
<blockquote>
<p>
The purpose of these user-submitted performance figures is to
give current and potential users of Lucene a sense
give current and potential users of Lucene a sense
of how well Lucene scales. If the requirements for an upcoming
project is similar to an existing benchmark, you
project is similar to an existing benchmark, you
will also have something to work with when designing the system
architecture for the application.
architecture for the application.
</p>
<p>
If you've conducted performance tests with Lucene, we'd
appreciate if you can submit these figures for display
appreciate if you can submit these figures for display
on this page. Post these figures to the lucene-user mailing list
using this
using this
<a href="benchmarktemplate.xml">template</a>.
</p>
</blockquote>
@ -153,57 +154,57 @@ using this
<p>
<b>Hardware Environment</b><br />
<li><i>Dedicated machine for indexing</i>: Self-explanatory
(yes/no)</li>
(yes/no)</li>
<li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
<li><i>RAM</i>: Self-explanatory</li>
<li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
RAID-1, RAID-5)</li>
RAID-1, RAID-5)</li>
</p>
<p>
<b>Software environment</b><br />
<li><i>Java Version</i>: Version of Java SDK/JRE that is run
</li>
</li>
<li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
<li><i>OS Version</i>: Self-explanatory</li>
<li><i>Location of index</i>: Is the index stored in filesystem
or database? Is it on the same server(local) or
or database? Is it on the same server(local) or
over the network?</li>
</p>
<p>
<b>Lucene indexing variables</b><br />
<li><i>Number of source documents</i>: Number of documents being
indexed</li>
indexed</li>
<li><i>Total filesize of source documents</i>:
Self-explanatory</li>
Self-explanatory</li>
<li><i>Average filesize of source documents</i>:
Self-explanatory</li>
Self-explanatory</li>
<li><i>Source documents storage location</i>: Where are the
documents being indexed located?
documents being indexed located?
Filesystem, DB, http,etc</li>
<li><i>File type of source documents</i>: Types of files being
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
<li><i>Parser(s) used, if any</i>: Parsers used for parsing the
various files for indexing,
various files for indexing,
e.g. XML parser, HTML parser, etc.</li>
<li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
<li><i>Number of fields per document</i>: Number of Fields each
Document contains</li>
Document contains</li>
<li><i>Type of fields</i>: Type of each field</li>
<li><i>Index persistence</i>: Where the index is stored, e.g.
FSDirectory, SqlDirectory, etc</li>
FSDirectory, SqlDirectory, etc</li>
</p>
<p>
<b>Figures</b><br />
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: Time taken to index all files</li>
runs)</i>: Time taken to index all files</li>
<li><i>Time taken / 1000 docs indexed</i>: Time taken to index
1000 files</li>
1000 files</li>
<li><i>Memory consumption</i>: Self-explanatory</li>
</p>
<p>
<b>Notes</b><br />
<li><i>Notes</i>: Any comments which don't belong in the above,
special tuning/strategies, etc</li>
special tuning/strategies, etc</li>
</p>
</ul>
</p>
@ -222,14 +223,14 @@ special tuning/strategies, etc</li>
<blockquote>
<p>
These benchmarks have been kindly submitted by Lucene users for
reference purposes.
reference purposes.
</p>
<p><b>We make NO guarantees regarding their accuracy or
validity.</b>
validity.</b>
</p>
<p>We strongly recommend you conduct your own
performance benchmarks before deciding on a particular
hardware/software setup (and hopefully submit
hardware/software setup (and hopefully submit
these figures to us).
</p>
<table border="0" cellspacing="0" cellpadding="2" width="100%">
@ -258,10 +259,10 @@ hardware/software setup (and hopefully submit
<p>
<b>Lucene indexing variables</b><br />
<li><i>Number of source documents</i>: Random generator. Set
to make 1M documents
in 2x500,000 batches.</li>
to make 1M documents
in 2x500,000 batches.</li>
<li><i>Total filesize of source documents</i>: &gt; 1GB if
stored</li>
stored</li>
<li><i>Average filesize of source documents</i>: 1KB</li>
<li><i>Source documents storage location</i>: Filesystem</li>
<li><i>File type of source documents</i>: Generated</li>
@ -274,7 +275,7 @@ stored</li>
<p>
<b>Figures</b><br />
<li><i>Time taken (in ms/s as an average of at least 3
indexing runs)</i>: </li>
indexing runs)</i>: </li>
<li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
<li><i>Memory consumption</i>:</li>
</p>
@ -283,43 +284,43 @@ indexing runs)</i>: </li>
<li><i>Notes</i>:
<p>
A windows client ran a random document generator which
created
created
documents based on some arrays of values and an excerpt
(approx 1kb)
(approx 1kb)
from a text file of the bible (King James version).<br />
These were submitted via a socket connection (open throughout
indexing process).<br />
The index writer was not closed between index calls.<br />
This created a 400Mb index in 23 files (after
optimization).<br />
optimization).<br />
</p>
<p>
<u>Query details</u>:<br />
</p>
<p>
Set up a threaded class to start x number of simultaneous
threads to
threads to
search the above created index.
</p>
<p>
Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
(Teaser:goo* Tea
(Teaser:goo* Tea
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+DisplayStartDate:[mkwsw2jk0
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
</p>
<p>
This query counted 34000 documents and I limited the returned
documents
documents
to 5.
</p>
<p>
This is using Peter Halacsy's IndexSearcherCache slightly
modified to
modified to
be a singleton returned cached searchers for a given
directory. This
directory. This
solved an initial problem with too many files open and
running out of
running out of
linux handles for them.
</p>
<pre>
@ -334,9 +335,9 @@ running out of
</pre>
<p>
I removed the two date range terms from the query and it made
a HUGE
a HUGE
difference in performance. With 4 threads the avg time
dropped to 900ms!
dropped to 900ms!
</p>
<p>Other query optimizations made little difference.</p></li>
</p>
@ -360,11 +361,11 @@ dropped to 900ms!
<p>
<b>Hardware Environment</b><br />
<li><i>Dedicated machine for indexing</i>: No, but nominal
usage at time of indexing.</li>
usage at time of indexing.</li>
<li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
<li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
<li><i>Drive configuration</i>: RAID 5 on Fibre Channel
Array</li>
Array</li>
</p>
<p>
<b>Software environment</b><br />
@ -378,43 +379,43 @@ Array</li>
<li><i>Number of source documents</i>: about 60K</li>
<li><i>Total filesize of source documents</i>: 6.5GB</li>
<li><i>Average filesize of source documents</i>: 100K
(6.5GB/60K documents)</li>
(6.5GB/60K documents)</li>
<li><i>Source documents storage location</i>: filesystem on
NTFS</li>
NTFS</li>
<li><i>File type of source documents</i>: </li>
<li><i>Parser(s) used, if any</i>: Currently the only parser
used is the Quiotix html
used is the Quiotix html
parser.</li>
<li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
<li><i>Number of fields per document</i>: 8</li>
<li><i>Type of fields</i>: All strings, and all are stored
and indexed.</li>
and indexed.</li>
<li><i>Index persistence</i>: FSDirectory</li>
</p>
<p>
<b>Figures</b><br />
<li><i>Time taken (in ms/s as an average of at least 3
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
minutes. Note that the #
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
minutes. Note that the #
and size of documents changes daily.</li>
<li><i>Time taken / 1000 docs indexed</i>: </li>
<li><i>Memory consumption</i>: JVM is given 256MB and uses it
all.</li>
all.</li>
</p>
<p>
<b>Notes</b><br />
<li><i>Notes</i>:
<p>
We have 10 threads reading files from the filesystem and
parsing and
parsing and
analyzing them and the pushing them onto a queue and a single
thread poping
thread poping
them from the queue and indexing. Note that we are indexing
email messages
email messages
and are storing the entire plaintext in of the message in the
index. If the
index. If the
message contains attachment and we do not have a filter for
the attachment
the attachment
(ie. we do not do PDFs yet), we discard the data.
</p></li>
</p>
@ -425,6 +426,81 @@ the attachment
</blockquote>
</td></tr>
<tr><td><br/></td></tr>
</table>
<table border="0" cellspacing="0" cellpadding="2" width="100%">
<tr><td bgcolor="#828DA6">
<font color="#ffffff" face="arial,helvetica,sanserif">
<a name="Daniel Armbrust's benchmarks"><strong>Daniel Armbrust's benchmarks</strong></a>
</font>
</td></tr>
<tr><td>
<blockquote>
<p>
My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed,
nor was the total index built in one shot. The index was created on several different
machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
1 million documents per batch. Each of these small indexes was then moved to a
much larger drive, where they were all merged together into a big index.
This process was done manually, over the course of several months, as the sources became available.
</p>
<ul>
<p>
<b>Hardware Environment</b><br />
<li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load. However, the indexing process was built single
threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li>
<li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
<li><i>RAM</i>: 4 GB Memory</li>
<li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
</p>
<p>
<b>Software environment</b><br />
<li><i>Java Version</i>: 1.3.1</li>
<li><i>Java VM</i>: </li>
<li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
<li><i>Location of index</i>: local</li>
</p>
<p>
<b>Lucene indexing variables</b><br />
<li><i>Number of source documents</i>: 13,820,517</li>
<li><i>Total filesize of source documents</i>: 87.3 GB</li>
<li><i>Average filesize of source documents</i>: 6.3 KB</li>
<li><i>Source documents storage location</i>: Filesystem</li>
<li><i>File type of source documents</i>: XML</li>
<li><i>Parser(s) used, if any</i>: </li>
<li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
<li><i>Number of fields per document</i>: 1 - 31</li>
<li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 12.5 GB</li>
</p>
<p>
<b>Figures</b><br />
<li><i>Time taken (in ms/s as an average of at least 3
indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
<li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
<li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
1 GB of memory was allotted to the indexer</li>
</p>
<p>
<b>Notes</b><br />
<li><i>Notes</i>:
<p>
The source documents were XML. The "indexer" opened each document one at a time, ran an
XSL transformation on them, and then proceeded to index the stream. The indexer optimized
the index every 50,000 documents (on this run) though previously, we optimized every
300,000 documents. The performance didn't change much either way. We did no other
tuning (RAM Directories, separate process to pretransform the source material, etc)
to make it index faster. When all of these individual indexes were built, they were
merged together into the main index. That process usually took ~ a day.
</p></li>
</p>
</ul>
<p>
Daniel can be contacted at Armbrust.Daniel at mayo.edu.
</p>
</blockquote>
</td></tr>
<tr><td><br/></td></tr>
</table>
</blockquote>
</p>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -5,6 +5,7 @@
<!-- start the processing -->
<!-- ====================================================================== -->
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>

View File

@ -9,17 +9,17 @@
<section name="Performance Benchmarks">
<p>
The purpose of these user-submitted performance figures is to
give current and potential users of Lucene a sense
give current and potential users of Lucene a sense
of how well Lucene scales. If the requirements for an upcoming
project is similar to an existing benchmark, you
project is similar to an existing benchmark, you
will also have something to work with when designing the system
architecture for the application.
architecture for the application.
</p>
<p>
If you've conducted performance tests with Lucene, we'd
appreciate if you can submit these figures for display
appreciate if you can submit these figures for display
on this page. Post these figures to the lucene-user mailing list
using this
using this
<a href="benchmarktemplate.xml">template</a>.
</p>
</section>
@ -30,57 +30,57 @@ using this
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: Self-explanatory
(yes/no)</li>
(yes/no)</li>
<li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
<li><i>RAM</i>: Self-explanatory</li>
<li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
RAID-1, RAID-5)</li>
RAID-1, RAID-5)</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Java Version</i>: Version of Java SDK/JRE that is run
</li>
</li>
<li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
<li><i>OS Version</i>: Self-explanatory</li>
<li><i>Location of index</i>: Is the index stored in filesystem
or database? Is it on the same server(local) or
or database? Is it on the same server(local) or
over the network?</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: Number of documents being
indexed</li>
indexed</li>
<li><i>Total filesize of source documents</i>:
Self-explanatory</li>
Self-explanatory</li>
<li><i>Average filesize of source documents</i>:
Self-explanatory</li>
Self-explanatory</li>
<li><i>Source documents storage location</i>: Where are the
documents being indexed located?
documents being indexed located?
Filesystem, DB, http,etc</li>
<li><i>File type of source documents</i>: Types of files being
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
indexed, e.g. HTML files, XML files, PDF files, etc.</li>
<li><i>Parser(s) used, if any</i>: Parsers used for parsing the
various files for indexing,
various files for indexing,
e.g. XML parser, HTML parser, etc.</li>
<li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
<li><i>Number of fields per document</i>: Number of Fields each
Document contains</li>
Document contains</li>
<li><i>Type of fields</i>: Type of each field</li>
<li><i>Index persistence</i>: Where the index is stored, e.g.
FSDirectory, SqlDirectory, etc</li>
FSDirectory, SqlDirectory, etc</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: Time taken to index all files</li>
runs)</i>: Time taken to index all files</li>
<li><i>Time taken / 1000 docs indexed</i>: Time taken to index
1000 files</li>
1000 files</li>
<li><i>Memory consumption</i>: Self-explanatory</li>
</p>
<p>
<b>Notes</b><br/>
<li><i>Notes</i>: Any comments which don't belong in the above,
special tuning/strategies, etc</li>
special tuning/strategies, etc</li>
</p>
</ul>
</p>
@ -89,14 +89,14 @@ special tuning/strategies, etc</li>
<section name="User-submitted Benchmarks">
<p>
These benchmarks have been kindly submitted by Lucene users for
reference purposes.
reference purposes.
</p>
<p><b>We make NO guarantees regarding their accuracy or
validity.</b>
validity.</b>
</p>
<p>We strongly recommend you conduct your own
performance benchmarks before deciding on a particular
hardware/software setup (and hopefully submit
hardware/software setup (and hopefully submit
these figures to us).
</p>
@ -119,10 +119,10 @@ hardware/software setup (and hopefully submit
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: Random generator. Set
to make 1M documents
in 2x500,000 batches.</li>
to make 1M documents
in 2x500,000 batches.</li>
<li><i>Total filesize of source documents</i>: > 1GB if
stored</li>
stored</li>
<li><i>Average filesize of source documents</i>: 1KB</li>
<li><i>Source documents storage location</i>: Filesystem</li>
<li><i>File type of source documents</i>: Generated</li>
@ -135,7 +135,7 @@ stored</li>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3
indexing runs)</i>: </li>
indexing runs)</i>: </li>
<li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
<li><i>Memory consumption</i>:</li>
</p>
@ -144,43 +144,43 @@ indexing runs)</i>: </li>
<li><i>Notes</i>:
<p>
A windows client ran a random document generator which
created
created
documents based on some arrays of values and an excerpt
(approx 1kb)
(approx 1kb)
from a text file of the bible (King James version).<br/>
These were submitted via a socket connection (open throughout
indexing process).<br/>
The index writer was not closed between index calls.<br/>
This created a 400Mb index in 23 files (after
optimization).<br/>
optimization).<br/>
</p>
<p>
<u>Query details</u>:<br/>
</p>
<p>
Set up a threaded class to start x number of simultaneous
threads to
threads to
search the above created index.
</p>
<p>
Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
(Teaser:goo* Tea
(Teaser:goo* Tea
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+DisplayStartDate:[mkwsw2jk0
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
</p>
<p>
This query counted 34000 documents and I limited the returned
documents
documents
to 5.
</p>
<p>
This is using Peter Halacsy's IndexSearcherCache slightly
modified to
modified to
be a singleton returned cached searchers for a given
directory. This
directory. This
solved an initial problem with too many files open and
running out of
running out of
linux handles for them.
</p>
<pre>
@ -195,9 +195,9 @@ running out of
</pre>
<p>
I removed the two date range terms from the query and it made
a HUGE
a HUGE
difference in performance. With 4 threads the avg time
dropped to 900ms!
dropped to 900ms!
</p>
<p>Other query optimizations made little difference.</p></li>
</p>
@ -212,11 +212,11 @@ dropped to 900ms!
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: No, but nominal
usage at time of indexing.</li>
usage at time of indexing.</li>
<li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
<li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
<li><i>Drive configuration</i>: RAID 5 on Fibre Channel
Array</li>
Array</li>
</p>
<p>
<b>Software environment</b><br/>
@ -230,43 +230,43 @@ Array</li>
<li><i>Number of source documents</i>: about 60K</li>
<li><i>Total filesize of source documents</i>: 6.5GB</li>
<li><i>Average filesize of source documents</i>: 100K
(6.5GB/60K documents)</li>
(6.5GB/60K documents)</li>
<li><i>Source documents storage location</i>: filesystem on
NTFS</li>
NTFS</li>
<li><i>File type of source documents</i>: </li>
<li><i>Parser(s) used, if any</i>: Currently the only parser
used is the Quiotix html
used is the Quiotix html
parser.</li>
<li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
<li><i>Number of fields per document</i>: 8</li>
<li><i>Type of fields</i>: All strings, and all are stored
and indexed.</li>
and indexed.</li>
<li><i>Index persistence</i>: FSDirectory</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
minutes. Note that the #
indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
minutes. Note that the #
and size of documents changes daily.</li>
<li><i>Time taken / 1000 docs indexed</i>: </li>
<li><i>Memory consumption</i>: JVM is given 256MB and uses it
all.</li>
all.</li>
</p>
<p>
<b>Notes</b><br/>
<li><i>Notes</i>:
<p>
We have 10 threads reading files from the filesystem and
parsing and
parsing and
analyzing them and the pushing them onto a queue and a single
thread poping
thread poping
them from the queue and indexing. Note that we are indexing
email messages
email messages
and are storing the entire plaintext in of the message in the
index. If the
index. If the
message contains attachment and we do not have a filter for
the attachment
the attachment
(ie. we do not do PDFs yet), we discard the data.
</p></li>
</p>
@ -276,8 +276,74 @@ the attachment
</p>
</subsection>
<subsection name="Daniel Armbrust's benchmarks">
<p>
My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed,
nor was the total index built in one shot. The index was created on several different
machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
1 million documents per batch. Each of these small indexes was then moved to a
much larger drive, where they were all merged together into a big index.
This process was done manually, over the course of several months, as the sources became available.
</p>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load. However, the indexing process was built single
threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li>
<li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
<li><i>RAM</i>: 4 GB Memory</li>
<li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Java Version</i>: 1.3.1</li>
<li><i>Java VM</i>: </li>
<li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
<li><i>Location of index</i>: local</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 13,820,517</li>
<li><i>Total filesize of source documents</i>: 87.3 GB</li>
<li><i>Average filesize of source documents</i>: 6.3 KB</li>
<li><i>Source documents storage location</i>: Filesystem</li>
<li><i>File type of source documents</i>: XML</li>
<li><i>Parser(s) used, if any</i>: </li>
<li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
<li><i>Number of fields per document</i>: 1 - 31</li>
<li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 12.5 GB</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3
indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
<li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
<li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
1 GB of memory was allotted to the indexer</li>
</p>
<p>
<b>Notes</b><br/>
<li><i>Notes</i>:
<p>
The source documents were XML. The "indexer" opened each document one at a time, ran an
XSL transformation on them, and then proceeded to index the stream. The indexer optimized
the index every 50,000 documents (on this run) though previously, we optimized every
300,000 documents. The performance didn't change much either way. We did no other
tuning (RAM Directories, separate process to pretransform the source material, etc)
to make it index faster. When all of these individual indexes were built, they were
merged together into the main index. That process usually took ~ a day.
</p></li>
</p>
</ul>
<p>
Daniel can be contacted at Armbrust.Daniel at mayo.edu.
</p>
</subsection>
</section>
</body>
</document>