Resources - Performance Benchmarks

Hardware Environment

Dedicated machine for indexing: Self-explanatory (yes/no)
CPU: Self-explanatory (Type, Speed and Quantity)
RAM: Self-explanatory
Drive configuration: Self-explanatory (IDE, SCSI, RAID-1, RAID-5)

Software environment

Java Version: Version of Java SDK/JRE that is run
Java VM: Server/client VM, Sun VM/JRockIt
OS Version: Self-explanatory
Location of index: Is the index stored in filesystem or database? Is it on the same server(local) or over the network?

Lucene indexing variables

Number of source documents: Number of documents being indexed
Total filesize of source documents: Self-explanatory
Average filesize of source documents: Self-explanatory
Source documents storage location: Where are the documents being indexed located? Filesystem, DB, http,etc
File type of source documents: Types of files being indexed, e.g. HTML files, XML files, PDF files, etc.
Parser(s) used, if any: Parsers used for parsing the various files for indexing, e.g. XML parser, HTML parser, etc.
Analyzer(s) used: Type of Lucene analyzer used
Number of fields per document: Number of Fields each Document contains
Type of fields: Type of each field
Index persistence: Where the index is stored, e.g. FSDirectory, SqlDirectory, etc

Figures

Time taken (in ms/s as an average of at least 3 indexing runs): Time taken to index all files
Time taken / 1000 docs indexed: Time taken to index 1000 files
Memory consumption: Self-explanatory

Notes

Notes: Any comments which don't belong in the above, special tuning/strategies, etc

These benchmarks have been kindly submitted by Lucene users for reference purposes.

We make NO guarantees regarding their accuracy or validity.

We strongly recommend you conduct your own performance benchmarks before deciding on a particular hardware/software setup (and hopefully submit these figures to us).

Hardware Environment

Dedicated machine for indexing: yes
CPU: Intel x86 P4 1.5Ghz
RAM: 512 DDR
Drive configuration: IDE 7200rpm Raid-1

Software environment

Java Version: 1.3.1 IBM JITC Enabled
Java VM:
OS Version: Debian Linux 2.4.18-686
Location of index: local

Lucene indexing variables

Number of source documents: Random generator. Set to make 1M documents in 2x500,000 batches.
Total filesize of source documents: > 1GB if stored
Average filesize of source documents: 1KB
Source documents storage location: Filesystem
File type of source documents: Generated
Parser(s) used, if any:
Analyzer(s) used: Default
Number of fields per document: 11
Type of fields: 1 date, 1 id, 9 text
Index persistence: FSDirectory

Figures

Time taken (in ms/s as an average of at least 3 indexing runs):
Time taken / 1000 docs indexed: 49 seconds
Memory consumption:

Notes

Notes:
A windows client ran a random document generator which created documents based on some arrays of values and an excerpt (approx 1kb) from a text file of the bible (King James version).
These were submitted via a socket connection (open throughout indexing process).
The index writer was not closed between index calls.
This created a 400Mb index in 23 files (after optimization).

Query details:

Set up a threaded class to start x number of simultaneous threads to search the above created index.

Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) (Teaser:goo* Tea ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) +DisplayStartDate:[mkwsw2jk0 -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]

This query counted 34000 documents and I limited the returned documents to 5.

This is using Peter Halacsy's IndexSearcherCache slightly modified to be a singleton returned cached searchers for a given directory. This solved an initial problem with too many files open and running out of linux handles for them.
```
                                Threads|Avg Time per query (ms)
                                1       1009ms
                                2       2043ms
                                3       3087ms
                                4       4045ms
                                ..        .
                                ..        .
                                10      10091ms
                            
```
I removed the two date range terms from the query and it made a HUGE difference in performance. With 4 threads the avg time dropped to 900ms!

Other query optimizations made little difference.

Hamish can be contacted at hamish at catalyst.net.nz.

Hardware Environment

Dedicated machine for indexing: No, but nominal usage at time of indexing.
CPU: Compaq Proliant 1850R/600 2 X pIII 600
RAM: 1GB, 256MB allocated to JVM.
Drive configuration: RAID 5 on Fibre Channel Array

Software environment

Java Version: 1.3.1_06
Java VM:
OS Version: Winnt 4/Sp6
Location of index: local

Lucene indexing variables

Number of source documents: about 60K
Total filesize of source documents: 6.5GB
Average filesize of source documents: 100K (6.5GB/60K documents)
Source documents storage location: filesystem on NTFS
File type of source documents:
Parser(s) used, if any: Currently the only parser used is the Quiotix html parser.
Analyzer(s) used: SimpleAnalyzer
Number of fields per document: 8
Type of fields: All strings, and all are stored and indexed.
Index persistence: FSDirectory

Figures

Time taken (in ms/s as an average of at least 3 indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 minutes. Note that the # and size of documents changes daily.
Time taken / 1000 docs indexed:
Memory consumption: JVM is given 256MB and uses it all.

Notes

Notes:
We have 10 threads reading files from the filesystem and parsing and analyzing them and the pushing them onto a queue and a single thread poping them from the queue and indexing. Note that we are indexing email messages and are storing the entire plaintext in of the message in the index. If the message contains attachment and we do not have a filter for the attachment (ie. we do not do PDFs yet), we discard the data.

Justin can be contacted at tvxh-lw4x at spamex.com.

My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, nor was the total index built in one shot. The index was created on several different machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to 1 million documents per batch. Each of these small indexes was then moved to a much larger drive, where they were all merged together into a big index. This process was done manually, over the course of several months, as the sources became available.

Hardware Environment

Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.
CPU: Sun Ultra 80 4 x 64 bit processors
RAM: 4 GB Memory
Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive

Software environment

Java Version: 1.3.1
Java VM:
OS Version: Sun 5.8 (64 bit)
Location of index: local

Lucene indexing variables

Number of source documents: 13,820,517
Total filesize of source documents: 87.3 GB
Average filesize of source documents: 6.3 KB
Source documents storage location: Filesystem
File type of source documents: XML
Parser(s) used, if any:
Analyzer(s) used: A home grown analyzer that simply removes stopwords.
Number of fields per document: 1 - 31
Type of fields: All text, though 2 of them are dates (20001205) that we filter on
Index persistence: FSDirectory
Index size: 12.5 GB

Figures

Time taken (in ms/s as an average of at least 3 indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)
Time taken / 1000 docs indexed: 340 Seconds
Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so 1 GB of memory was allotted to the indexer

Notes

Notes:
The source documents were XML. The "indexer" opened each document one at a time, ran an XSL transformation on them, and then proceeded to index the stream. The indexer optimized the index every 50,000 documents (on this run) though previously, we optimized every 300,000 documents. The performance didn't change much either way. We did no other tuning (RAM Directories, separate process to pretransform the source material, etc) to make it index faster. When all of these individual indexes were built, they were merged together into the main index. That process usually took ~ a day.

Daniel can be contacted at Armbrust.Daniel at mayo.edu.