diff --git a/docs/benchmarks.html b/docs/benchmarks.html index c401a850fc9..b1cd6c5dadb 100644 --- a/docs/benchmarks.html +++ b/docs/benchmarks.html @@ -5,6 +5,7 @@ + @@ -121,20 +122,20 @@
- The purpose of these user-submitted performance figures is to -give current and potential users of Lucene a sense - of how well Lucene scales. If the requirements for an upcoming -project is similar to an existing benchmark, you - will also have something to work with when designing the system -architecture for the application. -
+ The purpose of these user-submitted performance figures is to + give current and potential users of Lucene a sense + of how well Lucene scales. If the requirements for an upcoming + project is similar to an existing benchmark, you + will also have something to work with when designing the system + architecture for the application. +- If you've conducted performance tests with Lucene, we'd -appreciate if you can submit these figures for display - on this page. Post these figures to the lucene-user mailing list -using this - template. -
+ If you've conducted performance tests with Lucene, we'd + appreciate if you can submit these figures for display + on this page. Post these figures to the lucene-user mailing list + using this + template. +
-
-
- +- Hardware Environment
-- Dedicated machine for indexing: Self-explanatory -(yes/no)
-- CPU: Self-explanatory (Type, Speed and Quantity)
-- RAM: Self-explanatory
-- Drive configuration: Self-explanatory (IDE, SCSI, -RAID-1, RAID-5)
- -- Software environment
-- Java Version: Version of Java SDK/JRE that is run -
-- Java VM: Server/client VM, Sun VM/JRockIt
-- OS Version: Self-explanatory
-- Location of index: Is the index stored in filesystem -or database? Is it on the same server(local) or - over the network?
- -- Lucene indexing variables
-- Number of source documents: Number of documents being -indexed
-- Total filesize of source documents: -Self-explanatory
-- Average filesize of source documents: -Self-explanatory
-- Source documents storage location: Where are the -documents being indexed located? - Filesystem, DB, http,etc
-- File type of source documents: Types of files being -indexed, e.g. HTML files, XML files, PDF files, etc.
-- Parser(s) used, if any: Parsers used for parsing the -various files for indexing, - e.g. XML parser, HTML parser, etc.
-- Analyzer(s) used: Type of Lucene analyzer used
-- Number of fields per document: Number of Fields each -Document contains
-- Type of fields: Type of each field
-- Index persistence: Where the index is stored, e.g. -FSDirectory, SqlDirectory, etc
- -- Figures
-- Time taken (in ms/s as an average of at least 3 indexing -runs): Time taken to index all files
-- Time taken / 1000 docs indexed: Time taken to index -1000 files
-- Memory consumption: Self-explanatory
- -- Notes
-- Notes: Any comments which don't belong in the above, -special tuning/strategies, etc
- -+
++ Hardware Environment
+- Dedicated machine for indexing: Self-explanatory + (yes/no)
+- CPU: Self-explanatory (Type, Speed and Quantity)
+- RAM: Self-explanatory
+- Drive configuration: Self-explanatory (IDE, SCSI, + RAID-1, RAID-5)
+ ++ Software environment
+- Java Version: Version of Java SDK/JRE that is run +
+- Java VM: Server/client VM, Sun VM/JRockIt
+- OS Version: Self-explanatory
+- Location of index: Is the index stored in filesystem + or database? Is it on the same server(local) or + over the network?
+ ++ Lucene indexing variables
+- Number of source documents: Number of documents being + indexed
+- Total filesize of source documents: + Self-explanatory
+- Average filesize of source documents: + Self-explanatory
+- Source documents storage location: Where are the + documents being indexed located? + Filesystem, DB, http,etc
+- File type of source documents: Types of files being + indexed, e.g. HTML files, XML files, PDF files, etc.
+- Parser(s) used, if any: Parsers used for parsing the + various files for indexing, + e.g. XML parser, HTML parser, etc.
+- Analyzer(s) used: Type of Lucene analyzer used
+- Number of fields per document: Number of Fields each + Document contains
+- Type of fields: Type of each field
+- Index persistence: Where the index is stored, e.g. + FSDirectory, SqlDirectory, etc
+ ++ Figures
+- Time taken (in ms/s as an average of at least 3 indexing + runs): Time taken to index all files
+- Time taken / 1000 docs indexed: Time taken to index + 1000 files
+- Memory consumption: Self-explanatory
+ ++ Notes
+- Notes: Any comments which don't belong in the above, + special tuning/strategies, etc
+ +
- These benchmarks have been kindly submitted by Lucene users for -reference purposes. -
-We make NO guarantees regarding their accuracy or -validity. -
-We strongly recommend you conduct your own - performance benchmarks before deciding on a particular -hardware/software setup (and hopefully submit - these figures to us). -
+ These benchmarks have been kindly submitted by Lucene users for + reference purposes. + +We make NO guarantees regarding their accuracy or + validity. +
+We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). +
+
@@ -241,109 +242,109 @@ hardware/software setup (and hopefully submit -
+- Hardware Environment
-- Dedicated machine for indexing: yes
-- CPU: Intel x86 P4 1.5Ghz
-- RAM: 512 DDR
-- Drive configuration: IDE 7200rpm Raid-1
- -- Software environment
-- Java Version: 1.3.1 IBM JITC Enabled
-- Java VM:
-- OS Version: Debian Linux 2.4.18-686
-- Location of index: local
- -- Lucene indexing variables
-- Number of source documents: Random generator. Set -to make 1M documents -in 2x500,000 batches.
-- Total filesize of source documents: > 1GB if -stored
-- Average filesize of source documents: 1KB
-- Source documents storage location: Filesystem
-- File type of source documents: Generated
-- Parser(s) used, if any:
-- Analyzer(s) used: Default
-- Number of fields per document: 11
-- Type of fields: 1 date, 1 id, 9 text
-- Index persistence: FSDirectory
- -- Figures
-- Time taken (in ms/s as an average of at least 3 -indexing runs):
-- Time taken / 1000 docs indexed: 49 seconds
-- Memory consumption:
- -- Notes
-- Notes: -
- -- A windows client ran a random document generator which -created - documents based on some arrays of values and an excerpt -(approx 1kb) - from a text file of the bible (King James version).
-
- These were submitted via a socket connection (open throughout - indexing process).
- The index writer was not closed between index calls.
- This created a 400Mb index in 23 files (after -optimization).
-- Query details:
-
-- Set up a threaded class to start x number of simultaneous -threads to - search the above created index. -
-- Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) -(Teaser:goo* Tea - ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) - +DisplayStartDate:[mkwsw2jk0 - -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] -
-- This query counted 34000 documents and I limited the returned -documents - to 5. -
-- This is using Peter Halacsy's IndexSearcherCache slightly -modified to - be a singleton returned cached searchers for a given -directory. This - solved an initial problem with too many files open and -running out of - linux handles for them. -
-- Threads|Avg Time per query (ms) - 1 1009ms - 2 2043ms - 3 3087ms - 4 4045ms - .. . - .. . - 10 10091ms --- I removed the two date range terms from the query and it made -a HUGE - difference in performance. With 4 threads the avg time -dropped to 900ms! -
-Other query optimizations made little difference.
+ Hardware Environment
+Dedicated machine for indexing: yes +CPU: Intel x86 P4 1.5Ghz +RAM: 512 DDR +Drive configuration: IDE 7200rpm Raid-1 + ++ Software environment
+Java Version: 1.3.1 IBM JITC Enabled +Java VM: +OS Version: Debian Linux 2.4.18-686 +Location of index: local + ++ Lucene indexing variables
+Number of source documents: Random generator. Set + to make 1M documents + in 2x500,000 batches. +Total filesize of source documents: > 1GB if + stored +Average filesize of source documents: 1KB +Source documents storage location: Filesystem +File type of source documents: Generated +Parser(s) used, if any: +Analyzer(s) used: Default +Number of fields per document: 11 +Type of fields: 1 date, 1 id, 9 text +Index persistence: FSDirectory + ++ Figures
+Time taken (in ms/s as an average of at least 3 + indexing runs): +Time taken / 1000 docs indexed: 49 seconds +Memory consumption: + ++ Notes
+Notes: + + ++ A windows client ran a random document generator which + created + documents based on some arrays of values and an excerpt + (approx 1kb) + from a text file of the bible (King James version).
+
+ These were submitted via a socket connection (open throughout + indexing process).
+ The index writer was not closed between index calls.
+ This created a 400Mb index in 23 files (after + optimization).
++ Query details:
+
++ Set up a threaded class to start x number of simultaneous + threads to + search the above created index. +
++ Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] +
++ This query counted 34000 documents and I limited the returned + documents + to 5. +
++ This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. +
++ Threads|Avg Time per query (ms) + 1 1009ms + 2 2043ms + 3 3087ms + 4 4045ms + .. . + .. . + 10 10091ms +++ I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! +
+Other query optimizations made little difference.
- Hamish can be contacted at hamish at catalyst.net.nz. -
+ Hamish can be contacted at hamish at catalyst.net.nz. +@@ -357,71 +358,146 @@ dropped to 900ms! + +-
+- Hardware Environment
-- Dedicated machine for indexing: No, but nominal -usage at time of indexing.
-- CPU: Compaq Proliant 1850R/600 2 X pIII 600
-- RAM: 1GB, 256MB allocated to JVM.
-- Drive configuration: RAID 5 on Fibre Channel -Array
- -- Software environment
-- Java Version: 1.3.1_06
-- Java VM:
-- OS Version: Winnt 4/Sp6
-- Location of index: local
- -- Lucene indexing variables
-- Number of source documents: about 60K
-- Total filesize of source documents: 6.5GB
-- Average filesize of source documents: 100K -(6.5GB/60K documents)
-- Source documents storage location: filesystem on -NTFS
-- File type of source documents:
-- Parser(s) used, if any: Currently the only parser -used is the Quiotix html - parser.
-- Analyzer(s) used: SimpleAnalyzer
-- Number of fields per document: 8
-- Type of fields: All strings, and all are stored -and indexed.
-- Index persistence: FSDirectory
- -- Figures
-- Time taken (in ms/s as an average of at least 3 -indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 -minutes. Note that the # - and size of documents changes daily.
-- Time taken / 1000 docs indexed:
-- Memory consumption: JVM is given 256MB and uses it -all.
- -- Notes
-- Notes: -
- -- We have 10 threads reading files from the filesystem and -parsing and - analyzing them and the pushing them onto a queue and a single -thread poping - them from the queue and indexing. Note that we are indexing -email messages - and are storing the entire plaintext in of the message in the -index. If the - message contains attachment and we do not have a filter for -the attachment - (ie. we do not do PDFs yet), we discard the data. -
+ Hardware Environment
+Dedicated machine for indexing: No, but nominal + usage at time of indexing. +CPU: Compaq Proliant 1850R/600 2 X pIII 600 +RAM: 1GB, 256MB allocated to JVM. +Drive configuration: RAID 5 on Fibre Channel + Array + ++ Software environment
+Java Version: 1.3.1_06 +Java VM: +OS Version: Winnt 4/Sp6 +Location of index: local + ++ Lucene indexing variables
+Number of source documents: about 60K +Total filesize of source documents: 6.5GB +Average filesize of source documents: 100K + (6.5GB/60K documents) +Source documents storage location: filesystem on + NTFS +File type of source documents: +Parser(s) used, if any: Currently the only parser + used is the Quiotix html + parser. +Analyzer(s) used: SimpleAnalyzer +Number of fields per document: 8 +Type of fields: All strings, and all are stored + and indexed. +Index persistence: FSDirectory + ++ Figures
+Time taken (in ms/s as an average of at least 3 + indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 + minutes. Note that the # + and size of documents changes daily. +Time taken / 1000 docs indexed: +Memory consumption: JVM is given 256MB and uses it + all. + ++ Notes
+Notes: + + ++ We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. +
- Justin can be contacted at tvxh-lw4x at spamex.com. -
+ Justin can be contacted at tvxh-lw4x at spamex.com. + ++ +
+ + + Daniel Armbrust's benchmarks + + + ++ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. +
++
++ Hardware Environment
+- Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single + threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.
+- CPU: Sun Ultra 80 4 x 64 bit processors
+- RAM: 4 GB Memory
+- Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive
+ ++ Software environment
+- Java Version: 1.3.1
+- Java VM:
+- OS Version: Sun 5.8 (64 bit)
+- Location of index: local
+ ++ Lucene indexing variables
+- Number of source documents: 13,820,517
+- Total filesize of source documents: 87.3 GB
+- Average filesize of source documents: 6.3 KB
+- Source documents storage location: Filesystem
+- File type of source documents: XML
+- Parser(s) used, if any:
+- Analyzer(s) used: A home grown analyzer that simply removes stopwords.
+- Number of fields per document: 1 - 31
+- Type of fields: All text, though 2 of them are dates (20001205) that we filter on
+- Index persistence: FSDirectory
+- Index size: 12.5 GB
+ ++ Figures
+- Time taken (in ms/s as an average of at least 3 + indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)
+- Time taken / 1000 docs indexed: 340 Seconds
+- Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so + 1 GB of memory was allotted to the indexer
+ ++ Notes
+- Notes: +
+ ++ The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. +
+ Daniel can be contacted at Armbrust.Daniel at mayo.edu. +
diff --git a/docs/contributions.html b/docs/contributions.html index 37e4930d1d2..b0cd5b7e656 100644 --- a/docs/contributions.html +++ b/docs/contributions.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo.html b/docs/demo.html index f29ffce9fa5..a83b4dd202c 100644 --- a/docs/demo.html +++ b/docs/demo.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo2.html b/docs/demo2.html index cf2f5769e39..b17c4537fbb 100644 --- a/docs/demo2.html +++ b/docs/demo2.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo3.html b/docs/demo3.html index 1e032520e34..b3fdb092708 100644 --- a/docs/demo3.html +++ b/docs/demo3.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo4.html b/docs/demo4.html index 454966bd36d..ad8e3723e2f 100644 --- a/docs/demo4.html +++ b/docs/demo4.html @@ -5,6 +5,7 @@ + diff --git a/docs/fileformats.html b/docs/fileformats.html index c6b28549c59..b930a80366c 100644 --- a/docs/fileformats.html +++ b/docs/fileformats.html @@ -5,6 +5,7 @@ + diff --git a/docs/gettingstarted.html b/docs/gettingstarted.html index 3f8e208f0f5..a7df33dec5f 100644 --- a/docs/gettingstarted.html +++ b/docs/gettingstarted.html @@ -5,6 +5,7 @@ + diff --git a/docs/index.html b/docs/index.html index 4e6c9b10957..6d40b5c74f8 100644 --- a/docs/index.html +++ b/docs/index.html @@ -5,6 +5,7 @@ + diff --git a/docs/lucene-sandbox/index.html b/docs/lucene-sandbox/index.html index f3d0cde7e2d..278220a847d 100644 --- a/docs/lucene-sandbox/index.html +++ b/docs/lucene-sandbox/index.html @@ -5,6 +5,7 @@ + diff --git a/docs/lucene-sandbox/indyo/tutorial.html b/docs/lucene-sandbox/indyo/tutorial.html index d76440220a5..667ba1dcd6a 100644 --- a/docs/lucene-sandbox/indyo/tutorial.html +++ b/docs/lucene-sandbox/indyo/tutorial.html @@ -5,6 +5,7 @@ + diff --git a/docs/lucene-sandbox/larm/overview.html b/docs/lucene-sandbox/larm/overview.html index a10dbaa528b..c98a75d55aa 100644 --- a/docs/lucene-sandbox/larm/overview.html +++ b/docs/lucene-sandbox/larm/overview.html @@ -5,6 +5,7 @@ + diff --git a/docs/luceneplan.html b/docs/luceneplan.html index 2abef4c51a4..5490662d3fc 100644 --- a/docs/luceneplan.html +++ b/docs/luceneplan.html @@ -5,6 +5,7 @@ + diff --git a/docs/powered.html b/docs/powered.html index d9617af6d63..809cb87354e 100644 --- a/docs/powered.html +++ b/docs/powered.html @@ -5,6 +5,7 @@ + diff --git a/docs/queryparsersyntax.html b/docs/queryparsersyntax.html index 35d1473a2ab..8d8bf8b7345 100644 --- a/docs/queryparsersyntax.html +++ b/docs/queryparsersyntax.html @@ -5,6 +5,7 @@ + diff --git a/docs/resources.html b/docs/resources.html index c7e6f456d26..65286569ff7 100644 --- a/docs/resources.html +++ b/docs/resources.html @@ -5,6 +5,7 @@ + diff --git a/docs/todo.html b/docs/todo.html index a77a8824674..886946f4018 100644 --- a/docs/todo.html +++ b/docs/todo.html @@ -5,6 +5,7 @@ + diff --git a/docs/whoweare.html b/docs/whoweare.html index f5d0d13d2d9..1e74dbce364 100644 --- a/docs/whoweare.html +++ b/docs/whoweare.html @@ -5,6 +5,7 @@ + diff --git a/xdocs/benchmarks.xml b/xdocs/benchmarks.xml index eec7fc74167..863f96904ec 100644 --- a/xdocs/benchmarks.xml +++ b/xdocs/benchmarks.xml @@ -1,283 +1,349 @@ - - -Kelvin Tan -Resources - Performance Benchmarks +Kelvin Tan +Resources - Performance Benchmarks - - -- The purpose of these user-submitted performance figures is to -give current and potential users of Lucene a sense - of how well Lucene scales. If the requirements for an upcoming -project is similar to an existing benchmark, you - will also have something to work with when designing the system -architecture for the application. -
-- If you've conducted performance tests with Lucene, we'd -appreciate if you can submit these figures for display - on this page. Post these figures to the lucene-user mailing list -using this - template. -
-- +-
-
- -- Hardware Environment
-- Dedicated machine for indexing: Self-explanatory -(yes/no)
-- CPU: Self-explanatory (Type, Speed and Quantity)
-- RAM: Self-explanatory
-- Drive configuration: Self-explanatory (IDE, SCSI, -RAID-1, RAID-5)
- -- Software environment
-- Java Version: Version of Java SDK/JRE that is run -
-- Java VM: Server/client VM, Sun VM/JRockIt
-- OS Version: Self-explanatory
-- Location of index: Is the index stored in filesystem -or database? Is it on the same server(local) or - over the network?
- -- Lucene indexing variables
-- Number of source documents: Number of documents being -indexed
-- Total filesize of source documents: -Self-explanatory
-- Average filesize of source documents: -Self-explanatory
-- Source documents storage location: Where are the -documents being indexed located? - Filesystem, DB, http,etc
-- File type of source documents: Types of files being -indexed, e.g. HTML files, XML files, PDF files, etc.
-- Parser(s) used, if any: Parsers used for parsing the -various files for indexing, - e.g. XML parser, HTML parser, etc.
-- Analyzer(s) used: Type of Lucene analyzer used
-- Number of fields per document: Number of Fields each -Document contains
-- Type of fields: Type of each field
-- Index persistence: Where the index is stored, e.g. -FSDirectory, SqlDirectory, etc
- -- Figures
-- Time taken (in ms/s as an average of at least 3 indexing -runs): Time taken to index all files
-- Time taken / 1000 docs indexed: Time taken to index -1000 files
-- Memory consumption: Self-explanatory
- -- Notes
-- Notes: Any comments which don't belong in the above, -special tuning/strategies, etc
- -+ -+ The purpose of these user-submitted performance figures is to + give current and potential users of Lucene a sense + of how well Lucene scales. If the requirements for an upcoming + project is similar to an existing benchmark, you + will also have something to work with when designing the system + architecture for the application. +
++ If you've conducted performance tests with Lucene, we'd + appreciate if you can submit these figures for display + on this page. Post these figures to the lucene-user mailing list + using this + template. +
+- - These benchmarks have been kindly submitted by Lucene users for -reference purposes. -
-We make NO guarantees regarding their accuracy or -validity. -
-We strongly recommend you conduct your own - performance benchmarks before deciding on a particular -hardware/software setup (and hopefully submit - these figures to us). -
- -- +-
-- Hardware Environment
-- Dedicated machine for indexing: yes
-- CPU: Intel x86 P4 1.5Ghz
-- RAM: 512 DDR
-- Drive configuration: IDE 7200rpm Raid-1
- -- Software environment
-- Java Version: 1.3.1 IBM JITC Enabled
-- Java VM:
-- OS Version: Debian Linux 2.4.18-686
-- Location of index: local
- -- Lucene indexing variables
-- Number of source documents: Random generator. Set -to make 1M documents -in 2x500,000 batches.
-- Total filesize of source documents: > 1GB if -stored
-- Average filesize of source documents: 1KB
-- Source documents storage location: Filesystem
-- File type of source documents: Generated
-- Parser(s) used, if any:
-- Analyzer(s) used: Default
-- Number of fields per document: 11
-- Type of fields: 1 date, 1 id, 9 text
-- Index persistence: FSDirectory
- -- Figures
-- Time taken (in ms/s as an average of at least 3 -indexing runs):
-- Time taken / 1000 docs indexed: 49 seconds
-- Memory consumption:
- -- Notes
-- Notes: -
- -- A windows client ran a random document generator which -created - documents based on some arrays of values and an excerpt -(approx 1kb) - from a text file of the bible (King James version).
-
- These were submitted via a socket connection (open throughout - indexing process).
- The index writer was not closed between index calls.
- This created a 400Mb index in 23 files (after -optimization).
-- Query details:
-
-- Set up a threaded class to start x number of simultaneous -threads to - search the above created index. -
-- Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) -(Teaser:goo* Tea - ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) - +DisplayStartDate:[mkwsw2jk0 - -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] -
-- This query counted 34000 documents and I limited the returned -documents - to 5. -
-- This is using Peter Halacsy's IndexSearcherCache slightly -modified to - be a singleton returned cached searchers for a given -directory. This - solved an initial problem with too many files open and -running out of - linux handles for them. -
-- Threads|Avg Time per query (ms) - 1 1009ms - 2 2043ms - 3 3087ms - 4 4045ms - .. . - .. . - 10 10091ms --- I removed the two date range terms from the query and it made -a HUGE - difference in performance. With 4 threads the avg time -dropped to 900ms! -
-Other query optimizations made little difference.
- Hamish can be contacted at hamish at catalyst.net.nz. -
-+ -+
+
+ ++ Hardware Environment
+- Dedicated machine for indexing: Self-explanatory + (yes/no)
+- CPU: Self-explanatory (Type, Speed and Quantity)
+- RAM: Self-explanatory
+- Drive configuration: Self-explanatory (IDE, SCSI, + RAID-1, RAID-5)
+ ++ Software environment
+- Java Version: Version of Java SDK/JRE that is run +
+- Java VM: Server/client VM, Sun VM/JRockIt
+- OS Version: Self-explanatory
+- Location of index: Is the index stored in filesystem + or database? Is it on the same server(local) or + over the network?
+ ++ Lucene indexing variables
+- Number of source documents: Number of documents being + indexed
+- Total filesize of source documents: + Self-explanatory
+- Average filesize of source documents: + Self-explanatory
+- Source documents storage location: Where are the + documents being indexed located? + Filesystem, DB, http,etc
+- File type of source documents: Types of files being + indexed, e.g. HTML files, XML files, PDF files, etc.
+- Parser(s) used, if any: Parsers used for parsing the + various files for indexing, + e.g. XML parser, HTML parser, etc.
+- Analyzer(s) used: Type of Lucene analyzer used
+- Number of fields per document: Number of Fields each + Document contains
+- Type of fields: Type of each field
+- Index persistence: Where the index is stored, e.g. + FSDirectory, SqlDirectory, etc
+ ++ Figures
+- Time taken (in ms/s as an average of at least 3 indexing + runs): Time taken to index all files
+- Time taken / 1000 docs indexed: Time taken to index + 1000 files
+- Memory consumption: Self-explanatory
+ ++ Notes
+- Notes: Any comments which don't belong in the above, + special tuning/strategies, etc
+ +- +-
-- Hardware Environment
-- Dedicated machine for indexing: No, but nominal -usage at time of indexing.
-- CPU: Compaq Proliant 1850R/600 2 X pIII 600
-- RAM: 1GB, 256MB allocated to JVM.
-- Drive configuration: RAID 5 on Fibre Channel -Array
- -- Software environment
-- Java Version: 1.3.1_06
-- Java VM:
-- OS Version: Winnt 4/Sp6
-- Location of index: local
- -- Lucene indexing variables
-- Number of source documents: about 60K
-- Total filesize of source documents: 6.5GB
-- Average filesize of source documents: 100K -(6.5GB/60K documents)
-- Source documents storage location: filesystem on -NTFS
-- File type of source documents:
-- Parser(s) used, if any: Currently the only parser -used is the Quiotix html - parser.
-- Analyzer(s) used: SimpleAnalyzer
-- Number of fields per document: 8
-- Type of fields: All strings, and all are stored -and indexed.
-- Index persistence: FSDirectory
- -- Figures
-- Time taken (in ms/s as an average of at least 3 -indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 -minutes. Note that the # - and size of documents changes daily.
-- Time taken / 1000 docs indexed:
-- Memory consumption: JVM is given 256MB and uses it -all.
- -- Notes
-- Notes: -
- -- We have 10 threads reading files from the filesystem and -parsing and - analyzing them and the pushing them onto a queue and a single -thread poping - them from the queue and indexing. Note that we are indexing -email messages - and are storing the entire plaintext in of the message in the -index. If the - message contains attachment and we do not have a filter for -the attachment - (ie. we do not do PDFs yet), we discard the data. -
- Justin can be contacted at tvxh-lw4x at spamex.com. -
-+ ++ These benchmarks have been kindly submitted by Lucene users for + reference purposes. +
+We make NO guarantees regarding their accuracy or + validity. +
+We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). +
-+ + ++
++ Hardware Environment
+- Dedicated machine for indexing: yes
+- CPU: Intel x86 P4 1.5Ghz
+- RAM: 512 DDR
+- Drive configuration: IDE 7200rpm Raid-1
+ ++ Software environment
+- Java Version: 1.3.1 IBM JITC Enabled
+- Java VM:
+- OS Version: Debian Linux 2.4.18-686
+- Location of index: local
+ ++ Lucene indexing variables
+- Number of source documents: Random generator. Set + to make 1M documents + in 2x500,000 batches.
+- Total filesize of source documents: > 1GB if + stored
+- Average filesize of source documents: 1KB
+- Source documents storage location: Filesystem
+- File type of source documents: Generated
+- Parser(s) used, if any:
+- Analyzer(s) used: Default
+- Number of fields per document: 11
+- Type of fields: 1 date, 1 id, 9 text
+- Index persistence: FSDirectory
+ ++ Figures
+- Time taken (in ms/s as an average of at least 3 + indexing runs):
+- Time taken / 1000 docs indexed: 49 seconds
+- Memory consumption:
+ ++ Notes
+- Notes: +
+ ++ A windows client ran a random document generator which + created + documents based on some arrays of values and an excerpt + (approx 1kb) + from a text file of the bible (King James version).
+
+ These were submitted via a socket connection (open throughout + indexing process).
+ The index writer was not closed between index calls.
+ This created a 400Mb index in 23 files (after + optimization).
++ Query details:
+
++ Set up a threaded class to start x number of simultaneous + threads to + search the above created index. +
++ Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] +
++ This query counted 34000 documents and I limited the returned + documents + to 5. +
++ This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. +
++ Threads|Avg Time per query (ms) + 1 1009ms + 2 2043ms + 3 3087ms + 4 4045ms + .. . + .. . + 10 10091ms +++ I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! +
+Other query optimizations made little difference.
+ Hamish can be contacted at hamish at catalyst.net.nz. +
++ + + ++
++ Hardware Environment
+- Dedicated machine for indexing: No, but nominal + usage at time of indexing.
+- CPU: Compaq Proliant 1850R/600 2 X pIII 600
+- RAM: 1GB, 256MB allocated to JVM.
+- Drive configuration: RAID 5 on Fibre Channel + Array
+ ++ Software environment
+- Java Version: 1.3.1_06
+- Java VM:
+- OS Version: Winnt 4/Sp6
+- Location of index: local
+ ++ Lucene indexing variables
+- Number of source documents: about 60K
+- Total filesize of source documents: 6.5GB
+- Average filesize of source documents: 100K + (6.5GB/60K documents)
+- Source documents storage location: filesystem on + NTFS
+- File type of source documents:
+- Parser(s) used, if any: Currently the only parser + used is the Quiotix html + parser.
+- Analyzer(s) used: SimpleAnalyzer
+- Number of fields per document: 8
+- Type of fields: All strings, and all are stored + and indexed.
+- Index persistence: FSDirectory
+ ++ Figures
+- Time taken (in ms/s as an average of at least 3 + indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 + minutes. Note that the # + and size of documents changes daily.
+- Time taken / 1000 docs indexed:
+- Memory consumption: JVM is given 256MB and uses it + all.
+ ++ Notes
+- Notes: +
+ ++ We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. +
+ Justin can be contacted at tvxh-lw4x at spamex.com. +
++ + ++ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. +
++
++ Hardware Environment
+- Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single + threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.
+- CPU: Sun Ultra 80 4 x 64 bit processors
+- RAM: 4 GB Memory
+- Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive
+ ++ Software environment
+- Java Version: 1.3.1
+- Java VM:
+- OS Version: Sun 5.8 (64 bit)
+- Location of index: local
+ ++ Lucene indexing variables
+- Number of source documents: 13,820,517
+- Total filesize of source documents: 87.3 GB
+- Average filesize of source documents: 6.3 KB
+- Source documents storage location: Filesystem
+- File type of source documents: XML
+- Parser(s) used, if any:
+- Analyzer(s) used: A home grown analyzer that simply removes stopwords.
+- Number of fields per document: 1 - 31
+- Type of fields: All text, though 2 of them are dates (20001205) that we filter on
+- Index persistence: FSDirectory
+- Index size: 12.5 GB
+ ++ Figures
+- Time taken (in ms/s as an average of at least 3 + indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)
+- Time taken / 1000 docs indexed: 340 Seconds
+- Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so + 1 GB of memory was allotted to the indexer
+ ++ Notes
+- Notes: +
+ ++ The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. +
+ Daniel can be contacted at Armbrust.Daniel at mayo.edu. +
+