diff --git a/docs/benchmarks.html b/docs/benchmarks.html index c401a850fc9..b1cd6c5dadb 100644 --- a/docs/benchmarks.html +++ b/docs/benchmarks.html @@ -5,6 +5,7 @@ + @@ -121,20 +122,20 @@

- The purpose of these user-submitted performance figures is to -give current and potential users of Lucene a sense - of how well Lucene scales. If the requirements for an upcoming -project is similar to an existing benchmark, you - will also have something to work with when designing the system -architecture for the application. -

+ The purpose of these user-submitted performance figures is to + give current and potential users of Lucene a sense + of how well Lucene scales. If the requirements for an upcoming + project is similar to an existing benchmark, you + will also have something to work with when designing the system + architecture for the application. +

- If you've conducted performance tests with Lucene, we'd -appreciate if you can submit these figures for display - on this page. Post these figures to the lucene-user mailing list -using this - template. -

+ If you've conducted performance tests with Lucene, we'd + appreciate if you can submit these figures for display + on this page. Post these figures to the lucene-user mailing list + using this + template. +

@@ -149,64 +150,64 @@ using this

-

-

+ +

@@ -221,17 +222,17 @@ special tuning/strategies, etc

- These benchmarks have been kindly submitted by Lucene users for -reference purposes. -

-

We make NO guarantees regarding their accuracy or -validity. -

-

We strongly recommend you conduct your own - performance benchmarks before deciding on a particular -hardware/software setup (and hopefully submit - these figures to us). -

+ These benchmarks have been kindly submitted by Lucene users for + reference purposes. +

+

We make NO guarantees regarding their accuracy or + validity. +

+

We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). +

@@ -357,71 +358,146 @@ dropped to 900ms! + +
@@ -241,109 +242,109 @@ hardware/software setup (and hopefully submit
    -

    - Hardware Environment
    -

  • Dedicated machine for indexing: yes
  • -
  • CPU: Intel x86 P4 1.5Ghz
  • -
  • RAM: 512 DDR
  • -
  • Drive configuration: IDE 7200rpm Raid-1
  • -

    -

    - Software environment
    -

  • Java Version: 1.3.1 IBM JITC Enabled
  • -
  • Java VM:
  • -
  • OS Version: Debian Linux 2.4.18-686
  • -
  • Location of index: local
  • -

    -

    - Lucene indexing variables
    -

  • Number of source documents: Random generator. Set -to make 1M documents -in 2x500,000 batches.
  • -
  • Total filesize of source documents: > 1GB if -stored
  • -
  • Average filesize of source documents: 1KB
  • -
  • Source documents storage location: Filesystem
  • -
  • File type of source documents: Generated
  • -
  • Parser(s) used, if any:
  • -
  • Analyzer(s) used: Default
  • -
  • Number of fields per document: 11
  • -
  • Type of fields: 1 date, 1 id, 9 text
  • -
  • Index persistence: FSDirectory
  • -

    -

    - Figures
    -

  • Time taken (in ms/s as an average of at least 3 -indexing runs):
  • -
  • Time taken / 1000 docs indexed: 49 seconds
  • -
  • Memory consumption:
  • -

    -

    - Notes
    -

  • Notes: -

    - A windows client ran a random document generator which -created - documents based on some arrays of values and an excerpt -(approx 1kb) - from a text file of the bible (King James version).
    - These were submitted via a socket connection (open throughout - indexing process).
    - The index writer was not closed between index calls.
    - This created a 400Mb index in 23 files (after -optimization).
    -

    -

    - Query details:
    -

    -

    - Set up a threaded class to start x number of simultaneous -threads to - search the above created index. -

    -

    - Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) -(Teaser:goo* Tea - ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) - +DisplayStartDate:[mkwsw2jk0 - -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] -

    -

    - This query counted 34000 documents and I limited the returned -documents - to 5. -

    -

    - This is using Peter Halacsy's IndexSearcherCache slightly -modified to - be a singleton returned cached searchers for a given -directory. This - solved an initial problem with too many files open and -running out of - linux handles for them. -

    -
    -          Threads|Avg Time per query (ms)
    -          1       1009ms
    -          2       2043ms
    -          3       3087ms
    -          4       4045ms
    -          ..        .
    -          ..        .
    -          10      10091ms
    -          
    -

    - I removed the two date range terms from the query and it made -a HUGE - difference in performance. With 4 threads the avg time -dropped to 900ms! -

    -

    Other query optimizations made little difference.

  • -

    -
+

+ Hardware Environment
+

  • Dedicated machine for indexing: yes
  • +
  • CPU: Intel x86 P4 1.5Ghz
  • +
  • RAM: 512 DDR
  • +
  • Drive configuration: IDE 7200rpm Raid-1
  • +

    +

    + Software environment
    +

  • Java Version: 1.3.1 IBM JITC Enabled
  • +
  • Java VM:
  • +
  • OS Version: Debian Linux 2.4.18-686
  • +
  • Location of index: local
  • +

    +

    + Lucene indexing variables
    +

  • Number of source documents: Random generator. Set + to make 1M documents + in 2x500,000 batches.
  • +
  • Total filesize of source documents: > 1GB if + stored
  • +
  • Average filesize of source documents: 1KB
  • +
  • Source documents storage location: Filesystem
  • +
  • File type of source documents: Generated
  • +
  • Parser(s) used, if any:
  • +
  • Analyzer(s) used: Default
  • +
  • Number of fields per document: 11
  • +
  • Type of fields: 1 date, 1 id, 9 text
  • +
  • Index persistence: FSDirectory
  • +

    +

    + Figures
    +

  • Time taken (in ms/s as an average of at least 3 + indexing runs):
  • +
  • Time taken / 1000 docs indexed: 49 seconds
  • +
  • Memory consumption:
  • +

    +

    + Notes
    +

  • Notes: +

    + A windows client ran a random document generator which + created + documents based on some arrays of values and an excerpt + (approx 1kb) + from a text file of the bible (King James version).
    + These were submitted via a socket connection (open throughout + indexing process).
    + The index writer was not closed between index calls.
    + This created a 400Mb index in 23 files (after + optimization).
    +

    +

    + Query details:
    +

    +

    + Set up a threaded class to start x number of simultaneous + threads to + search the above created index. +

    +

    + Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] +

    +

    + This query counted 34000 documents and I limited the returned + documents + to 5. +

    +

    + This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. +

    +
    +                                Threads|Avg Time per query (ms)
    +                                1       1009ms
    +                                2       2043ms
    +                                3       3087ms
    +                                4       4045ms
    +                                ..        .
    +                                ..        .
    +                                10      10091ms
    +                            
    +

    + I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! +

    +

    Other query optimizations made little difference.

  • +

    +

    - Hamish can be contacted at hamish at catalyst.net.nz. -

    + Hamish can be contacted at hamish at catalyst.net.nz. +


      -

      - Hardware Environment
      -

    • Dedicated machine for indexing: No, but nominal -usage at time of indexing.
    • -
    • CPU: Compaq Proliant 1850R/600 2 X pIII 600
    • -
    • RAM: 1GB, 256MB allocated to JVM.
    • -
    • Drive configuration: RAID 5 on Fibre Channel -Array
    • -

      -

      - Software environment
      -

    • Java Version: 1.3.1_06
    • -
    • Java VM:
    • -
    • OS Version: Winnt 4/Sp6
    • -
    • Location of index: local
    • -

      -

      - Lucene indexing variables
      -

    • Number of source documents: about 60K
    • -
    • Total filesize of source documents: 6.5GB
    • -
    • Average filesize of source documents: 100K -(6.5GB/60K documents)
    • -
    • Source documents storage location: filesystem on -NTFS
    • -
    • File type of source documents:
    • -
    • Parser(s) used, if any: Currently the only parser -used is the Quiotix html - parser.
    • -
    • Analyzer(s) used: SimpleAnalyzer
    • -
    • Number of fields per document: 8
    • -
    • Type of fields: All strings, and all are stored -and indexed.
    • -
    • Index persistence: FSDirectory
    • -

      -

      - Figures
      -

    • Time taken (in ms/s as an average of at least 3 -indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 -minutes. Note that the # - and size of documents changes daily.
    • -
    • Time taken / 1000 docs indexed:
    • -
    • Memory consumption: JVM is given 256MB and uses it -all.
    • -

      -

      - Notes
      -

    • Notes: -

      - We have 10 threads reading files from the filesystem and -parsing and - analyzing them and the pushing them onto a queue and a single -thread poping - them from the queue and indexing. Note that we are indexing -email messages - and are storing the entire plaintext in of the message in the -index. If the - message contains attachment and we do not have a filter for -the attachment - (ie. we do not do PDFs yet), we discard the data. -

    • -

      -
    +

    + Hardware Environment
    +

  • Dedicated machine for indexing: No, but nominal + usage at time of indexing.
  • +
  • CPU: Compaq Proliant 1850R/600 2 X pIII 600
  • +
  • RAM: 1GB, 256MB allocated to JVM.
  • +
  • Drive configuration: RAID 5 on Fibre Channel + Array
  • +

    +

    + Software environment
    +

  • Java Version: 1.3.1_06
  • +
  • Java VM:
  • +
  • OS Version: Winnt 4/Sp6
  • +
  • Location of index: local
  • +

    +

    + Lucene indexing variables
    +

  • Number of source documents: about 60K
  • +
  • Total filesize of source documents: 6.5GB
  • +
  • Average filesize of source documents: 100K + (6.5GB/60K documents)
  • +
  • Source documents storage location: filesystem on + NTFS
  • +
  • File type of source documents:
  • +
  • Parser(s) used, if any: Currently the only parser + used is the Quiotix html + parser.
  • +
  • Analyzer(s) used: SimpleAnalyzer
  • +
  • Number of fields per document: 8
  • +
  • Type of fields: All strings, and all are stored + and indexed.
  • +
  • Index persistence: FSDirectory
  • +

    +

    + Figures
    +

  • Time taken (in ms/s as an average of at least 3 + indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 + minutes. Note that the # + and size of documents changes daily.
  • +
  • Time taken / 1000 docs indexed:
  • +
  • Memory consumption: JVM is given 256MB and uses it + all.
  • +

    +

    + Notes
    +

  • Notes: +

    + We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. +

  • +

    +

    - Justin can be contacted at tvxh-lw4x at spamex.com. -

    + Justin can be contacted at tvxh-lw4x at spamex.com. +

    +
    +

    + + + diff --git a/docs/contributions.html b/docs/contributions.html index 37e4930d1d2..b0cd5b7e656 100644 --- a/docs/contributions.html +++ b/docs/contributions.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo.html b/docs/demo.html index f29ffce9fa5..a83b4dd202c 100644 --- a/docs/demo.html +++ b/docs/demo.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo2.html b/docs/demo2.html index cf2f5769e39..b17c4537fbb 100644 --- a/docs/demo2.html +++ b/docs/demo2.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo3.html b/docs/demo3.html index 1e032520e34..b3fdb092708 100644 --- a/docs/demo3.html +++ b/docs/demo3.html @@ -5,6 +5,7 @@ + diff --git a/docs/demo4.html b/docs/demo4.html index 454966bd36d..ad8e3723e2f 100644 --- a/docs/demo4.html +++ b/docs/demo4.html @@ -5,6 +5,7 @@ + diff --git a/docs/fileformats.html b/docs/fileformats.html index c6b28549c59..b930a80366c 100644 --- a/docs/fileformats.html +++ b/docs/fileformats.html @@ -5,6 +5,7 @@ + diff --git a/docs/gettingstarted.html b/docs/gettingstarted.html index 3f8e208f0f5..a7df33dec5f 100644 --- a/docs/gettingstarted.html +++ b/docs/gettingstarted.html @@ -5,6 +5,7 @@ + diff --git a/docs/index.html b/docs/index.html index 4e6c9b10957..6d40b5c74f8 100644 --- a/docs/index.html +++ b/docs/index.html @@ -5,6 +5,7 @@ + diff --git a/docs/lucene-sandbox/index.html b/docs/lucene-sandbox/index.html index f3d0cde7e2d..278220a847d 100644 --- a/docs/lucene-sandbox/index.html +++ b/docs/lucene-sandbox/index.html @@ -5,6 +5,7 @@ + diff --git a/docs/lucene-sandbox/indyo/tutorial.html b/docs/lucene-sandbox/indyo/tutorial.html index d76440220a5..667ba1dcd6a 100644 --- a/docs/lucene-sandbox/indyo/tutorial.html +++ b/docs/lucene-sandbox/indyo/tutorial.html @@ -5,6 +5,7 @@ + diff --git a/docs/lucene-sandbox/larm/overview.html b/docs/lucene-sandbox/larm/overview.html index a10dbaa528b..c98a75d55aa 100644 --- a/docs/lucene-sandbox/larm/overview.html +++ b/docs/lucene-sandbox/larm/overview.html @@ -5,6 +5,7 @@ + diff --git a/docs/luceneplan.html b/docs/luceneplan.html index 2abef4c51a4..5490662d3fc 100644 --- a/docs/luceneplan.html +++ b/docs/luceneplan.html @@ -5,6 +5,7 @@ + diff --git a/docs/powered.html b/docs/powered.html index d9617af6d63..809cb87354e 100644 --- a/docs/powered.html +++ b/docs/powered.html @@ -5,6 +5,7 @@ + diff --git a/docs/queryparsersyntax.html b/docs/queryparsersyntax.html index 35d1473a2ab..8d8bf8b7345 100644 --- a/docs/queryparsersyntax.html +++ b/docs/queryparsersyntax.html @@ -5,6 +5,7 @@ + diff --git a/docs/resources.html b/docs/resources.html index c7e6f456d26..65286569ff7 100644 --- a/docs/resources.html +++ b/docs/resources.html @@ -5,6 +5,7 @@ + diff --git a/docs/todo.html b/docs/todo.html index a77a8824674..886946f4018 100644 --- a/docs/todo.html +++ b/docs/todo.html @@ -5,6 +5,7 @@ + diff --git a/docs/whoweare.html b/docs/whoweare.html index f5d0d13d2d9..1e74dbce364 100644 --- a/docs/whoweare.html +++ b/docs/whoweare.html @@ -5,6 +5,7 @@ + diff --git a/xdocs/benchmarks.xml b/xdocs/benchmarks.xml index eec7fc74167..863f96904ec 100644 --- a/xdocs/benchmarks.xml +++ b/xdocs/benchmarks.xml @@ -1,283 +1,349 @@ - Kelvin Tan - Resources - Performance Benchmarks + Kelvin Tan + Resources - Performance Benchmarks -
    -

    - The purpose of these user-submitted performance figures is to -give current and potential users of Lucene a sense - of how well Lucene scales. If the requirements for an upcoming -project is similar to an existing benchmark, you - will also have something to work with when designing the system -architecture for the application. -

    -

    - If you've conducted performance tests with Lucene, we'd -appreciate if you can submit these figures for display - on this page. Post these figures to the lucene-user mailing list -using this - template. -

    -
    - -
    -

    -

      -

      - Hardware Environment
      -

    • Dedicated machine for indexing: Self-explanatory -(yes/no)
    • -
    • CPU: Self-explanatory (Type, Speed and Quantity)
    • -
    • RAM: Self-explanatory
    • -
    • Drive configuration: Self-explanatory (IDE, SCSI, -RAID-1, RAID-5)
    • -

      -

      - Software environment
      -

    • Java Version: Version of Java SDK/JRE that is run -
    • -
    • Java VM: Server/client VM, Sun VM/JRockIt
    • -
    • OS Version: Self-explanatory
    • -
    • Location of index: Is the index stored in filesystem -or database? Is it on the same server(local) or - over the network?
    • -

      -

      - Lucene indexing variables
      -

    • Number of source documents: Number of documents being -indexed
    • -
    • Total filesize of source documents: -Self-explanatory
    • -
    • Average filesize of source documents: -Self-explanatory
    • -
    • Source documents storage location: Where are the -documents being indexed located? - Filesystem, DB, http,etc
    • -
    • File type of source documents: Types of files being -indexed, e.g. HTML files, XML files, PDF files, etc.
    • -
    • Parser(s) used, if any: Parsers used for parsing the -various files for indexing, - e.g. XML parser, HTML parser, etc.
    • -
    • Analyzer(s) used: Type of Lucene analyzer used
    • -
    • Number of fields per document: Number of Fields each -Document contains
    • -
    • Type of fields: Type of each field
    • -
    • Index persistence: Where the index is stored, e.g. -FSDirectory, SqlDirectory, etc
    • -

      -

      - Figures
      -

    • Time taken (in ms/s as an average of at least 3 indexing -runs): Time taken to index all files
    • -
    • Time taken / 1000 docs indexed: Time taken to index -1000 files
    • -
    • Memory consumption: Self-explanatory
    • -

      -

      - Notes
      -

    • Notes: Any comments which don't belong in the above, -special tuning/strategies, etc
    • -

      -
    -

    -
    +
    +

    + The purpose of these user-submitted performance figures is to + give current and potential users of Lucene a sense + of how well Lucene scales. If the requirements for an upcoming + project is similar to an existing benchmark, you + will also have something to work with when designing the system + architecture for the application. +

    +

    + If you've conducted performance tests with Lucene, we'd + appreciate if you can submit these figures for display + on this page. Post these figures to the lucene-user mailing list + using this + template. +

    +
    -
    -

    - These benchmarks have been kindly submitted by Lucene users for -reference purposes. -

    -

    We make NO guarantees regarding their accuracy or -validity. -

    -

    We strongly recommend you conduct your own - performance benchmarks before deciding on a particular -hardware/software setup (and hopefully submit - these figures to us). -

    - - -
      -

      - Hardware Environment
      -

    • Dedicated machine for indexing: yes
    • -
    • CPU: Intel x86 P4 1.5Ghz
    • -
    • RAM: 512 DDR
    • -
    • Drive configuration: IDE 7200rpm Raid-1
    • -

      -

      - Software environment
      -

    • Java Version: 1.3.1 IBM JITC Enabled
    • -
    • Java VM:
    • -
    • OS Version: Debian Linux 2.4.18-686
    • -
    • Location of index: local
    • -

      -

      - Lucene indexing variables
      -

    • Number of source documents: Random generator. Set -to make 1M documents -in 2x500,000 batches.
    • -
    • Total filesize of source documents: > 1GB if -stored
    • -
    • Average filesize of source documents: 1KB
    • -
    • Source documents storage location: Filesystem
    • -
    • File type of source documents: Generated
    • -
    • Parser(s) used, if any:
    • -
    • Analyzer(s) used: Default
    • -
    • Number of fields per document: 11
    • -
    • Type of fields: 1 date, 1 id, 9 text
    • -
    • Index persistence: FSDirectory
    • -

      -

      - Figures
      -

    • Time taken (in ms/s as an average of at least 3 -indexing runs):
    • -
    • Time taken / 1000 docs indexed: 49 seconds
    • -
    • Memory consumption:
    • -

      -

      - Notes
      -

    • Notes: -

      - A windows client ran a random document generator which -created - documents based on some arrays of values and an excerpt -(approx 1kb) - from a text file of the bible (King James version).
      - These were submitted via a socket connection (open throughout - indexing process).
      - The index writer was not closed between index calls.
      - This created a 400Mb index in 23 files (after -optimization).
      -

      -

      - Query details:
      -

      -

      - Set up a threaded class to start x number of simultaneous -threads to - search the above created index. -

      -

      - Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) -(Teaser:goo* Tea - ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) - +DisplayStartDate:[mkwsw2jk0 - -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] -

      -

      - This query counted 34000 documents and I limited the returned -documents - to 5. -

      -

      - This is using Peter Halacsy's IndexSearcherCache slightly -modified to - be a singleton returned cached searchers for a given -directory. This - solved an initial problem with too many files open and -running out of - linux handles for them. -

      -
      -          Threads|Avg Time per query (ms)
      -          1       1009ms
      -          2       2043ms
      -          3       3087ms
      -          4       4045ms
      -          ..        .
      -          ..        .
      -          10      10091ms
      -          
      -

      - I removed the two date range terms from the query and it made -a HUGE - difference in performance. With 4 threads the avg time -dropped to 900ms! -

      -

      Other query optimizations made little difference.

    • -

      -
    -

    - Hamish can be contacted at hamish at catalyst.net.nz. -

    -
    +
    +

    +

      +

      + Hardware Environment
      +

    • Dedicated machine for indexing: Self-explanatory + (yes/no)
    • +
    • CPU: Self-explanatory (Type, Speed and Quantity)
    • +
    • RAM: Self-explanatory
    • +
    • Drive configuration: Self-explanatory (IDE, SCSI, + RAID-1, RAID-5)
    • +

      +

      + Software environment
      +

    • Java Version: Version of Java SDK/JRE that is run +
    • +
    • Java VM: Server/client VM, Sun VM/JRockIt
    • +
    • OS Version: Self-explanatory
    • +
    • Location of index: Is the index stored in filesystem + or database? Is it on the same server(local) or + over the network?
    • +

      +

      + Lucene indexing variables
      +

    • Number of source documents: Number of documents being + indexed
    • +
    • Total filesize of source documents: + Self-explanatory
    • +
    • Average filesize of source documents: + Self-explanatory
    • +
    • Source documents storage location: Where are the + documents being indexed located? + Filesystem, DB, http,etc
    • +
    • File type of source documents: Types of files being + indexed, e.g. HTML files, XML files, PDF files, etc.
    • +
    • Parser(s) used, if any: Parsers used for parsing the + various files for indexing, + e.g. XML parser, HTML parser, etc.
    • +
    • Analyzer(s) used: Type of Lucene analyzer used
    • +
    • Number of fields per document: Number of Fields each + Document contains
    • +
    • Type of fields: Type of each field
    • +
    • Index persistence: Where the index is stored, e.g. + FSDirectory, SqlDirectory, etc
    • +

      +

      + Figures
      +

    • Time taken (in ms/s as an average of at least 3 indexing + runs): Time taken to index all files
    • +
    • Time taken / 1000 docs indexed: Time taken to index + 1000 files
    • +
    • Memory consumption: Self-explanatory
    • +

      +

      + Notes
      +

    • Notes: Any comments which don't belong in the above, + special tuning/strategies, etc
    • +

      +
    +

    +
    - -
      -

      - Hardware Environment
      -

    • Dedicated machine for indexing: No, but nominal -usage at time of indexing.
    • -
    • CPU: Compaq Proliant 1850R/600 2 X pIII 600
    • -
    • RAM: 1GB, 256MB allocated to JVM.
    • -
    • Drive configuration: RAID 5 on Fibre Channel -Array
    • -

      -

      - Software environment
      -

    • Java Version: 1.3.1_06
    • -
    • Java VM:
    • -
    • OS Version: Winnt 4/Sp6
    • -
    • Location of index: local
    • -

      -

      - Lucene indexing variables
      -

    • Number of source documents: about 60K
    • -
    • Total filesize of source documents: 6.5GB
    • -
    • Average filesize of source documents: 100K -(6.5GB/60K documents)
    • -
    • Source documents storage location: filesystem on -NTFS
    • -
    • File type of source documents:
    • -
    • Parser(s) used, if any: Currently the only parser -used is the Quiotix html - parser.
    • -
    • Analyzer(s) used: SimpleAnalyzer
    • -
    • Number of fields per document: 8
    • -
    • Type of fields: All strings, and all are stored -and indexed.
    • -
    • Index persistence: FSDirectory
    • -

      -

      - Figures
      -

    • Time taken (in ms/s as an average of at least 3 -indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 -minutes. Note that the # - and size of documents changes daily.
    • -
    • Time taken / 1000 docs indexed:
    • -
    • Memory consumption: JVM is given 256MB and uses it -all.
    • -

      -

      - Notes
      -

    • Notes: -

      - We have 10 threads reading files from the filesystem and -parsing and - analyzing them and the pushing them onto a queue and a single -thread poping - them from the queue and indexing. Note that we are indexing -email messages - and are storing the entire plaintext in of the message in the -index. If the - message contains attachment and we do not have a filter for -the attachment - (ie. we do not do PDFs yet), we discard the data. -

    • -

      -
    -

    - Justin can be contacted at tvxh-lw4x at spamex.com. -

    -
    +
    +

    + These benchmarks have been kindly submitted by Lucene users for + reference purposes. +

    +

    We make NO guarantees regarding their accuracy or + validity. +

    +

    We strongly recommend you conduct your own + performance benchmarks before deciding on a particular + hardware/software setup (and hopefully submit + these figures to us). +

    -
    + +
      +

      + Hardware Environment
      +

    • Dedicated machine for indexing: yes
    • +
    • CPU: Intel x86 P4 1.5Ghz
    • +
    • RAM: 512 DDR
    • +
    • Drive configuration: IDE 7200rpm Raid-1
    • +

      +

      + Software environment
      +

    • Java Version: 1.3.1 IBM JITC Enabled
    • +
    • Java VM:
    • +
    • OS Version: Debian Linux 2.4.18-686
    • +
    • Location of index: local
    • +

      +

      + Lucene indexing variables
      +

    • Number of source documents: Random generator. Set + to make 1M documents + in 2x500,000 batches.
    • +
    • Total filesize of source documents: > 1GB if + stored
    • +
    • Average filesize of source documents: 1KB
    • +
    • Source documents storage location: Filesystem
    • +
    • File type of source documents: Generated
    • +
    • Parser(s) used, if any:
    • +
    • Analyzer(s) used: Default
    • +
    • Number of fields per document: 11
    • +
    • Type of fields: 1 date, 1 id, 9 text
    • +
    • Index persistence: FSDirectory
    • +

      +

      + Figures
      +

    • Time taken (in ms/s as an average of at least 3 + indexing runs):
    • +
    • Time taken / 1000 docs indexed: 49 seconds
    • +
    • Memory consumption:
    • +

      +

      + Notes
      +

    • Notes: +

      + A windows client ran a random document generator which + created + documents based on some arrays of values and an excerpt + (approx 1kb) + from a text file of the bible (King James version).
      + These were submitted via a socket connection (open throughout + indexing process).
      + The index writer was not closed between index calls.
      + This created a 400Mb index in 23 files (after + optimization).
      +

      +

      + Query details:
      +

      +

      + Set up a threaded class to start x number of simultaneous + threads to + search the above created index. +

      +

      + Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) + (Teaser:goo* Tea + ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) + +DisplayStartDate:[mkwsw2jk0 + -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0] +

      +

      + This query counted 34000 documents and I limited the returned + documents + to 5. +

      +

      + This is using Peter Halacsy's IndexSearcherCache slightly + modified to + be a singleton returned cached searchers for a given + directory. This + solved an initial problem with too many files open and + running out of + linux handles for them. +

      +
      +                                Threads|Avg Time per query (ms)
      +                                1       1009ms
      +                                2       2043ms
      +                                3       3087ms
      +                                4       4045ms
      +                                ..        .
      +                                ..        .
      +                                10      10091ms
      +                            
      +

      + I removed the two date range terms from the query and it made + a HUGE + difference in performance. With 4 threads the avg time + dropped to 900ms! +

      +

      Other query optimizations made little difference.

    • +

      +
    +

    + Hamish can be contacted at hamish at catalyst.net.nz. +

    +
    + + +
      +

      + Hardware Environment
      +

    • Dedicated machine for indexing: No, but nominal + usage at time of indexing.
    • +
    • CPU: Compaq Proliant 1850R/600 2 X pIII 600
    • +
    • RAM: 1GB, 256MB allocated to JVM.
    • +
    • Drive configuration: RAID 5 on Fibre Channel + Array
    • +

      +

      + Software environment
      +

    • Java Version: 1.3.1_06
    • +
    • Java VM:
    • +
    • OS Version: Winnt 4/Sp6
    • +
    • Location of index: local
    • +

      +

      + Lucene indexing variables
      +

    • Number of source documents: about 60K
    • +
    • Total filesize of source documents: 6.5GB
    • +
    • Average filesize of source documents: 100K + (6.5GB/60K documents)
    • +
    • Source documents storage location: filesystem on + NTFS
    • +
    • File type of source documents:
    • +
    • Parser(s) used, if any: Currently the only parser + used is the Quiotix html + parser.
    • +
    • Analyzer(s) used: SimpleAnalyzer
    • +
    • Number of fields per document: 8
    • +
    • Type of fields: All strings, and all are stored + and indexed.
    • +
    • Index persistence: FSDirectory
    • +

      +

      + Figures
      +

    • Time taken (in ms/s as an average of at least 3 + indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 + minutes. Note that the # + and size of documents changes daily.
    • +
    • Time taken / 1000 docs indexed:
    • +
    • Memory consumption: JVM is given 256MB and uses it + all.
    • +

      +

      + Notes
      +

    • Notes: +

      + We have 10 threads reading files from the filesystem and + parsing and + analyzing them and the pushing them onto a queue and a single + thread poping + them from the queue and indexing. Note that we are indexing + email messages + and are storing the entire plaintext in of the message in the + index. If the + message contains attachment and we do not have a filter for + the attachment + (ie. we do not do PDFs yet), we discard the data. +

    • +

      +
    +

    + Justin can be contacted at tvxh-lw4x at spamex.com. +

    +
    + + + +

    + My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. +

    +
      +

      + Hardware Environment
      +

    • Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single + threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.
    • +
    • CPU: Sun Ultra 80 4 x 64 bit processors
    • +
    • RAM: 4 GB Memory
    • +
    • Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive
    • +

      +

      + Software environment
      +

    • Java Version: 1.3.1
    • +
    • Java VM:
    • +
    • OS Version: Sun 5.8 (64 bit)
    • +
    • Location of index: local
    • +

      +

      + Lucene indexing variables
      +

    • Number of source documents: 13,820,517
    • +
    • Total filesize of source documents: 87.3 GB
    • +
    • Average filesize of source documents: 6.3 KB
    • +
    • Source documents storage location: Filesystem
    • +
    • File type of source documents: XML
    • +
    • Parser(s) used, if any:
    • +
    • Analyzer(s) used: A home grown analyzer that simply removes stopwords.
    • +
    • Number of fields per document: 1 - 31
    • +
    • Type of fields: All text, though 2 of them are dates (20001205) that we filter on
    • +
    • Index persistence: FSDirectory
    • +
    • Index size: 12.5 GB
    • +

      +

      + Figures
      +

    • Time taken (in ms/s as an average of at least 3 + indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)
    • +
    • Time taken / 1000 docs indexed: 340 Seconds
    • +
    • Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so + 1 GB of memory was allotted to the indexer
    • +

      +

      + Notes
      +

    • Notes: +

      + The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. +

    • +

      +
    +

    + Daniel can be contacted at Armbrust.Daniel at mayo.edu. +

    +
    + +
    -
    + + Daniel Armbrust's benchmarks + +
    +
    +

    + My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed, + nor was the total index built in one shot. The index was created on several different + machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to + 1 million documents per batch. Each of these small indexes was then moved to a + much larger drive, where they were all merged together into a big index. + This process was done manually, over the course of several months, as the sources became available. +

    +
      +

      + Hardware Environment
      +

    • Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single + threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.
    • +
    • CPU: Sun Ultra 80 4 x 64 bit processors
    • +
    • RAM: 4 GB Memory
    • +
    • Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive
    • +

      +

      + Software environment
      +

    • Java Version: 1.3.1
    • +
    • Java VM:
    • +
    • OS Version: Sun 5.8 (64 bit)
    • +
    • Location of index: local
    • +

      +

      + Lucene indexing variables
      +

    • Number of source documents: 13,820,517
    • +
    • Total filesize of source documents: 87.3 GB
    • +
    • Average filesize of source documents: 6.3 KB
    • +
    • Source documents storage location: Filesystem
    • +
    • File type of source documents: XML
    • +
    • Parser(s) used, if any:
    • +
    • Analyzer(s) used: A home grown analyzer that simply removes stopwords.
    • +
    • Number of fields per document: 1 - 31
    • +
    • Type of fields: All text, though 2 of them are dates (20001205) that we filter on
    • +
    • Index persistence: FSDirectory
    • +
    • Index size: 12.5 GB
    • +

      +

      + Figures
      +

    • Time taken (in ms/s as an average of at least 3 + indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)
    • +
    • Time taken / 1000 docs indexed: 340 Seconds
    • +
    • Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so + 1 GB of memory was allotted to the indexer
    • +

      +

      + Notes
      +

    • Notes: +

      + The source documents were XML. The "indexer" opened each document one at a time, ran an + XSL transformation on them, and then proceeded to index the stream. The indexer optimized + the index every 50,000 documents (on this run) though previously, we optimized every + 300,000 documents. The performance didn't change much either way. We did no other + tuning (RAM Directories, separate process to pretransform the source material, etc) + to make it index faster. When all of these individual indexes were built, they were + merged together into the main index. That process usually took ~ a day. +

    • +

      +
    +

    + Daniel can be contacted at Armbrust.Daniel at mayo.edu. +