lucene/modules/benchmark/CHANGES.txt

433 lines
19 KiB
Plaintext

Lucene Benchmark Contrib Change Log
The Benchmark contrib package contains code for benchmarking Lucene in a variety of ways.
03/31/2011
Updated ReadTask to the new method for obtaining a top-level deleted docs
bitset. Also checking the bitset for null, when there are no deleted docs.
(Steve Rowe, Mike McCandless)
Updated NewAnalyzerTask and NewShingleAnalyzerTask to handle analyzers
in the new org.apache.lucene.analysis.core package (KeywordAnalyzer,
SimpleAnalyzer, etc.) (Steve Rowe, Robert Muir)
Updated ReadTokensTask to converts tokens to their indexed forms
(char[]->byte[]), just as the indexer does. This allows measurement
of the conversion process, which is important for analysis components
that customize it, e.g. (ICU)CollationKeyFilter. As a result,
benchmarks that incorporate this task will no longer be directly
comparable between 3.X and 4.0. (Robert Muir, Steve Rowe)
03/24/2011
LUCENE-2977: WriteLineDocTask now automatically detects how to write -
GZip or BZip2 or Plain-text - according to the output file extension.
Property bzip.compression of WriteLineDocTask was canceled. (Doron Cohen)
03/23/2011
LUCENE-2980: Benchmark's ContentSource no more requires lower case file suffixes
for detecting file type (gzip/bzip2/text). As part of this fix worked around an
issue with gzip input streams which were remaining open (See COMPRESS-127).
(Doron Cohen)
03/22/2011
LUCENE-2978: Upgrade benchmark's commons-compress from 1.0 to 1.1 as
the move of gzip decompression in LUCENE-1540 from Java's GZipInputStream
to commons-compress 1.0 made it 15 times slower. In 1.1 no such slow-down
is observed. (Doron Cohen)
03/21/2011
LUCENE-2958: WriteLineDocTask improvements - allow to emit line docs also for empty
docs, and be flexible about which fields are added to the line file. For this, a header
line was added to the line file. That header is examined by LineDocSource. Old line
files which have no header line are handled as before, imposing the default header.
(Doron Cohen, Shai Erera, Mike McCandless)
03/21/2011
LUCENE-2964: Allow benchmark tasks from alternative packages,
specified through a new property "alt.tasks.packages".
(Doron Cohen, Shai Erera)
03/20/2011
LUCENE-2963: Easier way to run benchmark, by calling Benmchmark.exec(alg-file).
(Doron Cohen)
03/10/2011
LUCENE-2961: Removed lib/xml-apis.jar, since JVM 1.5+ already contains the
JAXP 1.3 interface classes it provides.
02/05/2011
LUCENE-1540: Improvements to contrib.benchmark for TREC collections.
ContentSource can now process plain text files, gzip files, and bzip2 files.
TREC doc parsing now handles the TREC gov2 collection and TREC disks 4&5-CR
collection (both used by many TREC tasks). (Shai Erera, Doron Cohen)
01/31/2011
LUCENE-1591: Rollback to xerces-2.9.1-patched-XERCESJ-1257.jar to workaround
XERCESJ-1257, which we hit on current Wikipedia XML export
(ENWIKI-20110115-pages-articles.xml) with xerces-2.10.0.jar. (Mike McCandless)
01/26/2011
LUCENE-929: ExtractReuters first extracts to a tmp dir and then renames. That
way, if a previous extract attempt failed, "ant extract-reuters" will still
extract the files. (Shai Erera, Doron Cohen, Grant Ingersoll)
01/24/2011
LUCENE-2885: Add WaitForMerges task (calls IndexWriter.waitForMerges()).
(Mike McCandless)
10/10/2010
The locally built patched version of the Xerces-J jar introduced
as part of LUCENE-1591 is no longer required, because Xerces
2.10.0, which contains a fix for XERCESJ-1257 (see
http://svn.apache.org/viewvc?view=revision&revision=554069),
was released earlier this year. Upgraded
xerces-2.9.1-patched-XERCESJ-1257.jar and xml-apis-2.9.0.jar
to xercesImpl-2.10.0.jar and xml-apis-2.10.0.jar. (Steven Rowe)
8/2/2010
LUCENE-2582: You can now specify the default codec to use for
writing new segments by adding default.codec = Pulsing (for
example), in the alg file. (Mike McCandless)
4/27/2010: WriteLineDocTask now supports multi-threading. Also,
StringBufferReader was renamed to StringBuilderReader and works on
StringBuilder now. In addition, LongToEnglishContentSource starts from 0
(instead of Long.MIN_VAL+10) and wraps around to MIN_VAL (if you ever hit
Long.MAX_VAL). (Shai Erera)
4/07/2010
LUCENE-2377: Enable the use of NoMergePolicy and NoMergeScheduler by
CreateIndexTask. (Shai Erera)
3/28/2010
LUCENE-2353: Fixed bug in Config where Windows absolute path property values
were incorrectly handled (Shai Erera)
3/24/2010
LUCENE-2343: Added support for benchmarking collectors. (Grant Ingersoll, Shai Erera)
2/21/2010
LUCENE-2254: Add support to the quality package for running
experiments with any combination of Title, Description, and Narrative.
(Robert Muir)
1/28/2010
LUCENE-2223: Add a benchmark for ShingleFilter. You can wrap any
analyzer with ShingleAnalyzerWrapper and specify shingle parameters
with the NewShingleAnalyzer task. (Steven Rowe via Robert Muir)
1/14/2010
LUCENE-2210: TrecTopicsReader now properly reads descriptions and
narratives from trec topics files. (Robert Muir)
1/11/2010
LUCENE-2181: Add a benchmark for collation. This adds NewLocaleTask,
which sets a Locale in the run data for collation to use, and can be
used in the future for benchmarking localized range queries and sorts.
Also add NewCollationAnalyzerTask, which works with both JDK and ICU
Collator implementations. Fix ReadTokensTask to not tokenize fields
unless they should be tokenized according to DocMaker config. The
easiest way to run the benchmark is to run 'ant collation'
(Steven Rowe via Robert Muir)
12/22/2009
LUCENE-2178: Allow multiple locations to add to the class path with
-Dbenchmark.ext.classpath=... when running "ant run-task" (Steven
Rowe via Mike McCandless)
12/17/2009
LUCENE-2168: Allow negative relative thread priority for BG tasks
(Mike McCandless)
12/07/2009
LUCENE-2106: ReadTask does not close its Reader when
OpenReader/CloseReader are not used. (Mark Miller)
11/17/2009
LUCENE-2079: Allow specifying delta thread priority after the "&";
added log.time.step.msec to print per-time-period counts; fixed
NearRealTimeTask to print reopen times (in msec) of each reopen, at
the end. (Mike McCandless)
11/13/2009
LUCENE-2050: Added ability to run tasks within a serial sequence in
the background, by appending "&". The tasks are stopped & joined at
the end of the sequence. Also added Wait and RollbackIndex tasks.
Genericized NearRealTimeReaderTask to only reopen the reader
(previously it spawned its own thread, and also did searching).
Also changed the API of PerfRunData.getIndexReader: it now returns a
reference, and it's your job to decRef the reader when you're done
using it. (Mike McCandless)
11/12/2009
LUCENE-2059: allow TrecContentSource not to change the docname.
Previously, it would always append the iteration # to the docname.
With the new option content.source.excludeIteration, you can disable this.
The resulting index can then be used with the quality package to measure
relevance. (Robert Muir)
11/12/2009
LUCENE-2058: specify trec_eval submission output from the command line.
Previously, 4 arguments were required, but the third was unused. The
third argument is now the desired location of submission.txt (Robert Muir)
11/08/2009
LUCENE-2044: Added delete.percent.rand.seed to seed the Random instance
used by DeleteByPercentTask. (Mike McCandless)
11/07/2009
LUCENE-2043: Fix CommitIndexTask to also commit pending IndexReader
changes (Mike McCandless)
11/07/2009
LUCENE-2042: Added print.hits.field, to print each hit from the
Search* tasks. (Mike McCandless)
11/04/2009
LUCENE-2029: Added doc.body.stored and doc.body.tokenized; each
falls back to the non-body variant as its default. (Mike McCandless)
10/28/2009
LUCENE-1994: Fix thread safety of EnwikiContentSource and DocMaker
when doc.reuse.fields is false. Also made docs.reuse.fields=true
thread safe. (Mark Miller, Shai Erera, Mike McCandless)
8/4/2009
LUCENE-1770: Add EnwikiQueryMaker (Mark Miller)
8/04/2009
LUCENE-1773: Add FastVectorHighlighter tasks. This change is a
non-backwards compatible change in how subclasses of ReadTask define
a highlighter. The methods doHighlight, isMergeContiguousFragments,
maxNumFragments and getHighlighter are no longer used and have been
mark deprecated and package protected private so there's a compile
time error. Instead, the new getBenchmarkHighlighter method should
return an appropriate highlighter for the task. The configuration of
the highlighter tasks (maxFrags, mergeContiguous, etc.) is now
accepted as params to the task. (Koji Sekiguchi via Mike McCandless)
8/03/2009
LUCENE-1778: Add support for log.step setting per task type. Perviously, if
you included a log.step line in the .alg file, it had been applied to all
tasks. Now, you can include a log.step.AddDoc, or log.step.DeleteDoc (for
example) to control logging for just these tasks. If you want to ommit logging
for any other task, include log.step=-1. The syntax is "log.step." together
with the Task's 'short' name (i.e., without the 'Task' part).
(Shai Erera via Mark Miller)
7/24/2009
LUCENE-1595: Deprecate LineDocMaker and EnwikiDocMaker in favor of
using DocMaker directly, with content.source = LineDocSource or
EnwikiContentSource. NOTE: with this change, the "id" field from
the Wikipedia XML export is now indexed as the "docname" field
(previously it was indexed as "docid"). Additionaly, the
SearchWithSort task now accepts all types that SortField can accept
and no longer falls back to SortField.AUTO, which has been
deprecated. (Mike McCandless)
7/20/2009
LUCENE-1755: Fix WriteLineDocTask to output a document if it contains either
a title or body (or both). (Shai Erera via Mark Miller)
7/14/2009
LUCENE-1725: Fix the example Sort algorithm - auto is now deprecated and no longer works
with Benchmark. Benchmark will now throw an exception if you specify sort fields without
a type. The example sort algorithm is now typed. (Mark Miller)
7/6/2009
LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files,
unless a different encoding is specified. Additionally, ContentSource now supports
a content.source.encoding parameter in the configuration file.
(Shai Erera via Mark Miller)
6/26/2009
LUCENE-1716: Added the following support:
doc.tokenized.norms: specifies whether to store norms
doc.body.tokenized.norms: special attribute for the body field
doc.index.props: specifies whether DocMaker should index the properties set on
DocData
writer.info.stream: specifies the info stream to set on IndexWriter (supported
values are: SystemOut, SystemErr and a file name). (Shai Erera via Mike McCandless)
6/23/09
LUCENE-1714: WriteLineDocTask incorrectly normalized text, by replacing only
occurrences of "\t" with a space. It now replaces "\r\n" in addition to that,
so that LineDocMaker won't fail. (Shai Erera via Michael McCandless)
6/17/09
LUCENE-1595: This issue breaks previous external algorithms. DocMaker has been
replaced with a concrete class which accepts a ContentSource for iterating over
a content source's documents. Most of the old DocMakers were changed to a
ContentSource implementation, and DocMaker is now a default document creation impl
that provides an easy way for reusing fields. When [doc.maker] is not defined in
an algorithm, the new DocMaker is the default. If you have .alg files which
specify a DocMaker (like ReutersDocMaker), you should change the [doc.maker] line to:
[content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource]
i.e.
doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
becomes
content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
becomes
content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource
Also, PerfTask now logs a message in tearDown() rather than each Task doing its
own logging. A new setting called [log.step] is consulted to determine how often
to log. [doc.add.log.step] is no longer a valid setting. For easy migration of
current .alg files, rename [doc.add.log.step] to [log.step] and [doc.delete.log.step]
to [delete.log.step].
Additionally, [doc.maker.forever] should be changed to [content.source.forever].
(Shai Erera via Mark Miller)
6/12/09
LUCENE-1539: Added DeleteByPercentTask which enables deleting a
percentage of documents and searching on them. Changed CommitIndex
to optionally accept a label (recorded as userData=<label> in the
commit point). Added FlushReaderTask, and modified OpenReaderTask
to also optionally take a label referencing a commit point to open.
Also changed default autoCommit (when IndexWriter is opened) to
true. (Jason Rutherglen via Mike McCandless)
12/20/08
LUCENE-1495: Allow task sequence to run for specfied number of seconds by adding ": 2.7s" (for example).
12/16/08
LUCENE-1493: Stop using deprecated Hits API for searching; add new
param search.num.hits to set top N docs to collect.
12/16/08
LUCENE-1492: Added optional readOnly param (default true) to OpenReader task.
9/9/08
LUCENE-1243: Added new sorting benchmark capabilities. Also Reopen and commit tasks. (Mark Miller via Grant Ingersoll)
5/10/08
LUCENE-1090: remove relative paths assumptions from benchmark code.
Only build.xml was modified: work-dir definition must remain so
benchmark tests can run from both trunk-home and benchmark-home.
3/9/08
LUCENE-1209: Fixed DocMaker settings by round. Prior to this fix, DocMaker settings of
first round were used in all rounds. (E.g. term vectors.)
(Mark Miller via Doron Cohen)
1/30/08
LUCENE-1156: Fixed redirect problem in EnwikiDocMaker. Refactored ExtractWikipedia to use EnwikiDocMaker. Added property to EnwikiDocMaker to allow
for skipping image only documents.
1/24/2008
LUCENE-1136: add ability to not count sub-task doLogic increment
1/23/2008
LUCENE-1129: ReadTask properly uses the traversalSize value
LUCENE-1128: Added support for benchmarking the highlighter
01/20/08
LUCENE-1139: various fixes
- add merge.scheduler, merge.policy config properties
- refactor Open/CreateIndexTask to share setting config on IndexWriter
- added doc.reuse.fields=true|false for LineDocMaker
- OptimizeTask now takes int param to call optimize(int maxNumSegments)
- CloseIndexTask now takes bool param to call close(false) (abort running merges)
01/03/08
LUCENE-1116: quality package improvements:
- add MRR computation;
- allow control of max #queries to run;
- verify log & report are flushed.
- add TREC query reader for the 1MQ track.
12/31/07
LUCENE-1102: EnwikiDocMaker now indexes the docid field, so results might not be comparable with results prior to this change, although
it is doubted that this one small field makes much difference.
12/13/07
LUCENE-1086: DocMakers setup for the "docs.dir" property
fixed to properly handle absolute paths. (Shai Erera via Doron Cohen)
9/18/07
LUCENE-941: infinite loop for alg: {[AddDoc(4000)]: 4} : *
ResetInputsTask fixed to work also after exhaustion.
All Reset Tasks now subclas ResetInputsTask.
8/9/07
LUCENE-971: Change enwiki tasks to a doc maker (extending
LineDocMaker) that directly processes the Wikipedia XML and produces
documents. Intermediate files (one per document) are no longer
created.
8/1/07
LUCENE-967: Add "ReadTokensTask" to allow for benchmarking just tokenization.
7/27/07
LUCENE-836: Add support for search quality benchmarking, running
a set of queries against a searcher, and, optionally produce a submission
report, and, if query judgements are available, compute quality measures:
recall, precision_at_N, average_precision, MAP. TREC specific Judge (based
on TREC QRels) and TREC Topics reader are included in o.a.l.benchmark.quality.trec
but any other format of queries and judgements can be implemented and used.
7/24/07
LUCENE-947: Add support for creating and index "one document per
line" from a large text file, which reduces per-document overhead of
opening a single file for each document.
6/30/07
LUCENE-848: Added support for Wikipedia benchmarking.
6/25/07
- LUCENE-940: Multi-threaded issues fixed: SimpleDateFormat; logging for addDoc/deleteDoc tasks.
- LUCENE-945: tests fail to find data dirs. Added sys-prop benchmark.work.dir and cfg-prop work.dir.
(Doron Cohen)
4/17/07
- LUCENE-863: Deprecated StandardBenchmarker in favour of byTask code.
(Otis Gospodnetic)
4/13/07
Better error handling and javadocs around "exhaustive" doc making.
3/25/07
LUCENE-849:
1. which HTML Parser is used is configurable with html.parser property.
2. External classes added to classpath with -Dbenchmark.ext.classpath=path.
3. '*' as repeating number now means "exhaust doc maker - no repetitions".
3/22/07
-Moved withRetrieve() call out of the loop in ReadTask
-Added SearchTravRetLoadFieldSelectorTask to help benchmark some of the FieldSelector capabilities
-Added options to store content bytes on the Reuters Doc (and others, but Reuters is the only one w/ it enabled)
3/21/07
Tests (for benchmarking code correctness) were added - LUCENE-840.
To be invoked by "ant test" from contrib/benchmark. (Doron Cohen)
3/19/07
1. Introduced an AbstractQueryMaker to hold common QueryMaker code. (GSI)
2. Added traversalSize parameter to SearchTravRetTask and SearchTravTask. Changed SearchTravRetTask to extend SearchTravTask. (GSI)
3. Added FileBasedQueryMaker to run queries from a File or resource. (GSI)
4. Modified query-maker generation for read related tasks to make further read tasks addition simpler and safer. (DC)
5. Changed Taks' setParams() to throw UnsupportedOperationException if that task does not suppot command line param. (DC)
6. Improved javadoc to specify all properties command line params currently supported. (DC)
7. Refactored ReportTasks so that it is easy/possible now to create new report tasks. (DC)
01/09/07
1. Committed Doron Cohen's benchmarking contribution, which provides an easily expandable task based approach to benchmarking. See the javadocs for information. (Doron Cohen via Grant Ingersoll)
2. Added this file.
3. 2/11/07: LUCENE-790 and 788: Fixed Locale issue with date formatter. Fixed some minor issues with benchmarking by task. Added a dependency
on the Lucene demo to the build classpath. (Doron Cohen, Grant Ingersoll)
4. 2/13/07: LUCENE-801: build.xml now builds Lucene core and Demo first and has classpath dependencies on the output of that build. (Doron Cohen, Grant Ingersoll)