mirror of https://github.com/apache/lucene.git
26 lines
1.4 KiB
Plaintext
26 lines
1.4 KiB
Plaintext
Support exists for downloading, parsing, and loading the English
|
|
version of wikipedia (enwiki).
|
|
|
|
The build file can automatically try to download the most current
|
|
enwiki dataset (pages-articles.xml.bz2) from the "latest" directory,
|
|
http://download.wikimedia.org/enwiki/latest/. However, this file
|
|
doesn't always exist, depending on where wikipedia is in the dump
|
|
process and whether prior dumps have succeeded. If this file doesn't
|
|
exist, you can sometimes find an older or in progress version by
|
|
looking in the dated directories under
|
|
http://download.wikimedia.org/enwiki/. For example, as of this
|
|
writing, there is a page file in
|
|
http://download.wikimedia.org/enwiki/20070402/. You can download this
|
|
file manually and put it in temp. Note that the file you download will
|
|
probably have the date in the name, e.g.,
|
|
http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2.
|
|
|
|
If you use the EnwikiContentSource then the data will be decompressed on the fly
|
|
during the benchmark. If you want to benchmark indexing, you should probably decompress
|
|
it beforehand using the "enwiki" Ant target which will produce a work/enwiki.txt, after
|
|
which you can use LineDocSource in your benchmark.
|
|
|
|
After that, ant enwiki should process the data set and run a load
|
|
test. Ant target enwiki will download, decompress, and extract (to individual files
|
|
in work/enwiki) the dataset, respectively.
|