mirror of https://github.com/apache/lucene.git
70 lines
3.4 KiB
Plaintext
70 lines
3.4 KiB
Plaintext
Support exists for downloading, parsing, and loading the English
|
|
version of wikipedia (enwiki).
|
|
|
|
The build file can automatically try to download the most current
|
|
enwiki dataset (pages-articles.xml.bz2) from the "latest" directory,
|
|
http://download.wikimedia.org/enwiki/latest/. However, this file
|
|
doesn't always exist, depending on where wikipedia is in the dump
|
|
process and whether prior dumps have succeeded. If this file doesn't
|
|
exist, you can sometimes find an older or in progress version by
|
|
looking in the dated directories under
|
|
http://download.wikimedia.org/enwiki/. For example, as of this
|
|
writing, there is a page file in
|
|
http://download.wikimedia.org/enwiki/20070402/. You can download this
|
|
file manually and put it in temp. Note that the file you download will
|
|
probably have the date in the name, e.g.,
|
|
http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2. When
|
|
you put it in temp, rename it to enwiki-latest-pages-articles.xml.bz2.
|
|
|
|
After that, ant enwiki should process the data set and run a load
|
|
test. Ant targets get-enwiki, expand-enwiki, and extract-enwiki can
|
|
also be used to download, decompress, and extract (to individual files
|
|
in work/enwiki) the dataset, respectively.
|
|
|
|
NOTE: This bug in Xerces:
|
|
|
|
https://issues.apache.org/jira/browse/XERCESJ-1257
|
|
|
|
which is still present as of 2.9.1, causes an exception like this when
|
|
processing Wikipedia's XML:
|
|
|
|
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
|
|
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
|
|
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
|
|
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
|
|
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
|
|
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
|
|
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
|
|
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
|
|
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
|
|
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
|
|
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
|
|
at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
|
|
... 1 more
|
|
|
|
The original poster in the Xerces bug provided this patch:
|
|
|
|
--- UTF8Reader.java 2006-11-23 00:36:53.000000000 +0100
|
|
+++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java 2008-04-04 00:40:58.000000000 +0200
|
|
@@ -534,6 +534,16 @@
|
|
invalidByte(4, 4, b2);
|
|
}
|
|
|
|
+ // check if output buffer is large enough to hold 2 surrogate chars
|
|
+ if( out + 1 >= offset + length ){
|
|
+ fBuffer[0] = (byte)b0;
|
|
+ fBuffer[1] = (byte)b1;
|
|
+ fBuffer[2] = (byte)b2;
|
|
+ fBuffer[3] = (byte)b3;
|
|
+ fOffset = 4;
|
|
+ return out - offset;
|
|
+ }
|
|
+
|
|
// decode bytes into surrogate characters
|
|
int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003);
|
|
if (uuuuu > 0x10) {
|
|
|
|
which I've applied to Xerces 2.9.1 sources, and committed under
|
|
lib/xerces-2.9.1-patched-XERCESJ-1257.jar. Once XERCESJ-1257 is fixed
|
|
we can upgrade to a standard Xerces release.
|