lucene/sandbox/contributions/webcrawler-LARM/CHANGES.txt

31 lines
1.4 KiB
Plaintext

$Id$
2002-06-18 (cmarschner)
* added an experimental version of Lucene storage. see FetcherMain.java for details how to use it
LuceneStorage simply saves all fields as specified in WebDocument. add a converter to the
storage pipeline before LuceneStorage to do preprocessing
2002-06-17 (cmarschner)
* moved HostInfo and HostManager to larm.net package
* included URLNormalizer (todo: source code Docs)
* changed filters to use normalized URLs when appropriate;
logs contain normalized version of referer and URL now
(todo: change description of log format in technical_overview.rtf)
2002-06-01 (cmarschner)
* divided Storage into LinkStorage and DocumentStorage
* introduced StoragePipeline, made MessageHandler a LinkStorage. Fetcher now stores everything in storages
* removed a couple of unused classes
now everything's prepared for a LuceneStorage
* added build.xml by Mehran Mehr
2002-05-23 (cmarschner)
* removed 0x0d0d from the source files (Otis?)
* included Apache License into all of the source files in de.lanlab.larm.* directories
* added anchor text deparsing to the Tokenizer
* split store.log in two files:
- store.log contains the page file index: <referer> <URL> <ResultCode> <MimeType> <Size> <Title> <PageFileNo> <PageFileOffset>
- links.log contains link information: <referer> <URL> <isFrame> <AnchorText>
* changed lib to libs in the startup scripts
* added .bat files for Windows