lucene/sandbox/contributions/webcrawler-LARM/TODO.txt

70 lines
2.0 KiB
Plaintext

Todos for 1.0 (not yet ordered in decreasing priority)
$id: $
* Bugs
- on very fast LAN connections (100MBit), sockets are not freed as fast as allocated
- some relative URLs are not appended appropriately, leading to wrong and growing URLs
* Build
- added build.xml, but build.bat and build.sh are still working without ANT. Change that.
* LuceneStorage
- define a configurable interface that saves fetched pages into a Lucene index
* Configuration
- move all configuration stuff into a meaningful properties file
* URLs:
- include a URLNormalizer
* lowercase host names
* avoid ambiguities like '%20' / '+'
* make sure http://host URLs end with "/"
* avoid host name aliases
- two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de
- two host names / one ip adress can point to different web sites (then other URLs / pages must differ)
suche.lmu.de / interesse.lmu.de
* cater 301/302 result codes
* Repository
- optionally use a database as repository (caches, queues, logs)
- if done so, use URL reordering to speed things up
* Tests
- Put all tests into a JUnit test suite
* distribution
- optionally send messages through a JMS topic.
- create an executable that installs a source (like JMS, page files) and a storage pipeline
- partition the URL space for distributed Fetchers
* Speed
- avoid synchronization delays by putting several URLMessages into one FetcherTask
* Services
- clean up ThreadMonitor
- incorporate a CRON-like service that enables timed GC'ing, batched data transfer, and
monitoring
* Politeness
- add the option to restrict the number of host accesses per hour/minute
* Anchor text extraction
* read until a meaningful end tag, not just the first encountered
* remove entities
* optionally remove Tags, leave ALT attribute
* remove redundant spaces
Nice-to-have:
* Stop and Continue (probably with database repository)
* "Hot Configure" from outside
* Web Interface
Next topic:
* Incremental crawling