Todos for 1.0 (not yet ordered in decreasing priority) $Id$ ----------------------------------------------------------------------------------------------- solved: ----------------------------------------------------------------------------------------------- Bugs: - some relative URLs are not appended appropriately, leading to wrong and growing URLs - 301/302 URLs were not updated: the docs were saved under the old URL, which lead to wrong relative URLs (cmarschner, 2002-06-17) URLs: - include a URLNormalizer * lowercase host names * avoid ambiguities like '%20' / '+' * make sure http://host URLs end with "/" * avoid host name aliases - two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de - two host names / one ip adress can point to different web sites (then other URLs / pages must differ) suche.lmu.de / interesse.lmu.de * cater 301/302 result codes STATUS: seems to be solved except that URL parameters can occur in different orders, which is NOT resolved host names are resolved by hand, via a synonym in HostManager. (cmarschner, 2002-06-17) problem: URLMessage size doubles ----------------------------------------------------------------------------------------------- remaining: ----------------------------------------------------------------------------------------------- * Bugs - on very fast LAN connections (100MBit), sockets are not freed as fast as allocated probably this will be solved by changing from HTTPClient.* to Jakarta HTTP client and reuse sockets * Build - added build.xml, but build.bat and build.sh are still working without ANT. Change that. * LuceneStorage - define a configurable interface that saves fetched pages into a Lucene index * Configuration - move all configuration stuff into a meaningful properties file * Repository - optionally use a database as repository (caches, queues, logs) - if done so, use URL reordering to speed things up * Tests - Put all tests into a JUnit test suite * distribution - optionally send messages through a JMS topic. - create an executable that installs a source (like JMS, page files) and a storage pipeline - partition the URL space for distributed Fetchers * Speed - avoid synchronization delays by putting several URLMessages into one FetcherTask * Services - clean up ThreadMonitor - incorporate a CRON-like service that enables timed GC'ing, batched data transfer, and monitoring * Politeness - add the option to restrict the number of host accesses per hour/minute * URL Extraction - URLs can be encoded in different encoding styles - see http://www.unicode.org/unicode/faq/unicode_web.html * I18N, HTML encoding - determine document encoding style in content-type, meta tag (http-equiv), or Doctype-tag; adapt URLs to encoding style * Anchor text extraction * read until a meaningful end tag, not just the first encountered * remove entities * optionally remove Tags, leave ALT attribute * remove redundant spaces * URLNormalizer * add possibility to add synonyms to top level domains, i.e. "d1.com = d2.com" --> "sub1.d1.com = sub1.d2.com" * add possibility to detect synonyms automatically, i.e. by comparing IP addresses or file checksums Nice-to-have: * Stop and Continue (probably with database repository) * "Hot Configure" from outside * Web Interface Next topic: * Incremental crawling