2002-06-01 14:55:16 -04:00
|
|
|
|
|
|
|
Todos for 1.0 (not yet ordered in decreasing priority)
|
|
|
|
|
2002-06-18 07:39:51 -04:00
|
|
|
$Id$
|
|
|
|
|
|
|
|
-----------------------------------------------------------------------------------------------
|
|
|
|
solved:
|
|
|
|
-----------------------------------------------------------------------------------------------
|
|
|
|
|
|
|
|
Bugs:
|
|
|
|
- some relative URLs are not appended appropriately, leading to wrong and growing URLs
|
|
|
|
- 301/302 URLs were not updated: the docs were saved under the old URL, which lead to
|
|
|
|
wrong relative URLs (cmarschner, 2002-06-17)
|
|
|
|
|
|
|
|
URLs:
|
|
|
|
- include a URLNormalizer
|
|
|
|
* lowercase host names
|
|
|
|
* avoid ambiguities like '%20' / '+'
|
|
|
|
* make sure http://host URLs end with "/"
|
|
|
|
* avoid host name aliases
|
|
|
|
- two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de
|
|
|
|
- two host names / one ip adress can point to different web sites (then other URLs / pages must differ)
|
|
|
|
suche.lmu.de / interesse.lmu.de
|
|
|
|
* cater 301/302 result codes
|
|
|
|
STATUS: seems to be solved except that URL parameters can occur in different orders, which is NOT resolved
|
|
|
|
host names are resolved by hand, via a synonym in HostManager. (cmarschner, 2002-06-17)
|
|
|
|
problem: URLMessage size doubles
|
|
|
|
|
|
|
|
-----------------------------------------------------------------------------------------------
|
|
|
|
remaining:
|
|
|
|
-----------------------------------------------------------------------------------------------
|
2002-06-01 14:55:16 -04:00
|
|
|
|
|
|
|
* Bugs
|
|
|
|
- on very fast LAN connections (100MBit), sockets are not freed as fast as allocated
|
2002-06-18 07:39:51 -04:00
|
|
|
probably this will be solved by changing from HTTPClient.* to Jakarta HTTP client and reuse sockets
|
|
|
|
|
2002-06-01 14:55:16 -04:00
|
|
|
|
|
|
|
* Build
|
|
|
|
- added build.xml, but build.bat and build.sh are still working without ANT. Change that.
|
|
|
|
|
|
|
|
* LuceneStorage
|
|
|
|
- define a configurable interface that saves fetched pages into a Lucene index
|
|
|
|
|
|
|
|
* Configuration
|
|
|
|
- move all configuration stuff into a meaningful properties file
|
|
|
|
|
|
|
|
|
|
|
|
* Repository
|
|
|
|
- optionally use a database as repository (caches, queues, logs)
|
|
|
|
- if done so, use URL reordering to speed things up
|
|
|
|
|
|
|
|
* Tests
|
|
|
|
- Put all tests into a JUnit test suite
|
|
|
|
|
|
|
|
* distribution
|
|
|
|
- optionally send messages through a JMS topic.
|
|
|
|
- create an executable that installs a source (like JMS, page files) and a storage pipeline
|
|
|
|
- partition the URL space for distributed Fetchers
|
|
|
|
|
|
|
|
* Speed
|
|
|
|
- avoid synchronization delays by putting several URLMessages into one FetcherTask
|
|
|
|
|
|
|
|
* Services
|
|
|
|
- clean up ThreadMonitor
|
|
|
|
- incorporate a CRON-like service that enables timed GC'ing, batched data transfer, and
|
|
|
|
monitoring
|
|
|
|
|
|
|
|
* Politeness
|
|
|
|
- add the option to restrict the number of host accesses per hour/minute
|
|
|
|
|
2002-06-18 07:39:51 -04:00
|
|
|
* URL Extraction
|
|
|
|
- URLs can be encoded in different encoding styles - see http://www.unicode.org/unicode/faq/unicode_web.html
|
|
|
|
|
|
|
|
* I18N, HTML encoding
|
|
|
|
- determine document encoding style in content-type, meta tag (http-equiv), or Doctype-tag; adapt URLs to
|
|
|
|
encoding style
|
|
|
|
|
2002-06-01 14:55:16 -04:00
|
|
|
* Anchor text extraction
|
|
|
|
* read until a meaningful end tag, not just the first encountered
|
|
|
|
* remove entities
|
|
|
|
* optionally remove Tags, leave ALT attribute
|
|
|
|
* remove redundant spaces
|
|
|
|
|
2002-06-18 07:39:51 -04:00
|
|
|
* URLNormalizer
|
|
|
|
* add possibility to add synonyms to top level domains, i.e. "d1.com = d2.com" --> "sub1.d1.com = sub1.d2.com"
|
|
|
|
* add possibility to detect synonyms automatically, i.e. by comparing IP addresses or file checksums
|
2002-06-01 14:55:16 -04:00
|
|
|
|
|
|
|
Nice-to-have:
|
|
|
|
|
|
|
|
* Stop and Continue (probably with database repository)
|
|
|
|
* "Hot Configure" from outside
|
|
|
|
* Web Interface
|
|
|
|
|
|
|
|
Next topic:
|
|
|
|
* Incremental crawling
|
|
|
|
|