2002-10-29 23:09:04 -05:00
|
|
|
|
<?xml version="1.0" encoding="ISO-8859-1"?>
|
|
|
|
|
<document>
|
|
|
|
|
|
|
|
|
|
<properties>
|
|
|
|
|
<author email="cmarschner at apache dot org">Clemens Marschner</author>
|
|
|
|
|
<title>LARM Webcrawler - Technical Overview</title>
|
|
|
|
|
</properties>
|
|
|
|
|
|
|
|
|
|
<meta name="keyword" content="jakarta, java, LARM, WebCrawler,
|
|
|
|
|
Crawler, Fetcher"/>
|
|
|
|
|
|
|
|
|
|
<body>
|
|
|
|
|
|
|
|
|
|
<section name="The LARM Web Crawler - Technical Overview">
|
|
|
|
|
|
|
|
|
|
<p align="center">Author: Clemens Marschner</p>
|
|
|
|
|
|
|
|
|
|
<p align="center">Revised: Oct. 28, 2002</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
This document describes the configuration parameters and the inner
|
|
|
|
|
workings of the LARM web crawler.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<subsection name="Purpose and Intended Audience">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
This document was made for Lucene developers, not necessarily with any
|
|
|
|
|
background knowledge on crawlers, to understand the inner workings of
|
|
|
|
|
the LARM crawler, the current problems and some directions for future
|
|
|
|
|
development. The aim is to keep the entry costs low for people who have
|
|
|
|
|
an interest in developing this piece of software further.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
It may also serve for actual users of the Lucene engine, but beware,
|
|
|
|
|
since there is a lot that will change in the near future, especially the
|
|
|
|
|
configuration.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="Quick Start">
|
|
|
|
|
|
|
|
|
|
<p>The crawler is only accessible via anonymous CVS at this time. See
|
|
|
|
|
the <a href="http://jakarta.apache.org/site/cvsindex.html">Jakarta CVS
|
|
|
|
|
instructions</a> for details if you're not familiar with CVS.</p>
|
|
|
|
|
|
|
|
|
|
<p>Too long? The following will give you a quick start: create a new
|
|
|
|
|
directory, say, "jakarta", make it your current directory, and type</p>
|
|
|
|
|
|
|
|
|
|
<source><![CDATA[cvs -d :pserver:anoncvs@cvs.apache.org:/home/cvspublic
|
|
|
|
|
login
|
|
|
|
|
password: anoncvs
|
|
|
|
|
cvs -d :pserver:anoncvs@cvs.apache.org:/home/cvspublic checkout
|
|
|
|
|
jakarta-lucene-sandbox]]></source>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The crawler will then be in
|
|
|
|
|
jakarta-lucene-sandbox/contributions/webcrawler-LARM. To compile it you will also need
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
|
|
|
|
|
<li>this <a
|
|
|
|
|
href="http://www.innovation.ch/java/HTTPClient/">HTTPClient</a>. Put the .zip file into the libs/ directory</li>
|
|
|
|
|
|
|
|
|
|
<li>a working installation of <a
|
|
|
|
|
href="http://jakarta.apache.org/ant">ANT</a> (1.5 or above recommended). ant.sh/.bat should be in your
|
|
|
|
|
PATH</li>
|
|
|
|
|
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Change to the webcrawler-LARM directory and type
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<source>ant</source>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
You should then have a working copy of LARM in
|
|
|
|
|
build/webcrawler-LARM-0.5.jar. See the section <a href="#syntax">Syntax</a> below on how to
|
|
|
|
|
start the crawler.<br/>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section name="Introduction">
|
|
|
|
|
|
|
|
|
|
<subsection name="Why web crawlers at all?">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Web crawlers became necessary because the web standard protocols didn't
|
|
|
|
|
contain any mechanisms to inform search engines that the data on a web
|
|
|
|
|
server had been changed. If this were possible, a search engine could
|
|
|
|
|
be notified in a "push" fashion, which would simplify the total process
|
|
|
|
|
and would make indexes as current as possible.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Imagine a web server that notifies another web server that a link was
|
|
|
|
|
created from one of its pages to the other server. That other server
|
|
|
|
|
could then send a message back if the page was removed.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
On the other hand, handling this system would be a lot more
|
|
|
|
|
complicated. Keeping distributed information up to date is an erroneous task. Even
|
|
|
|
|
in a single relational database it is often complicated to define and
|
|
|
|
|
handle dependencies between relations. Should it be possible to allow
|
|
|
|
|
inconsistencies for a short period of time? Should dependent data be
|
|
|
|
|
deleted if a record is removed? Handling relationships between clusters of
|
|
|
|
|
information well incorporates a new level of complexity.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
In order to keep the software (web servers and browsers) simple, the
|
|
|
|
|
inventors of the web concentrated on just a few core elements - URLs for
|
|
|
|
|
(more or less) uniquely identifying distributed information, HTTP for
|
|
|
|
|
handling the information, and HTML for structuring it. That system was
|
|
|
|
|
so simple that one could understand it in a very short time. This is
|
|
|
|
|
probably one of the main reasons why the WWW became so popular. Well,
|
|
|
|
|
another one would probably be coloured, moving graphics of naked people.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
But the WWW has some major disadvantages: There is no single index of
|
|
|
|
|
all available pages. Information can change without notice. URLs can
|
|
|
|
|
point to pages that no longer exist. There is no mechanism to get "all"
|
|
|
|
|
pages from a web server. The whole system is in a constant process of
|
|
|
|
|
change. And after all, the whole thing is growing at phenomenal rates.
|
|
|
|
|
Building a search engine on top of that is not something you can do on a
|
|
|
|
|
Saturday afternoon. Given the sheer size, it would take months to search
|
|
|
|
|
through all the pages in order to answer a single query, even if we had
|
|
|
|
|
a means to get from server to server, get the pages from there, and
|
|
|
|
|
search them. But we don't even know how to do that, since we don't know
|
|
|
|
|
all the web servers.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
That first problem was addressed by bookmark collections, which soon
|
|
|
|
|
became very popular. The most popular probably was Yahoo, which evolved
|
|
|
|
|
to one of the most popular pages in the web just a year after it emerged
|
|
|
|
|
from a college dorm room.
|
|
|
|
|
The second problem was how to get the information from all those pages
|
|
|
|
|
laying around. This is where a web crawler comes in.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Ok, those engineers said, we are not able to get a list of all the
|
|
|
|
|
pages. But almost every page contains links to other pages. We can save a
|
|
|
|
|
page, extract all the links, and load all of these pages these links
|
|
|
|
|
point to. If we start at a popular location which contains a lot of links,
|
|
|
|
|
like Yahoo for example, chances should be that we can get "all" pages
|
|
|
|
|
on the web.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
A little more formal, the web can be seen as a directional graph, with
|
|
|
|
|
pages as nodes and links as edges between them. A web crawler, also
|
|
|
|
|
called "spider" or "fetcher", uses the graph structure of the web to get
|
|
|
|
|
documents in order to be able to index them. Since there is no "push"
|
|
|
|
|
mechanism for updating our index, we need to "pull" the information on
|
|
|
|
|
our own, by repeatedly crawling the web.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="Implementation - the first attempt">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
"Easy", you may think now, "just implement what he said in the
|
|
|
|
|
paragraph before." So you start getting a page, extracting the links, following
|
|
|
|
|
all the pages you have not already visited. In Perl that can be done in
|
|
|
|
|
a few lines of code.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
But then, very soon (I can tell you), you end up in a lot of problems:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li>a server doesn't respond. Your program always wait for it to time
|
|
|
|
|
out
|
|
|
|
|
</li><li>you get OutOfMemory errors soon after the beginning
|
|
|
|
|
</li><li>your hard drive fills up
|
|
|
|
|
</li><li>You notice that one page is loaded again time after time,
|
|
|
|
|
because the URL changed a little
|
|
|
|
|
</li><li>Some servers will behave very strange. They will respond after
|
|
|
|
|
30 seconds, sometimes they time out, sometimes they are not accessible
|
|
|
|
|
at all
|
|
|
|
|
</li><li>some URLs will get longer and longer. Suddenly you will get
|
|
|
|
|
URLs with a length of thousands of characters.
|
|
|
|
|
</li><li>But the main problem will be: you notice that your network
|
|
|
|
|
interface card (NIC) is waiting, and your CPU is waiting. What's going on?
|
|
|
|
|
The overall process will take days
|
|
|
|
|
</li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<subsection name="Features of the LARM crawler">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The LARM web crawler is a result of experiences with the errors as
|
|
|
|
|
mentioned above, connected with a lot of monitoring to get the maximum out
|
|
|
|
|
of the given system ressources. It was designed with several different
|
|
|
|
|
aspects in mind:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
|
|
|
|
|
<li>Speed. This involves balancing the resources to prevent
|
|
|
|
|
bottlenecks. The crawler is multithreaded. A lot of work went in avoiding
|
|
|
|
|
synchronization between threads, i.e. by rewriting or replacing the standard
|
|
|
|
|
Java classes, which slows down multithreaded programs a lot
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li>Simplicity. The underlying scheme is quite modular and
|
|
|
|
|
comprehensible. See the description of the pipeline below
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>Power. The modular design and the ease of the Java language makes
|
|
|
|
|
customisation simple
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<li>Scalability. The crawler was supposed to be able to crawl <i>large
|
|
|
|
|
intranets</i> with hundreds of servers and hundreds of thousands of
|
|
|
|
|
documents within a reasonable amount of time. It was not ment to be
|
|
|
|
|
scalable to the whole Internet.</li>
|
|
|
|
|
|
|
|
|
|
<li>Java. Although there are many crawlers around at the time when I
|
|
|
|
|
started to think about it (in Summer 2000), I couldn't find a good
|
|
|
|
|
available implementation in Java. If this crawler would have to be integrated
|
|
|
|
|
in a Java search engine, a homogenous system would be an advantage. And
|
|
|
|
|
after all, I wanted to see if a fast implementation could be done in
|
|
|
|
|
this language.
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="What the crawler can do for you, and what it cannot
|
|
|
|
|
(yet)">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
What it can do for you:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li>Crawl a distinct set of the web, only restricted by a given regular
|
|
|
|
|
expression all pages have to match. The pages are saved into page files
|
|
|
|
|
of max. 50 MB (look at FetcherMain.java for details) and an index file
|
|
|
|
|
that contains the links between the URL and the position in the page
|
|
|
|
|
file. Links are logged as well. This is part of the standard LogStorage.
|
|
|
|
|
Other storages exist as well (see below)
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>Crawling is done breadth first. Hosts are accessed in a round-robin
|
|
|
|
|
manner, to prevent the situation that all threads access one host at
|
|
|
|
|
once. However, at the moment there is no means to throttle access to a
|
|
|
|
|
server - the crawler works as fast as it can. There are also some
|
|
|
|
|
problems with this technique, as will be described below.
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>The main part of the crawler is implemented as a pool of concurrent
|
|
|
|
|
threads, which speeds up I/O access
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>The HTML link extractor has been optimised for speed. It was made
|
|
|
|
|
10 x faster than a generic SAX parser implementation
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>A lot of logging and monitoring is done, to be able to track down
|
|
|
|
|
the going-ons in the inside
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>A lot of parts of the crawler have already been optimised to
|
|
|
|
|
consume not more memory then needed. A lot of the internal queues are cached
|
|
|
|
|
on hard drive, for example. Only the HashMap of already crawled pages
|
|
|
|
|
and the HostInfo structures still completely remain in memory, thus
|
|
|
|
|
limiting the number of crawled hosts and the number of crawled pages. At
|
|
|
|
|
the moment, OutOfMemory errors are not prevented, so beware.
|
|
|
|
|
|
|
|
|
|
</li><li>URLs are passed through a pipeline of filters that limit, for
|
|
|
|
|
example, the length of a URL, load robots.txt the first time a host is
|
|
|
|
|
accessed, etc. This pipeline can be extended easily by adding a Java
|
|
|
|
|
class to the pipeline.
|
|
|
|
|
</li><li>The storage mechanism is also pluggable. One of the next
|
|
|
|
|
issues would be to include this storage mechanism into the pipeline, to
|
|
|
|
|
allow a seperation of logging, processing, and storage
|
|
|
|
|
</li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
On the other hand, at the time of this writing, the crawler has not yet
|
|
|
|
|
evolved into a production release. The reason is: until now, it just
|
|
|
|
|
served me alone.</p>
|
|
|
|
|
<p>
|
|
|
|
|
|
|
|
|
|
These issues remain:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li>
|
|
|
|
|
Still, there's a relatively high memory overhead <i>per server</i> and
|
|
|
|
|
also some overhead <i>per page</i>, especially since a map of already
|
|
|
|
|
crawled pages is held in memory at this time. Although some of the
|
|
|
|
|
in-memory structures are already put on disc, memory consumption is still
|
|
|
|
|
linear to the number of pages crawled. We have lots of ideas on how to
|
|
|
|
|
move that out, but since then, as an example with 500 MB RAM, the crawler
|
|
|
|
|
scales up to some 100.000 files on some 100s of hosts.
|
|
|
|
|
|
|
|
|
|
</li><li>It's not polite. It sucks out the servers, which can impose
|
|
|
|
|
DOS (Denial of Service) problems. We have plans to restrict concurrent
|
|
|
|
|
server accesses to only one, but that's not yet implemented.
|
|
|
|
|
|
|
|
|
|
</li><li>Only some of the configuration can be done with command line
|
|
|
|
|
parameters. The pipeline is put together in the startup procedure.
|
|
|
|
|
Configuration will be done from within an XML file in the future. We are
|
|
|
|
|
planning to use the Avalon framework for that.
|
|
|
|
|
|
|
|
|
|
</li><li>The ThreadMonitor is very experimental. It has evolved from a
|
|
|
|
|
pure monitoring mechanism to a central part of the whole crawler. It
|
|
|
|
|
should probably be refactored.
|
|
|
|
|
|
|
|
|
|
</li><li>Speed could still be optimized. Synchronization takes place
|
|
|
|
|
too often. At the moment URLs and documents are passed one after each
|
|
|
|
|
other. Should be changed such that groups of messages are passed in one
|
|
|
|
|
turn.
|
|
|
|
|
|
|
|
|
|
</li><li>The Lucene integration is pretty experimental, and also pretty
|
|
|
|
|
slow. It forms a major bottleneck if you use it.
|
|
|
|
|
|
|
|
|
|
</li><li>No processing whatsoever is done on the documents (except
|
|
|
|
|
extracting the links). It should be decided how much of this is supposed to
|
|
|
|
|
be done within the crawler, and what should be done in a post
|
|
|
|
|
processing step.
|
|
|
|
|
|
|
|
|
|
</li><li>Unix is the favoured operating system. I used a SUSE Linux
|
|
|
|
|
with 2.2 kernel. I remember that I ran into problems with the I/O routines
|
|
|
|
|
on Windows machines. I haven't tried it for a long time now, though.
|
|
|
|
|
|
|
|
|
|
</li><li>Only http is supported, no file server crawling with recurse
|
|
|
|
|
directory options, etc.
|
|
|
|
|
|
|
|
|
|
</li><li>
|
|
|
|
|
I noticed on high bandwidth (100 MBit/s) systems that very slowly
|
|
|
|
|
system sockets were eaten, although the Java code seemed to be ok.
|
|
|
|
|
|
|
|
|
|
</li>
|
|
|
|
|
</ul>
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="Syntax and runtime behaviour">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<a name="syntax"></a>
|
|
|
|
|
|
|
|
|
|
The command line options are very simple:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<source><![CDATA[
|
|
|
|
|
java [-server] [-Xmx[ZZ]mb] -classpath fetcher.jar
|
|
|
|
|
de.lanlab.larm.fetcher.FetcherMain
|
|
|
|
|
[-start STARTURL | @STARTURLFILE]+
|
|
|
|
|
-restrictto REGEX
|
|
|
|
|
[-threads[=10]]
|
|
|
|
|
-hostresolver HOSTRES.PROPERTIES
|
|
|
|
|
]]></source>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
|
|
|
|
|
<li><b>-start</b> a start URL or, alternatively, a file with start URLs
|
|
|
|
|
(one in each line). In the latter case the file name must be preceded
|
|
|
|
|
by '@'. It must be a valid http-URL, including the http prefix </li>
|
|
|
|
|
<li><b>-restrictto</b> a (Perl5) regular expression that all URLs have
|
|
|
|
|
to obey. These regexes are performed on the normalized version of a URL
|
|
|
|
|
(see below). If you are not familiar with regular expressions, see
|
|
|
|
|
<a href="http://www.perldoc.com/perl5.6.1/pod/perlre.html">The Perl Man
|
|
|
|
|
Pages</a>.</li>
|
|
|
|
|
|
|
|
|
|
<li><b>-threads</b> the number of concurrent threads that crawl the
|
|
|
|
|
pages. At this time, more than 25 threads don't provide any advantages
|
|
|
|
|
because synchronization effects and (probably) the overhead of the
|
|
|
|
|
scheduler slow the system down</li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>Java runtime options:</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li><b>-server</b> starts the hot spot VM in server mode, which starts
|
|
|
|
|
up a little slower, but is faster during the run. Recommended</li>
|
|
|
|
|
<li><b>-Xmx<ZZ>mb</b> sets the maximum size of the heap to
|
|
|
|
|
<ZZ> mb. Should be a lot. Set it to what you have.</li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
You may also want to have a look at the source code, because some
|
|
|
|
|
options cannot be dealt with from the outside at this time.<br/><br/>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>Other options</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Unfortunately, a lot of the options are still not configurable from the
|
|
|
|
|
outside. Most of them are configured from within FetcherMain.java.
|
|
|
|
|
However, others are still spread over some of the other classes. At this
|
|
|
|
|
time, we tried to put a "FIXME" comment around all these options, so
|
|
|
|
|
check out the source code. </p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>What happens now?</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ol>
|
|
|
|
|
|
|
|
|
|
<li>The filter pipeline is built. The ScopeFilter is initialised with
|
|
|
|
|
the expression given by restrictto</li>
|
|
|
|
|
<!--zz-->
|
|
|
|
|
|
|
|
|
|
<li>The URL is put into the pipeline</li>
|
|
|
|
|
|
|
|
|
|
<li>The documents are fetched. If the mime type is text/html, links are
|
|
|
|
|
extracted and put back into the queue. The documents and URLs are
|
|
|
|
|
forwarded to the storage, which saves them</li>
|
|
|
|
|
<!--zz-->
|
|
|
|
|
|
|
|
|
|
<li>Meanwhile, every 5 seconds, the ThreadMonitor gathers statistics,
|
|
|
|
|
flushes log files, starts the garbage collection, and stops the fetcher
|
|
|
|
|
when everything seems to be done: all threads are idle, and nothing is
|
|
|
|
|
remaining in the queues</li>
|
|
|
|
|
|
|
|
|
|
</ol>
|
|
|
|
|
<!--zz-->
|
|
|
|
|
</subsection>
|
|
|
|
|
<!--zz !! -->
|
|
|
|
|
|
|
|
|
|
<subsection name="Normalized URLs">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
URLs are only supposed to make a web resource accessible. Unfortunately
|
|
|
|
|
there can be more than one representation of a URL, which can cause a
|
|
|
|
|
web crawler to save one file twice under different URLs. Two mechanisms
|
|
|
|
|
are used to reduce this error, while at the same time trying to keep
|
|
|
|
|
the second error (two URLs are regarded as pointing to the same file, but
|
|
|
|
|
are in fact two different ones):</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li>Common URL patterns are reduced to a known "canonical" form. I.
|
|
|
|
|
e. The URLs <i>http://host</i> and <i>http://host/</i> point to the same
|
|
|
|
|
file. The latter is defined as the canonical form of a root URL. For a
|
|
|
|
|
complete listing of these actions, see below</li>
|
|
|
|
|
<li>Host names can be edited through a set of rules manually passed
|
|
|
|
|
to the <i>HostResolver</i>, which can be configured in a configuration
|
|
|
|
|
file.</li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The result is used like a <i>stemming</i> function in IR systems: The
|
|
|
|
|
normalized form of a URL is used internally for comparisons, but to the
|
|
|
|
|
outside (i.e. for accessing the file), the original form is applied.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="URL patterns">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
These are the patterns applied by the <i>URLNormalizer</i>:
|
|
|
|
|
</p>
|
|
|
|
|
<ul>
|
|
|
|
|
<li>Host and path names are changed to lowercase. We know that on
|
|
|
|
|
Unix it may be possible to have two files in one directory that only
|
|
|
|
|
differ in their cases, but we don't think this will be used very often. On
|
|
|
|
|
the other hand, on Windows systems, case doesn't matter, so a lot of
|
|
|
|
|
Links use URLs with mixed cases</li>
|
|
|
|
|
<li>All characters except the ones marked 'safe' in the URL standard
|
|
|
|
|
are escaped. Spaces are always escaped as "%20" and not "+"</li>
|
|
|
|
|
<li>If Unicode characters are encountered, they are written as
|
|
|
|
|
escaped UTF-8 characters.</li>
|
|
|
|
|
<li>subsequent slashes '//' are removed. '/./' is reduced to '/'</li>
|
|
|
|
|
<li>index.* and default.* file names are removed</li>
|
|
|
|
|
</ul>
|
|
|
|
|
<p>Todo: '/../' is not handled yet</p>
|
|
|
|
|
<p>
|
|
|
|
|
Todo: Examples</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="The Host Resolver">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The host resolver is also applied when the URL normalization takes
|
|
|
|
|
place. It knows three different types of rules:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li>startsWith: replace the start of a URL with something else if a
|
|
|
|
|
certain pattern is found at the start</li>
|
|
|
|
|
<li>endsWith: same with the end</li>
|
|
|
|
|
<li>synonym: treat a host name as a synonym to another</li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
These three rules are applied to each URL in that order. I. e. you
|
|
|
|
|
can tell the host resolver to always remove the "www." of each host,
|
|
|
|
|
therefore regarding "cnn.com" and "www.cnn.com" as equal, by defining the
|
|
|
|
|
rule <i>setStartsWith("www.","")</i>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>Configuring HostResolver in a config file</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The HostResolver was one test on how config files could look like if
|
|
|
|
|
they were implemented using standard Java procedures. We used the
|
|
|
|
|
Jakarta BeanUtils for these matters (see
|
|
|
|
|
<a href="http://jakarta.apache.org/commons/beanutils.html">The BeanUtils
|
|
|
|
|
website</a> for details) and implemented the HostResolver as a JavaBean. The
|
|
|
|
|
rules can then be stated as "mapped properties" in a
|
|
|
|
|
hostResolver.properties file (see the start syntax above). The only difference between
|
|
|
|
|
normal properties and the mapped properties as supported by BeanUtils is
|
|
|
|
|
that a second argument can be passed to the latter.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
An example of such a properties file would look like this:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<source><![CDATA[
|
|
|
|
|
#hostResolver.properties
|
|
|
|
|
#format:
|
|
|
|
|
# startsWith(part1) = part2
|
|
|
|
|
# if host starts with part1, this part will be replaced by part2
|
|
|
|
|
# endsWith(part1) = part2
|
|
|
|
|
# if host ends with part1, this part will be replaced by part2.
|
|
|
|
|
# This is done after startsWith was processed
|
|
|
|
|
# synonym(host1) = host2
|
|
|
|
|
# the keywords startsWith, endsWith and synonym are case sensitive
|
|
|
|
|
# host1 will be replaced with host2. this is done _after_ startsWith
|
|
|
|
|
# and endsWith was processed
|
|
|
|
|
|
|
|
|
|
startsWith(www.)=
|
|
|
|
|
startsWith(www2.)=
|
|
|
|
|
startsWith(w3.)=
|
|
|
|
|
endsWith(.somehost.com)=.somhost2.com
|
|
|
|
|
synonym(daedalus.apache.org)=apache.org
|
|
|
|
|
|
|
|
|
|
]]></source>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
As you can see, the file format itself is like the standard Java
|
|
|
|
|
<a href="http://java.sun.com/j2se/1.3/docs/api/java/util/Properties.html">Properties</a>
|
|
|
|
|
format (comments etc.). Keywords are case sensitive.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
At the time when the class was written, BeanUtils still had a bug
|
|
|
|
|
that dots "." were not supported in mapped properties indexes. As with
|
|
|
|
|
the new version (1.5 at the time of this writing) this is supposed to be
|
|
|
|
|
removed, but I have not tried yet. Therefore, the comma "," was made a
|
|
|
|
|
synonymon for dots. Since "," is not allowed in domain names, you can
|
|
|
|
|
still use (and even mix) them if you want, or if you only have an older
|
|
|
|
|
BeanUtils version available.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<section name="Lucene Integration">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
LARM currently provides a very simple LuceneStorage that allows for
|
|
|
|
|
integrating the crawler with Lucene. It's ment to be a working example on
|
|
|
|
|
how this can be accomplished, not a final implementation. If you like
|
|
|
|
|
to volunteer on that part, contributions are welcome.</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The current storage simply takes the input that comes from the crawler
|
|
|
|
|
(a <i>WebDocument</i> which mainly consists of name/value pairs with
|
|
|
|
|
the document's contents) and puts it into a Lucene index. Each name/value
|
|
|
|
|
pair is put into one field. There's currently no incremental or update
|
|
|
|
|
operation, or document caching via a RAMDirectory. Therefore the
|
|
|
|
|
LuceneStorage becomes a bottleneck even with slow network connections.</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
see storage/LuceneStorage.java and fetcher/FetcherMain.java for
|
|
|
|
|
details</p>
|
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section name="Architecture">
|
|
|
|
|
|
|
|
|
|
<p>I studied the <a href="http://research.compaq.com/SRC/mercator/research.html">Mercator</a> web
|
|
|
|
|
crawler but decided to implement a somewhat different architecture.
|
|
|
|
|
Here is a high level overview of the default configuration:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<img src="/images/larm_architecture.jpg" width="568" height="460"
|
|
|
|
|
border="1" alt="Architecture Overview"/>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The message handler is an implementation of a simple <i>chain of
|
|
|
|
|
responsibility</i>. Implementations of <i>Message</i> are passed down a
|
|
|
|
|
filter chain. Each of the filters can decide whether to send the message
|
|
|
|
|
along, change it, or even delete it. In this case, Messages of type
|
|
|
|
|
URLMessage are used. The message handler runs in its own thread. Thus, a call
|
|
|
|
|
of <i>putMessage()</i> or <i>putMessages()</i> resp. involve a
|
|
|
|
|
<i>producer-consumer</i>-like message transfer. The filters themselves run
|
|
|
|
|
within the message handler thread.
|
|
|
|
|
At the end of the pipeline the Fetcher distributes the incoming
|
|
|
|
|
messages to its worker threads. They are implemented as a <i>thread pool</i>:
|
|
|
|
|
Several <i>ServerThreads</i> are running concurrently and wait for
|
|
|
|
|
<i>Tasks</i> which include the procedure to be executed. If more tasks are
|
|
|
|
|
to be done than threads are available, they are kept in a queue, which
|
|
|
|
|
will be read whenever a task is finished.</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
At this point the pipeline pattern is left . The <i>FetcherTask</i>
|
|
|
|
|
itself is still quite monolithic. It gets the document, parses it if
|
|
|
|
|
possible, and stores it into a storage. In the future one might think of
|
|
|
|
|
additional configurable processing steps within another processing
|
|
|
|
|
pipeline. I thought about incorporating it into the filter pipeline, but since
|
|
|
|
|
the filters are passive components and the <i>FetcherThreads</i> are
|
|
|
|
|
active, this didn't work.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<subsection name="Performance">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The performance was improved about 10-15 times compared to the first
|
|
|
|
|
naive attempts with a pre-built parser and Sun's network classes. And
|
|
|
|
|
there is still room left. On a network with about 150 web servers, which
|
|
|
|
|
the crawler server was connected to by a 100 MBit FDDS connection, I was
|
|
|
|
|
able to crawl an average of 60 documents per second, or 3,7 MB, after
|
|
|
|
|
10 minutes in the startup period. In this first period, crawling is
|
|
|
|
|
slower because the number of servers is small, so the server output limits
|
|
|
|
|
crawling. There may also be servers that don't respond. They are
|
|
|
|
|
excluded from the crawl after a few attempts.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Overall, performance is affected by a lot of factors: The operating
|
|
|
|
|
system, the native interface, the Java libraries, the web servers, the
|
|
|
|
|
number of threads, whether dynamic pages are included in the crawl, etc.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
From a development side, the speed is affected by the balance between
|
|
|
|
|
I/O and CPU usage. Both has to be kept at 100%, otherwise one of them
|
|
|
|
|
becomes the bottleneck. Managing these resources is the central part of a
|
|
|
|
|
crawler.
|
|
|
|
|
Imagine that only one thread is crawling. This is the worst case, as
|
|
|
|
|
can be seen very fast:</p>
|
|
|
|
|
|
|
|
|
|
<br/>
|
|
|
|
|
<br/>
|
|
|
|
|
<img src="/images/larm_crawling-process.jpg" width="600" height="306"
|
|
|
|
|
border="1" alt="Crawling Process"/>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The diagram to the right resembles a UML sequence diagram, except that
|
|
|
|
|
it stresses the time that a message needs to traverse the network.</p>
|
|
|
|
|
|
|
|
|
|
<ol>
|
|
|
|
|
<li>The URL is processed somehow. That's the filter part as stated
|
|
|
|
|
above</li>
|
|
|
|
|
<li>The request is sent. It goes through the different network layers
|
|
|
|
|
of the crawler server. A TCP/IP connection is established. Several
|
|
|
|
|
packets are sent back and forth. Then the crawler waits until the web server
|
|
|
|
|
processes the request, looks up the file or renders the page (which can
|
|
|
|
|
take several seconds or even minutes), then sends the file to the
|
|
|
|
|
crawler.</li>
|
|
|
|
|
<li>The crawler receives packet after packet, combines them to a file.
|
|
|
|
|
Probably it is copied through several buffers until it is complete.
|
|
|
|
|
This will take some CPU time, but mostly it will wait for the next packet
|
|
|
|
|
to arrive. The network transfer by itself is also affected by a lot of
|
|
|
|
|
factors, i.e. the speed of the web server, acknowledgement messages,
|
|
|
|
|
resent packages etc. so 100% network utilization will almost never be
|
|
|
|
|
reached.</li>
|
|
|
|
|
<li>The document is processed, which will take up the whole CPU. The
|
|
|
|
|
network will be idle at that time.</li>
|
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The storage process, which by itself uses CPU and disk I/O resources,
|
|
|
|
|
was left out here. That process will be very similar, although the
|
|
|
|
|
traversal will be faster.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
As you can see, both CPU and I/O are not used most of the time, and
|
|
|
|
|
wait for the other one (or the network) to complete. This is the reason
|
|
|
|
|
why single threaded web crawlers tend to be very slow (wget for example).
|
|
|
|
|
The slowest component always becomes the bottleneck.</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Two strategies can be followed to improve this situation:</p>
|
|
|
|
|
|
|
|
|
|
<ol>
|
|
|
|
|
<li>use asynchronous I/O</li>
|
|
|
|
|
<li>use several threads</li>
|
|
|
|
|
</ol>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Asynchronous I/O means, I/O requests are sent, but then the crawler
|
|
|
|
|
continues to process documents it has already crawled.</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Actually I haven't seen an implementation of this technique.
|
|
|
|
|
Asynchronous I/O was not available in Java until version 1.4. An advantage would
|
|
|
|
|
be that thread handling is also an expensive process in terms of CPU
|
|
|
|
|
and memory usage. Threads are resources and, thus, limited. I heard that
|
|
|
|
|
application server developers wanted asynchronous I/O, to be able to
|
|
|
|
|
cope with hundreds of simultaneous requests without spawning extra
|
|
|
|
|
threads for each of them. Probably this can be a solution in the future. But
|
|
|
|
|
from what I know about it today, it will not be necessary </p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The way this problem is solved usually in Java is with the use of
|
|
|
|
|
several threads. If many threads are used, chances are good that at any
|
|
|
|
|
given moment, at least one thread is in one of the states above, which
|
|
|
|
|
means both CPU and I/O will be at a maximum.</p>
|
|
|
|
|
<p>
|
|
|
|
|
The problem with this is that multi threaded programming is considered
|
|
|
|
|
to be one of the most difficult areas in computer science. But given
|
|
|
|
|
the simple linear structure of web crawlers, it is not very hard to avoid
|
|
|
|
|
race conditions or dead lock problems. You always get into problems
|
|
|
|
|
when threads are supposed to access shared resources, though. Don't touch
|
|
|
|
|
this until you have read the standard literature and have made at least
|
|
|
|
|
10 mistakes (and solved them!).</p>
|
|
|
|
|
<p>
|
|
|
|
|
Multithreading doesn't come without a cost, however. First, there is
|
|
|
|
|
the cost of thread scheduling. I don't have numbers for that in Java, but
|
|
|
|
|
I suppose that this should not be very expensive. MutExes can affect
|
|
|
|
|
the whole program a lot . I noticed that they should be avoided like
|
|
|
|
|
hell. In a crawler, a MutEx is used, for example, when a new URL is passed
|
|
|
|
|
to the thread, or when the fetched documents are supposed to be stored
|
|
|
|
|
linearly, one after the other.</p>
|
|
|
|
|
<p>
|
|
|
|
|
For example, the tasks used to insert a new URL into the global message
|
|
|
|
|
handler each time when a new URL was found in the document. I was able
|
|
|
|
|
to speed it up considerably when I changed this so that the URLs are
|
|
|
|
|
collected locally and then inserted only once per document. Probably this
|
|
|
|
|
can be augmented even further if each task is comprised of several
|
|
|
|
|
documents which are fetched one after the other and then stored
|
|
|
|
|
together.</p>
|
|
|
|
|
<p>
|
|
|
|
|
Nonetheless, keeping the right balance between the two resources is a
|
|
|
|
|
big concern. At the moment, the number of threads and the number of
|
|
|
|
|
processing steps is static, and is only optimised by trial and error. Few
|
|
|
|
|
hosts, slow network -> few threads. slow CPU -> few processing steps.
|
|
|
|
|
many hosts, fast network -> many threads. Probably those heuristics will
|
|
|
|
|
do well, but I wonder if these figures could also be fine-tuned
|
|
|
|
|
dynamically during runtime.</p>
|
|
|
|
|
<p>
|
|
|
|
|
Another issue that was optimised were very fine-grained method calls.
|
|
|
|
|
For example, the original implementation of the HTML parser used to call
|
|
|
|
|
the read()-method for each character. This call had probably to
|
|
|
|
|
traverse several Decorators until it got to a - synchronized call. That's why
|
|
|
|
|
the CharArrayReader was replaced by a SimpleCharArrayReader, because
|
|
|
|
|
only one thread works on a document at a time.</p>
|
|
|
|
|
<p>
|
|
|
|
|
These issues can only be traced down with special tools, i.e.
|
|
|
|
|
profilers. The work is worth it, because it allows one to work on the 20% of the
|
|
|
|
|
code that costs 80% of the time.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="Memory Usage">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
One "web crawler law" could be defined as:</p>
|
|
|
|
|
|
|
|
|
|
<source>
|
|
|
|
|
What can get infinite, will get infinite.
|
|
|
|
|
What can go wrong, will go wrong.
|
|
|
|
|
Eventually.
|
|
|
|
|
Very soon.
|
|
|
|
|
</source>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
A major task during the development was to get memory usage low. But a
|
|
|
|
|
lot of work still needs to be done here. Most of the optimizations
|
|
|
|
|
incorporated now move the problem from main memory to the hard disk, which
|
|
|
|
|
doesn't solve the problem.<br/>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Here are some means that were used:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
|
|
|
|
|
<li>CachingQueues: The message queue, the Fetcher queue, the robot
|
|
|
|
|
exclusion queue (see below) - a lot of queues can fill up the whole main
|
|
|
|
|
memory in a very short period of time. But since queues are only accessed
|
|
|
|
|
at their ends, a very simple mechanism was implemented to keep memory
|
|
|
|
|
usage low: The queue was divided into blocks of fixed size. Only the two
|
|
|
|
|
blocks at its end are kept in RAM. The rest is serialized on disk. In
|
|
|
|
|
the end, only a list of block references has to be managed
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>Define a maximum value for everything, and keep an eye on it.
|
|
|
|
|
Downloaded files can get "infinitely" large. URLs can get infinitely long.
|
|
|
|
|
Servers may contain an infinite set of documents. A lot of these checks
|
|
|
|
|
had to be included even in the university network mentioned. A special
|
|
|
|
|
case were the URLs. Some .shtml pages on a web server pointed to a
|
|
|
|
|
subdirectory that didn't exist but revealed the same page. If these errors
|
|
|
|
|
are introduced at will, they are called crawler traps: An infinite URL
|
|
|
|
|
space. The only way of dealing with this is manually excluding the
|
|
|
|
|
hosts.
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>Optimized HTML parser. Current parsers tend to create a huge amount
|
|
|
|
|
of very small objects. Most of that work is unnecessary for the task to
|
|
|
|
|
be done. This can only be optimised by stripping down the parser to do
|
|
|
|
|
only what it is supposed to do in that special task.
|
|
|
|
|
However, there still remains a problem: The HashMap of already visited
|
|
|
|
|
URLs needs to be accessed randomly while reading and writing. I can
|
|
|
|
|
imagine only two ways to overcome this:
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>Limiting, in some way, the number of URLs in RAM. If the crawler
|
|
|
|
|
were distributed, this could be done by assigning only a certain number
|
|
|
|
|
of hosts to each crawler node, while at the same time limiting the
|
|
|
|
|
number of pages read from one host. In the end this will only limit the
|
|
|
|
|
number of hosts that can be crawled by the number of crawler nodes
|
|
|
|
|
available. Another solution would be to store complete hosts on drive, together
|
|
|
|
|
with the list of unresolved URLs. Again, this shifts the problem only
|
|
|
|
|
from RAM to the hard drive
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
<li>Something worth while would be to compress the URLs. A lot of parts
|
|
|
|
|
of URLs are the same between hundreds of URLs (i.e. the host name). And
|
|
|
|
|
since only a limited number of characters are allowed in URLs, Huffman
|
|
|
|
|
compression will lead to a good compression rate . A first attempt
|
|
|
|
|
would be to incorporate the visited URLs hash into the HostInfo structure.
|
|
|
|
|
After all, the VisitedFilter hash map turned out to be the data
|
|
|
|
|
structure that will take up most of the RAM after some time.
|
|
|
|
|
</li>
|
|
|
|
|
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="The Filters">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Most of the functionality of the different filters has already been
|
|
|
|
|
described. Here's another, more detailed view :
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>RobotExclusionFilter</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The first implementation of this filter just kept a list of hosts, and
|
|
|
|
|
every time a new URLMessage with an unknown host came by, it attempted
|
|
|
|
|
to read the robots.txt file first to determine whether the URL should
|
|
|
|
|
be filtered.
|
|
|
|
|
A major drawback of that was that when the server was not accessible
|
|
|
|
|
somehow, the whole crawler was held until the connection timed out (well
|
|
|
|
|
with Sun's classes that even didn't happen, causing the whole program
|
|
|
|
|
to die).
|
|
|
|
|
The second implementation has its own little ThreadPool, and keeps a
|
|
|
|
|
state machine of each host in the HostInfo structure.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
If the host manager doesn't contain a HostInfo structure at all, the
|
|
|
|
|
filter creates it and creates a task to get the robots.txt file. During
|
|
|
|
|
this time, the host state is set to "isLoadingRobotsTxt", which means
|
|
|
|
|
further requests to that host are put into a queue. When loading is
|
|
|
|
|
finished, these URLs (and all subsequent ones) are put back to the beginning
|
|
|
|
|
of the queue.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
After this initial step, every URL that enters the filter is compared
|
|
|
|
|
to the disallow rules set (if present), and is filtered if necessary.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Since the URLs are put back to the beginning of the queue, the filter
|
|
|
|
|
has to be put in front of the VisitedFilter.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
In the host info structure, which is also used by the FetcherTasks,
|
|
|
|
|
some information about the health of the hosts is stored as well. If the
|
|
|
|
|
server is in a bad state several times, it is excluded from the crawl.
|
|
|
|
|
Note that it is possible that a server will be accessed more than the
|
|
|
|
|
(predefined) 5 times that it can time out, since a FetcherThread may
|
|
|
|
|
already have started to get a document when another one marks it as bad.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>URLLengthFilter</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
This very simple filter just filters a URL if a certain (total) length
|
|
|
|
|
is exceeded
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>KnownPathsFilter</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
This one filters some very common URLs (i.e. different views of an
|
|
|
|
|
Apache directory index), or hosts known to make problems. Should be more
|
|
|
|
|
configurable from outside in the future
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>URLScopeFilter</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The scope filter filters a URL if it doesn't match a given regular
|
|
|
|
|
expression.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>URLVisitedFilter</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
This filter keeps a HashMap of already visited URLs, and filters out
|
|
|
|
|
what it already knows
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>Fetcher</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The fetcher itself is also a filter that filters all URLs - they are
|
|
|
|
|
passed along to the storage as WebDocuments, in a different manner. It
|
|
|
|
|
contains a ThreadPool that runs in its own thread of control, which takes
|
|
|
|
|
tasks from the queue an distributes them to the different
|
|
|
|
|
FetcherThreads.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
In the first implementation the fetcher would simply distribute the
|
|
|
|
|
incoming URLs to the threads. The thread pool would use a simple queue to
|
|
|
|
|
store the remaining tasks. But this can lead to a very "unpolite"
|
|
|
|
|
distribution of the tasks: Since <20> of the links in a page point to the same
|
|
|
|
|
server, and all links of a page are added to the message handler at
|
|
|
|
|
once, groups of successive tasks would all try to access the same server,
|
|
|
|
|
probably causing denial of service, while other hosts present in the
|
|
|
|
|
queue are not accessed.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
To overcome this, the queue is divided into different parts, one for
|
|
|
|
|
each host. Each host contains its own (caching) queue. But the methods
|
|
|
|
|
used to pull tasks from the "end" of this queue cycle through the hosts
|
|
|
|
|
and always get a URL from a different host.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
One major problem still remains with this technique: If one host is
|
|
|
|
|
very slow, it can still slow down everything. Since with n host every nth
|
|
|
|
|
task will be accessed to this host, it can eat one thread after the
|
|
|
|
|
other if loading a document takes longer than loading it from the (n-1)
|
|
|
|
|
other servers. Then two concurrent requests will result on the same
|
|
|
|
|
server, which slows down the response times even more, and so on. In
|
|
|
|
|
reality, this will clog up the queue very fast. A little more work has to be
|
|
|
|
|
done to avoid these situations, i.e. by limiting the number of threads
|
|
|
|
|
that access one host at a time.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
<b>A Note on DNS</b>
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The Mercator crawler document stresses a lot on resolving host names.
|
|
|
|
|
Because of that, a DNSResolver filter was implemented in the very first
|
|
|
|
|
time. Two reasons prevented that it is used any more:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
|
|
|
|
|
<li>newer versions of the JDK than the one Mercator used resolve the IP
|
|
|
|
|
address of a host the first time it is accessed, and keep a cache of
|
|
|
|
|
already resolved host names.</li>
|
|
|
|
|
|
|
|
|
|
<li>the crawler itself was designed to crawl large local networks, and
|
|
|
|
|
not the internet. Thus, the number of hosts is very limited.</li>
|
|
|
|
|
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<section name="Future Enhancements">
|
|
|
|
|
|
|
|
|
|
<subsection name="Politeness">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
A crawler should not cause a Denial of Service attack. So this has to
|
|
|
|
|
be addressed.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="The processing pipeline">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The FetcherTask, as already stated, is very monolithic at this time.
|
|
|
|
|
Probably some more processing should be done at this step (the problem
|
|
|
|
|
with balanced CPU/IO usage taken into account). At least different
|
|
|
|
|
handlers for different mime types should be provided, i.e. to extract links
|
|
|
|
|
from PDF documents. The Storage should also be broken up. I only used
|
|
|
|
|
the LogStorage within the last months, which now doesn't only writes to
|
|
|
|
|
log files, but also stored the files on disk. This should probably be
|
|
|
|
|
replaced by a storage chain where different stores could be appended.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<subsection name="LARM as a real server">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The only way to start a crawl today is starting the crawler from the
|
|
|
|
|
shell. But it could also remain idle and wait for commands from an RMI
|
|
|
|
|
connection or expose a Web Service. Monitoring could be done by a simple
|
|
|
|
|
included web server that provides current statistics via HTML
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li><b>Recommended Reading: </b> We plan on using the
|
|
|
|
|
<a href="http://jakarta.apache.org/avalon">Avalon</a> Framework</li> for state
|
|
|
|
|
management and configuration. A very early proposal can
|
|
|
|
|
be found
|
2002-10-30 13:33:43 -05:00
|
|
|
|
<a href="http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=399399">here</a>.
|
2002-10-29 23:09:04 -05:00
|
|
|
|
We found out that Avalon provides a lot of the functionality described
|
|
|
|
|
as necessary in that posting.
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="Distribution">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
Distribution is a big issue. Some people say "Distribute your program
|
|
|
|
|
late. And then later." But as others have implemented distributed
|
|
|
|
|
crawlers, this should not be very hard.<br/>
|
|
|
|
|
I see two possible architectures for that:
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li>Write a single dispatcher (a star network) that contains the whole
|
|
|
|
|
MessageHandler except the Fetcher itself. The crawlers are run as
|
|
|
|
|
servers (see above), and are configured with a URL source that gets their
|
|
|
|
|
input from the dispatcher and a MessageHandler that stores URLs back to
|
|
|
|
|
the dispatcher. The main drawback being that this can become a
|
|
|
|
|
bottleneck.
|
|
|
|
|
</li>
|
|
|
|
|
<li>Partition the domain to be crawled into several parts. This could
|
|
|
|
|
be done for example by dividing up different intervals of the hash value
|
|
|
|
|
of the host names. Then plugging in another crawler could be done
|
|
|
|
|
dynamically, even within a peer to peer network. Each node knows which node
|
|
|
|
|
is responsible for which interval, and sends all URLs to the right
|
|
|
|
|
node. This could even be implemented as a filter.
|
|
|
|
|
</li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
One thing to keep in mind is that the number of URLs transferred to
|
|
|
|
|
other nodes should be as large as possible. </p>
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
The next thing to be distributed is the storage mechanism. Here, the
|
|
|
|
|
number of pure crawling nodes and the number of storing (post processing)
|
|
|
|
|
nodes could possibly diverge. An issue here is that the whole documents
|
|
|
|
|
have to be transferred over the net. </p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li><b>Recommended Reading: </b> <i>Junghoo Cho and Hector
|
|
|
|
|
Garcia-Molina (2002).
|
|
|
|
|
<a href="http://citeseer.nj.nec.com/cho02parallel.html">Parallel crawlers</a>.
|
|
|
|
|
In Proc. of the 11th International World--Wide Web Conference, 2002</i></li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="URL Reordering">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
One paper discussed different types of reordering URLs while crawling .
|
|
|
|
|
One of the most promising attempts was to take the calculated PageRank
|
|
|
|
|
into account . Crawling pages with higher PageRanks first seemed to get
|
|
|
|
|
important pages earlier. This is not rocket science, the research was
|
|
|
|
|
already done years ago.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<ul>
|
|
|
|
|
<li><b>Recommended Reading: </b> <i>J. Cho, H. Garcia-Molina & L.
|
|
|
|
|
Page (1998),
|
|
|
|
|
<a
|
|
|
|
|
href="http://citeseer.nj.nec.com/30730.html">Efficient Crawling Through URL Ordering</a>,
|
|
|
|
|
Seventh International Web Conference, Brisbane, Australia, 14-18 April</i></li>
|
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
<subsection name="Recovery">
|
|
|
|
|
|
|
|
|
|
<p>
|
|
|
|
|
At the moment there is no way of stopping and restarting a crawl. There
|
|
|
|
|
should be a mechanism to move the current state of the crawler to disk,
|
|
|
|
|
and, in case of a failure, to recover and continue from the last saved
|
|
|
|
|
state.
|
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
</body>
|
|
|
|
|
|
|
|
|
|
</document>
|