Author: Clemens Marschner
Revised: Oct. 28, 2002
This document describes the configuration parameters and the inner workings of the LARM web crawler.
This document was made for Lucene developers, not necessarily with any background knowledge on crawlers, to understand the inner workings of the LARM crawler, the current problems and some directions for future development. The aim is to keep the entry costs low for people who have an interest in developing this piece of software further.
It may also serve for actual users of the Lucene engine, but beware, since there is a lot that will change in the near future, especially the configuration.
The crawler is only accessible via anonymous CVS at this time. See the Jakarta CVS instructions for details if you're not familiar with CVS.
Too long? The following will give you a quick start: create a new directory, say, "jakarta", make it your current directory, and type
The crawler will then be in jakarta-lucene-sandbox/contributions/webcrawler-LARM. To compile it you will also need
Change to the webcrawler-LARM directory and type
You should then have a working copy of LARM in
build/webcrawler-LARM-0.5.jar. See the section Syntax below on how to
start the crawler.
Web crawlers became necessary because the web standard protocols didn't contain any mechanisms to inform search engines that the data on a web server had been changed. If this were possible, a search engine could be notified in a "push" fashion, which would simplify the total process and would make indexes as current as possible.
Imagine a web server that notifies another web server that a link was created from one of its pages to the other server. That other server could then send a message back if the page was removed.
On the other hand, handling this system would be a lot more complicated. Keeping distributed information up to date is an erroneous task. Even in a single relational database it is often complicated to define and handle dependencies between relations. Should it be possible to allow inconsistencies for a short period of time? Should dependent data be deleted if a record is removed? Handling relationships between clusters of information well incorporates a new level of complexity.
In order to keep the software (web servers and browsers) simple, the inventors of the web concentrated on just a few core elements - URLs for (more or less) uniquely identifying distributed information, HTTP for handling the information, and HTML for structuring it. That system was so simple that one could understand it in a very short time. This is probably one of the main reasons why the WWW became so popular. Well, another one would probably be coloured, moving graphics of naked people.
But the WWW has some major disadvantages: There is no single index of all available pages. Information can change without notice. URLs can point to pages that no longer exist. There is no mechanism to get "all" pages from a web server. The whole system is in a constant process of change. And after all, the whole thing is growing at phenomenal rates. Building a search engine on top of that is not something you can do on a Saturday afternoon. Given the sheer size, it would take months to search through all the pages in order to answer a single query, even if we had a means to get from server to server, get the pages from there, and search them. But we don't even know how to do that, since we don't know all the web servers.
That first problem was addressed by bookmark collections, which soon became very popular. The most popular probably was Yahoo, which evolved to one of the most popular pages in the web just a year after it emerged from a college dorm room. The second problem was how to get the information from all those pages laying around. This is where a web crawler comes in.
Ok, those engineers said, we are not able to get a list of all the pages. But almost every page contains links to other pages. We can save a page, extract all the links, and load all of these pages these links point to. If we start at a popular location which contains a lot of links, like Yahoo for example, chances should be that we can get "all" pages on the web.
A little more formal, the web can be seen as a directional graph, with pages as nodes and links as edges between them. A web crawler, also called "spider" or "fetcher", uses the graph structure of the web to get documents in order to be able to index them. Since there is no "push" mechanism for updating our index, we need to "pull" the information on our own, by repeatedly crawling the web.
"Easy", you may think now, "just implement what he said in the paragraph before." So you start getting a page, extracting the links, following all the pages you have not already visited. In Perl that can be done in a few lines of code.
But then, very soon (I can tell you), you end up in a lot of problems:
The LARM web crawler is a result of experiences with the errors as mentioned above, connected with a lot of monitoring to get the maximum out of the given system ressources. It was designed with several different aspects in mind:
What it can do for you:
On the other hand, at the time of this writing, the crawler has not yet evolved into a production release. The reason is: until now, it just served me alone.
These issues remain:
The command line options are very simple:
Java runtime options:
You may also want to have a look at the source code, because some
options cannot be dealt with from the outside at this time.
Other options
Unfortunately, a lot of the options are still not configurable from the outside. Most of them are configured from within FetcherMain.java. However, others are still spread over some of the other classes. At this time, we tried to put a "FIXME" comment around all these options, so check out the source code.
What happens now?
URLs are only supposed to make a web resource accessible. Unfortunately there can be more than one representation of a URL, which can cause a web crawler to save one file twice under different URLs. Two mechanisms are used to reduce this error, while at the same time trying to keep the second error (two URLs are regarded as pointing to the same file, but are in fact two different ones):
The result is used like a stemming function in IR systems: The normalized form of a URL is used internally for comparisons, but to the outside (i.e. for accessing the file), the original form is applied.
These are the patterns applied by the URLNormalizer:
Todo: '/../' is not handled yet
Todo: Examples
The host resolver is also applied when the URL normalization takes place. It knows three different types of rules:
These three rules are applied to each URL in that order. I. e. you can tell the host resolver to always remove the "www." of each host, therefore regarding "cnn.com" and "www.cnn.com" as equal, by defining the rule setStartsWith("www.","")
Configuring HostResolver in a config file
The HostResolver was one test on how config files could look like if they were implemented using standard Java procedures. We used the Jakarta BeanUtils for these matters (see The BeanUtils website for details) and implemented the HostResolver as a JavaBean. The rules can then be stated as "mapped properties" in a hostResolver.properties file (see the start syntax above). The only difference between normal properties and the mapped properties as supported by BeanUtils is that a second argument can be passed to the latter.
An example of such a properties file would look like this:
As you can see, the file format itself is like the standard Java Properties format (comments etc.). Keywords are case sensitive.
At the time when the class was written, BeanUtils still had a bug that dots "." were not supported in mapped properties indexes. As with the new version (1.5 at the time of this writing) this is supposed to be removed, but I have not tried yet. Therefore, the comma "," was made a synonymon for dots. Since "," is not allowed in domain names, you can still use (and even mix) them if you want, or if you only have an older BeanUtils version available.
LARM currently provides a very simple LuceneStorage that allows for integrating the crawler with Lucene. It's ment to be a working example on how this can be accomplished, not a final implementation. If you like to volunteer on that part, contributions are welcome.
The current storage simply takes the input that comes from the crawler (a WebDocument which mainly consists of name/value pairs with the document's contents) and puts it into a Lucene index. Each name/value pair is put into one field. There's currently no incremental or update operation, or document caching via a RAMDirectory. Therefore the LuceneStorage becomes a bottleneck even with slow network connections.
see storage/LuceneStorage.java and fetcher/FetcherMain.java for details
I studied the Mercator web crawler but decided to implement a somewhat different architecture. Here is a high level overview of the default configuration:
The message handler is an implementation of a simple chain of responsibility. Implementations of Message are passed down a filter chain. Each of the filters can decide whether to send the message along, change it, or even delete it. In this case, Messages of type URLMessage are used. The message handler runs in its own thread. Thus, a call of putMessage() or putMessages() resp. involve a producer-consumer-like message transfer. The filters themselves run within the message handler thread. At the end of the pipeline the Fetcher distributes the incoming messages to its worker threads. They are implemented as a thread pool: Several ServerThreads are running concurrently and wait for Tasks which include the procedure to be executed. If more tasks are to be done than threads are available, they are kept in a queue, which will be read whenever a task is finished.
At this point the pipeline pattern is left . The FetcherTask itself is still quite monolithic. It gets the document, parses it if possible, and stores it into a storage. In the future one might think of additional configurable processing steps within another processing pipeline. I thought about incorporating it into the filter pipeline, but since the filters are passive components and the FetcherThreads are active, this didn't work.
The performance was improved about 10-15 times compared to the first naive attempts with a pre-built parser and Sun's network classes. And there is still room left. On a network with about 150 web servers, which the crawler server was connected to by a 100 MBit FDDS connection, I was able to crawl an average of 60 documents per second, or 3,7 MB, after 10 minutes in the startup period. In this first period, crawling is slower because the number of servers is small, so the server output limits crawling. There may also be servers that don't respond. They are excluded from the crawl after a few attempts.
Overall, performance is affected by a lot of factors: The operating system, the native interface, the Java libraries, the web servers, the number of threads, whether dynamic pages are included in the crawl, etc.
From a development side, the speed is affected by the balance between I/O and CPU usage. Both has to be kept at 100%, otherwise one of them becomes the bottleneck. Managing these resources is the central part of a crawler. Imagine that only one thread is crawling. This is the worst case, as can be seen very fast:
The diagram to the right resembles a UML sequence diagram, except that it stresses the time that a message needs to traverse the network.
The storage process, which by itself uses CPU and disk I/O resources, was left out here. That process will be very similar, although the traversal will be faster.
As you can see, both CPU and I/O are not used most of the time, and wait for the other one (or the network) to complete. This is the reason why single threaded web crawlers tend to be very slow (wget for example). The slowest component always becomes the bottleneck.
Two strategies can be followed to improve this situation:
Asynchronous I/O means, I/O requests are sent, but then the crawler continues to process documents it has already crawled.
Actually I haven't seen an implementation of this technique. Asynchronous I/O was not available in Java until version 1.4. An advantage would be that thread handling is also an expensive process in terms of CPU and memory usage. Threads are resources and, thus, limited. I heard that application server developers wanted asynchronous I/O, to be able to cope with hundreds of simultaneous requests without spawning extra threads for each of them. Probably this can be a solution in the future. But from what I know about it today, it will not be necessary
The way this problem is solved usually in Java is with the use of several threads. If many threads are used, chances are good that at any given moment, at least one thread is in one of the states above, which means both CPU and I/O will be at a maximum.
The problem with this is that multi threaded programming is considered to be one of the most difficult areas in computer science. But given the simple linear structure of web crawlers, it is not very hard to avoid race conditions or dead lock problems. You always get into problems when threads are supposed to access shared resources, though. Don't touch this until you have read the standard literature and have made at least 10 mistakes (and solved them!).
Multithreading doesn't come without a cost, however. First, there is the cost of thread scheduling. I don't have numbers for that in Java, but I suppose that this should not be very expensive. MutExes can affect the whole program a lot . I noticed that they should be avoided like hell. In a crawler, a MutEx is used, for example, when a new URL is passed to the thread, or when the fetched documents are supposed to be stored linearly, one after the other.
For example, the tasks used to insert a new URL into the global message handler each time when a new URL was found in the document. I was able to speed it up considerably when I changed this so that the URLs are collected locally and then inserted only once per document. Probably this can be augmented even further if each task is comprised of several documents which are fetched one after the other and then stored together.
Nonetheless, keeping the right balance between the two resources is a big concern. At the moment, the number of threads and the number of processing steps is static, and is only optimised by trial and error. Few hosts, slow network -> few threads. slow CPU -> few processing steps. many hosts, fast network -> many threads. Probably those heuristics will do well, but I wonder if these figures could also be fine-tuned dynamically during runtime.
Another issue that was optimised were very fine-grained method calls. For example, the original implementation of the HTML parser used to call the read()-method for each character. This call had probably to traverse several Decorators until it got to a - synchronized call. That's why the CharArrayReader was replaced by a SimpleCharArrayReader, because only one thread works on a document at a time.
These issues can only be traced down with special tools, i.e. profilers. The work is worth it, because it allows one to work on the 20% of the code that costs 80% of the time.
One "web crawler law" could be defined as:
A major task during the development was to get memory usage low. But a
lot of work still needs to be done here. Most of the optimizations
incorporated now move the problem from main memory to the hard disk, which
doesn't solve the problem.
Here are some means that were used:
Most of the functionality of the different filters has already been described. Here's another, more detailed view :
RobotExclusionFilter
The first implementation of this filter just kept a list of hosts, and every time a new URLMessage with an unknown host came by, it attempted to read the robots.txt file first to determine whether the URL should be filtered. A major drawback of that was that when the server was not accessible somehow, the whole crawler was held until the connection timed out (well with Sun's classes that even didn't happen, causing the whole program to die). The second implementation has its own little ThreadPool, and keeps a state machine of each host in the HostInfo structure.
If the host manager doesn't contain a HostInfo structure at all, the filter creates it and creates a task to get the robots.txt file. During this time, the host state is set to "isLoadingRobotsTxt", which means further requests to that host are put into a queue. When loading is finished, these URLs (and all subsequent ones) are put back to the beginning of the queue.
After this initial step, every URL that enters the filter is compared to the disallow rules set (if present), and is filtered if necessary.
Since the URLs are put back to the beginning of the queue, the filter has to be put in front of the VisitedFilter.
In the host info structure, which is also used by the FetcherTasks, some information about the health of the hosts is stored as well. If the server is in a bad state several times, it is excluded from the crawl. Note that it is possible that a server will be accessed more than the (predefined) 5 times that it can time out, since a FetcherThread may already have started to get a document when another one marks it as bad.
URLLengthFilter
This very simple filter just filters a URL if a certain (total) length is exceeded
KnownPathsFilter
This one filters some very common URLs (i.e. different views of an Apache directory index), or hosts known to make problems. Should be more configurable from outside in the future
URLScopeFilter
The scope filter filters a URL if it doesn't match a given regular expression.
URLVisitedFilter
This filter keeps a HashMap of already visited URLs, and filters out what it already knows
Fetcher
The fetcher itself is also a filter that filters all URLs - they are passed along to the storage as WebDocuments, in a different manner. It contains a ThreadPool that runs in its own thread of control, which takes tasks from the queue an distributes them to the different FetcherThreads.
In the first implementation the fetcher would simply distribute the incoming URLs to the threads. The thread pool would use a simple queue to store the remaining tasks. But this can lead to a very "unpolite" distribution of the tasks: Since ¾ of the links in a page point to the same server, and all links of a page are added to the message handler at once, groups of successive tasks would all try to access the same server, probably causing denial of service, while other hosts present in the queue are not accessed.
To overcome this, the queue is divided into different parts, one for each host. Each host contains its own (caching) queue. But the methods used to pull tasks from the "end" of this queue cycle through the hosts and always get a URL from a different host.
One major problem still remains with this technique: If one host is very slow, it can still slow down everything. Since with n host every nth task will be accessed to this host, it can eat one thread after the other if loading a document takes longer than loading it from the (n-1) other servers. Then two concurrent requests will result on the same server, which slows down the response times even more, and so on. In reality, this will clog up the queue very fast. A little more work has to be done to avoid these situations, i.e. by limiting the number of threads that access one host at a time.
A Note on DNS
The Mercator crawler document stresses a lot on resolving host names. Because of that, a DNSResolver filter was implemented in the very first time. Two reasons prevented that it is used any more:
A crawler should not cause a Denial of Service attack. So this has to be addressed.
The FetcherTask, as already stated, is very monolithic at this time. Probably some more processing should be done at this step (the problem with balanced CPU/IO usage taken into account). At least different handlers for different mime types should be provided, i.e. to extract links from PDF documents. The Storage should also be broken up. I only used the LogStorage within the last months, which now doesn't only writes to log files, but also stored the files on disk. This should probably be replaced by a storage chain where different stores could be appended.
The only way to start a crawl today is starting the crawler from the shell. But it could also remain idle and wait for commands from an RMI connection or expose a Web Service. Monitoring could be done by a simple included web server that provides current statistics via HTML
Distribution is a big issue. Some people say "Distribute your program
late. And then later." But as others have implemented distributed
crawlers, this should not be very hard.
I see two possible architectures for that:
One thing to keep in mind is that the number of URLs transferred to other nodes should be as large as possible.
The next thing to be distributed is the storage mechanism. Here, the number of pure crawling nodes and the number of storing (post processing) nodes could possibly diverge. An issue here is that the whole documents have to be transferred over the net.
One paper discussed different types of reordering URLs while crawling . One of the most promising attempts was to take the calculated PageRank into account . Crawling pages with higher PageRanks first seemed to get important pages earlier. This is not rocket science, the research was already done years ago.
At the moment there is no way of stopping and restarting a crawl. There should be a mechanism to move the current state of the crawler to disk, and, in case of a failure, to recover and continue from the last saved state.