mirror of https://github.com/apache/lucene.git
- Updated.
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149880 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
d43c477e9e
commit
d80a2f359c
|
@ -361,15 +361,15 @@
|
|||
<p>
|
||||
The LARM web crawler is a result of experiences with the errors as
|
||||
mentioned above, connected with a lot of monitoring to get the maximum out
|
||||
of the given system ressources. It was designed with several different
|
||||
of the given system resources. It was designed with several different
|
||||
aspects in mind:
|
||||
</p>
|
||||
<ul>
|
||||
|
||||
<li>Speed. This involves balancing the resources to prevent
|
||||
bottlenecks. The crawler is multithreaded. A lot of work went in avoiding
|
||||
bottlenecks. The crawler is multi-threaded. A lot of work went in avoiding
|
||||
synchronization between threads, i.e. by rewriting or replacing the standard
|
||||
Java classes, which slows down multithreaded programs a lot
|
||||
Java classes, which slows down multi-threaded programs a lot
|
||||
</li>
|
||||
|
||||
|
||||
|
@ -384,13 +384,13 @@
|
|||
|
||||
<li>Scalability. The crawler was supposed to be able to crawl <i>large
|
||||
intranets</i> with hundreds of servers and hundreds of thousands of
|
||||
documents within a reasonable amount of time. It was not ment to be
|
||||
documents within a reasonable amount of time. It was not meant to be
|
||||
scalable to the whole Internet.</li>
|
||||
|
||||
<li>Java. Although there are many crawlers around at the time when I
|
||||
started to think about it (in Summer 2000), I couldn't find a good
|
||||
available implementation in Java. If this crawler would have to be integrated
|
||||
in a Java search engine, a homogenous system would be an advantage. And
|
||||
in a Java search engine, a homogeneous system would be an advantage. And
|
||||
after all, I wanted to see if a fast implementation could be done in
|
||||
this language.
|
||||
</li>
|
||||
|
@ -452,7 +452,7 @@
|
|||
class to the pipeline.
|
||||
</li><li>The storage mechanism is also pluggable. One of the next
|
||||
issues would be to include this storage mechanism into the pipeline, to
|
||||
allow a seperation of logging, processing, and storage
|
||||
allow a separation of logging, processing, and storage
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
|
@ -468,7 +468,7 @@
|
|||
Still, there's a relatively high memory overhead <i>per server</i> and
|
||||
also some overhead <i>per page</i>, especially since a map of already
|
||||
crawled pages is held in memory at this time. Although some of the
|
||||
in-memory structures are already put on disc, memory consumption is still
|
||||
in-memory structures are already put on disk, memory consumption is still
|
||||
linear to the number of pages crawled. We have lots of ideas on how to
|
||||
move that out, but since then, as an example with 500 MB RAM, the crawler
|
||||
scales up to some 100.000 files on some 100s of hosts.
|
||||
|
@ -772,7 +772,7 @@
|
|||
that dots "." were not supported in mapped properties indexes. As with
|
||||
the new version (1.5 at the time of this writing) this is supposed to be
|
||||
removed, but I have not tried yet. Therefore, the comma "," was made a
|
||||
synonymon for dots. Since "," is not allowed in domain names, you can
|
||||
synonym for dots. Since "," is not allowed in domain names, you can
|
||||
still use (and even mix) them if you want, or if you only have an older
|
||||
BeanUtils version available.
|
||||
</p>
|
||||
|
@ -784,7 +784,7 @@
|
|||
|
||||
<p>
|
||||
LARM currently provides a very simple LuceneStorage that allows for
|
||||
integrating the crawler with Lucene. It's ment to be a working example on
|
||||
integrating the crawler with Lucene. It's meant to be a working example on
|
||||
how this can be accomplished, not a final implementation. If you like
|
||||
to volunteer on that part, contributions are welcome.</p>
|
||||
|
||||
|
@ -893,7 +893,7 @@
|
|||
Probably it is copied through several buffers until it is complete.
|
||||
This will take some CPU time, but mostly it will wait for the next packet
|
||||
to arrive. The network transfer by itself is also affected by a lot of
|
||||
factors, i.e. the speed of the web server, acknowledgement messages,
|
||||
factors, i.e. the speed of the web server, acknowledgment messages,
|
||||
resent packages etc. so 100% network utilization will almost never be
|
||||
reached.</li>
|
||||
<li>The document is processed, which will take up the whole CPU. The
|
||||
|
@ -941,7 +941,7 @@
|
|||
this until you have read the standard literature and have made at least
|
||||
10 mistakes (and solved them!).</p>
|
||||
<p>
|
||||
Multithreading doesn't come without a cost, however. First, there is
|
||||
Multi-threading doesn't come without a cost, however. First, there is
|
||||
the cost of thread scheduling. I don't have numbers for that in Java, but
|
||||
I suppose that this should not be very expensive. MutExes can affect
|
||||
the whole program a lot . I noticed that they should be avoided like
|
||||
|
@ -1171,7 +1171,7 @@
|
|||
<p>
|
||||
In the first implementation the fetcher would simply distribute the
|
||||
incoming URLs to the threads. The thread pool would use a simple queue to
|
||||
store the remaining tasks. But this can lead to a very "unpolite"
|
||||
store the remaining tasks. But this can lead to a very "impolite"
|
||||
distribution of the tasks: Since ¾ of the links in a page point to the same
|
||||
server, and all links of a page are added to the message handler at
|
||||
once, groups of successive tasks would all try to access the same server,
|
||||
|
@ -1210,7 +1210,7 @@
|
|||
already resolved host names.</li>
|
||||
|
||||
<li>the crawler itself was designed to crawl large local networks, and
|
||||
not the internet. Thus, the number of hosts is very limited.</li>
|
||||
not the Internet. Thus, the number of hosts is very limited.</li>
|
||||
|
||||
</ul>
|
||||
</blockquote>
|
||||
|
@ -1326,8 +1326,9 @@
|
|||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
One thing to keep in mind is that the number of URLs transferred to
|
||||
other nodes should be as large as possible. </p>
|
||||
One thing to keep in mind is that the transfer of URLs to
|
||||
other nodes should be done in batches with hundreds or
|
||||
thousands or more URLs per batch.</p>
|
||||
<p>
|
||||
The next thing to be distributed is the storage mechanism. Here, the
|
||||
number of pure crawling nodes and the number of storing (post processing)
|
||||
|
|
Loading…
Reference in New Issue