- Updated to reflect new build procedure.

Contributed by: Clemens Marschner git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149973 13f79535-47bb-0310-9956-ffa450edef68
2003-04-12 00:43:04 +00:00 · 2003-04-12 00:43:04 +00:00 · 9e4c0f3925
parent d3835471fa
commit 9e4c0f3925
2 changed files with 131 additions and 17 deletions
--- a/docs/lucene-sandbox/larm/overview.html
+++ b/docs/lucene-sandbox/larm/overview.html
@ -119,10 +119,19 @@
      <tr><td>
        <blockquote>
                                    <p align="center">Author: Clemens Marschner</p>
-                                                <p align="center">Revised: Oct. 28, 2002</p>
+                                                <p align="center">Revised: Apr. 11, 2003</p>
                                                <p>
                This document describes the configuration parameters and the inner
-                workings of the LARM web crawler.
+                workings of the LARM web crawler contribution.
            </p>
                                                <p>
               <b><i>Note: There have been discussions about how the future of LARM could be.
               In this paper, which describes the original architecture or LARM, you can see it
               still has a lot of the shortcomings. The discussions have resulted in an effort to
               expand the LARM-crawler into a complete search engine. The project is still in
               its infancies: Contributions are very welcome. Please see
               <a href="http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages">the LARM pages</a>
               in the Apache Wiki for details.</i></b>
            </p>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#828DA6">
@ -190,14 +199,25 @@
                </p>
                                                <ul>
-                    <li>this <a href="http://www.innovation.ch/java/HTTPClient/">HTTPClient</a>. Put the .zip file into the libs/ directory</li>
+                    <li>a copy of a current lucene-X.jar. You can get it from Jakarta's <a href="http://jakarta.apache.org/builds/jakarta-lucene/release/">download pages</a>.
                    </li>
                    <li>a working installation of <a href="http://jakarta.apache.org/ant">ANT</a> (1.5 or above recommended). ant.sh/.bat should be in your
                        PATH</li>
                </ul>
                                                <p>
-                    Change to the webcrawler-LARM directory and type
+                    After that you will need to tell ANT where the lucene.jar is located. This is done in the build.properties file.
                    The easiest way to write one is to copy the build.properties.sample file in LARM's root directory and adapt the path.
                </p>
                                                <p>
                    LARM needs a couple of other libraries which have been included in the libs/ directory. You shouldn't have to care about that.
                    Some fixes had to be applied to the underlying HTTP library, the HTTPClient from <a href="http://www.innovation.ch">Roland Tschalär</a>.
                    The patched jar was added to the libraries in the libs/ directory now. See the README file for details.<br />
                </p>
                                                <p>
                    Compiling should work simply by typing
                </p>
                                                    <div align="left">
    <table cellspacing="4" cellpadding="0" border="0">
@ -391,8 +411,8 @@
                    <li>Scalability. The crawler was supposed to be able to crawl <i>large
                            intranets</i> with hundreds of servers and hundreds of thousands of
-                        documents within a reasonable amount of time. It was not meant to be
+                        documents within a reasonable amount of time. <i>It was not meant to be
-                        scalable to the whole Internet.</li>
+                        scalable to the whole Internet</i>.</li>
                    <li>Java. Although there are many crawlers around at the time when I
                        started to think about it (in Summer 2000), I couldn't find a good
@ -546,7 +566,8 @@
    <tr>
      <td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
      <td bgcolor="#ffffff"><pre>
-                    java [-server] [-Xmx[ZZ]mb] -classpath fetcher.jar
+                    java [-server] [-Xmx[ZZ]mb]
                    -classpath &lt;path-to-LARM.jar&gt;:&lt;paths-to-libs/*.jar&gt;:&lt;path-to-lucene&gt;
                    de.lanlab.larm.fetcher.FetcherMain
                    [-start STARTURL | @STARTURLFILE]+
                    -restrictto REGEX
@ -596,7 +617,8 @@
                </p>
                                                <p>
                    Unfortunately, a lot of the options are still not configurable from the
-                    outside. Most of them are configured from within FetcherMain.java.
+                    outside. Most of them are configured from within FetcherMain.java. <i>You
                    will have to edit this file if you want to change LARM's behavior</i>.
                    However, others are still spread over some of the other classes. At this
                    time, we tried to put a "FIXME" comment around all these options, so
                    check out the source code. </p>
@ -625,6 +647,46 @@
                            </blockquote>
      </td></tr>
      <tr><td><br/></td></tr>
    </table>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#828DA6">
        <font color="#ffffff" face="arial,helvetica,sanserif">
          <a name="LARM's output files"><strong>LARM's output files</strong></a>
        </font>
      </td></tr>
      <tr><td>
        <blockquote>
                                    <p>LARM is by default configured such that it outputs a bunch of files into the logs/ directory.
               During the run it also uses a cachingqueue/ directory that holds temporary internal queues. This directory
               can be deleted if the crawler should be stopped before its operation has ended. Logfiles are flushed every time
               the ThreadMonitor is called, usually every 5 seconds.
               <p />
               <p>The logs/ directory keeps output from the LogStorage, which is a pretty verbose storage class, and
               the output from different filter classes (see below).
               Namely, in the default configuration, the directory will contain the following files after a crawl:</p>
               <ul>
               <li>links.log - contains the list of links. One record is a tab-delimited line. Format is
               [from-page] [to-page] [to-page-normalized] [link-type] [anchor-text]. link-type can be 0 (ordinary link),
               1 (frame) or 2 (redirect). anchor text is the text between &lt;a&gt; and &lt;/a&gt; tags or the ALT-Tag in case of
               IMG or AREA links. FRAME or LINK links don't contain anchor texts.
               </li>
               <li>pagefile_x.pfl - contains the contents of the downloaded files. Pagefiles are segments of max. 50 MB. Offsets
               are included in store.log. files are saved as-is.</li>
               <li>store.log - contains the list of downloaded files. One record is a tab-delimited line. Format is [from-page] [url]
               [url-normalized] [link-type] [HTTP-response-code] [MIME type] [file size] [HTML-title] [page file nr.] [offset in page file]. The attributes
               [from-page] and [link-type] refer to the first link found to this file. You can extract the files from the page files by using
               the [page file nr.], [offset] and [file size] attributes.</li>
               <li>thread(\n+)[|_errors].log - contain output of each crawling thread</li>
               <li>.*Filter.log - contain status messages of the different filters.</li>
               <li>ThreadMonitor.log - contains info from the ThreadMonitor. self-explanation is included in the first line of this file.</li>
               </ul>
              </p>
                            </blockquote>
      </td></tr>
      <tr><td><br/></td></tr>
    </table>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#828DA6">
--- a/xdocs/lucene-sandbox/larm/overview.xml
+++ b/xdocs/lucene-sandbox/larm/overview.xml
@ -15,13 +15,22 @@
            <p align="center">Author: Clemens Marschner</p>
-            <p align="center">Revised: Oct. 28, 2002</p>
+            <p align="center">Revised: Apr. 11, 2003</p>
            <p>
                This document describes the configuration parameters and the inner
-                workings of the LARM web crawler.
+                workings of the LARM web crawler contribution.
            </p>
            <p>
               <b><i>Note: There have been discussions about how the future of LARM could be.
               In this paper, which describes the original architecture or LARM, you can see it
               still has a lot of the shortcomings. The discussions have resulted in an effort to
               expand the LARM-crawler into a complete search engine. The project is still in
               its infancies: Contributions are very welcome. Please see
               <a href="http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages">the LARM pages</a>
               in the Apache Wiki for details.</i></b>
            </p>
            <subsection name="Purpose and Intended Audience">
@ -63,17 +72,27 @@
                <ul>
-                    <li>this <a
+                    <li>a copy of a current lucene-X.jar. You can get it from Jakarta's <a href="http://jakarta.apache.org/builds/jakarta-lucene/release/">download pages</a>.
-                                href="http://www.innovation.ch/java/HTTPClient/">HTTPClient</a>. Put the .zip file into the libs/ directory</li>
+                    </li>
                    <li>a working installation of <a
                                                     href="http://jakarta.apache.org/ant">ANT</a> (1.5 or above recommended). ant.sh/.bat should be in your
                        PATH</li>
                </ul>
                <p>
-                    Change to the webcrawler-LARM directory and type
+                    After that you will need to tell ANT where the lucene.jar is located. This is done in the build.properties file.
                    The easiest way to write one is to copy the build.properties.sample file in LARM's root directory and adapt the path.
                </p>
                <p>
                    LARM needs a couple of other libraries which have been included in the libs/ directory. You shouldn't have to care about that.
                    Some fixes had to be applied to the underlying HTTP library, the HTTPClient from <a href="http://www.innovation.ch">Roland Tschalär</a>.
                    The patched jar was added to the libraries in the libs/ directory now. See the README file for details.<br/>
                </p>
                <p>
                    Compiling should work simply by typing
                </p>
                <source>ant</source>
@ -233,8 +252,8 @@
                    <li>Scalability. The crawler was supposed to be able to crawl <i>large
                            intranets</i> with hundreds of servers and hundreds of thousands of
-                        documents within a reasonable amount of time. It was not meant to be
+                        documents within a reasonable amount of time. <i>It was not meant to be
-                        scalable to the whole Internet.</li>
+                        scalable to the whole Internet</i>.</li>
                    <li>Java. Although there are many crawlers around at the time when I
                        started to think about it (in Summer 2000), I couldn't find a good
@ -369,7 +388,8 @@
                </p>
                <source><![CDATA[
-                    java [-server] [-Xmx[ZZ]mb] -classpath fetcher.jar
+                    java [-server] [-Xmx[ZZ]mb]
                    -classpath <path-to-LARM.jar>:<paths-to-libs/*.jar>:<path-to-lucene>
                    de.lanlab.larm.fetcher.FetcherMain
                    [-start STARTURL | @STARTURLFILE]+
                    -restrictto REGEX
@ -416,7 +436,8 @@
                <p>
                    Unfortunately, a lot of the options are still not configurable from the
-                    outside. Most of them are configured from within FetcherMain.java.
+                    outside. Most of them are configured from within FetcherMain.java. <i>You
                    will have to edit this file if you want to change LARM's behavior</i>.
                    However, others are still spread over some of the other classes. At this
                    time, we tried to put a "FIXME" comment around all these options, so
                    check out the source code. </p>
@ -448,6 +469,37 @@
            </subsection>
            <!--zz !! -->
            <subsection name="LARM's output files">
               <p>LARM is by default configured such that it outputs a bunch of files into the logs/ directory.
               During the run it also uses a cachingqueue/ directory that holds temporary internal queues. This directory
               can be deleted if the crawler should be stopped before its operation has ended. Logfiles are flushed every time
               the ThreadMonitor is called, usually every 5 seconds.
               <p/>
               <p>The logs/ directory keeps output from the LogStorage, which is a pretty verbose storage class, and
               the output from different filter classes (see below).
               Namely, in the default configuration, the directory will contain the following files after a crawl:</p>
               <ul>
               <li>links.log - contains the list of links. One record is a tab-delimited line. Format is
               [from-page] [to-page] [to-page-normalized] [link-type] [anchor-text]. link-type can be 0 (ordinary link),
               1 (frame) or 2 (redirect). anchor text is the text between &lt;a&gt; and &lt;/a&gt; tags or the ALT-Tag in case of
               IMG or AREA links. FRAME or LINK links don't contain anchor texts.
               </li>
               <li>pagefile_x.pfl - contains the contents of the downloaded files. Pagefiles are segments of max. 50 MB. Offsets
               are included in store.log. files are saved as-is.</li>
               <li>store.log - contains the list of downloaded files. One record is a tab-delimited line. Format is [from-page] [url]
               [url-normalized] [link-type] [HTTP-response-code] [MIME type] [file size] [HTML-title] [page file nr.] [offset in page file]. The attributes
               [from-page] and [link-type] refer to the first link found to this file. You can extract the files from the page files by using
               the [page file nr.], [offset] and [file size] attributes.</li>
               <li>thread(\n+)[|_errors].log - contain output of each crawling thread</li>
               <li>.*Filter.log - contain status messages of the different filters.</li>
               <li>ThreadMonitor.log - contains info from the ThreadMonitor. self-explanation is included in the first line of this file.</li>
               </ul>
              </p>
            </subsection>
            <subsection name="Normalized URLs">
                <p>