- Updated to reflect new build procedure.

Contributed by: Clemens Marschner git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149973 13f79535-47bb-0310-9956-ffa450edef68
2003-04-12 00:43:04 +00:00 · 2003-04-12 00:43:04 +00:00 · 9e4c0f3925
parent d3835471fa
commit 9e4c0f3925
2 changed files with 131 additions and 17 deletions
--- a/docs/lucene-sandbox/larm/overview.html
+++ b/docs/lucene-sandbox/larm/overview.html
@ -119,10 +119,19 @@
      <tr><td>
        <blockquote>
                                    <p align="center">Author: Clemens Marschner</p>
-                                                <p align="center">Revised: Oct. 28, 2002</p>
+                                                <p align="center">Revised: Apr. 11, 2003</p>
                                                <p>
                This document describes the configuration parameters and the inner
-                workings of the LARM web crawler.
+                workings of the LARM web crawler contribution.
+            </p>
+                                                <p>
+               <b><i>Note: There have been discussions about how the future of LARM could be.
+               In this paper, which describes the original architecture or LARM, you can see it
+               still has a lot of the shortcomings. The discussions have resulted in an effort to
+               expand the LARM-crawler into a complete search engine. The project is still in
+               its infancies: Contributions are very welcome. Please see
+               <a href="http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages">the LARM pages</a>
+               in the Apache Wiki for details.</i></b>
            </p>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#828DA6">
@ -190,14 +199,25 @@
                </p>
                                                <ul>

-                    <li>this <a href="http://www.innovation.ch/java/HTTPClient/">HTTPClient</a>. Put the .zip file into the libs/ directory</li>
+                    <li>a copy of a current lucene-X.jar. You can get it from Jakarta's <a href="http://jakarta.apache.org/builds/jakarta-lucene/release/">download pages</a>.
+                    </li>

                    <li>a working installation of <a href="http://jakarta.apache.org/ant">ANT</a> (1.5 or above recommended). ant.sh/.bat should be in your
                        PATH</li>

+
                </ul>
                                                <p>
-                    Change to the webcrawler-LARM directory and type
+                    After that you will need to tell ANT where the lucene.jar is located. This is done in the build.properties file.
+                    The easiest way to write one is to copy the build.properties.sample file in LARM's root directory and adapt the path.
+                </p>
+                                                <p>
+                    LARM needs a couple of other libraries which have been included in the libs/ directory. You shouldn't have to care about that.
+                    Some fixes had to be applied to the underlying HTTP library, the HTTPClient from <a href="http://www.innovation.ch">Roland Tschalär</a>.
+                    The patched jar was added to the libraries in the libs/ directory now. See the README file for details.<br />
+                </p>
+                                                <p>
+                    Compiling should work simply by typing
                </p>
                                                    <div align="left">
    <table cellspacing="4" cellpadding="0" border="0">
@ -391,8 +411,8 @@

                    <li>Scalability. The crawler was supposed to be able to crawl <i>large
                            intranets</i> with hundreds of servers and hundreds of thousands of
-                        documents within a reasonable amount of time. It was not meant to be
-                        scalable to the whole Internet.</li>
+                        documents within a reasonable amount of time. <i>It was not meant to be
+                        scalable to the whole Internet</i>.</li>

                    <li>Java. Although there are many crawlers around at the time when I
                        started to think about it (in Summer 2000), I couldn't find a good
@ -546,7 +566,8 @@
    <tr>
      <td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
      <td bgcolor="#ffffff"><pre>
-                    java [-server] [-Xmx[ZZ]mb] -classpath fetcher.jar
+                    java [-server] [-Xmx[ZZ]mb]
+                    -classpath &lt;path-to-LARM.jar&gt;:&lt;paths-to-libs/*.jar&gt;:&lt;path-to-lucene&gt;
                    de.lanlab.larm.fetcher.FetcherMain
                    [-start STARTURL | @STARTURLFILE]+
                    -restrictto REGEX
@ -596,7 +617,8 @@
                </p>
                                                <p>
                    Unfortunately, a lot of the options are still not configurable from the
-                    outside. Most of them are configured from within FetcherMain.java.
+                    outside. Most of them are configured from within FetcherMain.java. <i>You
+                    will have to edit this file if you want to change LARM's behavior</i>.
                    However, others are still spread over some of the other classes. At this
                    time, we tried to put a "FIXME" comment around all these options, so
                    check out the source code. </p>
@ -625,6 +647,46 @@
                            </blockquote>
      </td></tr>
      <tr><td><br/></td></tr>
+    </table>
+                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
+      <tr><td bgcolor="#828DA6">
+        <font color="#ffffff" face="arial,helvetica,sanserif">
+          <a name="LARM's output files"><strong>LARM's output files</strong></a>
+        </font>
+      </td></tr>
+      <tr><td>
+        <blockquote>
+                                    <p>LARM is by default configured such that it outputs a bunch of files into the logs/ directory.
+               During the run it also uses a cachingqueue/ directory that holds temporary internal queues. This directory
+               can be deleted if the crawler should be stopped before its operation has ended. Logfiles are flushed every time
+               the ThreadMonitor is called, usually every 5 seconds.
+               <p />
+
+               <p>The logs/ directory keeps output from the LogStorage, which is a pretty verbose storage class, and
+               the output from different filter classes (see below).
+               Namely, in the default configuration, the directory will contain the following files after a crawl:</p>
+               <ul>
+               <li>links.log - contains the list of links. One record is a tab-delimited line. Format is
+               [from-page] [to-page] [to-page-normalized] [link-type] [anchor-text]. link-type can be 0 (ordinary link),
+               1 (frame) or 2 (redirect). anchor text is the text between &lt;a&gt; and &lt;/a&gt; tags or the ALT-Tag in case of
+               IMG or AREA links. FRAME or LINK links don't contain anchor texts.
+               </li>
+
+               <li>pagefile_x.pfl - contains the contents of the downloaded files. Pagefiles are segments of max. 50 MB. Offsets
+               are included in store.log. files are saved as-is.</li>
+               <li>store.log - contains the list of downloaded files. One record is a tab-delimited line. Format is [from-page] [url]
+               [url-normalized] [link-type] [HTTP-response-code] [MIME type] [file size] [HTML-title] [page file nr.] [offset in page file]. The attributes
+               [from-page] and [link-type] refer to the first link found to this file. You can extract the files from the page files by using
+               the [page file nr.], [offset] and [file size] attributes.</li>
+
+               <li>thread(\n+)[|_errors].log - contain output of each crawling thread</li>
+               <li>.*Filter.log - contain status messages of the different filters.</li>
+               <li>ThreadMonitor.log - contains info from the ThreadMonitor. self-explanation is included in the first line of this file.</li>
+               </ul>
+              </p>
+                            </blockquote>
+      </td></tr>
+      <tr><td><br/></td></tr>
    </table>
                                                    <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#828DA6">
--- a/xdocs/lucene-sandbox/larm/overview.xml
+++ b/xdocs/lucene-sandbox/larm/overview.xml
@ -15,13 +15,22 @@

            <p align="center">Author: Clemens Marschner</p>

-            <p align="center">Revised: Oct. 28, 2002</p>
+            <p align="center">Revised: Apr. 11, 2003</p>

            <p>
                This document describes the configuration parameters and the inner
-                workings of the LARM web crawler.
+                workings of the LARM web crawler contribution.
            </p>

+            <p>
+               <b><i>Note: There have been discussions about how the future of LARM could be.
+               In this paper, which describes the original architecture or LARM, you can see it
+               still has a lot of the shortcomings. The discussions have resulted in an effort to
+               expand the LARM-crawler into a complete search engine. The project is still in
+               its infancies: Contributions are very welcome. Please see
+               <a href="http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages">the LARM pages</a>
+               in the Apache Wiki for details.</i></b>
+            </p>

            <subsection name="Purpose and Intended Audience">

@ -63,17 +72,27 @@

                <ul>

-                    <li>this <a
-                                href="http://www.innovation.ch/java/HTTPClient/">HTTPClient</a>. Put the .zip file into the libs/ directory</li>
+                    <li>a copy of a current lucene-X.jar. You can get it from Jakarta's <a href="http://jakarta.apache.org/builds/jakarta-lucene/release/">download pages</a>.
+                    </li>

                    <li>a working installation of <a
                                                     href="http://jakarta.apache.org/ant">ANT</a> (1.5 or above recommended). ant.sh/.bat should be in your
                        PATH</li>

+
                </ul>

                <p>
-                    Change to the webcrawler-LARM directory and type
+                    After that you will need to tell ANT where the lucene.jar is located. This is done in the build.properties file.
+                    The easiest way to write one is to copy the build.properties.sample file in LARM's root directory and adapt the path.
+                </p>
+                <p>
+                    LARM needs a couple of other libraries which have been included in the libs/ directory. You shouldn't have to care about that.
+                    Some fixes had to be applied to the underlying HTTP library, the HTTPClient from <a href="http://www.innovation.ch">Roland Tschalär</a>.
+                    The patched jar was added to the libraries in the libs/ directory now. See the README file for details.<br/>
+                </p>
+                <p>
+                    Compiling should work simply by typing
                </p>

                <source>ant</source>
@ -233,8 +252,8 @@

                    <li>Scalability. The crawler was supposed to be able to crawl <i>large
                            intranets</i> with hundreds of servers and hundreds of thousands of
-                        documents within a reasonable amount of time. It was not meant to be
-                        scalable to the whole Internet.</li>
+                        documents within a reasonable amount of time. <i>It was not meant to be
+                        scalable to the whole Internet</i>.</li>

                    <li>Java. Although there are many crawlers around at the time when I
                        started to think about it (in Summer 2000), I couldn't find a good
@ -369,7 +388,8 @@
                </p>

                <source><![CDATA[
-                    java [-server] [-Xmx[ZZ]mb] -classpath fetcher.jar
+                    java [-server] [-Xmx[ZZ]mb]
+                    -classpath <path-to-LARM.jar>:<paths-to-libs/*.jar>:<path-to-lucene>
                    de.lanlab.larm.fetcher.FetcherMain
                    [-start STARTURL | @STARTURLFILE]+
                    -restrictto REGEX
@ -416,7 +436,8 @@

                <p>
                    Unfortunately, a lot of the options are still not configurable from the
-                    outside. Most of them are configured from within FetcherMain.java.
+                    outside. Most of them are configured from within FetcherMain.java. <i>You
+                    will have to edit this file if you want to change LARM's behavior</i>.
                    However, others are still spread over some of the other classes. At this
                    time, we tried to put a "FIXME" comment around all these options, so
                    check out the source code. </p>
@ -448,6 +469,37 @@
            </subsection>
            <!--zz !! -->

+            <subsection name="LARM's output files">
+               <p>LARM is by default configured such that it outputs a bunch of files into the logs/ directory.
+               During the run it also uses a cachingqueue/ directory that holds temporary internal queues. This directory
+               can be deleted if the crawler should be stopped before its operation has ended. Logfiles are flushed every time
+               the ThreadMonitor is called, usually every 5 seconds.
+               <p/>
+
+               <p>The logs/ directory keeps output from the LogStorage, which is a pretty verbose storage class, and
+               the output from different filter classes (see below).
+               Namely, in the default configuration, the directory will contain the following files after a crawl:</p>
+               <ul>
+               <li>links.log - contains the list of links. One record is a tab-delimited line. Format is
+               [from-page] [to-page] [to-page-normalized] [link-type] [anchor-text]. link-type can be 0 (ordinary link),
+               1 (frame) or 2 (redirect). anchor text is the text between &lt;a&gt; and &lt;/a&gt; tags or the ALT-Tag in case of
+               IMG or AREA links. FRAME or LINK links don't contain anchor texts.
+               </li>
+
+               <li>pagefile_x.pfl - contains the contents of the downloaded files. Pagefiles are segments of max. 50 MB. Offsets
+               are included in store.log. files are saved as-is.</li>
+               <li>store.log - contains the list of downloaded files. One record is a tab-delimited line. Format is [from-page] [url]
+               [url-normalized] [link-type] [HTTP-response-code] [MIME type] [file size] [HTML-title] [page file nr.] [offset in page file]. The attributes
+               [from-page] and [link-type] refer to the first link found to this file. You can extract the files from the page files by using
+               the [page file nr.], [offset] and [file size] attributes.</li>
+
+               <li>thread(\n+)[|_errors].log - contain output of each crawling thread</li>
+               <li>.*Filter.log - contain status messages of the different filters.</li>
+               <li>ThreadMonitor.log - contains info from the ThreadMonitor. self-explanation is included in the first line of this file.</li>
+               </ul>
+              </p>
+            </subsection>
+
            <subsection name="Normalized URLs">

                <p>