diff --git a/docs/lucene-sandbox/larm/overview.html b/docs/lucene-sandbox/larm/overview.html index a3dfe85cdff..c4a387a03f0 100644 --- a/docs/lucene-sandbox/larm/overview.html +++ b/docs/lucene-sandbox/larm/overview.html @@ -119,10 +119,19 @@

Author: Clemens Marschner

-

Revised: Oct. 28, 2002

+

Revised: Apr. 11, 2003

This document describes the configuration parameters and the inner - workings of the LARM web crawler. + workings of the LARM web crawler contribution. +

+

+ Note: There have been discussions about how the future of LARM could be. + In this paper, which describes the original architecture or LARM, you can see it + still has a lot of the shortcomings. The discussions have resulted in an effort to + expand the LARM-crawler into a complete search engine. The project is still in + its infancies: Contributions are very welcome. Please see + the LARM pages + in the Apache Wiki for details.

@@ -190,14 +199,25 @@

    -
  • this HTTPClient. Put the .zip file into the libs/ directory
  • +
  • a copy of a current lucene-X.jar. You can get it from Jakarta's download pages. +
  • a working installation of ANT (1.5 or above recommended). ant.sh/.bat should be in your PATH
  • +

- Change to the webcrawler-LARM directory and type + After that you will need to tell ANT where the lucene.jar is located. This is done in the build.properties file. + The easiest way to write one is to copy the build.properties.sample file in LARM's root directory and adapt the path. +

+

+ LARM needs a couple of other libraries which have been included in the libs/ directory. You shouldn't have to care about that. + Some fixes had to be applied to the underlying HTTP library, the HTTPClient from Roland Tschalär. + The patched jar was added to the libraries in the libs/ directory now. See the README file for details.
+

+

+ Compiling should work simply by typing

@@ -391,8 +411,8 @@
  • Scalability. The crawler was supposed to be able to crawl large intranets with hundreds of servers and hundreds of thousands of - documents within a reasonable amount of time. It was not meant to be - scalable to the whole Internet.
  • + documents within a reasonable amount of time. It was not meant to be + scalable to the whole Internet.
  • Java. Although there are many crawlers around at the time when I started to think about it (in Summer 2000), I couldn't find a good @@ -546,7 +566,8 @@
  • +
    -                    java [-server] [-Xmx[ZZ]mb] -classpath fetcher.jar
    +                    java [-server] [-Xmx[ZZ]mb]
    +                    -classpath <path-to-LARM.jar>:<paths-to-libs/*.jar>:<path-to-lucene>
                         de.lanlab.larm.fetcher.FetcherMain
                         [-start STARTURL | @STARTURLFILE]+
                         -restrictto REGEX
    @@ -596,7 +617,8 @@
                     

    Unfortunately, a lot of the options are still not configurable from the - outside. Most of them are configured from within FetcherMain.java. + outside. Most of them are configured from within FetcherMain.java. You + will have to edit this file if you want to change LARM's behavior. However, others are still spread over some of the other classes. At this time, we tried to put a "FIXME" comment around all these options, so check out the source code.

    @@ -625,6 +647,46 @@

    + + + +
    + + LARM's output files + +
    +
    +

    LARM is by default configured such that it outputs a bunch of files into the logs/ directory. + During the run it also uses a cachingqueue/ directory that holds temporary internal queues. This directory + can be deleted if the crawler should be stopped before its operation has ended. Logfiles are flushed every time + the ThreadMonitor is called, usually every 5 seconds. +

    + +

    The logs/ directory keeps output from the LogStorage, which is a pretty verbose storage class, and + the output from different filter classes (see below). + Namely, in the default configuration, the directory will contain the following files after a crawl:

    +
      +
    • links.log - contains the list of links. One record is a tab-delimited line. Format is + [from-page] [to-page] [to-page-normalized] [link-type] [anchor-text]. link-type can be 0 (ordinary link), + 1 (frame) or 2 (redirect). anchor text is the text between <a> and </a> tags or the ALT-Tag in case of + IMG or AREA links. FRAME or LINK links don't contain anchor texts. +
    • + +
    • pagefile_x.pfl - contains the contents of the downloaded files. Pagefiles are segments of max. 50 MB. Offsets + are included in store.log. files are saved as-is.
    • +
    • store.log - contains the list of downloaded files. One record is a tab-delimited line. Format is [from-page] [url] + [url-normalized] [link-type] [HTTP-response-code] [MIME type] [file size] [HTML-title] [page file nr.] [offset in page file]. The attributes + [from-page] and [link-type] refer to the first link found to this file. You can extract the files from the page files by using + the [page file nr.], [offset] and [file size] attributes.
    • + +
    • thread(\n+)[|_errors].log - contain output of each crawling thread
    • +
    • .*Filter.log - contain status messages of the different filters.
    • +
    • ThreadMonitor.log - contains info from the ThreadMonitor. self-explanation is included in the first line of this file.
    • +
    +

    +
    +

    diff --git a/xdocs/lucene-sandbox/larm/overview.xml b/xdocs/lucene-sandbox/larm/overview.xml index 409a33c46eb..7780182ef75 100644 --- a/xdocs/lucene-sandbox/larm/overview.xml +++ b/xdocs/lucene-sandbox/larm/overview.xml @@ -15,13 +15,22 @@

    Author: Clemens Marschner

    -

    Revised: Oct. 28, 2002

    +

    Revised: Apr. 11, 2003

    This document describes the configuration parameters and the inner - workings of the LARM web crawler. + workings of the LARM web crawler contribution.

    +

    + Note: There have been discussions about how the future of LARM could be. + In this paper, which describes the original architecture or LARM, you can see it + still has a lot of the shortcomings. The discussions have resulted in an effort to + expand the LARM-crawler into a complete search engine. The project is still in + its infancies: Contributions are very welcome. Please see + the LARM pages + in the Apache Wiki for details. +

    @@ -63,17 +72,27 @@
      -
    • this HTTPClient. Put the .zip file into the libs/ directory
    • +
    • a copy of a current lucene-X.jar. You can get it from Jakarta's download pages. +
    • a working installation of ANT (1.5 or above recommended). ant.sh/.bat should be in your PATH
    • +

    - Change to the webcrawler-LARM directory and type + After that you will need to tell ANT where the lucene.jar is located. This is done in the build.properties file. + The easiest way to write one is to copy the build.properties.sample file in LARM's root directory and adapt the path. +

    +

    + LARM needs a couple of other libraries which have been included in the libs/ directory. You shouldn't have to care about that. + Some fixes had to be applied to the underlying HTTP library, the HTTPClient from Roland Tschalär. + The patched jar was added to the libraries in the libs/ directory now. See the README file for details.
    +

    +

    + Compiling should work simply by typing

    ant @@ -233,8 +252,8 @@
  • Scalability. The crawler was supposed to be able to crawl large intranets with hundreds of servers and hundreds of thousands of - documents within a reasonable amount of time. It was not meant to be - scalable to the whole Internet.
  • + documents within a reasonable amount of time. It was not meant to be + scalable to the whole Internet.
  • Java. Although there are many crawlers around at the time when I started to think about it (in Summer 2000), I couldn't find a good @@ -369,7 +388,8 @@

    :: de.lanlab.larm.fetcher.FetcherMain [-start STARTURL | @STARTURLFILE]+ -restrictto REGEX @@ -416,7 +436,8 @@

    Unfortunately, a lot of the options are still not configurable from the - outside. Most of them are configured from within FetcherMain.java. + outside. Most of them are configured from within FetcherMain.java. You + will have to edit this file if you want to change LARM's behavior. However, others are still spread over some of the other classes. At this time, we tried to put a "FIXME" comment around all these options, so check out the source code.

    @@ -448,6 +469,37 @@ + +

    LARM is by default configured such that it outputs a bunch of files into the logs/ directory. + During the run it also uses a cachingqueue/ directory that holds temporary internal queues. This directory + can be deleted if the crawler should be stopped before its operation has ended. Logfiles are flushed every time + the ThreadMonitor is called, usually every 5 seconds. +

    + +

    The logs/ directory keeps output from the LogStorage, which is a pretty verbose storage class, and + the output from different filter classes (see below). + Namely, in the default configuration, the directory will contain the following files after a crawl:

    +
      +
    • links.log - contains the list of links. One record is a tab-delimited line. Format is + [from-page] [to-page] [to-page-normalized] [link-type] [anchor-text]. link-type can be 0 (ordinary link), + 1 (frame) or 2 (redirect). anchor text is the text between <a> and </a> tags or the ALT-Tag in case of + IMG or AREA links. FRAME or LINK links don't contain anchor texts. +
    • + +
    • pagefile_x.pfl - contains the contents of the downloaded files. Pagefiles are segments of max. 50 MB. Offsets + are included in store.log. files are saved as-is.
    • +
    • store.log - contains the list of downloaded files. One record is a tab-delimited line. Format is [from-page] [url] + [url-normalized] [link-type] [HTTP-response-code] [MIME type] [file size] [HTML-title] [page file nr.] [offset in page file]. The attributes + [from-page] and [link-type] refer to the first link found to this file. You can extract the files from the page files by using + the [page file nr.], [offset] and [file size] attributes.
    • + +
    • thread(\n+)[|_errors].log - contain output of each crawling thread
    • +
    • .*Filter.log - contain status messages of the different filters.
    • +
    • ThreadMonitor.log - contains info from the ThreadMonitor. self-explanation is included in the first line of this file.
    • +
    +

    +
    +