Obtained from:
Submitted by:
Reviewed by:
implemented suggestions by Marc Tucker


git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149703 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Andrew C. Oliver 2002-02-24 15:58:41 +00:00
parent 0aa1ebe281
commit 3929335807
2 changed files with 78 additions and 20 deletions

View File

@ -199,24 +199,24 @@
<table border="0" cellspacing="0" cellpadding="2" width="100%"> <table border="0" cellspacing="0" cellpadding="2" width="100%">
<tr><td bgcolor="#525D76"> <tr><td bgcolor="#525D76">
<font color="#ffffff" face="arial,helvetica,sanserif"> <font color="#ffffff" face="arial,helvetica,sanserif">
<a name="Indexers"><strong>Indexers</strong></a> <a name="Crawlers"><strong>Crawlers</strong></a>
</font> </font>
</td></tr> </td></tr>
<tr><td> <tr><td>
<blockquote> <blockquote>
<p> <p>
Indexers are standard crawlers. They go crawl a file Crawlers are data source executable code. They crawl a file
system, ftp site, web site, etc. to create the index. system, ftp site, web site, etc. to create the index.
These standard indexers may not make ALL of Lucene's These standard crawlers may not make ALL of Lucene's
functionality available, though they should be able to functionality available, though they should be able to
make most of it available through configuration. make most of it available through configuration.
</p> </p>
<p> <p>
<b> Abstract Indexer </b> <b> Abstract Crawler </b>
</p> </p>
<p> <p>
The Abstract indexer is basically the parent for all The AbstractCrawler is basically the parent for all
Indexer classes. It provides implementation for the Crawler classes. It provides implementation for the
following functions/properties: following functions/properties:
</p> </p>
<ul> <ul>
@ -263,6 +263,35 @@
be turned on or this is ignored. Range: be turned on or this is ignored. Range:
0 - Long.MAX_VALUE. 0 - Long.MAX_VALUE.
</li> </li>
<li>
SleeptimeBetweenCalls - can be used to
avoid flooding a machine with too many
requests
</li>
<li>
RequestTimeout - kill the crawler
request after the specified period of
inactivity.
</li>
<li>
IncludeFilter - include only items
matching filter. (can occur mulitple
times)
</li>
<li>
ExcludeFilter - exclude only items
matching filter. (can occur multiple
times)
</li>
<li>
MaxItems - stops indexing after x
documents have been indexed.
</li>
<li>
MaxMegs - stops indexing after x megs
have been indexed.. (should this be in
specific crawlers?)
</li>
<li> <li>
properties - in addition to the settings properties - in addition to the settings
(probably from the command line) read (probably from the command line) read
@ -275,18 +304,18 @@
</li> </li>
</ul> </ul>
<p> <p>
<b>FileSystemIndexer</b> <b>FileSystemCrawler</b>
</p> </p>
<p> <p>
This should extend the AbstractIndexer and This should extend the AbstractCrawler and
support any addtional options required for a support any addtional options required for a
filesystem index. filesystem index.
</p> </p>
<p> <p>
<b>HTTP Indexer </b> <b>HTTP Crawler </b>
</p> </p>
<p> <p>
Supports the AbstractIndexer options as well as: Supports the AbstractCrawler options as well as:
</p> </p>
<ul> <ul>
<li> <li>

View File

@ -91,21 +91,21 @@
</li> </li>
</ul> </ul>
</section> </section>
<section name="Indexers"> <section name="Crawlers">
<p> <p>
Indexers are standard crawlers. They go crawl a file Crawlers are data source executable code. They crawl a file
system, ftp site, web site, etc. to create the index. system, ftp site, web site, etc. to create the index.
These standard indexers may not make ALL of Lucene's These standard crawlers may not make ALL of Lucene's
functionality available, though they should be able to functionality available, though they should be able to
make most of it available through configuration. make most of it available through configuration.
</p> </p>
<!--<section name="AbstractIndexer">--> <!--<section name="AbstractIndexer">-->
<p> <p>
<b> Abstract Indexer </b> <b> Abstract Crawler </b>
</p> </p>
<p> <p>
The Abstract indexer is basically the parent for all The AbstractCrawler is basically the parent for all
Indexer classes. It provides implementation for the Crawler classes. It provides implementation for the
following functions/properties: following functions/properties:
</p> </p>
<ul> <ul>
@ -152,6 +152,35 @@
be turned on or this is ignored. Range: be turned on or this is ignored. Range:
0 - Long.MAX_VALUE. 0 - Long.MAX_VALUE.
</li> </li>
<li>
SleeptimeBetweenCalls - can be used to
avoid flooding a machine with too many
requests
</li>
<li>
RequestTimeout - kill the crawler
request after the specified period of
inactivity.
</li>
<li>
IncludeFilter - include only items
matching filter. (can occur mulitple
times)
</li>
<li>
ExcludeFilter - exclude only items
matching filter. (can occur multiple
times)
</li>
<li>
MaxItems - stops indexing after x
documents have been indexed.
</li>
<li>
MaxMegs - stops indexing after x megs
have been indexed.. (should this be in
specific crawlers?)
</li>
<li> <li>
properties - in addition to the settings properties - in addition to the settings
(probably from the command line) read (probably from the command line) read
@ -166,20 +195,20 @@
<!--</section>--> <!--</section>-->
<!--<s2 title="FileSystemIndexer">--> <!--<s2 title="FileSystemIndexer">-->
<p> <p>
<b>FileSystemIndexer</b> <b>FileSystemCrawler</b>
</p> </p>
<p> <p>
This should extend the AbstractIndexer and This should extend the AbstractCrawler and
support any addtional options required for a support any addtional options required for a
filesystem index. filesystem index.
</p> </p>
<!--</s2>--> <!--</s2>-->
<!--<s2 title="HTTPIndexer">--> <!--<s2 title="HTTPIndexer">-->
<p> <p>
<b>HTTP Indexer </b> <b>HTTP Crawler </b>
</p> </p>
<p> <p>
Supports the AbstractIndexer options as well as: Supports the AbstractCrawler options as well as:
</p> </p>
<ul> <ul>
<li> <li>