mirror of https://github.com/apache/lucene.git
PR:
Obtained from: Submitted by: Reviewed by: implemented suggestions by Marc Tucker git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149703 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
0aa1ebe281
commit
3929335807
|
@ -199,24 +199,24 @@
|
||||||
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
||||||
<tr><td bgcolor="#525D76">
|
<tr><td bgcolor="#525D76">
|
||||||
<font color="#ffffff" face="arial,helvetica,sanserif">
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
||||||
<a name="Indexers"><strong>Indexers</strong></a>
|
<a name="Crawlers"><strong>Crawlers</strong></a>
|
||||||
</font>
|
</font>
|
||||||
</td></tr>
|
</td></tr>
|
||||||
<tr><td>
|
<tr><td>
|
||||||
<blockquote>
|
<blockquote>
|
||||||
<p>
|
<p>
|
||||||
Indexers are standard crawlers. They go crawl a file
|
Crawlers are data source executable code. They crawl a file
|
||||||
system, ftp site, web site, etc. to create the index.
|
system, ftp site, web site, etc. to create the index.
|
||||||
These standard indexers may not make ALL of Lucene's
|
These standard crawlers may not make ALL of Lucene's
|
||||||
functionality available, though they should be able to
|
functionality available, though they should be able to
|
||||||
make most of it available through configuration.
|
make most of it available through configuration.
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
<b> Abstract Indexer </b>
|
<b> Abstract Crawler </b>
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
The Abstract indexer is basically the parent for all
|
The AbstractCrawler is basically the parent for all
|
||||||
Indexer classes. It provides implementation for the
|
Crawler classes. It provides implementation for the
|
||||||
following functions/properties:
|
following functions/properties:
|
||||||
</p>
|
</p>
|
||||||
<ul>
|
<ul>
|
||||||
|
@ -263,6 +263,35 @@
|
||||||
be turned on or this is ignored. Range:
|
be turned on or this is ignored. Range:
|
||||||
0 - Long.MAX_VALUE.
|
0 - Long.MAX_VALUE.
|
||||||
</li>
|
</li>
|
||||||
|
<li>
|
||||||
|
SleeptimeBetweenCalls - can be used to
|
||||||
|
avoid flooding a machine with too many
|
||||||
|
requests
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
RequestTimeout - kill the crawler
|
||||||
|
request after the specified period of
|
||||||
|
inactivity.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
IncludeFilter - include only items
|
||||||
|
matching filter. (can occur mulitple
|
||||||
|
times)
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
ExcludeFilter - exclude only items
|
||||||
|
matching filter. (can occur multiple
|
||||||
|
times)
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
MaxItems - stops indexing after x
|
||||||
|
documents have been indexed.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
MaxMegs - stops indexing after x megs
|
||||||
|
have been indexed.. (should this be in
|
||||||
|
specific crawlers?)
|
||||||
|
</li>
|
||||||
<li>
|
<li>
|
||||||
properties - in addition to the settings
|
properties - in addition to the settings
|
||||||
(probably from the command line) read
|
(probably from the command line) read
|
||||||
|
@ -275,18 +304,18 @@
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
<p>
|
<p>
|
||||||
<b>FileSystemIndexer</b>
|
<b>FileSystemCrawler</b>
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
This should extend the AbstractIndexer and
|
This should extend the AbstractCrawler and
|
||||||
support any addtional options required for a
|
support any addtional options required for a
|
||||||
filesystem index.
|
filesystem index.
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
<b>HTTP Indexer </b>
|
<b>HTTP Crawler </b>
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
Supports the AbstractIndexer options as well as:
|
Supports the AbstractCrawler options as well as:
|
||||||
</p>
|
</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li>
|
<li>
|
||||||
|
|
|
@ -91,21 +91,21 @@
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</section>
|
</section>
|
||||||
<section name="Indexers">
|
<section name="Crawlers">
|
||||||
<p>
|
<p>
|
||||||
Indexers are standard crawlers. They go crawl a file
|
Crawlers are data source executable code. They crawl a file
|
||||||
system, ftp site, web site, etc. to create the index.
|
system, ftp site, web site, etc. to create the index.
|
||||||
These standard indexers may not make ALL of Lucene's
|
These standard crawlers may not make ALL of Lucene's
|
||||||
functionality available, though they should be able to
|
functionality available, though they should be able to
|
||||||
make most of it available through configuration.
|
make most of it available through configuration.
|
||||||
</p>
|
</p>
|
||||||
<!--<section name="AbstractIndexer">-->
|
<!--<section name="AbstractIndexer">-->
|
||||||
<p>
|
<p>
|
||||||
<b> Abstract Indexer </b>
|
<b> Abstract Crawler </b>
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
The Abstract indexer is basically the parent for all
|
The AbstractCrawler is basically the parent for all
|
||||||
Indexer classes. It provides implementation for the
|
Crawler classes. It provides implementation for the
|
||||||
following functions/properties:
|
following functions/properties:
|
||||||
</p>
|
</p>
|
||||||
<ul>
|
<ul>
|
||||||
|
@ -152,6 +152,35 @@
|
||||||
be turned on or this is ignored. Range:
|
be turned on or this is ignored. Range:
|
||||||
0 - Long.MAX_VALUE.
|
0 - Long.MAX_VALUE.
|
||||||
</li>
|
</li>
|
||||||
|
<li>
|
||||||
|
SleeptimeBetweenCalls - can be used to
|
||||||
|
avoid flooding a machine with too many
|
||||||
|
requests
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
RequestTimeout - kill the crawler
|
||||||
|
request after the specified period of
|
||||||
|
inactivity.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
IncludeFilter - include only items
|
||||||
|
matching filter. (can occur mulitple
|
||||||
|
times)
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
ExcludeFilter - exclude only items
|
||||||
|
matching filter. (can occur multiple
|
||||||
|
times)
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
MaxItems - stops indexing after x
|
||||||
|
documents have been indexed.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
MaxMegs - stops indexing after x megs
|
||||||
|
have been indexed.. (should this be in
|
||||||
|
specific crawlers?)
|
||||||
|
</li>
|
||||||
<li>
|
<li>
|
||||||
properties - in addition to the settings
|
properties - in addition to the settings
|
||||||
(probably from the command line) read
|
(probably from the command line) read
|
||||||
|
@ -166,20 +195,20 @@
|
||||||
<!--</section>-->
|
<!--</section>-->
|
||||||
<!--<s2 title="FileSystemIndexer">-->
|
<!--<s2 title="FileSystemIndexer">-->
|
||||||
<p>
|
<p>
|
||||||
<b>FileSystemIndexer</b>
|
<b>FileSystemCrawler</b>
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
This should extend the AbstractIndexer and
|
This should extend the AbstractCrawler and
|
||||||
support any addtional options required for a
|
support any addtional options required for a
|
||||||
filesystem index.
|
filesystem index.
|
||||||
</p>
|
</p>
|
||||||
<!--</s2>-->
|
<!--</s2>-->
|
||||||
<!--<s2 title="HTTPIndexer">-->
|
<!--<s2 title="HTTPIndexer">-->
|
||||||
<p>
|
<p>
|
||||||
<b>HTTP Indexer </b>
|
<b>HTTP Crawler </b>
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
Supports the AbstractIndexer options as well as:
|
Supports the AbstractCrawler options as well as:
|
||||||
</p>
|
</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li>
|
<li>
|
||||||
|
|
Loading…
Reference in New Issue