PR:

Obtained from: Submitted by: Reviewed by: implemented suggestions by Marc Tucker git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149703 13f79535-47bb-0310-9956-ffa450edef68
2002-02-24 15:58:41 +00:00 · 2002-02-24 15:58:41 +00:00 · 3929335807
parent 0aa1ebe281
commit 3929335807
2 changed files with 78 additions and 20 deletions
--- a/docs/luceneplan.html
+++ b/docs/luceneplan.html
@ -199,24 +199,24 @@
                                                <table border="0" cellspacing="0" cellpadding="2" width="100%">
      <tr><td bgcolor="#525D76">
        <font color="#ffffff" face="arial,helvetica,sanserif">
-          <a name="Indexers"><strong>Indexers</strong></a>
+          <a name="Crawlers"><strong>Crawlers</strong></a>
        </font>
      </td></tr>
      <tr><td>
        <blockquote>
                                    <p>
-                        Indexers are standard crawlers.  They go crawl a file
+                        Crawlers are data source executable code.  They crawl a file
                        system, ftp site, web site, etc. to create the index.
-                        These standard indexers may not make ALL of Lucene's
+                        These standard crawlers may not make ALL of Lucene's
                        functionality available, though they should be able to
                        make most of it available through configuration.
                </p>
                                                <p>
-                        <b> Abstract Indexer </b>
+                        <b> Abstract Crawler </b>
                </p>
                                                <p>
-                                The Abstract indexer is basically the parent for all
-                                Indexer classes.  It provides implementation for the
+                                The AbstractCrawler is basically the parent for all
+                                Crawler classes.  It provides implementation for the
                                following functions/properties:
                        </p>
                                                <ul>
@ -263,6 +263,35 @@
                                        be turned on or this is ignored.  Range:
                                        0 - Long.MAX_VALUE.
                                </li>
+                                <li>
+                                        SleeptimeBetweenCalls - can be used to 
+                                        avoid flooding a machine with too many 
+                                        requests
+                                </li>
+                                <li>
+                                        RequestTimeout - kill the crawler
+                                        request after the specified period of
+                                        inactivity.
+                                </li>
+                                <li>
+                                        IncludeFilter - include only items 
+                                        matching filter.  (can occur mulitple
+                                        times)
+                                </li>
+                                <li>
+                                        ExcludeFilter - exclude only items 
+                                        matching filter.  (can occur multiple
+                                        times)
+                                </li>
+                                <li>
+                                        MaxItems - stops indexing after x
+                                        documents have been indexed.
+                                </li>
+                                <li>
+                                        MaxMegs - stops indexing after x megs
+                                        have been indexed..  (should this be in
+                                        specific crawlers?)
+                                </li>
                                <li>
                                        properties - in addition to the settings
                                        (probably from the command line) read
@ -275,18 +304,18 @@
                                </li>
                        </ul>
                                                <p>
-                              <b>FileSystemIndexer</b>
+                              <b>FileSystemCrawler</b>
                        </p>
                                                <p>
-                                This should extend the AbstractIndexer and
+                                This should extend the AbstractCrawler and
                                support any addtional options required for a
                                filesystem index.
                        </p>
                                                <p>
-			      <b>HTTP Indexer </b>
+			      <b>HTTP Crawler </b>
                        </p>
                                                <p>
-                                Supports the AbstractIndexer options as well as:                                
+                                Supports the AbstractCrawler options as well as:                                
                        </p>
                                                <ul>
                                <li>
--- a/xdocs/luceneplan.xml
+++ b/xdocs/luceneplan.xml
@ -91,21 +91,21 @@
                        </li>
                </ul>
        </section>
-        <section name="Indexers">
+        <section name="Crawlers">
                <p>
-                        Indexers are standard crawlers.  They go crawl a file
+                        Crawlers are data source executable code.  They crawl a file
                        system, ftp site, web site, etc. to create the index.
-                        These standard indexers may not make ALL of Lucene's
+                        These standard crawlers may not make ALL of Lucene's
                        functionality available, though they should be able to
                        make most of it available through configuration.
                </p>
                <!--<section name="AbstractIndexer">-->
                <p>
-                        <b> Abstract Indexer </b>
+                        <b> Abstract Crawler </b>
                </p>
                        <p>
-                                The Abstract indexer is basically the parent for all
-                                Indexer classes.  It provides implementation for the
+                                The AbstractCrawler is basically the parent for all
+                                Crawler classes.  It provides implementation for the
                                following functions/properties:
                        </p>
                        <ul>
@ -152,6 +152,35 @@
                                        be turned on or this is ignored.  Range:
                                        0 - Long.MAX_VALUE.
                                </li>
+                                <li>
+                                        SleeptimeBetweenCalls - can be used to 
+                                        avoid flooding a machine with too many 
+                                        requests
+                                </li>
+                                <li>
+                                        RequestTimeout - kill the crawler
+                                        request after the specified period of
+                                        inactivity.
+                                </li>
+                                <li>
+                                        IncludeFilter - include only items 
+                                        matching filter.  (can occur mulitple
+                                        times)
+                                </li>
+                                <li>
+                                        ExcludeFilter - exclude only items 
+                                        matching filter.  (can occur multiple
+                                        times)
+                                </li>
+                                <li>
+                                        MaxItems - stops indexing after x
+                                        documents have been indexed.
+                                </li>
+                                <li>
+                                        MaxMegs - stops indexing after x megs
+                                        have been indexed..  (should this be in
+                                        specific crawlers?)
+                                </li>
                                <li>
                                        properties - in addition to the settings
                                        (probably from the command line) read
@ -166,20 +195,20 @@
                <!--</section>-->
                <!--<s2 title="FileSystemIndexer">-->
                        <p>
-                              <b>FileSystemIndexer</b>
+                              <b>FileSystemCrawler</b>
                        </p>
                        <p>
-                                This should extend the AbstractIndexer and
+                                This should extend the AbstractCrawler and
                                support any addtional options required for a
                                filesystem index.
                        </p>
                <!--</s2>-->
                <!--<s2 title="HTTPIndexer">-->
                        <p>
-			      <b>HTTP Indexer </b>
+			      <b>HTTP Crawler </b>
                        </p>
                        <p>
-                                Supports the AbstractIndexer options as well as:                                
+                                Supports the AbstractCrawler options as well as:                                
                        </p>
                        <ul>
                                <li>