added application extensions plan and myself as a comitter

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149700 13f79535-47bb-0310-9956-ffa450edef68
2002-02-23 22:02:55 +00:00 · 2002-02-23 22:02:55 +00:00 · 04ebb07469
parent 84bd5a1565
commit 04ebb07469
3 changed files with 346 additions and 0 deletions
--- a/xdocs/luceneplan.xml
+++ b/xdocs/luceneplan.xml
@ -0,0 +1,341 @@
+<?xml version="1.0" encoding="UTF-8"?>
+     
+<document>
+  <properties>
+   <title>Plan for enhancements to Lucene</title>
+   <authors>
+    <person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/>
+   </authors>
+  </properties>
+  <body>
+  
+        <section name="Purpose">
+                <p>
+                        The purpose of this document is to outline plans for
+                        making <a href="http://jakarta.apache.org/lucene">
+                        Jakarta Lucene</a> work as a more general drop-in
+                        component.  It makes the assumption that this is an
+                        objective for the Lucene user and development community.
+                </p>
+                <p>
+                        The best reference is <a href="http://www.htdig.org">
+                        htDig</a>, though it is not quite as sophisticated as
+                        Lucene, it has a number of features that make it
+                        desireable.  It however is a traditional c-compiled app
+                        which makes it somewhat unpleasent to install on some
+                        platforms (like Solaris!).
+                </p>
+                <p>
+                        This plan is being submitted to the Lucene developer
+                        community for an initial reaction, advice, feedback and
+                        consent.  Following this it will be submitted to the
+                        Lucene user community for support.  Although, I'm (Andy
+                        Oliver) capable of providing these enhancements by 
+                        myself, I'd of course prefer to work on them in concert 
+                        with others.
+                </p>
+                <p>
+                        While I'm outlaying a fairly large featureset, these can
+                        be implemented incrementally of course (and are probably
+                        best if done that way).
+                </p>
+        </section>
+  
+        <section name="Goal and Objectives">
+                <p>
+                        The goal is to provide features to Lucene that allow it
+                        to be used as a dropin search engine.  It should provide
+                        many of the features of projects like <a
+                        href="http://www.htdig.org">htDig</a> while surpassing
+                        them with unique Lucene features and capabillities such as
+                        easy installation on and java-supporting platform,
+                        and support for document fields and field searches.  And 
+                        of course, <a href="http://apache.org/LICENSE">
+                        a pragmatic software license</a>.
+                </p>
+                <p>
+                        To reach this goal we'll implement code to support the
+                        following objectives that augment but do not replace
+                        the current Lucene featureset.  
+                </p>
+                <ul>
+                        <li>
+                                Document Location Independance - meaning mapping
+                                real contexts to runtime contexts.
+                                Essentially, if the document is at
+                                /var/www/htdocs/mydoc.html, I probably want it
+                                indexed as
+                                http://www.bigevilmegacorp.com/mydoc.html.                                
+                        </li>
+                        <li>
+                                Standard methods of creating central indicies -
+                                file system indexing is probably less useful in
+                                many environments than is *remote* indexing (for
+                                instance http).  I would suggest that most folks
+                                would prefer that general functionality be
+                                suppored by Lucene instead of having to write
+                                code for every indexing project.  Obviously, if
+                                what they are doing is *special* they'll have to
+                                code, but general document indexing accross
+                                webservers would not qualify.
+                        </li>
+                        <li>
+                                Document interperatation abstraction - currently
+                                one must handle document object construction via
+                                custom code.  A standard interface for plugging
+                                in format handlers should be supported.  
+                        </li>
+                        <li>
+                                Mime and file-extension to document
+                                interperatation mapping.                                  
+                        </li>
+                </ul>
+        </section>
+        <section name="Indexers">
+                <p>
+                        Indexers are standard crawlers.  They go crawl a file
+                        system, ftp site, web site, etc. to create the index.
+                        These standard indexers may not make ALL of Lucene's
+                        functionality available, though they should be able to
+                        make most of it available through configuration.
+                </p>
+                <!--<section name="AbstractIndexer">-->
+                <p>
+                        <b> Abstract Indexer </b>
+                </p>
+                        <p>
+                                The Abstract indexer is basically the parent for all
+                                Indexer classes.  It provides implementation for the
+                                following functions/properties:
+                        </p>
+                        <ul>
+                                <li>
+                                        index path - where to write the index.
+                                </li>
+                                <li>
+                                        cui - create or update the index
+                                </li>
+                                <li>
+                                        root context - the start of the pathname
+                                        that should be replaced by the
+                                        replace with property or dropped
+                                        entirely.  Example: /opt/tomcat/webapps
+                                </li>
+                                <li>
+                                        replace with - when specified replaces
+                                        the root context.  Example:
+                                        http://jakarta.apache.org.
+                                </li>
+                                <li>
+                                        replacement type - the type of
+                                        replacewith path:  relative, url or
+                                        path.
+                                </li>
+                                <li>
+                                        location - the location to start
+                                        indexing at.
+                                </li>
+                                <li>
+                                        doctypes - only index documents with
+                                        these doctypes.  If not specified all
+                                        registered mime-types are used.
+                                        Example: "xml,doc,html"
+                                </li>
+                                <li>
+                                        recursive - if not specified is turned
+                                        off.
+                                </li>
+                                <li>
+                                        level - optional level of directory or
+                                        links to traverse.  By default is
+                                        assumed to be infinite.  Recursive must
+                                        be turned on or this is ignored.  Range:
+                                        0 - Long.MAX_VALUE.
+                                </li>
+                                <li>
+                                        properties - in addition to the settings
+                                        (probably from the command line) read
+                                        this properties file and get them from
+                                        it.  Command line options override
+                                        the properties file in the case of 
+                                        duplicates.  There should also be an
+                                        enivironment variable or VM parameter to
+                                        set this.
+                                </li>
+                        </ul>
+                <!--</section>-->
+                <!--<s2 title="FileSystemIndexer">-->
+                        <p>
+                              <b>FileSystemIndexer</b>
+                        </p>
+                        <p>
+                                This should extend the AbstractIndexer and
+                                support any addtional options required for a
+                                filesystem index.
+                        </p>
+                <!--</s2>-->
+                <!--<s2 title="HTTPIndexer">-->
+                        <p>
+			      <b>HTTP Indexer </b>
+                        </p>
+                        <p>
+                                Supports the AbstractIndexer options as well as:                                
+                        </p>
+                        <ul>
+                                <li>
+                                        span hosts - Wheter to span hosts or not,
+                                        by default this should be no.                                        
+                                </li>
+                                <li>
+                                        restrict domains - (ignored if span
+                                        hosts is not enabled).  Whether all
+                                        spanned hosts must be in the same domain
+                                        (default is off).
+                                </li>
+                                <li>
+                                        try directories - Whether to attempt
+                                        directory listings or not (so if you
+                                        recurse and go to
+                                        /nextcontext/index.html this option says
+                                        to also try /nextcontext to get the dir
+                                        lsiting)
+                                </li>
+                                <li>
+                                        map extensions -
+                                        (always/default/never/fallback).  Wether
+                                        to always use extension mapping, by
+                                        default (fallback to mime type), NEVER
+                                        or fallback if mime is not available
+                                        (default).
+                                </li>
+                                <li>
+                                        ignore robots - ignore robots.txt, on or
+                                        off (default - off)
+                                </li>
+                        </ul>
+        <!--        </s2> -->
+        </section>
+        
+        <section name="MIMEMap">
+                <p>
+                        A configurable registry of document types, their
+                        description, an identifyer, mime-type and file
+                        extension.  This should map both MIME -> factory 
+                        and extension -> factory.
+                </p>
+                <p>
+                        This might be configured at compile time or by a
+                        properties file, etc.  For example:
+                </p>
+                        <table>
+                                <tr>
+                                        <td>Description</td>
+                                        <td>Identifier</td>
+                                        <td>Extensions</td>
+                                        <td>MimeType</td>
+                                        <td>DocumentFactory</td>
+                                </tr>
+                                <tr>
+                                        <td>"Word Document"</td>
+                                        <td>"doc"</td>
+                                        <td>"doc"</td>
+                                        <td>"vnd.application/ms-word"</td>
+                                        <td>POIWordDocumentFactory</td>
+                                </tr>
+                                <tr>
+                                        <td>"HTML Document"</td>
+                                        <td>"html"</td>
+                                        <td>"html,htm"</td>
+                                        <td></td>
+                                        <td>HTMLDocumentFactory</td>
+                                </tr>                                
+                        </table>
+        </section>
+        <section name="DocumentFactory">
+                <p>
+                        An interface for classes which create document objects
+                        for particular file types.  Examples:
+                        HTMLDocumentFactory, DOCDocumentFactory,
+                        XLSDocumentFactory, XML DocumentFactory.
+                </p>
+        </section>
+        <section name="FieldMapping classes">
+                <p>
+                        A class taht maps standard fields from the
+                        DocumentFactories into *fields* in the Document objects
+                        they create.  I suggest that a regular expression system
+                        or xpath might be the most universal way to do this.
+                        For instance if perhaps I had an XML factory that
+                        represented XML elements as fields, I could map content
+                        from particular fields to ther fields or supress them
+                        entirely.  We could even make this configurable.
+                </p>
+                <p>
+                
+                        for example:
+                </p>
+                <ul>
+                        <li>
+                                htmldoc.properties
+                        </li>
+                        <li>
+                        suppress=*
+                        </li>
+                        <li>
+                        author=content:g/author\:\ ........................................./
+                        </li>
+                        <li>
+                        author.suppress=false
+                        </li>
+                        <li>
+                        title=content:g/title\:\ ........................................./
+                        </li>
+                        <li>
+                        title.suppress=false
+                        </li>
+                </ul>
+                <p>                
+                        In this example we map html documents such that all 
+                        fields are suppressed but author and title.  We map 
+                        author and title to anything in the content matching 
+                        author: (and x characters).  Okay my regular expresions 
+                        suck but hopefully you get the idea.
+                </p>
+        </section>
+        <section name="Final Thoughts">
+                <p>
+                        We might also consider eliminating the DocumentFactory 
+                        entirely by making an AbstractDocument from which the 
+                        current document object would inherit from.  I 
+                        experimented with this locally, and it was a relatively 
+                        minor code change and there was of course no difference 
+                        in performance.  The Document Factory classes would 
+                        instead be instances of various subclasses of 
+                        AbstractDocument.
+                </p>
+                <p>
+                        My inspiration for this is HTDig (http://www.htdig.org/).  
+                        While this goes slightly beyond what HTDig provides by 
+                        providing field mapping (where HTDIG is just interested 
+                        in Strings/numbers wherever they are found), it provides 
+                        at least what I would need to use this as a dropin for 
+                        most places I contract at (with the obvious exception of 
+                        a default set of content handlers which would of course 
+                        develop naturally over time).
+                </p>
+                <p>
+                        I am able to certainly contribute to this effort if the 
+                        development community is open to it.  I'd suggest we do 
+                        it iteratively in stages and not aim for all of this at 
+                        once (for instance leave out the field mapping at first).
+                </p>
+                <p>
+                
+                        Anyhow, please give me some feedback, counter 
+                        suggestions, let me know if I'm way off base or out of 
+                        line, etc. -Andy
+                </p>
+        </section>
+                
+  </body>
+</document>
--- a/xdocs/stylesheets/project.xml
+++ b/xdocs/stylesheets/project.xml
@ -23,6 +23,10 @@
        <item name="Contributions"     href="/contributions.html"/>
    </menu>
    
+    <menu name="Plans">
+        <item name="Application Extensions"           href="/luceneplan.html"/>
+    </menu>
+
    <menu name="Download">
        <item name="Binaries"           href="/site/binindex.html"/>
        <item name="Source Code"        href="/site/sourceindex.html"/>
--- a/xdocs/whoweare.xml
+++ b/xdocs/whoweare.xml
@ -38,6 +38,7 @@ the <a href="http://jakarta.apache.org/site/mail.html">Jakarta-Lucene mailing li
 <li><b>Dave Kor</b> (davekor at apache.org)</li>
 <li><b>Jon Stevens</b> (jon at latchkey.com)</li>
 <li><b>Tal Dayan</b> (zapta at apache.org)</li>
+<li><a href="http://www.trilug.org/~acoliver">Andrew C. Oliver</a> (acoliver at apache dot org)</li>
 </ul>
 </section>