lucene/xdocs/luceneplan.xml

<?xml version="1.0" encoding="UTF-8"?>

<document>
  <properties>
   <title>Plan for enhancements to Lucene</title>
   <authors>
    <person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/>
   </authors>
  </properties>
  <body>

        <section name="Purpose">
                <p>
                        The purpose of this document is to outline plans for
                        making <a href="http://jakarta.apache.org/lucene">
                        Jakarta Lucene</a> work as a more general drop-in
                        component.  It makes the assumption that this is an
                        objective for the Lucene user and development community.
                </p>
                <p>
                        The best reference is <a href="http://www.htdig.org">
                        htDig</a>, though it is not quite as sophisticated as
                        Lucene, it has a number of features that make it
                        desireable.  It however is a traditional c-compiled app
                        which makes it somewhat unpleasent to install on some
                        platforms (like Solaris!).
                </p>
                <p>
                        This plan is being submitted to the Lucene developer
                        community for an initial reaction, advice, feedback and
                        consent.  Following this it will be submitted to the
                        Lucene user community for support.  Although, I'm (Andy
                        Oliver) capable of providing these enhancements by
                        myself, I'd of course prefer to work on them in concert
                        with others.
                </p>
                <p>
                        While I'm outlaying a fairly large featureset, these can
                        be implemented incrementally of course (and are probably
                        best if done that way).
                </p>
        </section>

        <section name="Goal and Objectives">
                <p>
                        The goal is to provide features to Lucene that allow it
                        to be used as a dropin search engine.  It should provide
                        many of the features of projects like <a
                        href="http://www.htdig.org">htDig</a> while surpassing
                        them with unique Lucene features and capabillities such as
                        easy installation on and java-supporting platform,
                        and support for document fields and field searches.  And
                        of course, <a href="http://apache.org/LICENSE">
                        a pragmatic software license</a>.
                </p>
                <p>
                        To reach this goal we'll implement code to support the
                        following objectives that augment but do not replace
                        the current Lucene featureset.
                </p>
                <ul>
                        <li>
                                Document Location Independance - meaning mapping
                                real contexts to runtime contexts.
                                Essentially, if the document is at
                                /var/www/htdocs/mydoc.html, I probably want it
                                indexed as
                                http://www.bigevilmegacorp.com/mydoc.html.
                        </li>
                        <li>
                                Standard methods of creating central indicies -
                                file system indexing is probably less useful in
                                many environments than is *remote* indexing (for
                                instance http).  I would suggest that most folks
                                would prefer that general functionality be
                                suppored by Lucene instead of having to write
                                code for every indexing project.  Obviously, if
                                what they are doing is *special* they'll have to
                                code, but general document indexing accross
                                webservers would not qualify.
                        </li>
                        <li>
                                Document interperatation abstraction - currently
                                one must handle document object construction via
                                custom code.  A standard interface for plugging
                                in format handlers should be supported.
                        </li>
                        <li>
                                Mime and file-extension to document
                                interperatation mapping.
                        </li>
                </ul>
        </section>
        <section name="Crawlers">
                <p>
                        Crawlers are data source executable code.  They crawl a file
                        system, ftp site, web site, etc. to create the index.
                        These standard crawlers may not make ALL of Lucene's
                        functionality available, though they should be able to
                        make most of it available through configuration.
                </p>
                <!--<section name="AbstractIndexer">-->
                <p>
                        <b> Abstract Crawler </b>
                </p>
                        <p>
                                The AbstractCrawler is basically the parent for all
                                Crawler classes.  It provides implementation for the
                                following functions/properties:
                        </p>
                        <ul>
                                <li>
                                        index path - where to write the index.
                                </li>
                                <li>
                                        cui - create or update the index
                                </li>
                                <li>
                                        root context - the start of the pathname
                                        that should be replaced by the
                                        replace with property or dropped
                                        entirely.  Example: /opt/tomcat/webapps
                                </li>
                                <li>
                                        replace with - when specified replaces
                                        the root context.  Example:
                                        http://jakarta.apache.org.
                                </li>
                                <li>
                                        replacement type - the type of
                                        replacewith path:  relative, url or
                                        path.
                                </li>
                                <li>
                                        location - the location to start
                                        indexing at.
                                </li>
                                <li>
                                        doctypes - only index documents with
                                        these doctypes.  If not specified all
                                        registered mime-types are used.
                                        Example: "xml,doc,html"
                                </li>
                                <li>
                                        recursive - if not specified is turned
                                        off.
                                </li>
                                <li>
                                        level - optional level of directory or
                                        links to traverse.  By default is
                                        assumed to be infinite.  Recursive must
                                        be turned on or this is ignored.  Range:
                                        0 - Long.MAX_VALUE.
                                </li>
                                <li>
                                        SleeptimeBetweenCalls - can be used to
                                        avoid flooding a machine with too many
                                        requests
                                </li>
                                <li>
                                        RequestTimeout - kill the crawler
                                        request after the specified period of
                                        inactivity.
                                </li>
                                <li>
                                        IncludeFilter - include only items
                                        matching filter.  (can occur mulitple
                                        times)
                                </li>
                                <li>
                                        ExcludeFilter - exclude only items
                                        matching filter.  (can occur multiple
                                        times)
                                </li>
                                <li>
                                        ExpandOnly - use but do not index items
                                        that match this pattern (regex?) (can
                                        occur multiple times)
                                </li>
                                <li>
                                        NoExpand - Index but do not follow the
                                        links in items that match this pattern
                                        (regex?) (can occur multiple times)
                                </li>
                                <li>
                                        MaxItems - stops indexing after x
                                        documents have been indexed.
                                </li>
                                <li>
                                        MaxMegs - stops indexing after x megs
                                        have been indexed..  (should this be in
                                        specific crawlers?)
                                </li>
                                <li>
                                        properties - in addition to the settings
                                        (probably from the command line) read
                                        this properties file and get them from
                                        it.  Command line options override
                                        the properties file in the case of
                                        duplicates.  There should also be an
                                        enivironment variable or VM parameter to
                                        set this.
                                </li>
                        </ul>
                <!--</section>-->
                <!--<s2 title="FileSystemIndexer">-->
                        <p>
                              <b>FileSystemCrawler</b>
                        </p>
                        <p>
                                This should extend the AbstractCrawler and
                                support any addtional options required for a
                                filesystem index.
                        </p>
                <!--</s2>-->
                <!--<s2 title="HTTPIndexer">-->
                        <p>
			      <b>HTTP Crawler </b>
                        </p>
                        <p>
                                Supports the AbstractCrawler options as well as:
                        </p>
                        <ul>
                                <li>
                                        span hosts - Wheter to span hosts or not,
                                        by default this should be no.
                                </li>
                                <li>
                                        restrict domains - (ignored if span
                                        hosts is not enabled).  Whether all
                                        spanned hosts must be in the same domain
                                        (default is off).
                                </li>
                                <li>
                                        try directories - Whether to attempt
                                        directory listings or not (so if you
                                        recurse and go to
                                        /nextcontext/index.html this option says
                                        to also try /nextcontext to get the dir
                                        lsiting)
                                </li>
                                <li>
                                        map extensions -
                                        (always/default/never/fallback).  Wether
                                        to always use extension mapping, by
                                        default (fallback to mime type), NEVER
                                        or fallback if mime is not available
                                        (default).
                                </li>
                                <li>
                                        ignore robots - ignore robots.txt, on or
                                        off (default - off)
                                </li>
                        </ul>
        <!--        </s2> -->
        </section>

        <section name="MIMEMap">
                <p>
                        A configurable registry of document types, their
                        description, an identifyer, mime-type and file
                        extension.  This should map both MIME -> factory
                        and extension -> factory.
                </p>
                <p>
                        This might be configured at compile time or by a
                        properties file, etc.  For example:
                </p>
                        <table>
                                <tr>
                                        <td>Description</td>
                                        <td>Identifier</td>
                                        <td>Extensions</td>
                                        <td>MimeType</td>
                                        <td>DocumentFactory</td>
                                </tr>
                                <tr>
                                        <td>"Word Document"</td>
                                        <td>"doc"</td>
                                        <td>"doc"</td>
                                        <td>"vnd.application/ms-word"</td>
                                        <td>POIWordDocumentFactory</td>
                                </tr>
                                <tr>
                                        <td>"HTML Document"</td>
                                        <td>"html"</td>
                                        <td>"html,htm"</td>
                                        <td></td>
                                        <td>HTMLDocumentFactory</td>
                                </tr>
                        </table>
        </section>
        <section name="DocumentFactory">
                <p>
                        An interface for classes which create document objects
                        for particular file types.  Examples:
                        HTMLDocumentFactory, DOCDocumentFactory,
                        XLSDocumentFactory, XML DocumentFactory.
                </p>
        </section>
        <section name="FieldMapping classes">
                <p>
                        A class taht maps standard fields from the
                        DocumentFactories into *fields* in the Document objects
                        they create.  I suggest that a regular expression system
                        or xpath might be the most universal way to do this.
                        For instance if perhaps I had an XML factory that
                        represented XML elements as fields, I could map content
                        from particular fields to ther fields or supress them
                        entirely.  We could even make this configurable.
                </p>
                <p>

                        for example:
                </p>
                <ul>
                        <li>
                                htmldoc.properties
                        </li>
                        <li>
                        suppress=*
                        </li>
                        <li>
                        author=content:g/author\:\ ........................................./
                        </li>
                        <li>
                        author.suppress=false
                        </li>
                        <li>
                        title=content:g/title\:\ ........................................./
                        </li>
                        <li>
                        title.suppress=false
                        </li>
                </ul>
                <p>
                        In this example we map html documents such that all
                        fields are suppressed but author and title.  We map
                        author and title to anything in the content matching
                        author: (and x characters).  Okay my regular expresions
                        suck but hopefully you get the idea.
                </p>
        </section>
        <section name="Final Thoughts">
                <p>
                        We might also consider eliminating the DocumentFactory
                        entirely by making an AbstractDocument from which the
                        current document object would inherit from.  I
                        experimented with this locally, and it was a relatively
                        minor code change and there was of course no difference
                        in performance.  The Document Factory classes would
                        instead be instances of various subclasses of
                        AbstractDocument.
                </p>
                <p>
                        My inspiration for this is HTDig (http://www.htdig.org/).
                        While this goes slightly beyond what HTDig provides by
                        providing field mapping (where HTDIG is just interested
                        in Strings/numbers wherever they are found), it provides
                        at least what I would need to use this as a dropin for
                        most places I contract at (with the obvious exception of
                        a default set of content handlers which would of course
                        develop naturally over time).
                </p>
                <p>
                        I am able to certainly contribute to this effort if the
                        development community is open to it.  I'd suggest we do
                        it iteratively in stages and not aim for all of this at
                        once (for instance leave out the field mapping at first).
                </p>
                <p>

                        Anyhow, please give me some feedback, counter
                        suggestions, let me know if I'm way off base or out of
                        line, etc. -Andy
                </p>
        </section>

  </body>
</document>