Plan for enhancements to Lucene

The purpose of this document is to outline plans for making Jakarta Lucene work as a more general drop-in component. It makes the assumption that this is an objective for the Lucene user and development community.

The best reference is htDig, though it is not quite as sophisticated as Lucene, it has a number of features that make it desirable. It however is a traditional c-compiled app which makes it somewhat unpleasant to install on some platforms (like Solaris!).

This plan is being submitted to the Lucene developer community for an initial reaction, advice, feedback and consent. Following this it will be submitted to the Lucene user community for support. Although, I'm (Andy Oliver) capable of providing these enhancements by myself, I'd of course prefer to work on them in concert with others.

While I'm outlaying a fairly large feature set, these can be implemented incrementally of course (and are probably best if done that way).

The goal is to provide features to Lucene that allow it to be used as a drop-in search engine. It should provide many of the features of projects like htDig while surpassing them with unique Lucene features and capabilities such as easy installation on and java-supporting platform, and support for document fields and field searches. And of course, a pragmatic software license.

To reach this goal we'll implement code to support the following objectives that augment but do not replace the current Lucene feature set.

Crawlers are data source executable code. They crawl a file system, ftp site, web site, etc. to create the index. These standard crawlers may not make ALL of Lucene's functionality available, though they should be able to make most of it available through configuration.

Abstract Crawler

The AbstractCrawler is basically the parent for all Crawler classes. It provides implementation for the following functions/properties:

FileSystemCrawler

This should extend the AbstractCrawler and support any additional options required for a file system index.

HTTP Crawler

Supports the AbstractCrawler options as well as:

A configurable registry of document types, their description, an identifier, mime-type and file extension. This should map both MIME -> factory and extension -> factory.

This might be configured at compile time or by a properties file, etc. For example:

Description Identifier Extensions MimeType DocumentFactory
"Word Document" "doc" "doc" "vnd.application/ms-word" POIWordDocumentFactory
"HTML Document" "html" "html,htm" HTMLDocumentFactory

An interface for classes which create document objects for particular file types. Examples: HTMLDocumentFactory, DOCDocumentFactory, XLSDocumentFactory, XML DocumentFactory.

A class that maps standard fields from the DocumentFactories into *fields* in the Document objects they create. I suggest that a regular expression system or xpath might be the most universal way to do this. For instance if perhaps I had an XML factory that represented XML elements as fields, I could map content from particular fields to their fields or suppress them entirely. We could even make this configurable.

for example:

In this example we map html documents such that all fields are suppressed but author and title. We map author and title to anything in the content matching author: (and x characters). Okay my regular expresions suck but hopefully you get the idea.

We might also consider eliminating the DocumentFactory entirely by making an AbstractDocument from which the current document object would inherit from. I experimented with this locally, and it was a relatively minor code change and there was of course no difference in performance. The Document Factory classes would instead be instances of various subclasses of AbstractDocument.

My inspiration for this is HTDig (http://www.htdig.org/). While this goes slightly beyond what HTDig provides by providing field mapping (where HTDIG is just interested in Strings/numbers wherever they are found), it provides at least what I would need to use this as a drop-in for most places I contract at (with the obvious exception of a default set of content handlers which would of course develop naturally over time).

I am able to certainly contribute to this effort if the development community is open to it. I'd suggest we do it iteratively in stages and not aim for all of this at once (for instance leave out the field mapping at first).

Anyhow, please give me some feedback, counter suggestions, let me know if I'm way off base or out of line, etc. -Andy