added application extensions plan and myself as a comitter

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149700 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Andrew C. Oliver 2002-02-23 22:02:55 +00:00
parent 84bd5a1565
commit 04ebb07469
3 changed files with 346 additions and 0 deletions

341
xdocs/luceneplan.xml Normal file
View File

@ -0,0 +1,341 @@
<?xml version="1.0" encoding="UTF-8"?>
<document>
<properties>
<title>Plan for enhancements to Lucene</title>
<authors>
<person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/>
</authors>
</properties>
<body>
<section name="Purpose">
<p>
The purpose of this document is to outline plans for
making <a href="http://jakarta.apache.org/lucene">
Jakarta Lucene</a> work as a more general drop-in
component. It makes the assumption that this is an
objective for the Lucene user and development community.
</p>
<p>
The best reference is <a href="http://www.htdig.org">
htDig</a>, though it is not quite as sophisticated as
Lucene, it has a number of features that make it
desireable. It however is a traditional c-compiled app
which makes it somewhat unpleasent to install on some
platforms (like Solaris!).
</p>
<p>
This plan is being submitted to the Lucene developer
community for an initial reaction, advice, feedback and
consent. Following this it will be submitted to the
Lucene user community for support. Although, I'm (Andy
Oliver) capable of providing these enhancements by
myself, I'd of course prefer to work on them in concert
with others.
</p>
<p>
While I'm outlaying a fairly large featureset, these can
be implemented incrementally of course (and are probably
best if done that way).
</p>
</section>
<section name="Goal and Objectives">
<p>
The goal is to provide features to Lucene that allow it
to be used as a dropin search engine. It should provide
many of the features of projects like <a
href="http://www.htdig.org">htDig</a> while surpassing
them with unique Lucene features and capabillities such as
easy installation on and java-supporting platform,
and support for document fields and field searches. And
of course, <a href="http://apache.org/LICENSE">
a pragmatic software license</a>.
</p>
<p>
To reach this goal we'll implement code to support the
following objectives that augment but do not replace
the current Lucene featureset.
</p>
<ul>
<li>
Document Location Independance - meaning mapping
real contexts to runtime contexts.
Essentially, if the document is at
/var/www/htdocs/mydoc.html, I probably want it
indexed as
http://www.bigevilmegacorp.com/mydoc.html.
</li>
<li>
Standard methods of creating central indicies -
file system indexing is probably less useful in
many environments than is *remote* indexing (for
instance http). I would suggest that most folks
would prefer that general functionality be
suppored by Lucene instead of having to write
code for every indexing project. Obviously, if
what they are doing is *special* they'll have to
code, but general document indexing accross
webservers would not qualify.
</li>
<li>
Document interperatation abstraction - currently
one must handle document object construction via
custom code. A standard interface for plugging
in format handlers should be supported.
</li>
<li>
Mime and file-extension to document
interperatation mapping.
</li>
</ul>
</section>
<section name="Indexers">
<p>
Indexers are standard crawlers. They go crawl a file
system, ftp site, web site, etc. to create the index.
These standard indexers may not make ALL of Lucene's
functionality available, though they should be able to
make most of it available through configuration.
</p>
<!--<section name="AbstractIndexer">-->
<p>
<b> Abstract Indexer </b>
</p>
<p>
The Abstract indexer is basically the parent for all
Indexer classes. It provides implementation for the
following functions/properties:
</p>
<ul>
<li>
index path - where to write the index.
</li>
<li>
cui - create or update the index
</li>
<li>
root context - the start of the pathname
that should be replaced by the
replace with property or dropped
entirely. Example: /opt/tomcat/webapps
</li>
<li>
replace with - when specified replaces
the root context. Example:
http://jakarta.apache.org.
</li>
<li>
replacement type - the type of
replacewith path: relative, url or
path.
</li>
<li>
location - the location to start
indexing at.
</li>
<li>
doctypes - only index documents with
these doctypes. If not specified all
registered mime-types are used.
Example: "xml,doc,html"
</li>
<li>
recursive - if not specified is turned
off.
</li>
<li>
level - optional level of directory or
links to traverse. By default is
assumed to be infinite. Recursive must
be turned on or this is ignored. Range:
0 - Long.MAX_VALUE.
</li>
<li>
properties - in addition to the settings
(probably from the command line) read
this properties file and get them from
it. Command line options override
the properties file in the case of
duplicates. There should also be an
enivironment variable or VM parameter to
set this.
</li>
</ul>
<!--</section>-->
<!--<s2 title="FileSystemIndexer">-->
<p>
<b>FileSystemIndexer</b>
</p>
<p>
This should extend the AbstractIndexer and
support any addtional options required for a
filesystem index.
</p>
<!--</s2>-->
<!--<s2 title="HTTPIndexer">-->
<p>
<b>HTTP Indexer </b>
</p>
<p>
Supports the AbstractIndexer options as well as:
</p>
<ul>
<li>
span hosts - Wheter to span hosts or not,
by default this should be no.
</li>
<li>
restrict domains - (ignored if span
hosts is not enabled). Whether all
spanned hosts must be in the same domain
(default is off).
</li>
<li>
try directories - Whether to attempt
directory listings or not (so if you
recurse and go to
/nextcontext/index.html this option says
to also try /nextcontext to get the dir
lsiting)
</li>
<li>
map extensions -
(always/default/never/fallback). Wether
to always use extension mapping, by
default (fallback to mime type), NEVER
or fallback if mime is not available
(default).
</li>
<li>
ignore robots - ignore robots.txt, on or
off (default - off)
</li>
</ul>
<!-- </s2> -->
</section>
<section name="MIMEMap">
<p>
A configurable registry of document types, their
description, an identifyer, mime-type and file
extension. This should map both MIME -> factory
and extension -> factory.
</p>
<p>
This might be configured at compile time or by a
properties file, etc. For example:
</p>
<table>
<tr>
<td>Description</td>
<td>Identifier</td>
<td>Extensions</td>
<td>MimeType</td>
<td>DocumentFactory</td>
</tr>
<tr>
<td>"Word Document"</td>
<td>"doc"</td>
<td>"doc"</td>
<td>"vnd.application/ms-word"</td>
<td>POIWordDocumentFactory</td>
</tr>
<tr>
<td>"HTML Document"</td>
<td>"html"</td>
<td>"html,htm"</td>
<td></td>
<td>HTMLDocumentFactory</td>
</tr>
</table>
</section>
<section name="DocumentFactory">
<p>
An interface for classes which create document objects
for particular file types. Examples:
HTMLDocumentFactory, DOCDocumentFactory,
XLSDocumentFactory, XML DocumentFactory.
</p>
</section>
<section name="FieldMapping classes">
<p>
A class taht maps standard fields from the
DocumentFactories into *fields* in the Document objects
they create. I suggest that a regular expression system
or xpath might be the most universal way to do this.
For instance if perhaps I had an XML factory that
represented XML elements as fields, I could map content
from particular fields to ther fields or supress them
entirely. We could even make this configurable.
</p>
<p>
for example:
</p>
<ul>
<li>
htmldoc.properties
</li>
<li>
suppress=*
</li>
<li>
author=content:g/author\:\ ........................................./
</li>
<li>
author.suppress=false
</li>
<li>
title=content:g/title\:\ ........................................./
</li>
<li>
title.suppress=false
</li>
</ul>
<p>
In this example we map html documents such that all
fields are suppressed but author and title. We map
author and title to anything in the content matching
author: (and x characters). Okay my regular expresions
suck but hopefully you get the idea.
</p>
</section>
<section name="Final Thoughts">
<p>
We might also consider eliminating the DocumentFactory
entirely by making an AbstractDocument from which the
current document object would inherit from. I
experimented with this locally, and it was a relatively
minor code change and there was of course no difference
in performance. The Document Factory classes would
instead be instances of various subclasses of
AbstractDocument.
</p>
<p>
My inspiration for this is HTDig (http://www.htdig.org/).
While this goes slightly beyond what HTDig provides by
providing field mapping (where HTDIG is just interested
in Strings/numbers wherever they are found), it provides
at least what I would need to use this as a dropin for
most places I contract at (with the obvious exception of
a default set of content handlers which would of course
develop naturally over time).
</p>
<p>
I am able to certainly contribute to this effort if the
development community is open to it. I'd suggest we do
it iteratively in stages and not aim for all of this at
once (for instance leave out the field mapping at first).
</p>
<p>
Anyhow, please give me some feedback, counter
suggestions, let me know if I'm way off base or out of
line, etc. -Andy
</p>
</section>
</body>
</document>

View File

@ -23,6 +23,10 @@
<item name="Contributions" href="/contributions.html"/> <item name="Contributions" href="/contributions.html"/>
</menu> </menu>
<menu name="Plans">
<item name="Application Extensions" href="/luceneplan.html"/>
</menu>
<menu name="Download"> <menu name="Download">
<item name="Binaries" href="/site/binindex.html"/> <item name="Binaries" href="/site/binindex.html"/>
<item name="Source Code" href="/site/sourceindex.html"/> <item name="Source Code" href="/site/sourceindex.html"/>

View File

@ -38,6 +38,7 @@ the <a href="http://jakarta.apache.org/site/mail.html">Jakarta-Lucene mailing li
<li><b>Dave Kor</b> (davekor at apache.org)</li> <li><b>Dave Kor</b> (davekor at apache.org)</li>
<li><b>Jon Stevens</b> (jon at latchkey.com)</li> <li><b>Jon Stevens</b> (jon at latchkey.com)</li>
<li><b>Tal Dayan</b> (zapta at apache.org)</li> <li><b>Tal Dayan</b> (zapta at apache.org)</li>
<li><a href="http://www.trilug.org/~acoliver">Andrew C. Oliver</a> (acoliver at apache dot org)</li>
</ul> </ul>
</section> </section>