mirror of https://github.com/apache/lucene.git
added application extensions plan and myself as a comitter
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149700 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
84bd5a1565
commit
04ebb07469
|
@ -0,0 +1,341 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
|
||||||
|
<document>
|
||||||
|
<properties>
|
||||||
|
<title>Plan for enhancements to Lucene</title>
|
||||||
|
<authors>
|
||||||
|
<person email="acoliver@apache.org" name="Andrew C. Oliver" id="AO"/>
|
||||||
|
</authors>
|
||||||
|
</properties>
|
||||||
|
<body>
|
||||||
|
|
||||||
|
<section name="Purpose">
|
||||||
|
<p>
|
||||||
|
The purpose of this document is to outline plans for
|
||||||
|
making <a href="http://jakarta.apache.org/lucene">
|
||||||
|
Jakarta Lucene</a> work as a more general drop-in
|
||||||
|
component. It makes the assumption that this is an
|
||||||
|
objective for the Lucene user and development community.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
The best reference is <a href="http://www.htdig.org">
|
||||||
|
htDig</a>, though it is not quite as sophisticated as
|
||||||
|
Lucene, it has a number of features that make it
|
||||||
|
desireable. It however is a traditional c-compiled app
|
||||||
|
which makes it somewhat unpleasent to install on some
|
||||||
|
platforms (like Solaris!).
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This plan is being submitted to the Lucene developer
|
||||||
|
community for an initial reaction, advice, feedback and
|
||||||
|
consent. Following this it will be submitted to the
|
||||||
|
Lucene user community for support. Although, I'm (Andy
|
||||||
|
Oliver) capable of providing these enhancements by
|
||||||
|
myself, I'd of course prefer to work on them in concert
|
||||||
|
with others.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
While I'm outlaying a fairly large featureset, these can
|
||||||
|
be implemented incrementally of course (and are probably
|
||||||
|
best if done that way).
|
||||||
|
</p>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section name="Goal and Objectives">
|
||||||
|
<p>
|
||||||
|
The goal is to provide features to Lucene that allow it
|
||||||
|
to be used as a dropin search engine. It should provide
|
||||||
|
many of the features of projects like <a
|
||||||
|
href="http://www.htdig.org">htDig</a> while surpassing
|
||||||
|
them with unique Lucene features and capabillities such as
|
||||||
|
easy installation on and java-supporting platform,
|
||||||
|
and support for document fields and field searches. And
|
||||||
|
of course, <a href="http://apache.org/LICENSE">
|
||||||
|
a pragmatic software license</a>.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
To reach this goal we'll implement code to support the
|
||||||
|
following objectives that augment but do not replace
|
||||||
|
the current Lucene featureset.
|
||||||
|
</p>
|
||||||
|
<ul>
|
||||||
|
<li>
|
||||||
|
Document Location Independance - meaning mapping
|
||||||
|
real contexts to runtime contexts.
|
||||||
|
Essentially, if the document is at
|
||||||
|
/var/www/htdocs/mydoc.html, I probably want it
|
||||||
|
indexed as
|
||||||
|
http://www.bigevilmegacorp.com/mydoc.html.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
Standard methods of creating central indicies -
|
||||||
|
file system indexing is probably less useful in
|
||||||
|
many environments than is *remote* indexing (for
|
||||||
|
instance http). I would suggest that most folks
|
||||||
|
would prefer that general functionality be
|
||||||
|
suppored by Lucene instead of having to write
|
||||||
|
code for every indexing project. Obviously, if
|
||||||
|
what they are doing is *special* they'll have to
|
||||||
|
code, but general document indexing accross
|
||||||
|
webservers would not qualify.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
Document interperatation abstraction - currently
|
||||||
|
one must handle document object construction via
|
||||||
|
custom code. A standard interface for plugging
|
||||||
|
in format handlers should be supported.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
Mime and file-extension to document
|
||||||
|
interperatation mapping.
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</section>
|
||||||
|
<section name="Indexers">
|
||||||
|
<p>
|
||||||
|
Indexers are standard crawlers. They go crawl a file
|
||||||
|
system, ftp site, web site, etc. to create the index.
|
||||||
|
These standard indexers may not make ALL of Lucene's
|
||||||
|
functionality available, though they should be able to
|
||||||
|
make most of it available through configuration.
|
||||||
|
</p>
|
||||||
|
<!--<section name="AbstractIndexer">-->
|
||||||
|
<p>
|
||||||
|
<b> Abstract Indexer </b>
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
The Abstract indexer is basically the parent for all
|
||||||
|
Indexer classes. It provides implementation for the
|
||||||
|
following functions/properties:
|
||||||
|
</p>
|
||||||
|
<ul>
|
||||||
|
<li>
|
||||||
|
index path - where to write the index.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
cui - create or update the index
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
root context - the start of the pathname
|
||||||
|
that should be replaced by the
|
||||||
|
replace with property or dropped
|
||||||
|
entirely. Example: /opt/tomcat/webapps
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
replace with - when specified replaces
|
||||||
|
the root context. Example:
|
||||||
|
http://jakarta.apache.org.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
replacement type - the type of
|
||||||
|
replacewith path: relative, url or
|
||||||
|
path.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
location - the location to start
|
||||||
|
indexing at.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
doctypes - only index documents with
|
||||||
|
these doctypes. If not specified all
|
||||||
|
registered mime-types are used.
|
||||||
|
Example: "xml,doc,html"
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
recursive - if not specified is turned
|
||||||
|
off.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
level - optional level of directory or
|
||||||
|
links to traverse. By default is
|
||||||
|
assumed to be infinite. Recursive must
|
||||||
|
be turned on or this is ignored. Range:
|
||||||
|
0 - Long.MAX_VALUE.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
properties - in addition to the settings
|
||||||
|
(probably from the command line) read
|
||||||
|
this properties file and get them from
|
||||||
|
it. Command line options override
|
||||||
|
the properties file in the case of
|
||||||
|
duplicates. There should also be an
|
||||||
|
enivironment variable or VM parameter to
|
||||||
|
set this.
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
<!--</section>-->
|
||||||
|
<!--<s2 title="FileSystemIndexer">-->
|
||||||
|
<p>
|
||||||
|
<b>FileSystemIndexer</b>
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This should extend the AbstractIndexer and
|
||||||
|
support any addtional options required for a
|
||||||
|
filesystem index.
|
||||||
|
</p>
|
||||||
|
<!--</s2>-->
|
||||||
|
<!--<s2 title="HTTPIndexer">-->
|
||||||
|
<p>
|
||||||
|
<b>HTTP Indexer </b>
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
Supports the AbstractIndexer options as well as:
|
||||||
|
</p>
|
||||||
|
<ul>
|
||||||
|
<li>
|
||||||
|
span hosts - Wheter to span hosts or not,
|
||||||
|
by default this should be no.
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
restrict domains - (ignored if span
|
||||||
|
hosts is not enabled). Whether all
|
||||||
|
spanned hosts must be in the same domain
|
||||||
|
(default is off).
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
try directories - Whether to attempt
|
||||||
|
directory listings or not (so if you
|
||||||
|
recurse and go to
|
||||||
|
/nextcontext/index.html this option says
|
||||||
|
to also try /nextcontext to get the dir
|
||||||
|
lsiting)
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
map extensions -
|
||||||
|
(always/default/never/fallback). Wether
|
||||||
|
to always use extension mapping, by
|
||||||
|
default (fallback to mime type), NEVER
|
||||||
|
or fallback if mime is not available
|
||||||
|
(default).
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
ignore robots - ignore robots.txt, on or
|
||||||
|
off (default - off)
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
<!-- </s2> -->
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section name="MIMEMap">
|
||||||
|
<p>
|
||||||
|
A configurable registry of document types, their
|
||||||
|
description, an identifyer, mime-type and file
|
||||||
|
extension. This should map both MIME -> factory
|
||||||
|
and extension -> factory.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
This might be configured at compile time or by a
|
||||||
|
properties file, etc. For example:
|
||||||
|
</p>
|
||||||
|
<table>
|
||||||
|
<tr>
|
||||||
|
<td>Description</td>
|
||||||
|
<td>Identifier</td>
|
||||||
|
<td>Extensions</td>
|
||||||
|
<td>MimeType</td>
|
||||||
|
<td>DocumentFactory</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>"Word Document"</td>
|
||||||
|
<td>"doc"</td>
|
||||||
|
<td>"doc"</td>
|
||||||
|
<td>"vnd.application/ms-word"</td>
|
||||||
|
<td>POIWordDocumentFactory</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>"HTML Document"</td>
|
||||||
|
<td>"html"</td>
|
||||||
|
<td>"html,htm"</td>
|
||||||
|
<td></td>
|
||||||
|
<td>HTMLDocumentFactory</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</section>
|
||||||
|
<section name="DocumentFactory">
|
||||||
|
<p>
|
||||||
|
An interface for classes which create document objects
|
||||||
|
for particular file types. Examples:
|
||||||
|
HTMLDocumentFactory, DOCDocumentFactory,
|
||||||
|
XLSDocumentFactory, XML DocumentFactory.
|
||||||
|
</p>
|
||||||
|
</section>
|
||||||
|
<section name="FieldMapping classes">
|
||||||
|
<p>
|
||||||
|
A class taht maps standard fields from the
|
||||||
|
DocumentFactories into *fields* in the Document objects
|
||||||
|
they create. I suggest that a regular expression system
|
||||||
|
or xpath might be the most universal way to do this.
|
||||||
|
For instance if perhaps I had an XML factory that
|
||||||
|
represented XML elements as fields, I could map content
|
||||||
|
from particular fields to ther fields or supress them
|
||||||
|
entirely. We could even make this configurable.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
|
||||||
|
for example:
|
||||||
|
</p>
|
||||||
|
<ul>
|
||||||
|
<li>
|
||||||
|
htmldoc.properties
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
suppress=*
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
author=content:g/author\:\ ........................................./
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
author.suppress=false
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
title=content:g/title\:\ ........................................./
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
title.suppress=false
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
<p>
|
||||||
|
In this example we map html documents such that all
|
||||||
|
fields are suppressed but author and title. We map
|
||||||
|
author and title to anything in the content matching
|
||||||
|
author: (and x characters). Okay my regular expresions
|
||||||
|
suck but hopefully you get the idea.
|
||||||
|
</p>
|
||||||
|
</section>
|
||||||
|
<section name="Final Thoughts">
|
||||||
|
<p>
|
||||||
|
We might also consider eliminating the DocumentFactory
|
||||||
|
entirely by making an AbstractDocument from which the
|
||||||
|
current document object would inherit from. I
|
||||||
|
experimented with this locally, and it was a relatively
|
||||||
|
minor code change and there was of course no difference
|
||||||
|
in performance. The Document Factory classes would
|
||||||
|
instead be instances of various subclasses of
|
||||||
|
AbstractDocument.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
My inspiration for this is HTDig (http://www.htdig.org/).
|
||||||
|
While this goes slightly beyond what HTDig provides by
|
||||||
|
providing field mapping (where HTDIG is just interested
|
||||||
|
in Strings/numbers wherever they are found), it provides
|
||||||
|
at least what I would need to use this as a dropin for
|
||||||
|
most places I contract at (with the obvious exception of
|
||||||
|
a default set of content handlers which would of course
|
||||||
|
develop naturally over time).
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
I am able to certainly contribute to this effort if the
|
||||||
|
development community is open to it. I'd suggest we do
|
||||||
|
it iteratively in stages and not aim for all of this at
|
||||||
|
once (for instance leave out the field mapping at first).
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
|
||||||
|
Anyhow, please give me some feedback, counter
|
||||||
|
suggestions, let me know if I'm way off base or out of
|
||||||
|
line, etc. -Andy
|
||||||
|
</p>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
</body>
|
||||||
|
</document>
|
|
@ -23,6 +23,10 @@
|
||||||
<item name="Contributions" href="/contributions.html"/>
|
<item name="Contributions" href="/contributions.html"/>
|
||||||
</menu>
|
</menu>
|
||||||
|
|
||||||
|
<menu name="Plans">
|
||||||
|
<item name="Application Extensions" href="/luceneplan.html"/>
|
||||||
|
</menu>
|
||||||
|
|
||||||
<menu name="Download">
|
<menu name="Download">
|
||||||
<item name="Binaries" href="/site/binindex.html"/>
|
<item name="Binaries" href="/site/binindex.html"/>
|
||||||
<item name="Source Code" href="/site/sourceindex.html"/>
|
<item name="Source Code" href="/site/sourceindex.html"/>
|
||||||
|
|
|
@ -38,6 +38,7 @@ the <a href="http://jakarta.apache.org/site/mail.html">Jakarta-Lucene mailing li
|
||||||
<li><b>Dave Kor</b> (davekor at apache.org)</li>
|
<li><b>Dave Kor</b> (davekor at apache.org)</li>
|
||||||
<li><b>Jon Stevens</b> (jon at latchkey.com)</li>
|
<li><b>Jon Stevens</b> (jon at latchkey.com)</li>
|
||||||
<li><b>Tal Dayan</b> (zapta at apache.org)</li>
|
<li><b>Tal Dayan</b> (zapta at apache.org)</li>
|
||||||
|
<li><a href="http://www.trilug.org/~acoliver">Andrew C. Oliver</a> (acoliver at apache dot org)</li>
|
||||||
</ul>
|
</ul>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue