mirror of https://github.com/apache/lucene.git
- Fixed spelling a bit.
- Nukes trailing blank spaces. git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@149766 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
241f32309d
commit
abefb1b48e
|
@ -1,5 +1,5 @@
|
|||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
|
||||
|
||||
<document>
|
||||
<properties>
|
||||
<title>Plan for enhancements to Lucene</title>
|
||||
|
@ -8,7 +8,7 @@
|
|||
</authors>
|
||||
</properties>
|
||||
<body>
|
||||
|
||||
|
||||
<section name="Purpose">
|
||||
<p>
|
||||
The purpose of this document is to outline plans for
|
||||
|
@ -21,8 +21,8 @@
|
|||
The best reference is <a href="http://www.htdig.org">
|
||||
htDig</a>, though it is not quite as sophisticated as
|
||||
Lucene, it has a number of features that make it
|
||||
desireable. It however is a traditional c-compiled app
|
||||
which makes it somewhat unpleasent to install on some
|
||||
desirable. It however is a traditional c-compiled app
|
||||
which makes it somewhat unpleasant to install on some
|
||||
platforms (like Solaris!).
|
||||
</p>
|
||||
<p>
|
||||
|
@ -30,42 +30,42 @@
|
|||
community for an initial reaction, advice, feedback and
|
||||
consent. Following this it will be submitted to the
|
||||
Lucene user community for support. Although, I'm (Andy
|
||||
Oliver) capable of providing these enhancements by
|
||||
myself, I'd of course prefer to work on them in concert
|
||||
Oliver) capable of providing these enhancements by
|
||||
myself, I'd of course prefer to work on them in concert
|
||||
with others.
|
||||
</p>
|
||||
<p>
|
||||
While I'm outlaying a fairly large featureset, these can
|
||||
While I'm outlaying a fairly large feature set, these can
|
||||
be implemented incrementally of course (and are probably
|
||||
best if done that way).
|
||||
</p>
|
||||
</section>
|
||||
|
||||
|
||||
<section name="Goal and Objectives">
|
||||
<p>
|
||||
The goal is to provide features to Lucene that allow it
|
||||
to be used as a dropin search engine. It should provide
|
||||
to be used as a drop-in search engine. It should provide
|
||||
many of the features of projects like <a
|
||||
href="http://www.htdig.org">htDig</a> while surpassing
|
||||
them with unique Lucene features and capabillities such as
|
||||
them with unique Lucene features and capabilities such as
|
||||
easy installation on and java-supporting platform,
|
||||
and support for document fields and field searches. And
|
||||
and support for document fields and field searches. And
|
||||
of course, <a href="http://apache.org/LICENSE">
|
||||
a pragmatic software license</a>.
|
||||
</p>
|
||||
<p>
|
||||
To reach this goal we'll implement code to support the
|
||||
following objectives that augment but do not replace
|
||||
the current Lucene featureset.
|
||||
the current Lucene feature set.
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
Document Location Independance - meaning mapping
|
||||
Document Location Independence - meaning mapping
|
||||
real contexts to runtime contexts.
|
||||
Essentially, if the document is at
|
||||
/var/www/htdocs/mydoc.html, I probably want it
|
||||
indexed as
|
||||
http://www.bigevilmegacorp.com/mydoc.html.
|
||||
http://www.bigevilmegacorp.com/mydoc.html.
|
||||
</li>
|
||||
<li>
|
||||
Standard methods of creating central indicies -
|
||||
|
@ -73,21 +73,21 @@
|
|||
many environments than is *remote* indexing (for
|
||||
instance http). I would suggest that most folks
|
||||
would prefer that general functionality be
|
||||
suppored by Lucene instead of having to write
|
||||
supported by Lucene instead of having to write
|
||||
code for every indexing project. Obviously, if
|
||||
what they are doing is *special* they'll have to
|
||||
code, but general document indexing accross
|
||||
webservers would not qualify.
|
||||
code, but general document indexing across
|
||||
web servers would not qualify.
|
||||
</li>
|
||||
<li>
|
||||
Document interperatation abstraction - currently
|
||||
Document interpretation abstraction - currently
|
||||
one must handle document object construction via
|
||||
custom code. A standard interface for plugging
|
||||
in format handlers should be supported.
|
||||
in format handlers should be supported.
|
||||
</li>
|
||||
<li>
|
||||
Mime and file-extension to document
|
||||
interperatation mapping.
|
||||
interpretation mapping.
|
||||
</li>
|
||||
</ul>
|
||||
</section>
|
||||
|
@ -128,7 +128,7 @@
|
|||
</li>
|
||||
<li>
|
||||
replacement type - the type of
|
||||
replacewith path: relative, url or
|
||||
replace with path: relative, URL or
|
||||
path.
|
||||
</li>
|
||||
<li>
|
||||
|
@ -153,8 +153,8 @@
|
|||
0 - Long.MAX_VALUE.
|
||||
</li>
|
||||
<li>
|
||||
SleeptimeBetweenCalls - can be used to
|
||||
avoid flooding a machine with too many
|
||||
SleeptimeBetweenCalls - can be used to
|
||||
avoid flooding a machine with too many
|
||||
requests
|
||||
</li>
|
||||
<li>
|
||||
|
@ -163,12 +163,12 @@
|
|||
inactivity.
|
||||
</li>
|
||||
<li>
|
||||
IncludeFilter - include only items
|
||||
matching filter. (can occur mulitple
|
||||
IncludeFilter - include only items
|
||||
matching filter. (can occur multiple
|
||||
times)
|
||||
</li>
|
||||
<li>
|
||||
ExcludeFilter - exclude only items
|
||||
ExcludeFilter - exclude only items
|
||||
matching filter. (can occur multiple
|
||||
times)
|
||||
</li>
|
||||
|
@ -196,9 +196,9 @@
|
|||
(probably from the command line) read
|
||||
this properties file and get them from
|
||||
it. Command line options override
|
||||
the properties file in the case of
|
||||
the properties file in the case of
|
||||
duplicates. There should also be an
|
||||
enivironment variable or VM parameter to
|
||||
environment variable or VM parameter to
|
||||
set this.
|
||||
</li>
|
||||
</ul>
|
||||
|
@ -209,8 +209,8 @@
|
|||
</p>
|
||||
<p>
|
||||
This should extend the AbstractCrawler and
|
||||
support any addtional options required for a
|
||||
filesystem index.
|
||||
support any additional options required for a
|
||||
file system index.
|
||||
</p>
|
||||
<!--</s2>-->
|
||||
<!--<s2 title="HTTPIndexer">-->
|
||||
|
@ -218,12 +218,12 @@
|
|||
<b>HTTP Crawler </b>
|
||||
</p>
|
||||
<p>
|
||||
Supports the AbstractCrawler options as well as:
|
||||
Supports the AbstractCrawler options as well as:
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
span hosts - Wheter to span hosts or not,
|
||||
by default this should be no.
|
||||
span hosts - Whether to span hosts or not,
|
||||
by default this should be no.
|
||||
</li>
|
||||
<li>
|
||||
restrict domains - (ignored if span
|
||||
|
@ -237,11 +237,11 @@
|
|||
recurse and go to
|
||||
/nextcontext/index.html this option says
|
||||
to also try /nextcontext to get the dir
|
||||
lsiting)
|
||||
listing)
|
||||
</li>
|
||||
<li>
|
||||
map extensions -
|
||||
(always/default/never/fallback). Wether
|
||||
(always/default/never/fallback). Whether
|
||||
to always use extension mapping, by
|
||||
default (fallback to mime type), NEVER
|
||||
or fallback if mime is not available
|
||||
|
@ -254,12 +254,12 @@
|
|||
</ul>
|
||||
<!-- </s2> -->
|
||||
</section>
|
||||
|
||||
|
||||
<section name="MIMEMap">
|
||||
<p>
|
||||
A configurable registry of document types, their
|
||||
description, an identifyer, mime-type and file
|
||||
extension. This should map both MIME -> factory
|
||||
description, an identifier, mime-type and file
|
||||
extension. This should map both MIME -> factory
|
||||
and extension -> factory.
|
||||
</p>
|
||||
<p>
|
||||
|
@ -287,7 +287,7 @@
|
|||
<td>"html,htm"</td>
|
||||
<td></td>
|
||||
<td>HTMLDocumentFactory</td>
|
||||
</tr>
|
||||
</tr>
|
||||
</table>
|
||||
</section>
|
||||
<section name="DocumentFactory">
|
||||
|
@ -300,17 +300,17 @@
|
|||
</section>
|
||||
<section name="FieldMapping classes">
|
||||
<p>
|
||||
A class taht maps standard fields from the
|
||||
A class that maps standard fields from the
|
||||
DocumentFactories into *fields* in the Document objects
|
||||
they create. I suggest that a regular expression system
|
||||
or xpath might be the most universal way to do this.
|
||||
For instance if perhaps I had an XML factory that
|
||||
represented XML elements as fields, I could map content
|
||||
from particular fields to ther fields or supress them
|
||||
from particular fields to their fields or suppress them
|
||||
entirely. We could even make this configurable.
|
||||
</p>
|
||||
<p>
|
||||
|
||||
|
||||
for example:
|
||||
</p>
|
||||
<ul>
|
||||
|
@ -333,48 +333,48 @@
|
|||
title.suppress=false
|
||||
</li>
|
||||
</ul>
|
||||
<p>
|
||||
In this example we map html documents such that all
|
||||
fields are suppressed but author and title. We map
|
||||
author and title to anything in the content matching
|
||||
author: (and x characters). Okay my regular expresions
|
||||
<p>
|
||||
In this example we map html documents such that all
|
||||
fields are suppressed but author and title. We map
|
||||
author and title to anything in the content matching
|
||||
author: (and x characters). Okay my regular expresions
|
||||
suck but hopefully you get the idea.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Final Thoughts">
|
||||
<p>
|
||||
We might also consider eliminating the DocumentFactory
|
||||
entirely by making an AbstractDocument from which the
|
||||
current document object would inherit from. I
|
||||
experimented with this locally, and it was a relatively
|
||||
minor code change and there was of course no difference
|
||||
in performance. The Document Factory classes would
|
||||
instead be instances of various subclasses of
|
||||
We might also consider eliminating the DocumentFactory
|
||||
entirely by making an AbstractDocument from which the
|
||||
current document object would inherit from. I
|
||||
experimented with this locally, and it was a relatively
|
||||
minor code change and there was of course no difference
|
||||
in performance. The Document Factory classes would
|
||||
instead be instances of various subclasses of
|
||||
AbstractDocument.
|
||||
</p>
|
||||
<p>
|
||||
My inspiration for this is HTDig (http://www.htdig.org/).
|
||||
While this goes slightly beyond what HTDig provides by
|
||||
providing field mapping (where HTDIG is just interested
|
||||
in Strings/numbers wherever they are found), it provides
|
||||
at least what I would need to use this as a dropin for
|
||||
most places I contract at (with the obvious exception of
|
||||
a default set of content handlers which would of course
|
||||
My inspiration for this is HTDig (http://www.htdig.org/).
|
||||
While this goes slightly beyond what HTDig provides by
|
||||
providing field mapping (where HTDIG is just interested
|
||||
in Strings/numbers wherever they are found), it provides
|
||||
at least what I would need to use this as a drop-in for
|
||||
most places I contract at (with the obvious exception of
|
||||
a default set of content handlers which would of course
|
||||
develop naturally over time).
|
||||
</p>
|
||||
<p>
|
||||
I am able to certainly contribute to this effort if the
|
||||
development community is open to it. I'd suggest we do
|
||||
it iteratively in stages and not aim for all of this at
|
||||
I am able to certainly contribute to this effort if the
|
||||
development community is open to it. I'd suggest we do
|
||||
it iteratively in stages and not aim for all of this at
|
||||
once (for instance leave out the field mapping at first).
|
||||
</p>
|
||||
<p>
|
||||
|
||||
Anyhow, please give me some feedback, counter
|
||||
suggestions, let me know if I'm way off base or out of
|
||||
|
||||
Anyhow, please give me some feedback, counter
|
||||
suggestions, let me know if I'm way off base or out of
|
||||
line, etc. -Andy
|
||||
</p>
|
||||
</section>
|
||||
|
||||
|
||||
</body>
|
||||
</document>
|
||||
|
|
Loading…
Reference in New Issue