From abefb1b48e9b05d722962c97a235e901efe4956d Mon Sep 17 00:00:00 2001
From: Otis Gospodnetic
The purpose of this document is to outline plans for
@@ -21,8 +21,8 @@
The best reference is
htDig, though it is not quite as sophisticated as
Lucene, it has a number of features that make it
- desireable. It however is a traditional c-compiled app
- which makes it somewhat unpleasent to install on some
+ desirable. It however is a traditional c-compiled app
+ which makes it somewhat unpleasant to install on some
platforms (like Solaris!).
@@ -30,42 +30,42 @@
community for an initial reaction, advice, feedback and
consent. Following this it will be submitted to the
Lucene user community for support. Although, I'm (Andy
- Oliver) capable of providing these enhancements by
- myself, I'd of course prefer to work on them in concert
+ Oliver) capable of providing these enhancements by
+ myself, I'd of course prefer to work on them in concert
with others.
- While I'm outlaying a fairly large featureset, these can
+ While I'm outlaying a fairly large feature set, these can
be implemented incrementally of course (and are probably
best if done that way).
The goal is to provide features to Lucene that allow it
- to be used as a dropin search engine. It should provide
+ to be used as a drop-in search engine. It should provide
many of the features of projects like htDig while surpassing
- them with unique Lucene features and capabillities such as
+ them with unique Lucene features and capabilities such as
easy installation on and java-supporting platform,
- and support for document fields and field searches. And
+ and support for document fields and field searches. And
of course,
a pragmatic software license.
To reach this goal we'll implement code to support the
following objectives that augment but do not replace
- the current Lucene featureset.
+ the current Lucene feature set.
This should extend the AbstractCrawler and - support any addtional options required for a - filesystem index. + support any additional options required for a + file system index.
@@ -218,12 +218,12 @@ HTTP Crawler- Supports the AbstractCrawler options as well as: + Supports the AbstractCrawler options as well as:
A configurable registry of document types, their - description, an identifyer, mime-type and file - extension. This should map both MIME -> factory + description, an identifier, mime-type and file + extension. This should map both MIME -> factory and extension -> factory.
@@ -287,7 +287,7 @@
- A class taht maps standard fields from the + A class that maps standard fields from the DocumentFactories into *fields* in the Document objects they create. I suggest that a regular expression system or xpath might be the most universal way to do this. For instance if perhaps I had an XML factory that represented XML elements as fields, I could map content - from particular fields to ther fields or supress them + from particular fields to their fields or suppress them entirely. We could even make this configurable.
- + for example:
- In this example we map html documents such that all - fields are suppressed but author and title. We map - author and title to anything in the content matching - author: (and x characters). Okay my regular expresions +
+ In this example we map html documents such that all + fields are suppressed but author and title. We map + author and title to anything in the content matching + author: (and x characters). Okay my regular expresions suck but hopefully you get the idea.
- We might also consider eliminating the DocumentFactory - entirely by making an AbstractDocument from which the - current document object would inherit from. I - experimented with this locally, and it was a relatively - minor code change and there was of course no difference - in performance. The Document Factory classes would - instead be instances of various subclasses of + We might also consider eliminating the DocumentFactory + entirely by making an AbstractDocument from which the + current document object would inherit from. I + experimented with this locally, and it was a relatively + minor code change and there was of course no difference + in performance. The Document Factory classes would + instead be instances of various subclasses of AbstractDocument.
- My inspiration for this is HTDig (http://www.htdig.org/). - While this goes slightly beyond what HTDig provides by - providing field mapping (where HTDIG is just interested - in Strings/numbers wherever they are found), it provides - at least what I would need to use this as a dropin for - most places I contract at (with the obvious exception of - a default set of content handlers which would of course + My inspiration for this is HTDig (http://www.htdig.org/). + While this goes slightly beyond what HTDig provides by + providing field mapping (where HTDIG is just interested + in Strings/numbers wherever they are found), it provides + at least what I would need to use this as a drop-in for + most places I contract at (with the obvious exception of + a default set of content handlers which would of course develop naturally over time).
- I am able to certainly contribute to this effort if the - development community is open to it. I'd suggest we do - it iteratively in stages and not aim for all of this at + I am able to certainly contribute to this effort if the + development community is open to it. I'd suggest we do + it iteratively in stages and not aim for all of this at once (for instance leave out the field mapping at first).
- - Anyhow, please give me some feedback, counter - suggestions, let me know if I'm way off base or out of + + Anyhow, please give me some feedback, counter + suggestions, let me know if I'm way off base or out of line, etc. -Andy