diff --git a/xdocs/luceneplan.xml b/xdocs/luceneplan.xml index 6992784f312..825bf563a98 100644 --- a/xdocs/luceneplan.xml +++ b/xdocs/luceneplan.xml @@ -1,5 +1,5 @@ - + Plan for enhancements to Lucene @@ -8,7 +8,7 @@ - +

The purpose of this document is to outline plans for @@ -21,8 +21,8 @@ The best reference is htDig, though it is not quite as sophisticated as Lucene, it has a number of features that make it - desireable. It however is a traditional c-compiled app - which makes it somewhat unpleasent to install on some + desirable. It however is a traditional c-compiled app + which makes it somewhat unpleasant to install on some platforms (like Solaris!).

@@ -30,42 +30,42 @@ community for an initial reaction, advice, feedback and consent. Following this it will be submitted to the Lucene user community for support. Although, I'm (Andy - Oliver) capable of providing these enhancements by - myself, I'd of course prefer to work on them in concert + Oliver) capable of providing these enhancements by + myself, I'd of course prefer to work on them in concert with others.

- While I'm outlaying a fairly large featureset, these can + While I'm outlaying a fairly large feature set, these can be implemented incrementally of course (and are probably best if done that way).

- +

The goal is to provide features to Lucene that allow it - to be used as a dropin search engine. It should provide + to be used as a drop-in search engine. It should provide many of the features of projects like htDig while surpassing - them with unique Lucene features and capabillities such as + them with unique Lucene features and capabilities such as easy installation on and java-supporting platform, - and support for document fields and field searches. And + and support for document fields and field searches. And of course, a pragmatic software license.

To reach this goal we'll implement code to support the following objectives that augment but do not replace - the current Lucene featureset. + the current Lucene feature set.

@@ -128,7 +128,7 @@
  • replacement type - the type of - replacewith path: relative, url or + replace with path: relative, URL or path.
  • @@ -153,8 +153,8 @@ 0 - Long.MAX_VALUE.
  • - SleeptimeBetweenCalls - can be used to - avoid flooding a machine with too many + SleeptimeBetweenCalls - can be used to + avoid flooding a machine with too many requests
  • @@ -163,12 +163,12 @@ inactivity.
  • - IncludeFilter - include only items - matching filter. (can occur mulitple + IncludeFilter - include only items + matching filter. (can occur multiple times)
  • - ExcludeFilter - exclude only items + ExcludeFilter - exclude only items matching filter. (can occur multiple times)
  • @@ -196,9 +196,9 @@ (probably from the command line) read this properties file and get them from it. Command line options override - the properties file in the case of + the properties file in the case of duplicates. There should also be an - enivironment variable or VM parameter to + environment variable or VM parameter to set this. @@ -209,8 +209,8 @@

    This should extend the AbstractCrawler and - support any addtional options required for a - filesystem index. + support any additional options required for a + file system index.

    @@ -218,12 +218,12 @@ HTTP Crawler

    - Supports the AbstractCrawler options as well as: + Supports the AbstractCrawler options as well as:

    - +

    A configurable registry of document types, their - description, an identifyer, mime-type and file - extension. This should map both MIME -> factory + description, an identifier, mime-type and file + extension. This should map both MIME -> factory and extension -> factory.

    @@ -287,7 +287,7 @@ "html,htm" HTMLDocumentFactory - +

    @@ -300,17 +300,17 @@

    - A class taht maps standard fields from the + A class that maps standard fields from the DocumentFactories into *fields* in the Document objects they create. I suggest that a regular expression system or xpath might be the most universal way to do this. For instance if perhaps I had an XML factory that represented XML elements as fields, I could map content - from particular fields to ther fields or supress them + from particular fields to their fields or suppress them entirely. We could even make this configurable.

    - + for example:

    -

    - In this example we map html documents such that all - fields are suppressed but author and title. We map - author and title to anything in the content matching - author: (and x characters). Okay my regular expresions +

    + In this example we map html documents such that all + fields are suppressed but author and title. We map + author and title to anything in the content matching + author: (and x characters). Okay my regular expresions suck but hopefully you get the idea.

    - We might also consider eliminating the DocumentFactory - entirely by making an AbstractDocument from which the - current document object would inherit from. I - experimented with this locally, and it was a relatively - minor code change and there was of course no difference - in performance. The Document Factory classes would - instead be instances of various subclasses of + We might also consider eliminating the DocumentFactory + entirely by making an AbstractDocument from which the + current document object would inherit from. I + experimented with this locally, and it was a relatively + minor code change and there was of course no difference + in performance. The Document Factory classes would + instead be instances of various subclasses of AbstractDocument.

    - My inspiration for this is HTDig (http://www.htdig.org/). - While this goes slightly beyond what HTDig provides by - providing field mapping (where HTDIG is just interested - in Strings/numbers wherever they are found), it provides - at least what I would need to use this as a dropin for - most places I contract at (with the obvious exception of - a default set of content handlers which would of course + My inspiration for this is HTDig (http://www.htdig.org/). + While this goes slightly beyond what HTDig provides by + providing field mapping (where HTDIG is just interested + in Strings/numbers wherever they are found), it provides + at least what I would need to use this as a drop-in for + most places I contract at (with the obvious exception of + a default set of content handlers which would of course develop naturally over time).

    - I am able to certainly contribute to this effort if the - development community is open to it. I'd suggest we do - it iteratively in stages and not aim for all of this at + I am able to certainly contribute to this effort if the + development community is open to it. I'd suggest we do + it iteratively in stages and not aim for all of this at once (for instance leave out the field mapping at first).

    - - Anyhow, please give me some feedback, counter - suggestions, let me know if I'm way off base or out of + + Anyhow, please give me some feedback, counter + suggestions, let me know if I'm way off base or out of line, etc. -Andy

    - +