From 672827d4a423e41cafccbb03a5b3132682192725 Mon Sep 17 00:00:00 2001 From: Otis Gospodnetic Date: Fri, 22 Apr 2005 04:30:42 +0000 Subject: [PATCH] - Not referenced from anywhere, and not really needed git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@164169 13f79535-47bb-0310-9956-ffa450edef68 --- docs/luceneplan.html | 648 ------------------------------------------- xdocs/luceneplan.xml | 380 ------------------------- 2 files changed, 1028 deletions(-) delete mode 100644 docs/luceneplan.html delete mode 100644 xdocs/luceneplan.xml diff --git a/docs/luceneplan.html b/docs/luceneplan.html deleted file mode 100644 index 5fb0b0b614d..00000000000 --- a/docs/luceneplan.html +++ /dev/null @@ -1,648 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - Apache Lucene - Plan for enhancements to Lucene - - - - - - - - - -
- - -Apache Lucene -
- - - - - - - - - - - - -
-
-
- - - -

About

- -

Resources

- -

Download

- -
- - - - -
- - Purpose - -
-

-

-

- The purpose of this document is to outline plans for - making - Apache Lucene work as a more general drop-in - component. It makes the assumption that this is an - objective for the Lucene user and development community. -

-

- The best reference is - htDig, though it is not quite as sophisticated as - Lucene, it has a number of features that make it - desirable. It however is a traditional c-compiled app - which makes it somewhat unpleasant to install on some - platforms (like Solaris!). -

-

- This plan is being submitted to the Lucene developer - community for an initial reaction, advice, feedback and - consent. Following this it will be submitted to the - Lucene user community for support. Although, I'm (Andy - Oliver) capable of providing these enhancements by - myself, I'd of course prefer to work on them in concert - with others. -

-

- While I'm outlaying a fairly large feature set, these can - be implemented incrementally of course (and are probably - best if done that way). -

-
-

-

- - - - -
- - Goal and Objectives - -
-

-

-

- The goal is to provide features to Lucene that allow it - to be used as a drop-in search engine. It should provide - many of the features of projects like htDig while surpassing - them with unique Lucene features and capabilities such as - easy installation on and java-supporting platform, - and support for document fields and field searches. And - of course, - a pragmatic software license. -

-

- To reach this goal we'll implement code to support the - following objectives that augment but do not replace - the current Lucene feature set. -

-
    -
  • - Document Location Independence - meaning mapping - real contexts to runtime contexts. - Essentially, if the document is at - /var/www/htdocs/mydoc.html, I probably want it - indexed as - http://www.bigevilmegacorp.com/mydoc.html. -
  • -
  • - Standard methods of creating central indicies - - file system indexing is probably less useful in - many environments than is *remote* indexing (for - instance http). I would suggest that most folks - would prefer that general functionality be - supported by Lucene instead of having to write - code for every indexing project. Obviously, if - what they are doing is *special* they'll have to - code, but general document indexing across - web servers would not qualify. -
  • -
  • - Document interpretation abstraction - currently - one must handle document object construction via - custom code. A standard interface for plugging - in format handlers should be supported. -
  • -
  • - Mime and file-extension to document - interpretation mapping. -
  • -
-
-

-

- - - - -
- - Crawlers - -
-

-

-

- Crawlers are data source executable code. They crawl a file - system, ftp site, web site, etc. to create the index. - These standard crawlers may not make ALL of Lucene's - functionality available, though they should be able to - make most of it available through configuration. -

-

- Abstract Crawler -

-

- The AbstractCrawler is basically the parent for all - Crawler classes. It provides implementation for the - following functions/properties: -

-
    -
  • - index path - where to write the index. -
  • -
  • - cui - create or update the index -
  • -
  • - root context - the start of the pathname - that should be replaced by the - replace with property or dropped - entirely. Example: /opt/tomcat/webapps -
  • -
  • - replace with - when specified replaces - the root context. Example: - http://jakarta.apache.org. -
  • -
  • - replacement type - the type of - replace with path: relative, URL or - path. -
  • -
  • - location - the location to start - indexing at. -
  • -
  • - doctypes - only index documents with - these doctypes. If not specified all - registered mime-types are used. - Example: "xml,doc,html" -
  • -
  • - recursive - if not specified is turned - off. -
  • -
  • - level - optional level of directory or - links to traverse. By default is - assumed to be infinite. Recursive must - be turned on or this is ignored. Range: - 0 - Long.MAX_VALUE. -
  • -
  • - SleeptimeBetweenCalls - can be used to - avoid flooding a machine with too many - requests -
  • -
  • - RequestTimeout - kill the crawler - request after the specified period of - inactivity. -
  • -
  • - IncludeFilter - include only items - matching filter. (can occur multiple - times) -
  • -
  • - ExcludeFilter - exclude only items - matching filter. (can occur multiple - times) -
  • -
  • - ExpandOnly - use but do not index items - that match this pattern (regex?) (can - occur multiple times) -
  • -
  • - NoExpand - Index but do not follow the - links in items that match this pattern - (regex?) (can occur multiple times) -
  • -
  • - MaxItems - stops indexing after x - documents have been indexed. -
  • -
  • - MaxMegs - stops indexing after x megs - have been indexed.. (should this be in - specific crawlers?) -
  • -
  • - properties - in addition to the settings - (probably from the command line) read - this properties file and get them from - it. Command line options override - the properties file in the case of - duplicates. There should also be an - environment variable or VM parameter to - set this. -
  • -
-

- FileSystemCrawler -

-

- This should extend the AbstractCrawler and - support any additional options required for a - file system index. -

-

- HTTP Crawler -

-

- Supports the AbstractCrawler options as well as: -

-
    -
  • - span hosts - Whether to span hosts or not, - by default this should be no. -
  • -
  • - restrict domains - (ignored if span - hosts is not enabled). Whether all - spanned hosts must be in the same domain - (default is off). -
  • -
  • - try directories - Whether to attempt - directory listings or not (so if you - recurse and go to - /nextcontext/index.html this option says - to also try /nextcontext to get the dir - listing) -
  • -
  • - map extensions - - (always/default/never/fallback). Whether - to always use extension mapping, by - default (fallback to mime type), NEVER - or fallback if mime is not available - (default). -
  • -
  • - ignore robots - ignore robots.txt, on or - off (default - off) -
  • -
-
-

-

- - - - -
- - MIMEMap - -
-

-

-

- A configurable registry of document types, their - description, an identifier, mime-type and file - extension. This should map both MIME -> factory - and extension -> factory. -

-

- This might be configured at compile time or by a - properties file, etc. For example: -

- - - - - - - - - - - - - - - - - - - - - - -
- - Description - - - - Identifier - - - - Extensions - - - - MimeType - - - - DocumentFactory - -
- - "Word Document" - - - - "doc" - - - - "doc" - - - - "vnd.application/ms-word" - - - - POIWordDocumentFactory - -
- - "HTML Document" - - - - "html" - - - - "html,htm" - - - -   - - - - HTMLDocumentFactory - -
-
-

-

- - - - -
- - DocumentFactory - -
-

-

-

- An interface for classes which create document objects - for particular file types. Examples: - HTMLDocumentFactory, DOCDocumentFactory, - XLSDocumentFactory, XML DocumentFactory. -

-
-

-

- - - - -
- - FieldMapping classes - -
-

-

-

- A class that maps standard fields from the - DocumentFactories into *fields* in the Document objects - they create. I suggest that a regular expression system - or xpath might be the most universal way to do this. - For instance if perhaps I had an XML factory that - represented XML elements as fields, I could map content - from particular fields to their fields or suppress them - entirely. We could even make this configurable. -

-

- - for example: -

-
    -
  • - htmldoc.properties -
  • -
  • - suppress=* -
  • -
  • - author=content:g/author\:\ ........................................./ -
  • -
  • - author.suppress=false -
  • -
  • - title=content:g/title\:\ ........................................./ -
  • -
  • - title.suppress=false -
  • -
-

- In this example we map html documents such that all - fields are suppressed but author and title. We map - author and title to anything in the content matching - author: (and x characters). Okay my regular expresions - suck but hopefully you get the idea. -

-
-

-

- - - - -
- - Final Thoughts - -
-

-

-

- We might also consider eliminating the DocumentFactory - entirely by making an AbstractDocument from which the - current document object would inherit from. I - experimented with this locally, and it was a relatively - minor code change and there was of course no difference - in performance. The Document Factory classes would - instead be instances of various subclasses of - AbstractDocument. -

-

- My inspiration for this is HTDig (http://www.htdig.org/). - While this goes slightly beyond what HTDig provides by - providing field mapping (where HTDIG is just interested - in Strings/numbers wherever they are found), it provides - at least what I would need to use this as a drop-in for - most places I contract at (with the obvious exception of - a default set of content handlers which would of course - develop naturally over time). -

-

- I am able to certainly contribute to this effort if the - development community is open to it. I'd suggest we do - it iteratively in stages and not aim for all of this at - once (for instance leave out the field mapping at first). -

-

- - Anyhow, please give me some feedback, counter - suggestions, let me know if I'm way off base or out of - line, etc. -Andy -

-
-

-

-
-
-
-
- Copyright © 1999-2005, The Apache Software Foundation -
-
- - - - - - - - - - - - - - - - - - - - - - - diff --git a/xdocs/luceneplan.xml b/xdocs/luceneplan.xml deleted file mode 100644 index 5ca0ec7af3e..00000000000 --- a/xdocs/luceneplan.xml +++ /dev/null @@ -1,380 +0,0 @@ - - - - - Plan for enhancements to Lucene - - - - - - -
-

- The purpose of this document is to outline plans for - making - Apache Lucene work as a more general drop-in - component. It makes the assumption that this is an - objective for the Lucene user and development community. -

-

- The best reference is - htDig, though it is not quite as sophisticated as - Lucene, it has a number of features that make it - desirable. It however is a traditional c-compiled app - which makes it somewhat unpleasant to install on some - platforms (like Solaris!). -

-

- This plan is being submitted to the Lucene developer - community for an initial reaction, advice, feedback and - consent. Following this it will be submitted to the - Lucene user community for support. Although, I'm (Andy - Oliver) capable of providing these enhancements by - myself, I'd of course prefer to work on them in concert - with others. -

-

- While I'm outlaying a fairly large feature set, these can - be implemented incrementally of course (and are probably - best if done that way). -

-
- -
-

- The goal is to provide features to Lucene that allow it - to be used as a drop-in search engine. It should provide - many of the features of projects like htDig while surpassing - them with unique Lucene features and capabilities such as - easy installation on and java-supporting platform, - and support for document fields and field searches. And - of course, - a pragmatic software license. -

-

- To reach this goal we'll implement code to support the - following objectives that augment but do not replace - the current Lucene feature set. -

-
    -
  • - Document Location Independence - meaning mapping - real contexts to runtime contexts. - Essentially, if the document is at - /var/www/htdocs/mydoc.html, I probably want it - indexed as - http://www.bigevilmegacorp.com/mydoc.html. -
  • -
  • - Standard methods of creating central indicies - - file system indexing is probably less useful in - many environments than is *remote* indexing (for - instance http). I would suggest that most folks - would prefer that general functionality be - supported by Lucene instead of having to write - code for every indexing project. Obviously, if - what they are doing is *special* they'll have to - code, but general document indexing across - web servers would not qualify. -
  • -
  • - Document interpretation abstraction - currently - one must handle document object construction via - custom code. A standard interface for plugging - in format handlers should be supported. -
  • -
  • - Mime and file-extension to document - interpretation mapping. -
  • -
-
-
-

- Crawlers are data source executable code. They crawl a file - system, ftp site, web site, etc. to create the index. - These standard crawlers may not make ALL of Lucene's - functionality available, though they should be able to - make most of it available through configuration. -

- -

- Abstract Crawler -

-

- The AbstractCrawler is basically the parent for all - Crawler classes. It provides implementation for the - following functions/properties: -

-
    -
  • - index path - where to write the index. -
  • -
  • - cui - create or update the index -
  • -
  • - root context - the start of the pathname - that should be replaced by the - replace with property or dropped - entirely. Example: /opt/tomcat/webapps -
  • -
  • - replace with - when specified replaces - the root context. Example: - http://jakarta.apache.org. -
  • -
  • - replacement type - the type of - replace with path: relative, URL or - path. -
  • -
  • - location - the location to start - indexing at. -
  • -
  • - doctypes - only index documents with - these doctypes. If not specified all - registered mime-types are used. - Example: "xml,doc,html" -
  • -
  • - recursive - if not specified is turned - off. -
  • -
  • - level - optional level of directory or - links to traverse. By default is - assumed to be infinite. Recursive must - be turned on or this is ignored. Range: - 0 - Long.MAX_VALUE. -
  • -
  • - SleeptimeBetweenCalls - can be used to - avoid flooding a machine with too many - requests -
  • -
  • - RequestTimeout - kill the crawler - request after the specified period of - inactivity. -
  • -
  • - IncludeFilter - include only items - matching filter. (can occur multiple - times) -
  • -
  • - ExcludeFilter - exclude only items - matching filter. (can occur multiple - times) -
  • -
  • - ExpandOnly - use but do not index items - that match this pattern (regex?) (can - occur multiple times) -
  • -
  • - NoExpand - Index but do not follow the - links in items that match this pattern - (regex?) (can occur multiple times) -
  • -
  • - MaxItems - stops indexing after x - documents have been indexed. -
  • -
  • - MaxMegs - stops indexing after x megs - have been indexed.. (should this be in - specific crawlers?) -
  • -
  • - properties - in addition to the settings - (probably from the command line) read - this properties file and get them from - it. Command line options override - the properties file in the case of - duplicates. There should also be an - environment variable or VM parameter to - set this. -
  • -
- - -

- FileSystemCrawler -

-

- This should extend the AbstractCrawler and - support any additional options required for a - file system index. -

- - -

- HTTP Crawler -

-

- Supports the AbstractCrawler options as well as: -

-
    -
  • - span hosts - Whether to span hosts or not, - by default this should be no. -
  • -
  • - restrict domains - (ignored if span - hosts is not enabled). Whether all - spanned hosts must be in the same domain - (default is off). -
  • -
  • - try directories - Whether to attempt - directory listings or not (so if you - recurse and go to - /nextcontext/index.html this option says - to also try /nextcontext to get the dir - listing) -
  • -
  • - map extensions - - (always/default/never/fallback). Whether - to always use extension mapping, by - default (fallback to mime type), NEVER - or fallback if mime is not available - (default). -
  • -
  • - ignore robots - ignore robots.txt, on or - off (default - off) -
  • -
- -
- -
-

- A configurable registry of document types, their - description, an identifier, mime-type and file - extension. This should map both MIME -> factory - and extension -> factory. -

-

- This might be configured at compile time or by a - properties file, etc. For example: -

- - - - - - - - - - - - - - - - - - - - - - -
DescriptionIdentifierExtensionsMimeTypeDocumentFactory
"Word Document""doc""doc""vnd.application/ms-word"POIWordDocumentFactory
"HTML Document""html""html,htm"HTMLDocumentFactory
-
-
-

- An interface for classes which create document objects - for particular file types. Examples: - HTMLDocumentFactory, DOCDocumentFactory, - XLSDocumentFactory, XML DocumentFactory. -

-
-
-

- A class that maps standard fields from the - DocumentFactories into *fields* in the Document objects - they create. I suggest that a regular expression system - or xpath might be the most universal way to do this. - For instance if perhaps I had an XML factory that - represented XML elements as fields, I could map content - from particular fields to their fields or suppress them - entirely. We could even make this configurable. -

-

- - for example: -

-
    -
  • - htmldoc.properties -
  • -
  • - suppress=* -
  • -
  • - author=content:g/author\:\ ........................................./ -
  • -
  • - author.suppress=false -
  • -
  • - title=content:g/title\:\ ........................................./ -
  • -
  • - title.suppress=false -
  • -
-

- In this example we map html documents such that all - fields are suppressed but author and title. We map - author and title to anything in the content matching - author: (and x characters). Okay my regular expresions - suck but hopefully you get the idea. -

-
-
-

- We might also consider eliminating the DocumentFactory - entirely by making an AbstractDocument from which the - current document object would inherit from. I - experimented with this locally, and it was a relatively - minor code change and there was of course no difference - in performance. The Document Factory classes would - instead be instances of various subclasses of - AbstractDocument. -

-

- My inspiration for this is HTDig (http://www.htdig.org/). - While this goes slightly beyond what HTDig provides by - providing field mapping (where HTDIG is just interested - in Strings/numbers wherever they are found), it provides - at least what I would need to use this as a drop-in for - most places I contract at (with the obvious exception of - a default set of content handlers which would of course - develop naturally over time). -

-

- I am able to certainly contribute to this effort if the - development community is open to it. I'd suggest we do - it iteratively in stages and not aim for all of this at - once (for instance leave out the field mapping at first). -

-

- - Anyhow, please give me some feedback, counter - suggestions, let me know if I'm way off base or out of - line, etc. -Andy -

-
- - -