Jakarta Lucene

About

Resources

Download

Jakarta

Lucene Sandbox

Lucene project also contains a workspace, Lucene Sandbox, that is open to all Lucene committers, as well as a few other developers. The purpose of the Sandbox is to host various third party contributions, and to serve as a place to try out new ideas and prepare them for inclusion into the core Lucene distribution.
Users are free to experiment with the components developed in the Sandbox, but Sandbox components will not necessarily be maintained, particularly in their current state.

You can access the Lucene Sandbox CVS repository at http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/.

Indyo

Indyo is a datasource-independent Lucene indexing framework.

A tutorial for using Indyo can be found here.


LARM

LARM is a web crawler optimized for large intranets with up to a couple of hundred hosts.

Technical Overview

Snowball Stemmers for Lucene

This project provides pre-compiled versions of the Snowball stemmers for Lucene.

More information can be found here.

Background information on Snowball, which is a language for stemmers developed by Martin Porter.


Ant

The Ant project is a useful Ant task that creates a Lucene index out of an Ant fileset. It also contains an example HTML parser that uses JTidy.

The CVS repository for the Ant contribution.


SearchBean

SearchBean is a UI component that can be used to browse through the results of a Lucene search. The SearchBean searches the index for a given query string, retrieves the hits, and then brings them into the HitsIterator class, which can be used for paging and sorting through search results.

The CVS repository for the SearchBean contribution.

Background information on Snowball, which is a language for stemmers developed by Martin Porter.


Lucene Service for Fulcrum

Lucene can be run as a service inside Fulcrum, which is the services framework from the Turbine project.

The implementation consists of a SearchService interface, a LuceneSearchSearchService implementation, and a SearchResults object that gets an array of Document objects from a Hits object. Calls to the search methods on the service return the SearchResults object.

The service supports querying, but does not support indexing.

CVS repository for the Fulcrum Service.


WordNet/Synonyms

The Lucene WordNet code consists of a single class which parses a prolog file from the WordNet site that contains a list of English words and synonyms. The class builds a Lucene index from the synonyms file. Your querying code could hit this index to build up a set of synonyms for the terms in the search query.

More information on the Lucene WordNet package. WordNet is an online database of English language words that contains synonyms, definitions, and various relationships between synonym sets.

CVS for the WordNet module.


SAX/DOM XML Indexing demo

This contribution is some sample code that demonstrates adding simple XML documents into the index. It creates a new Document object for each file, and then populates the Document with a Field for each XML element, recursively. There are examples included for both SAX and DOM.

CVS for the XML Indexing Demo.


High Frequency Terms

The miscellaneous package is for classes that don't fit anywhere else. The only class in it right now determines what terms occur the most inside a Lucene index. This could be useful for analyzing which terms may need to go into a custom stop word list for better search results.

CVS for miscellaneous classes.




Copyright © 1999-2003, Apache Software Foundation