Jakarta Lucene

About

Resources

Plans

Download

Jakarta

Purpose

This document describes the list of tasks on the plates of the Lucene development team. Tasks are assigned into two categories: core or non-core.


About Core vs. Non-Core Development

Currently the Lucene development team is working on categorizing change requests into core and non-core changes.

Core changes would entail a change to the search engine core itself. From Doug Cutting:

"Examples include: file locking to make things multi-process safe; adding an API for boosting individual documents and fields values; making the scoring API extensible and public; etc."

Non-core changes would not affect the search engine itself, but would consist instead of projects or components that would make useful additions to the core framework. Again, from Doug Cutting:

"[Examples] include: support for more languages; query parsers; database storage; crawlers, etc. Whether these belong in the base distribution is a matter of debate (sometimes hot). My rule of thumb for including them is their generality: if they are likely to be useful to a large proportion of Lucene users then they should probably go in the base distribution. Language support in particular is tricky. Perhaps we should migrate to a model where the base distribution includes no analyzers, and supply separate language packages."

Change requests will be categorically defined by the development team (committers) as core or non-core, and a committer will be assigned responsibility for coordinating development of the change request. All change requests should be submitted to one of the Lucene mailing lists, or through the Apache Bugzilla database.


Core Development Changes

No change requests classified as core yet!


Non-Core Development Changes

No change requests classified as non-core yet!


Unclassified Changes

Name Description Links
Term Vector support
Support for Search Term Highlighting
Better support for hits sorted by things other than score. An easy, efficient case is to support results sorted by the order documents were added to the index. A little harder and less efficient is support for results sorted by an arbitrary field.
Add some requested methods: Document.getValues, IndexReader.getIndexedFields String[] Document.getValues(String fieldName); String[] IndexReader.getIndexedFields(); void Token.setPositionIncrement(int);
Add lastModified() method to Directory, FSDirectory and RamDirectory, so it could be cached in IndexWriter/Searcher manager.
Support for adding more than 1 term to the same position. N.B. I think the Finnish lady already implemented this. It required some pieces of Lucene to be modified. (OG).
The ability to retrieve the number of occurrences not only for a term but also for a Phrase.
A lady from Finland submitted code for handling Finnish.
Dutch stemmer, analyzer, etc.
French stemmer, analyzer, etc.
Che Dong's CJKTokenizer for Chinese, Japanese, and Korean.
Selecting a language-specific analyzer according to a locale. Now we rewrite parts of Lucene code in order to use another analyzer. It will be useful to select analyzer without touching code.
Adding "-encoding" option and encoding-sensitive methods to tools. Current tools needs minor changes on a Japanese (and other language) environment: adding an "-encode" option and argument, using Reader/Writer classes instead of InputStream/OutputStream classes, etc.



Copyright © 1999-2002, Apache Software Foundation