diff --git a/CHANGES.txt b/CHANGES.txt index e56843d256b..ffba790b643 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -276,6 +276,8 @@ Documentation 4. LUCENE-740: Added SNOWBALL-LICENSE.txt to the snowball package and a remark about the license to NOTICE.TXT. (Steven Parkes via Michael Busch) + 5. LUCENE-925: Added analysis package javadocs. (Grant Ingersoll and Doron Cohen) + Build 1. LUCENE-802: Added LICENSE.TXT and NOTICE.TXT to Lucene jars. diff --git a/src/java/org/apache/lucene/analysis/package.html b/src/java/org/apache/lucene/analysis/package.html index 6b8ebf93b31..93e36764e62 100644 --- a/src/java/org/apache/lucene/analysis/package.html +++ b/src/java/org/apache/lucene/analysis/package.html @@ -5,6 +5,90 @@ -API and code to convert text into indexable tokens. +

API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.

+

Parsing? Tokenization? Analysis!

+

+Lucene, indexing and search library, accepts only plain text input. +

+

Parsing

+

+Applications that build their search capabilities upon Lucene may support documents in various formats - HTML, XML, PDF, Word - just to name a few. +Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the +application using Lucene to use an appropriate Parser to convert the original format into plain text, before passing that plain text to Lucene. +

+

Tokenization

+

+Plain text passed to Lucene for indexing goes through a process generally called tokenization - namely breaking of the +input text into small indexing elements - Tokens. The way that the input text is broken into tokens very +much dictates the further search capabilities of the index into which that text was added. Sentences +beginnings and endings can be identified to provide for more accurate phrase and proximity searches +(though sentence identification is not provided by Lucene). +

+In some cases simply breaking the input text into tokens is not enough - a deeper Analysis is needed, +providing for several functions, including (but not limited to): +

+

+

Core Analysis

+

+ The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There + are three main classes in the package from which all analysis processes are derived. These are: +

+

+

Hints, Tips and Traps

+

+ The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer} + is sometimes confusing. To ease on this confusion, some clarifications: +

+

+

Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link + org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more + than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning: +

    +
  1. {@link org.apache.lucene.analysis.PerFieldAnalyzerWrapper} -- Most Analyzers perform the same operation on all + {@link org.apache.lucene.document.Field}s. The PerFieldAnalyzerWrapper can be used to associate a different Analyzer with different + {@link org.apache.lucene.document.Field}s.
  2. +
  3. The contrib/analyzers library located at the root of the Lucene distribution has a number of different Analyzer implementations to solve a variety + of different problems related to searching. Many of the Analyzers are designed to analyze non-English languages.
  4. +
  5. The contrib/snowball library located at the root of the Lucene distribution has Analyzer and TokenFilter implementations for a variety of Snowball stemmers. See http://snowball.tartarus.org for more information.
  6. +
  7. There are a variety of Tokenizer and TokenFilter implementations in this package. Take a look around, chances are someone has implemented what you need.
  8. +
+

+

Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). + Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a + {@link org.apache.lucene.analysis.StopFilter}.

+

Implementing your own Analyzer

+

Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer +or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile +to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. +If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at +the source code of any one of the many samples located in this package.