diff --git a/CHANGES.txt b/CHANGES.txt index e56843d256b..ffba790b643 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -276,6 +276,8 @@ Documentation 4. LUCENE-740: Added SNOWBALL-LICENSE.txt to the snowball package and a remark about the license to NOTICE.TXT. (Steven Parkes via Michael Busch) + 5. LUCENE-925: Added analysis package javadocs. (Grant Ingersoll and Doron Cohen) + Build 1. LUCENE-802: Added LICENSE.TXT and NOTICE.TXT to Lucene jars. diff --git a/src/java/org/apache/lucene/analysis/package.html b/src/java/org/apache/lucene/analysis/package.html index 6b8ebf93b31..93e36764e62 100644 --- a/src/java/org/apache/lucene/analysis/package.html +++ b/src/java/org/apache/lucene/analysis/package.html @@ -5,6 +5,90 @@
-API and code to convert text into indexable tokens. +API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.
++Lucene, indexing and search library, accepts only plain text input. +
+
+Applications that build their search capabilities upon Lucene may support documents in various formats - HTML, XML, PDF, Word - just to name a few. +Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the +application using Lucene to use an appropriate Parser to convert the original format into plain text, before passing that plain text to Lucene. +
+
+Plain text passed to Lucene for indexing goes through a process generally called tokenization - namely breaking of the +input text into small indexing elements - Tokens. The way that the input text is broken into tokens very +much dictates the further search capabilities of the index into which that text was added. Sentences +beginnings and endings can be identified to provide for more accurate phrase and proximity searches +(though sentence identification is not provided by Lucene). +
+In some cases simply breaking the input text into tokens is not enough - a deeper Analysis is needed, +providing for several functions, including (but not limited to): +
+
+ The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There + are three main classes in the package from which all analysis processes are derived. These are: +
+ The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer} + is sometimes confusing. To ease on this confusion, some clarifications: +
Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link + org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more + than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning: +
Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). + Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a + {@link org.apache.lucene.analysis.StopFilter}.
+Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer +or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile +to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. +If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at +the source code of any one of the many samples located in this package.