diff --git a/lucene/src/java/org/apache/lucene/analysis/package.html b/lucene/src/java/org/apache/lucene/analysis/package.html index 7200f4f6417..9e573b35af4 100644 --- a/lucene/src/java/org/apache/lucene/analysis/package.html +++ b/lucene/src/java/org/apache/lucene/analysis/package.html @@ -23,7 +23,7 @@

API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.

Parsing? Tokenization? Analysis!

-Lucene, indexing and search library, accepts only plain text input. +Lucene, an indexing and search library, accepts only plain text input.

Parsing

@@ -39,12 +39,23 @@ The way input text is broken into tokens heavily influences how people will then For instance, sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches (though sentence identification is not provided by Lucene).

-In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. -There are many post tokenization steps that can be done, including (but not limited to): + In some cases simply breaking the input text into tokens is not enough + – a deeper Analysis may be needed. Lucene includes both + pre- and post-tokenization analysis facilities. +

+

+ Pre-tokenization analysis can include (but is not limited to) stripping + HTML markup, and transforming or removing text matching arbitrary patterns + or sets of fixed strings. +

+

+ There are many post-tokenization steps that can be done, including + (but not limited to): +