From c9361a507defa9f96b0e1dc1b2535d07a35c6733 Mon Sep 17 00:00:00 2001 From: Steven Rowe Date: Wed, 18 Jan 2012 14:36:58 +0000 Subject: [PATCH] LUCENE-3666: Update org.apache.lucene.analysis package summary git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1232909 13f79535-47bb-0310-9956-ffa450edef68 --- .../org/apache/lucene/analysis/package.html | 656 +++++++++++------- 1 file changed, 414 insertions(+), 242 deletions(-) diff --git a/lucene/src/java/org/apache/lucene/analysis/package.html b/lucene/src/java/org/apache/lucene/analysis/package.html index 7200f4f6417..9e573b35af4 100644 --- a/lucene/src/java/org/apache/lucene/analysis/package.html +++ b/lucene/src/java/org/apache/lucene/analysis/package.html @@ -23,7 +23,7 @@

API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.

Parsing? Tokenization? Analysis!

-Lucene, indexing and search library, accepts only plain text input. +Lucene, an indexing and search library, accepts only plain text input.

Parsing

@@ -39,12 +39,23 @@ The way input text is broken into tokens heavily influences how people will then For instance, sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches (though sentence identification is not provided by Lucene).

-In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. -There are many post tokenization steps that can be done, including (but not limited to): + In some cases simply breaking the input text into tokens is not enough + – a deeper Analysis may be needed. Lucene includes both + pre- and post-tokenization analysis facilities. +

+

+ Pre-tokenization analysis can include (but is not limited to) stripping + HTML markup, and transforming or removing text matching arbitrary patterns + or sets of fixed strings. +

+

+ There are many post-tokenization steps that can be done, including + (but not limited to): +