From 52eea1618e5e702eaa2c9508bc209b363eb8e510 Mon Sep 17 00:00:00 2001 From: Mark Robert Miller Date: Mon, 24 Aug 2009 13:29:57 +0000 Subject: [PATCH] this whole bit is somewhat rough - some quick improvements here, but needs more git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@807207 13f79535-47bb-0310-9956-ffa450edef68 --- src/java/org/apache/lucene/analysis/package.html | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/src/java/org/apache/lucene/analysis/package.html b/src/java/org/apache/lucene/analysis/package.html index 1ce29adb848..b4356e7eef9 100644 --- a/src/java/org/apache/lucene/analysis/package.html +++ b/src/java/org/apache/lucene/analysis/package.html @@ -33,15 +33,14 @@ application using Lucene to use an appropriate Parser to convert the orig

Tokenization

-Plain text passed to Lucene for indexing goes through a process generally called tokenization – namely breaking of the -input text into small indexing elements – tokens. -The way input text is broken into tokens very -much dictates further capabilities of search upon that text. +Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process +of breaking input text into small indexing elements – tokens. +The way input text is broken into tokens heavily influences how people will then be able to search for that text. For instance, sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches (though sentence identification is not provided by Lucene).

-In some cases simply breaking the input text into tokens is not enough – a deeper Analysis is needed, -providing for several functions, including (but not limited to): +In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. +There are many post tokenization steps that can be done, including (but not limited to):

- Since Lucene 2.9 the TokenStream API was changed. Please see section "New TokenStream API" below for details. + Since Lucene 2.9 the TokenStream API has changed. Please see section "New TokenStream API" below for details.

Hints, Tips and Traps