this whole bit is somewhat rough - some quick improvements here, but needs more

git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@807207 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Mark Robert Miller 2009-08-24 13:29:57 +00:00
parent ae88817d01
commit 52eea1618e
1 changed files with 6 additions and 7 deletions

View File

@ -33,15 +33,14 @@ application using Lucene to use an appropriate <i>Parser</i> to convert the orig
<p> <p>
<h2>Tokenization</h2> <h2>Tokenization</h2>
<p> <p>
Plain text passed to Lucene for indexing goes through a process generally called tokenization &ndash; namely breaking of the Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process
input text into small indexing elements &ndash; tokens. of breaking input text into small indexing elements &ndash; tokens.
The way input text is broken into tokens very The way input text is broken into tokens heavily influences how people will then be able to search for that text.
much dictates further capabilities of search upon that text.
For instance, sentences beginnings and endings can be identified to provide for more accurate phrase For instance, sentences beginnings and endings can be identified to provide for more accurate phrase
and proximity searches (though sentence identification is not provided by Lucene). and proximity searches (though sentence identification is not provided by Lucene).
<p> <p>
In some cases simply breaking the input text into tokens is not enough &ndash; a deeper <i>Analysis</i> is needed, In some cases simply breaking the input text into tokens is not enough &ndash; a deeper <i>Analysis</i> may be needed.
providing for several functions, including (but not limited to): There are many post tokenization steps that can be done, including (but not limited to):
<ul> <ul>
<li><a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> &ndash; <li><a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> &ndash;
Replacing of words by their stems. Replacing of words by their stems.
@ -76,7 +75,7 @@ providing for several functions, including (but not limited to):
for modifying tokenss that have been created by the Tokenizer. Common modifications performed by a for modifying tokenss that have been created by the Tokenizer. Common modifications performed by a
TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters</li> TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters</li>
</ul> </ul>
<b>Since Lucene 2.9 the TokenStream API was changed. Please see section "New TokenStream API" below for details.</b> <b>Since Lucene 2.9 the TokenStream API has changed. Please see section "New TokenStream API" below for details.</b>
</p> </p>
<h2>Hints, Tips and Traps</h2> <h2>Hints, Tips and Traps</h2>
<p> <p>