mirror of https://github.com/apache/lucene.git
this whole bit is somewhat rough - some quick improvements here, but needs more
git-svn-id: https://svn.apache.org/repos/asf/lucene/java/trunk@807207 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
ae88817d01
commit
52eea1618e
|
@ -33,15 +33,14 @@ application using Lucene to use an appropriate <i>Parser</i> to convert the orig
|
||||||
<p>
|
<p>
|
||||||
<h2>Tokenization</h2>
|
<h2>Tokenization</h2>
|
||||||
<p>
|
<p>
|
||||||
Plain text passed to Lucene for indexing goes through a process generally called tokenization – namely breaking of the
|
Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process
|
||||||
input text into small indexing elements – tokens.
|
of breaking input text into small indexing elements – tokens.
|
||||||
The way input text is broken into tokens very
|
The way input text is broken into tokens heavily influences how people will then be able to search for that text.
|
||||||
much dictates further capabilities of search upon that text.
|
|
||||||
For instance, sentences beginnings and endings can be identified to provide for more accurate phrase
|
For instance, sentences beginnings and endings can be identified to provide for more accurate phrase
|
||||||
and proximity searches (though sentence identification is not provided by Lucene).
|
and proximity searches (though sentence identification is not provided by Lucene).
|
||||||
<p>
|
<p>
|
||||||
In some cases simply breaking the input text into tokens is not enough – a deeper <i>Analysis</i> is needed,
|
In some cases simply breaking the input text into tokens is not enough – a deeper <i>Analysis</i> may be needed.
|
||||||
providing for several functions, including (but not limited to):
|
There are many post tokenization steps that can be done, including (but not limited to):
|
||||||
<ul>
|
<ul>
|
||||||
<li><a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> –
|
<li><a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> –
|
||||||
Replacing of words by their stems.
|
Replacing of words by their stems.
|
||||||
|
@ -76,7 +75,7 @@ providing for several functions, including (but not limited to):
|
||||||
for modifying tokenss that have been created by the Tokenizer. Common modifications performed by a
|
for modifying tokenss that have been created by the Tokenizer. Common modifications performed by a
|
||||||
TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters</li>
|
TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters</li>
|
||||||
</ul>
|
</ul>
|
||||||
<b>Since Lucene 2.9 the TokenStream API was changed. Please see section "New TokenStream API" below for details.</b>
|
<b>Since Lucene 2.9 the TokenStream API has changed. Please see section "New TokenStream API" below for details.</b>
|
||||||
</p>
|
</p>
|
||||||
<h2>Hints, Tips and Traps</h2>
|
<h2>Hints, Tips and Traps</h2>
|
||||||
<p>
|
<p>
|
||||||
|
|
Loading…
Reference in New Issue