mirror of https://github.com/apache/lucene.git
LUCENE-5384: Add some analysis api tips to the package.html (closes #12)
git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1555907 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
a92ca0f717
commit
e45755d9b1
lucene
|
@ -141,6 +141,12 @@ Changes in Runtime Behavior
|
||||||
AlreadyClosedException if the refCount in incremented but
|
AlreadyClosedException if the refCount in incremented but
|
||||||
is less that 1. (Simon Willnauer)
|
is less that 1. (Simon Willnauer)
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
|
||||||
|
* LUCENE-5384: Add some tips for making tokenfilters and tokenizers
|
||||||
|
to the analysis package overview.
|
||||||
|
(Benson Margulies via Robert Muir - pull request #12)
|
||||||
|
|
||||||
======================= Lucene 4.6.0 =======================
|
======================= Lucene 4.6.0 =======================
|
||||||
|
|
||||||
New Features
|
New Features
|
||||||
|
|
|
@ -386,7 +386,15 @@ and proximity searches (though sentence identification is not provided by Lucene
|
||||||
<li>The first position increment must be > 0.</li>
|
<li>The first position increment must be > 0.</li>
|
||||||
<li>Positions must not go backward.</li>
|
<li>Positions must not go backward.</li>
|
||||||
<li>Tokens that have the same start position must have the same start offset.</li>
|
<li>Tokens that have the same start position must have the same start offset.</li>
|
||||||
<li>Tokens that have the same end position (taking into account the position length) must have the same end offset.</li>
|
<li>Tokens that have the same end position (taking into account the
|
||||||
|
position length) must have the same end offset.</li>
|
||||||
|
<li>Tokenizers must call {@link
|
||||||
|
org.apache.lucene.util.AttributeSource#clearAttributes()} in
|
||||||
|
incrementToken().</li>
|
||||||
|
<li>Tokenizers must override {@link
|
||||||
|
org.apache.lucene.analysis.TokenStream#end()}, and pass the final
|
||||||
|
offset (the total number of input characters processed) to both
|
||||||
|
parameters of {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute#setOffset(int, int)}.</li>
|
||||||
</ul>
|
</ul>
|
||||||
<p>
|
<p>
|
||||||
Although these rules might seem easy to follow, problems can quickly happen when chaining
|
Although these rules might seem easy to follow, problems can quickly happen when chaining
|
||||||
|
@ -395,7 +403,8 @@ and proximity searches (though sentence identification is not provided by Lucene
|
||||||
</p>
|
</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li>Token filters should not modify offsets. If you feel that your filter would need to modify offsets, then it should probably be implemented as a tokenizer.</li>
|
<li>Token filters should not modify offsets. If you feel that your filter would need to modify offsets, then it should probably be implemented as a tokenizer.</li>
|
||||||
<li>Token filters should not insert positions. If a filter needs to add tokens, then they shoud all have a position increment of 0.</li>
|
<li>Token filters should not insert positions. If a filter needs to add tokens, then they should all have a position increment of 0.</li>
|
||||||
|
<li>When they add tokens, token filters should call {@link org.apache.lucene.util.AttributeSource#clearAttributes()} first.</li>
|
||||||
<li>When they remove tokens, token filters should increment the position increment of the following token.</li>
|
<li>When they remove tokens, token filters should increment the position increment of the following token.</li>
|
||||||
<li>Token filters should preserve position lengths.</li>
|
<li>Token filters should preserve position lengths.</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
@ -467,6 +476,14 @@ and proximity searches (though sentence identification is not provided by Lucene
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
</table>
|
</table>
|
||||||
|
<h3>Testing Your Analysis Component</h3>
|
||||||
|
<p>
|
||||||
|
The lucene-test-framework component defines
|
||||||
|
<a href="{@docRoot}/../test-framework/org/apache/lucene/analysis/BaseTokenStreamTestCase.html">BaseTokenStreamTestCase</a>. By extending
|
||||||
|
this class, you can create JUnit tests that validate that your
|
||||||
|
Analyzer and/or analysis components correctly implement the
|
||||||
|
protocol. The checkRandomData methods of that class are particularly effective in flushing out errors.
|
||||||
|
</p>
|
||||||
<h3>Using the TokenStream API</h3>
|
<h3>Using the TokenStream API</h3>
|
||||||
There are a few important things to know in order to use the new API efficiently which are summarized here. You may want
|
There are a few important things to know in order to use the new API efficiently which are summarized here. You may want
|
||||||
to walk through the example below first and come back to this section afterwards.
|
to walk through the example below first and come back to this section afterwards.
|
||||||
|
|
Loading…
Reference in New Issue