mirror of https://github.com/apache/lucene.git
LUCENE-5384: Add some analysis api tips to the package.html (closes #12)
git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1555907 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
a92ca0f717
commit
e45755d9b1
|
@ -141,6 +141,12 @@ Changes in Runtime Behavior
|
|||
AlreadyClosedException if the refCount in incremented but
|
||||
is less that 1. (Simon Willnauer)
|
||||
|
||||
Documentation
|
||||
|
||||
* LUCENE-5384: Add some tips for making tokenfilters and tokenizers
|
||||
to the analysis package overview.
|
||||
(Benson Margulies via Robert Muir - pull request #12)
|
||||
|
||||
======================= Lucene 4.6.0 =======================
|
||||
|
||||
New Features
|
||||
|
|
|
@ -386,7 +386,15 @@ and proximity searches (though sentence identification is not provided by Lucene
|
|||
<li>The first position increment must be > 0.</li>
|
||||
<li>Positions must not go backward.</li>
|
||||
<li>Tokens that have the same start position must have the same start offset.</li>
|
||||
<li>Tokens that have the same end position (taking into account the position length) must have the same end offset.</li>
|
||||
<li>Tokens that have the same end position (taking into account the
|
||||
position length) must have the same end offset.</li>
|
||||
<li>Tokenizers must call {@link
|
||||
org.apache.lucene.util.AttributeSource#clearAttributes()} in
|
||||
incrementToken().</li>
|
||||
<li>Tokenizers must override {@link
|
||||
org.apache.lucene.analysis.TokenStream#end()}, and pass the final
|
||||
offset (the total number of input characters processed) to both
|
||||
parameters of {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute#setOffset(int, int)}.</li>
|
||||
</ul>
|
||||
<p>
|
||||
Although these rules might seem easy to follow, problems can quickly happen when chaining
|
||||
|
@ -395,7 +403,8 @@ and proximity searches (though sentence identification is not provided by Lucene
|
|||
</p>
|
||||
<ul>
|
||||
<li>Token filters should not modify offsets. If you feel that your filter would need to modify offsets, then it should probably be implemented as a tokenizer.</li>
|
||||
<li>Token filters should not insert positions. If a filter needs to add tokens, then they shoud all have a position increment of 0.</li>
|
||||
<li>Token filters should not insert positions. If a filter needs to add tokens, then they should all have a position increment of 0.</li>
|
||||
<li>When they add tokens, token filters should call {@link org.apache.lucene.util.AttributeSource#clearAttributes()} first.</li>
|
||||
<li>When they remove tokens, token filters should increment the position increment of the following token.</li>
|
||||
<li>Token filters should preserve position lengths.</li>
|
||||
</ul>
|
||||
|
@ -467,6 +476,14 @@ and proximity searches (though sentence identification is not provided by Lucene
|
|||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
<h3>Testing Your Analysis Component</h3>
|
||||
<p>
|
||||
The lucene-test-framework component defines
|
||||
<a href="{@docRoot}/../test-framework/org/apache/lucene/analysis/BaseTokenStreamTestCase.html">BaseTokenStreamTestCase</a>. By extending
|
||||
this class, you can create JUnit tests that validate that your
|
||||
Analyzer and/or analysis components correctly implement the
|
||||
protocol. The checkRandomData methods of that class are particularly effective in flushing out errors.
|
||||
</p>
|
||||
<h3>Using the TokenStream API</h3>
|
||||
There are a few important things to know in order to use the new API efficiently which are summarized here. You may want
|
||||
to walk through the example below first and come back to this section afterwards.
|
||||
|
|
Loading…
Reference in New Issue