From e45755d9b160c9b1cfbcf470a7cbe80c650b95c7 Mon Sep 17 00:00:00 2001 From: Robert Muir Date: Mon, 6 Jan 2014 16:44:14 +0000 Subject: [PATCH] LUCENE-5384: Add some analysis api tips to the package.html (closes #12) git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1555907 13f79535-47bb-0310-9956-ffa450edef68 --- lucene/CHANGES.txt | 6 ++++++ .../org/apache/lucene/analysis/package.html | 21 +++++++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/lucene/CHANGES.txt b/lucene/CHANGES.txt index 2e6113ac79c..c56cd31636c 100644 --- a/lucene/CHANGES.txt +++ b/lucene/CHANGES.txt @@ -141,6 +141,12 @@ Changes in Runtime Behavior AlreadyClosedException if the refCount in incremented but is less that 1. (Simon Willnauer) +Documentation + +* LUCENE-5384: Add some tips for making tokenfilters and tokenizers + to the analysis package overview. + (Benson Margulies via Robert Muir - pull request #12) + ======================= Lucene 4.6.0 ======================= New Features diff --git a/lucene/core/src/java/org/apache/lucene/analysis/package.html b/lucene/core/src/java/org/apache/lucene/analysis/package.html index c997eb6aef6..5d5b65aa347 100644 --- a/lucene/core/src/java/org/apache/lucene/analysis/package.html +++ b/lucene/core/src/java/org/apache/lucene/analysis/package.html @@ -386,7 +386,15 @@ and proximity searches (though sentence identification is not provided by Lucene
  • The first position increment must be > 0.
  • Positions must not go backward.
  • Tokens that have the same start position must have the same start offset.
  • -
  • Tokens that have the same end position (taking into account the position length) must have the same end offset.
  • +
  • Tokens that have the same end position (taking into account the + position length) must have the same end offset.
  • +
  • Tokenizers must call {@link + org.apache.lucene.util.AttributeSource#clearAttributes()} in + incrementToken().
  • +
  • Tokenizers must override {@link + org.apache.lucene.analysis.TokenStream#end()}, and pass the final + offset (the total number of input characters processed) to both + parameters of {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute#setOffset(int, int)}.
  • Although these rules might seem easy to follow, problems can quickly happen when chaining @@ -395,7 +403,8 @@ and proximity searches (though sentence identification is not provided by Lucene

    @@ -467,6 +476,14 @@ and proximity searches (though sentence identification is not provided by Lucene +

    Testing Your Analysis Component

    +

    + The lucene-test-framework component defines + BaseTokenStreamTestCase. By extending + this class, you can create JUnit tests that validate that your + Analyzer and/or analysis components correctly implement the + protocol. The checkRandomData methods of that class are particularly effective in flushing out errors. +

    Using the TokenStream API

    There are a few important things to know in order to use the new API efficiently which are summarized here. You may want to walk through the example below first and come back to this section afterwards.