LUCENE-2074: Refactor StandardTokenizer to have separate JFlex files per matchVersion. Also reset zzBuffer to initial size on reset(Reader) and clean up reset methods.

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@932163 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Uwe Schindler 2010-04-08 22:56:23 +00:00
parent 81c21e163b
commit be7da179ff
12 changed files with 1119 additions and 139 deletions

View File

@ -237,6 +237,9 @@ Bug fixes
which caused corruption for sep codec. Also fixed several tests to
test all 4 core codecs. (Renaud Delbru via Mike McCandless)
* LUCENE-2074: Reduce buffer size of lexer back to default on reset.
(Ruben Laguna, Shai Erera via Uwe Schindler)
New features
* LUCENE-2128: Parallelized fetching document frequencies during weight
@ -347,6 +350,9 @@ New features
can be used to prevent commits from ever getting deleted from the index.
(Shai Erera)
* LUCENE-2074: Make StandardTokenizer fit for Unicode 4.0, if the
matchVersion parameter is Version.LUCENE_31. (Uwe Schindler)
Optimizations
* LUCENE-2075: Terms dict cache is now shared across threads instead
@ -433,9 +439,12 @@ Build
into core, and moved the ICU-based collation support into contrib/icu.
(Robert Muir)
* LUCENE-2326: Removed SVN checkouts for backwards tests. The backwards branch
is now included in the svn repository using "svn copy" after release.
(Uwe Schindler)
* LUCENE-2326: Removed SVN checkouts for backwards tests. The backwards
branch is now included in the svn repository using "svn copy"
after release. (Uwe Schindler)
* LUCENE-2074: Regenerating StandardTokenizerImpl files now needs
JFlex 1.5 (currently only available on SVN). (Uwe Schindler)
Test Cases

View File

@ -698,11 +698,14 @@ The source distribution does not contain sources of the previous Lucene Java ver
<target name="jflex" depends="clean-jflex,jflex-StandardAnalyzer" />
<target name="jflex-StandardAnalyzer" depends="init,jflex-check" if="jflex.present">
<taskdef classname="JFlex.anttask.JFlexTask" name="jflex">
<classpath location="${jflex.home}/lib/JFlex.jar" />
<taskdef classname="jflex.anttask.JFlexTask" name="jflex">
<classpath refid="jflex.classpath"/>
</taskdef>
<jflex file="src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex"
<jflex file="src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.jflex"
outdir="src/java/org/apache/lucene/analysis/standard"
nobak="on" />
<jflex file="src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.jflex"
outdir="src/java/org/apache/lucene/analysis/standard"
nobak="on" />
</target>

View File

@ -92,6 +92,13 @@
<property name="javacc.home" location="${common.dir}"/>
<property name="jflex.home" location="${common.dir}"/>
<path id="jflex.classpath">
<fileset dir="${jflex.home}/">
<include name="**/target/*.jar"/>
<include name="**/lib/*.jar"/>
</fileset>
</path>
<property name="backwards.dir" location="backwards"/>
<property name="build.dir.backwards" location="${build.dir}/backwards"/>
@ -151,11 +158,9 @@
classpath="${javacc.home}/bin/lib/javacc.jar"
/>
<available
property="jflex.present"
classname="JFlex.anttask.JFlexTask"
classpath="${jflex.home}/lib/JFlex.jar"
/>
<available property="jflex.present" classname="jflex.anttask.JFlexTask">
<classpath refid="jflex.classpath"/>
</available>
<available
property="maven.ant.tasks.present"
@ -237,17 +242,17 @@
JFlex not found.
JFlex Home: ${jflex.home}
Please download and install JFlex from:
Please install the jFlex 1.5 version (currently not released)
from its SVN repository:
&lt;http://jflex.de/download.html&gt;
svn co http://jflex.svn.sourceforge.net/svnroot/jflex/trunk jflex
cd jflex
mvn install
Then, create a build.properties file either in your home
directory, or within the Lucene directory and set the jflex.home
property to the path where JFlex is installed. For example,
if you installed JFlex in /usr/local/java/jflex-1.4.1, then set the
jflex.home property to:
jflex.home=/usr/local/java/jflex-1.4.1
property to the path where the JFlex trunk checkout is located
(in the above example its the directory called "jflex").
##################################################################
</fail>

View File

@ -16,10 +16,5 @@
*/
WARNING: if you change StandardTokenizerImpl.jflex and need to regenerate
the tokenizer, only use Java 1.4 !!!
This grammar currently uses constructs (eg :digit:, :letter:) whose
meaning can vary according to the JRE used to run jflex. See
https://issues.apache.org/jira/browse/LUCENE-1126 for details.
For current backwards compatibility it is needed to support
only Java 1.4 - this will change in Lucene 3.1.
WARNING: if you change StandardTokenizerImpl*.jflex and need to regenerate
the tokenizer, only use the trunk version of JFlex 1.5 at the moment!

View File

@ -34,12 +34,12 @@ public final class StandardFilter extends TokenFilter {
typeAtt = addAttribute(TypeAttribute.class);
}
private static final String APOSTROPHE_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.APOSTROPHE];
private static final String ACRONYM_TYPE = StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.ACRONYM];
private static final String APOSTROPHE_TYPE = StandardTokenizer.TOKEN_TYPES[StandardTokenizer.APOSTROPHE];
private static final String ACRONYM_TYPE = StandardTokenizer.TOKEN_TYPES[StandardTokenizer.ACRONYM];
// this filters uses attribute type
private TypeAttribute typeAtt;
private TermAttribute termAtt;
private final TypeAttribute typeAtt;
private final TermAttribute termAtt;
/** Returns the next token in the stream, or null at EOS.
* <p>Removes <tt>'s</tt> from the end of words.

View File

@ -23,7 +23,7 @@ import java.io.Reader;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.util.Version;
@ -55,7 +55,7 @@ import org.apache.lucene.util.Version;
public final class StandardTokenizer extends Tokenizer {
/** A private instance of the JFlex-constructed scanner */
private final StandardTokenizerImpl scanner;
private StandardTokenizerInterface scanner;
public static final int ALPHANUM = 0;
public static final int APOSTROPHE = 1;
@ -111,7 +111,6 @@ public final class StandardTokenizer extends Tokenizer {
*/
public StandardTokenizer(Version matchVersion, Reader input) {
super();
this.scanner = new StandardTokenizerImpl(input);
init(input, matchVersion);
}
@ -120,7 +119,6 @@ public final class StandardTokenizer extends Tokenizer {
*/
public StandardTokenizer(Version matchVersion, AttributeSource source, Reader input) {
super(source);
this.scanner = new StandardTokenizerImpl(input);
init(input, matchVersion);
}
@ -129,29 +127,26 @@ public final class StandardTokenizer extends Tokenizer {
*/
public StandardTokenizer(Version matchVersion, AttributeFactory factory, Reader input) {
super(factory);
this.scanner = new StandardTokenizerImpl(input);
init(input, matchVersion);
}
private void init(Reader input, Version matchVersion) {
private final void init(Reader input, Version matchVersion) {
this.scanner = matchVersion.onOrAfter(Version.LUCENE_31) ?
new StandardTokenizerImpl31(input) : new StandardTokenizerImplOrig(input);
if (matchVersion.onOrAfter(Version.LUCENE_24)) {
replaceInvalidAcronym = true;
} else {
replaceInvalidAcronym = false;
}
this.input = input;
termAtt = addAttribute(TermAttribute.class);
offsetAtt = addAttribute(OffsetAttribute.class);
posIncrAtt = addAttribute(PositionIncrementAttribute.class);
typeAtt = addAttribute(TypeAttribute.class);
}
// this tokenizer generates three attributes:
// offset, positionIncrement and type
private TermAttribute termAtt;
private OffsetAttribute offsetAtt;
private PositionIncrementAttribute posIncrAtt;
private TypeAttribute typeAtt;
// term offset, positionIncrement and type
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class);
/*
* (non-Javadoc)
@ -166,7 +161,7 @@ public final class StandardTokenizer extends Tokenizer {
while(true) {
int tokenType = scanner.getNextToken();
if (tokenType == StandardTokenizerImpl.YYEOF) {
if (tokenType == StandardTokenizerInterface.YYEOF) {
return false;
}
@ -174,19 +169,19 @@ public final class StandardTokenizer extends Tokenizer {
posIncrAtt.setPositionIncrement(posIncr);
scanner.getText(termAtt);
final int start = scanner.yychar();
offsetAtt.setOffset(correctOffset(start), correctOffset(start+termAtt.termLength()));
offsetAtt.setOffset(correctOffset(start), correctOffset(start+termAtt.length()));
// This 'if' should be removed in the next release. For now, it converts
// invalid acronyms to HOST. When removed, only the 'else' part should
// remain.
if (tokenType == StandardTokenizerImpl.ACRONYM_DEP) {
if (tokenType == StandardTokenizer.ACRONYM_DEP) {
if (replaceInvalidAcronym) {
typeAtt.setType(StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.HOST]);
termAtt.setTermLength(termAtt.termLength() - 1); // remove extra '.'
typeAtt.setType(StandardTokenizer.TOKEN_TYPES[StandardTokenizer.HOST]);
termAtt.setLength(termAtt.length() - 1); // remove extra '.'
} else {
typeAtt.setType(StandardTokenizerImpl.TOKEN_TYPES[StandardTokenizerImpl.ACRONYM]);
typeAtt.setType(StandardTokenizer.TOKEN_TYPES[StandardTokenizer.ACRONYM]);
}
} else {
typeAtt.setType(StandardTokenizerImpl.TOKEN_TYPES[tokenType]);
typeAtt.setType(StandardTokenizer.TOKEN_TYPES[tokenType]);
}
return true;
} else
@ -203,21 +198,10 @@ public final class StandardTokenizer extends Tokenizer {
offsetAtt.setOffset(finalOffset, finalOffset);
}
/*
* (non-Javadoc)
*
* @see org.apache.lucene.analysis.TokenStream#reset()
*/
@Override
public void reset() throws IOException {
super.reset();
scanner.yyreset(input);
}
@Override
public void reset(Reader reader) throws IOException {
super.reset(reader);
reset();
scanner.reset(reader);
}
/**

View File

@ -0,0 +1,747 @@
/* The following code was generated by JFlex 1.5.0-SNAPSHOT on 09.04.10 00:10 */
package org.apache.lucene.analysis.standard;
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
WARNING: if you change StandardTokenizerImpl*.jflex and need to regenerate
the tokenizer, only use the trunk version of JFlex 1.5 at the moment!
*/
import java.io.Reader;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
/**
* This class is a scanner generated by
* <a href="http://www.jflex.de/">JFlex</a> 1.5.0-SNAPSHOT
* on 09.04.10 00:10 from the specification file
* <tt>C:/Users/Uwe Schindler/Projects/lucene/trunk-full1/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.jflex</tt>
*/
class StandardTokenizerImpl31 implements StandardTokenizerInterface {
/** This character denotes the end of file */
public static final int YYEOF = -1;
/** initial size of the lookahead buffer */
private static final int ZZ_BUFFERSIZE = 16384;
/** lexical states */
public static final int YYINITIAL = 0;
/**
* ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
* ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
* at the beginning of a line
* l is of the form l = 2*k, k a non negative integer
*/
private static final int ZZ_LEXSTATE[] = {
0, 0
};
/**
* Translates characters to character classes
*/
private static final String ZZ_CMAP_PACKED =
"\11\0\1\0\1\15\1\0\1\0\1\14\22\0\1\0\5\0\1\5"+
"\1\3\4\0\1\11\1\7\1\4\1\11\12\2\6\0\1\6\32\12"+
"\4\0\1\10\1\0\32\12\57\0\1\12\12\0\1\12\4\0\1\12"+
"\5\0\27\12\1\0\37\12\1\0\u013f\12\31\0\162\12\4\0\14\12"+
"\16\0\5\12\11\0\1\12\213\0\1\12\13\0\1\12\1\0\3\12"+
"\1\0\1\12\1\0\24\12\1\0\54\12\1\0\46\12\1\0\5\12"+
"\4\0\202\12\10\0\105\12\1\0\46\12\2\0\2\12\6\0\20\12"+
"\41\0\46\12\2\0\1\12\7\0\47\12\110\0\33\12\5\0\3\12"+
"\56\0\32\12\5\0\13\12\25\0\12\2\4\0\2\12\1\0\143\12"+
"\1\0\1\12\17\0\2\12\7\0\2\12\12\2\3\12\2\0\1\12"+
"\20\0\1\12\1\0\36\12\35\0\3\12\60\0\46\12\13\0\1\12"+
"\u0152\0\66\12\3\0\1\12\22\0\1\12\7\0\12\12\4\0\12\2"+
"\25\0\10\12\2\0\2\12\2\0\26\12\1\0\7\12\1\0\1\12"+
"\3\0\4\12\3\0\1\12\36\0\2\12\1\0\3\12\4\0\12\2"+
"\2\12\23\0\6\12\4\0\2\12\2\0\26\12\1\0\7\12\1\0"+
"\2\12\1\0\2\12\1\0\2\12\37\0\4\12\1\0\1\12\7\0"+
"\12\2\2\0\3\12\20\0\11\12\1\0\3\12\1\0\26\12\1\0"+
"\7\12\1\0\2\12\1\0\5\12\3\0\1\12\22\0\1\12\17\0"+
"\2\12\4\0\12\2\25\0\10\12\2\0\2\12\2\0\26\12\1\0"+
"\7\12\1\0\2\12\1\0\5\12\3\0\1\12\36\0\2\12\1\0"+
"\3\12\4\0\12\2\1\0\1\12\21\0\1\12\1\0\6\12\3\0"+
"\3\12\1\0\4\12\3\0\2\12\1\0\1\12\1\0\2\12\3\0"+
"\2\12\3\0\3\12\3\0\10\12\1\0\3\12\55\0\11\2\25\0"+
"\10\12\1\0\3\12\1\0\27\12\1\0\12\12\1\0\5\12\46\0"+
"\2\12\4\0\12\2\25\0\10\12\1\0\3\12\1\0\27\12\1\0"+
"\12\12\1\0\5\12\3\0\1\12\40\0\1\12\1\0\2\12\4\0"+
"\12\2\25\0\10\12\1\0\3\12\1\0\27\12\1\0\20\12\46\0"+
"\2\12\4\0\12\2\25\0\22\12\3\0\30\12\1\0\11\12\1\0"+
"\1\12\2\0\7\12\71\0\1\1\60\12\1\1\2\12\14\1\7\12"+
"\11\1\12\2\47\0\2\12\1\0\1\12\2\0\2\12\1\0\1\12"+
"\2\0\1\12\6\0\4\12\1\0\7\12\1\0\3\12\1\0\1\12"+
"\1\0\1\12\2\0\2\12\1\0\4\12\1\0\2\12\11\0\1\12"+
"\2\0\5\12\1\0\1\12\11\0\12\2\2\0\2\12\42\0\1\12"+
"\37\0\12\2\26\0\10\12\1\0\42\12\35\0\4\12\164\0\42\12"+
"\1\0\5\12\1\0\2\12\25\0\12\2\6\0\6\12\112\0\46\12"+
"\12\0\51\12\7\0\132\12\5\0\104\12\5\0\122\12\6\0\7\12"+
"\1\0\77\12\1\0\1\12\1\0\4\12\2\0\7\12\1\0\1\12"+
"\1\0\4\12\2\0\47\12\1\0\1\12\1\0\4\12\2\0\37\12"+
"\1\0\1\12\1\0\4\12\2\0\7\12\1\0\1\12\1\0\4\12"+
"\2\0\7\12\1\0\7\12\1\0\27\12\1\0\37\12\1\0\1\12"+
"\1\0\4\12\2\0\7\12\1\0\47\12\1\0\23\12\16\0\11\2"+
"\56\0\125\12\14\0\u026c\12\2\0\10\12\12\0\32\12\5\0\113\12"+
"\25\0\15\12\1\0\4\12\16\0\22\12\16\0\22\12\16\0\15\12"+
"\1\0\3\12\17\0\64\12\43\0\1\12\4\0\1\12\3\0\12\2"+
"\46\0\12\2\6\0\130\12\10\0\51\12\127\0\35\12\51\0\12\2"+
"\36\12\2\0\5\12\u038b\0\154\12\224\0\234\12\4\0\132\12\6\0"+
"\26\12\2\0\6\12\2\0\46\12\2\0\6\12\2\0\10\12\1\0"+
"\1\12\1\0\1\12\1\0\1\12\1\0\37\12\2\0\65\12\1\0"+
"\7\12\1\0\1\12\3\0\3\12\1\0\7\12\3\0\4\12\2\0"+
"\6\12\4\0\15\12\5\0\3\12\1\0\7\12\164\0\1\12\15\0"+
"\1\12\202\0\1\12\4\0\1\12\2\0\12\12\1\0\1\12\3\0"+
"\5\12\6\0\1\12\1\0\1\12\1\0\1\12\1\0\4\12\1\0"+
"\3\12\1\0\7\12\3\0\3\12\5\0\5\12\u0ebb\0\2\12\52\0"+
"\5\12\5\0\2\12\3\0\1\13\126\13\6\13\3\13\1\13\132\13"+
"\1\13\4\13\5\13\50\13\3\13\1\0\136\12\21\0\30\12\70\0"+
"\20\13\u0100\0\200\13\200\0\u19b6\13\12\13\100\0\u51a6\13\132\13\u048d\12"+
"\u0773\0\u2ba4\12\u215c\0\u012e\13\2\13\73\13\225\13\7\12\14\0\5\12"+
"\5\0\1\12\1\0\12\12\1\0\15\12\1\0\5\12\1\0\1\12"+
"\1\0\2\12\1\0\2\12\1\0\154\12\41\0\u016b\12\22\0\100\12"+
"\2\0\66\12\50\0\14\12\164\0\5\12\1\0\207\12\23\0\12\2"+
"\7\0\32\12\6\0\32\12\12\0\1\13\72\13\37\12\3\0\6\12"+
"\2\0\6\12\2\0\6\12\2\0\3\12\43\0";
/**
* Translates characters to character classes
*/
private static final char [] ZZ_CMAP = zzUnpackCMap(ZZ_CMAP_PACKED);
/**
* Translates DFA states to action switch labels.
*/
private static final int [] ZZ_ACTION = zzUnpackAction();
private static final String ZZ_ACTION_PACKED_0 =
"\1\0\1\1\3\2\1\3\1\1\13\0\1\2\3\4"+
"\2\0\1\5\1\0\1\5\3\4\6\5\1\6\1\4"+
"\2\7\1\10\1\0\1\10\3\0\2\10\1\11\1\12"+
"\1\4";
private static int [] zzUnpackAction() {
int [] result = new int[51];
int offset = 0;
offset = zzUnpackAction(ZZ_ACTION_PACKED_0, offset, result);
return result;
}
private static int zzUnpackAction(String packed, int offset, int [] result) {
int i = 0; /* index in packed string */
int j = offset; /* index in unpacked array */
int l = packed.length();
while (i < l) {
int count = packed.charAt(i++);
int value = packed.charAt(i++);
do result[j++] = value; while (--count > 0);
}
return j;
}
/**
* Translates a state to a row index in the transition table
*/
private static final int [] ZZ_ROWMAP = zzUnpackRowMap();
private static final String ZZ_ROWMAP_PACKED_0 =
"\0\0\0\16\0\34\0\52\0\70\0\16\0\106\0\124"+
"\0\142\0\160\0\176\0\214\0\232\0\250\0\266\0\304"+
"\0\322\0\340\0\356\0\374\0\u010a\0\u0118\0\u0126\0\u0134"+
"\0\u0142\0\u0150\0\u015e\0\u016c\0\u017a\0\u0188\0\u0196\0\u01a4"+
"\0\u01b2\0\u01c0\0\u01ce\0\u01dc\0\u01ea\0\u01f8\0\322\0\u0206"+
"\0\u0214\0\u0222\0\u0230\0\u023e\0\u024c\0\u025a\0\124\0\214"+
"\0\u0268\0\u0276\0\u0284";
private static int [] zzUnpackRowMap() {
int [] result = new int[51];
int offset = 0;
offset = zzUnpackRowMap(ZZ_ROWMAP_PACKED_0, offset, result);
return result;
}
private static int zzUnpackRowMap(String packed, int offset, int [] result) {
int i = 0; /* index in packed string */
int j = offset; /* index in unpacked array */
int l = packed.length();
while (i < l) {
int high = packed.charAt(i++) << 16;
result[j++] = high | packed.charAt(i++);
}
return j;
}
/**
* The transition table of the DFA
*/
private static final int [] ZZ_TRANS = zzUnpackTrans();
private static final String ZZ_TRANS_PACKED_0 =
"\1\2\1\3\1\4\7\2\1\5\1\6\1\7\1\2"+
"\17\0\2\3\1\0\1\10\1\0\1\11\2\12\1\13"+
"\1\3\4\0\1\3\1\4\1\0\1\14\1\0\1\11"+
"\2\15\1\16\1\4\4\0\1\3\1\4\1\17\1\20"+
"\1\21\1\22\2\12\1\13\1\23\20\0\1\2\1\0"+
"\1\24\1\25\7\0\1\26\4\0\2\27\7\0\1\27"+
"\4\0\1\30\1\31\7\0\1\32\5\0\1\33\7\0"+
"\1\13\4\0\1\34\1\35\7\0\1\36\4\0\1\37"+
"\1\40\7\0\1\41\4\0\1\42\1\43\7\0\1\44"+
"\15\0\1\45\4\0\1\24\1\25\7\0\1\46\15\0"+
"\1\47\4\0\2\27\7\0\1\50\4\0\1\3\1\4"+
"\1\17\1\10\1\21\1\22\2\12\1\13\1\23\4\0"+
"\2\24\1\0\1\51\1\0\1\11\2\52\1\0\1\24"+
"\4\0\1\24\1\25\1\0\1\53\1\0\1\11\2\54"+
"\1\55\1\25\4\0\1\24\1\25\1\0\1\51\1\0"+
"\1\11\2\52\1\0\1\26\4\0\2\27\1\0\1\56"+
"\2\0\1\56\2\0\1\27\4\0\2\30\1\0\1\52"+
"\1\0\1\11\2\52\1\0\1\30\4\0\1\30\1\31"+
"\1\0\1\54\1\0\1\11\2\54\1\55\1\31\4\0"+
"\1\30\1\31\1\0\1\52\1\0\1\11\2\52\1\0"+
"\1\32\5\0\1\33\1\0\1\55\2\0\3\55\1\33"+
"\4\0\2\34\1\0\1\57\1\0\1\11\2\12\1\13"+
"\1\34\4\0\1\34\1\35\1\0\1\60\1\0\1\11"+
"\2\15\1\16\1\35\4\0\1\34\1\35\1\0\1\57"+
"\1\0\1\11\2\12\1\13\1\36\4\0\2\37\1\0"+
"\1\12\1\0\1\11\2\12\1\13\1\37\4\0\1\37"+
"\1\40\1\0\1\15\1\0\1\11\2\15\1\16\1\40"+
"\4\0\1\37\1\40\1\0\1\12\1\0\1\11\2\12"+
"\1\13\1\41\4\0\2\42\1\0\1\13\2\0\3\13"+
"\1\42\4\0\1\42\1\43\1\0\1\16\2\0\3\16"+
"\1\43\4\0\1\42\1\43\1\0\1\13\2\0\3\13"+
"\1\44\6\0\1\17\6\0\1\45\4\0\1\24\1\25"+
"\1\0\1\61\1\0\1\11\2\52\1\0\1\26\4\0"+
"\2\27\1\0\1\56\2\0\1\56\2\0\1\50\4\0"+
"\2\24\7\0\1\24\4\0\2\30\7\0\1\30\4\0"+
"\2\34\7\0\1\34\4\0\2\37\7\0\1\37\4\0"+
"\2\42\7\0\1\42\4\0\2\62\7\0\1\62\4\0"+
"\2\24\7\0\1\63\4\0\2\62\1\0\1\56\2\0"+
"\1\56\2\0\1\62\4\0\2\24\1\0\1\61\1\0"+
"\1\11\2\52\1\0\1\24\3\0";
private static int [] zzUnpackTrans() {
int [] result = new int[658];
int offset = 0;
offset = zzUnpackTrans(ZZ_TRANS_PACKED_0, offset, result);
return result;
}
private static int zzUnpackTrans(String packed, int offset, int [] result) {
int i = 0; /* index in packed string */
int j = offset; /* index in unpacked array */
int l = packed.length();
while (i < l) {
int count = packed.charAt(i++);
int value = packed.charAt(i++);
value--;
do result[j++] = value; while (--count > 0);
}
return j;
}
/* error codes */
private static final int ZZ_UNKNOWN_ERROR = 0;
private static final int ZZ_NO_MATCH = 1;
private static final int ZZ_PUSHBACK_2BIG = 2;
/* error messages for the codes above */
private static final String ZZ_ERROR_MSG[] = {
"Unkown internal scanner error",
"Error: could not match input",
"Error: pushback value was too large"
};
/**
* ZZ_ATTRIBUTE[aState] contains the attributes of state <code>aState</code>
*/
private static final int [] ZZ_ATTRIBUTE = zzUnpackAttribute();
private static final String ZZ_ATTRIBUTE_PACKED_0 =
"\1\0\1\11\3\1\1\11\1\1\13\0\4\1\2\0"+
"\1\1\1\0\17\1\1\0\1\1\3\0\5\1";
private static int [] zzUnpackAttribute() {
int [] result = new int[51];
int offset = 0;
offset = zzUnpackAttribute(ZZ_ATTRIBUTE_PACKED_0, offset, result);
return result;
}
private static int zzUnpackAttribute(String packed, int offset, int [] result) {
int i = 0; /* index in packed string */
int j = offset; /* index in unpacked array */
int l = packed.length();
while (i < l) {
int count = packed.charAt(i++);
int value = packed.charAt(i++);
do result[j++] = value; while (--count > 0);
}
return j;
}
/** the input device */
private java.io.Reader zzReader;
/** the current state of the DFA */
private int zzState;
/** the current lexical state */
private int zzLexicalState = YYINITIAL;
/** this buffer contains the current text to be matched and is
the source of the yytext() string */
private char zzBuffer[] = new char[ZZ_BUFFERSIZE];
/** the textposition at the last accepting state */
private int zzMarkedPos;
/** the current text position in the buffer */
private int zzCurrentPos;
/** startRead marks the beginning of the yytext() string in the buffer */
private int zzStartRead;
/** endRead marks the last character in the buffer, that has been read
from input */
private int zzEndRead;
/** number of newlines encountered up to the start of the matched text */
private int yyline;
/** the number of characters up to the start of the matched text */
private int yychar;
/**
* the number of characters from the last newline up to the start of the
* matched text
*/
private int yycolumn;
/**
* zzAtBOL == true <=> the scanner is currently at the beginning of a line
*/
private boolean zzAtBOL = true;
/** zzAtEOF == true <=> the scanner is at the EOF */
private boolean zzAtEOF;
/** denotes if the user-EOF-code has already been executed */
private boolean zzEOFDone;
/* user code: */
public static final int ALPHANUM = StandardTokenizer.ALPHANUM;
public static final int APOSTROPHE = StandardTokenizer.APOSTROPHE;
public static final int ACRONYM = StandardTokenizer.ACRONYM;
public static final int COMPANY = StandardTokenizer.COMPANY;
public static final int EMAIL = StandardTokenizer.EMAIL;
public static final int HOST = StandardTokenizer.HOST;
public static final int NUM = StandardTokenizer.NUM;
public static final int CJ = StandardTokenizer.CJ;
/**
* @deprecated this solves a bug where HOSTs that end with '.' are identified
* as ACRONYMs.
*/
public static final int ACRONYM_DEP = StandardTokenizer.ACRONYM_DEP;
public static final String [] TOKEN_TYPES = StandardTokenizer.TOKEN_TYPES;
public final int yychar()
{
return yychar;
}
/**
* Fills CharTermAttribute with the current token text.
*/
public final void getText(CharTermAttribute t) {
t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
}
/**
* Resets the Tokenizer to a new Reader.
*/
public final void reset(Reader r) {
// reset to default buffer size, if buffer has grown
if (zzBuffer.length > ZZ_BUFFERSIZE) {
zzBuffer = new char[ZZ_BUFFERSIZE];
}
yyreset(r);
}
/**
* Creates a new scanner
* There is also a java.io.InputStream version of this constructor.
*
* @param in the java.io.Reader to read input from.
*/
StandardTokenizerImpl31(java.io.Reader in) {
this.zzReader = in;
}
/**
* Creates a new scanner.
* There is also java.io.Reader version of this constructor.
*
* @param in the java.io.Inputstream to read input from.
*/
StandardTokenizerImpl31(java.io.InputStream in) {
this(new java.io.InputStreamReader(in));
}
/**
* Unpacks the compressed character translation table.
*
* @param packed the packed character translation table
* @return the unpacked character translation table
*/
private static char [] zzUnpackCMap(String packed) {
char [] map = new char[0x10000];
int i = 0; /* index in packed string */
int j = 0; /* index in unpacked array */
while (i < 1234) {
int count = packed.charAt(i++);
char value = packed.charAt(i++);
do map[j++] = value; while (--count > 0);
}
return map;
}
/**
* Refills the input buffer.
*
* @return <code>false</code>, iff there was new input.
*
* @exception java.io.IOException if any I/O-Error occurs
*/
private boolean zzRefill() throws java.io.IOException {
/* first: make room (if you can) */
if (zzStartRead > 0) {
System.arraycopy(zzBuffer, zzStartRead,
zzBuffer, 0,
zzEndRead-zzStartRead);
/* translate stored positions */
zzEndRead-= zzStartRead;
zzCurrentPos-= zzStartRead;
zzMarkedPos-= zzStartRead;
zzStartRead = 0;
}
/* is the buffer big enough? */
if (zzCurrentPos >= zzBuffer.length) {
/* if not: blow it up */
char newBuffer[] = new char[zzCurrentPos*2];
System.arraycopy(zzBuffer, 0, newBuffer, 0, zzBuffer.length);
zzBuffer = newBuffer;
}
/* finally: fill the buffer with new input */
int numRead = zzReader.read(zzBuffer, zzEndRead,
zzBuffer.length-zzEndRead);
if (numRead > 0) {
zzEndRead+= numRead;
return false;
}
// unlikely but not impossible: read 0 characters, but not at end of stream
if (numRead == 0) {
int c = zzReader.read();
if (c == -1) {
return true;
} else {
zzBuffer[zzEndRead++] = (char) c;
return false;
}
}
// numRead < 0
return true;
}
/**
* Closes the input stream.
*/
public final void yyclose() throws java.io.IOException {
zzAtEOF = true; /* indicate end of file */
zzEndRead = zzStartRead; /* invalidate buffer */
if (zzReader != null)
zzReader.close();
}
/**
* Resets the scanner to read from a new input stream.
* Does not close the old reader.
*
* All internal variables are reset, the old input stream
* <b>cannot</b> be reused (internal buffer is discarded and lost).
* Lexical state is set to <tt>ZZ_INITIAL</tt>.
*
* @param reader the new input stream
*/
public final void yyreset(java.io.Reader reader) {
zzReader = reader;
zzAtBOL = true;
zzAtEOF = false;
zzEOFDone = false;
zzEndRead = zzStartRead = 0;
zzCurrentPos = zzMarkedPos = 0;
yyline = yychar = yycolumn = 0;
zzLexicalState = YYINITIAL;
}
/**
* Returns the current lexical state.
*/
public final int yystate() {
return zzLexicalState;
}
/**
* Enters a new lexical state
*
* @param newState the new lexical state
*/
public final void yybegin(int newState) {
zzLexicalState = newState;
}
/**
* Returns the text matched by the current regular expression.
*/
public final String yytext() {
return new String( zzBuffer, zzStartRead, zzMarkedPos-zzStartRead );
}
/**
* Returns the character at position <tt>pos</tt> from the
* matched text.
*
* It is equivalent to yytext().charAt(pos), but faster
*
* @param pos the position of the character to fetch.
* A value from 0 to yylength()-1.
*
* @return the character at position pos
*/
public final char yycharat(int pos) {
return zzBuffer[zzStartRead+pos];
}
/**
* Returns the length of the matched text region.
*/
public final int yylength() {
return zzMarkedPos-zzStartRead;
}
/**
* Reports an error that occured while scanning.
*
* In a wellformed scanner (no or only correct usage of
* yypushback(int) and a match-all fallback rule) this method
* will only be called with things that "Can't Possibly Happen".
* If this method is called, something is seriously wrong
* (e.g. a JFlex bug producing a faulty scanner etc.).
*
* Usual syntax/scanner level error handling should be done
* in error fallback rules.
*
* @param errorCode the code of the errormessage to display
*/
private void zzScanError(int errorCode) {
String message;
try {
message = ZZ_ERROR_MSG[errorCode];
}
catch (ArrayIndexOutOfBoundsException e) {
message = ZZ_ERROR_MSG[ZZ_UNKNOWN_ERROR];
}
throw new Error(message);
}
/**
* Pushes the specified amount of characters back into the input stream.
*
* They will be read again by then next call of the scanning method
*
* @param number the number of characters to be read again.
* This number must not be greater than yylength()!
*/
public void yypushback(int number) {
if ( number > yylength() )
zzScanError(ZZ_PUSHBACK_2BIG);
zzMarkedPos -= number;
}
/**
* Resumes scanning until the next regular expression is matched,
* the end of input is encountered or an I/O-Error occurs.
*
* @return the next token
* @exception java.io.IOException if any I/O-Error occurs
*/
public int getNextToken() throws java.io.IOException {
int zzInput;
int zzAction;
// cached fields:
int zzCurrentPosL;
int zzMarkedPosL;
int zzEndReadL = zzEndRead;
char [] zzBufferL = zzBuffer;
char [] zzCMapL = ZZ_CMAP;
int [] zzTransL = ZZ_TRANS;
int [] zzRowMapL = ZZ_ROWMAP;
int [] zzAttrL = ZZ_ATTRIBUTE;
while (true) {
zzMarkedPosL = zzMarkedPos;
yychar+= zzMarkedPosL-zzStartRead;
zzAction = -1;
zzCurrentPosL = zzCurrentPos = zzStartRead = zzMarkedPosL;
zzState = ZZ_LEXSTATE[zzLexicalState];
zzForAction: {
while (true) {
if (zzCurrentPosL < zzEndReadL)
zzInput = zzBufferL[zzCurrentPosL++];
else if (zzAtEOF) {
zzInput = YYEOF;
break zzForAction;
}
else {
// store back cached positions
zzCurrentPos = zzCurrentPosL;
zzMarkedPos = zzMarkedPosL;
boolean eof = zzRefill();
// get translated positions and possibly new buffer
zzCurrentPosL = zzCurrentPos;
zzMarkedPosL = zzMarkedPos;
zzBufferL = zzBuffer;
zzEndReadL = zzEndRead;
if (eof) {
zzInput = YYEOF;
break zzForAction;
}
else {
zzInput = zzBufferL[zzCurrentPosL++];
}
}
int zzNext = zzTransL[ zzRowMapL[zzState] + zzCMapL[zzInput] ];
if (zzNext == -1) break zzForAction;
zzState = zzNext;
int zzAttributes = zzAttrL[zzState];
if ( (zzAttributes & 1) == 1 ) {
zzAction = zzState;
zzMarkedPosL = zzCurrentPosL;
if ( (zzAttributes & 8) == 8 ) break zzForAction;
}
}
}
// store back cached position
zzMarkedPos = zzMarkedPosL;
switch (zzAction < 0 ? zzAction : ZZ_ACTION[zzAction]) {
case 5:
{ return NUM;
}
case 11: break;
case 9:
{ return ACRONYM;
}
case 12: break;
case 7:
{ return COMPANY;
}
case 13: break;
case 10:
{ return EMAIL;
}
case 14: break;
case 1:
{ /* ignore */
}
case 15: break;
case 6:
{ return APOSTROPHE;
}
case 16: break;
case 3:
{ return CJ;
}
case 17: break;
case 8:
{ return ACRONYM_DEP;
}
case 18: break;
case 2:
{ return ALPHANUM;
}
case 19: break;
case 4:
{ return HOST;
}
case 20: break;
default:
if (zzInput == YYEOF && zzStartRead == zzCurrentPos) {
zzAtEOF = true;
return YYEOF;
}
else {
zzScanError(ZZ_NO_MATCH);
}
}
}
}
}

View File

@ -19,23 +19,19 @@ package org.apache.lucene.analysis.standard;
/*
WARNING: if you change StandardTokenizerImpl.jflex and need to regenerate
the tokenizer, only use Java 1.4 !!!
This grammar currently uses constructs (eg :digit:, :letter:) whose
meaning can vary according to the JRE used to run jflex. See
https://issues.apache.org/jira/browse/LUCENE-1126 for details.
For current backwards compatibility it is needed to support
only Java 1.4 - this will change in Lucene 3.1.
WARNING: if you change StandardTokenizerImpl*.jflex and need to regenerate
the tokenizer, only use the trunk version of JFlex 1.5 at the moment!
*/
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import java.io.Reader;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
%%
%class StandardTokenizerImpl
%unicode
%class StandardTokenizerImpl31
%implements StandardTokenizerInterface
%unicode 4.0
%integer
%function getNextToken
%pack
@ -55,7 +51,6 @@ public static final int CJ = StandardTokenizer.CJ;
* @deprecated this solves a bug where HOSTs that end with '.' are identified
* as ACRONYMs.
*/
@Deprecated
public static final int ACRONYM_DEP = StandardTokenizer.ACRONYM_DEP;
public static final String [] TOKEN_TYPES = StandardTokenizer.TOKEN_TYPES;
@ -66,17 +61,21 @@ public final int yychar()
}
/**
* Fills Lucene token with the current token text.
* Fills CharTermAttribute with the current token text.
*/
final void getText(Token t) {
t.setTermBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
public final void getText(CharTermAttribute t) {
t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
}
/**
* Fills TermAttribute with the current token text.
* Resets the Tokenizer to a new Reader.
*/
final void getText(TermAttribute t) {
t.setTermBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
public final void reset(Reader r) {
// reset to default buffer size, if buffer has grown
if (zzBuffer.length > ZZ_BUFFERSIZE) {
zzBuffer = new char[ZZ_BUFFERSIZE];
}
yyreset(r);
}
%}

View File

@ -1,4 +1,4 @@
/* The following code was generated by JFlex 1.4.1 on 9/4/08 6:49 PM */
/* The following code was generated by JFlex 1.5.0-SNAPSHOT on 09.04.10 00:10 */
package org.apache.lucene.analysis.standard;
@ -21,27 +21,22 @@ package org.apache.lucene.analysis.standard;
/*
WARNING: if you change StandardTokenizerImpl.jflex and need to regenerate
the tokenizer, only use Java 1.4 !!!
This grammar currently uses constructs (eg :digit:, :letter:) whose
meaning can vary according to the JRE used to run jflex. See
https://issues.apache.org/jira/browse/LUCENE-1126 for details.
For current backwards compatibility it is needed to support
only Java 1.4 - this will change in Lucene 3.1.
WARNING: if you change StandardTokenizerImpl*.jflex and need to regenerate
the tokenizer, only use the trunk version of JFlex 1.5 at the moment!
*/
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import java.io.Reader;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
/**
* This class is a scanner generated by
* <a href="http://www.jflex.de/">JFlex</a> 1.4.1
* on 9/4/08 6:49 PM from the specification file
* <tt>/tango/mike/src/lucene.standarddigit/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex</tt>
* <a href="http://www.jflex.de/">JFlex</a> 1.5.0-SNAPSHOT
* on 09.04.10 00:10 from the specification file
* <tt>C:/Users/Uwe Schindler/Projects/lucene/trunk-full1/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.jflex</tt>
*/
class StandardTokenizerImpl {
class StandardTokenizerImplOrig implements StandardTokenizerInterface {
/** This character denotes the end of file */
public static final int YYEOF = -1;
@ -52,6 +47,16 @@ class StandardTokenizerImpl {
/** lexical states */
public static final int YYINITIAL = 0;
/**
* ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l
* ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l
* at the beginning of a line
* l is of the form l = 2*k, k a non negative integer
*/
private static final int ZZ_LEXSTATE[] = {
0, 0
};
/**
* Translates characters to character classes
*/
@ -307,9 +312,6 @@ class StandardTokenizerImpl {
/** the textposition at the last accepting state */
private int zzMarkedPos;
/** the textposition at the last state to be included in yytext */
private int zzPushbackPos;
/** the current text position in the buffer */
private int zzCurrentPos;
@ -340,6 +342,9 @@ class StandardTokenizerImpl {
/** zzAtEOF == true <=> the scanner is at the EOF */
private boolean zzAtEOF;
/** denotes if the user-EOF-code has already been executed */
private boolean zzEOFDone;
/* user code: */
public static final int ALPHANUM = StandardTokenizer.ALPHANUM;
@ -354,7 +359,6 @@ public static final int CJ = StandardTokenizer.CJ;
* @deprecated this solves a bug where HOSTs that end with '.' are identified
* as ACRONYMs.
*/
@Deprecated
public static final int ACRONYM_DEP = StandardTokenizer.ACRONYM_DEP;
public static final String [] TOKEN_TYPES = StandardTokenizer.TOKEN_TYPES;
@ -365,27 +369,32 @@ public final int yychar()
}
/**
* Fills Lucene token with the current token text.
* Fills CharTermAttribute with the current token text.
*/
final void getText(Token t) {
t.setTermBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
public final void getText(CharTermAttribute t) {
t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
}
/**
* Fills TermAttribute with the current token text.
* Resets the Tokenizer to a new Reader.
*/
final void getText(TermAttribute t) {
t.setTermBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
public final void reset(Reader r) {
// reset to default buffer size, if buffer has grown
if (zzBuffer.length > ZZ_BUFFERSIZE) {
zzBuffer = new char[ZZ_BUFFERSIZE];
}
yyreset(r);
}
/**
* Creates a new scanner
* There is also a java.io.InputStream version of this constructor.
*
* @param in the java.io.Reader to read input from.
*/
StandardTokenizerImpl(java.io.Reader in) {
StandardTokenizerImplOrig(java.io.Reader in) {
this.zzReader = in;
}
@ -395,7 +404,7 @@ final void getText(TermAttribute t) {
*
* @param in the java.io.Inputstream to read input from.
*/
StandardTokenizerImpl(java.io.InputStream in) {
StandardTokenizerImplOrig(java.io.InputStream in) {
this(new java.io.InputStreamReader(in));
}
@ -437,7 +446,6 @@ final void getText(TermAttribute t) {
zzEndRead-= zzStartRead;
zzCurrentPos-= zzStartRead;
zzMarkedPos-= zzStartRead;
zzPushbackPos-= zzStartRead;
zzStartRead = 0;
}
@ -453,13 +461,23 @@ final void getText(TermAttribute t) {
int numRead = zzReader.read(zzBuffer, zzEndRead,
zzBuffer.length-zzEndRead);
if (numRead < 0) {
return true;
}
else {
if (numRead > 0) {
zzEndRead+= numRead;
return false;
}
// unlikely but not impossible: read 0 characters, but not at end of stream
if (numRead == 0) {
int c = zzReader.read();
if (c == -1) {
return true;
} else {
zzBuffer[zzEndRead++] = (char) c;
return false;
}
}
// numRead < 0
return true;
}
@ -489,8 +507,9 @@ final void getText(TermAttribute t) {
zzReader = reader;
zzAtBOL = true;
zzAtEOF = false;
zzEOFDone = false;
zzEndRead = zzStartRead = 0;
zzCurrentPos = zzMarkedPos = zzPushbackPos = 0;
zzCurrentPos = zzMarkedPos = 0;
yyline = yychar = yycolumn = 0;
zzLexicalState = YYINITIAL;
}
@ -620,7 +639,7 @@ final void getText(TermAttribute t) {
zzCurrentPosL = zzCurrentPos = zzStartRead = zzMarkedPosL;
zzState = zzLexicalState;
zzState = ZZ_LEXSTATE[zzLexicalState];
zzForAction: {
@ -668,44 +687,44 @@ final void getText(TermAttribute t) {
zzMarkedPos = zzMarkedPosL;
switch (zzAction < 0 ? zzAction : ZZ_ACTION[zzAction]) {
case 4:
{ return HOST;
case 5:
{ return NUM;
}
case 11: break;
case 9:
{ return ACRONYM;
}
case 12: break;
case 8:
{ return ACRONYM_DEP;
}
case 13: break;
case 1:
{ /* ignore */
}
case 14: break;
case 5:
{ return NUM;
}
case 15: break;
case 3:
{ return CJ;
}
case 16: break;
case 2:
{ return ALPHANUM;
}
case 17: break;
case 7:
{ return COMPANY;
}
case 18: break;
case 13: break;
case 10:
{ return EMAIL;
}
case 14: break;
case 1:
{ /* ignore */
}
case 15: break;
case 6:
{ return APOSTROPHE;
}
case 16: break;
case 3:
{ return CJ;
}
case 17: break;
case 8:
{ return ACRONYM_DEP;
}
case 18: break;
case 2:
{ return ALPHANUM;
}
case 19: break;
case 10:
{ return EMAIL;
case 4:
{ return HOST;
}
case 20: break;
default:

View File

@ -0,0 +1,145 @@
package org.apache.lucene.analysis.standard;
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
WARNING: if you change StandardTokenizerImpl*.jflex and need to regenerate
the tokenizer, only use the trunk version of JFlex 1.5 at the moment!
*/
import java.io.Reader;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
%%
%class StandardTokenizerImplOrig
%implements StandardTokenizerInterface
%unicode 3.0
%integer
%function getNextToken
%pack
%char
%{
public static final int ALPHANUM = StandardTokenizer.ALPHANUM;
public static final int APOSTROPHE = StandardTokenizer.APOSTROPHE;
public static final int ACRONYM = StandardTokenizer.ACRONYM;
public static final int COMPANY = StandardTokenizer.COMPANY;
public static final int EMAIL = StandardTokenizer.EMAIL;
public static final int HOST = StandardTokenizer.HOST;
public static final int NUM = StandardTokenizer.NUM;
public static final int CJ = StandardTokenizer.CJ;
/**
* @deprecated this solves a bug where HOSTs that end with '.' are identified
* as ACRONYMs.
*/
public static final int ACRONYM_DEP = StandardTokenizer.ACRONYM_DEP;
public static final String [] TOKEN_TYPES = StandardTokenizer.TOKEN_TYPES;
public final int yychar()
{
return yychar;
}
/**
* Fills CharTermAttribute with the current token text.
*/
public final void getText(CharTermAttribute t) {
t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos-zzStartRead);
}
/**
* Resets the Tokenizer to a new Reader.
*/
public final void reset(Reader r) {
// reset to default buffer size, if buffer has grown
if (zzBuffer.length > ZZ_BUFFERSIZE) {
zzBuffer = new char[ZZ_BUFFERSIZE];
}
yyreset(r);
}
%}
THAI = [\u0E00-\u0E59]
// basic word: a sequence of digits & letters (includes Thai to enable ThaiAnalyzer to function)
ALPHANUM = ({LETTER}|{THAI}|[:digit:])+
// internal apostrophes: O'Reilly, you're, O'Reilly's
// use a post-filter to remove possessives
APOSTROPHE = {ALPHA} ("'" {ALPHA})+
// acronyms: U.S.A., I.B.M., etc.
// use a post-filter to remove dots
ACRONYM = {LETTER} "." ({LETTER} ".")+
ACRONYM_DEP = {ALPHANUM} "." ({ALPHANUM} ".")+
// company names like AT&T and Excite@Home.
COMPANY = {ALPHA} ("&"|"@") {ALPHA}
// email addresses
EMAIL = {ALPHANUM} (("."|"-"|"_") {ALPHANUM})* "@" {ALPHANUM} (("."|"-") {ALPHANUM})+
// hostname
HOST = {ALPHANUM} ((".") {ALPHANUM})+
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM = ({ALPHANUM} {P} {HAS_DIGIT}
| {HAS_DIGIT} {P} {ALPHANUM}
| {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
| {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
// punctuation
P = ("_"|"-"|"/"|"."|",")
// at least one digit
HAS_DIGIT = ({LETTER}|[:digit:])* [:digit:] ({LETTER}|[:digit:])*
ALPHA = ({LETTER})+
// From the JFlex manual: "the expression that matches everything of <a> not matched by <b> is !(!<a>|<b>)"
LETTER = !(![:letter:]|{CJ})
// Chinese and Japanese (but NOT Korean, which is included in [:letter:])
CJ = [\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f]
WHITESPACE = \r\n | [ \r\n\t\f]
%%
{ALPHANUM} { return ALPHANUM; }
{APOSTROPHE} { return APOSTROPHE; }
{ACRONYM} { return ACRONYM; }
{COMPANY} { return COMPANY; }
{EMAIL} { return EMAIL; }
{HOST} { return HOST; }
{NUM} { return NUM; }
{CJ} { return CJ; }
{ACRONYM_DEP} { return ACRONYM_DEP; }
/** Ignore the rest */
. | {WHITESPACE} { /* ignore */ }

View File

@ -0,0 +1,66 @@
package org.apache.lucene.analysis.standard;
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.Reader;
import java.io.IOException;
interface StandardTokenizerInterface {
/** This character denotes the end of file */
public static final int YYEOF = -1;
/**
* Copies the matched text into the CharTermAttribute
*/
void getText(CharTermAttribute t);
/**
* Returns the current position.
*/
int yychar();
/**
* Resets the scanner to read from a new input stream.
* Does not close the old reader.
*
* All internal variables are reset, the old input stream
* <b>cannot</b> be reused (internal buffer is discarded and lost).
* Lexical state is set to <tt>ZZ_INITIAL</tt>.
*
* @param reader the new input stream
*/
void reset(Reader reader);
/**
* Returns the length of the matched text region.
*/
int yylength();
/**
* Resumes scanning until the next regular expression is matched,
* the end of input is encountered or an I/O-Error occurs.
*
* @return the next token, {@link #YYEOF} on end of stream
* @exception IOException if any I/O-Error occurs
*/
int getNextToken() throws IOException;
}

View File

@ -224,4 +224,12 @@ public class TestStandardAnalyzer extends BaseTokenStreamTestCase {
"<ALPHANUM>", "<NUM>", "<HOST>", "<NUM>", "<ALPHANUM>",
"<ALPHANUM>", "<HOST>"});
}
public void testJava14BWCompatibility() throws Exception {
StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_30);
assertAnalyzesTo(sa, "test\u02C6test", new String[] { "test", "test" });
sa = new StandardAnalyzer(Version.LUCENE_31);
assertAnalyzesTo(sa, "test\u02C6test", new String[] { "test\u02C6test" });
}
}