Analysis Module Change Log For more information on past and future Lucene versions, please see: http://s.apache.org/luceneversions ======================= Trunk (not yet released) ======================= API Changes * LUCENE-3820: Deprecated constructors accepting pattern matching bounds. The input is buffered and matched in one pass. (Dawid Weiss) * LUCENE-2413: Deprecated PatternAnalyzer in common/miscellaneous, in favor of the pattern package (CharFilter, Tokenizer, TokenFilter). (Robert Muir) * LUCENE-2413: Removed the AnalyzerUtil in common/miscellaneous. (Robert Muir) * LUCENE-1370: Added ShingleFilter option to output unigrams if no shingles can be generated. (Chris Harris via Steven Rowe) * LUCENE-2514, LUCENE-2551: JDK and ICU CollationKeyAnalyzers were changed to use pure byte keys when Version >= 4.0. This cuts sort key size approximately in half. (Robert Muir) * LUCENE-3400: Removed DutchAnalyzer.setStemDictionary (Chris Male) * LUCENE-3431: Removed QueryAutoStopWordAnalyzer.addStopWords* deprecated methods since they prevented reuse. Stopwords are now generated at instantiation through the Analyzer's constructors. (Chris Male) * LUCENE-3434: Removed ShingleAnalyzerWrapper.set* and PerFieldAnalyzerWrapper.addAnalyzer since they prevent reuse. Both Analyzers should be configured at instantiation. (Chris Male) * LUCENE-3765: Stopset ctors that previously took Set or Map now take CharArraySet and CharArrayMap respectively. Previously the behavior was confusing, and sometimes different depending on the type of set, and ultimately a CharArraySet or CharArrayMap was always used anyway. (Robert Muir) Bug fixes * LUCENE-3820: PatternReplaceCharFilter could return invalid token positions. (Dawid Weiss) New Features * LUCENE-2341: A new analyzer/ filter: Morfologik - a dictionary-driven lemmatizer (accurate stemmer) for Polish (includes morphosyntactic annotations). (Michał Dybizbański, Dawid Weiss) * LUCENE-2413: Consolidated Lucene/Solr analysis components into common. New features from Solr now available to Lucene users include: - o.a.l.analysis.commongrams: Constructs n-grams for frequently occurring terms and phrases. - o.a.l.analysis.charfilter.HTMLStripCharFilter: CharFilter that strips HTML constructs. - o.a.l.analysis.miscellaneous.WordDelimiterFilter: TokenFilter that splits words into subwords and performs optional transformations on subword groups. - o.a.l.analysis.miscellaneous.RemoveDuplicatesTokenFilter: TokenFilter which filters out Tokens at the same position and Term text as the previous token. - o.a.l.analysis.miscellaneous.TrimFilter: Trims leading and trailing whitespace from Tokens in the stream. - o.a.l.analysis.miscellaneous.KeepWordFilter: A TokenFilter that only keeps tokens with text contained in the required words (inverse of StopFilter). - o.a.l.analysis.miscellaneous.HyphenatedWordsFilter: A TokenFilter that puts hyphenated words broken into two lines back together. - o.a.l.analysis.miscellaneous.CapitalizationFilter: A TokenFilter that applies capitalization rules to tokens. - o.a.l.analysis.pattern: Package for pattern-based analysis, containing a CharFilter, Tokenizer, and Tokenfilter for transforming text with regexes. - o.a.l.analysis.synonym.SynonymFilter: A synonym filter that supports multi-word synonyms. - o.a.l.analysis.phonetic: Package for phonetic search, containing various phonetic encoders such as Double Metaphone. Some existing analysis components changed packages: - o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer - o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer - o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer - o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter - o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer - o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer - o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer - o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter - o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer - o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer - o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter - o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter - o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter - o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter - o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter - o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper - o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter - o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter - o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter - o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter - o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap - o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet - o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap - o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase - o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase - o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader - o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer - o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils All analyzers in contrib/analyzers and contrib/icu were moved to the analysis module. The 'smartcn' and 'stempel' components now depend on 'common'. (Chris Male, Robert Muir)