Analysis Module Change Log For more information on past and future Lucene versions, please see: http://s.apache.org/luceneversions ======================= Trunk (not yet released) ======================= API Changes * LUCENE-3820: Deprecated constructors accepting pattern matching bounds. The input is buffered and matched in one pass. (Dawid Weiss) * LUCENE-2413: Deprecated PatternAnalyzer in common/miscellaneous, in favor of the pattern package (CharFilter, Tokenizer, TokenFilter). (Robert Muir) * LUCENE-2413: Removed the AnalyzerUtil in common/miscellaneous. (Robert Muir) * LUCENE-1370: Added ShingleFilter option to output unigrams if no shingles can be generated. (Chris Harris via Steven Rowe) * LUCENE-2514, LUCENE-2551: JDK and ICU CollationKeyAnalyzers were changed to use pure byte keys when Version >= 4.0. This cuts sort key size approximately in half. (Robert Muir) * LUCENE-3400: Removed DutchAnalyzer.setStemDictionary (Chris Male) * LUCENE-3431: Removed QueryAutoStopWordAnalyzer.addStopWords* deprecated methods since they prevented reuse. Stopwords are now generated at instantiation through the Analyzer's constructors. (Chris Male) * LUCENE-3434: Removed ShingleAnalyzerWrapper.set* and PerFieldAnalyzerWrapper.addAnalyzer since they prevent reuse. Both Analyzers should be configured at instantiation. (Chris Male) * LUCENE-3765: Stopset ctors that previously took Set or Map now take CharArraySet and CharArrayMap respectively. Previously the behavior was confusing, and sometimes different depending on the type of set, and ultimately a CharArraySet or CharArrayMap was always used anyway. (Robert Muir) Bug fixes * LUCENE-3971: MappingCharFilter could return invalid final token position. (Dawid Weiss, Robert Muir) * LUCENE-3820: PatternReplaceCharFilter could return invalid token positions. (Dawid Weiss) * LUCENE-3969: Throw IAE on bad arguments that could cause confusing errors in CompoundWordTokenFilterBase, PatternTokenizer, PositionFilter, SnowballFilter, PathHierarchyTokenizer, ReversePathHierarchyTokenizer, WikipediaTokenizer, and KeywordTokenizer. ShingleFilter and CommonGramsFilter now populate PositionLengthAttribute. Fixed PathHierarchyTokenizer to reset() all state. Protect against AIOOBE in ReversePathHierarchyTokenizer if skip is large. Fixed wrong final offset calculation in PathHierarchyTokenizer. (Mike McCandless, Uwe Schindler, Robert Muir) New Features * LUCENE-2341: A new analyzer/ filter: Morfologik - a dictionary-driven lemmatizer (accurate stemmer) for Polish (includes morphosyntactic annotations). (Michał Dybizbański, Dawid Weiss) * LUCENE-2413: Consolidated Lucene/Solr analysis components into common. New features from Solr now available to Lucene users include: - o.a.l.analysis.commongrams: Constructs n-grams for frequently occurring terms and phrases. - o.a.l.analysis.charfilter.HTMLStripCharFilter: CharFilter that strips HTML constructs. - o.a.l.analysis.miscellaneous.WordDelimiterFilter: TokenFilter that splits words into subwords and performs optional transformations on subword groups. - o.a.l.analysis.miscellaneous.RemoveDuplicatesTokenFilter: TokenFilter which filters out Tokens at the same position and Term text as the previous token. - o.a.l.analysis.miscellaneous.TrimFilter: Trims leading and trailing whitespace from Tokens in the stream. - o.a.l.analysis.miscellaneous.KeepWordFilter: A TokenFilter that only keeps tokens with text contained in the required words (inverse of StopFilter). - o.a.l.analysis.miscellaneous.HyphenatedWordsFilter: A TokenFilter that puts hyphenated words broken into two lines back together. - o.a.l.analysis.miscellaneous.CapitalizationFilter: A TokenFilter that applies capitalization rules to tokens. - o.a.l.analysis.pattern: Package for pattern-based analysis, containing a CharFilter, Tokenizer, and Tokenfilter for transforming text with regexes. - o.a.l.analysis.synonym.SynonymFilter: A synonym filter that supports multi-word synonyms. - o.a.l.analysis.phonetic: Package for phonetic search, containing various phonetic encoders such as Double Metaphone. Some existing analysis components changed packages: - o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer - o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer - o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer - o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter - o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer - o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer - o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer - o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter - o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer - o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer - o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter - o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter - o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter - o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter - o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter - o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper - o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter - o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter - o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter - o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter - o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap - o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet - o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap - o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase - o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase - o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader - o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer - o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils All analyzers in contrib/analyzers and contrib/icu were moved to the analysis module. The 'smartcn' and 'stempel' components now depend on 'common'. (Chris Male, Robert Muir) * SOLR-2764: Create a NorwegianLightStemmer and NorwegianMinimalStemmer (janhoy)