lucene/modules/analysis/CHANGES.txt

125 lines
6.7 KiB
Plaintext

Analysis Module Change Log
For more information on past and future Lucene versions, please see:
http://s.apache.org/luceneversions
======================= Trunk (not yet released) =======================
API Changes
* LUCENE-3820: Deprecated constructors accepting pattern matching bounds. The input
is buffered and matched in one pass. (Dawid Weiss)
* LUCENE-2413: Deprecated PatternAnalyzer in common/miscellaneous, in favor
of the pattern package (CharFilter, Tokenizer, TokenFilter). (Robert Muir)
* LUCENE-2413: Removed the AnalyzerUtil in common/miscellaneous. (Robert Muir)
* LUCENE-1370: Added ShingleFilter option to output unigrams if no shingles
can be generated. (Chris Harris via Steven Rowe)
* LUCENE-2514, LUCENE-2551: JDK and ICU CollationKeyAnalyzers were changed to
use pure byte keys when Version >= 4.0. This cuts sort key size approximately
in half. (Robert Muir)
* LUCENE-3400: Removed DutchAnalyzer.setStemDictionary (Chris Male)
* LUCENE-3431: Removed QueryAutoStopWordAnalyzer.addStopWords* deprecated methods
since they prevented reuse. Stopwords are now generated at instantiation through
the Analyzer's constructors. (Chris Male)
* LUCENE-3434: Removed ShingleAnalyzerWrapper.set* and PerFieldAnalyzerWrapper.addAnalyzer
since they prevent reuse. Both Analyzers should be configured at instantiation.
(Chris Male)
* LUCENE-3765: Stopset ctors that previously took Set<?> or Map<?,String> now take
CharArraySet and CharArrayMap respectively. Previously the behavior was confusing,
and sometimes different depending on the type of set, and ultimately a CharArraySet
or CharArrayMap was always used anyway. (Robert Muir)
Bug fixes
* LUCENE-3971: MappingCharFilter could return invalid final token position.
(Dawid Weiss, Robert Muir)
* LUCENE-3820: PatternReplaceCharFilter could return invalid token positions.
(Dawid Weiss)
* LUCENE-3969: Throw IAE on bad arguments that could cause confusing errors in
CompoundWordTokenFilterBase, PatternTokenizer, PositionFilter,
SnowballFilter, PathHierarchyTokenizer, ReversePathHierarchyTokenizer,
WikipediaTokenizer, and KeywordTokenizer. ShingleFilter and
CommonGramsFilter now populate PositionLengthAttribute. Fixed
PathHierarchyTokenizer to reset() all state. Protect against AIOOBE in
ReversePathHierarchyTokenizer if skip is large. Fixed wrong final
offset calculation in PathHierarchyTokenizer.
(Mike McCandless, Uwe Schindler, Robert Muir)
New Features
* LUCENE-2341: A new analyzer/ filter: Morfologik - a dictionary-driven lemmatizer
(accurate stemmer) for Polish (includes morphosyntactic annotations).
(Michał Dybizbański, Dawid Weiss)
* LUCENE-2413: Consolidated Lucene/Solr analysis components into common.
New features from Solr now available to Lucene users include:
- o.a.l.analysis.commongrams: Constructs n-grams for frequently occurring terms
and phrases.
- o.a.l.analysis.charfilter.HTMLStripCharFilter: CharFilter that strips HTML
constructs.
- o.a.l.analysis.miscellaneous.WordDelimiterFilter: TokenFilter that splits words
into subwords and performs optional transformations on subword groups.
- o.a.l.analysis.miscellaneous.RemoveDuplicatesTokenFilter: TokenFilter which
filters out Tokens at the same position and Term text as the previous token.
- o.a.l.analysis.miscellaneous.TrimFilter: Trims leading and trailing whitespace
from Tokens in the stream.
- o.a.l.analysis.miscellaneous.KeepWordFilter: A TokenFilter that only keeps tokens
with text contained in the required words (inverse of StopFilter).
- o.a.l.analysis.miscellaneous.HyphenatedWordsFilter: A TokenFilter that puts
hyphenated words broken into two lines back together.
- o.a.l.analysis.miscellaneous.CapitalizationFilter: A TokenFilter that applies
capitalization rules to tokens.
- o.a.l.analysis.pattern: Package for pattern-based analysis, containing a
CharFilter, Tokenizer, and Tokenfilter for transforming text with regexes.
- o.a.l.analysis.synonym.SynonymFilter: A synonym filter that supports multi-word
synonyms.
- o.a.l.analysis.phonetic: Package for phonetic search, containing various
phonetic encoders such as Double Metaphone.
Some existing analysis components changed packages:
- o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer
- o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer
- o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer
- o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter
- o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer
- o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer
- o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer
- o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter
- o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer
- o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer
- o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter
- o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter
- o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter
- o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter
- o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter
- o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper
- o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter
- o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter
- o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter
- o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter
- o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap
- o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet
- o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap
- o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase
- o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase
- o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader
- o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer
- o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils
All analyzers in contrib/analyzers and contrib/icu were moved to the
analysis module. The 'smartcn' and 'stempel' components now depend on 'common'.
(Chris Male, Robert Muir)
* SOLR-2764: Create a NorwegianLightStemmer and NorwegianMinimalStemmer (janhoy)