mirror of https://github.com/apache/lucene.git
125 lines
6.7 KiB
Plaintext
125 lines
6.7 KiB
Plaintext
Analysis Module Change Log
|
|
|
|
For more information on past and future Lucene versions, please see:
|
|
http://s.apache.org/luceneversions
|
|
|
|
======================= Trunk (not yet released) =======================
|
|
|
|
API Changes
|
|
|
|
* LUCENE-3820: Deprecated constructors accepting pattern matching bounds. The input
|
|
is buffered and matched in one pass. (Dawid Weiss)
|
|
|
|
* LUCENE-2413: Deprecated PatternAnalyzer in common/miscellaneous, in favor
|
|
of the pattern package (CharFilter, Tokenizer, TokenFilter). (Robert Muir)
|
|
|
|
* LUCENE-2413: Removed the AnalyzerUtil in common/miscellaneous. (Robert Muir)
|
|
|
|
* LUCENE-1370: Added ShingleFilter option to output unigrams if no shingles
|
|
can be generated. (Chris Harris via Steven Rowe)
|
|
|
|
* LUCENE-2514, LUCENE-2551: JDK and ICU CollationKeyAnalyzers were changed to
|
|
use pure byte keys when Version >= 4.0. This cuts sort key size approximately
|
|
in half. (Robert Muir)
|
|
|
|
* LUCENE-3400: Removed DutchAnalyzer.setStemDictionary (Chris Male)
|
|
|
|
* LUCENE-3431: Removed QueryAutoStopWordAnalyzer.addStopWords* deprecated methods
|
|
since they prevented reuse. Stopwords are now generated at instantiation through
|
|
the Analyzer's constructors. (Chris Male)
|
|
|
|
* LUCENE-3434: Removed ShingleAnalyzerWrapper.set* and PerFieldAnalyzerWrapper.addAnalyzer
|
|
since they prevent reuse. Both Analyzers should be configured at instantiation.
|
|
(Chris Male)
|
|
|
|
* LUCENE-3765: Stopset ctors that previously took Set<?> or Map<?,String> now take
|
|
CharArraySet and CharArrayMap respectively. Previously the behavior was confusing,
|
|
and sometimes different depending on the type of set, and ultimately a CharArraySet
|
|
or CharArrayMap was always used anyway. (Robert Muir)
|
|
|
|
Bug fixes
|
|
|
|
* LUCENE-3971: MappingCharFilter could return invalid final token position.
|
|
(Dawid Weiss, Robert Muir)
|
|
|
|
* LUCENE-3820: PatternReplaceCharFilter could return invalid token positions.
|
|
(Dawid Weiss)
|
|
|
|
* LUCENE-3969: Throw IAE on bad arguments that could cause confusing errors in
|
|
CompoundWordTokenFilterBase, PatternTokenizer, PositionFilter,
|
|
SnowballFilter, PathHierarchyTokenizer, ReversePathHierarchyTokenizer,
|
|
WikipediaTokenizer, and KeywordTokenizer. ShingleFilter and
|
|
CommonGramsFilter now populate PositionLengthAttribute. Fixed
|
|
PathHierarchyTokenizer to reset() all state. Protect against AIOOBE in
|
|
ReversePathHierarchyTokenizer if skip is large. Fixed wrong final
|
|
offset calculation in PathHierarchyTokenizer.
|
|
(Mike McCandless, Uwe Schindler, Robert Muir)
|
|
|
|
New Features
|
|
|
|
* LUCENE-2341: A new analyzer/ filter: Morfologik - a dictionary-driven lemmatizer
|
|
(accurate stemmer) for Polish (includes morphosyntactic annotations).
|
|
(Michał Dybizbański, Dawid Weiss)
|
|
|
|
* LUCENE-2413: Consolidated Lucene/Solr analysis components into common.
|
|
New features from Solr now available to Lucene users include:
|
|
- o.a.l.analysis.commongrams: Constructs n-grams for frequently occurring terms
|
|
and phrases.
|
|
- o.a.l.analysis.charfilter.HTMLStripCharFilter: CharFilter that strips HTML
|
|
constructs.
|
|
- o.a.l.analysis.miscellaneous.WordDelimiterFilter: TokenFilter that splits words
|
|
into subwords and performs optional transformations on subword groups.
|
|
- o.a.l.analysis.miscellaneous.RemoveDuplicatesTokenFilter: TokenFilter which
|
|
filters out Tokens at the same position and Term text as the previous token.
|
|
- o.a.l.analysis.miscellaneous.TrimFilter: Trims leading and trailing whitespace
|
|
from Tokens in the stream.
|
|
- o.a.l.analysis.miscellaneous.KeepWordFilter: A TokenFilter that only keeps tokens
|
|
with text contained in the required words (inverse of StopFilter).
|
|
- o.a.l.analysis.miscellaneous.HyphenatedWordsFilter: A TokenFilter that puts
|
|
hyphenated words broken into two lines back together.
|
|
- o.a.l.analysis.miscellaneous.CapitalizationFilter: A TokenFilter that applies
|
|
capitalization rules to tokens.
|
|
- o.a.l.analysis.pattern: Package for pattern-based analysis, containing a
|
|
CharFilter, Tokenizer, and Tokenfilter for transforming text with regexes.
|
|
- o.a.l.analysis.synonym.SynonymFilter: A synonym filter that supports multi-word
|
|
synonyms.
|
|
- o.a.l.analysis.phonetic: Package for phonetic search, containing various
|
|
phonetic encoders such as Double Metaphone.
|
|
|
|
Some existing analysis components changed packages:
|
|
- o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer
|
|
- o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer
|
|
- o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer
|
|
- o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter
|
|
- o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer
|
|
- o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer
|
|
- o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer
|
|
- o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter
|
|
- o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer
|
|
- o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer
|
|
- o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter
|
|
- o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter
|
|
- o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter
|
|
- o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter
|
|
- o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter
|
|
- o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper
|
|
- o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter
|
|
- o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter
|
|
- o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter
|
|
- o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter
|
|
- o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap
|
|
- o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet
|
|
- o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap
|
|
- o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase
|
|
- o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase
|
|
- o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader
|
|
- o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer
|
|
- o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils
|
|
|
|
All analyzers in contrib/analyzers and contrib/icu were moved to the
|
|
analysis module. The 'smartcn' and 'stempel' components now depend on 'common'.
|
|
(Chris Male, Robert Muir)
|
|
|
|
* SOLR-2764: Create a NorwegianLightStemmer and NorwegianMinimalStemmer (janhoy)
|
|
|