mirror of https://github.com/apache/lucene.git
490 lines
20 KiB
Plaintext
490 lines
20 KiB
Plaintext
Lucene contrib change Log
|
|
|
|
======================= Trunk (not yet released) =======================
|
|
|
|
Changes in backwards compatibility policy
|
|
|
|
* LUCENE-2100: All Analyzers in Lucene-contrib have been marked as final.
|
|
Analyzers should be only act as a composition of TokenStreams, users should
|
|
compose their own analyzers instead of subclassing existing ones.
|
|
(Simon Willnauer)
|
|
|
|
* LUCENE-2194, LUCENE-2201: Snowball APIs were upgraded to snowball revision
|
|
502 (with some local modifications for improved performance).
|
|
Index backwards compatibility and binary backwards compatibility is
|
|
preserved, but some protected/public member variables changed type. This
|
|
does NOT affect java code/class files produced by the snowball compiler,
|
|
but technically is a backwards compatibility break. (Robert Muir)
|
|
|
|
* LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers.
|
|
Be sure to remove any old obselete lucene-snowball jar files from your
|
|
classpath! (Robert Muir)
|
|
|
|
Changes in runtime behavior
|
|
|
|
* LUCENE-2117: SnowballAnalyzer uses TurkishLowerCaseFilter instead of
|
|
LowercaseFilter to correctly handle the unique Turkish casing behavior if
|
|
used with Version > 3.0 and the TurkishStemmer.
|
|
(Robert Muir via Simon Willnauer)
|
|
|
|
Bug fixes
|
|
|
|
* LUCENE-2199: ShingleFilter skipped over tri-gram shingles if outputUnigram
|
|
was set to false. (Simon Willnauer)
|
|
|
|
* LUCENE-2068: Fixed ReverseStringFilter which was not aware of supplementary
|
|
characters. During reverse the filter created unpaired surrogates, which
|
|
will be replaced by U+FFFD by the indexer, but not at query time. The filter
|
|
now reverses supplementary characters correctly if used with Version > 3.0.
|
|
(Simon Willnauer, Robert Muir)
|
|
|
|
* LUCENE-2144: Fix InstantiatedIndex to handle termDocs(null)
|
|
correctly (enumerate all non-deleted docs). (Karl Wettin via Mike
|
|
McCandless)
|
|
|
|
* LUCENE-2035: TokenSources.getTokenStream() does not assign positionIncrement.
|
|
(Christopher Morris via Mark Miller)
|
|
|
|
* LUCENE-2211: Fix missing clearAttributes() calls in contrib:
|
|
ShingleMatrix, PrefixAware, compounds, NGramTokenFilter,
|
|
EdgeNGramTokenFilter, Highlighter, and MemoryIndex.
|
|
(Uwe Schindler, Robert Muir)
|
|
|
|
* LUCENE-2207, LUCENE-2219: Fix incorrect offset calculations in end() for
|
|
CJKTokenizer, ChineseTokenizer, SmartChinese SentenceTokenizer,
|
|
and WikipediaTokenizer. (Koji Sekiguchi, Robert Muir)
|
|
|
|
API Changes
|
|
|
|
* LUCENE-2108: Add SpellChecker.close, to close the underlying
|
|
reader. (Eirik Bjørsnøs via Mike McCandless)
|
|
|
|
* LUCENE-2147: Spatial GeoHashUtils now always decode GeoHash strings
|
|
with full precision. GeoHash#decode_exactly(String) was merged into
|
|
GeoHash#decode(String). (Chris Male, Simon Willnauer)
|
|
|
|
* LUCENE-2165: Add a constructor to SnowballAnalyzer that takes a Set of
|
|
stopwords, and deprecate the String[] one. (Nick Burch via Robert Muir)
|
|
|
|
* LUCENE-2204: Change some package private classes/members to publicly accessible to implement
|
|
custom FragmentsBuilders. (Koji Sekiguchi)
|
|
|
|
New features
|
|
|
|
* LUCENE-2102: Add a Turkish LowerCase Filter. TurkishLowerCaseFilter handles
|
|
Turkish and Azeri unique casing behavior correctly.
|
|
(Ahmet Arslan, Robert Muir via Simon Willnauer)
|
|
|
|
* LUCENE-2039: Add a extensible query parser to contrib/misc.
|
|
ExtendableQueryParser enables arbitrary parser extensions based on a
|
|
customizable field naming scheme.
|
|
(Simon Willnauer)
|
|
|
|
* LUCENE-2108: Spellchecker now safely supports concurrent modifications to
|
|
the spell-index. Threads can safely obtain term suggestions while the spell-
|
|
index is rebuild, cleared or reset. Internal IndexSearcher instances remain
|
|
open until the last thread accessing them releases the reference.
|
|
(Simon Willnauer)
|
|
|
|
* LUCENE-2067: Add a Czech light stemmer. CzechAnalyzer will now stem words
|
|
when Version is set to 3.1 or higher. (Robert Muir)
|
|
|
|
* LUCENE-2062: Add a Bulgarian analyzer. (Robert Muir, Simon Willnauer)
|
|
|
|
* LUCENE-2206: Add Snowball's stopword lists for Danish, Dutch, English,
|
|
Finnish, French, German, Hungarian, Italian, Norwegian, Russian, Spanish,
|
|
and Swedish. These can be loaded with WordListLoader.getSnowballWordSet.
|
|
(Robert Muir, Simon Willnauer)
|
|
|
|
* LUCENE-2243: Add DisjunctionMaxQuery support for FastVectorHighlighter.
|
|
(Koji Sekiguchi)
|
|
|
|
* LUCENE-2218: ShingleFilter supports minimum shingle size, and the separator
|
|
character is now configurable. Its also up to 20% faster.
|
|
(Steven Rowe via Robert Muir)
|
|
|
|
* LUCENE-2234: Add a Hindi analyzer. (Robert Muir)
|
|
|
|
Build
|
|
|
|
* LUCENE-2124: Moved the JDK-based collation support from contrib/collation
|
|
into core, and moved the ICU-based collation support into contrib/icu.
|
|
(Steven Rowe, Robert Muir)
|
|
|
|
Optimizations
|
|
|
|
* LUCENE-2157: DelimitedPayloadTokenFilter no longer copies the buffer
|
|
over itsself. Instead it sets only the length. This patch also optimizes
|
|
the logic of the filter and uses NIO for IdentityEncoder. (Uwe Schindler)
|
|
|
|
* LUCENE-2084: Change IndexableBinaryStringTools to work on byte[] and char[]
|
|
directly, instead of Byte/CharBuffers, and modify ICUCollationKeyFilter to
|
|
take advantage of this for faster performance.
|
|
(Steven Rowe, Uwe Schindler, Robert Muir)
|
|
|
|
Test Cases
|
|
|
|
* LUCENE-2115: Cutover contrib tests to use Java5 generics. (Kay Kay
|
|
via Mike McCandless)
|
|
|
|
Other
|
|
|
|
* LUCENE-1845: Updated bdb-je jar from version 3.3.69 to 3.3.93.
|
|
(Simon Willnauer via Mike McCandless)
|
|
|
|
======================= Release 3.0.0 2009-11-25 =======================
|
|
|
|
Changes in backwards compatibility policy
|
|
|
|
* LUCENE-1257: Change some occurences of StringBuffer in public/protected
|
|
APIs of contrib/surround to StringBuilder.
|
|
(Paul Elschot via Uwe Schindler)
|
|
|
|
Changes in runtime behavior
|
|
|
|
* LUCENE-1966: Modified and cleaned the default Arabic stopwords list used
|
|
by ArabicAnalyzer. You'll need to fully re-index any previously created
|
|
indexes. (Basem Narmok via Robert Muir)
|
|
|
|
API Changes
|
|
|
|
* LUCENE-1936: Deprecated RussianLowerCaseFilter, because it transforms
|
|
text exactly the same as LowerCaseFilter. Please use LowerCaseFilter
|
|
instead, which has the same functionality. (Robert Muir)
|
|
|
|
* LUCENE-2051: Contrib Analyzer setters were deprecated and replaced
|
|
with ctor arguments / Version number. Also stop word lists
|
|
were unified. (Simon Willnauer)
|
|
|
|
Bug fixes
|
|
|
|
* LUCENE-1781: Fixed various issues with the lat/lng bounding box
|
|
distance filter created for radius search in contrib/spatial.
|
|
(Bill Bell via Mike McCandless)
|
|
|
|
* LUCENE-1939: IndexOutOfBoundsException at ShingleMatrixFilter's
|
|
Iterator#hasNext method on exhausted streams.
|
|
(Patrick Jungermann via Karl Wettin)
|
|
|
|
* LUCENE-1359: French analyzer did not support null field names.
|
|
(Andrew Lynch via Robert Muir)
|
|
|
|
New features
|
|
|
|
* LUCENE-1924: Added BalancedSegmentMergePolicy to contrib/misc,
|
|
which is a merge policy that tries to avoid doing very large
|
|
segment merges to give better search performance in a mixed
|
|
indexing/searching environment. (John Wang via Mike McCandless)
|
|
|
|
* LUCENE-1959: Add index splitting tools. The IndexSplitter tool works
|
|
on multi-segment (non optimized) indexes and it can copy specific
|
|
segments out of the index into a new index. It can also list the
|
|
segments in the index, and delete specified segments. (Jason Rutherglen via
|
|
Mike McCandless). MultiPassIndexSplitter can split any index into
|
|
any number of output parts, at the cost of doing multiple passes over
|
|
the input index. (Andrzej Bialecki)
|
|
|
|
* LUCENE-1993: Add maxDocFreq setting to MoreLikeThis, to exclude
|
|
from consideration terms that match more than the specified number
|
|
of documents. (Christian Steinert via Mike McCandless)
|
|
|
|
Optimizations
|
|
|
|
* LUCENE-1965, LUCENE-1962: Arabic-, Persian- and SmartChineseAnalyzer
|
|
loads default stopwords only once if accessed for the first time.
|
|
Previous versions were loading the stopword files each time a new
|
|
instance was created. This might improve performance for applications
|
|
creating lots of instances of these Analyzers. (Simon Willnauer)
|
|
|
|
Documentation
|
|
|
|
* LUCENE-1916: Translated documentation in the smartcn hhmm package.
|
|
(Patricia Peng via Robert Muir)
|
|
|
|
Build
|
|
|
|
* LUCENE-1904: Moved wordnet-based synonym support from contrib/memory
|
|
into contrib/wordnet. (Robert Muir)
|
|
|
|
* LUCENE-2031: Moved PatternAnalyzer from contrib/memory into
|
|
contrib/analyzers/common, under miscellaneous. (Robert Muir)
|
|
|
|
======================= Release 2.9.1 2009-11-06 =======================
|
|
|
|
Changes in backwards compatibility policy
|
|
|
|
* LUCENE-2002: Add required Version matchVersion argument when
|
|
constructing ComplexPhraseQueryParser and default (as of 2.9)
|
|
enablePositionIncrements to true to match StandardAnalyzer's
|
|
default. Also added required matchVersion to most of the analyzers
|
|
(Uwe Schindler, Mike McCandless)
|
|
|
|
Changes in runtime behavior
|
|
|
|
* LUCENE-1963: ArabicAnalyzer now lowercases before checking the stopword
|
|
list. This has no effect on Arabic text, but if you are using a custom
|
|
stopword list that contains some non-Arabic words, you'll need to fully
|
|
reindex. (DM Smith via Robert Muir)
|
|
|
|
Bug fixes
|
|
|
|
* LUCENE-1953: FastVectorHighlighter: small fragCharSize can cause
|
|
StringIndexOutOfBoundsException. (Koji Sekiguchi)
|
|
|
|
* LUCENE-1929: Highlighter throws exception on NumericRangeQuery and does not
|
|
support deprecated RangeQuery. (Mark Miller)
|
|
|
|
* LUCENE-2001: Wordnet Syns2Index incorrectly parses synonyms that
|
|
contain a single quote. (Parag H. Dave via Robert Muir)
|
|
|
|
* LUCENE-2003: Highlighter doesn't respect position increments other than 1 with
|
|
PhraseQuerys. (Uwe Schindler, Mark Miller)
|
|
|
|
* LUCENE-1954: InstantiatedIndexWriter: Fixed ClassCastException with
|
|
NumericField because of incorrect unchecked cast: Document.getFields()
|
|
returns List<Fieldable>. (Bernd Fondermann via Uwe Schindler)
|
|
|
|
* LUCENE-2014: SmartChineseAnalyzer did not properly clear attributes
|
|
in WordTokenFilter. If enablePositionIncrements is set for StopFilter,
|
|
then this could create invalid position increments, causing IndexWriter
|
|
to crash. (Robert Muir, Uwe Schindler)
|
|
|
|
* LUCENE-2013: SpanRegexQuery does not work with QueryScorer.
|
|
(Benjamin Keil via Mark Miller)
|
|
|
|
======================= Release 2.9.0 2009-09-23 =======================
|
|
|
|
Changes in runtime behavior
|
|
|
|
* LUCENE-1505: Local lucene now uses org.apache.lucene.util.NumericUtils for all
|
|
number conversion. You'll need to fully re-index any previously created indexes.
|
|
This isn't a break in back-compatibility because local Lucene has not yet
|
|
been released. (Mike McCandless)
|
|
|
|
* LUCENE-1758: ArabicAnalyzer now uses the light10 algorithm, has a refined
|
|
default stopword list, and lowercases non-Arabic text.
|
|
You'll need to fully re-index any previously created indexes. This isn't a
|
|
break in back-compatibility because ArabicAnalyzer has not yet been
|
|
released. (Robert Muir)
|
|
|
|
|
|
API Changes
|
|
|
|
* LUCENE-1695: Update the Highlighter to use the new TokenStream API. This issue breaks backwards
|
|
compatibility with some public classes. If you have implemented custom Fragmenters or Scorers,
|
|
you will need to adjust them to work with the new TokenStream API. Rather than getting passed a
|
|
Token at a time, you will be given a TokenStream to init your impl with - store the Attributes
|
|
you are interested in locally and access them on each call to the method that used to pass a new
|
|
Token. Look at the included updated impls for examples. (Mark Miller)
|
|
|
|
* LUCENE-1460: Change contrib TokenStreams/Filters to use the new
|
|
TokenStream API. (Robert Muir, Michael Busch)
|
|
|
|
* LUCENE-1775, LUCENE-1903: Change remaining TokenFilters (shingle, prefix-suffix)
|
|
to use the new TokenStream API. ShingleFilter is much more efficient now,
|
|
it clones much less often and computes the tokens mostly on the fly now.
|
|
Also added more tests. (Robert Muir, Michael Busch, Uwe Schindler, Chris Harris)
|
|
|
|
* LUCENE-1685: The position aware SpanScorer has become the default scorer
|
|
for Highlighting. The SpanScorer implementation has replaced QueryScorer
|
|
and the old term highlighting QueryScorer has been renamed to
|
|
QueryTermScorer. Multi-term queries are also now expanded by default. If
|
|
you were previously rewriting the query for multi-term query highlighting,
|
|
you should no longer do that (unless you switch to using QueryTermScorer).
|
|
The SpanScorer API (now QueryScorer) has also been improved to more closely
|
|
match the API of the previous QueryScorer implementation. (Mark Miller)
|
|
|
|
* LUCENE-1793: Deprecate the custom encoding support in the Greek and Russian
|
|
Analyzers. If you need to index text in these encodings, please use Java's
|
|
character set conversion facilities (InputStreamReader, etc) during I/O,
|
|
so that Lucene can analyze this text as Unicode instead. (Robert Muir)
|
|
|
|
Bug fixes
|
|
|
|
* LUCENE-1423: InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBounds on empty index.
|
|
(Karl Wettin)
|
|
|
|
* LUCENE-1462: InstantiatedIndexWriter did not reset pre analyzed TokenStreams the
|
|
same way IndexWriter does. Parts of InstantiatedIndex was not Serializable.
|
|
(Karl Wettin)
|
|
|
|
* LUCENE-1510: InstantiatedIndexReader#norms methods throws NullPointerException on empty index.
|
|
(Karl Wettin, Robert Newson)
|
|
|
|
* LUCENE-1514: ShingleMatrixFilter#next(Token) easily throws a StackOverflowException
|
|
due to recursive invocation. (Karl Wettin)
|
|
|
|
* LUCENE-1548: Fix distance normalization in LevenshteinDistance to
|
|
not produce negative distances (Thomas Morton via Mike McCandless)
|
|
|
|
* LUCENE-1490: Fix latin1 conversion of HALFWIDTH_AND_FULLWIDTH_FORMS
|
|
characters to only apply to the correct subset (Daniel Cheng via
|
|
Mike McCandless)
|
|
|
|
* LUCENE-1576: Fix BrazilianAnalyzer to downcase tokens after
|
|
StandardTokenizer so that stop words with mixed case are filtered
|
|
out. (Rafael Cunha de Almeida, Douglas Campos via Mike McCandless)
|
|
|
|
* LUCENE-1491: EdgeNGramTokenFilter no longer stops on tokens shorter than minimum n-gram size.
|
|
(Todd Teak via Otis Gospodnetic)
|
|
|
|
* LUCENE-1683: Fixed JavaUtilRegexCapabilities (an impl used by
|
|
RegexQuery) to use Matcher.matches() not Matcher.lookingAt() so
|
|
that the regexp must match the entire string, not just a prefix.
|
|
(Trejkaz via Mike McCandless)
|
|
|
|
* LUCENE-1792: Fix new query parser to set rewrite method for
|
|
multi-term queries. (Luis Alves, Mike McCandless via Michael Busch)
|
|
|
|
* LUCENE-1828: Fix memory index to call TokenStream.reset() and
|
|
TokenStream.end(). (Tim Smith via Michael Busch)
|
|
|
|
* LUCENE-1912: Fix fast-vector-highlighter issue when two or more
|
|
terms are concatenated (Koji Sekiguchi via Mike McCandless)
|
|
|
|
New features
|
|
|
|
* LUCENE-1531: Added support for BoostingTermQuery to XML query parser. (Karl Wettin)
|
|
|
|
* LUCENE-1435: Added contrib/collation, a CollationKeyFilter
|
|
allowing you to convert tokens into CollationKeys encoded using
|
|
IndexableBinaryStringTools. This allows for faster RangeQuery when
|
|
a field needs to use a custom Collator. (Steven Rowe via Mike
|
|
McCandless)
|
|
|
|
* LUCENE-1591: EnWikiDocMaker, LineDocMaker, WriteLineDoc can now
|
|
read/write bz2 using Apache commons compress library. This means
|
|
you can download the .bz2 export from http://wikipedia.org and
|
|
immediately index it. (Shai Erera via Mike McCandless)
|
|
|
|
* LUCENE-1629: Add SmartChineseAnalyzer to contrib/analyzers. It
|
|
improves on CJKAnalyzer and ChineseAnalyzer by handling Chinese
|
|
sentences properly. SmartChineseAnalyzer uses a Hidden Markov
|
|
Model to tokenize Chinese words in a more intelligent way.
|
|
(Xiaoping Gao via Mike McCandless)
|
|
|
|
* LUCENE-1676: Added DelimitedPayloadTokenFilter class for automatically adding payloads "in-stream" (Grant Ingersoll)
|
|
|
|
* LUCENE-1578: Support for loading unoptimized readers to the
|
|
constructor of InstantiatedIndex. (Karl Wettin)
|
|
|
|
* LUCENE-1704: Allow specifying the Tidy configuration file when
|
|
parsing HTML docs with contrib/ant. (Keith Sprochi via Mike
|
|
McCandless)
|
|
|
|
* LUCENE-1522: Added contrib/fast-vector-highlighter, a new alternative
|
|
highlighter. (Koji Sekiguchi via Mike McCandless)
|
|
|
|
* LUCENE-1740: Added "analyzer" command to Lucli, enabling changing
|
|
the analyzer from the default StandardAnalyzer. (Bernd Fondermann
|
|
via Mike McCandless)
|
|
|
|
* LUCENE-1272: Add get/setBoost to MoreLikeThis. (Jonathan
|
|
Leibiusky via Mike McCandless)
|
|
|
|
* LUCENE-1745: Added constructors to JakartaRegexpCapabilities and
|
|
JavaUtilRegexCapabilities as well as static flags to support
|
|
configuring a RegexCapabilities implementation with the
|
|
implementation-specific modifier flags. Allows for callers to
|
|
customize the RegexQuery using the implementation-specific options
|
|
and fine tune how regular expressions are compiled and
|
|
matched. (Marc Zampetti zampettim@aim.com via Mike McCandless)
|
|
|
|
* LUCENE-1567: Added a new QueryParser framework, that allows
|
|
implementing a new query syntax in a flexible and efficient way.
|
|
This new QueryParser will be moved to Lucene's core in release
|
|
3.0 and will then replace the current core QueryParser, which
|
|
has been deprecated with this patch.
|
|
(Luis Alves and Adriano Campos via Michael Busch)
|
|
|
|
* LUCENE-1486: Added ComplexPhraseQueryParser, an extension of QueryParser
|
|
that allows a subset of the Lucene query language to be embedded in
|
|
PhraseQuerys. Wildcard, Range, and Fuzzy queries, as well as limited
|
|
boolean logic, can be used within quote operators with this parser, ie:
|
|
"(jo* -john) smyth~". (Mark Harwood via Mark Miller)
|
|
|
|
* Added web-based demo of functionality in contrib's XML Query Parser
|
|
packaged as War file (Mark Harwood)
|
|
|
|
* LUCENE-1406: Added Arabic analyzer. (Robert Muir via Grant Ingersoll)
|
|
|
|
* LUCENE-1628: Added Persian analyzer. (Robert Muir)
|
|
|
|
* LUCENE-1813: Add option to ReverseStringFilter to mark reversed tokens.
|
|
(Andrzej Bialecki via Robert Muir)
|
|
|
|
Optimizations
|
|
|
|
* LUCENE-1643: Re-use the collation key (RawCollationKey) for
|
|
better performance, in ICUCollationKeyFilter. (Robert Muir via
|
|
Mike McCandless)
|
|
|
|
* LUCENE-1794: Implement TokenStream reuse for contrib Analyzers,
|
|
and implement reset() for TokenStreams to support reuse. (Robert Muir)
|
|
|
|
Documentation
|
|
|
|
* LUCENE-1876: added missing package level documentation for numerous
|
|
contrib packages.
|
|
(Steven Rowe & Robert Muir)
|
|
|
|
Build
|
|
|
|
* LUCENE-1728: Split contrib/analyzers into common and smartcn modules.
|
|
Contrib/analyzers now builds an additional lucene-smartcn Jar file. All
|
|
smartcn classes are not included in the lucene-analyzers JAR file.
|
|
(Robert Muir via Simon Willnauer)
|
|
|
|
* LUCENE-1829: Fix contrib query parser to properly create javacc files.
|
|
(Jan-Pascal and Luis Alves via Michael Busch)
|
|
|
|
Test Cases
|
|
|
|
|
|
======================= Release 2.4.0 2008-10-06 =======================
|
|
|
|
Changes in runtime behavior
|
|
|
|
(None)
|
|
|
|
API Changes
|
|
|
|
1.
|
|
|
|
(None)
|
|
|
|
Bug fixes
|
|
|
|
1. LUCENE-1312: Added full support for InstantiatedIndexReader#getFieldNames()
|
|
and tests that assert that deleted documents behaves as they should (they did).
|
|
(Jason Rutherglen, Karl Wettin)
|
|
|
|
2. LUCENE-1318: InstantiatedIndexReader.norms(String, b[], int) didn't treat
|
|
the array offset right. (Jason Rutherglen via Karl Wettin)
|
|
|
|
New features
|
|
|
|
1. LUCENE-1320: ShingleMatrixFilter, multidimensional shingle token filter. (Karl Wettin)
|
|
|
|
2. LUCENE-1142: Updated Snowball package, org.tartarus distribution revision 500.
|
|
Introducing Hungarian, Turkish and Romanian support, updated older stemmers
|
|
and optimized (reflectionless) SnowballFilter.
|
|
IMPORTANT NOTICE ON BACKWARDS COMPATIBILITY: an index created using the 2.3.2 (or older)
|
|
might not be compatible with these updated classes as some algorithms have changed.
|
|
(Karl Wettin)
|
|
|
|
3. LUCENE-1016: TermVectorAccessor, transparent vector space access via stored vectors
|
|
or by resolving the inverted index. (Karl Wettin)
|
|
|
|
Documentation
|
|
|
|
(None)
|
|
|
|
Build
|
|
|
|
(None)
|
|
|
|
Test Cases
|
|
|
|
(None)
|