mirror of https://github.com/apache/lucene.git
181 lines
8.1 KiB
Plaintext
181 lines
8.1 KiB
Plaintext
# Apache Lucene Migration Guide
|
|
|
|
## TermsEnum is now fully abstract (LUCENE-8292) ##
|
|
|
|
TermsEnum has been changed to be fully abstract, so non-abstract subclass must implement all it's methods.
|
|
Non-Performance critical TermsEnums can use BaseTermsEnum as a base class instead. The change was motivated
|
|
by several performance issues with FilterTermsEnum that caused significant slowdowns and massive memory consumption due
|
|
to not delegating all method from TermsEnum. See LUCENE-8292 and LUCENE-8662
|
|
|
|
## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed ##
|
|
|
|
RAM-based directory implementation have been removed. (LUCENE-8474).
|
|
ByteBuffersDirectory can be used as a RAM-resident replacement, although it
|
|
is discouraged in favor of the default memory-mapped directory.
|
|
|
|
|
|
## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014) ##
|
|
|
|
SpanQuery and PhraseQuery now always calculate their slops as (1.0 / (1.0 +
|
|
distance)). Payload factor calculation is performed by PayloadDecoder in the
|
|
queries module
|
|
|
|
|
|
## Scorer must produce positive scores (LUCENE-7996) ##
|
|
|
|
Scorers are no longer allowed to produce negative scores. If you have custom
|
|
query implementations, you should make sure their score formula may never produce
|
|
negative scores.
|
|
|
|
As a side-effect of this change, negative boosts are now rejected and
|
|
FunctionScoreQuery maps negative values to 0.
|
|
|
|
|
|
## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099) ##
|
|
|
|
Instead use FunctionScoreQuery and a DoubleValuesSource implementation. BoostedQuery
|
|
and BoostingQuery may be replaced by calls to FunctionScoreQuery.boostByValue() and
|
|
FunctionScoreQuery.boostByQuery(). To replace more complex calculations in
|
|
CustomScoreQuery, use the lucene-expressions module:
|
|
|
|
SimpleBindings bindings = new SimpleBindings();
|
|
bindings.add("score", DoubleValuesSource.SCORES);
|
|
bindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
|
|
bindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
|
|
Expression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
|
|
FunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));
|
|
|
|
## Index options can no longer be changed dynamically (LUCENE-8134) ##
|
|
|
|
Changing index options on the fly is now going to result into an
|
|
IllegalArgumentException. If a field is indexed
|
|
(FieldType.indexOptions() != IndexOptions.NONE) then all documents must have
|
|
the same index options for that field.
|
|
|
|
|
|
## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242) ##
|
|
|
|
Instead use IndexSearcher.createWeight(), rewriting the query first, and using
|
|
a boost of 1f.
|
|
|
|
## Memory codecs removed (LUCENE-8267) ##
|
|
|
|
Memory codecs have been removed from the codebase (MemoryPostings, MemoryDocValues).
|
|
|
|
## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144) ##
|
|
|
|
Caching everything is discouraged as it disables the ability to skip non-interesting documents.
|
|
ALWAYS_CACHE can be replaced by a UsageTrackingQueryCachingPolicy with an appropriate config.
|
|
|
|
## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444) ##
|
|
|
|
To retain the old behaviour, pass EnglishAnalyzer.ENGLISH_STOP_WORDS_SET as an argument
|
|
to the constructor
|
|
|
|
## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved ##
|
|
|
|
English stop words are now defined in EnglishAnalyzer#ENGLISH_STOP_WORDS_SET in the
|
|
analysis-common module
|
|
|
|
## TopDocs.maxScore removed ##
|
|
|
|
TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector no longer have
|
|
an option to compute the maximum score when sorting by field. If you need to
|
|
know the maximum score for a query, the recommended approach is to run a
|
|
separate query:
|
|
|
|
TopDocs topHits = searcher.search(query, 1);
|
|
float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score;
|
|
|
|
Thanks to other optimizations that were added to Lucene 8, this query will be
|
|
able to efficiently select the top-scoring document without having to visit
|
|
all matches.
|
|
|
|
## TopFieldCollector always assumes fillFields=true ##
|
|
|
|
Because filling sort values doesn't have a significant overhead, the fillFields
|
|
option has been removed from TopFieldCollector factory methods. Everything
|
|
behaves as if it was set to true.
|
|
|
|
## TopFieldCollector no longer takes a trackDocScores option ##
|
|
|
|
Computing scores at collection time is less efficient than running a second
|
|
request in order to only compute scores for documents that made it to the top
|
|
hits. As a consequence, the trackDocScores option has been removed and can be
|
|
replaced with the new TopFieldCollector#populateScores helper method.
|
|
|
|
## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long ##
|
|
|
|
Lucene 8 received optimizations for collection of top-k matches by not visiting
|
|
all matches. However these optimizations won't help if all matches still need
|
|
to be visited in order to compute the total number of hits. As a consequence,
|
|
IndexSearcher's search and searchAfter methods were changed to only count hits
|
|
accurately up to 1,000, and Topdocs.totalHits was changed from a long to an
|
|
object that says whether the hit count is accurate or a lower bound of the
|
|
actual hit count.
|
|
|
|
## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated ##
|
|
|
|
This RAM-based directory implementation is an old piece of code that uses inefficient
|
|
thread synchronization primitives and can be confused as "faster" than the NIO-based
|
|
MMapDirectory. It is deprecated and scheduled for removal in future versions of
|
|
Lucene. (LUCENE-8467, LUCENE-8438)
|
|
|
|
## LeafCollector.setScorer() now takes a Scorable rather than a Scorer ##
|
|
|
|
Scorer has a number of methods that should never be called from Collectors, for example
|
|
those that advance the underlying iterators. To hide these, LeafCollector.setScorer()
|
|
now takes a Scorable, an abstract class that Scorers can extend, with methods
|
|
docId() and score() (LUCENE-6228)
|
|
|
|
## Scorers must have non-null Weights ##
|
|
|
|
If a custom Scorer implementation does not have an associated Weight, it can probably
|
|
be replaced with a Scorable instead.
|
|
|
|
## Suggesters now return Long instead of long for weight() during indexing, and double
|
|
instead of long at suggest time ##
|
|
|
|
Most code should just require recompilation, though possibly requiring some added casts.
|
|
|
|
## TokenStreamComponents is now final ##
|
|
|
|
Instead of overriding TokenStreamComponents#setReader() to customise analyzer
|
|
initialisation, you should now pass a Consumer<Reader> instance to the
|
|
TokenStreamComponents constructor.
|
|
|
|
## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed ##
|
|
|
|
LowerCaseTokenizer combined tokenization and filtering in a way that broke token
|
|
normalization, so they have been removed. Instead, use a LetterTokenizer followed by
|
|
a LowerCaseFilter
|
|
|
|
## CharTokenizer no longer takes a normalizer function ##
|
|
|
|
CharTokenizer now only performs tokenization. To perform any type of filtering
|
|
use a TokenFilter chain as you would with any other Tokenizer.
|
|
|
|
## Highlighter and FastVectorHighlighter no longer support ToParent/ToChildBlockJoinQuery
|
|
|
|
Both Highlighter and FastVectorHighlighter need a custom WeightedSpanTermExtractor or FieldQuery respectively
|
|
in order to support ToParent/ToChildBlockJoinQuery.
|
|
|
|
## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize() ##
|
|
|
|
Normalization is now type-safe, with CharFilterFactory#normalize() returning a Reader and
|
|
TokenFilterFactory#normalize() returning a TokenFilter.
|
|
|
|
## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563) ##
|
|
|
|
Scores computed by the BM25 similarity are lower than previously as the k1+1
|
|
constant factor was removed from the numerator of the scoring formula.
|
|
Ordering of results is preserved unless scores are computed from multiple
|
|
fields using different similarities. The previous behaviour is now exposed
|
|
by the LegacyBM25Similarity class which can be found in the lucene-misc jar.
|
|
|
|
## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats() ##
|
|
|
|
IndexWriter#getDocStats() should be used instead of #maxDoc() / #numDocs() which offers a consistent
|
|
view on document stats. Previously calling two methods in order ot get point in time stats was subject
|
|
to concurrent changes.
|