lucene/lucene/MIGRATE.txt

617 lines
25 KiB
Plaintext

LUCENE-2380: FieldCache.getStrings/Index --> FieldCache.getDocTerms/Index
* The field values returned when sorting by SortField.STRING are now
BytesRef. You can call value.utf8ToString() to convert back to
string, if necessary.
* In FieldCache, getStrings (returning String[]) has been replaced
with getTerms (returning a FieldCache.DocTerms instance).
DocTerms provides a getTerm method, taking a docID and a BytesRef
to fill (which must not be null), and it fills it in with the
reference to the bytes for that term.
If you had code like this before:
String[] values = FieldCache.DEFAULT.getStrings(reader, field);
...
String aValue = values[docID];
you can do this instead:
DocTerms values = FieldCache.DEFAULT.getTerms(reader, field);
...
BytesRef term = new BytesRef();
String aValue = values.getTerm(docID, term).utf8ToString();
Note however that it can be costly to convert to String, so it's
better to work directly with the BytesRef.
* Similarly, in FieldCache, getStringIndex (returning a StringIndex
instance, with direct arrays int[] order and String[] lookup) has
been replaced with getTermsIndex (returning a
FieldCache.DocTermsIndex instance). DocTermsIndex provides the
getOrd(int docID) method to lookup the int order for a document,
lookup(int ord, BytesRef reuse) to lookup the term from a given
order, and the sugar method getTerm(int docID, BytesRef reuse)
which internally calls getOrd and then lookup.
If you had code like this before:
StringIndex idx = FieldCache.DEFAULT.getStringIndex(reader, field);
...
int ord = idx.order[docID];
String aValue = idx.lookup[ord];
you can do this instead:
DocTermsIndex idx = FieldCache.DEFAULT.getTermsIndex(reader, field);
...
int ord = idx.getOrd(docID);
BytesRef term = new BytesRef();
String aValue = idx.lookup(ord, term).utf8ToString();
Note however that it can be costly to convert to String, so it's
better to work directly with the BytesRef.
DocTermsIndex also has a getTermsEnum() method, which returns an
iterator (TermsEnum) over the term values in the index (ie,
iterates ord = 0..numOrd()-1).
* StringComparatorLocale is now more CPU costly than it was before
(it was already very CPU costly since it does not compare using
indexed collation keys; use CollationKeyFilter for better
performance), since it converts BytesRef -> String on the fly.
Also, the field values returned when sorting by SortField.STRING
are now BytesRef.
* FieldComparator.StringOrdValComparator has been renamed to
TermOrdValComparator, and now uses BytesRef for its values.
Likewise for StringValComparator, renamed to TermValComparator.
This means when sorting by SortField.STRING or
SortField.STRING_VAL (or directly invoking these comparators) the
values returned in the FieldDoc.fields array will be BytesRef not
String. You can call the .utf8ToString() method on the BytesRef
instances, if necessary.
LUCENE-1458, LUCENE-2111: Flexible Indexing
Flexible indexing changed the low level fields/terms/docs/positions
enumeration APIs. Here are the major changes:
* Terms are now binary in nature (arbitrary byte[]), represented
by the BytesRef class (which provides an offset + length "slice"
into an existing byte[]).
* Fields are separately enumerated (FieldsEnum) from the terms
within each field (TermEnum). So instead of this:
TermEnum termsEnum = ...;
while(termsEnum.next()) {
Term t = termsEnum.term();
System.out.println("field=" + t.field() + "; text=" + t.text());
}
Do this:
FieldsEnum fieldsEnum = ...;
String field;
while((field = fieldsEnum.next()) != null) {
TermsEnum termsEnum = fieldsEnum.terms();
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("field=" + field + "; text=" + text.utf8ToString());
}
* TermDocs is renamed to DocsEnum. Instead of this:
while(td.next()) {
int doc = td.doc();
...
}
do this:
int doc;
while((doc = td.next()) != DocsEnum.NO_MORE_DOCS) {
...
}
Instead of this:
if (td.skipTo(target)) {
int doc = td.doc();
...
}
do this:
if ((doc=td.advance(target)) != DocsEnum.NO_MORE_DOCS) {
...
}
The bulk read API has also changed. Instead of this:
int[] docs = new int[256];
int[] freqs = new int[256];
while(true) {
int count = td.read(docs, freqs)
if (count == 0) {
break;
}
// use docs[i], freqs[i]
}
do this:
DocsEnum.BulkReadResult bulk = td.getBulkResult();
while(true) {
int count = td.read();
if (count == 0) {
break;
}
// use bulk.docs.ints[i] and bulk.freqs.ints[i]
}
* TermPositions is renamed to DocsAndPositionsEnum, and no longer
extends the docs only enumerator (DocsEnum).
* Deleted docs are no longer implicitly filtered from
docs/positions enums. Instead, you pass a Bits
skipDocs (set bits are skipped) when obtaining the enums. Also,
you can now ask a reader for its deleted docs.
* The docs/positions enums cannot seek to a term. Instead,
TermsEnum is able to seek, and then you request the
docs/positions enum from that TermsEnum.
* TermsEnum's seek method returns more information. So instead of
this:
Term t;
TermEnum termEnum = reader.terms(t);
if (t.equals(termEnum.term())) {
...
}
do this:
TermsEnum termsEnum = ...;
BytesRef text;
if (termsEnum.seek(text) == TermsEnum.SeekStatus.FOUND) {
...
}
SeekStatus also contains END (enumerator is done) and NOT_FOUND
(term was not found but enumerator is now positioned to the next
term).
* TermsEnum has an ord() method, returning the long numeric
ordinal (ie, first term is 0, next is 1, and so on) for the term
it's not positioned to. There is also a corresponding seek(long
ord) method. Note that these methods are optional; in
particular the MultiFields TermsEnum does not implement them.
How you obtain the enums has changed. The primary entry point is
the Fields class. If you know your reader is a single segment
reader, do this:
Fields fields = reader.Fields();
if (fields != null) {
...
}
If the reader might be multi-segment, you must do this:
Fields fields = MultiFields.getFields(reader);
if (fields != null) {
...
}
The fields may be null (eg if the reader has no fields).
Note that the MultiFields approach entails a performance hit on
MultiReaders, as it must merge terms/docs/positions on the fly. It's
generally better to instead get the sequential readers (use
oal.util.ReaderUtil) and then step through those readers yourself,
if you can (this is how Lucene drives searches).
If you pass a SegmentReader to MultiFields.fiels it will simply
return reader.fields(), so there is no performance hit in that
case.
Once you have a non-null Fields you can do this:
Terms terms = fields.terms("field");
if (terms != null) {
...
}
The terms may be null (eg if the field does not exist).
Once you have a non-null terms you can get an enum like this:
TermsEnum termsEnum = terms.iterator();
The returned TermsEnum will not be null.
You can then .next() through the TermsEnum, or seek. If you want a
DocsEnum, do this:
Bits liveDocs = reader.getLiveDocs();
DocsEnum docsEnum = null;
docsEnum = termsEnum.docs(liveDocs, docsEnum);
You can pass in a prior DocsEnum and it will be reused if possible.
Likewise for DocsAndPositionsEnum.
IndexReader has several sugar methods (which just go through the
above steps, under the hood). Instead of:
Term t;
TermDocs termDocs = reader.termDocs();
termDocs.seek(t);
do this:
String field;
BytesRef text;
DocsEnum docsEnum = reader.termDocsEnum(reader.getLiveDocs(), field, text);
Likewise for DocsAndPositionsEnum.
* LUCENE-2600: remove IndexReader.isDeleted
Instead of IndexReader.isDeleted, do this:
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
if (!liveDocs.get(docID)) {
// document is deleted...
}
* LUCENE-2858, LUCENE-3733: The abstract class IndexReader has been
refactored to expose only essential methods to access stored fields
during display of search results. It is no longer possible to retrieve
terms or postings data from the underlying index, not even deletions are
visible anymore. You can still pass IndexReader as constructor parameter
to IndexSearcher and execute your searches; Lucene will automatically
delegate procedures like query rewriting and document collection atomic
subreaders.
If you want to dive deeper into the index and want to write own queries,
take a closer look at the new abstract sub-classes AtomicReader and
CompositeReader:
AtomicReader instances are now the only source of Terms, Postings,
DocValues and FieldCache. Queries are forced to execute on a Atomic
reader on a per-segment basis and FieldCaches are keyed by
AtomicReaders.
Its counterpart CompositeReader exposes a utility method to retrieve
its composites. But watch out, composites are not necessarily atomic.
Next to the added type-safety we also removed the notion of
index-commits and version numbers from the abstract IndexReader, the
associations with IndexWriter were pulled into a specialized
DirectoryReader. To open Directory-based indexes use
DirectoryReader.open(), the corresponding method in IndexReader is now
deprecated for easier migration. Only DirectoryReader supports commits,
versions, and reopening with openIfChanged(). Terms, postings,
docvalues, and norms can from now on only be retrieved using
AtomicReader; DirectoryReader and MultiReader extend CompositeReader,
only offering stored fields and access to the sub-readers (which may be
composite or atomic).
If you have more advanced code dealing with custom Filters, you might
have noticed another new class hierarchy in Lucene (see LUCENE-2831):
IndexReaderContext with corresponding Atomic-/CompositeReaderContext.
The move towards per-segment search Lucene 2.9 exposed lots of custom
Queries and Filters that couldn't handle it. For example, some Filter
implementations expected the IndexReader passed in is identical to the
IndexReader passed to IndexSearcher with all its advantages like
absolute document IDs etc. Obviously this "paradigm-shift" broke lots of
applications and especially those that utilized cross-segment data
structures (like Apache Solr).
In Lucene 4.0, we introduce IndexReaderContexts "searcher-private"
reader hierarchy. During Query or Filter execution Lucene no longer
passes raw readers down Queries, Filters or Collectors; instead
components are provided an AtomicReaderContext (essentially a hierarchy
leaf) holding relative properties like the document-basis in relation to
the top-level reader. This allows Queries & Filter to build up logic
based on document IDs, albeit the per-segment orientation.
There are still valid use-cases where top-level readers ie. "atomic
views" on the index are desirable. Let say you want to iterate all terms
of a complete index for auto-completion or facetting, Lucene provides
utility wrappers like SlowCompositeReaderWrapper (LUCENE-2597) emulating
an AtomicReader. Note: using "atomicity emulators" can cause serious
slowdowns due to the need to merge terms, postings, DocValues, and
FieldCache, use them with care!
* LUCENE-2674: A new idfExplain method was added to Similarity, that
accepts an incoming docFreq. If you subclass Similarity, make sure
you also override this method on upgrade, otherwise your
customizations won't run for certain MultiTermQuerys.
* LUCENE-2413: Lucene's core and contrib analyzers, along with Solr's analyzers,
were consolidated into modules/analysis. During the refactoring some
package names have changed:
- o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer
- o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer
- o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer
- o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter
- o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer
- o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer
- o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer
- o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter
- o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer
- o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer
- o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter
- o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter
- o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter
- o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter
- o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter
- o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper
- o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter
- o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter
- o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter
- o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter
- o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap
- o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet
- o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap
- o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase
- o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase
- o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader
- o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer
- o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils
* LUCENE-2514: The option to use a Collator's order (instead of binary order) for
sorting and range queries has been moved to contrib/queries.
The Collated TermRangeQuery/Filter has been moved to SlowCollatedTermRangeQuery/Filter,
and the collated sorting has been moved to SlowCollatedStringComparator.
Note: this functionality isn't very scalable and if you are using it, consider
indexing collation keys with the collation support in the analysis module instead.
To perform collated range queries, use a suitable collating analyzer: CollationKeyAnalyzer
or ICUCollationKeyAnalyzer, and set qp.setAnalyzeRangeTerms(true).
TermRangeQuery and TermRangeFilter now work purely on bytes. Both have helper factory methods
(newStringRange) similar to the NumericRange API, to easily perform range queries on Strings.
* LUCENE-2691: The near-real-time API has moved from IndexWriter to
IndexReader. Instead of IndexWriter.getReader(), call
IndexReader.open(IndexWriter) or IndexReader.reopen(IndexWriter).
* LUCENE-2690: MultiTermQuery boolean rewrites per segment.
Also MultiTermQuery.getTermsEnum() now takes an AttributeSource. FuzzyTermsEnum
is both consumer and producer of attributes: MTQ.BoostAttribute is
added to the FuzzyTermsEnum and MTQ's rewrite mode consumes it.
The other way round MTQ.TopTermsBooleanQueryRewrite supplys a
global AttributeSource to each segments TermsEnum. The TermsEnum is consumer
and gets the current minimum competitive boosts (MTQ.MaxNonCompetitiveBoostAttribute).
* LUCENE-2374: The backwards layer in AttributeImpl was removed. To support correct
reflection of AttributeImpl instances, where the reflection was done using deprecated
toString() parsing, you have to now override reflectWith() to customize output.
toString() is no longer implemented by AttributeImpl, so if you have overridden
toString(), port your customization over to reflectWith(). reflectAsString() would
then return what toString() did before.
* LUCENE-2236, LUCENE-2912: DefaultSimilarity can no longer be set statically
(and dangerously) for the entire JVM.
Similarity can now be configured on a per-field basis (via PerFieldSimilarityWrapper)
Similarity has a lower-level API, if you want the higher-level vector-space API
like in previous Lucene releases, then look at TFIDFSimilarity.
* LUCENE-1076: TieredMergePolicy is now the default merge policy.
It's able to merge non-contiguous segments; this may cause problems
for applications that rely on Lucene's internal document ID
assigment. If so, you should instead use LogByteSize/DocMergePolicy
during indexing.
* LUCENE-2883: Lucene's o.a.l.search.function ValueSource based functionality, was consolidated
into module/queries along with Solr's similar functionality. The following classes were moved:
- o.a.l.search.function.CustomScoreQuery -> o.a.l.queries.CustomScoreQuery
- o.a.l.search.function.CustomScoreProvider -> o.a.l.queries.CustomScoreProvider
- o.a.l.search.function.NumericIndexDocValueSource -> o.a.l.queries.function.valuesource.NumericIndexDocValueSource
The following lists the replacement classes for those removed:
- o.a.l.search.function.ByteFieldSource -> o.a.l.queries.function.valuesource.ByteFieldSource
- o.a.l.search.function.DocValues -> o.a.l.queries.function.DocValues
- o.a.l.search.function.FieldCacheSource -> o.a.l.queries.function.valuesource.FieldCacheSource
- o.a.l.search.function.FieldScoreQuery ->o.a.l.queries.function.FunctionQuery
- o.a.l.search.function.FloatFieldSource -> o.a.l.queries.function.valuesource.FloatFieldSource
- o.a.l.search.function.IntFieldSource -> o.a.l.queries.function.valuesource.IntFieldSource
- o.a.l.search.function.OrdFieldSource -> o.a.l.queries.function.valuesource.OrdFieldSource
- o.a.l.search.function.ReverseOrdFieldSource -> o.a.l.queries.function.valuesource.ReverseOrdFieldSource
- o.a.l.search.function.ShortFieldSource -> o.a.l.queries.function.valuesource.ShortFieldSource
- o.a.l.search.function.ValueSource -> o.a.l.queries.function.ValueSource
- o.a.l.search.function.ValueSourceQuery -> o.a.l.queries.function.FunctionQuery
DocValues are now named FunctionValues, to not confuse with Lucene's per-document values.
* LUCENE-2392: Enable flexible scoring:
The existing "Similarity" api is now TFIDFSimilarity, if you were extending
Similarity before, you should likely extend this instead.
Weight.normalize no longer takes a norm value that incorporates the top-level
boost from outer queries such as BooleanQuery, instead it takes 2 parameters,
the outer boost (topLevelBoost) and the norm. Weight.sumOfSquaredWeights has
been renamed to Weight.getValueForNormalization().
The scorePayload method now takes a BytesRef. It is never null.
* LUCENE-3722: Similarity methods and collection/term statistics now take
long instead of int (to enable distributed scoring of > 2B docs).
For example, in TFIDFSimilarity idf(int, int) is now idf(long, long).
* LUCENE-3559: The methods "docFreq" and "maxDoc" on IndexSearcher were removed,
as these are no longer used by the scoring system.
If you were using these casually in your code for reasons unrelated to scoring,
call them on the IndexSearcher's reader instead: getIndexReader().
If you were subclassing IndexSearcher and overriding these methods to alter
scoring, override IndexSearcher's termStatistics() and collectionStatistics()
methods instead.
* LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been consolidated into module/queryparser,
where other QueryParsers from the codebase will also be placed. The following classes were moved:
- o.a.l.queryParser.CharStream -> o.a.l.queryparser.classic.CharStream
- o.a.l.queryParser.FastCharStream -> o.a.l.queryparser.classic.FastCharStream
- o.a.l.queryParser.MultiFieldQueryParser -> o.a.l.queryparser.classic.MultiFieldQueryParser
- o.a.l.queryParser.ParseException -> o.a.l.queryparser.classic.ParseException
- o.a.l.queryParser.QueryParser -> o.a.l.queryparser.classic.QueryParser
- o.a.l.queryParser.QueryParserBase -> o.a.l.queryparser.classic.QueryParserBase
- o.a.l.queryParser.QueryParserConstants -> o.a.l.queryparser.classic.QueryParserConstants
- o.a.l.queryParser.QueryParserTokenManager -> o.a.l.queryparser.classic.QueryParserTokenManager
- o.a.l.queryParser.QueryParserToken -> o.a.l.queryparser.classic.Token
- o.a.l.queryParser.QueryParserTokenMgrError -> o.a.l.queryparser.classic.TokenMgrError
* LUCENE-2308,LUCENE-3453: Separate IndexableFieldType from Field instances
With this change, the indexing details (indexed, tokenized, norms,
indexOptions, stored, etc.) are moved into a separate FieldType
instance (rather than being stored directly on the Field).
This means you can create the FieldType instance once, up front,
for a given field, and then re-use that instance whenever you instantiate
the Field.
Certain field types are pre-defined since they are common cases:
* StringField: indexes a String value as a single token (ie, does
not tokenize). This field turns off norms and indexes only doc
IDS (does not index term frequency nor positions). This field
does not store its value, but exposes TYPE_STORED as well.
* TextField: indexes and tokenizes a String, Reader or TokenStream
value, without term vectors. This field does not store its value,
but exposes TYPE_STORED as well.
* StoredField: field that stores its value
* DocValuesField: indexes the value as a DocValues field
* NumericField: indexes the numeric value so that NumericRangeQuery
can be used at search-time.
If your usage fits one of those common cases you can simply
instantiate the above class. If you need to store the value, you can
add a separate StoredField to the document, or you can use
TYPE_STORED for the field:
Field f = new Field("field", "value", StringField.TYPE_STORED);
Alternatively, if an existing type is close to what you want but you
need to make a few changes, you can copy that type and make changes:
FieldType bodyType = new FieldType(TextField.TYPE_STORED);
bodyType.setStoreTermVectors(true);
You can of course also create your own FieldType from scratch:
FieldType t = new FieldType();
t.setIndexed(true);
t.setStored(true);
t.setOmitNorms(true);
t.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
t.freeze();
FieldType has a freeze() method to prevent further changes.
There is also a deprecated transition API, providing the same Index,
Store, TermVector enums from 3.x, and Field constructors taking these
enums.
When migrating from the 3.x API, if you did this before:
new Field("field", "value", Field.Store.NO, Field.Indexed.NOT_ANALYZED_NO_NORMS)
you can now do this:
new StringField("field", "value")
(though note that StringField indexes DOCS_ONLY).
If instead the value was stored:
new Field("field", "value", Field.Store.YES, Field.Indexed.NOT_ANALYZED_NO_NORMS)
you can now do this:
new Field("field", "value", StringField.TYPE_STORED)
If you didn't omit norms:
new Field("field", "value", Field.Store.YES, Field.Indexed.NOT_ANALYZED)
you can now do this:
FieldType ft = new FieldType(StringField.TYPE_STORED);
ft.setOmitNorms(false);
new Field("field", "value", ft)
If you did this before (value can be String or Reader):
new Field("field", value, Field.Store.NO, Field.Indexed.ANALYZED)
you can now do this:
new TextField("field", value)
If instead the value was stored:
new Field("field", value, Field.Store.YES, Field.Indexed.ANALYZED)
you can now do this:
new Field("field", value, TextField.TYPE_STORED)
If in addition you omit norms:
new Field("field", value, Field.Store.YES, Field.Indexed.ANALYZED_NO_NORMS)
you can now do this:
FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setOmitNorms(true);
new Field("field", value, ft)
If you did this before (bytes is a byte[]):
new Field("field", bytes)
you can now do this:
new StoredField("field", bytes)
* LUCENE-3396: Analyzer.tokenStream() and .reusableTokenStream() have been made final.
It is now necessary to use Analyzer.TokenStreamComponents to define an analysis process.
Analyzer also has its own way of managing the reuse of TokenStreamComponents (either
globally, or per-field). To define another Strategy, implement Analyzer.ReuseStrategy.
* LUCENE-3464: IndexReader.reopen has been renamed to
IndexReader.openIfChanged (a static method), and now returns null
(instead of the old reader) if there are no changes to the index, to
prevent the common pitfall of accidentally closing the old reader.
* LUCENE-3687: Similarity#computeNorm() now expects a Norm object to set the computed
norm value instead of returning a fixed single byte value. Custom similarities can now
set integer, float and byte values if a single byte is not sufficient.