mirror of https://github.com/apache/lucene.git
367 lines
14 KiB
Plaintext
367 lines
14 KiB
Plaintext
|
|
LUCENE-2380: FieldCache.getStrings/Index --> FieldCache.getDocTerms/Index
|
|
|
|
* The field values returned when sorting by SortField.STRING are now
|
|
BytesRef. You can call value.utf8ToString() to convert back to
|
|
string, if necessary.
|
|
|
|
* In FieldCache, getStrings (returning String[]) has been replaced
|
|
with getTerms (returning a FieldCache.DocTerms instance).
|
|
DocTerms provides a getTerm method, taking a docID and a BytesRef
|
|
to fill (which must not be null), and it fills it in with the
|
|
reference to the bytes for that term.
|
|
|
|
If you had code like this before:
|
|
|
|
String[] values = FieldCache.DEFAULT.getStrings(reader, field);
|
|
...
|
|
String aValue = values[docID];
|
|
|
|
you can do this instead:
|
|
|
|
DocTerms values = FieldCache.DEFAULT.getTerms(reader, field);
|
|
...
|
|
BytesRef term = new BytesRef();
|
|
String aValue = values.getTerm(docID, term).utf8ToString();
|
|
|
|
Note however that it can be costly to convert to String, so it's
|
|
better to work directly with the BytesRef.
|
|
|
|
* Similarly, in FieldCache, getStringIndex (returning a StringIndex
|
|
instance, with direct arrays int[] order and String[] lookup) has
|
|
been replaced with getTermsIndex (returning a
|
|
FieldCache.DocTermsIndex instance). DocTermsIndex provides the
|
|
getOrd(int docID) method to lookup the int order for a document,
|
|
lookup(int ord, BytesRef reuse) to lookup the term from a given
|
|
order, and the sugar method getTerm(int docID, BytesRef reuse)
|
|
which internally calls getOrd and then lookup.
|
|
|
|
If you had code like this before:
|
|
|
|
StringIndex idx = FieldCache.DEFAULT.getStringIndex(reader, field);
|
|
...
|
|
int ord = idx.order[docID];
|
|
String aValue = idx.lookup[ord];
|
|
|
|
you can do this instead:
|
|
|
|
DocTermsIndex idx = FieldCache.DEFAULT.getTermsIndex(reader, field);
|
|
...
|
|
int ord = idx.getOrd(docID);
|
|
BytesRef term = new BytesRef();
|
|
String aValue = idx.lookup(ord, term).utf8ToString();
|
|
|
|
Note however that it can be costly to convert to String, so it's
|
|
better to work directly with the BytesRef.
|
|
|
|
DocTermsIndex also has a getTermsEnum() method, which returns an
|
|
iterator (TermsEnum) over the term values in the index (ie,
|
|
iterates ord = 0..numOrd()-1).
|
|
|
|
* StringComparatorLocale is now more CPU costly than it was before
|
|
(it was already very CPU costly since it does not compare using
|
|
indexed collation keys; use CollationKeyFilter for better
|
|
performance), since it converts BytesRef -> String on the fly.
|
|
Also, the field values returned when sorting by SortField.STRING
|
|
are now BytesRef.
|
|
|
|
* FieldComparator.StringOrdValComparator has been renamed to
|
|
TermOrdValComparator, and now uses BytesRef for its values.
|
|
Likewise for StringValComparator, renamed to TermValComparator.
|
|
This means when sorting by SortField.STRING or
|
|
SortField.STRING_VAL (or directly invoking these comparators) the
|
|
values returned in the FieldDoc.fields array will be BytesRef not
|
|
String. You can call the .utf8ToString() method on the BytesRef
|
|
instances, if necessary.
|
|
|
|
|
|
|
|
LUCENE-1458, LUCENE-2111: Flexible Indexing
|
|
|
|
Flexible indexing changed the low level fields/terms/docs/positions
|
|
enumeration APIs. Here are the major changes:
|
|
|
|
* Terms are now binary in nature (arbitrary byte[]), represented
|
|
by the BytesRef class (which provides an offset + length "slice"
|
|
into an existing byte[]).
|
|
|
|
* Fields are separately enumerated (FieldsEnum) from the terms
|
|
within each field (TermEnum). So instead of this:
|
|
|
|
TermEnum termsEnum = ...;
|
|
while(termsEnum.next()) {
|
|
Term t = termsEnum.term();
|
|
System.out.println("field=" + t.field() + "; text=" + t.text());
|
|
}
|
|
|
|
Do this:
|
|
|
|
FieldsEnum fieldsEnum = ...;
|
|
String field;
|
|
while((field = fieldsEnum.next()) != null) {
|
|
TermsEnum termsEnum = fieldsEnum.terms();
|
|
BytesRef text;
|
|
while((text = termsEnum.next()) != null) {
|
|
System.out.println("field=" + field + "; text=" + text.utf8ToString());
|
|
}
|
|
|
|
* TermDocs is renamed to DocsEnum. Instead of this:
|
|
|
|
while(td.next()) {
|
|
int doc = td.doc();
|
|
...
|
|
}
|
|
|
|
do this:
|
|
|
|
int doc;
|
|
while((doc = td.next()) != DocsEnum.NO_MORE_DOCS) {
|
|
...
|
|
}
|
|
|
|
Instead of this:
|
|
|
|
if (td.skipTo(target)) {
|
|
int doc = td.doc();
|
|
...
|
|
}
|
|
|
|
do this:
|
|
|
|
if ((doc=td.advance(target)) != DocsEnum.NO_MORE_DOCS) {
|
|
...
|
|
}
|
|
|
|
The bulk read API has also changed. Instead of this:
|
|
|
|
int[] docs = new int[256];
|
|
int[] freqs = new int[256];
|
|
|
|
while(true) {
|
|
int count = td.read(docs, freqs)
|
|
if (count == 0) {
|
|
break;
|
|
}
|
|
// use docs[i], freqs[i]
|
|
}
|
|
|
|
do this:
|
|
|
|
DocsEnum.BulkReadResult bulk = td.getBulkResult();
|
|
while(true) {
|
|
int count = td.read();
|
|
if (count == 0) {
|
|
break;
|
|
}
|
|
// use bulk.docs.ints[i] and bulk.freqs.ints[i]
|
|
}
|
|
|
|
* TermPositions is renamed to DocsAndPositionsEnum, and no longer
|
|
extends the docs only enumerator (DocsEnum).
|
|
|
|
* Deleted docs are no longer implicitly filtered from
|
|
docs/positions enums. Instead, you pass a Bits
|
|
skipDocs (set bits are skipped) when obtaining the enums. Also,
|
|
you can now ask a reader for its deleted docs.
|
|
|
|
* The docs/positions enums cannot seek to a term. Instead,
|
|
TermsEnum is able to seek, and then you request the
|
|
docs/positions enum from that TermsEnum.
|
|
|
|
* TermsEnum's seek method returns more information. So instead of
|
|
this:
|
|
|
|
Term t;
|
|
TermEnum termEnum = reader.terms(t);
|
|
if (t.equals(termEnum.term())) {
|
|
...
|
|
}
|
|
|
|
do this:
|
|
|
|
TermsEnum termsEnum = ...;
|
|
BytesRef text;
|
|
if (termsEnum.seek(text) == TermsEnum.SeekStatus.FOUND) {
|
|
...
|
|
}
|
|
|
|
SeekStatus also contains END (enumerator is done) and NOT_FOUND
|
|
(term was not found but enumerator is now positioned to the next
|
|
term).
|
|
|
|
* TermsEnum has an ord() method, returning the long numeric
|
|
ordinal (ie, first term is 0, next is 1, and so on) for the term
|
|
it's not positioned to. There is also a corresponding seek(long
|
|
ord) method. Note that these methods are optional; in
|
|
particular the MultiFields TermsEnum does not implement them.
|
|
|
|
|
|
How you obtain the enums has changed. The primary entry point is
|
|
the Fields class. If you know your reader is a single segment
|
|
reader, do this:
|
|
|
|
Fields fields = reader.Fields();
|
|
if (fields != null) {
|
|
...
|
|
}
|
|
|
|
If the reader might be multi-segment, you must do this:
|
|
|
|
Fields fields = MultiFields.getFields(reader);
|
|
if (fields != null) {
|
|
...
|
|
}
|
|
|
|
The fields may be null (eg if the reader has no fields).
|
|
|
|
Note that the MultiFields approach entails a performance hit on
|
|
MultiReaders, as it must merge terms/docs/positions on the fly. It's
|
|
generally better to instead get the sequential readers (use
|
|
oal.util.ReaderUtil) and then step through those readers yourself,
|
|
if you can (this is how Lucene drives searches).
|
|
|
|
If you pass a SegmentReader to MultiFields.fiels it will simply
|
|
return reader.fields(), so there is no performance hit in that
|
|
case.
|
|
|
|
Once you have a non-null Fields you can do this:
|
|
|
|
Terms terms = fields.terms("field");
|
|
if (terms != null) {
|
|
...
|
|
}
|
|
|
|
The terms may be null (eg if the field does not exist).
|
|
|
|
Once you have a non-null terms you can get an enum like this:
|
|
|
|
TermsEnum termsEnum = terms.iterator();
|
|
|
|
The returned TermsEnum will not be null.
|
|
|
|
You can then .next() through the TermsEnum, or seek. If you want a
|
|
DocsEnum, do this:
|
|
|
|
Bits skipDocs = MultiFields.getDeletedDocs(reader);
|
|
DocsEnum docsEnum = null;
|
|
|
|
docsEnum = termsEnum.docs(skipDocs, docsEnum);
|
|
|
|
You can pass in a prior DocsEnum and it will be reused if possible.
|
|
|
|
Likewise for DocsAndPositionsEnum.
|
|
|
|
IndexReader has several sugar methods (which just go through the
|
|
above steps, under the hood). Instead of:
|
|
|
|
Term t;
|
|
TermDocs termDocs = reader.termDocs();
|
|
termDocs.seek(t);
|
|
|
|
do this:
|
|
|
|
String field;
|
|
BytesRef text;
|
|
DocsEnum docsEnum = reader.termDocsEnum(reader.getDeletedDocs(), field, text);
|
|
|
|
Likewise for DocsAndPositionsEnum.
|
|
|
|
* LUCENE-2600: remove IndexReader.isDeleted
|
|
|
|
Instead of IndexReader.isDeleted, do this:
|
|
|
|
import org.apache.lucene.util.Bits;
|
|
import org.apache.lucene.index.MultiFields;
|
|
|
|
Bits delDocs = MultiFields.getDeletedDocs(indexReader);
|
|
if (delDocs.get(docID)) {
|
|
// document is deleted...
|
|
}
|
|
|
|
* LUCENE-2674: A new idfExplain method was added to Similarity, that
|
|
accepts an incoming docFreq. If you subclass Similarity, make sure
|
|
you also override this method on upgrade, otherwise your
|
|
customizations won't run for certain MultiTermQuerys.
|
|
|
|
* LUCENE-2413: Lucene's core and contrib analyzers, along with Solr's analyzers,
|
|
were consolidated into modules/analysis. During the refactoring some
|
|
package names have changed:
|
|
- o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer
|
|
- o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer
|
|
- o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer
|
|
- o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter
|
|
- o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer
|
|
- o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer
|
|
- o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer
|
|
- o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter
|
|
- o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer
|
|
- o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer
|
|
- o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter
|
|
- o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter
|
|
- o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter
|
|
- o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter
|
|
- o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter
|
|
- o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper
|
|
- o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter
|
|
- o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter
|
|
- o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter
|
|
- o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter
|
|
- o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap
|
|
- o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet
|
|
- o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap
|
|
- o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase
|
|
- o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase
|
|
- o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader
|
|
- o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer
|
|
- o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils
|
|
|
|
* LUCENE-2514: The option to use a Collator's order (instead of binary order) for
|
|
sorting and range queries has been moved to contrib/queries.
|
|
|
|
The Collated TermRangeQuery/Filter has been moved to SlowCollatedTermRangeQuery/Filter,
|
|
and the collated sorting has been moved to SlowCollatedStringComparator.
|
|
|
|
Note: this functionality isn't very scalable and if you are using it, consider
|
|
indexing collation keys with the collation support in the analysis module instead.
|
|
|
|
To perform collated range queries, use a suitable collating analyzer: CollationKeyAnalyzer
|
|
or ICUCollationKeyAnalyzer, and set qp.setAnalyzeRangeTerms(true).
|
|
|
|
TermRangeQuery and TermRangeFilter now work purely on bytes. Both have helper factory methods
|
|
(newStringRange) similar to the NumericRange API, to easily perform range queries on Strings.
|
|
|
|
* LUCENE-2691: The near-real-time API has moved from IndexWriter to
|
|
IndexReader. Instead of IndexWriter.getReader(), call
|
|
IndexReader.open(IndexWriter) or IndexReader.reopen(IndexWriter).
|
|
|
|
* LUCENE-2690: MultiTermQuery boolean rewrites per segment.
|
|
Also MultiTermQuery.getTermsEnum() now takes an AttributeSource. FuzzyTermsEnum
|
|
is both consumer and producer of attributes: MTQ.BoostAttribute is
|
|
added to the FuzzyTermsEnum and MTQ's rewrite mode consumes it.
|
|
The other way round MTQ.TopTermsBooleanQueryRewrite supplys a
|
|
global AttributeSource to each segments TermsEnum. The TermsEnum is consumer
|
|
and gets the current minimum competitive boosts (MTQ.MaxNonCompetitiveBoostAttribute).
|
|
|
|
* LUCENE-2374: The backwards layer in AttributeImpl was removed. To support correct
|
|
reflection of AttributeImpl instances, where the reflection was done using deprecated
|
|
toString() parsing, you have to now override reflectWith() to customize output.
|
|
toString() is no longer implemented by AttributeImpl, so if you have overridden
|
|
toString(), port your customization over to reflectWith(). reflectAsString() would
|
|
then return what toString() did before.
|
|
|
|
* LUCENE-2236, LUCENE-2912: DefaultSimilarity can no longer be set statically
|
|
(and dangerously) for the entire JVM.
|
|
Instead, IndexWriterConfig and IndexSearcher now take a SimilarityProvider.
|
|
Similarity can now be configured on a per-field basis.
|
|
Similarity retains only the field-specific relevance methods such as tf() and idf().
|
|
Previously some (but not all) of these methods, such as computeNorm and scorePayload took
|
|
field as a parameter, this is removed due to the fact the entire Similarity (all methods)
|
|
can now be configured per-field.
|
|
Methods that apply to the entire query such as coord() and queryNorm() exist in SimilarityProvider.
|
|
|
|
* LUCENE-1076: TieredMergePolicy is now the default merge policy.
|
|
It's able to merge non-contiguous segments; this may cause problems
|
|
for applications that rely on Lucene's internal document ID
|
|
assigment. If so, you should instead use LogByteSize/DocMergePolicy
|
|
during indexing.
|