lucene/lucene/CHANGES.txt

9521 lines
428 KiB
Plaintext

Lucene Change Log
For more information on past and future Lucene versions, please see:
http://s.apache.org/luceneversions
======================= Lucene 5.0.0 =======================
Changes in backwards compatibility policy
* LUCENE-4535: oal.util.FilterIterator is now an internal API.
(Adrien Grand)
* LUCENE-3312: The API of oal.document was restructured to
differentiate between stored documents and indexed documents.
IndexReader.document(int) now returns StoredDocument
instead of Document. In most cases a simple replacement
of the return type is enough to upgrade (see MIGRATE.txt).
(Nikola Tanković, Uwe Schindler, Chris Male, Mike McCandless,
Robert Muir)
* LUCENE-4924: DocIdSetIterator.docID() must now return -1 when the iterator is
not positioned. This change affects all classes that inherit from
DocIdSetIterator, including DocsEnum and DocsAndPositionsEnum. (Adrien Grand)
* LUCENE-5127: Reduce RAM usage of FixedGapTermsIndex. Remove
IndexWriterConfig.setTermIndexInterval, IndexWriterConfig.setReaderTermsIndexDivisor,
and termsIndexDivisor from StandardDirectoryReader. These options have been no-ops
with the default codec since Lucene 4.0. If you want to configure the interval for
this term index, pass it directly in your codec, where it can also be configured
per-field. (Robert Muir)
New Features
* LUCENE-4747: Move to Java 7 as minimum Java version.
(Robert Muir, Uwe Schindler)
* SOLR-3359: Added analyzer attribute/property to SynonymFilterFactory.
(Ryo Onodera via Koji Sekiguchi)
Optimizations
* LUCENE-4848: Use Java 7 NIO2-FileChannel instead of RandomAccessFile
for NIOFSDirectory and MMapDirectory. This allows to delete open files
on Windows if NIOFSDirectory is used, mmapped files are still locked.
(Michael Poindexter, Robert Muir, Uwe Schindler)
======================= Lucene 4.6.0 =======================
New Features
* LUCENE-4906: PostingsHighlighter can now render to custom Object,
for advanced use cases where String is too restrictive (Luca
Cavanna, Robert Muir, Mike McCandless)
* LUCENE-5133: Changed AnalyzingInfixSuggester.highlight to return
Object instead of String, to allow for advanced use cases where
String is too restrictive (Robert Muir, Shai Erera, Mike
McCandless)
Changes in backwards compatibility policy
* LUCENE-5204: Directory doesn't have default implementations for
LockFactory-related methods, which have been moved to BaseDirectory. If you
had a custom Directory implementation that extended Directory, you need to
extend BaseDirectory instead. (Adrien Grand)
======================= Lucene 4.5.0 =======================
New features
* LUCENE-5084: Added new Elias-Fano encoder, decoder and DocIdSet
implementations. (Paul Elschot via Adrien Grand)
* LUCENE-5081: Added WAH8DocIdSet, an in-memory doc id set implementation based
on word-aligned hybrid encoding. (Adrien Grand)
* LUCENE-5098: New broadword utility methods in oal.util.BroadWord.
(Paul Elschot via Adrien Grand, Dawid Weiss)
* LUCENE-5030: FuzzySuggester now supports optional unicodeAware
(default is false). If true then edits are measured in Unicode code
points instead of UTF8 bytes. (Artem Lukanin via Mike McCandless)
* LUCENE-5118: SpatialStrategy.makeDistanceValueSource() now has an optional
multiplier for scaling degrees to another unit. (David Smiley)
* LUCENE-5091: SpanNotQuery can now be configured with pre and post slop to act
as a hypothetical SpanNotNearQuery. (Tim Allison via David Smiley)
* LUCENE-4985: FacetsAccumulator.create() is now able to create a
MultiFacetsAccumulator over a mixed set of facet requests. MultiFacetsAccumulator
allows wrapping multiple FacetsAccumulators, allowing to easily mix
existing and custom ones. TaxonomyFacetsAccumulator supports any
FacetRequest which implements createFacetsAggregator and was indexed
using the taxonomy index. (Shai Erera)
* LUCENE-5153: AnalyzerWrapper.wrapReader allows wrapping the Reader given to
inputReader. (Shai Erera)
* LUCENE-5155: FacetRequest.getValueOf and .getFacetArraysSource replaced by
FacetsAggregator.createOrdinalValueResolver. This gives better options for
resolving an ordinal's value by FacetAggregators. (Shai Erera)
* LUCENE-5165: Add SuggestStopFilter, to be used with analyzing
suggesters, so that a stop word at the very end of the lookup query,
and without any trailing token characters, will be preserved. This
enables query "a" to suggest apple; see
http://blog.mikemccandless.com/2013/08/suggeststopfilter-carefully-removes.html
for details.
* LUCENE-5178: Added support for missing values to DocValues fields.
AtomicReader.getDocsWithField returns a Bits of documents with a value,
and FieldCache.getDocsWithField forwards to that for DocValues fields. Things like
SortField.setMissingValue, FunctionValues.exists, and FieldValueFilter now
work with DocValues fields. (Robert Muir)
* LUCENE-5124: Lucene 4.5 has a new Lucene45Codec with Lucene45DocValues,
supporting missing values and with most datastructures residing off-heap.
Added "Memory" docvalues format that works entirely in heap, and "Disk"
loads no datastructures into RAM. Both of these also support missing values.
Added DiskNormsFormat (in case you want norms entirely on disk). (Robert Muir)
* LUCENE-2750: Added PForDeltaDocIdSet, an in-memory doc id set implementation
based on the PFOR encoding. (Adrien Grand)
* LUCENE-5186: Added CachingWrapperFilter.getFilter in order to be able to get
the wrapped filter. (Trejkaz via Adrien Grand)
* LUCENE-5197: Added SegmentReader.ramBytesUsed to return approximate heap RAM
used by index datastructures. (Areek Zillur via Robert Muir)
Bug Fixes
* LUCENE-5116: IndexWriter.addIndexes(IndexReader...) should drop empty (or all
deleted) segments. (Robert Muir, Shai Erera)
* LUCENE-5132: Spatial RecursivePrefixTree Contains predicate will throw an NPE
when there's no indexed data and maybe in other circumstances too. (David Smiley)
* LUCENE-5146: AnalyzingSuggester sort comparator read part of the input key as the
weight that caused the sorter to never sort by weight first since the weight is only
considered if the input is equal causing the malformed weight to be identical as well.
(Simon Willnauer)
* LUCENE-5151: Associations FacetsAggregators could enter an infinite loop when
some result documents were missing category associations. (Shai Erera)
* LUCENE-5152: Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum
seek/lookup which can cause sideeffects if done on a cached FST root arc.
(Simon Willnauer)
* LUCENE-5160: Handle the case where reading from a file or FileChannel returns -1,
which could happen in rare cases where something happens to the file between the
time we start the read loop (where we check the length) and when we actually do
the read. (gsingers, yonik, Robert Muir, Uwe Schindler)
* LUCENE-5166: PostingsHighlighter would throw IOOBE if a term spanned the maxLength
boundary, made it into the top-N and went to the formatter.
(Manuel Amoabeng, Michael McCandless, Robert Muir)
* LUCENE-4583: Indexing core no longer enforces a limit on maximum
length binary doc values fields, but individual codecs (including
the default one) have their own limits (David Smiley, Robert Muir,
Mike McCandless)
* LUCENE-3849: TokenStreams now set the position increment in end(),
so we can handle trailing holes. If you have a custom TokenStream
implementing end() then be sure it calls super.end(). (Robert Muir,
Mike McCandless)
* LUCENE-5192: IndexWriter could allow adding same field name with different
DocValueTypes under some circumstances. (Shai Erera)
* LUCENE-5191: SimpleHTMLEncoder in Highlighter module broke Unicode
outside BMP because it encoded UTF-16 chars instead of codepoints.
The escaping of codepoints > 127 was removed (not needed for valid HTML)
and missing escaping for ' and / was added. (Uwe Schindler)
* LUCENE-5201: Fixed compression bug in LZ4.compressHC when the input is highly
compressible and the start offset of the array to compress is > 0.
(Adrien Grand)
API Changes
* LUCENE-5094: Add ramBytesUsed() to MultiDocValues.OrdinalMap.
(Robert Muir)
* LUCENE-5114: Remove unused boolean useCache parameter from
TermsEnum.seekCeil and .seekExact (Mike McCandless)
* LUCENE-5128: IndexSearcher.searchAfter throws IllegalArgumentException if
searchAfter exceeds the number of documents in the reader.
(Crocket via Shai Erera)
* LUCENE-5129: CategoryAssociationsContainer no longer supports null
association values for categories. If you want to index categories without
associations, you should add them using FacetFields. (Shai Erera)
* LUCENE-4876: IndexWriter no longer clones the given IndexWriterConfig. If you
need to use the same config more than once, e.g. when sharing between multiple
writers, make sure to clone it before passing to each writer.
(Shai Erera, Mike McCandless)
* LUCENE-5144: StandardFacetsAccumulator renamed to OldFacetsAccumulator, and all
associated classes were moved under o.a.l.facet.old. The intention to remove it
one day, when the features it covers (complements, partitiona, sampling) will be
migrated to the new FacetsAggregator and FacetsAccumulator API. Also,
FacetRequest.createAggregator was replaced by OldFacetsAccumulator.createAggregator.
(Shai Erera)
* LUCENE-5149: CommonTermsQuery now allows to set the minimum number of terms that
should match for its high and low frequent sub-queries. Previously this was only
supported on the low frequent terms query. (Simon Willnauer)
* LUCENE-5156: CompressingTermVectors TermsEnum no longer supports ord().
(Robert Muir)
* LUCENE-5161, LUCENE-5164: Fix default chunk sizes in FSDirectory to not be
unnecessarily large (now 8192 bytes); also use chunking when writing to index
files. FSDirectory#setReadChunkSize() is now deprecated and will be removed
in Lucene 5.0. (Uwe Schindler, Robert Muir, gsingers)
* LUCENE-5170: Analyzer.ReuseStrategy instances are now stateless and can
be reused in other Analyzer instances, which was not possible before.
Lucene ships now with stateless singletons for per field and global reuse.
Legacy code can still instantiate the deprecated implementation classes,
but new code should use the constants. Implementors of custom strategies
have to take care of new method signatures. AnalyzerWrapper can now be
configured to use a custom strategy, too, ideally the one from the wrapped
Analyzer. Analyzer adds a getter to retrieve the strategy for this use-case.
(Uwe Schindler, Robert Muir, Shay Banon)
* LUCENE-5173: Lucene never writes segments with 0 documents anymore.
(Shai Erera, Uwe Schindler, Robert Muir)
* LUCENE-5178: SortedDocValues always returns -1 ord when a document is missing
a value for the field. Previously it only did this if the SortedDocValues
was produced by uninversion on the FieldCache. (Robert Muir)
* LUCENE-5183: remove BinaryDocValues.MISSING. In order to determine a document
is missing a field, use getDocsWithField instead. (Robert Muir)
Changes in Runtime Behavior
* LUCENE-5178: DocValues codec consumer APIs (iterables) return null values
when the document has no value for the field. (Robert Muir)
* LUCENE-5200: The HighFreqTerms command-line tool returns the true top-N
by totalTermFreq when using the -t option, it uses the term statistics (faster)
and now always shows totalTermFreq in the output. (Robert Muir)
Optimizations
* LUCENE-5088: Added TermFilter to filter docs by a specific term.
(Martijn van Groningen)
* LUCENE-5119: DiskDV keeps the document-to-ordinal mapping on disk for
SortedDocValues. (Robert Muir)
* LUCENE-5145: New AppendingPackedLongBuffer, a new variant of the former
AppendingLongBuffer which assumes values are 0-based.
(Boaz Leskes via Adrien Grand)
* LUCENE-5145: All Appending*Buffer now support bulk get.
(Boaz Leskes via Adrien Grand)
* LUCENE-5140: Fixed a performance regression of span queries caused by
LUCENE-4946. (Alan Woodward, Adrien Grand)
* LUCENE-5150: Make WAH8DocIdSet able to inverse its encoding in order to
compress dense sets efficiently as well. (Adrien Grand)
* LUCENE-5159: Prefix-code the sorted/sortedset value dictionaries in DiskDV.
(Robert Muir)
* LUCENE-5170: Fixed several wrapper analyzers to inherit the reuse strategy
of the wrapped Analyzer. (Uwe Schindler, Robert Muir, Shay Banon)
* LUCENE-5006: Simplified DocumentsWriter and DocumentsWriterPerThread
synchronization and concurrent interaction with IndexWriter. DWPT is now
only setup once and has no reset logic. All segment publishing and state
transition from DWPT into IndexWriter is now done via an Event-Queue
processed from within the IndexWriter in order to prevent suituations
where DWPT or DW calling int IW causing deadlocks. (Simon Willnauer)
* LUCENE-5182: Terminate phrase searches early if max phrase window is
exceeded in FastVectorHighlighter to prevent very long running phrase
extraction if phrase terms are high frequent. (Simon Willnauer)
* LUCENE-5188: CompressingStoredFieldsFormat now slices chunks containing big
documents into fixed-size blocks so that requesting a single field does not
necessarily force to decompress the whole chunk. (Adrien Grand)
* LUCENE-5101: CachingWrapper makes it easier to plug-in a custom cacheable
DocIdSet implementation and uses WAH8DocIdSet by default, which should be
more memory efficient than FixedBitSet on average as well as faster on small
sets. (Robert Muir)
Documentation
* LUCENE-4894: remove facet userguide as it was outdated. Partially absorbed into
package's documentation and classes javadocs. (Shai Erera)
Changes in backwards compatibility policy
* LUCENE-5141: CheckIndex.fixIndex(Status,Codec) is now
CheckIndex.fixIndex(Status). If you used to pass a codec to this method, just
remove it from the arguments. (Adrien Grand)
* LUCENE-5089, SOLR-5126: Update to Morfologik 1.7.1. MorfologikAnalyzer and MorfologikFilter
no longer support multiple "dictionaries" as there is only one dictionary available.
(Dawid Weiss)
* LUCENE-5170: Changed method signatures of Analyzer.ReuseStrategy to take
Analyzer. Closeable interface was removed because the class was changed to
be stateless. (Uwe Schindler, Robert Muir, Shay Banon)
* LUCENE-5187: SlowCompositeReaderWrapper constructor is now private,
SlowCompositeReaderWrapper.wrap should be used instead. (Adrien Grand)
* LUCENE-5101: CachingWrapperFilter doesn't always return FixedBitSet instances
anymore. Users of the join module can use
oal.search.join.FixedBitSetCachingWrapperFilter instead. (Adrien Grand)
Build
* SOLR-5159: Manifest includes non-parsed maven variables.
(Artem Karpenko via Steve Rowe)
* LUCENE-5193: Add jar-src as top-level target to generate all Lucene and Solr
*-src.jar. (Steve Rowe, Shai Erera)
======================= Lucene 4.4.0 =======================
Changes in backwards compatibility policy
* LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
(Dawid Weiss, Grzegorz Sobczyk)
* LUCENE-4955: NGramTokenFilter now emits all n-grams for the same token at the
same position and preserves the position length and the offsets of the
original token. (Simon Willnauer, Adrien Grand)
* LUCENE-4955: NGramTokenizer now emits n-grams in a different order
(a, ab, b, bc, c) instead of (a, b, c, ab, bc) and doesn't trim trailing
whitespaces. (Adrien Grand)
* LUCENE-5042: The n-gram and edge n-gram tokenizers and filters now correctly
handle supplementary characters, and the tokenizers have the ability to
pre-tokenize the input stream similarly to CharTokenizer. (Adrien Grand)
* LUCENE-4967: NRTManager is replaced by
ControlledRealTimeReopenThread, for controlling which requests must
see which indexing changes, so that it can work with any
ReferenceManager (Mike McCandless)
* LUCENE-4973: SnapshotDeletionPolicy no longer requires a unique
String id (Mike McCandless, Shai Erera)
* LUCENE-4946: The internal sorting API (SorterTemplate, now Sorter) has been
completely refactored to allow for a better implementation of TimSort.
(Adrien Grand, Uwe Schindler, Dawid Weiss)
* LUCENE-4963: Some TokenFilter options that generate broken TokenStreams have
been deprecated: updateOffsets=true on TrimFilter and
enablePositionIncrements=false on all classes that inherit from
FilteringTokenFilter: JapanesePartOfSpeechStopFilter, KeepWordFilter,
LengthFilter, StopFilter and TypeTokenFilter. (Adrien Grand)
* LUCENE-4963: In order not to take position increments into account in
suggesters, you now need to call setPreservePositionIncrements(false) instead
of configuring the token filters to not increment positions. (Adrien Grand)
* LUCENE-3907: EdgeNGramTokenizer now supports maxGramSize > 1024, doesn't trim
the input, sets position increment = 1 for all tokens and doesn't support
backward grams anymore. (Adrien Grand)
* LUCENE-3907: EdgeNGramTokenFilter does not support backward grams and does
not update offsets anymore. (Adrien Grand)
* LUCENE-4981: PositionFilter is now deprecated as it can corrupt token stream
graphs. Since it main use-case was to make query parsers generate boolean
queries instead of phrase queries, it is now advised to use
QueryParser.setAutoGeneratePhraseQueries(false) (for simple cases) or to
override QueryParser.newFieldQuery. (Adrien Grand, Steve Rowe)
* LUCENE-5018: CompoundWordTokenFilterBase and its children
DictionaryCompoundWordTokenFilter and HyphenationCompoundWordTokenFilter don't
update offsets anymore. (Adrien Grand)
* LUCENE-5015: SamplingAccumulator no longer corrects the counts of the sampled
categories. You should set TakmiSampleFixer on SamplingParams if required (but
notice that this means slower search). (Rob Audenaerde, Gilad Barkai, Shai Erera)
* LUCENE-4933: Replace ExactSimScorer/SloppySimScorer with just SimScorer. Previously
there were 2 implementations as a performance hack to support tableization of
sqrt(), but this caching is removed, as sqrt is implemented in hardware with modern
jvms and its faster not to cache. (Robert Muir)
* LUCENE-5038: MergePolicy now has a default implementation for useCompoundFile based
on segment size and noCFSRatio. The default implemantion was pulled up from
TieredMergePolicy. (Simon Willnauer)
* LUCENE-5063: FieldCache.get(Bytes|Shorts), SortField.Type.(BYTE|SHORT) and
FieldCache.DEFAULT_(BYTE|SHORT|INT|LONG|FLOAT|DOUBLE)_PARSER are now
deprecated. These methods/types assume that data is stored as strings although
Lucene has much better support for numeric data through (Int|Long)Field,
NumericRangeQuery and FieldCache.get(Int|Long)s. (Adrien Grand)
* LUCENE-5078: TfIDFSimilarity lets you encode the norm value as any arbitrary long.
As a result, encode/decodeNormValue were made abstract with their signatures changed.
The default implementation was moved to DefaultSimilarity, which encodes the norm as
a single-byte value. (Shai Erera)
Bug Fixes
* LUCENE-4890: QueryTreeBuilder.getBuilder() only finds interfaces on the
most derived class. (Adriano Crestani)
* LUCENE-4997: Internal test framework's tests are sensitive to previous
test failures and tests.failfast. (Dawid Weiss, Shai Erera)
* LUCENE-4955: NGramTokenizer now supports inputs larger than 1024 chars.
(Adrien Grand)
* LUCENE-4959: Fix incorrect return value in
SimpleNaiveBayesClassifier.assignClass. (Alexey Kutin via Adrien Grand)
* LUCENE-4972: DirectoryTaxonomyWriter created empty commits even if no changes
were made. (Shai Erera, Michael McCandless)
* LUCENE-949: AnalyzingQueryParser can't work with leading wildcards.
(Tim Allison, Robert Muir, Steve Rowe)
* LUCENE-4980: Fix issues preventing mixing of RangeFacetRequest and
non-RangeFacetRequest when using DrillSideways. (Mike McCandless,
Shai Erera)
* LUCENE-4996: Ensure DocInverterPerField always includes field name
in exception messages. (Markus Jelsma via Robert Muir)
* LUCENE-4992: Fix constructor of CustomScoreQuery to take FunctionQuery
for scoringQueries. Instead use QueryValueSource to safely wrap arbitrary
queries and use them with CustomScoreQuery. (John Wang, Robert Muir)
* LUCENE-5016: SamplingAccumulator returned inconsistent label if asked to
aggregate a non-existing category. Also fixed a bug in RangeAccumulator if
some readers did not have the requested numeric DV field.
(Rob Audenaerde, Shai Erera)
* LUCENE-5028: Remove pointless and confusing doShare option in FST's
PositiveIntOutputs (Han Jiang via Mike McCandless)
* LUCENE-5032: Fix IndexOutOfBoundsExc in PostingsHighlighter when
multi-valued fields exceed maxLength (Tomás Fernández Löbbe
via Mike McCandless)
* LUCENE-4933: SweetSpotSimilarity didn't apply its tf function to some
queries (SloppyPhraseQuery, SpanQueries). (Robert Muir)
* LUCENE-5033: SlowFuzzyQuery was accepting too many terms (documents) when
provided minSimilarity is an int > 1 (Tim Allison via Mike McCandless)
* LUCENE-5045: DrillSideways.search did not work on an empty index. (Shai Erera)
* LUCENE-4995: CompressingStoredFieldsReader now only reuses an internal buffer
when there is no more than 32kb to decompress. This prevents from running
into out-of-memory errors when working with large stored fields.
(Adrien Grand)
* LUCENE-5048: CategoryPath with a long path could result in hitting
NegativeArraySizeException, categories being added multiple times to the
taxonomy or drill-down terms silently discarded by the indexer. CategoryPath
is now limited to MAX_CATEGORY_PATH_LENGTH characters.
(Colton Jamieson, Mike McCandless, Shai Erera)
* LUCENE-5062: If the spatial data for a document was comprised of multiple
overlapping or adjacent parts then a CONTAINS predicate query might not match
when the sum of those shapes contain the query shape but none do individually.
A flag was added to use the original faster algorithm. (David Smiley)
* LUCENE-4971: Fixed NPE in AnalyzingSuggester when there are too many
graph expansions. (Alexey Kudinov via Mike McCandless)
* LUCENE-5080: Combined setMaxMergeCount and setMaxThreadCount into one
setter in ConcurrentMergePolicy: setMaxMergesAndThreads. Previously these
setters would not work unless you invoked them very carefully.
(Robert Muir, Shai Erera)
* LUCENE-5068: QueryParserUtil.escape() does not escape forward slash.
(Matias Holte via Steve Rowe)
* LUCENE-5103: A join on A single-valued field with deleted docs scored too few
docs. (David Smiley)
* LUCENE-5090: Detect mismatched readers passed to
SortedSetDocValuesReaderState and SortedSetDocValuesAccumulator.
(Robert Muir, Mike McCandless)
* LUCENE-5120: AnalyzingSuggester modifed it's FST's cached root arc if payloads
are used and the entire output resided on the root arc on the first access. This
caused subsequent suggest calls to fail. (Simon Willnauer)
Optimizations
* LUCENE-4936: Improve numeric doc values compression in case all values share
a common divisor. In particular, this improves the compression ratio of dates
without time when they are encoded as milliseconds since Epoch. Also support
TABLE compressed numerics in the Disk codec. (Robert Muir, Adrien Grand)
* LUCENE-4951: DrillSideways uses the new Scorer.cost() method to make
better decisions about which scorer to use internally. (Mike McCandless)
* LUCENE-4976: PersistentSnapshotDeletionPolicy writes its state to a
single snapshots_N file, and no longer requires closing (Mike
McCandless, Shai Erera)
* LUCENE-5035: Compress addresses in FieldCacheImpl.SortedDocValuesImpl more
efficiently. (Adrien Grand, Robert Muir)
* LUCENE-4941: Sort "from" terms only once when using JoinUtil.
(Martijn van Groningen)
* LUCENE-5050: Close the stored fields and term vectors index files as soon as
the index has been loaded into memory to save file descriptors. (Adrien Grand)
* LUCENE-5086: RamUsageEstimator now uses official Java 7 API or a proprietary
Oracle Java 6 API to get Hotspot MX bean, preventing AWT classes to be
loaded on MacOSX. (Shay Banon, Dawid Weiss, Uwe Schindler)
New Features
* LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
(Dawid Weiss, Grzegorz Sobczyk)
* LUCENE-5064: Added PagedMutable (internal), a paged extension of
PackedInts.Mutable which allows for storing more than 2B values. (Adrien Grand)
* LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to
emit multiple tokens one for each capture group in one or more patterns.
(Simon Willnauer, Clinton Gormley)
* LUCENE-4952: Expose control (protected method) in DrillSideways to
force all sub-scorers to be on the same document being collected.
This is necessary when using collectors like
ToParentBlockJoinCollector with DrillSideways. (Mike McCandless)
* SOLR-4761: Add SimpleMergedSegmentWarmer, which just initializes terms,
norms, docvalues, and so on. (Mark Miller, Mike McCandless, Robert Muir)
* LUCENE-4964: Allow arbitrary Query for per-dimension drill-down to
DrillDownQuery and DrillSideways, to support future dynamic faceting
methods (Mike McCandless)
* LUCENE-4966: Add CachingWrapperFilter.sizeInBytes() (Mike McCandless)
* LUCENE-4965: Add dynamic (no taxonomy index used) numeric range
faceting to Lucene's facet module (Mike McCandless, Shai Erera)
* LUCENE-4979: LiveFieldFields can work with any ReferenceManager, not
just ReferenceManager<IndexSearcher> (Mike McCandless).
* LUCENE-4975: Added a new Replicator module which can replicate index
revisions between server and client. (Shai Erera, Mike McCandless)
* LUCENE-5022: Added FacetResult.mergeHierarchies to merge multiple
FacetResult of the same dimension into a single one with the reconstructed
hierarchy. (Shai Erera)
* LUCENE-5026: Added PagedGrowableWriter, a new internal packed-ints structure
that grows the number of bits per value on demand, can store more than 2B
values and supports random write and read access. (Adrien Grand)
* LUCENE-5025: FST's Builder can now handle more than 2.1 billion
"tail nodes" while building a minimal FST. (Aaron Binns, Adrien
Grand, Mike McCandless)
* LUCENE-5063: FieldCache.DEFAULT.get(Ints|Longs) now uses bit-packing to save
memory. (Adrien Grand)
* LUCENE-5079: IndexWriter.hasUncommittedChanges() returns true if there are
changes that have not been committed. (yonik, Mike McCandless, Uwe Schindler)
* SOLR-4565: Extend NorwegianLightStemFilter and NorwegianMinimalStemFilter
to handle "nynorsk" (Erlend Garåsen, janhoy via Robert Muir)
* LUCENE-5087: Add getMultiValuedSeparator to PostingsHighlighter, for cases
where you want a different logical separator between field values. This can
be set to e.g. U+2029 PARAGRAPH SEPARATOR if you never want passes to span
values. (Mike McCandless, Robert Muir)
* LUCENE-5013: Added ScandinavianFoldingFilterFactory and
ScandinavianNormalizationFilterFactory (Karl Wettin via janhoy)
* LUCENE-4845: AnalyzingInfixSuggester finds suggestions based on
matches to any tokens in the suggestion, not just based on pure
prefix matching. (Mike McCandless, Robert Muir)
API Changes
* LUCENE-5077: Make it easier to use compressed norms. Lucene42NormsFormat takes
an overhead parameter, so you can easily pass a different value other than
PackedInts.FASTEST from your own codec. (Robert Muir)
* LUCENE-5097: Analyzer now has an additional tokenStream(String fieldName,
String text) method, so wrapping by StringReader for common use is no
longer needed. This method uses an internal reuseable reader, which was
previously only used by the Field class. (Uwe Schindler, Robert Muir)
* LUCENE-4542: HunspellStemFilter's maximum recursion level is now configurable.
(Piotr, Rafał Kuć via Adrien Grand)
Build
* LUCENE-4987: Upgrade randomized testing to version 2.0.10:
Test framework may fail internally due to overly aggresive J9 optimizations.
(Dawid Weiss, Shai Erera)
* LUCENE-5043: The eclipse target now uses the containing directory for the
project name. This also enforces UTF-8 encoding when files are copied with
filtering.
* LUCENE-5055: "rat-sources" target now checks also build.xml, ivy.xml,
forbidden-api signatures, and parts of resources folders. (Ryan Ernst,
Uwe Schindler)
* LUCENE-5072: Automatically patch javadocs generated by JDK versions
before 7u25 to work around the frame injection vulnerability (CVE-2013-1571,
VU#225657). (Uwe Schindler)
Tests
* LUCENE-4901: TestIndexWriterOnJRECrash should work on any
JRE vendor via Runtime.halt().
(Mike McCandless, Robert Muir, Uwe Schindler, Rodrigo Trujillo, Dawid Weiss)
Changes in runtime behavior
* LUCENE-5038: New segments written by IndexWriter are now wrapped into CFS
by default. DocumentsWriterPerThread doesn't consult MergePolicy anymore
to decide if a CFS must be written, instead IndexWriterConfig now has a
property to enable / disable CFS for newly created segments. (Simon Willnauer)
* LUCENE-5107: Properties files by Lucene are now written in UTF-8 encoding,
Unicode is no longer escaped. Reading of legacy properties files with
\u escapes is still possible. (Uwe Schindler, Robert Muir)
======================= Lucene 4.3.1 =======================
Bug Fixes
* SOLR-4813: Fix SynonymFilterFactory to allow init parameters for
tokenizer factory used when parsing synonyms file. (Shingo Sasaki, hossman)
* LUCENE-4935: CustomScoreQuery wrongly applied its query boost twice
(boost^2). (Robert Muir)
* LUCENE-4948: Fixed ArrayIndexOutOfBoundsException in PostingsHighlighter
if you had a 64-bit JVM without compressed OOPS: IBM J9, or Oracle with
large heap/explicitly disabled. (Mike McCandless, Uwe Schindler, Robert Muir)
* LUCENE-4953: Fixed ParallelCompositeReader to inform ReaderClosedListeners of
its synthetic subreaders. FieldCaches keyed on the atomic childs will be purged
earlier and FC insanity prevented. In addition, ParallelCompositeReader's
toString() was changed to better reflect the reader structure.
(Mike McCandless, Uwe Schindler)
* LUCENE-4968: Fixed ToParentBlockJoinQuery/Collector: correctly handle parent
hits that had no child matches, don't throw IllegalArgumentEx when
the child query has no hits, more aggressively catch cases where childQuery
incorrectly matches parent documents (Mike McCandless)
* LUCENE-4970: Fix boost value of rewritten NGramPhraseQuery.
(Shingo Sasaki via Adrien Grand)
* LUCENE-4974: CommitIndexTask was broken if no params were set. (Shai Erera)
* LUCENE-4986: Fixed case where a newly opened near-real-time reader
fails to reflect a delete from IndexWriter.tryDeleteDocument (Reg,
Mike McCandless)
* LUCENE-4994: Fix PatternKeywordMarkerFilter to have public constructor.
(Uwe Schindler)
* LUCENE-4993: Fix BeiderMorseFilter to preserve custom attributes when
inserting tokens with position increment 0. (Uwe Schindler)
* LUCENE-4991: Fix handling of synonyms in classic QueryParser.getFieldQuery for
terms not separated by whitespace. PositionIncrementAttribute was ignored, so with
default AND synonyms wrongly became mandatory clauses, and with OR, the
coordination factor was wrong. (李威, Robert Muir)
* LUCENE-5002: IndexWriter#deleteAll() caused a deadlock in DWPT / DWSC if a
DwPT was flushing concurrently while deleteAll() aborted all DWPT. The IW
should never wait on DWPT via the flush control while holding on to the IW
Lock. (Simon Willnauer)
Optimizations
* LUCENE-4938: Don't use an unnecessarily large priority queue in IndexSearcher
methods that take top-N. (Uwe Schindler, Mike McCandless, Robert Muir)
======================= Lucene 4.3.0 =======================
Changes in backwards compatibility policy
* LUCENE-4810: EdgeNGramTokenFilter no longer increments position for
multiple ngrams derived from the same input token. (Walter Underwood
via Mike McCandless)
* LUCENE-4822: KeywordTokenFilter is now an abstract class. Subclasses
need to implement #isKeyword() in order to mark terms as keywords.
The existing functionality has been factored out into a new
SetKeywordTokenFilter class. (Simon Willnauer, Uwe Schindler)
* LUCENE-4642: Remove Tokenizer's and subclasses' ctors taking
AttributeSource. (Renaud Delbru, Uwe Schindler, Steve Rowe)
* LUCENE-4833: IndexWriterConfig used to use LogByteSizeMergePolicy when
calling setMergePolicy(null) although the default merge policy is
TieredMergePolicy. IndexWriterConfig setters now throw an exception when
passed null if null is not a valid value. (Adrien Grand)
* LUCENE-4849: Made ParallelTaxonomyArrays abstract with a concrete
implementation for DirectoryTaxonomyWriter/Reader. Also moved it under
o.a.l.facet.taxonomy. (Shai Erera)
* LUCENE-4876: IndexDeletionPolicy is now an abstract class instead of an
interface. IndexDeletionPolicy, MergeScheduler and InfoStream now implement
Cloneable. (Adrien Grand)
* LUCENE-4874: FilterAtomicReader and related classes (FilterTerms,
FilterDocsEnum, ...) don't forward anymore to the filtered instance when the
method has a default implementation through other abstract methods.
(Adrien Grand, Robert Muir)
* LUCENE-4642, LUCENE-4877: Implementors of TokenizerFactory, TokenFilterFactory,
and CharFilterFactory now need to provide at least one constructor taking
Map<String,String> to be able to be loaded by the SPI framework (e.g., from Solr).
In addition, TokenizerFactory needs to implement the abstract
create(AttributeFactory,Reader) method. (Renaud Delbru, Uwe Schindler,
Steve Rowe, Robert Muir)
API Changes
* LUCENE-4896: Made PassageFormatter abstract in PostingsHighlighter, made
members of DefaultPassageFormatter protected. (Luca Cavanna via Robert Muir)
* LUCENE-4844: removed TaxonomyReader.getParent(), you should use
TaxonomyReader.getParallelArrays().parents() instead. (Shai Erera)
* LUCENE-4742: Renamed spatial 'Node' to 'Cell', along with any method names
and variables using this terminology. (David Smiley)
New Features
* LUCENE-4815: DrillSideways now allows more than one FacetRequest per
dimension (Mike McCandless)
* LUCENE-3918: IndexSorter has been ported to 4.3 API and now supports
sorting documents by a numeric DocValues field, or reverse the order of
the documents in the index. Additionally, apps can implement their own
sort criteria. (Anat Hashavit, Shai Erera)
* LUCENE-4817: Added KeywordRepeatFilter that allows to emit a token twice
once as a keyword and once as an ordinary token allow stemmers to emit
a stemmed version along with the un-stemmed version. (Simon Willnauer)
* LUCENE-4822: PatternKeywordTokenFilter can mark tokens as keywords based
on regular expressions. (Simon Willnauer, Uwe Schindler)
* LUCENE-4821: AnalyzingSuggester now uses the ending offset to
determine whether the last token was finished or not, so that a
query "i " will no longer suggest "Isla de Muerta" for example.
(Mike McCandless)
* LUCENE-4642: Add create(AttributeFactory) to TokenizerFactory and
subclasses with ctors taking AttributeFactory.
(Renaud Delbru, Uwe Schindler, Steve Rowe)
* LUCENE-4820: Add payloads to Analyzing/FuzzySuggester, to record an
arbitrary byte[] per suggestion (Mike McCandless)
* LUCENE-4816: Add WholeBreakIterator to PostingsHighlighter
for treating the entire content as a single Passage. (Robert
Muir, Mike McCandless)
* LUCENE-4827: Add additional ctor to PostingsHighlighter PassageScorer
to provide bm25 k1,b,avgdl parameters. (Robert Muir)
* LUCENE-4607: Add DocIDSetIterator.cost() and Spans.cost() for optimizing
scoring. (Simon Willnauer, Robert Muir)
* LUCENE-4795: Add SortedSetDocValuesFacetFields and
SortedSetDocValuesAccumulator, to compute topK facet counts from a
field's SortedSetDocValues. This method only supports flat
(dim/label) facets, is a bit (~25%) slower, has added cost
per-IndexReader-open to compute its ordinal map, but it requires no
taxonomy index and it tie-breaks facet labels in an understandable
(by Unicode sort order) way. (Robert Muir, Mike McCandless)
* LUCENE-4843: Add LimitTokenPositionFilter: don't emit tokens with
positions that exceed the configured limit. (Steve Rowe)
* LUCENE-4832: Add ToParentBlockJoinCollector.getTopGroupsWithAllChildDocs, to retrieve
all children in each group. (Aleksey Aleev via Mike McCandless)
* LUCENE-4846: PostingsHighlighter subclasses can override where the
String values come from (it still defaults to pulling from stored
fields). (Robert Muir, Mike McCandless)
* LUCENE-4853: Add PostingsHighlighter.highlightFields method that
takes int[] docIDs instead of TopDocs. (Robert Muir, Mike
McCandless)
* LUCENE-4856: If there are no matches for a given field, return the
first maxPassages sentences (Robert Muir, Mike McCandless)
* LUCENE-4859: IndexReader now exposes Terms statistics: getDocCount,
getSumDocFreq, getSumTotalTermFreq. (Shai Erera)
* LUCENE-4862: It is now possible to terminate collection of a single
IndexReader leaf by throwing a CollectionTerminatedException in
Collector.collect. (Adrien Grand, Shai Erera)
* LUCENE-4752: New SortingMergePolicy (in lucene/misc) that sorts documents
before merging segments. (Adrien Grand, Shai Erera, David Smiley)
* LUCENE-4860: Customize scoring and formatting per-field in
PostingsHighlighter by subclassing and overriding the getFormatter
and/or getScorer methods. This also changes Passage.getMatchTerms()
to return BytesRef[] instead of Term[]. (Robert Muir, Mike
McCandless)
* LUCENE-4839: Added SorterTemplate.timSort, a O(n log n) stable sort algorithm
that performs well on partially sorted data. (Adrien Grand)
* LUCENE-4644: Added support for the "IsWithin" spatial predicate for
RecursivePrefixTreeStrategy. It's for matching non-point indexed shapes; if
you only have points (1/doc) then "Intersects" is equivalent and faster.
See the javadocs. (David Smiley)
* LUCENE-4861: Make BreakIterator per-field in PostingsHighlighter. This means
you can override getBreakIterator(String field) to use different mechanisms
for e.g. title vs. body fields. (Mike McCandless, Robert Muir)
* LUCENE-4645: Added support for the "Contains" spatial predicate for
RecursivePrefixTreeStrategy. (David Smiley)
* LUCENE-4898: DirectoryReader.openIfChanged now allows opening a reader
on an IndexCommit starting from a near-real-time reader (previously
this would throw IllegalArgumentException). (Mike McCandless)
* LUCENE-4905: Made the maxPassages parameter per-field in PostingsHighlighter.
(Robert Muir)
* LUCENE-4897: Added TaxonomyReader.getChildren for traversing a category's
children. (Shai Erera)
* LUCENE-4902: Added FilterDirectoryReader to allow easy filtering of a
DirectoryReader's subreaders. (Alan Woodward, Adrien Grand, Uwe Schindler)
* LUCENE-4858: Added EarlyTerminatingSortingCollector to be used in conjunction
with SortingMergePolicy, which allows to early terminate queries on sorted
indexes, when the sort order matches the index order. (Adrien Grand, Shai Erera)
* LUCENE-4904: Added descending sort order to NumericDocValuesSorter. (Shai Erera)
* LUCENE-3786: Added SearcherTaxonomyManager, to manage access to both
IndexSearcher and DirectoryTaxonomyReader for near-real-time
faceting. (Shai Erera, Mike McCandless)
* LUCENE-4915: DrillSideways now allows drilling down on fields that
are not faceted. (Mike McCandless)
* LUCENE-4895: Added support for the "IsDisjointTo" spatial predicate for
RecursivePrefixTreeStrategy. (David Smiley)
* LUCENE-4774: Added FieldComparator that allows sorting parent documents based on
fields on the child / nested document level. (Martijn van Groningen)
Optimizations
* LUCENE-4839: SorterTemplate.merge can now be overridden in order to replace
the default implementation which merges in-place by a faster implementation
that could require fewer swaps at the expense of some extra memory.
ArrayUtil and CollectionUtil override it so that their mergeSort and timSort
methods are faster but only require up to 1% of extra memory. (Adrien Grand)
* LUCENE-4571: Speed up BooleanQuerys with minNrShouldMatch to use
skipping. (Stefan Pohl via Robert Muir)
* LUCENE-4863: StemmerOverrideFilter now uses a FST to represent its overrides
in memory. (Simon Willnauer)
* LUCENE-4889: UnicodeUtil.codePointCount implementation replaced with a
non-array-lookup version. (Dawid Weiss)
* LUCENE-4923: Speed up BooleanQuerys processing of in-order disjunctions.
(Robert Muir)
* LUCENE-4926: Speed up DisjunctionMatchQuery. (Robert Muir)
* LUCENE-4930: Reduce contention in older/buggy JVMs when using
AttributeSource#addAttribute() because java.lang.ref.ReferenceQueue#poll()
is implemented using synchronization. (Christian Ziech, Karl Wright,
Uwe Schindler)
Bug Fixes
* LUCENE-4868: SumScoreFacetsAggregator used an incorrect index into
the scores array. (Shai Erera)
* LUCENE-4882: FacetsAccumulator did not allow to count ROOT category (i.e.
count dimensions). (Shai Erera)
* LUCENE-4876: IndexWriterConfig.clone() now clones its MergeScheduler,
IndexDeletionPolicy and InfoStream in order to make an IndexWriterConfig and
its clone fully independent. (Adrien Grand)
* LUCENE-4893: Facet counts were multiplied as many times as
FacetsCollector.getFacetResults() is called. (Shai Erera)
* LUCENE-4888: Fixed SloppyPhraseScorer, MultiDocs(AndPositions)Enum and
MultiSpansWrapper which happened to sometimes call DocIdSetIterator.advance
with target<=current (in this case the behavior of advance is undefined).
(Adrien Grand)
* LUCENE-4899: FastVectorHighlighter failed with StringIndexOutOfBoundsException
if a single highlight phrase or term was greater than the fragCharSize producing
negative string offsets. (Simon Willnauer)
* LUCENE-4877: Throw exception for invalid arguments in analysis factories.
(Steve Rowe, Uwe Schindler, Robert Muir)
* LUCENE-4914: SpatialPrefixTree's Node/Cell.reset() forgot to reset the 'leaf'
flag. It affects SpatialRecursivePrefixTreeStrategy on non-point indexed
shapes, as of Lucene 4.2. (David Smiley)
* LUCENE-4913: FacetResultNode.ordinal was always 0 when all children
are returned. (Mike McCandless)
* LUCENE-4918: Highlighter closes the given IndexReader if QueryScorer
is used with an external IndexReader. (Simon Willnauer, Sirvan Yahyaei)
* LUCENE-4880: Fix MemoryIndex to consume empty terms from the tokenstream consistent
with IndexWriter. Previously it discarded them. (Timothy Allison via Robert Muir)
* LUCENE-4885: FacetsAccumulator did not set the correct value for
FacetResult.numValidDescendants. (Mike McCandless, Shai Erera)
* LUCENE-4925: Fixed IndexSearcher.search when the argument list contains a Sort
and one of the sort fields is the relevance score. Only IndexSearchers created
with an ExecutorService are concerned. (Adrien Grand)
* LUCENE-4738, LUCENE-2727, LUCENE-2812: Simplified
DirectoryReader.indexExists so that it's more robust to transient
IOExceptions (e.g. due to issues like file descriptor exhaustion),
but this will also cause it to err towards returning true for
example if the directory contains a corrupted index or an incomplete
initial commit. In addition, IndexWriter with OpenMode.CREATE will
now succeed even if the directory contains a corrupted index (Billow
Gao, Robert Muir, Mike McCandless)
* LUCENE-4928: Stored fields and term vectors could become super slow in case
of tiny documents (a few bytes). This is especially problematic when switching
codecs since bulk-merge strategies can't be applied and the same chunk of
documents can end up being decompressed thousands of times. A hard limit on
the number of documents per chunk has been added to fix this issue.
(Robert Muir, Adrien Grand)
* LUCENE-4934: Fix minor equals/hashcode problems in facet/DrillDownQuery,
BoostingQuery, MoreLikeThisQuery, FuzzyLikeThisQuery, and block join queries.
(Robert Muir, Uwe Schindler)
* LUCENE-4504: Fix broken sort comparator in ValueSource.getSortField,
used when sorting by a function query. (Tom Shally via Robert Muir)
* LUCENE-4937: Fix incorrect sorting of float/double values (+/-0, NaN).
(Robert Muir, Uwe Schindler)
Documentation
* LUCENE-4841: Added example SimpleSortedSetFacetsExample to show how
to use the new SortedSetDocValues backed facet implementation.
(Shai Erera, Mike McCandless)
Build
* LUCENE-4879: Upgrade randomized testing to version 2.0.9:
Filter stack traces on console output. (Dawid Weiss, Robert Muir)
======================= Lucene 4.2.1 =======================
Bug Fixes
* LUCENE-4713: The SPI components used to load custom codecs or analysis
components were fixed to also scan the Lucene ClassLoader in addition
to the context ClassLoader, so Lucene is always able to find its own
codecs. The special case of a null context ClassLoader is now also
supported. (Christian Kohlschütter, Uwe Schindler)
* LUCENE-4819: seekExact(BytesRef, boolean) did not work correctly with
Sorted[Set]DocValuesTermsEnum. (Robert Muir)
* LUCENE-4826: PostingsHighlighter was not returning the top N best
scoring passages. (Robert Muir, Mike McCandless)
* LUCENE-4854: Fix DocTermOrds.getOrdTermsEnum() to not return negative
ord on initial next(). (Robert Muir)
* LUCENE-4836: Fix SimpleRateLimiter#pause to return the actual time spent
sleeping instead of the wakeup timestamp in nano seconds. (Simon Willnauer)
* LUCENE-4828: BooleanQuery no longer extracts terms from its MUST_NOT
clauses. (Mike McCandless)
* SOLR-4589: Fixed CPU spikes and poor performance in lazy field loading
of multivalued fields. (hossman)
* LUCENE-4870: Fix bug where an entire index might be deleted by the IndexWriter
due to false detection if an index exists in the directory when
OpenMode.CREATE_OR_APPEND is used. This might also affect application that set
the open mode manually using DirectoryReader#indexExists. (Simon Willnauer)
* LUCENE-4878: Override getRegexpQuery in MultiFieldQueryParser to prefent
NullPointerException when regular expression syntax is used with
MultiFieldQueryParser. (Simon Willnauer, Adam Rauch)
Optimizations
* LUCENE-4819: Added Sorted[Set]DocValues.termsEnum(), and optimized the
default codec for improved enumeration performance. (Robert Muir)
* LUCENE-4854: Speed up TermsEnum of FieldCache.getDocTermOrds.
(Robert Muir)
* LUCENE-4857: Don't unnecessarily copy stem override map in
StemmerOverrideFilter. (Simon Willnauer)
======================= Lucene 4.2.0 =======================
Changes in backwards compatibility policy
* LUCENE-4602: FacetFields now stores facet ordinals in a DocValues field,
rather than a payload. This forces rebuilding existing indexes, or do a
one time migration using FacetsPayloadMigratingReader. Since DocValues
support in-memory caching, CategoryListCache was removed too.
(Shai Erera, Michael McCandless)
* LUCENE-4697: FacetResultNode is now a concrete class with public members
(instead of getter methods). (Shai Erera)
* LUCENE-4600: FacetsCollector is now an abstract class with two
implementations: StandardFacetsCollector (the old version of
FacetsCollector) and CountingFacetsCollector. FacetsCollector.create()
returns the most optimized collector for the given parameters.
(Shai Erera, Michael McCandless)
* LUCENE-4700: OrdinalPolicy is now per CategoryListParams, and is no longer
an interface, but rather an enum with values NO_PARENTS and ALL_PARENTS.
PathPolicy was removed, you should extend FacetFields and DrillDownStream
to control which categories are added as drill-down terms. (Shai Erera)
* LUCENE-4547: DocValues improvements:
- Simplified codec API: codecs are now only responsible for encoding and
decoding docvalues, they do not need to do buffering or RAM accounting.
- Per-Field support: added PerFieldDocValuesFormat, which allows you to
use a different DocValuesFormat per field (like postings).
- Unified with FieldCache api: DocValues can be accessed via FieldCache API,
so it works automatically with grouping/join/sort/function queries, etc.
- Simplified types: There are only 3 types (NUMERIC, BINARY, SORTED), so its
not necessary to specify for example that all of your binary values have
the same length. Instead its easy for the Codec API to optimize encoding
based on any properties of the content.
(Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
* LUCENE-4757: Cleanup and refactoring of FacetsAccumulator, FacetRequest,
FacetsAggregator and FacetResultsHandler API. If your application did
FacetsCollector.create(), you should not be affected, but if you wrote
an Aggregator, then you should migrate it to the per-segment
FacetsAggregator. You can still use StandardFacetsAccumulator, which works
with the old API (for now). (Shai Erera)
* LUCENE-4761: Facet packages reorganized. Should be easy to fix your import
statements, if you use an IDE such as Eclipse. (Shai Erera)
* LUCENE-4750: Convert DrillDown to DrillDownQuery, so you can initialize it
and add drill-down categories to it. (Michael McCandless, Shai Erera)
* LUCENE-4759: remove FacetRequest.SortBy; result categories are always
sorted by value, while ties are broken by category ordinal. (Shai Erera)
* LUCENE-4772: Facet associations moved to new FacetsAggregator API. You
should override FacetsAccumualtor and return the relevant aggregator,
for aggregating the association values. (Shai Erera)
* LUCENE-4748: A FacetRequest on a non-existent field now returns an
empty FacetResult instead of skipping it. (Shai Erera, Mike McCandless)
* LUCENE-4806: The default category delimiter character was changed
from U+F749 to U+001F, since the latter uses 1 byte vs 3 bytes for
the former. Existing facet indices must be reindexed. (Robert
Muir, Shai Erera, Mike McCandless)
Optimizations
* LUCENE-4687: BloomFilterPostingsFormat now lazily initializes delegate
TermsEnum only if needed to do a seek or get a DocsEnum. (Simon Willnauer)
* LUCENE-4677, LUCENE-4682: unpacked FSTs now use vInt to encode the node target,
to reduce their size (Mike McCandless)
* LUCENE-4678: FST now uses a paged byte[] structure instead of a
single byte[] internally, to avoid large memory spikes during
building (James Dyer, Mike McCandless)
* LUCENE-3298: FST can now be larger than 2.1 GB / 2.1 B nodes.
(James Dyer, Mike McCandless)
* LUCENE-4690: Performance improvements and non-hashing versions
of NumericUtils.*ToPrefixCoded() (yonik)
* LUCENE-4715: CategoryListParams.getOrdinalPolicy now allows to return a
different OrdinalPolicy per dimension, to better tune how you index
facets. Also added OrdinalPolicy.ALL_BUT_DIMENSION.
(Shai Erera, Michael McCandless)
* LUCENE-4740: Don't track clones of MMapIndexInput if unmapping
is disabled. This reduces GC overhead. (Kristofer Karlsson, Uwe Schindler)
* LUCENE-4733: The default Lucene 4.2 codec now uses a more compact
TermVectorsFormat (Lucene42TermVectorsFormat) based on
CompressingTermVectorsFormat. (Adrien Grand)
* LUCENE-3729: The default Lucene 4.2 codec now uses a more compact
DocValuesFormat (Lucene42DocValuesFormat). Sorted values are stored in an
FST, Numerics and Ordinals use a number of strategies (delta-compression,
table-compression, etc), and memory addresses use MonotonicBlockPackedWriter.
(Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
* LUCENE-4792: Reduction of the memory required to build the doc ID maps used
when merging segments. (Adrien Grand)
* LUCENE-4794: Spatial RecursivePrefixTreeStrategy's search filter: Skip calls
to termsEnum.seek() when the next term is known to follow the current cell.
(David Smiley)
New Features
* LUCENE-4686: New specialized DGapVInt8IntEncoder for facets (now the
default). (Shai Erera)
* LUCENE-4703: Add simple PrintTaxonomyStats tool to see summary
information about the facets taxonomy index. (Mike McCandless)
* LUCENE-4599: New oal.codecs.compressing.CompressingTermVectorsFormat which
compresses term vectors into chunks of documents similarly to
CompressingStoredFieldsFormat. (Adrien Grand)
* LUCENE-4695: Added LiveFieldValues utility class, for getting the
current (live, real-time) value for any indexed doc/field. The
class buffers recently indexed doc/field values until a new
near-real-time reader is opened that contains those changes.
(Robert Muir, Mike McCandless)
* LUCENE-4723: Add AnalyzerFactoryTask to benchmark, and enable analyzer
creation via the resulting factories using NewAnalyzerTask. (Steve Rowe)
* LUCENE-4728: Unknown and not explicitly mapped queries are now rewritten
against the highlighting IndexReader to obtain primitive queries before
discarding the query entirely. WeightedSpanTermExtractor now builds a
MemoryIndex only once even if multiple fields are highlighted.
(Simon Willnauer)
* LUCENE-4035: Added ICUCollationDocValuesField, more efficient
support for Locale-sensitive sort and range queries for
single-valued fields. (Robert Muir)
* LUCENE-4547: Added MonotonicBlockPacked(Reader/Writer), which provide
efficient random access to large amounts of monotonically increasing
positive values (e.g. file offsets). Each block stores the minimum value
and the average gap, and values are encoded as signed deviations from
the expected value. (Adrien Grand)
* LUCENE-4547: Added AppendingLongBuffer, an append-only buffer that packs
signed long values in memory and provides an efficient iterator API.
(Adrien Grand)
* LUCENE-4540: It is now possible for a codec to represent norms with
less than 8 bits per value. For performance reasons this is not done
by default, but you can customize your codec (e.g. pass PackedInts.DEFAULT
to Lucene42DocValuesConsumer) if you want to make this tradeoff.
(Adrien Grand, Robert Muir)
* LUCENE-4764: A new Facet42Codec and Facet42DocValuesFormat provide
faster but more RAM-consuming facet performance. (Shai Erera, Mike
McCandless)
* LUCENE-4769: Added OrdinalsCache and CachedOrdsCountingFacetsAggregator
which uses the cache to obtain a document's ordinals. This aggregator
is faster than others, however consumes much more RAM.
(Michael McCandless, Shai Erera)
* LUCENE-4778: Add a getter for the delegate in RateLimitedDirectoryWrapper.
(Mark Miller)
* LUCENE-4765: Add a multi-valued docvalues type (SORTED_SET). This is equivalent
to building a FieldCache.getDocTermOrds at index-time. (Robert Muir)
* LUCENE-4780: Add MonotonicAppendingLongBuffer: an append-only buffer for
monotonically increasing values. (Adrien Grand)
* LUCENE-4748: Added DrillSideways utility class for computing both
drill-down and drill-sideways counts for a DrillDownQuery. (Mike
McCandless)
API Changes
* LUCENE-4709: FacetResultNode no longer has a residue field. (Shai Erera)
* LUCENE-4716: DrillDown.query now takes Occur, allowing to specify if
categories should be OR'ed or AND'ed. (Shai Erera)
* LUCENE-4695: ReferenceManager.RefreshListener.afterRefresh now takes
a boolean indicating whether a new reference was in fact opened, and
a new beforeRefresh method notifies you when a refresh attempt is
starting. (Robert Muir, Mike McCandless)
* LUCENE-4794: Spatial RecursivePrefixTreeFilter replaced by
IntersectsPrefixTreeFilter and some extensible base classes. (David Smiley)
Bug Fixes
* LUCENE-4705: Pass on FilterStrategy in FilteredQuery if the filtered query is
rewritten. (Simon Willnauer)
* LUCENE-4712: MemoryIndex#normValues() throws NPE if field doesn't exist.
(Simon Willnauer, Ricky Pritchett)
* LUCENE-4550: Shapes wider than 180 degrees would use too much accuracy for the
PrefixTree based SpatialStrategy. For a pathological case of nearly 360
degrees and barely any height, it would generate so many indexed terms
(> 500k) that it could even cause an OutOfMemoryError. Fixed. (David Smiley)
* LUCENE-4704: Make join queries override hashcode and equals methods.
(Martijn van Groningen)
* LUCENE-4724: Fix bug in CategoryPath which allowed passing null or empty
string components. This is forbidden now (throws an exception). Note that if
you have a taxonomy index created with such strings, you should rebuild it.
(Michael McCandless, Shai Erera)
* LUCENE-4732: Fixed TermsEnum.seekCeil/seekExact on term vectors.
(Adrien Grand, Robert Muir)
* LUCENE-4739: Fixed bugs that prevented FSTs more than ~1.1GB from
being saved and loaded (Adrien Grand, Mike McCandless)
* LUCENE-4717: Fixed bug where Lucene40DocValuesFormat would sometimes write
an extra unused ordinal for sorted types. The bug is detected and corrected
on-the-fly for old indexes. (Robert Muir)
* LUCENE-4547: Fixed bug where Lucene40DocValuesFormat was unable to encode
segments that would exceed 2GB total data. This could happen in some surprising
cases, for example if you had an index with more than 260M documents and a
VAR_INT field. (Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
* LUCENE-4775: Remove SegmentInfo.sizeInBytes() and make
MergePolicy.OneMerge.totalBytesSize thread safe (Josh Bronson via
Robert Muir, Mike McCandless)
* LUCENE-4770: If spatial's TermQueryPrefixTreeStrategy was used to search
indexed non-point shapes, then there was an edge case where a query should
find a shape but it didn't. The fix is the removal of an optimization that
simplifies some leaf cells into a parent. The index data for such a field is
now ~20% larger. This optimization is still done for the query shape, and for
indexed data for RecursivePrefixTreeStrategy. Furthermore, this optimization
is enhanced to roll up beyond the bottom cell level. (David Smiley,
Florian Schilling)
* LUCENE-4790: Fix FieldCacheImpl.getDocTermOrds to not bake deletes into the
cached datastructure. Otherwise this can cause inconsistencies with readers
at different points in time. (Robert Muir)
* LUCENE-4791: A conjunction of terms (ConjunctionTermScorer) scanned on
the lowest frequency term instead of skipping, leading to potentially
large performance impacts for many non-random or non-uniform
term distributions. (John Wang, yonik)
* LUCENE-4798: PostingsHighlighter's formatter sometimes didn't highlight
matched terms. (Robert Muir)
* LUCENE-4796, SOLR-4373: Fix concurrency issue in NamedSPILoader and
AnalysisSPILoader when doing reload (e.g. from Solr).
(Uwe Schindler, Hossman)
* LUCENE-4802: Don't compute norms for drill-down facet fields. (Mike McCandless)
* LUCENE-4804: PostingsHighlighter sometimes applied terms to the wrong passage,
if they started exactly on a passage boundary. (Robert Muir)
Documentation
* LUCENE-4718: Fixed documentation of oal.queryparser.classic.
(Hayden Muhl via Adrien Grand)
* LUCENE-4784, LUCENE-4785, LUCENE-4786: Fixed references to deprecated classes
SinkTokenizer, ValueSourceQuery and RangeQuery. (Hao Zhong via Adrien Grand)
Build
* LUCENE-4654: Test duration statistics from multiple test runs should be
reused. (Dawid Weiss)
* LUCENE-4636: Upgrade ivy to 2.3.0 (Shawn Heisey via Robert Muir)
* LUCENE-4570: Use the Policeman Forbidden API checker, released separately
from Lucene and downloaded via Ivy. (Uwe Schindler, Robert Muir)
* LUCENE-4758: 'ant jar', 'ant compile', and 'ant compile-test' should
recurse. (Steve Rowe)
======================= Lucene 4.1.0 =======================
Changes in backwards compatibility policy
* LUCENE-4514: Scorer's freq() method returns an integer value indicating
the number of times the scorer matches the current document. Previously
this was only sometimes the case, in some cases it returned a (meaningless)
floating point value. Scorer now extends DocsEnum so it has attributes().
(Robert Muir)
* LUCENE-4543: TFIDFSimilarity's index-time computeNorm is now final to
match the fact that its query-time norm usage requires a FIXED_8 encoding.
Override lengthNorm and/or encode/decodeNormValue to change the specifics,
like Lucene 3.x. (Robert Muir)
* LUCENE-3441: The facet module now supports NRT. As a result, the following
changes were made:
- DirectoryTaxonomyReader has a new constructor which takes a
DirectoryTaxonomyWriter. You should use that constructor in order to get
the NRT support (or the old one for non-NRT).
- TaxonomyReader.refresh() removed in exchange for TaxonomyReader.openIfChanged
static method. Similar to DirectoryReader, the method either returns null
if no changes were made to the taxonomy, or a new TR instance otherwise.
Instead of calling refresh(), you should write similar code to how you reopen
a regular DirectoryReader.
- TaxonomyReader.openIfChanged (previously refresh()) no longer throws
InconsistentTaxonomyException, and supports recreate. InconsistentTaxoEx
was removed.
- ChildrenArrays was pulled out of TaxonomyReader into a top-level class.
- TaxonomyReader was made an abstract class (instead of an interface), with
methods such as close() and reference counting management pulled from
DirectoryTaxonomyReader, and made final. The rest of the methods, remained
abstract.
(Shai Erera, Gilad Barkai)
* LUCENE-4576: Remove CachingWrapperFilter(Filter, boolean). This recacheDeletes
option gave less than 1% speedup at the expense of cache churn (filters were
invalidated on reopen if even a single delete was posted against the segment).
(Robert Muir)
* LUCENE-4575: Replace IndexWriter's commit/prepareCommit versions that take
commitData with setCommitData(). That allows committing changes to IndexWriter
even if the commitData is the only thing that changes.
(Shai Erera, Michael McCandless)
* LUCENE-4565: TaxonomyReader.getParentArray and .getChildrenArrays consolidated
into one getParallelTaxonomyArrays(). You can obtain the 3 arrays that the
previous two methods returned by calling parents(), children() or siblings()
on the returned ParallelTaxonomyArrays. (Shai Erera)
* LUCENE-4585: Spatial PrefixTree based Strategies (either TermQuery or
RecursivePrefix based) MAY want to re-index if used for point data. If a
re-index is not done, then an indexed point is ~1/2 the smallest grid cell
larger and as such is slightly more likely to match a query shape.
(David Smiley)
* LUCENE-4604: DefaultOrdinalPolicy removed in favor of OrdinalPolicy.ALL_PARENTS.
Same for DefaultPathPolicy (now PathPolicy.ALL_CATEGORIES). In addition, you
can use OrdinalPolicy.NO_PARENTS to never write any parent category ordinal
to the fulltree posting payload (but note that you need a special
FacetsAccumulator - see javadocs). (Shai Erera)
* LUCENE-4594: Spatial PrefixTreeStrategy no longer indexes center points of
non-point shapes. If you want to call makeDistanceValueSource() based on
shape centers, you need to do this yourself in another spatial field.
(David Smiley)
* LUCENE-4615: Replace IntArrayAllocator and FloatArrayAllocator by ArraysPool.
FacetArrays no longer takes those allocators; if you need to reuse the arrays,
you should use ReusingFacetArrays. (Shai Erera, Gilad Barkai)
* LUCENE-4621: FacetIndexingParams is now a concrete class (instead of DefaultFIP).
Also, the entire IndexingParams chain is now immutable. If you need to override
a setting, you should extend the relevant class.
Additionally, FacetSearchParams is now immutable, and requires all FacetRequests
to specified at initialization time. (Shai Erera)
* LUCENE-4647: CategoryDocumentBuilder and EnhancementsDocumentBuilder are replaced
by FacetFields and AssociationsFacetFields respectively. CategoryEnhancement and
AssociationEnhancement were removed in favor of a simplified CategoryAssociation
interface, with CategoryIntAssociation and CategoryFloatAssociation
implementations.
NOTE: indexes that contain category enhancements/associations are not supported
by the new code and should be recreated. (Shai Erera)
* LUCENE-4659: Massive cleanup to CategoryPath API. Additionally, CategoryPath is
now immutable, so you don't need to clone() it. (Shai Erera)
* LUCENE-4670: StoredFieldsWriter and TermVectorsWriter have new finish* callbacks
which are called after a doc/field/term has been completely added.
(Adrien Grand, Robert Muir)
* LUCENE-4620: IntEncoder/Decoder were changed to do bulk encoding/decoding. As a
result, few other classes such as Aggregator and CategoryListIterator were
changed to handle bulk category ordinals. (Shai Erera)
* LUCENE-4683: CategoryListIterator and Aggregator are now per-segment. As such
their implementations no longer take a top-level IndexReader in the constructor
but rather implement a setNextReader. (Shai Erera)
New Features
* LUCENE-4226: New experimental StoredFieldsFormat that compresses chunks of
documents together in order to improve the compression ratio. (Adrien Grand)
* LUCENE-4426: New ValueSource implementations (in lucene/queries) for
DocValues fields. (Adrien Grand)
* LUCENE-4410: FilteredQuery now exposes a FilterStrategy that exposes
how filters are applied during query execution. (Simon Willnauer)
* LUCENE-4404: New ListOfOutputs (in lucene/misc) for FSTs wraps
another Outputs implementation, allowing you to store more than one
output for a single input. UpToTwoPositiveIntsOutputs was moved
from lucene/core to lucene/misc. (Mike McCandless)
* LUCENE-3842: New AnalyzingSuggester, for doing auto-suggest
using an analyzer. This can create powerful suggesters: if the analyzer
remove stop words then "ghost chr..." could suggest "The Ghost of
Christmas Past"; if SynonymFilter is used to map wifi and wireless
network to hotspot, then "wirele..." could suggest "wifi router";
token normalization likes stemmers, accent removal, etc. would allow
the suggester to ignore such variations. (Robert Muir, Sudarshan
Gaikaiwari, Mike McCandless)
* LUCENE-4446: Lucene 4.1 has a new default index format (Lucene41Codec)
that incorporates the previously experimental "Block" postings format
for better search performance.
(Han Jiang, Adrien Grand, Robert Muir, Mike McCandless)
* LUCENE-3846: New FuzzySuggester, like AnalyzingSuggester except it
also finds completions allowing for fuzzy edits in the input string.
(Robert Muir, Simon Willnauer, Mike McCandless)
* LUCENE-4515: MemoryIndex now supports adding the same field multiple
times. (Simon Willnauer)
* LUCENE-4489: Added consumeAllTokens option to LimitTokenCountFilter
(hossman, Robert Muir)
* LUCENE-4566: Add NRT/SearcherManager.RefreshListener/addListener to
be notified whenever a new searcher was opened. (selckin via Shai
Erera, Mike McCandless)
* SOLR-4123: Add per-script customizability to ICUTokenizerFactory via
rule files in the ICU RuleBasedBreakIterator format.
(Shawn Heisey, Robert Muir, Steve Rowe)
* LUCENE-4590: Added WriteEnwikiLineDocTask - a benchmark task for writing
Wikipedia category pages and non-category pages into separate line files.
extractWikipedia.alg was changed to use this task, so now it creates two
files. (Doron Cohen)
* LUCENE-4290: Added PostingsHighlighter to the highlighter module. It uses
offsets from the postings lists to highlight documents. (Robert Muir)
* LUCENE-4628: Added CommonTermsQuery that executes high-frequency terms
in a optional sub-query to prevent slow queries due to "common" terms
like stopwords. (Simon Willnauer)
API Changes
* LUCENE-4399: Deprecated AppendingCodec. Lucene's term dictionaries
no longer seek when writing. (Adrien Grand, Robert Muir)
* LUCENE-4479: Rename TokenStream.getTokenStream(IndexReader, int, String)
to TokenStream.getTokenStreamWithOffsets, and return null on failure
rather than throwing IllegalArgumentException. (Alan Woodward)
* LUCENE-4472: MergePolicy now accepts a MergeTrigger that provides
information about the trigger of the merge ie. merge triggered due
to a segment merge or a full flush etc. (Simon Willnauer)
* LUCENE-4415: TermsFilter is now immutable. All terms need to be provided
as constructor argument. (Simon Willnauer)
* LUCENE-4520: ValueSource.getSortField no longer throws IOExceptions
(Alan Woodward)
* LUCENE-4537: RateLimiter is now separated from FSDirectory and exposed via
RateLimitingDirectoryWrapper. Any Directory can now be rate-limited.
(Simon Willnauer)
* LUCENE-4591: CompressingStoredFields{Writer,Reader} now accept a segment
suffix as a constructor parameter. (Renaud Delbru via Adrien Grand)
* LUCENE-4605: Added DocsEnum.FLAG_NONE which can be passed instead of 0 as
the flag to .docs() and .docsAndPositions(). (Shai Erera)
* LUCENE-4617: Remove FST.pack() method. Previously to make a packed FST,
you had to make a Builder with willPackFST=true (telling it you will later pack it),
create your fst with finish(), and then call pack() to get another FST.
Instead just pass true for doPackFST to Builder and finish() returns a packed FST.
(Robert Muir)
* LUCENE-4663: Deprecate IndexSearcher.document(int, Set). This was not intended
to be final, nor named document(). Use IndexSearcher.doc(int, Set) instead.
(Robert Muir)
* LUCENE-4684: Made DirectSpellChecker extendable.
(Martijn van Groningen)
Bug Fixes
* LUCENE-1822: BaseFragListBuilder hard-coded 6 char margin is too naive.
(Alex Vigdor, Arcadius Ahouansou, Koji Sekiguchi)
* LUCENE-4468: Fix rareish integer overflows in Lucene41 postings
format. (Robert Muir)
* LUCENE-4486: Add support for ConstantScoreQuery in Highlighter.
(Simon Willnauer)
* LUCENE-4485: When CheckIndex terms, terms/docs pairs and tokens,
these counts now all exclude deleted documents. (Mike McCandless)
* LUCENE-4479: Highlighter works correctly for fields with term vector
positions, but no offsets. (Alan Woodward)
* SOLR-3906: JapaneseReadingFormFilter in romaji mode will return
romaji even for out-of-vocabulary kana cases (e.g. half-width forms).
(Robert Muir)
* LUCENE-4511: TermsFilter might return wrong results if a field is not
indexed or doesn't exist in the index. (Simon Willnauer)
* LUCENE-4521: IndexWriter.tryDeleteDocument could return true
(successfully deleting the document) but then on IndexWriter
close/commit fail to write the new deletions, if no other changes
happened in the IndexWriter instance. (Ivan Vasilev via Mike
McCandless)
* LUCENE-4513: Fixed that deleted nested docs are scored into the
parent doc when using ToParentBlockJoinQuery. (Martijn van Groningen)
* LUCENE-4534: Fixed WFSTCompletionLookup and Analyzing/FuzzySuggester
to allow 0 byte values in the lookup keys. (Mike McCandless)
* LUCENE-4532: DirectoryTaxonomyWriter use a timestamp to denote taxonomy
index re-creation, which could cause a bug in case machine clocks were
not synced. Instead, it now tracks an 'epoch' version, which is incremented
whenever the taxonomy is re-created, or replaced. (Shai Erera)
* LUCENE-4544: Fixed off-by-1 in ConcurrentMergeScheduler that would
allow 1+maxMergeCount merges threads to be created, instead of just
maxMergeCount (Radim Kolar, Mike McCandless)
* LUCENE-4567: Fixed NullPointerException in analyzing, fuzzy, and
WFST suggesters when no suggestions were added (selckin via Mike
McCandless)
* LUCENE-4568: Fixed integer overflow in
PagedBytes.PagedBytesData{In,Out}put.getPosition. (Adrien Grand)
* LUCENE-4581: GroupingSearch.setAllGroups(true) was failing to
actually compute allMatchingGroups (dizh@neusoft.com via Mike
McCandless)
* LUCENE-4009: Improve TermsFilter.toString (Tim Costermans via Chris
Male, Mike McCandless)
* LUCENE-4588: Benchmark's EnwikiContentSource was discarding last wiki
document and had leaking threads in 'forever' mode. (Doron Cohen)
* LUCENE-4585: Spatial RecursivePrefixTreeFilter had some bugs that only
occurred when shapes were indexed. In what appears to be rare circumstances,
documents with shapes near a query shape were erroneously considered a match.
In addition, it wasn't possible to index a shape representing the entire
globe.
* LUCENE-4595: EnwikiContentSource had a thread safety problem (NPE) in
'forever' mode (Doron Cohen)
* LUCENE-4587: fix WordBreakSpellChecker to not throw AIOOBE when presented
with 2-char codepoints, and to correctly break/combine terms containing
non-latin characters. (James Dyer, Andreas Hubold)
* LUCENE-4596: fix a concurrency bug in DirectoryTaxonomyWriter.
(Shai Erera)
* LUCENE-4594: Spatial PrefixTreeStrategy would index center-points in addition
to the shape to index if it was non-point, in the same field. But sometimes
the center-point isn't actually in the shape (consider a LineString), and for
highly precise shapes it could cause makeDistanceValueSource's cache to load
parts of the shape's boundary erroneously too. So center points aren't
indexed any more; you should use another spatial field. (David Smiley)
* LUCENE-4629: IndexWriter misses to delete documents if a document block is
indexed and the Iterator throws an exception. Documents were only rolled back
if the actual indexing process failed. (Simon Willnauer)
* LUCENE-4608: Handle large number of requested fragments better.
(Martijn van Groningen)
* LUCENE-4633: DirectoryTaxonomyWriter.replaceTaxonomy did not refresh its
internal reader, which could cause an existing category to be added twice.
(Shai Erera)
* LUCENE-4461: If you added the same FacetRequest more than once, you would get
inconsistent results. (Gilad Barkai via Shai Erera)
* LUCENE-4656: Fix regression in IndexWriter to work with empty TokenStreams
that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute),
e.g., oal.analysis.miscellaneous.EmptyTokenStream.
(Uwe Schindler, Adrien Grand, Robert Muir)
* LUCENE-4660: ConcurrentMergeScheduler was taking too long to
un-pause incoming threads it had paused when too many merges were
queued up. (Mike McCandless)
* LUCENE-4662: Add missing elided articles and prepositions to FrenchAnalyzer's
DEFAULT_ARTICLES list passed to ElisionFilter. (David Leunen via Steve Rowe)
* LUCENE-4671: Fix CharsRef.subSequence method. (Tim Smith via Robert Muir)
* LUCENE-4465: Let ConstantScoreQuery's Scorer return its child scorer.
(selckin via Uwe Schindler)
Changes in Runtime Behavior
* LUCENE-4586: Change default ResultMode of FacetRequest to PER_NODE_IN_TREE.
This only affects requests with depth>1. If you execute such requests and
rely on the facet results being returned flat (i.e. no hierarchy), you should
set the ResultMode to GLOBAL_FLAT. (Shai Erera, Gilad Barkai)
* LUCENE-1822: Improves the text window selection by recalculating the starting margin
once all phrases in the fragment have been identified in FastVectorHighlighter. This
way if a single word is matched in a fragment, it will appear in the middle of the highlight,
instead of 6 characters from the beginning. This way one can also guarantee that
the entirety of short texts are represented in a fragment by specifying a large
enough fragCharSize.
Optimizations
* LUCENE-2221: oal.util.BitUtil was modified to use Long.bitCount and
Long.numberOfTrailingZeros (which are intrinsics since Java 6u18) instead of
pure java bit twiddling routines in order to improve performance on modern
JVMs/hardware. (Dawid Weiss, Adrien Grand)
* LUCENE-4509: Enable stored fields compression by default in the Lucene 4.1
default codec. (Adrien Grand)
* LUCENE-4536: PackedInts on-disk format is now byte-aligned (it used to be
long-aligned), saving up to 7 bytes per array of values.
(Adrien Grand, Mike McCandless)
* LUCENE-4512: Additional memory savings for CompressingStoredFieldsFormat.
(Adrien Grand, Robert Muir)
* LUCENE-4443: Lucene41PostingsFormat no longer writes unnecessary offsets
into the skipdata. (Robert Muir)
* LUCENE-4459: Improve WeakIdentityMap.keyIterator() to remove GCed keys
from backing map early instead of waiting for reap(). This makes test
failures in TestWeakIdentityMap disappear, too.
(Uwe Schindler, Mike McCandless, Robert Muir)
* LUCENE-4473: Lucene41PostingsFormat encodes offsets more efficiently
for low frequency terms (< 128 occurrences). (Robert Muir)
* LUCENE-4462: DocumentsWriter now flushes deletes, segment infos and builds
CFS files if necessary during segment flush and not during publishing. The latter
was a single threaded process while now all IO and CPU heavy computation is done
concurrently in DocumentsWriterPerThread. (Simon Willnauer)
* LUCENE-4496: Optimize Lucene41PostingsFormat when requesting a subset of
the postings data (via flags to TermsEnum.docs/docsAndPositions) to use
ForUtil.skipBlock. (Robert Muir)
* LUCENE-4497: Don't write PosVIntCount to the positions file in
Lucene41PostingsFormat, as its always totalTermFreq % BLOCK_SIZE. (Robert Muir)
* LUCENE-4498: In Lucene41PostingsFormat, when a term appears in only one document,
Instead of writing a file pointer to a VIntBlock containing the doc id, just
write the doc id. (Mike McCandless, Robert Muir)
* LUCENE-4515: MemoryIndex now uses Byte/IntBlockPool internally to hold terms and
posting lists. All index data is represented as consecutive byte/int arrays to
reduce GC cost and memory overhead. (Simon Willnauer)
* LUCENE-4538: DocValues now caches direct sources in a ThreadLocal exposed via SourceCache.
Users of this API can now simply obtain an instance via DocValues#getDirectSource per thread.
(Simon Willnauer)
* LUCENE-4580: DrillDown.query variants return a ConstantScoreQuery with boost set to 0.0f
so that documents scores are not affected by running a drill-down query. (Shai Erera)
* LUCENE-4598: PayloadIterator no longer uses top-level IndexReader to iterate on the
posting's payload. (Shai Erera, Michael McCandless)
* LUCENE-4661: Drop default maxThreadCount to 1 and maxMergeCount to 2
in ConcurrentMergeScheduler, for faster merge performance on
spinning-magnet drives (Mike McCandless)
Documentation
* LUCENE-4483: Refer to BytesRef.deepCopyOf in Term's constructor that takes BytesRef.
(Paul Elschot via Robert Muir)
Build
* LUCENE-4650: Upgrade randomized testing to version 2.0.8: make the
test framework more robust under low memory conditions. (Dawid Weiss)
* LUCENE-4603: Upgrade randomized testing to version 2.0.5: print forked
JVM PIDs on heartbeat from hung tests (Dawid Weiss)
* Upgrade randomized testing to version 2.0.4: avoid hangs on shutdown
hooks hanging forever by calling Runtime.halt() in addition to
Runtime.exit() after a short delay to allow graceful shutdown (Dawid Weiss)
* LUCENE-4451: Memory leak per unique thread caused by
RandomizedContext.contexts static map. Upgrade randomized testing
to version 2.0.2 (Mike McCandless, Dawid Weiss)
* LUCENE-4589: Upgraded benchmark module's Nekohtml dependency to version
1.9.17, removing the workaround in Lucene's HTML parser for the
Turkish locale. (Uwe Schindler)
* LUCENE-4601: Fix ivy availability check to use typefound, so it works
if called from another build file. (Ryan Ernst via Robert Muir)
======================= Lucene 4.0.0 =======================
Changes in backwards compatibility policy
* LUCENE-4392: Class org.apache.lucene.util.SortedVIntList has been removed.
(Adrien Grand)
* LUCENE-4393: RollingCharBuffer has been moved to the o.a.l.analysis.util
package of lucene-analysis-common. (Adrien Grand)
New Features
* LUCENE-1888: Added the option to store payloads in the term
vectors (IndexableFieldType.storeTermVectorPayloads()). Note
that you must store term vector positions to store payloads.
(Robert Muir)
* LUCENE-3892: Add a new BlockPostingsFormat that bulk-encodes docs,
freqs and positions in large (size 128) packed-int blocks for faster
search performance. This was from Han Jiang's 2012 Google Summer of
Code project (Han Jiang, Adrien Grand, Robert Muir, Mike McCandless)
* LUCENE-4323: Added support for an absolute maximum CFS segment size
(in MiB) to LogMergePolicy and TieredMergePolicy.
(Alexey Lef via Uwe Schindler)
* LUCENE-4339: Allow deletes against 3.x segments for easier upgrading.
Lucene3x Codec is still otherwise read-only, you should not set it
as the default Codec on IndexWriter, because it cannot write new segments.
(Mike McCandless, Robert Muir)
* SOLR-3441: ElisionFilterFactory is now MultiTermAware
(Jack Krupansky via hossman)
API Changes
* LUCENE-4391, LUCENE-4440: All methods of Lucene40Codec but
getPostingsFormatForField are now final. To reuse functionality
of Lucene40, you should extend FilterCodec and delegate to Lucene40
instead of extending Lucene40Codec. (Adrien Grand, Shai Erea,
Robert Muir, Uwe Schindler)
* LUCENE-4299: Added Terms.hasPositions() and Terms.hasOffsets().
Previously you had no real way to know that a term vector field
had positions or offsets, since this can be configured on a
per-field-per-document basis. (Robert Muir)
* Removed DocsAndPositionsEnum.hasPayload() and simplified the
contract of getPayload(). It returns null if there is no payload,
otherwise returns the current payload. You can now call it multiple
times per position if you want. (Robert Muir)
* Removed FieldsEnum. Fields API instead implements Iterable<String>
and exposes Iterator, so you can iterate over field names with
for (String field : fields) instead. (Robert Muir)
* LUCENE-4152: added IndexReader.leaves(), which lets you enumerate
the leaf atomic reader contexts for all readers in the tree.
(Uwe Schindler, Robert Muir)
* LUCENE-4304: removed PayloadProcessorProvider. If you want to change
payloads (or other things) when merging indexes, its recommended
to just use a FilterAtomicReader + IndexWriter.addIndexes. See the
OrdinalMappingAtomicReader and TaxonomyMergeUtils in the facets
module if you want an example of this.
(Mike McCandless, Uwe Schindler, Shai Erera, Robert Muir)
* LUCENE-4304: Make CompositeReader.getSequentialSubReaders()
protected. To get atomic leaves of any IndexReader use the new method
leaves() (LUCENE-4152), which lists AtomicReaderContexts including
the doc base of each leaf. (Uwe Schindler, Robert Muir)
* LUCENE-4307: Renamed IndexReader.getTopReaderContext to
IndexReader.getContext. (Robert Muir)
* LUCENE-4316: Deprecate Fields.getUniqueTermCount and remove it from
AtomicReader. If you really want the unique term count across all
fields, just sum up Terms.size() across those fields. This method
only exists so that this statistic can be accessed for Lucene 3.x
segments, which don't support Terms.size(). (Uwe Schindler, Robert Muir)
* LUCENE-4321: Change CharFilter to extend Reader directly, as FilterReader
overdelegates (read(), read(char[], int, int), skip, etc). This made it
hard to implement CharFilters that were correct. Instead only close() is
delegated by default: read(char[], int, int) and correct(int) are abstract
so that its obvious which methods you should implement. The protected
inner Reader is 'input' like CharFilter in the 3.x series, instead of 'in'.
(Dawid Weiss, Uwe Schindler, Robert Muir)
* LUCENE-3309: The expert FieldSelector API, used to load only certain
fields in a stored document, has been replaced with the simpler
StoredFieldVisitor API. (Mike McCandless)
* LUCENE-4343: Made Tokenizer.setReader final. This is a setter that should
not be overridden by subclasses: per-stream initialization should happen
in reset(). (Robert Muir)
* LUCENE-4377: Remove IndexInput.copyBytes(IndexOutput, long).
Use DataOutput.copyBytes(DataInput, long) instead.
(Mike McCandless, Robert Muir)
* LUCENE-4355: Simplify AtomicReader's sugar methods such as termDocsEnum,
termPositionsEnum, docFreq, and totalTermFreq to only take Term as a
parameter. If you want to do expert things such as pass a different
Bits as liveDocs, then use the flex apis (fields(), terms(), etc) directly.
(Mike McCandless, Robert Muir)
* LUCENE-4425: clarify documentation of StoredFieldVisitor.binaryValue
and simplify the api to binaryField(FieldInfo, byte[]).
(Adrien Grand, Robert Muir)
Bug Fixes
* LUCENE-4423: DocumentStoredFieldVisitor.binaryField ignored offset and
length. (Adrien Grand)
* LUCENE-4297: BooleanScorer2 would multiply the coord() factor
twice for conjunctions: for most users this is no problem, but
if you had a customized Similarity that returned something other
than 1 when overlap == maxOverlap (always the case for conjunctions),
then the score would be incorrect. (Pascal Chollet, Robert Muir)
* LUCENE-4298: MultiFields.getTermDocsEnum(IndexReader, Bits, String, BytesRef)
did not work at all, it would infinitely recurse.
(Alberto Paro via Robert Muir)
* LUCENE-4300: BooleanQuery's rewrite was not always safe: if you
had a custom Similarity where coord(1,1) != 1F, then the rewritten
query would be scored differently. (Robert Muir)
* Don't allow negatives in the positions file. If you have an index
from 2.4.0 or earlier with such negative positions, and you already
upgraded to 3.x, then to Lucene 4.0-ALPHA or -BETA, you should run
CheckIndex. If it fails, then you need to upgrade again to 4.0 (Robert Muir)
* LUCENE-4303: PhoneticFilterFactory and SnowballPorterFilterFactory load their
encoders / stemmers via the ResourceLoader now instead of Class.forName().
Solr users should now no longer have to embed these in its war. (David Smiley)
* SOLR-3737: StempelPolishStemFilterFactory loaded its stemmer table incorrectly.
Also, ensure immutability and use only one instance of this table in RAM (lazy
loaded) since its quite large. (sausarkar, Steven Rowe, Robert Muir)
* LUCENE-4310: MappingCharFilter was failing to match input strings
containing non-BMP Unicode characters. (Dawid Weiss, Robert Muir,
Mike McCandless)
* LUCENE-4224: Add in-order scorer to query time joining and the
out-of-order scorer throws an UOE. (Martijn van Groningen, Robert Muir)
* LUCENE-4333: Fixed NPE in TermGroupFacetCollector when faceting on mv fields.
(Jesse MacVicar, Martijn van Groningen)
* LUCENE-4218: Document.get(String) and Field.stringValue() again return
values for numeric fields, like Lucene 3.x and consistent with the documentation.
(Jamie, Uwe Schindler, Robert Muir)
* NRTCachingDirectory was always caching a newly flushed segment in
RAM, instead of checking the estimated size of the segment
to decide whether to cache it. (Mike McCandless)
* LUCENE-3720: fix memory-consumption issues with BeiderMorseFilter.
(Thomas Neidhart via Robert Muir)
* LUCENE-4401: Fix bug where DisjunctionSumScorer would sometimes call score()
on a subscorer that had already returned NO_MORE_DOCS. (Liu Chao, Robert Muir)
* LUCENE-4411: when sampling is enabled for a FacetRequest, its depth
parameter is reset to the default (1), even if set otherwise.
(Gilad Barkai via Shai Erera)
* LUCENE-4455: Fix bug in SegmentInfoPerCommit.sizeInBytes() that was
returning 2X the true size, inefficiently. Also fixed bug in
CheckIndex that would report no deletions when a segment has
deletions, and vice/versa. (Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-4456: Fixed double-counting sizeInBytes for a segment
(affects how merge policies pick merges); fixed CheckIndex's
incorrect reporting of whether a segment has deletions; fixed case
where on abort Lucene could remove files it didn't create; fixed
many cases where IndexWriter could leave leftover files (on
exception in various places, on reuse of a segment name after crash
and recovery. (Uwe Schindler, Robert Muir, Mike McCandless)
Optimizations
* LUCENE-4322: Decrease lucene-core JAR size. The core JAR size had increased a
lot because of generated code introduced in LUCENE-4161 and LUCENE-3892.
(Adrien Grand)
* LUCENE-4317: Improve reuse of internal TokenStreams and StringReader
in oal.document.Field. (Uwe Schindler, Chris Male, Robert Muir)
* LUCENE-4327: Support out-of-order scoring in FilteredQuery for higher
performance. (Mike McCandless, Robert Muir)
* LUCENE-4364: Optimize MMapDirectory to not make a mapping per-cfs-slice,
instead one map per .cfs file. This reduces the total number of maps.
Additionally factor out a (package-private) generic
ByteBufferIndexInput from MMapDirectory. (Uwe Schindler, Robert Muir)
Build
* LUCENE-4406, LUCENE-4407: Upgrade to randomizedtesting 2.0.1.
Workaround for broken test output XMLs due to non-XML text unicode
chars in strings. Added printing of failed tests at the end of a
test run (Dawid Weiss)
* LUCENE-4252: Detect/Fail tests when they leak RAM in static fields
(Robert Muir, Dawid Weiss)
* LUCENE-4360: Support running the same test suite multiple times in
parallel (Dawid Weiss)
* LUCENE-3985: Upgrade to randomizedtesting 2.0.0. Added support for
thread leak detection. Added support for suite timeouts. (Dawid Weiss)
* LUCENE-4354: Corrected maven dependencies to be consistent with
the licenses/ folder and the binary release. Some had different
versions or additional unnecessary dependencies. (selckin via Robert Muir)
* LUCENE-4340: Move all non-default codec, postings format and terms
dictionary implementations to lucene/codecs. (Adrien Grand)
Documentation
* LUCENE-4302: Fix facet userguide to have HTML loose doctype like
all other javadocs. (Karl Nicholas via Uwe Schindler)
======================= Lucene 4.0.0-BETA =======================
New features
* LUCENE-4249: Changed the explanation of the PayloadTermWeight to use the
underlying PayloadFunction's explanation as the explanation
for the payload score. (Scott Smerchek via Robert Muir)
* LUCENE-4069: Added BloomFilteringPostingsFormat for use with low-frequency terms
such as primary keys (Mark Harwood, Mike McCandless)
* LUCENE-4201: Added JapaneseIterationMarkCharFilter to normalize Japanese
iteration marks. (Robert Muir, Christian Moen)
* LUCENE-3832: Added BasicAutomata.makeStringUnion method to efficiently
create automata from a fixed collection of UTF-8 encoded BytesRef
(Dawid Weiss, Robert Muir)
* LUCENE-4153: Added option to fast vector highlighting via BaseFragmentsBuilder to
respect field boundaries in the case of highlighting for multivalued fields.
(Martijn van Groningen)
* LUCENE-4227: Added DirectPostingsFormat, to hold all postings in
memory as uncompressed simple arrays. This uses a tremendous amount
of RAM but gives good search performance gains. (Mike McCandless)
* LUCENE-2510, LUCENE-4044: Migrated Solr's Tokenizer-, TokenFilter-, and
CharFilterFactories to the lucene-analysis module. The API is still
experimental. (Chris Male, Robert Muir, Uwe Schindler)
* LUCENE-4230: When pulling a DocsAndPositionsEnum you can now
specify whether or not you require payloads (in addition to
offsets); turning one or both off may allow some codec
implementations to optimize the enum implementation. (Robert Muir,
Mike McCandless)
* LUCENE-4203: Add IndexWriter.tryDeleteDocument(AtomicReader reader,
int docID), to attempt deletion by docID as long as the provided
reader is an NRT reader, and the segment has not yet been merged
away (Mike McCandless).
* LUCENE-4286: Added option to CJKBigramFilter to always also output
unigrams. This can be used for a unigram+bigram approach, or at
index-time only for better support of short queries.
(Tom Burton-West, Robert Muir)
API Changes
* LUCENE-4138: update of morfologik (Polish morphological analyzer) to 1.5.3.
The tag attribute class has been renamed to MorphosyntacticTagsAttribute and
has a different API (carries a list of tags instead of a compound tag). Upgrade
of embedded morfologik dictionaries to version 1.9. (Dawid Weiss)
* LUCENE-4178: set 'tokenized' to true on FieldType by default, so that if you
make a custom FieldType and set indexed = true, its analyzed by the analyzer.
(Robert Muir)
* LUCENE-4220: Removed the buggy JavaCC-based HTML parser in the benchmark
module and replaced by NekoHTML. HTMLParser interface was cleaned up while
changing method signatures. (Uwe Schindler, Robert Muir)
* LUCENE-2191: Rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader).
The purpose of this method was always to set a new Reader on the Tokenizer,
reusing the object. But the name was often confused with TokenStream.reset().
(Robert Muir)
* LUCENE-4228: Refactored CharFilter to extend java.io.FilterReader. CharFilters
filter another reader and you override correct() for offset correction.
(Robert Muir)
* LUCENE-4240: Analyzer api now just takes fieldName for getOffsetGap. If the
field is not analyzed (e.g. StringField), then the analyzer is not invoked
at all. If you want to tweak things like positionIncrementGap and offsetGap,
analyze the field with KeywordTokenizer instead. (Grant Ingersoll, Robert Muir)
* LUCENE-4250: Pass fieldName to the PayloadFunction explain method, so it
parallels with docScore and the default implementation is correct.
(Robert Muir)
* LUCENE-3747: Support Unicode 6.1.0. (Steve Rowe)
* LUCENE-3884: Moved ElisionFilter out of org.apache.lucene.analysis.fr
package into org.apache.lucene.analysis.util. (Robert Muir)
* LUCENE-4230: When pulling a DocsAndPositionsEnum you now pass an int
flags instead of the previous boolean needOffsets. Currently
recognized flags are DocsAndPositionsEnum.FLAG_PAYLOADS and
DocsAndPositionsEnum.FLAG_OFFSETS (Robert Muir, Mike McCandless)
* LUCENE-4273: When pulling a DocsEnum, you can pass an int flags
instead of the previous boolean needsFlags; consistent with the changes
for DocsAndPositionsEnum in LUCENE-4230. Currently the only flag
is DocsEnum.FLAG_FREQS. (Robert Muir, Mike McCandless)
* LUCENE-3616: TextField(String, Reader, Store) was reduced to TextField(String, Reader),
as the Store parameter didn't make sense: if you supplied Store.YES, you would only
receive an exception anyway. (Robert Muir)
Optimizations
* LUCENE-4171: Performance improvements to Packed64.
(Toke Eskildsen via Adrien Grand)
* LUCENE-4184: Performance improvements to the aligned packed bits impl.
(Toke Eskildsen, Adrien Grand)
* LUCENE-4235: Remove enforcing of Filter rewrite for NRQ queries.
(Uwe Schindler)
* LUCENE-4279: Regenerated snowball Stemmers from snowball r554,
making them substantially more lightweight. Behavior is unchanged.
(Robert Muir)
* LUCENE-4291: Reduced internal buffer size for Jflex-based tokenizers
such as StandardTokenizer from 32kb to 8kb.
(Raintung Li, Steven Rowe, Robert Muir)
Bug Fixes
* LUCENE-4109: BooleanQueries are not parsed correctly with the
flexible query parser. (Karsten Rauch via Robert Muir)
* LUCENE-4176: Fix AnalyzingQueryParser to analyze range endpoints as bytes,
so that it works correctly with Analyzers that produce binary non-UTF-8 terms
such as CollationAnalyzer. (Nattapong Sirilappanich via Robert Muir)
* LUCENE-4209: Fix FSTCompletionLookup to close its sorter, so that it won't
leave temp files behind in /tmp. Fix SortedTermFreqIteratorWrapper to not
leave temp files behind in /tmp on Windows. Fix Sort to not leave
temp files behind when /tmp is a separate volume. (Uwe Schindler, Robert Muir)
* LUCENE-4221: Fix overeager CheckIndex validation for term vector offsets.
(Robert Muir)
* LUCENE-4222: TieredMergePolicy.getFloorSegmentMB was returning the
size in bytes not MB (Chris Fuller via Mike McCandless)
* LUCENE-3505: Fix bug (Lucene 4.0alpha only) where boolean conjunctions
were sometimes scored incorrectly. Conjunctions of only termqueries where
at least one term omitted term frequencies (IndexOptions.DOCS_ONLY) would
be scored as if all terms omitted term frequencies. (Robert Muir)
* LUCENE-2686, LUCENE-3505: Fixed BooleanQuery scorers to return correct
freq(). Added support for scorer navigation API (Scorer.getChildren) to
all queries. Made Scorer.freq() abstract.
(Koji Sekiguchi, Mike McCandless, Robert Muir)
* LUCENE-4234: Exception when FacetsCollector is used with ScoreFacetRequest,
and the number of matching documents is too large. (Gilad Barkai via Shai Erera)
* LUCENE-4245: Make IndexWriter#close() and MergeScheduler#close()
non-interruptible. (Mark Miller, Uwe Schindler)
* LUCENE-4190: restrict allowed filenames that a codec may create to
the patterns recognized by IndexFileNames. This also fixes
IndexWriter to only delete files matching this pattern from an index
directory, to reduce risk when the wrong index path is accidentally
passed to IndexWriter (Robert Muir, Mike McCandless)
* LUCENE-4277: Fix IndexWriter deadlock during rollback if flushable DWPT
instance are already checked out and queued up but not yet flushed.
(Simon Willnauer)
* LUCENE-4282: Automaton FuzzyQuery didn't always deliver all results.
(Johannes Christen, Uwe Schindler, Robert Muir)
* LUCENE-4289: Fix minor idf inconsistencies/inefficiencies in highlighter.
(Robert Muir)
Changes in Runtime Behavior
* LUCENE-4109: Enable position increments in the flexible queryparser by default.
(Karsten Rauch via Robert Muir)
* LUCENE-3616: Field throws exception if you try to set a boost on an
unindexed field or one that omits norms. (Robert Muir)
Build
* LUCENE-4094: Support overriding file.encoding on forked test JVMs
(force via -Drandomized.file.encoding=XXX). (Dawid Weiss)
* LUCENE-4189: Test output should include timestamps (start/end for each
test/ suite). Added -Dtests.timestamps=[off by default]. (Dawid Weiss)
* LUCENE-4110: Report long periods of forked jvm inactivity (hung tests/ suites).
Added -Dtests.heartbeat=[seconds] with the default of 60 seconds.
(Dawid Weiss)
* LUCENE-4160: Added a property to quit the tests after a given
number of failures has occurred. This is useful in combination
with -Dtests.iters=N (you can start N iterations and wait for M
failures, in particular M = 1). -Dtests.maxfailures=M. Alternatively,
specify -Dtests.failfast=true to skip all tests after the first failure.
(Dawid Weiss)
* LUCENE-4115: JAR resolution/ cleanup should be done automatically for ant
clean/ eclipse/ resolve (Dawid Weiss)
* LUCENE-4199, LUCENE-4202, LUCENE-4206: Add a new target "check-forbidden-apis"
that parses all generated .class files for use of APIs that use default
charset, default locale, or default timezone and fail build if violations
found. This ensures, that Lucene / Solr is independent on local configuration
options. (Uwe Schindler, Robert Muir, Dawid Weiss)
* LUCENE-4217: Add the possibility to run tests with Atlassian Clover
loaded from IVY. A development License solely for Apache code was added in
the tools/ folder, but is not included in releases. (Uwe Schindler)
Documentation
* LUCENE-4195: Added package documentation and examples for
org.apache.lucene.codecs (Alan Woodward via Robert Muir)
======================= Lucene 4.0.0-ALPHA =======================
More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
https://wiki.apache.org/lucene-java/Lucene4.0
For "contrib" changes prior to 4.0, please see:
http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_6_0/lucene/contrib/CHANGES.txt
Changes in backwards compatibility policy
* LUCENE-1458, LUCENE-2111, LUCENE-2354: Changes from flexible indexing:
- On upgrading to 4.0, if you do not fully reindex your documents,
Lucene will emulate the new flex API on top of the old index,
incurring some performance cost (up to ~10% slowdown, typically).
To prevent this slowdown, use oal.index.IndexUpgrader
to upgrade your indexes to latest file format (LUCENE-3082).
Mixed flex/pre-flex indexes are perfectly fine -- the two
emulation layers (flex API on pre-flex index, and pre-flex API on
flex index) will remap the access as required. So on upgrading to
4.0 you can start indexing new documents into an existing index.
To get optimal performance, use oal.index.IndexUpgrader
to upgrade your indexes to latest file format (LUCENE-3082).
- The postings APIs (TermEnum, TermDocsEnum, TermPositionsEnum)
have been removed in favor of the new flexible
indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum,
DocsEnum, DocsAndPositionsEnum). One big difference is that field
and terms are now enumerated separately: a TermsEnum provides a
BytesRef (wraps a byte[]) per term within a single field, not a
Term. Another is that when asking for a Docs/AndPositionsEnum, you
now specify the skipDocs explicitly (typically this will be the
deleted docs, but in general you can provide any Bits).
- The term vectors APIs (TermFreqVector, TermPositionVector,
TermVectorMapper) have been removed in favor of the above
flexible indexing APIs, presenting a single-document inverted
index of the document from the term vectors.
- MultiReader ctor now throws IOException
- Directory.copy/Directory.copyTo now copies all files (not just
index files), since what is and isn't and index file is now
dependent on the codecs used.
- UnicodeUtil now uses BytesRef for UTF-8 output, and some method
signatures have changed to CharSequence. These are internal APIs
and subject to change suddenly.
- Positional queries (PhraseQuery, *SpanQuery) will now throw an
exception if use them on a field that omits positions during
indexing (previously they silently returned no results).
- FieldCache.{Byte,Short,Int,Long,Float,Double}Parser's API has
changed -- each parse method now takes a BytesRef instead of a
String. If you have an existing Parser, a simple way to fix it is
invoke BytesRef.utf8ToString, and pass that String to your
existing parser. This will work, but performance would be better
if you could fix your parser to instead operate directly on the
byte[] in the BytesRef.
- The internal (experimental) API of NumericUtils changed completely
from String to BytesRef. Client code should never use this class,
so the change would normally not affect you. If you used some of
the methods to inspect terms or create TermQueries out of
prefix encoded terms, change to use BytesRef. Please note:
Do not use TermQueries to search for single numeric terms.
The recommended way is to create a corresponding NumericRangeQuery
with upper and lower bound equal and included. TermQueries do not
score correct, so the constant score mode of NRQ is the only
correct way to handle single value queries.
- NumericTokenStream now works directly on byte[] terms. If you
plug a TokenFilter on top of this stream, you will likely get
an IllegalArgumentException, because the NTS does not support
TermAttribute/CharTermAttribute. If you want to further filter
or attach Payloads to NTS, use the new NumericTermAttribute.
(Mike McCandless, Robert Muir, Uwe Schindler, Mark Miller, Michael Busch)
* LUCENE-2858, LUCENE-3733: IndexReader was refactored into abstract
AtomicReader, CompositeReader, and DirectoryReader. To open Directory-
based indexes use DirectoryReader.open(), the corresponding method in
IndexReader is now deprecated for easier migration. Only DirectoryReader
supports commits, versions, and reopening with openIfChanged(). Terms,
postings, docvalues, and norms can from now on only be retrieved using
AtomicReader; DirectoryReader and MultiReader extend CompositeReader,
only offering stored fields and access to the sub-readers (which may be
composite or atomic). SlowCompositeReaderWrapper (LUCENE-2597) can be
used to emulate atomic readers on top of composites.
Please review MIGRATE.txt for information how to migrate old code.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2265: FuzzyQuery and WildcardQuery now operate on Unicode codepoints,
not unicode code units. For example, a Wildcard "?" represents any unicode
character. Furthermore, the rest of the automaton package and RegexpQuery use
true Unicode codepoint representation. (Robert Muir, Mike McCandless)
* LUCENE-2380: The String-based FieldCache methods (getStrings,
getStringIndex) have been replaced with BytesRef-based equivalents
(getTerms, getTermsIndex). Also, the sort values (returned in
FieldDoc.fields) when sorting by SortField.STRING or
SortField.STRING_VAL are now BytesRef instances. See MIGRATE.txt
for more details. (yonik, Mike McCandless)
* LUCENE-2480: Though not a change in backwards compatibility policy, pre-3.0
indexes are no longer supported. You should upgrade to 3.x first, then run
optimize(), or reindex. (Shai Erera, Earwin Burrfoot)
* LUCENE-2484: Removed deprecated TermAttribute. Use CharTermAttribute
and TermToBytesRefAttribute instead. (Uwe Schindler)
* LUCENE-2600: Remove IndexReader.isDeleted in favor of
AtomicReader.getDeletedDocs(). (Mike McCandless)
* LUCENE-2667: FuzzyQuery's defaults have changed for more performant
behavior: the minimum similarity is 2 edit distances from the word,
and the priority queue size is 50. To support this, FuzzyQuery now allows
specifying unscaled edit distances (foobar~2). If your application depends
upon the old defaults of 0.5 (scaled) minimum similarity and Integer.MAX_VALUE
priority queue size, you can use FuzzyQuery(Term, float, int, int) to specify
those explicitly.
* LUCENE-2674: MultiTermQuery.TermCollector.collect now accepts the
TermsEnum as well. (Robert Muir, Mike McCandless)
* LUCENE-588: WildcardQuery and QueryParser now allows escaping with
the '\' character. Previously this was impossible (you could not escape */?,
for example). If your code somehow depends on the old behavior, you will
need to change it (e.g. using "\\" to escape '\' itself).
(Sunil Kamath, Terry Yang via Robert Muir)
* LUCENE-2837: Collapsed Searcher, Searchable into IndexSearcher;
removed contrib/remote and MultiSearcher (Mike McCandless); absorbed
ParallelMultiSearcher into IndexSearcher as an optional
ExecutorServiced passed to its ctor. (Mike McCandless)
* LUCENE-2908, LUCENE-4037: Removed serialization code from lucene classes.
It is recommended that you serialize user search needs at a higher level
in your application.
(Robert Muir, Benson Margulies)
* LUCENE-2831: Changed Weight#scorer, Weight#explain & Filter#getDocIdSet to
operate on a AtomicReaderContext instead of directly on IndexReader to enable
searches to be aware of IndexSearcher's context. (Simon Willnauer)
* LUCENE-2839: Scorer#score(Collector,int,int) is now public because it is
called from other classes and part of public API. (Uwe Schindler)
* LUCENE-2865: Weight#scorer(AtomicReaderContext, boolean, boolean) now accepts
a ScorerContext struct instead of booleans.(Simon Willnauer)
* LUCENE-2882: Cut over SpanQuery#getSpans to AtomicReaderContext to enforce
per segment semantics on SpanQuery & Spans. (Simon Willnauer)
* LUCENE-2236: Similarity can now be configured on a per-field basis. See the
migration notes in MIGRATE.txt for more details. (Robert Muir, Doron Cohen)
* LUCENE-2315: AttributeSource's methods for accessing attributes are now final,
else its easy to corrupt the internal states. (Uwe Schindler)
* LUCENE-2814: The IndexWriter.flush method no longer takes "boolean
flushDocStores" argument, as we now always flush doc stores (index
files holding stored fields and term vectors) while flushing a
segment. (Mike McCandless)
* LUCENE-2548: Field names (eg in Term, FieldInfo) are no longer
interned. (Mike McCandless)
* LUCENE-2883: The contents of o.a.l.search.function has been consolidated into
the queries module and can be found at o.a.l.queries.function. See
MIGRATE.txt for more information (Chris Male)
* LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from
Query/Weight/Scorer. If you extended Similarity directly before, you should
extend TFIDFSimilarity instead. Similarity is now a lower-level API to
implement other scoring algorithms. See MIGRATE.txt for more details.
(David Nemeskey, Simon Willnauer, Mike McCandless, Robert Muir)
* LUCENE-3330: The expert visitor API in Scorer has been simplified and
extended to support arbitrary relationships. To navigate to a scorer's
children, call Scorer.getChildren(). (Robert Muir)
* LUCENE-2308: Field is now instantiated with an instance of IndexableFieldType,
of which there is a core implementation FieldType. Most properties
describing a Field have been moved to IndexableFieldType. See MIGRATE.txt
for more details. (Nikola Tankovic, Mike McCandless, Chris Male)
* LUCENE-3396: ReusableAnalyzerBase.TokenStreamComponents.reset(Reader) now
returns void instead of boolean. If a Component cannot be reset, it should
throw an Exception. (Chris Male)
* LUCENE-3396: ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer
implementations must now use Analyzer.TokenStreamComponents, rather than
overriding .tokenStream() and .reusableTokenStream() (which are now final).
(Chris Male)
* LUCENE-3346: Analyzer.reusableTokenStream() has been renamed to tokenStream()
with the old tokenStream() method removed. Consequently it is now mandatory
for all Analyzers to support reusability. (Chris Male)
* LUCENE-3473: AtomicReader.getUniqueTermCount() no longer throws UOE when
it cannot be easily determined. Instead, it returns -1 to be consistent with
this behavior across other index statistics.
(Robert Muir)
* LUCENE-1536: The abstract FilteredDocIdSet.match() method is no longer
allowed to throw IOException. This change was required to make it conform
to the Bits interface. This method should never do I/O for performance reasons.
(Mike McCandless, Uwe Schindler, Robert Muir, Chris Male, Yonik Seeley,
Jason Rutherglen, Paul Elschot)
* LUCENE-3559: The methods "docFreq" and "maxDoc" on IndexSearcher were removed,
as these are no longer used by the scoring system. See MIGRATE.txt for more
details. (Robert Muir)
* LUCENE-3533: Removed SpanFilters, they created large lists of objects and
did not scale. (Robert Muir)
* LUCENE-3606: IndexReader and subclasses were made read-only. It is no longer
possible to delete or undelete documents using IndexReader; you have to use
IndexWriter now. As deleting by internal Lucene docID is no longer possible,
this requires adding a unique identifier field to your index. Deleting/
relying upon Lucene docIDs is not recommended anyway, because they can
change. Consequently commit() was removed and DirectoryReader.open(),
openIfChanged() no longer take readOnly booleans or IndexDeletionPolicy
instances. Furthermore, IndexReader.setNorm() was removed. If you need
customized norm values, the recommended way to do this is by modifying
Similarity to use an external byte[] or one of the new DocValues
fields (LUCENE-3108). Alternatively, to dynamically change norms (boost
*and* length norm) at query time, wrap your AtomicReader using
FilterAtomicReader, overriding FilterAtomicReader.norms(). To persist the
changes on disk, copy the FilteredIndexReader to a new index using
IndexWriter.addIndexes(). (Uwe Schindler, Robert Muir)
* LUCENE-3640: Removed IndexSearcher.close(), because IndexSearcher no longer
takes a Directory and no longer "manages" IndexReaders, it is a no-op.
(Robert Muir)
* LUCENE-3684: Add offsets into DocsAndPositionsEnum, and a few
FieldInfo.IndexOption: DOCS_AND_POSITIONS_AND_OFFSETS. (Robert
Muir, Mike McCandless)
* LUCENE-2858, LUCENE-3770: FilterIndexReader was renamed to
FilterAtomicReader and now extends AtomicReader. If you want to filter
composite readers like DirectoryReader or MultiReader, filter their
atomic leaves and build a new CompositeReader (e.g. MultiReader) around
them. (Uwe Schindler, Robert Muir)
* LUCENE-3736: ParallelReader was split into ParallelAtomicReader
and ParallelCompositeReader. Lucene 3.x's ParallelReader is now
ParallelAtomicReader; but the new composite variant has improved performance
as it works on the atomic subreaders. It requires that all parallel
composite readers have the same subreader structure. If you cannot provide this,
you can use SlowCompositeReaderWrapper to make all parallel readers atomic
and use ParallelAtomicReader. (Uwe Schindler, Mike McCandless, Robert Muir)
* LUCENE-2000: clone() now returns covariant types where possible. (ryan)
* LUCENE-3970: Rename Fields.getUniqueFieldCount -> .size() and
Terms.getUniqueTermCount -> .size(). (Iulius Curt via Mike McCandless)
* LUCENE-3514: IndexSearcher.setDefaultFieldSortScoring was removed
and replaced with per-search control via new expert search methods
that take two booleans indicating whether hit scores and max
score should be computed. (Mike McCandless)
* LUCENE-4055: You can't put foreign files into the index dir anymore.
* LUCENE-3866: CompositeReader.getSequentialSubReaders() now returns
unmodifiable List<? extends IndexReader>. ReaderUtil.Gather was
removed, as IndexReaderContext.leaves() is now the preferred way
to access sub-readers. (Uwe Schindler)
* LUCENE-4155: oal.util.ReaderUtil, TwoPhaseCommit, TwoPhaseCommitTool
classes were moved to oal.index package. oal.util.CodecUtil class was moved
to oal.codecs package. oal.util.DummyConcurrentLock was removed
(no longer used in Lucene 4.0). (Uwe Schindler)
Changes in Runtime Behavior
* LUCENE-2846: omitNorms now behaves like omitTermFrequencyAndPositions, if you
omitNorms(true) for field "a" for 1000 documents, but then add a document with
omitNorms(false) for field "a", all documents for field "a" will have no
norms. Previously, Lucene would fill the first 1000 documents with
"fake norms" from Similarity.getDefault(). (Robert Muir, Mike McCandless)
* LUCENE-2846: When some documents contain field "a", and others do not, the
documents that don't have the field get a norm byte value of 0. Previously,
Lucene would populate "fake norms" with Similarity.getDefault() for these
documents. (Robert Muir, Mike McCandless)
* LUCENE-2720: IndexWriter throws IndexFormatTooOldException on open, rather
than later when e.g. a merge starts.
(Shai Erera, Mike McCandless, Uwe Schindler)
* LUCENE-2881: FieldInfos is now tracked per segment. Before it was tracked
per IndexWriter session, which resulted in FieldInfos that had the FieldInfo
properties from all previous segments combined. Field numbers are now tracked
globally across IndexWriter sessions and persisted into a _X.fnx file on
successful commit. The corresponding file format changes are backwards-
compatible. (Michael Busch, Simon Willnauer)
* LUCENE-2956, LUCENE-2573, LUCENE-2324, LUCENE-2555: Changes from
DocumentsWriterPerThread:
- IndexWriter now uses a DocumentsWriter per thread when indexing documents.
Each DocumentsWriterPerThread indexes documents in its own private segment,
and the in memory segments are no longer merged on flush. Instead, each
segment is separately flushed to disk and subsequently merged with normal
segment merging.
- DocumentsWriterPerThread (DWPT) is now flushed concurrently based on a
FlushPolicy. When a DWPT is flushed, a fresh DWPT is swapped in so that
indexing may continue concurrently with flushing. The selected
DWPT flushes all its RAM resident documents do disk. Note: Segment flushes
don't flush all RAM resident documents but only the documents private to
the DWPT selected for flushing.
- Flushing is now controlled by FlushPolicy that is called for every add,
update or delete on IndexWriter. By default DWPTs are flushed either on
maxBufferedDocs per DWPT or the global active used memory. Once the active
memory exceeds ramBufferSizeMB only the largest DWPT is selected for
flushing and the memory used by this DWPT is subtracted from the active
memory and added to a flushing memory pool, which can lead to temporarily
higher memory usage due to ongoing indexing.
- IndexWriter now can utilize ramBufferSize > 2048 MB. Each DWPT can address
up to 2048 MB memory such that the ramBufferSize is now bounded by the max
number of DWPT available in the used DocumentsWriterPerThreadPool.
IndexWriters net memory consumption can grow far beyond the 2048 MB limit if
the application can use all available DWPTs. To prevent a DWPT from
exhausting its address space IndexWriter will forcefully flush a DWPT if its
hard memory limit is exceeded. The RAMPerThreadHardLimitMB can be controlled
via IndexWriterConfig and defaults to 1945 MB.
Since IndexWriter flushes DWPT concurrently not all memory is released
immediately. Applications should still use a ramBufferSize significantly
lower than the JVMs available heap memory since under high load multiple
flushing DWPT can consume substantial transient memory when IO performance
is slow relative to indexing rate.
- IndexWriter#commit now doesn't block concurrent indexing while flushing all
'currently' RAM resident documents to disk. Yet, flushes that occur while a
a full flush is running are queued and will happen after all DWPT involved
in the full flush are done flushing. Applications using multiple threads
during indexing and trigger a full flush (eg call commit() or open a new
NRT reader) can use significantly more transient memory.
- IndexWriter#addDocument and IndexWriter.updateDocument can block indexing
threads if the number of active + number of flushing DWPT exceed a
safety limit. By default this happens if 2 * max number available thread
states (DWPTPool) is exceeded. This safety limit prevents applications from
exhausting their available memory if flushing can't keep up with
concurrently indexing threads.
- IndexWriter only applies and flushes deletes if the maxBufferedDelTerms
limit is reached during indexing. No segment flushes will be triggered
due to this setting.
- IndexWriter#flush(boolean, boolean) doesn't synchronized on IndexWriter
anymore. A dedicated flushLock has been introduced to prevent multiple full-
flushes happening concurrently.
- DocumentsWriter doesn't write shared doc stores anymore.
(Mike McCandless, Michael Busch, Simon Willnauer)
* LUCENE-3309: Stored fields no longer record whether they were
tokenized or not. In general you should not rely on stored fields
to record any "metadata" from indexing (tokenized, omitNorms,
IndexOptions, boost, etc.) (Mike McCandless)
* LUCENE-3309: Fast vector highlighter now inserts the
MultiValuedSeparator for NOT_ANALYZED fields (in addition to
ANALYZED fields). To ensure your offsets are correct you should
provide an analyzer that returns 1 from the offsetGap method.
(Mike McCandless)
* LUCENE-2621: Removed contrib/instantiated. (Robert Muir)
* LUCENE-1768: StandardQueryTreeBuilder no longer uses RangeQueryNodeBuilder
for RangeQueryNodes, since theses two classes were removed;
TermRangeQueryNodeProcessor now creates TermRangeQueryNode,
instead of RangeQueryNode; the same applies for numeric nodes;
(Vinicius Barros via Uwe Schindler)
* LUCENE-3455: QueryParserBase.newFieldQuery() will throw a ParseException if
any of the calls to the Analyzer throw an IOException. QueryParseBase.analyzeRangePart()
will throw a RuntimeException if an IOException is thrown by the Analyzer.
* LUCENE-4127: IndexWriter will now throw IllegalArgumentException if
the first token of an indexed field has 0 positionIncrement
(previously it silently corrected it to 1, possibly masking bugs).
OffsetAttributeImpl will throw IllegalArgumentException if startOffset
is less than endOffset, or if offsets are negative.
(Robert Muir, Mike McCandless)
API Changes
* LUCENE-2302, LUCENE-1458, LUCENE-2111, LUCENE-2514: Terms are no longer
required to be character based. Lucene views a term as an arbitrary byte[]:
during analysis, character-based terms are converted to UTF8 byte[],
but analyzers are free to directly create terms as byte[]
(NumericField does this, for example). The term data is buffered as
byte[] during indexing, written as byte[] into the terms dictionary,
and iterated as byte[] (wrapped in a BytesRef) by IndexReader for
searching.
* LUCENE-1458, LUCENE-2111: AtomicReader now directly exposes its
deleted docs (getDeletedDocs), providing a new Bits interface to
directly query by doc ID.
* LUCENE-2691: IndexWriter.getReader() has been made package local and is now
exposed via open and reopen methods on DirectoryReader. The semantics of the
call is the same as it was prior to the API change.
(Grant Ingersoll, Mike McCandless)
* LUCENE-2566: QueryParser: Unary operators +,-,! will not be treated as
operators if they are followed by whitespace. (yonik)
* LUCENE-2831: Weight#scorer, Weight#explain, Filter#getDocIdSet,
Collector#setNextReader & FieldComparator#setNextReader now expect an
AtomicReaderContext instead of an IndexReader. (Simon Willnauer)
* LUCENE-2892: Add QueryParser.newFieldQuery (called by getFieldQuery by
default) which takes Analyzer as a parameter, for easier customization by
subclasses. (Robert Muir)
* LUCENE-2953: In addition to changes in 3.x, PriorityQueue#initialize(int)
function was moved into the ctor. (Uwe Schindler, Yonik Seeley)
* LUCENE-3219: SortField type properties have been moved to an enum
SortField.Type. In be consistent, CachedArrayCreator.getSortTypeID() has
been changed CachedArrayCreator.getSortType(). (Chris Male)
* LUCENE-3225: Add TermsEnum.seekExact for faster seeking when you
don't need the ceiling term; renamed existing seek methods to either
seekCeil or seekExact; changed seekExact(ord) to return no value.
Fixed MemoryCodec and SimpleTextCodec to optimize the seekExact
case, and fixed places in Lucene to use seekExact when possible.
(Mike McCandless)
* LUCENE-1536: Filter.getDocIdSet() now takes an acceptDocs Bits interface (like
Scorer) limiting the documents that can appear in the returned DocIdSet.
Filters are now required to respect these acceptDocs, otherwise deleted documents
may get returned by searches. Most filters will pass these Bits down to DocsEnum,
but those, e.g. working on FieldCache, may need to use BitsFilteredDocIdSet.wrap()
to exclude them.
(Mike McCandless, Uwe Schindler, Robert Muir, Chris Male, Yonik Seeley,
Jason Rutherglen, Paul Elschot)
* LUCENE-3722: Similarity methods and collection/term statistics now take
long instead of int (to enable distributed scoring of > 2B docs).
(Yonik Seeley, Andrzej Bialecki, Robert Muir)
* LUCENE-3761: Generalize SearcherManager into an abstract ReferenceManager.
SearcherManager remains a concrete class, but due to the refactoring, the
method maybeReopen has been deprecated in favor of maybeRefresh().
(Shai Erera, Mike McCandless, Simon Willnauer)
* LUCENE-3859: AtomicReader.hasNorms(field) is deprecated, instead you
can inspect the FieldInfo yourself to see if norms are present, which
also allows you to get the type. (Robert Muir)
* LUCENE-2606: Changed RegexCapabilities interface to fix thread
safety, serialization, and performance problems. If you have
written a custom RegexCapabilities it will need to be updated
to the new API. (Robert Muir, Uwe Schindler)
* LUCENE-2638 MakeHighFreqTerms.TermStats public to make it more useful
for API use. (Andrzej Bialecki)
* LUCENE-2912: The field-specific hashmaps in SweetSpotSimilarity were removed.
Instead, use PerFieldSimilarityWrapper to return different SweetSpotSimilaritys
for different fields, this way all parameters (such as TF factors) can be
customized on a per-field basis. (Robert Muir)
* LUCENE-3308: DuplicateFilter keepMode and processingMode have been converted to
enums DuplicateFilter.KeepMode and DuplicateFilter.ProcessingMode respectively.
* LUCENE-3483: Move Function grouping collectors from Solr to grouping module.
(Martijn van Groningen)
* LUCENE-3606: FieldNormModifier was deprecated, because IndexReader's
setNorm() was deprecated. Furthermore, this class is broken, as it does
not take position overlaps into account while recalculating norms.
(Uwe Schindler, Robert Muir)
* LUCENE-3936: Renamed StringIndexDocValues to DocTermsIndexDocValues.
(Martijn van Groningen)
* LUCENE-1768: Deprecated Parametric(Range)QueryNode, RangeQueryNode(Builder),
ParametricRangeQueryNodeProcessor were removed. (Vinicius Barros via Uwe Schindler)
* LUCENE-3820: Deprecated constructors accepting pattern matching bounds. The input
is buffered and matched in one pass. (Dawid Weiss)
* LUCENE-2413: Deprecated PatternAnalyzer in common/miscellaneous, in favor
of the pattern package (CharFilter, Tokenizer, TokenFilter). (Robert Muir)
* LUCENE-2413: Removed the AnalyzerUtil in common/miscellaneous. (Robert Muir)
* LUCENE-1370: Added ShingleFilter option to output unigrams if no shingles
can be generated. (Chris Harris via Steven Rowe)
* LUCENE-2514, LUCENE-2551: JDK and ICU CollationKeyAnalyzers were changed to
use pure byte keys when Version >= 4.0. This cuts sort key size approximately
in half. (Robert Muir)
* LUCENE-3400: Removed DutchAnalyzer.setStemDictionary (Chris Male)
* LUCENE-3431: Removed QueryAutoStopWordAnalyzer.addStopWords* deprecated methods
since they prevented reuse. Stopwords are now generated at instantiation through
the Analyzer's constructors. (Chris Male)
* LUCENE-3434: Removed ShingleAnalyzerWrapper.set* and PerFieldAnalyzerWrapper.addAnalyzer
since they prevent reuse. Both Analyzers should be configured at instantiation.
(Chris Male)
* LUCENE-3765: Stopset ctors that previously took Set<?> or Map<?,String> now take
CharArraySet and CharArrayMap respectively. Previously the behavior was confusing,
and sometimes different depending on the type of set, and ultimately a CharArraySet
or CharArrayMap was always used anyway. (Robert Muir)
* LUCENE-3830: Switched to NormalizeCharMap.Builder to create
immutable instances of NormalizeCharMap. (Dawid Weiss, Mike
McCandless)
* LUCENE-4063: FrenchLightStemmer no longer deletes repeated digits.
(Tanguy Moal via Steve Rowe)
* LUCENE-4122: Replace Payload with BytesRef. (Andrzej Bialecki)
* LUCENE-4132: IndexWriter.getConfig() now returns a LiveIndexWriterConfig object
which can be used to change the IndexWriter's live settings. IndexWriterConfig
is used only for initializing the IndexWriter. (Shai Erera)
* LUCENE-3866: IndexReaderContext.leaves() is now the preferred way to access
atomic sub-readers of any kind of IndexReader (for AtomicReaders it returns
itself as only leaf with docBase=0). (Uwe Schindler)
New features
* LUCENE-2604: Added RegexpQuery support to QueryParser. Regular expressions
are directly supported by the standard queryparser via
fieldName:/expression/ OR /expression against default field/
Users who wish to search for literal "/" characters are advised to
backslash-escape or quote those characters as needed.
(Simon Willnauer, Robert Muir)
* LUCENE-1606, LUCENE-2089: Adds AutomatonQuery, a MultiTermQuery that
matches terms against a finite-state machine. Implement WildcardQuery
and FuzzyQuery with finite-state methods. Adds RegexpQuery.
(Robert Muir, Mike McCandless, Uwe Schindler, Mark Miller)
* LUCENE-3662: Add support for levenshtein distance with transpositions
to LevenshteinAutomata, FuzzyTermsEnum, and DirectSpellChecker.
(Jean-Philippe Barrette-LaPierre, Robert Muir)
* LUCENE-2321: Cutover to a more RAM efficient packed-ints based
representation for the in-memory terms dict index. (Mike
McCandless)
* LUCENE-2126: Add new classes for data (de)serialization: DataInput
and DataOutput. IndexInput and IndexOutput extend these new classes.
(Michael Busch)
* LUCENE-1458, LUCENE-2111: With flexible indexing it is now possible
for an application to create its own postings codec, to alter how
fields, terms, docs and positions are encoded into the index. The
standard codec is the default codec. IndexWriter accepts a Codec
class to obtain codecs for newly written segments.
* LUCENE-1458, LUCENE-2111: Some experimental codecs have been added
for flexible indexing, including pulsing codec (inlines
low-frequency terms directly into the terms dict, avoiding seeking
for some queries), sep codec (stores docs, freqs, positions, skip
data and payloads in 5 separate files instead of the 2 used by
standard codec), and int block (really a "base" for using
block-based compressors like PForDelta for storing postings data).
* LUCENE-1458, LUCENE-2111: The in-memory terms index used by standard
codec is more RAM efficient: terms data is stored as block byte
arrays and packed integers. Net RAM reduction for indexes that have
many unique terms should be substantial, and initial open time for
IndexReaders should be faster. These gains only apply for newly
written segments after upgrading.
* LUCENE-1458, LUCENE-2111: Terms data are now buffered directly as
byte[] during indexing, which uses half the RAM for ascii terms (and
also numeric fields). This can improve indexing throughput for
applications that have many unique terms, since it reduces how often
a new segment must be flushed given a fixed RAM buffer size.
* LUCENE-2489: Added PerFieldCodecWrapper (in oal.index.codecs) which
lets you set the Codec per field (Mike McCandless)
* LUCENE-2373: Extend Codec to use SegmentInfosWriter and
SegmentInfosReader to allow customization of SegmentInfos data.
(Andrzej Bialecki)
* LUCENE-2504: FieldComparator.setNextReader now returns a
FieldComparator instance. You can "return this", to just reuse the
same instance, or you can return a comparator optimized to the new
segment. (yonik, Mike McCandless)
* LUCENE-2648: PackedInts.Iterator now supports to advance by more than a
single ordinal. (Simon Willnauer)
* LUCENE-2649: Objects in the FieldCache can optionally store Bits
that mark which docs have real values in the native[] (ryan)
* LUCENE-2664: Add SimpleText codec, which stores all terms/postings
data in a single text file for transparency (at the expense of poor
performance). (Sahin Buyrukbilen via Mike McCandless)
* LUCENE-2589: Add a VariableSizedIntIndexInput, which, when used w/
Sep*, makes it simple to take any variable sized int block coders
(like Simple9/16) and use them in a codec. (Mike McCandless)
* LUCENE-2597: Add oal.index.SlowCompositeReaderWrapper, to wrap a
composite reader (eg MultiReader or DirectoryReader), making it
pretend it's an atomic reader. This is a convenience class (you can
use MultiFields static methods directly, instead) if you need to use
the flex APIs directly on a composite reader. (Mike McCandless)
* LUCENE-2690: MultiTermQuery boolean rewrites per segment.
(Uwe Schindler, Robert Muir, Mike McCandless, Simon Willnauer)
* LUCENE-996: The QueryParser now accepts mixed inclusive and exclusive
bounds for range queries. Example: "{3 TO 5]"
QueryParser subclasses that overrode getRangeQuery will need to be changed
to use the new getRangeQuery method. (Andrew Schurman, Mark Miller, yonik)
* LUCENE-2742: Add native per-field postings format support. Codec lets you now
register a postings format for each field and which is in turn recorded
into the index. Postings formats are maintained on a per-segment basis and be
resolved without knowing the actual postings format used for writing the segment.
(Simon Willnauer)
* LUCENE-2741: Add support for multiple codecs that use the same file
extensions within the same segment. Codecs now use their per-segment codec
ID in the file names. (Simon Willnauer)
* LUCENE-2843: Added a new terms index impl,
VariableGapTermsIndexWriter/Reader, that accepts a pluggable
IndexTermSelector for picking which terms should be indexed in the
terms dict. This impl stores the indexed terms in an FST, which is
much more RAM efficient than FixedGapTermsIndex. (Mike McCandless)
* LUCENE-2862: Added TermsEnum.totalTermFreq() and
Terms.getSumTotalTermFreq(). (Mike McCandless, Robert Muir)
* LUCENE-3290: Added Terms.getSumDocFreq() (Mike McCandless, Robert Muir)
* LUCENE-3003: Added new expert class oal.index.DocTermsOrd,
refactored from Solr's UnInvertedField, for accessing term ords for
multi-valued fields, per document. This is similar to FieldCache in
that it inverts the index to compute the ords, but differs in that
it's able to handle multi-valued fields and does not hold the term
bytes in RAM. (Mike McCandless)
* LUCENE-3108, LUCENE-2935, LUCENE-2168, LUCENE-1231: Changes from
DocValues (ColumnStrideFields):
- IndexWriter now supports typesafe dense per-document values stored in
a column like storage. DocValues are stored on a per-document
basis where each documents field can hold exactly one value of a given
type. DocValues are provided via Fieldable and can be used in
conjunction with stored and indexed values.
- DocValues provides an entirely RAM resident document id to value
mapping per field as well as a DocIdSetIterator based disk-resident
sequential access API relying on filesystem-caches.
- Both APIs are exposed via IndexReader and the Codec / Flex API allowing
expert users to integrate customized DocValues reader and writer
implementations by extending existing Codecs.
- DocValues provides implementations for primitive datatypes like int,
long, float, double and arrays of byte. Byte based implementations further
provide storage variants like straight or dereferenced stored bytes, fixed
and variable length bytes as well as index time sorted based on
user-provided comparators.
(Mike McCandless, Simon Willnauer)
* LUCENE-3209: Added MemoryCodec, which stores all terms & postings in
RAM as an FST; this is good for primary-key fields if you frequently
need to lookup by that field or perform deletions against it, for
example in a near-real-time setting. (Mike McCandless)
* SOLR-2533: Added support for rewriting Sort and SortFields using an
IndexSearcher. SortFields can have SortField.REWRITEABLE type which
requires they are rewritten before they are used. (Chris Male)
* LUCENE-3203: FSDirectory can now limit the max allowed write rate
(MB/sec) of all running merges, to reduce impact ongoing merging has
on searching, NRT reopen time, etc. (Mike McCandless)
* LUCENE-2793: Directory#createOutput & Directory#openInput now accept an
IOContext instead of a buffer size to allow low level optimizations for
different usecases like merging, flushing and reading.
(Simon Willnauer, Mike McCandless, Varun Thacker)
* LUCENE-3354: FieldCache can cache DocTermOrds. (Martijn van Groningen)
* LUCENE-3376: ReusableAnalyzerBase has been moved from modules/analysis/common
into lucene/src/java/org/apache/lucene/analysis (Chris Male)
* LUCENE-3423: add Terms.getDocCount(), which returns the number of documents
that have at least one term for a field. (Yonik Seeley, Robert Muir)
* LUCENE-2959: Added a variety of different relevance ranking systems to Lucene.
- Added Okapi BM25, Language Models, Divergence from Randomness, and
Information-Based Models. The models are pluggable, support all of lucene's
features (boosts, slops, explanations, etc) and queries (spans, etc).
- All models default to the same index-time norm encoding as
DefaultSimilarity, so you can easily try these out/switch back and
forth/run experiments and comparisons without reindexing. Note: most of
the models do rely upon index statistics that are new in Lucene 4.0, so
for existing 3.x indexes its a good idea to upgrade your index to the
new format with IndexUpgrader first.
- Added a new subclass SimilarityBase which provides a simplified API
for plugging in new ranking algorithms without dealing with all of the
nuances and implementation details of Lucene.
- For example, to use BM25 for all fields:
searcher.setSimilarity(new BM25Similarity());
If you instead want to apply different similarities (e.g. ones with
different parameter values or different algorithms entirely) to different
fields, implement PerFieldSimilarityWrapper with your per-field logic.
(David Mark Nemeskey via Robert Muir)
* LUCENE-3396: ReusableAnalyzerBase now provides a ReuseStrategy abstraction
which controls how TokenStreamComponents are reused per request. Two
implementations are provided - GlobalReuseStrategy which implements the
current behavior of sharing components between all fields, and
PerFieldReuseStrategy which shares per field. (Chris Male)
* LUCENE-2309: Added IndexableField.tokenStream(Analyzer) which is now
responsible for creating the TokenStreams for Fields when they are to
be indexed. (Chris Male)
* LUCENE-3433: Added random access for non RAM resident IndexDocValues. RAM
resident and disk resident IndexDocValues are now exposed via the Source
interface. ValuesEnum has been removed in favour of Source. (Simon Willnauer)
* LUCENE-1536: Filters can now be applied down-low, if their DocIdSet implements
a new bits() method, returning all documents in a random access way. If the
DocIdSet is not too sparse, it will be passed as acceptDocs down to the Scorer
as replacement for AtomicReader's live docs.
In addition, FilteredQuery backs now IndexSearcher's filtering search methods.
Using FilteredQuery you can chain Filters in a very performant way
[new FilteredQuery(new FilteredQuery(query, filter1), filter2)], which was not
possible with IndexSearcher's methods. FilteredQuery also allows to override
the heuristics used to decide if filtering should be done random access or
using a conjunction on DocIdSet's iterator().
(Mike McCandless, Uwe Schindler, Robert Muir, Chris Male, Yonik Seeley,
Jason Rutherglen, Paul Elschot)
* LUCENE-3638: Added sugar methods to IndexReader and IndexSearcher to
load only certain fields when loading a document. (Peter Chang via
Mike McCandless)
* LUCENE-3628: Norms are represented as DocValues. AtomicReader exposes
a #normValues(String) method to obtain norms per field. (Simon Willnauer)
* LUCENE-3687: Similarity#computeNorm(FieldInvertState, Norm) allows to compute
norm values or arbitrary precision. Instead of returning a fixed single byte
value, custom similarities can now set a integer, float or byte value to the
given Norm object. (Simon Willnauer)
* LUCENE-2604, LUCENE-4103: Added RegexpQuery support to contrib/queryparser.
(Simon Willnauer, Robert Muir, Daniel Truemper)
* LUCENE-2373: Added a Codec implementation that works with append-only
filesystems (such as e.g. Hadoop DFS). SegmentInfos writing/reading
code is refactored to support append-only FS, and to allow for future
customization of per-segment information. (Andrzej Bialecki)
* LUCENE-2479: Added ability to provide a sort comparator for spelling suggestions along
with two implementations. The existing comparator (score, then frequency) is the default (Grant Ingersoll)
* LUCENE-2608: Added the ability to specify the accuracy at method time in the SpellChecker. The per class
method is also still available. (Grant Ingersoll)
* LUCENE-2507: Added DirectSpellChecker, which retrieves correction candidates directly
from the term dictionary using levenshtein automata. (Robert Muir)
* LUCENE-3527: Add LuceneLevenshteinDistance, which computes string distance in a compatible
way as DirectSpellChecker. This can be used to merge top-N results from more than one
SpellChecker. (James Dyer via Robert Muir)
* LUCENE-3496: Support grouping by DocValues. (Martijn van Groningen)
* LUCENE-2795: Generified DirectIOLinuxDirectory to work across any
unix supporting the O_DIRECT flag when opening a file (tested on
Linux and OS X but likely other Unixes will work), and improved it
so it can be used for indexing and searching. The directory uses
direct IO when doing large merges to avoid unnecessarily evicting
cached IO pages due to large merges. (Varun Thacker, Mike
McCandless)
* LUCENE-3827: DocsAndPositionsEnum from MemoryIndex implements
start/endOffset, if offsets are indexed. (Alan Woodward via Mike
McCandless)
* LUCENE-3802, LUCENE-3856: Support for grouped faceting. (Martijn van Groningen)
* LUCENE-3444: Added a second pass grouping collector that keeps track of distinct
values for a specified field for the top N group. (Martijn van Groningen)
* LUCENE-3778: Added a grouping utility class that makes it easier to use result
grouping for pure Lucene apps. (Martijn van Groningen)
* LUCENE-2341: A new analysis/ filter: Morfologik - a dictionary-driven lemmatizer
(accurate stemmer) for Polish (includes morphosyntactic annotations).
(Michał Dybizbański, Dawid Weiss)
* LUCENE-2413: Consolidated Lucene/Solr analysis components into analysis/common.
New features from Solr now available to Lucene users include:
- o.a.l.analysis.commongrams: Constructs n-grams for frequently occurring terms
and phrases.
- o.a.l.analysis.charfilter.HTMLStripCharFilter: CharFilter that strips HTML
constructs.
- o.a.l.analysis.miscellaneous.WordDelimiterFilter: TokenFilter that splits words
into subwords and performs optional transformations on subword groups.
- o.a.l.analysis.miscellaneous.RemoveDuplicatesTokenFilter: TokenFilter which
filters out Tokens at the same position and Term text as the previous token.
- o.a.l.analysis.miscellaneous.TrimFilter: Trims leading and trailing whitespace
from Tokens in the stream.
- o.a.l.analysis.miscellaneous.KeepWordFilter: A TokenFilter that only keeps tokens
with text contained in the required words (inverse of StopFilter).
- o.a.l.analysis.miscellaneous.HyphenatedWordsFilter: A TokenFilter that puts
hyphenated words broken into two lines back together.
- o.a.l.analysis.miscellaneous.CapitalizationFilter: A TokenFilter that applies
capitalization rules to tokens.
- o.a.l.analysis.pattern: Package for pattern-based analysis, containing a
CharFilter, Tokenizer, and TokenFilter for transforming text with regexes.
- o.a.l.analysis.synonym.SynonymFilter: A synonym filter that supports multi-word
synonyms.
- o.a.l.analysis.phonetic: Package for phonetic search, containing various
phonetic encoders such as Double Metaphone.
Some existing analysis components changed packages:
- o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer
- o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer
- o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer
- o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter
- o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer
- o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer
- o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer
- o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter
- o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer
- o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer
- o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter
- o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter
- o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter
- o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter
- o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter
- o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper
- o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter
- o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter
- o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter
- o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter
- o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap
- o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet
- o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap
- o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase
- o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase
- o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader
- o.a.l.analysis.CharTokenizer -> o.a.l.analysis.util.CharTokenizer
- o.a.l.util.CharacterUtils -> o.a.l.analysis.util.CharacterUtils
All analyzers in contrib/analyzers and contrib/icu were moved to the
analysis/ module. The 'smartcn' and 'stempel' components now depend on 'common'.
(Chris Male, Robert Muir)
* LUCENE-4004: Add DisjunctionMaxQuery support to the xml query parser.
(Benson Margulies via Robert Muir)
* LUCENE-4025: Add maybeRefreshBlocking to ReferenceManager, to let a caller
block until the refresh logic has been executed. (Shai Erera, Mike McCandless)
* LUCENE-4039: Add AddIndexesTask to benchmark, which uses IW.addIndexes.
(Shai Erera)
* LUCENE-3514: Added IndexSearcher.searchAfter when Sort is used,
returning results after a specified FieldDoc for deep
paging. (Mike McCandless)
* LUCENE-4043: Added scoring support via score mode for query time joining.
(Martijn van Groningen, Mike McCandless)
* LUCENE-3523: Added oal.search.spell.WordBreakSpellChecker, which
generates suggestions by combining two or more terms and/or
breaking terms into multiple words. See Javadocs for usage. (James Dyer)
* LUCENE-4019: Added improved parsing of Hunspell Dictionaries so those
rules missing the required number of parameters either ignored or
cause a ParseException (depending on whether strict parsing is enabled).
(Luca Cavanna via Chris Male)
* LUCENE-3440: Add ordered fragments feature with IDF-weighted terms for FVH.
(Sebastian Lutze via Koji Sekiguchi)
* LUCENE-4082: Added explain to ToParentBlockJoinQuery.
(Christoph Kaser, Martijn van Groningen)
* LUCENE-4108: add replaceTaxonomy to DirectoryTaxonomyWriter, which replaces
the taxonomy in place with the given one. (Shai Erera)
* LUCENE-3030: new BlockTree terms dictionary (used by the default
Lucene40 postings format) uses less RAM (for the terms index) and
disk space (for all terms and metadata) and gives sizable
performance gains for terms dictionary intensive operations like
FuzzyQuery, direct spell checker and primary-key lookup (Mike
McCandless).
Optimizations
* LUCENE-2588: Don't store unnecessary suffixes when writing the terms
index, saving RAM in IndexReader; change default terms index
interval from 128 to 32, because the terms index now requires much
less RAM. (Robert Muir, Mike McCandless)
* LUCENE-2669: Optimize NumericRangeQuery.NumericRangeTermsEnum to
not seek backwards when a sub-range has no terms. It now only seeks
when the current term is less than the next sub-range's lower end.
(Uwe Schindler, Mike McCandless)
* LUCENE-2694: Optimize MultiTermQuery to be single pass for Term lookups.
MultiTermQuery now stores TermState per leaf reader during rewrite to re-
seek the term dictionary in TermQuery / TermWeight.
(Simon Willnauer, Mike McCandless, Robert Muir)
* LUCENE-3292: IndexWriter no longer shares the same SegmentReader
instance for merging and NRT readers, which enables directory impls
to separately tune IO flags for each. (Varun Thacker, Simon
Willnauer, Mike McCandless)
* LUCENE-3328: BooleanQuery now uses a specialized ConjunctionScorer if all
boolean clauses are required and instances of TermQuery.
(Simon Willnauer, Robert Muir)
* LUCENE-3643: FilteredQuery and IndexSearcher.search(Query, Filter,...)
now optimize the special case query instanceof MatchAllDocsQuery to
execute as ConstantScoreQuery. (Uwe Schindler)
* LUCENE-3509: Added fasterButMoreRam option for docvalues. This option controls whether the space for packed ints
should be rounded up for better performance. This option only applies for docvalues types bytes fixed sorted
and bytes var sorted. (Simon Willnauer, Martijn van Groningen)
* LUCENE-3795: Replace contrib/spatial with modules/spatial. This includes
a basic spatial strategy interface. (David Smiley, Chris Male, ryan)
* LUCENE-3932: Lucene3x codec loads terms index faster, by
pre-allocating the packed ints array based on the .tii file size
(Sean Bridges via Mike McCandless)
* LUCENE-3468: Replaced last() and remove() with pollLast() in
FirstPassGroupingCollector (Martijn van Groningen)
* LUCENE-3830: Changed MappingCharFilter/NormalizeCharMap to use an
FST under the hood, which requires less RAM. NormalizeCharMap no
longer accepts empty string match (it did previously, but ignored
it). (Dawid Weiss, Mike McCandless)
* LUCENE-4061: improve synchronization in DirectoryTaxonomyWriter.addCategory
and few general improvements to DirectoryTaxonomyWriter.
(Shai Erera, Gilad Barkai)
* LUCENE-4062: Add new aligned packed bits impls for faster lookup
performance; add float acceptableOverheadRatio to getWriter and
getMutable API to give packed ints freedom to pick faster
implementations (Adrien Grand via Mike McCandless)
* LUCENE-2357: Reduce transient RAM usage when merging segments in
IndexWriter. (Adrien Grand)
* LUCENE-4098: Add bulk get/set methods to PackedInts (Adrien Grand
via Mike McCandless)
* LUCENE-4156: DirectoryTaxonomyWriter.getSize is no longer synchronized.
(Shai Erera, Sivan Yogev)
* LUCENE-4163: Improve concurrency of MMapIndexInput.clone() by using
the new WeakIdentityMap on top of a ConcurrentHashMap to manage
the cloned instances. WeakIdentityMap was extended to support
iterating over its keys. (Uwe Schindler)
Bug fixes
* LUCENE-2803: The FieldCache can miss values if an entry for a reader
with more document deletions is requested before a reader with fewer
deletions, provided they share some segments. (yonik)
* LUCENE-2645: Fix false assertion error when same token was added one
after another with 0 posIncr. (David Smiley, Kurosaka Teruhiko via Mike
McCandless)
* LUCENE-3348: Fix thread safety hazards in IndexWriter that could
rarely cause deletions to be incorrectly applied. (Yonik Seeley,
Simon Willnauer, Mike McCandless)
* LUCENE-3515: Fix terrible merge performance versus 3.x, especially
when the directory isn't MMapDirectory, due to failing to reuse
DocsAndPositionsEnum while merging (Marc Sturlese, Erick Erickson,
Robert Muir, Simon Willnauer, Mike McCandless)
* LUCENE-3589: BytesRef copy(short) didn't set length.
(Peter Chang via Robert Muir)
* LUCENE-3045: fixed QueryNodeImpl.containsTag(String key) that was
not lowercasing the key before checking for the tag (Adriano Crestani)
* LUCENE-3890: Fixed NPE for grouped faceting on multi-valued fields.
(Michael McCandless, Martijn van Groningen)
* LUCENE-2945: Fix hashCode/equals for surround query parser generated queries.
(Paul Elschot, Simon Rosenthal, gsingers via ehatcher)
* LUCENE-3971: MappingCharFilter could return invalid final token position.
(Dawid Weiss, Robert Muir)
* LUCENE-3820: PatternReplaceCharFilter could return invalid token positions.
(Dawid Weiss)
* LUCENE-3969: Throw IAE on bad arguments that could cause confusing errors in
CompoundWordTokenFilterBase, PatternTokenizer, PositionFilter,
SnowballFilter, PathHierarchyTokenizer, ReversePathHierarchyTokenizer,
WikipediaTokenizer, and KeywordTokenizer. ShingleFilter and
CommonGramsFilter now populate PositionLengthAttribute. Fixed
PathHierarchyTokenizer to reset() all state. Protect against AIOOBE in
ReversePathHierarchyTokenizer if skip is large. Fixed wrong final
offset calculation in PathHierarchyTokenizer.
(Mike McCandless, Uwe Schindler, Robert Muir)
* LUCENE-4060: Fix a synchronization bug in
DirectoryTaxonomyWriter.addTaxonomies(). Also, the method has been renamed to
addTaxonomy and now takes only one Directory and one OrdinalMap.
(Shai Erera, Gilad Barkai)
* LUCENE-3590: Fix AIOOBE in BytesRef/CharsRef copyBytes/copyChars when
offset is nonzero, fix off-by-one in CharsRef.subSequence, and fix
CharsRef's CharSequence methods to throw exceptions in boundary cases
to properly meet the specification. (Robert Muir)
* LUCENE-4084: Attempting to reuse a single IndexWriterConfig instance
across more than one IndexWriter resulted in a cryptic exception.
This is now fixed, but requires that certain members of
IndexWriterConfig (MergePolicy, FlushPolicy,
DocumentsWriterThreadPool) implement clone. (Robert Muir, Simon
Willnauer, Mike McCandless)
* LUCENE-4079: Fixed loading of Hunspell dictionaries that use aliasing (AF rules)
(Ludovic Boutros via Chris Male)
* LUCENE-4077: Expose the max score and per-group scores from
ToParentBlockJoinCollector (Christoph Kaser, Mike McCandless)
* LUCENE-4114: Fix int overflow bugs in BYTES_FIXED_STRAIGHT and
BYTES_FIXED_DEREF doc values implementations (Walt Elder via Mike McCandless).
* LUCENE-4147: Fixed thread safety issues when rollback() and commit()
are called simultaneously. (Simon Willnauer, Mike McCandless)
* LUCENE-4165: Removed closing of the Reader used to read the affix file in
HunspellDictionary. Consumers are now responsible for closing all InputStreams
once the Dictionary has been instantiated. (Torsten Krah, Uwe Schindler, Chris Male)
Documentation
* LUCENE-3958: Javadocs corrections for IndexWriter.
(Iulius Curt via Robert Muir)
Build
* LUCENE-4047: Cleanup of LuceneTestCase: moved blocks of initialization/ cleanup
code into JUnit instance and class rules. (Dawid Weiss)
* LUCENE-4016: Require ANT 1.8.2+ for the build.
* LUCENE-3808: Refactoring of testing infrastructure to use randomizedtesting
package: http://labs.carrotsearch.com/randomizedtesting.html (Dawid Weiss)
* LUCENE-3964: Added target stage-maven-artifacts, which stages
Maven release artifacts to a Maven staging repository in preparation
for release. (Steve Rowe)
* LUCENE-2845: Moved contrib/benchmark to lucene/benchmark.
* LUCENE-2995: Moved contrib/spellchecker into lucene/suggest.
* LUCENE-3285: Moved contrib/queryparser into lucene/queryparser
* LUCENE-3285: Moved contrib/xml-query-parser's demo into lucene/demo
* LUCENE-3271: Moved contrib/queries BooleanFilter, BoostingQuery,
ChainedFilter, FilterClause and TermsFilter into lucene/queries
* LUCENE-3381: Moved contrib/queries regex.*, DuplicateFilter,
FuzzyLikeThisQuery and SlowCollated* into lucene/sandbox.
Removed contrib/queries.
* LUCENE-3286: Moved remainder of contrib/xml-query-parser to lucene/queryparser.
Classes now found at org.apache.lucene.queryparser.xml.*
* LUCENE-4059: Improve ANT task prepare-webpages (used by documentation
tasks) to correctly encode build file names as URIs for later processing by
XSL. (Greg Bowyer, Uwe Schindler)
======================= Lucene 3.6.2 =======================
Bug Fixes
* LUCENE-4234: Exception when FacetsCollector is used with ScoreFacetRequest,
and the number of matching documents is too large. (Gilad Barkai via Shai Erera)
* LUCENE-2686, LUCENE-3505, LUCENE-4401: Fix BooleanQuery scorers to
return correct freq().
(Koji Sekiguchi, Mike McCandless, Liu Chao, Robert Muir)
* LUCENE-2501: Fixed rare thread-safety issue that could cause
ArrayIndexOutOfBoundsException inside ByteBlockPool (Robert Muir,
Mike McCandless)
* LUCENE-4297: BooleanScorer2 would multiply the coord() factor
twice for conjunctions: for most users this is no problem, but
if you had a customized Similarity that returned something other
than 1 when overlap == maxOverlap (always the case for conjunctions),
then the score would be incorrect. (Pascal Chollet, Robert Muir)
* LUCENE-4300: BooleanQuery's rewrite was not always safe: if you
had a custom Similarity where coord(1,1) != 1F, then the rewritten
query would be scored differently. (Robert Muir)
* LUCENE-4398: If you index many different field names in your
documents then due to a bug in how it measures its RAM
usage, IndexWriter would flush each segment too early eventually
reaching the point where it flushes after every doc. (Tim Smith via
Mike McCandless)
* LUCENE-4411: when sampling is enabled for a FacetRequest, its depth
parameter is reset to the default (1), even if set otherwise.
(Gilad Barkai via Shai Erera)
* LUCENE-4635: Fixed ArrayIndexOutOfBoundsException when in-memory
terms index requires more than 2.1 GB RAM (indices with billions of
terms). (Tom Burton-West via Mike McCandless)
Documentation
* LUCENE-4302: Fix facet userguide to have HTML loose doctype like
all other javadocs. (Karl Nicholas via Uwe Schindler)
======================= Lucene 3.6.1 =======================
More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
https://wiki.apache.org/lucene-java/Lucene3.6.1
Bug Fixes
* LUCENE-3969: Throw IAE on bad arguments that could cause confusing
errors in KeywordTokenizer.
(Uwe Schindler, Mike McCandless, Robert Muir)
* LUCENE-3971: MappingCharFilter could return invalid final token position.
(Dawid Weiss, Robert Muir)
* LUCENE-4023: DisjunctionMaxScorer now implements visitSubScorers().
(Uwe Schindler)
* LUCENE-2566: + - operators allow any amount of whitespace (yonik, janhoy)
* LUCENE-3590: Fix AIOOBE in BytesRef/CharsRef copyBytes/copyChars when
offset is nonzero, fix off-by-one in CharsRef.subSequence, and fix
CharsRef's CharSequence methods to throw exceptions in boundary cases
to properly meet the specification. (Robert Muir)
* LUCENE-4222: TieredMergePolicy.getFloorSegmentMB was returning the
size in bytes not MB (Chris Fuller via Mike McCandless)
API Changes
* LUCENE-4023: Changed the visibility of Scorer#visitSubScorers() to
public, otherwise it's impossible to implement Scorers outside
the Lucene package. (Uwe Schindler)
Optimizations
* LUCENE-4163: Improve concurrency of MMapIndexInput.clone() by using
the new WeakIdentityMap on top of a ConcurrentHashMap to manage
the cloned instances. WeakIdentityMap was extended to support
iterating over its keys. (Uwe Schindler)
Tests
* LUCENE-3873: add MockGraphTokenFilter, testing analyzers with
random graph tokens. (Mike McCandless)
* LUCENE-3968: factor out LookaheadTokenFilter from
MockGraphTokenFilter (Mike McCandless)
======================= Lucene 3.6.0 =======================
More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
https://wiki.apache.org/lucene-java/Lucene3.6
Changes in backwards compatibility policy
* LUCENE-3594: The protected inner class (never intended to be visible)
FieldCacheTermsFilter.FieldCacheTermsFilterDocIdSet was removed and
replaced by another internal implementation. (Uwe Schindler)
* LUCENE-3620: FilterIndexReader now overrides all methods of IndexReader that
it should (note that some are still not overridden, as they should be
overridden by sub-classes only). In the process, some methods of IndexReader
were made final. This is not expected to affect many apps, since these methods
already delegate to abstract methods, which you had to already override
anyway. (Shai Erera)
* LUCENE-3636: Added SearcherFactory, used by SearcherManager and NRTManager
to create new IndexSearchers. You can provide your own implementation to
warm new searchers, set an ExecutorService, set a custom Similarity, or
even return your own subclass of IndexSearcher. The SearcherWarmer and
ExecutorService parameters on these classes were removed, as they are
subsumed by SearcherFactory. (Shai Erera, Mike McCandless, Robert Muir)
* LUCENE-3644: The expert ReaderFinishedListener api suffered problems (propagated
down to subreaders, but was not called on SegmentReaders, unless they were
the owner of the reader core, and other ambiguities). The API is revised:
You can set ReaderClosedListeners on any IndexReader, and onClose is called
when that reader is closed. SegmentReader has CoreClosedListeners that you
can register to know when a shared reader core is closed.
(Uwe Schindler, Mike McCandless, Robert Muir)
* LUCENE-3652: The package org.apache.lucene.messages was moved to
contrib/queryparser. If you have used those classes in your code
just add the lucene-queryparser.jar file to your classpath.
(Uwe Schindler)
* LUCENE-3681: FST now stores labels for BYTE2 input type as 2 bytes
instead of vInt; this can make FSTs smaller and faster, but it is a
break in the binary format so if you had built and saved any FSTs
then you need to rebuild them. (Robert Muir, Mike McCandless)
* LUCENE-3679: The expert IndexReader.getFieldNames(FieldOption) API
has been removed and replaced with the experimental getFieldInfos
API. All IndexReader subclasses must implement getFieldInfos.
(Mike McCandless)
* LUCENE-3695: Move confusing add(X) methods out of FST.Builder into
FST.Util. (Robert Muir, Mike McCandless)
* LUCENE-3701: Added an additional argument to the expert FST.Builder
ctor to take FreezeTail, which you can use to (very-expertly) customize
the FST construction process. Pass null if you want the default
behavior. Added seekExact() to FSTEnum, and added FST.save/read
from a File. (Mike McCandless, Dawid Weiss, Robert Muir)
* LUCENE-3712: Removed unused and untested ReaderUtil#subReader methods.
(Uwe Schindler)
* LUCENE-3672: Deprecate Directory.fileModified,
IndexCommit.getTimestamp and .getVersion and
IndexReader.lastModified and getCurrentVersion (Andrzej Bialecki,
Robert Muir, Mike McCandless)
* LUCENE-3760: In IndexReader/DirectoryReader, deprecate static
methods getCurrentVersion and getCommitUserData, and non-static
method getCommitUserData (use getIndexCommit().getUserData()
instead). (Ryan McKinley, Robert Muir, Mike McCandless)
* LUCENE-3867: Deprecate instance creation of RamUsageEstimator, instead
the new static method sizeOf(Object) should be used. As the algorithm
is now using Hotspot(TM) internals (reference size, header sizes,
object alignment), the abstract o.a.l.util.MemoryModel class was
completely removed (without replacement). The new static methods
no longer support String intern-ness checking, interned strings
now count to memory usage as any other Java object.
(Dawid Weiss, Uwe Schindler, Shai Erera)
* LUCENE-3738: All readXxx methods in BufferedIndexInput were made
final. Subclasses should only override protected readInternal /
seekInternal. (Uwe Schindler)
* LUCENE-2599: Deprecated the spatial contrib module, which was buggy and not
well maintained. Lucene 4 includes a new spatial module that replaces this.
(David Smiley, Ryan McKinley, Chris Male)
Changes in Runtime Behavior
* LUCENE-3796, SOLR-3241: Throw an exception if you try to set an index-time
boost on a field that omits norms. Because the index-time boost
is multiplied into the norm, previously your boost would be
silently discarded. (Tomás Fernández Löbbe, Hoss Man, Robert Muir)
* LUCENE-3848: Fix tokenstreams to not produce a stream with an initial
position increment of 0: which is out of bounds (overlapping with a
non-existent previous term). Consumers such as IndexWriter and QueryParser
still check for and silently correct this situation today, but at some point
in the future they may throw an exception. (Mike McCandless, Robert Muir)
* LUCENE-3738: DataInput/DataOutput no longer allow negative vLongs. Negative
vInts are still supported (for index backwards compatibility), but
should not be used in new code. The read method for negative vLongs
was already broken since Lucene 3.1.
(Uwe Schindler, Mike McCandless, Robert Muir)
Security fixes
* LUCENE-3588: Try harder to prevent SIGSEGV on cloned MMapIndexInputs:
Previous versions of Lucene could SIGSEGV the JVM if you try to access
the clone of an IndexInput retrieved from MMapDirectory. This security fix
prevents this as best as it can by throwing AlreadyClosedException
also on clones. (Uwe Schindler, Robert Muir)
API Changes
* LUCENE-3606: IndexReader will be made read-only in Lucene 4.0, so all
methods allowing to delete or undelete documents using IndexReader were
deprecated; you should use IndexWriter now. Consequently
IndexReader.commit() and all open(), openIfChanged(), clone() methods
taking readOnly booleans (or IndexDeletionPolicy instances) were
deprecated. IndexReader.setNorm() is superfluous and was deprecated.
If you have to change per-document boost use CustomScoreQuery.
If you want to dynamically change norms (boost *and* length norm) at
query time, wrap your IndexReader using FilterIndexReader, overriding
FilterIndexReader.norms(). To persist the changes on disk, copy the
FilteredIndexReader to a new index using IndexWriter.addIndexes().
In Lucene 4.0, SimilarityProvider will allow you to customize scoring
using external norms, too. (Uwe Schindler, Robert Muir)
* LUCENE-3735: PayloadProcessorProvider was changed to return a
ReaderPayloadProcessor instead of DirPayloadProcessor. The selection
of the provider to return for the factory is now based on the IndexReader
to be merged. To mimic the old behaviour, just use IndexReader.directory()
for choosing the provider by Directory. (Uwe Schindler)
* LUCENE-3765: Deprecated StopFilter ctor that took ignoreCase, because
in some cases (if the set is a CharArraySet), the argument is ignored.
Deprecated StandardAnalyzer and ClassicAnalyzer ctors that take File,
please use the Reader ctor instead. (Robert Muir)
* LUCENE-3766: Deprecate no-arg ctors of Tokenizer. Tokenizers are
TokenStreams with Readers: tokenizers with null Readers will not be
supported in Lucene 4.0, just use a TokenStream.
(Mike McCandless, Robert Muir)
* LUCENE-3769: Simplified NRTManager by requiring applyDeletes to be
passed to ctor only; if an app needs to mix and match it's free to
create two NRTManagers (one always applying deletes and the other
never applying deletes). (MJB, Shai Erera, Mike McCandless)
* LUCENE-3761: Generalize SearcherManager into an abstract ReferenceManager.
SearcherManager remains a concrete class, but due to the refactoring, the
method maybeReopen has been deprecated in favor of maybeRefresh().
(Shai Erera, Mike McCandless, Simon Willnauer)
* LUCENE-3776: You now acquire/release the IndexSearcher directly from
NRTManager. (Mike McCandless)
New Features
* LUCENE-3593: Added a FieldValueFilter that accepts all documents that either
have at least one or no value at all in a specific field. (Simon Willnauer,
Uwe Schindler, Robert Muir)
* LUCENE-3586: CheckIndex and IndexUpgrader allow you to specify the
specific FSDirectory implementation to use (with the new -dir-impl
command-line option). (Luca Cavanna via Mike McCandless)
* LUCENE-3634: IndexReader's static main method was moved to a new
tool, CompoundFileExtractor, in contrib/misc. (Robert Muir, Mike
McCandless)
* LUCENE-995: The QueryParser now interprets * as an open end for range
queries. Literal asterisks may be represented by quoting or escaping
(i.e. \* or "*") Custom QueryParser subclasses overriding getRangeQuery()
will be passed null for any open endpoint. (Ingo Renner, Adriano
Crestani, yonik, Mike McCandless
* LUCENE-3121: Add sugar reverse lookup (given an output, find the
input mapping to it) for FSTs that have strictly monotonic long
outputs (such as an ord). (Mike McCandless)
* LUCENE-3671: Add TypeTokenFilter that filters tokens based on
their TypeAttribute. (Tommaso Teofili via Uwe Schindler)
* LUCENE-3690,LUCENE-3913: Added HTMLStripCharFilter, a CharFilter that strips
HTML markup. (Steve Rowe)
* LUCENE-3725: Added optional packing to FST building; this uses extra
RAM during building but results in a smaller FST. (Mike McCandless)
* LUCENE-3714: Add top N shortest cost paths search for FST.
(Robert Muir, Dawid Weiss, Mike McCandless)
* LUCENE-3789: Expose MTQ TermsEnum via RewriteMethod for non package private
access (Simon Willnauer)
* LUCENE-3881: Added UAX29URLEmailAnalyzer: a standard analyzer that recognizes
URLs and emails. (Steve Rowe)
Bug fixes
* LUCENE-3595: Fixed FieldCacheRangeFilter and FieldCacheTermsFilter
to correctly respect deletions on reopened SegmentReaders. Factored out
FieldCacheDocIdSet to be a top-level class. (Uwe Schindler, Simon Willnauer)
* LUCENE-3627: Don't let an errant 0-byte segments_N file corrupt the index.
(Ken McCracken via Mike McCandless)
* LUCENE-3630: The internal method MultiReader.doOpenIfChanged(boolean doClone)
was overriding IndexReader.doOpenIfChanged(boolean readOnly), so changing the
contract of the overridden method. This method was renamed and made private.
In ParallelReader the bug was not existent, but the implementation method
was also made private. (Uwe Schindler)
* LUCENE-3641: Fixed MultiReader to correctly propagate readerFinishedListeners
to clones/reopened readers. (Uwe Schindler)
* LUCENE-3642, SOLR-2891, LUCENE-3717: Fixed bugs in CharTokenizer, n-gram tokenizers/filters,
compound token filters, thai word filter, icutokenizer, pattern analyzer,
wikipediatokenizer, and smart chinese where they would create invalid offsets in
some situations, leading to problems in highlighting.
(Max Beutel, Edwin Steiner via Robert Muir)
* LUCENE-3639: TopDocs.merge was incorrectly setting TopDocs.maxScore to
Float.MIN_VALUE when it should be Float.NaN, when there were 0
hits. Improved age calculation in SearcherLifetimeManager, to have
double precision and to compute age to be how long ago the searcher
was replaced with a new searcher (Mike McCandless)
* LUCENE-3658: Corrected potential concurrency issues with
NRTCachingDir, fixed createOutput to overwrite any previous file,
and removed invalid asserts (Robert Muir, Mike McCandless)
* LUCENE-3605: don't sleep in a retry loop when trying to locate the
segments_N file (Robert Muir, Mike McCandless)
* LUCENE-3711: SentinelIntSet with a small initial size can go into
an infinite loop when expanded. This can affect grouping using
TermAllGroupsCollector or TermAllGroupHeadsCollector if instantiated with a
non default small size. (Martijn van Groningen, yonik)
* LUCENE-3727: When writing stored fields and term vectors, Lucene
checks file sizes to detect a bug in some Sun JREs (LUCENE-1282),
however, on some NFS filesystems File.length() could be stale,
resulting in false errors like "fdx size mismatch while indexing".
These checks now use getFilePointer instead to avoid this.
(Jamir Shaikh, Mike McCandless, Robert Muir)
* LUCENE-3816: Fixed problem in FilteredDocIdSet, if null was returned
from the delegate DocIdSet.iterator(), which is allowed to return
null by DocIdSet specification when no documents match.
(Shay Banon via Uwe Schindler)
* LUCENE-3821: SloppyPhraseScorer missed documents that ExactPhraseScorer finds
When phrase query had repeating terms (e.g. "yes no yes")
sloppy query missed documents that exact query matched.
Fixed except when for repeating multiterms (e.g. "yes no yes|no").
(Robert Muir, Doron Cohen)
* LUCENE-3841: Fix CloseableThreadLocal to also purge stale entries on
get(); this fixes certain cases where we were holding onto objects
for dead threads for too long (Matthew Bellew, Mike McCandless)
* LUCENE-3872: IndexWriter.close() now throws IllegalStateException if
you call it after calling prepareCommit() without calling commit()
first. (Tim Bogaert via Mike McCandless)
* LUCENE-3874: Throw IllegalArgumentException from IndexWriter (rather
than producing a corrupt index), if a positionIncrement would cause
integer overflow. This can happen, for example when using a buggy
TokenStream that forgets to call clearAttributes() in combination
with a StopFilter. (Robert Muir)
* LUCENE-3876: Fix bug where positions for a document exceeding
Integer.MAX_VALUE/2 would produce a corrupt index.
(Simon Willnauer, Mike McCandless, Robert Muir)
* LUCENE-3880: UAX29URLEmailTokenizer now recognizes emails when the mailto:
scheme is prepended. (Kai Gülzau, Steve Rowe)
Optimizations
* LUCENE-3653: Improve concurrency in VirtualMethod and AttributeSource by
using a WeakIdentityMap based on a ConcurrentHashMap. (Uwe Schindler,
Gerrit Jansen van Vuuren)
Documentation
* LUCENE-3597: Fixed incorrect grouping documentation. (Martijn van Groningen,
Robert Muir)
* LUCENE-3926: Improve documentation of RAMDirectory, because this
class is not intended to work with huge indexes. Everything beyond
several hundred megabytes will waste resources (GC cycles), because
it uses an internal buffer size of 1024 bytes, producing millions of
byte[1024] arrays. This class is optimized for small memory-resident
indexes. It also has bad concurrency on multithreaded environments.
It is recommended to materialize large indexes on disk and use
MMapDirectory, which is a high-performance directory implementation
working directly on the file system cache of the operating system,
so copying data to Java heap space is not useful. (Uwe Schindler,
Mike McCandless, Robert Muir)
Build
* LUCENE-3857: exceptions from other threads in beforeclass/etc do not fail
the test (Dawid Weiss)
* LUCENE-3847: LuceneTestCase will now check for modifications of System
properties before and after each test (and suite). If changes are detected,
the test will fail. A rule can be used to reset system properties to
before-scope state (and this has been used to make Solr tests pass).
(Dawid Weiss, Uwe Schindler).
* LUCENE-3228: Stop downloading external javadoc package-list files:
- Added package-list files for Oracle Java javadocs and JUnit javadocs to
Lucene/Solr subversion.
- The Oracle Java javadocs package-list file is excluded from Lucene and
Solr source release packages.
- Regardless of network connectivity, javadocs built from a subversion
checkout contain links to Oracle & JUnit javadocs.
- Building javadocs from a source release package will download the Oracle
Java package-list file if it isn't already present.
- When the Oracle Java package-list file is not present and download fails,
the javadocs targets will not fail the build, though an error will appear
in the build log. In this case, the built javadocs will not contain links
to Oracle Java javadocs.
- Links from Solr javadocs to Lucene's javadocs are enabled. When building
a X.Y.Z-SNAPSHOT version, the links are to the most recently built nightly
Jenkins javadocs. When building a release version, links are to the
Lucene release javadocs for the same version.
(Steve Rowe, hossman)
* LUCENE-3753: Restructure the Lucene build system:
- Created a new Lucene-internal module named "core" by moving the java/
and test/ directories from lucene/src/ to lucene/core/src/.
- Eliminated lucene/src/ by moving all its directories up one level.
- Each internal module (core/, test-framework/, and tools/) now has its own
build.xml, from which it is possible to run module-specific targets.
lucene/build.xml delegates all build tasks (via
<ant dir="internal-module-dir"> calls) to these modules' build.xml files.
(Steve Rowe)
* LUCENE-3774: Optimized and streamlined license and notice file validation
by refactoring the build task into an ANT task and modifying build scripts
to perform top-level checks. (Dawid Weiss, Steve Rowe, Robert Muir)
* LUCENE-3762: Upgrade JUnit to 4.10, refactor state-machine of detecting
setUp/tearDown call chaining in LuceneTestCase. (Dawid Weiss, Robert Muir)
* LUCENE-3944: Make the 'generate-maven-artifacts' target use filtered POMs
placed under lucene/build/poms/, rather than in each module's base
directory. The 'clean' target now removes them.
(Steve Rowe, Robert Muir)
* LUCENE-3930: Changed build system to use Apache Ivy for retrival of 3rd
party JAR files. Please review BUILD.txt for instructions.
(Robert Muir, Chris Male, Uwe Schindler, Steven Rowe, Hossman)
======================= Lucene 3.5.0 =======================
Changes in backwards compatibility policy
* LUCENE-3390: The first approach in Lucene 3.4.0 for missing values
support for sorting had a design problem that made the missing value
be populated directly into the FieldCache arrays during sorting,
leading to concurrency issues. To fix this behaviour, the method
signatures had to be changed:
- FieldCache.getUnValuedDocs() was renamed to FieldCache.getDocsWithField()
returning a Bits interface (backported from Lucene 4.0).
- FieldComparator.setMissingValue() was removed and added to
constructor
As this is expert API, most code will not be affected.
(Uwe Schindler, Doron Cohen, Mike McCandless)
* LUCENE-3541: Remove IndexInput's protected copyBuf. If you want to
keep a buffer in your IndexInput, do this yourself in your implementation,
and be sure to do the right thing on clone()! (Robert Muir)
* LUCENE-2822: TimeLimitingCollector now expects a counter clock instead of
relying on a private daemon thread. The global time limiting clock thread
has been exposed and is now lazily loaded and fully optional.
TimeLimitingCollector now supports setting clock baseline manually to include
prelude of a search. Previous versions set the baseline on construction time,
now baseline is set once the first IndexReader is passed to the collector
unless set before. (Simon Willnauer)
Changes in runtime behavior
* LUCENE-3520: IndexReader.openIfChanged, when passed a near-real-time
reader, will now return null if there are no changes. The API has
always reserved the right to do this; it's just that in the past for
near-real-time readers it never did. (Mike McCandless)
Bug fixes
* LUCENE-3412: SloppyPhraseScorer was returning non-deterministic results
for queries with many repeats (Doron Cohen)
* LUCENE-3421: PayloadTermQuery's explain was wrong when includeSpanScore=false.
(Edward Drapkin via Robert Muir)
* LUCENE-3432: IndexWriter.expungeDeletes with TieredMergePolicy
should ignore the maxMergedSegmentMB setting (v.sevel via Mike
McCandless)
* LUCENE-3442: TermQuery.TermWeight.scorer() returns null for non-atomic
IndexReaders (optimization bug, introcuced by LUCENE-2829), preventing
QueryWrapperFilter and similar classes to get a top-level DocIdSet.
(Dan C., Uwe Schindler)
* LUCENE-3390: Corrected handling of missing values when two parallel searches
using different missing values for sorting: the missing value was populated
directly into the FieldCache arrays during sorting, leading to concurrency
issues. (Uwe Schindler, Doron Cohen, Mike McCandless)
* LUCENE-3439: Closing an NRT reader after the writer was closed was
incorrectly invoking the DeletionPolicy and (then possibly deleting
files) on the closed IndexWriter (Robert Muir, Mike McCandless)
* LUCENE-3215: SloppyPhraseScorer sometimes computed Infinite freq
(Robert Muir, Doron Cohen)
* LUCENE-3503: DisjunctionSumScorer would give slightly different scores
for a document depending if you used nextDoc() versus advance().
(Mike McCandless, Robert Muir)
* LUCENE-3529: Properly support indexing an empty field with empty term text.
Previously, if you had assertions enabled you would receive an error during
flush, if you didn't, you would get an invalid index.
(Mike McCandless, Robert Muir)
* LUCENE-2633: PackedInts Packed32 and Packed64 did not support internal
structures larger than 256MB (Toke Eskildsen via Mike McCandless)
* LUCENE-3540: LUCENE-3255 dropped support for pre-1.9 indexes, but the
error message in IndexFormatTooOldException was incorrect. (Uwe Schindler,
Mike McCandless)
* LUCENE-3541: IndexInput's default copyBytes() implementation was not safe
across multiple threads, because all clones shared the same buffer.
(Robert Muir)
* LUCENE-3548: Fix CharsRef#append to extend length of the existing char[]
and preserve existing chars. (Simon Willnauer)
* LUCENE-3582: Normalize NaN values in NumericUtils.floatToSortableInt() /
NumericUtils.doubleToSortableLong(), so this is consistent with stored
fields. Also fix NumericRangeQuery to not falsely hit NaNs on half-open
ranges (one bound is null). Because of normalization, NumericRangeQuery
can now be used to hit NaN values by creating a query with
upper == lower == NaN (inclusive). (Dawid Weiss, Uwe Schindler)
API Changes
* LUCENE-3454: Rename IndexWriter.optimize to forceMerge to discourage
use of this method since it is horribly costly and rarely justified
anymore. MergePolicy.findMergesForOptimize was renamed to
findForcedMerges. IndexReader.isOptimized was
deprecated. IndexCommit.isOptimized was replaced with
getSegmentCount. (Robert Muir, Mike McCandless)
* LUCENE-3205: Deprecated MultiTermQuery.getTotalNumerOfTerms() [and
related methods], as the numbers returned are not useful
for multi-segment indexes. They were only needed for tests of
NumericRangeQuery. (Mike McCandless, Uwe Schindler)
* LUCENE-3574: Deprecate outdated constants in org.apache.lucene.util.Constants
and add new ones for Java 6 and Java 7. (Uwe Schindler)
* LUCENE-3571: Deprecate IndexSearcher(Directory). Use the constructors
that take IndexReader instead. (Robert Muir)
* LUCENE-3577: Rename IndexWriter.expungeDeletes to forceMergeDeletes,
and revamped the javadocs, to discourage
use of this method since it is horribly costly and rarely
justified. MergePolicy.findMergesToExpungeDeletes was renamed to
findForcedDeletesMerges. (Robert Muir, Mike McCandless)
* LUCENE-3464: IndexReader.reopen has been renamed to
IndexReader.openIfChanged (a static method), and now returns null
(instead of the old reader) if there are no changes in the index, to
prevent the common pitfall of accidentally closing the old reader.
New Features
* LUCENE-3448: Added FixedBitSet.and(other/DISI), andNot(other/DISI).
(Uwe Schindler)
* LUCENE-2215: Added IndexSearcher.searchAfter which returns results after a
specified ScoreDoc (e.g. last document on the previous page) to support deep
paging use cases. (Aaron McCurry, Grant Ingersoll, Robert Muir)
* LUCENE-1990: Adds internal packed ints implementation, to be used
for more efficient storage of int arrays when the values are
bounded, for example for storing the terms dict index (Toke
Eskildsen via Mike McCandless)
* LUCENE-3558: Moved SearcherManager, NRTManager & SearcherLifetimeManager into
core. All classes are contained in o.a.l.search. (Simon Willnauer)
Optimizations
* LUCENE-3426: Add NGramPhraseQuery which extends PhraseQuery and tries to
reduce the number of terms of the query when rewrite(), in order to improve
performance. (Robert Muir, Koji Sekiguchi)
* LUCENE-3494: Optimize FilteredQuery to remove a multiply in score()
(Uwe Schindler, Robert Muir)
* LUCENE-3534: Remove filter logic from IndexSearcher and delegate to
FilteredQuery's Scorer. This is a partial backport of a cleanup in
FilteredQuery/IndexSearcher added by LUCENE-1536 to Lucene 4.0.
(Uwe Schindler)
* LUCENE-2205: Very substantial (3-5X) RAM reduction required to hold
the terms index on opening an IndexReader (Aaron McCurry via Mike McCandless)
* LUCENE-3443: FieldCache can now set docsWithField, and create an
array, in a single pass. This results in faster init time for apps
that need both (such as sorting by a field with a missing value).
(Mike McCandless)
Test Cases
* LUCENE-3420: Disable the finalness checks in TokenStream and Analyzer
for implementing subclasses in different packages, where assertions are not
enabled. (Uwe Schindler)
* LUCENE-3506: tests relying on assertions being enabled were no-op because
they ignored AssertionError. With this fix now entire test framework
(every test) fails if assertions are disabled, unless
-Dtests.asserts.gracious=true is specified. (Doron Cohen)
Build
* SOLR-2849: Fix dependencies in Maven POMs. (David Smiley via Steve Rowe)
* LUCENE-3561: Fix maven xxx-src.jar files that were missing resources.
(Uwe Schindler)
======================= Lucene 3.4.0 =======================
Bug fixes
* LUCENE-3251: Directory#copy failed to close target output if opening the
source stream failed. (Simon Willnauer)
* LUCENE-3255: If segments_N file is all zeros (due to file
corruption), don't read that to mean the index is empty. (Gregory
Tarr, Mark Harwood, Simon Willnauer, Mike McCandless)
* LUCENE-3254: Fixed minor bug in deletes were written to disk,
causing the file to sometimes be larger than it needed to be. (Mike
McCandless)
* LUCENE-3224: Fixed a big where CheckIndex would incorrectly report a
corrupt index if a term with docfreq >= 16 was indexed more than once
at the same position. (Robert Muir)
* LUCENE-3339: Fixed deadlock case when multiple threads use the new
block-add (IndexWriter.add/updateDocuments) methods. (Robert Muir,
Mike McCandless)
* LUCENE-3340: Fixed case where IndexWriter was not flushing at
exactly maxBufferedDeleteTerms (Mike McCandless)
* LUCENE-3358, LUCENE-3361: StandardTokenizer and UAX29URLEmailTokenizer
wrongly discarded combining marks attached to Han or Hiragana characters,
this is fixed if you supply Version >= 3.4 If you supply a previous
lucene version, you get the old buggy behavior for backwards compatibility.
(Trejkaz, Robert Muir)
* LUCENE-3368: IndexWriter commits segments without applying their buffered
deletes when flushing concurrently. (Simon Willnauer, Mike McCandless)
* LUCENE-3365: Create or Append mode determined before obtaining write lock
can cause IndexWriter overriding an existing index.
(Geoff Cooney via Simon Willnauer)
* LUCENE-3380: Fixed a bug where FileSwitchDirectory's listAll() would wrongly
throw NoSuchDirectoryException when all files written so far have been
written to one directory, but the other still has not yet been created on the
filesystem. (Robert Muir)
* LUCENE-3409: IndexWriter.deleteAll was failing to close pooled NRT
SegmentReaders, leading to unused files accumulating in the
Directory. (tal steier via Mike McCandless)
* LUCENE-3418: Lucene was failing to fsync index files on commit,
meaning an operating system or hardware crash, or power loss, could
easily corrupt the index. (Mark Miller, Robert Muir, Mike
McCandless)
New Features
* LUCENE-3290: Added FieldInvertState.numUniqueTerms
(Mike McCandless, Robert Muir)
* LUCENE-3280: Add FixedBitSet, like OpenBitSet but is not elastic
(grow on demand if you set/get/clear too-large indices). (Mike
McCandless)
* LUCENE-2048: Added the ability to omit positions but still index
term frequencies, you can now control what is indexed into
the postings via AbstractField.setIndexOptions:
DOCS_ONLY: only documents are indexed: term frequencies and positions are omitted
DOCS_AND_FREQS: only documents and term frequencies are indexed: positions are omitted
DOCS_AND_FREQS_AND_POSITIONS: full postings: documents, frequencies, and positions
AbstractField.setOmitTermFrequenciesAndPositions is deprecated,
you should use DOCS_ONLY instead. (Robert Muir)
* LUCENE-3097: Added a new grouping collector that can be used to retrieve all most relevant
documents per group. This can be useful in situations when one wants to compute grouping
based facets / statistics on the complete query result. (Martijn van Groningen)
* LUCENE-3334: If Java7 is detected, IOUtils.closeSafely() will log
suppressed exceptions in the original exception, so stack trace
will contain them. (Uwe Schindler)
Optimizations
* LUCENE-3201, LUCENE-3218: CompoundFileSystem code has been consolidated
into a Directory implementation. Reading is optimized for MMapDirectory,
NIOFSDirectory and SimpleFSDirectory to only map requested parts of the
CFS into an IndexInput. Writing to a CFS now tries to append to the CF
directly if possible and merges separately written files on the fly instead
of during close. (Simon Willnauer, Robert Muir)
* LUCENE-3289: When building an FST you can now tune how aggressively
the FST should try to share common suffixes. Typically you can
greatly reduce RAM required during building, and CPU consumed, at
the cost of a somewhat larger FST. (Mike McCandless)
Test Cases
* LUCENE-3327: Fix AIOOBE when TestFSTs is run with
-Dtests.verbose=true (James Dyer via Mike McCandless)
Build
* LUCENE-3406: Add ant target 'package-local-src-tgz' to Lucene and Solr
to package sources from the local working copy.
(Seung-Yeoul Yang via Steve Rowe)
======================= Lucene 3.3.0 =======================
Changes in backwards compatibility policy
* LUCENE-3140: IndexOutput.copyBytes now takes a DataInput (superclass
of IndexInput) as its first argument. (Robert Muir, Dawid Weiss,
Mike McCandless)
* LUCENE-3191: FieldComparator.value now returns an Object not
Comparable; FieldDoc.fields also changed from Comparable[] to
Object[] (Uwe Schindler, Mike McCandless)
* LUCENE-3208: Made deprecated methods Query.weight(Searcher) and
Searcher.createWeight() final to prevent override. If you have
overridden one of these methods, cut over to the non-deprecated
implementation. (Uwe Schindler, Robert Muir, Yonik Seeley)
* LUCENE-3238: Made MultiTermQuery.rewrite() final, to prevent
problems (such as not properly setting rewrite methods, or
not working correctly with things like SpanMultiTermQueryWrapper).
To rewrite to a simpler form, instead return a simpler enum
from getEnum(IndexReader). For example, to rewrite to a single term,
return a SingleTermEnum. (ludovic Boutros, Uwe Schindler, Robert Muir)
Changes in runtime behavior
* LUCENE-2834: the hash used to compute the lock file name when the
lock file is not stored in the index has changed. This means you
will see a different lucene-XXX-write.lock in your lock directory.
(Robert Muir, Uwe Schindler, Mike McCandless)
* LUCENE-3146: IndexReader.setNorm throws IllegalStateException if the field
does not store norms. (Shai Erera, Mike McCandless)
* LUCENE-3198: On Linux, if the JRE is 64 bit and supports unmapping,
FSDirectory.open now defaults to MMapDirectory instead of
NIOFSDirectory since MMapDirectory gives better performance. (Mike
McCandless)
* LUCENE-3200: MMapDirectory now uses chunk sizes that are powers of 2.
When setting the chunk size, it is rounded down to the next possible
value. The new default value for 64 bit platforms is 2^30 (1 GiB),
for 32 bit platforms it stays unchanged at 2^28 (256 MiB).
Internally, MMapDirectory now only uses one dedicated final IndexInput
implementation supporting multiple chunks, which makes Hotspot's life
easier. (Uwe Schindler, Robert Muir, Mike McCandless)
Bug fixes
* LUCENE-3147,LUCENE-3152: Fixed open file handles leaks in many places in the
code. Now MockDirectoryWrapper (in test-framework) tracks all open files,
including locks, and fails if the test fails to release all of them.
(Mike McCandless, Robert Muir, Shai Erera, Simon Willnauer)
* LUCENE-3102: CachingCollector.replay was failing to call setScorer
per-segment (Martijn van Groningen via Mike McCandless)
* LUCENE-3183: Fix rare corner case where seeking to empty term
(field="", term="") with terms index interval 1 could hit
ArrayIndexOutOfBoundsException (selckin, Robert Muir, Mike
McCandless)
* LUCENE-3208: IndexSearcher had its own private similarity field
and corresponding get/setter overriding Searcher's implementation. If you
setted a different Similarity instance on IndexSearcher, methods implemented
in the superclass Searcher were not using it, leading to strange bugs.
(Uwe Schindler, Robert Muir)
* LUCENE-3197: Fix core merge policies to not over-merge during
background optimize when documents are still being deleted
concurrently with the optimize (Mike McCandless)
* LUCENE-3222: The RAM accounting for buffered delete terms was
failing to measure the space required to hold the term's field and
text character data. (Mike McCandless)
* LUCENE-3238: Fixed bug where using WildcardQuery("prefix*") inside
of a SpanMultiTermQueryWrapper rewrote incorrectly and returned
an error instead. (ludovic Boutros, Uwe Schindler, Robert Muir)
API Changes
* LUCENE-3208: Renamed protected IndexSearcher.createWeight() to expert
public method IndexSearcher.createNormalizedWeight() as this better describes
what this method does. The old method is still there for backwards
compatibility. Query.weight() was deprecated and simply delegates to
IndexSearcher. Both deprecated methods will be removed in Lucene 4.0.
(Uwe Schindler, Robert Muir, Yonik Seeley)
* LUCENE-3197: MergePolicy.findMergesForOptimize now takes
Map<SegmentInfo,Boolean> instead of Set<SegmentInfo> as the second
argument, so the merge policy knows which segments were originally
present vs produced by an optimizing merge (Mike McCandless)
Optimizations
* LUCENE-1736: DateTools.java general improvements.
(David Smiley via Steve Rowe)
New Features
* LUCENE-3140: Added experimental FST implementation to Lucene.
(Robert Muir, Dawid Weiss, Mike McCandless)
* LUCENE-3193: A new TwoPhaseCommitTool allows running a 2-phase commit
algorithm over objects that implement the new TwoPhaseCommit interface (such
as IndexWriter). (Shai Erera)
* LUCENE-3191: Added TopDocs.merge, to facilitate merging results from
different shards (Uwe Schindler, Mike McCandless)
* LUCENE-3179: Added OpenBitSet.prevSetBit (Paul Elschot via Mike McCandless)
* LUCENE-3210: Made TieredMergePolicy more aggressive in reclaiming
segments with deletions; added new methods
set/getReclaimDeletesWeight to control this. (Mike McCandless)
Build
* LUCENE-1344: Create OSGi bundle using dev-tools/maven.
(Nicolas Lalevée, Luca Stancapiano via ryan)
* LUCENE-3204: The maven-ant-tasks jar is now included in the source tree;
users of the generate-maven-artifacts target no longer have to manually
place this jar in the Ant classpath. NOTE: when Ant looks for the
maven-ant-tasks jar, it looks first in its pre-existing classpath, so
any copies it finds will be used instead of the copy included in the
Lucene/Solr source tree. For this reason, it is recommeded to remove
any copies of the maven-ant-tasks jar in the Ant classpath, e.g. under
~/.ant/lib/ or under the Ant installation's lib/ directory. (Steve Rowe)
======================= Lucene 3.2.0 =======================
Changes in backwards compatibility policy
* LUCENE-2953: PriorityQueue's internal heap was made private, as subclassing
with generics can lead to ClassCastException. For advanced use (e.g. in Solr)
a method getHeapArray() was added to retrieve the internal heap array as a
non-generic Object[]. (Uwe Schindler, Yonik Seeley)
* LUCENE-1076: IndexWriter.setInfoStream now throws IOException
(Mike McCandless, Shai Erera)
* LUCENE-3084: MergePolicy.OneMerge.segments was changed from
SegmentInfos to a List<SegmentInfo>. SegmentInfos itself was changed
to no longer extend Vector<SegmentInfo> (to update code that is using
Vector-API, use the new asList() and asSet() methods returning unmodifiable
collections; modifying SegmentInfos is now only possible through
the explicitely declared methods). IndexWriter.segString() now takes
Iterable<SegmentInfo> instead of List<SegmentInfo>. A simple recompile
should fix this. MergePolicy and SegmentInfos are internal/experimental
APIs not covered by the strict backwards compatibility policy.
(Uwe Schindler, Mike McCandless)
Changes in runtime behavior
* LUCENE-3065: When a NumericField is retrieved from a Document loaded
from IndexReader (or IndexSearcher), it will now come back as
NumericField not as a Field with a string-ified version of the
numeric value you had indexed. Note that this only applies for
newly-indexed Documents; older indices will still return Field
with the string-ified numeric value. If you call Document.get(),
the value comes still back as String, but Document.getFieldable()
returns NumericField instances. (Uwe Schindler, Ryan McKinley,
Mike McCandless)
* LUCENE-1076: Changed the default merge policy from
LogByteSizeMergePolicy to TieredMergePolicy, as of Version.LUCENE_32
(passed to IndexWriterConfig), which is able to merge non-contiguous
segments. This means docIDs no longer necessarily stay "in order"
during indexing. If this is a problem then you can use either of
the LogMergePolicy impls. (Mike McCandless)
New features
* LUCENE-3082: Added index upgrade tool oal.index.IndexUpgrader
that allows to upgrade all segments to last recent supported index
format without fully optimizing. (Uwe Schindler, Mike McCandless)
* LUCENE-1076: Added TieredMergePolicy which is able to merge non-contiguous
segments, which means docIDs no longer necessarily stay "in order".
(Mike McCandless, Shai Erera)
* LUCENE-3071: Adding ReversePathHierarchyTokenizer, added skip parameter to
PathHierarchyTokenizer (Olivier Favre via ryan)
* LUCENE-1421, LUCENE-3102: added CachingCollector which allow you to cache
document IDs and scores encountered during the search, and "replay" them to
another Collector. (Mike McCandless, Shai Erera)
* LUCENE-3112: Added experimental IndexWriter.add/updateDocuments,
enabling a block of documents to be indexed, atomically, with
guaranteed sequential docIDs. (Mike McCandless)
API Changes
* LUCENE-3061: IndexWriter's getNextMerge() and merge(OneMerge) are now public
(though @lucene.experimental), allowing for custom MergeScheduler
implementations. (Shai Erera)
* LUCENE-3065: Document.getField() was deprecated, as it throws
ClassCastException when loading lazy fields or NumericFields.
(Uwe Schindler, Ryan McKinley, Mike McCandless)
* LUCENE-2027: Directory.touchFile is deprecated and will be removed
in 4.0. (Mike McCandless)
Optimizations
* LUCENE-2990: ArrayUtil/CollectionUtil.*Sort() methods now exit early
on empty or one-element lists/arrays. (Uwe Schindler)
* LUCENE-2897: Apply deleted terms while flushing a segment. We still
buffer deleted terms to later apply to past segments. (Mike McCandless)
* LUCENE-3126: IndexWriter.addIndexes copies incoming segments into CFS if they
aren't already and MergePolicy allows that. (Shai Erera)
Bug fixes
* LUCENE-2996: addIndexes(IndexReader) did not flush before adding the new
indexes, causing existing deletions to be applied on the incoming indexes as
well. (Shai Erera, Mike McCandless)
* LUCENE-3024: Index with more than 2.1B terms was hitting AIOOBE when
seeking TermEnum (eg used by Solr's faceting) (Tom Burton-West, Mike
McCandless)
* LUCENE-3042: When a filter or consumer added Attributes to a TokenStream
chain after it was already (partly) consumed [or clearAttributes(),
captureState(), cloneAttributes(),... was called by the Tokenizer],
the Tokenizer calling clearAttributes() or capturing state after addition
may not do this on the newly added Attribute. This bug affected only
very special use cases of the TokenStream-API, most users would not
have recognized it. (Uwe Schindler, Robert Muir)
* LUCENE-3054: PhraseQuery can in some cases stack overflow in
SorterTemplate.quickSort(). This fix also adds an optimization to
PhraseQuery as term with lower doc freq will also have less positions.
(Uwe Schindler, Robert Muir, Otis Gospodnetic)
* LUCENE-3068: sloppy phrase query failed to match valid documents when multiple
query terms had same position in the query. (Doron Cohen)
* LUCENE-3012: Lucene writes the header now for separate norm files (*.sNNN)
(Robert Muir)
Build
* LUCENE-3006: Building javadocs will fail on warnings by default.
Override with -Dfailonjavadocwarning=false (sarowe, gsingers)
* LUCENE-3128: "ant eclipse" creates a .project file for easier Eclipse
integration (unless one already exists). (Daniel Serodio via Shai Erera)
Test Cases
* LUCENE-3002: added 'tests.iter.min' to control 'tests.iter' by allowing to
stop iterating if at least 'tests.iter.min' ran and a failure occured.
(Shai Erera, Chris Hostetter)
======================= Lucene 3.1.0 =======================
Changes in backwards compatibility policy
* LUCENE-2719: Changed API of internal utility class
org.apache.lucene.util.SorterTemplate to support faster quickSort using
pivot values and also merge sort and insertion sort. If you have used
this class, you have to implement two more methods for handling pivots.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-1923: Renamed SegmentInfo & SegmentInfos segString method to
toString. These are advanced APIs and subject to change suddenly.
(Tim Smith via Mike McCandless)
* LUCENE-2190: Removed deprecated customScore() and customExplain()
methods from experimental CustomScoreQuery. (Uwe Schindler)
* LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default.
This means that terms with a position increment gap of zero do not
affect the norms calculation by default. (Robert Muir)
* LUCENE-2320: MergePolicy.writer is now of type SetOnce, which allows setting
the IndexWriter for a MergePolicy exactly once. You can change references to
'writer' from <code>writer.doXYZ()</code> to <code>writer.get().doXYZ()</code>
(it is also advisable to add an <code>assert writer != null;</code> before you
access the wrapped IndexWriter.)
In addition, MergePolicy only exposes a default constructor, and the one that
took IndexWriter as argument has been removed from all MergePolicy extensions.
(Shai Erera via Mike McCandless)
* LUCENE-2328: SimpleFSDirectory.SimpleFSIndexInput is moved to
FSDirectory.FSIndexInput. Anyone extending this class will have to
fix their code on upgrading. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2302: The new interface for term attributes, CharTermAttribute,
now implements CharSequence. This requires the toString() methods of
CharTermAttribute, deprecated TermAttribute, and Token to return only
the term text and no other attribute contents. LUCENE-2374 implements
an attribute reflection API to no longer rely on toString() for attribute
inspection. (Uwe Schindler, Robert Muir)
* LUCENE-2372, LUCENE-2389: StandardAnalyzer, KeywordAnalyzer,
PerFieldAnalyzerWrapper, WhitespaceTokenizer are now final. Also removed
the now obsolete and deprecated Analyzer.setOverridesTokenStreamMethod().
Analyzer and TokenStream base classes now have an assertion in their ctor,
that check subclasses to be final or at least have final implementations
of incrementToken(), tokenStream(), and reusableTokenStream().
(Uwe Schindler, Robert Muir)
* LUCENE-2316: Directory.fileLength contract was clarified - it returns the
actual file's length if the file exists, and throws FileNotFoundException
otherwise. Returning length=0 for a non-existent file is no longer allowed. If
you relied on that, make sure to catch the exception. (Shai Erera)
* LUCENE-2386: IndexWriter no longer performs an empty commit upon new index
creation. Previously, if you passed an empty Directory and set OpenMode to
CREATE*, IndexWriter would make a first empty commit. If you need that
behavior you can call writer.commit()/close() immediately after you create it.
(Shai Erera, Mike McCandless)
* LUCENE-2733: Removed public constructors of utility classes with only static
methods to prevent instantiation. (Uwe Schindler)
* LUCENE-2602: The default (LogByteSizeMergePolicy) merge policy now
takes deletions into account by default. You can disable this by
calling setCalibrateSizeByDeletes(false) on the merge policy. (Mike
McCandless)
* LUCENE-2529, LUCENE-2668: Position increment gap and offset gap of empty
values in multi-valued field has been changed for some cases in index.
If you index empty fields and uses positions/offsets information on that
fields, reindex is recommended. (David Smiley, Koji Sekiguchi)
* LUCENE-2804: Directory.setLockFactory new declares throwing an IOException.
(Shai Erera, Robert Muir)
* LUCENE-2837: Added deprecations noting that in 4.0, Searcher and
Searchable are collapsed into IndexSearcher; contrib/remote and
MultiSearcher have been removed. (Mike McCandless)
* LUCENE-2854: Deprecated SimilarityDelegator and
Similarity.lengthNorm; the latter is now final, forcing any custom
Similarity impls to cutover to the more general computeNorm (Robert
Muir, Mike McCandless)
* LUCENE-2869: Deprecated Query.getSimilarity: instead of using
"runtime" subclassing/delegation, subclass the Weight instead.
(Robert Muir)
* LUCENE-2674: A new idfExplain method was added to Similarity, that
accepts an incoming docFreq. If you subclass Similarity, make sure
you also override this method on upgrade. (Robert Muir, Mike
McCandless)
Changes in runtime behavior
* LUCENE-1923: Made IndexReader.toString() produce something
meaningful (Tim Smith via Mike McCandless)
* LUCENE-2179: CharArraySet.clear() is now functional.
(Robert Muir, Uwe Schindler)
* LUCENE-2455: IndexWriter.addIndexes no longer optimizes the target index
before it adds the new ones. Also, the existing segments are not merged and so
the index will not end up with a single segment (unless it was empty before).
In addition, addIndexesNoOptimize was renamed to addIndexes and no longer
invokes a merge on the incoming and target segments, but instead copies the
segments to the target index. You can call maybeMerge or optimize after this
method completes, if you need to.
In addition, Directory.copyTo* were removed in favor of copy which takes the
target Directory, source and target files as arguments, and copies the source
file to the target Directory under the target file name. (Shai Erera)
* LUCENE-2663: IndexWriter no longer forcefully clears any existing
locks when create=true. This was a holdover from when
SimpleFSLockFactory was the default locking implementation, and,
even then it was dangerous since it could mask bugs in IndexWriter's
usage, allowing applications to accidentally open two writers on the
same directory. (Mike McCandless)
* LUCENE-2701: maxMergeMBForOptimize and maxMergeDocs constraints set on
LogMergePolicy now affect optimize() as well (as opposed to only regular
merges). This means that you can run optimize() and too large segments won't
be merged. (Shai Erera)
* LUCENE-2753: IndexReader and DirectoryReader .listCommits() now return a List,
guaranteeing the commits are sorted from oldest to latest. (Shai Erera)
* LUCENE-2785: TopScoreDocCollector, TopFieldCollector and
the IndexSearcher search methods that take an int nDocs will now
throw IllegalArgumentException if nDocs is 0. Instead, you should
use the newly added TotalHitCountCollector. (Mike McCandless)
* LUCENE-2790: LogMergePolicy.useCompoundFile's logic now factors in noCFSRatio
to determine whether the passed in segment should be compound.
(Shai Erera, Earwin Burrfoot)
* LUCENE-2805: IndexWriter now increments the index version on every change to
the index instead of for every commit. Committing or closing the IndexWriter
without any changes to the index will not cause any index version increment.
(Simon Willnauer, Mike McCandless)
* LUCENE-2650, LUCENE-2825: The behavior of FSDirectory.open has changed. On 64-bit
Windows and Solaris systems that support unmapping, FSDirectory.open returns
MMapDirectory. Additionally the behavior of MMapDirectory has been
changed to enable unmapping by default if supported by the JRE.
(Mike McCandless, Uwe Schindler, Robert Muir)
* LUCENE-2829: Improve the performance of "primary key" lookup use
case (running a TermQuery that matches one document) on a
multi-segment index. (Robert Muir, Mike McCandless)
* LUCENE-2010: Segments with 100% deleted documents are now removed on
IndexReader or IndexWriter commit. (Uwe Schindler, Mike McCandless)
* LUCENE-2960: Allow some changes to IndexWriterConfig to take effect
"live" (after an IW is instantiated), via
IndexWriter.getConfig().setXXX(...) (Shay Banon, Mike McCandless)
API Changes
* LUCENE-2076: Rename FSDirectory.getFile -> getDirectory. (George
Aroush via Mike McCandless)
* LUCENE-1260: Change norm encode (float->byte) and decode
(byte->float) to be instance methods not static methods. This way a
custom Similarity can alter how norms are encoded, though they must
still be encoded as a single byte (Johan Kindgren via Mike
McCandless)
* LUCENE-2103: NoLockFactory should have a private constructor;
until Lucene 4.0 the default one will be deprecated.
(Shai Erera via Uwe Schindler)
* LUCENE-2177: Deprecate the Field ctors that take byte[] and Store.
Since the removal of compressed fields, Store can only be YES, so
it's not necessary to specify. (Erik Hatcher via Mike McCandless)
* LUCENE-2200: Several final classes had non-overriding protected
members. These were converted to private and unused protected
constructors removed. (Steven Rowe via Robert Muir)
* LUCENE-2240: SimpleAnalyzer and WhitespaceAnalyzer now have
Version ctors. (Simon Willnauer via Uwe Schindler)
* LUCENE-2259: Add IndexWriter.deleteUnusedFiles, to attempt removing
unused files. This is only useful on Windows, which prevents
deletion of open files. IndexWriter will eventually remove these
files itself; this method just lets you do so when you know the
files are no longer open by IndexReaders. (luocanrao via Mike
McCandless)
* LUCENE-2282: IndexFileNames is exposed as a public class allowing for easier
use by external code. In addition it offers a matchExtension method which
callers can use to query whether a certain file matches a certain extension.
(Shai Erera via Mike McCandless)
* LUCENE-124: Add a TopTermsBoostOnlyBooleanQueryRewrite to MultiTermQuery.
This rewrite method is similar to TopTermsScoringBooleanQueryRewrite, but
only scores terms by their boost values. For example, this can be used
with FuzzyQuery to ensure that exact matches are always scored higher,
because only the boost will be used in scoring. (Robert Muir)
* LUCENE-2015: Add a static method foldToASCII to ASCIIFoldingFilter to
expose its folding logic. (Cédrik Lime via Robert Muir)
* LUCENE-2294: IndexWriter constructors have been deprecated in favor of a
single ctor which accepts IndexWriterConfig and a Directory. You can set all
the parameters related to IndexWriter on IndexWriterConfig. The different
setter/getter methods were deprecated as well. One should call
writer.getConfig().getXYZ() to query for a parameter XYZ.
Additionally, the setter/getter related to MergePolicy were deprecated as
well. One should interact with the MergePolicy directly.
(Shai Erera via Mike McCandless)
* LUCENE-2320: IndexWriter's MergePolicy configuration was moved to
IndexWriterConfig and the respective methods on IndexWriter were deprecated.
(Shai Erera via Mike McCandless)
* LUCENE-2328: Directory now keeps track itself of the files that are written
but not yet fsynced. The old Directory.sync(String file) method is deprecated
and replaced with Directory.sync(Collection<String> files). Take a look at
FSDirectory to see a sample of how such tracking might look like, if needed
in your custom Directories. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2302: Deprecated TermAttribute and replaced by a new
CharTermAttribute. The change is backwards compatible, so
mixed new/old TokenStreams all work on the same char[] buffer
independent of which interface they use. CharTermAttribute
has shorter method names and implements CharSequence and
Appendable. This allows usage like Java's StringBuilder in
addition to direct char[] access. Also terms can directly be
used in places where CharSequence is allowed (e.g. regular
expressions).
(Uwe Schindler, Robert Muir)
* LUCENE-2402: IndexWriter.deleteUnusedFiles now deletes unreferenced commit
points too. If you use an IndexDeletionPolicy which holds onto index commits
(such as SnapshotDeletionPolicy), you can call this method to remove those
commit points when they are not needed anymore (instead of waiting for the
next commit). (Shai Erera)
* LUCENE-2481: SnapshotDeletionPolicy.snapshot() and release() were replaced
with equivalent ones that take a String (id) as argument. You can pass
whatever ID you want, as long as you use the same one when calling both.
(Shai Erera)
* LUCENE-2356: Add IndexWriterConfig.set/getReaderTermIndexDivisor, to
set what IndexWriter passes for termsIndexDivisor to the readers it
opens internally when apply deletions or creating a near-real-time
reader. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2167,LUCENE-2699,LUCENE-2763,LUCENE-2847: StandardTokenizer/Analyzer
in common/standard/ now implement the Word Break rules from the Unicode 6.0.0
Text Segmentation algorithm (UAX#29), covering the full range of Unicode code
points, including values from U+FFFF to U+10FFFF
ClassicTokenizer/Analyzer retains the old (pre-Lucene 3.1) StandardTokenizer/
Analyzer implementation and behavior. Only the Unicode Basic Multilingual
Plane (code points from U+0000 to U+FFFF) is covered.
UAX29URLEmailTokenizer tokenizes URLs and E-mail addresses according to the
relevant RFCs, in addition to implementing the UAX#29 Word Break rules.
(Steven Rowe, Robert Muir, Uwe Schindler)
* LUCENE-2778: RAMDirectory now exposes newRAMFile() which allows to override
and return a different RAMFile implementation. (Shai Erera)
* LUCENE-2785: Added TotalHitCountCollector whose sole purpose is to
count the number of hits matching the query. (Mike McCandless)
* LUCENE-2846: Deprecated IndexReader.setNorm(int, String, float). This method
is only syntactic sugar for setNorm(int, String, byte), but using the global
Similarity.getDefault().encodeNormValue(). Use the byte-based method instead
to ensure that the norm is encoded with your Similarity.
(Robert Muir, Mike McCandless)
* LUCENE-2374: Added Attribute reflection API: It's now possible to inspect the
contents of AttributeImpl and AttributeSource using a well-defined API.
This is e.g. used by Solr's AnalysisRequestHandlers to display all attributes
in a structured way.
There are also some backwards incompatible changes in toString() output,
as LUCENE-2302 introduced the CharSequence interface to CharTermAttribute
leading to changed toString() return values. The new API allows to get a
string representation in a well-defined way using a new method
reflectAsString(). For backwards compatibility reasons, when toString()
was implemented by implementation subclasses, the default implementation of
AttributeImpl.reflectWith() uses toString()s output instead to report the
Attribute's properties. Otherwise, reflectWith() uses Java's reflection
(like toString() did before) to get the attribute properties.
In addition, the mandatory equals() and hashCode() are no longer required
for AttributeImpls, but can still be provided (if needed).
(Uwe Schindler)
* LUCENE-2691: Deprecate IndexWriter.getReader in favor of
IndexReader.open(IndexWriter) (Grant Ingersoll, Mike McCandless)
* LUCENE-2876: Deprecated Scorer.getSimilarity(). If your Scorer uses a Similarity,
it should keep it itself. Fixed Scorers to pass their parent Weight, so that
Scorer.visitSubScorers (LUCENE-2590) will work correctly.
(Robert Muir, Doron Cohen)
* LUCENE-2900: When opening a near-real-time (NRT) reader
(IndexReader.re/open(IndexWriter)) you can now specify whether
deletes should be applied. Applying deletes can be costly, and some
expert use cases can handle seeing deleted documents returned. The
deletes remain buffered so that the next time you open an NRT reader
and pass true, all deletes will be a applied. (Mike McCandless)
* LUCENE-1253: LengthFilter (and Solr's KeepWordTokenFilter) now
require up front specification of enablePositionIncrement. Together with
StopFilter they have a common base class (FilteringTokenFilter) that handles
the position increments automatically. Implementors only need to override an
accept() method that filters tokens. (Uwe Schindler, Robert Muir)
Bug fixes
* LUCENE-2249: ParallelMultiSearcher should shut down thread pool on
close. (Martin Traverso via Uwe Schindler)
* LUCENE-2273: FieldCacheImpl.getCacheEntries() used WeakHashMap
incorrectly and lead to ConcurrentModificationException.
(Uwe Schindler, Robert Muir)
* LUCENE-2328: Index files fsync tracking moved from
IndexWriter/IndexReader to Directory, and it no longer leaks memory.
(Earwin Burrfoot via Mike McCandless)
* LUCENE-2074: Reduce buffer size of lexer back to default on reset.
(Ruben Laguna, Shai Erera via Uwe Schindler)
* LUCENE-2496: Don't throw NPE if IndexWriter is opened with CREATE on
a prior (corrupt) index missing its segments_N file. (Mike
McCandless)
* LUCENE-2458: QueryParser no longer automatically forms phrase queries,
assuming whitespace tokenization. Previously all CJK queries, for example,
would be turned into phrase queries. The old behavior is preserved with
the matchVersion parameter for previous versions. Additionally, you can
explicitly enable the old behavior with setAutoGeneratePhraseQueries(true)
(Robert Muir)
* LUCENE-2537: FSDirectory.copy() implementation was unsafe and could result in
OOM if a large file was copied. (Shai Erera)
* LUCENE-2580: MultiPhraseQuery throws AIOOBE if number of positions
exceeds number of terms at one position (Jayendra Patil via Mike McCandless)
* LUCENE-2617: Optional clauses of a BooleanQuery were not factored
into coord if the scorer for that segment returned null. This
can cause the same document to score to differently depending on
what segment it resides in. (yonik)
* LUCENE-2272: Fix explain in PayloadNearQuery and also fix scoring issue (Peter Keegan via Grant Ingersoll)
* LUCENE-2732: Fix charset problems in XML loading in
HyphenationCompoundWordTokenFilter. (Uwe Schindler)
* LUCENE-2802: NRT DirectoryReader returned incorrect values from
getVersion, isOptimized, getCommitUserData, getIndexCommit and isCurrent due
to a mutable reference to the IndexWriters SegmentInfos.
(Simon Willnauer, Earwin Burrfoot)
* LUCENE-2852: Fixed corner case in RAMInputStream that would hit a
false EOF after seeking to EOF then seeking back to same block you
were just in and then calling readBytes (Robert Muir, Mike McCandless)
* LUCENE-2860: Fixed SegmentInfo.sizeInBytes to factor includeDocStores when it
decides whether to return the cached computed size or not. (Shai Erera)
* LUCENE-2584: SegmentInfo.files() could hit ConcurrentModificationException if
called by multiple threads. (Alexander Kanarsky via Shai Erera)
* LUCENE-2809: Fixed IndexWriter.numDocs to take into account
applied but not yet flushed deletes. (Mike McCandless)
* LUCENE-2879: MultiPhraseQuery previously calculated its phrase IDF by summing
internally, it now calls Similarity.idfExplain(Collection, IndexSearcher).
(Robert Muir)
* LUCENE-2693: RAM used by IndexWriter was slightly incorrectly computed.
(Jason Rutherglen via Shai Erera)
* LUCENE-1846: DateTools now uses the US locale everywhere, so DateTools.round()
is safe also in strange locales. (Uwe Schindler)
* LUCENE-2891: IndexWriterConfig did not accept -1 in setReaderTermIndexDivisor,
which can be used to prevent loading the terms index into memory. (Shai Erera)
* LUCENE-2937: Encoding a float into a byte (e.g. encoding field norms during
indexing) had an underflow detection bug that caused floatToByte(f)==0 where
f was greater than 0, but slightly less than byteToFloat(1). This meant that
certain very small field norms (index_boost * length_norm) could have
been rounded down to 0 instead of being rounded up to the smallest
positive number. (yonik)
* LUCENE-2936: PhraseQuery score explanations were not correctly
identifying matches vs non-matches. (hossman)
* LUCENE-2975: A hotspot bug corrupts IndexInput#readVInt()/readVLong() if
the underlying readByte() is inlined (which happens e.g. in MMapDirectory).
The loop was unwinded which makes the hotspot bug disappear.
(Uwe Schindler, Robert Muir, Mike McCandless)
New features
* LUCENE-2128: Parallelized fetching document frequencies during weight
creation. (Israel Tsadok, Simon Willnauer via Uwe Schindler)
* LUCENE-2069: Added Unicode 4 support to CharArraySet. Due to the switch
to Java 5, supplementary characters are now lowercased correctly if the
set is created as case insensitive.
CharArraySet now requires a Version argument to preserve
backwards compatibility. If Version < 3.1 is passed to the constructor,
CharArraySet yields the old behavior. (Simon Willnauer)
* LUCENE-2069: Added Unicode 4 support to LowerCaseFilter. Due to the switch
to Java 5, supplementary characters are now lowercased correctly.
LowerCaseFilter now requires a Version argument to preserve
backwards compatibility. If Version < 3.1 is passed to the constructor,
LowerCaseFilter yields the old behavior. (Simon Willnauer, Robert Muir)
* LUCENE-2034: Added ReusableAnalyzerBase, an abstract subclass of Analyzer
that makes it easier to reuse TokenStreams correctly. This issue also added
StopwordAnalyzerBase, which improves consistency of all Analyzers that use
stopwords, and implement many analyzers in contrib with it.
(Simon Willnauer via Robert Muir)
* LUCENE-2198, LUCENE-2901: Support protected words in stemming TokenFilters using a
new KeywordAttribute. (Simon Willnauer, Drew Farris via Uwe Schindler)
* LUCENE-2183, LUCENE-2240, LUCENE-2241: Added Unicode 4 support
to CharTokenizer and its subclasses. CharTokenizer now has new
int-API which is conditionally preferred to the old char-API depending
on the provided Version. Version < 3.1 will use the char-API.
(Simon Willnauer via Uwe Schindler)
* LUCENE-2247: Added a CharArrayMap<V> for performance improvements
in some stemmers and synonym filters. (Uwe Schindler)
* LUCENE-2320: Added SetOnce which wraps an object and allows it to be set
exactly once. (Shai Erera via Mike McCandless)
* LUCENE-2314: Added AttributeSource.copyTo(AttributeSource) that
allows to use cloneAttributes() and this method as a replacement
for captureState()/restoreState(), if the state itself
needs to be inspected/modified. (Uwe Schindler)
* LUCENE-2293: Expose control over max number of threads that
IndexWriter will allow to run concurrently while indexing
documents (previously this was hardwired to 5), using
IndexWriterConfig.setMaxThreadStates. (Mike McCandless)
* LUCENE-2297: Enable turning on reader pooling inside IndexWriter
even when getReader (near-real-timer reader) is not in use, through
IndexWriterConfig.enable/disableReaderPooling. (Mike McCandless)
* LUCENE-2331: Add NoMergePolicy which never returns any merges to execute. In
addition, add NoMergeScheduler which never executes any merges. These two are
convenient classes in case you want to disable segment merges by IndexWriter
without tweaking a particular MergePolicy parameters, such as mergeFactor.
MergeScheduler's methods are now public. (Shai Erera via Mike McCandless)
* LUCENE-2339: Deprecate static method Directory.copy in favor of
Directory.copyTo, and use nio's FileChannel.transferTo when copying
files between FSDirectory instances. (Earwin Burrfoot via Mike
McCandless).
* LUCENE-2074: Make StandardTokenizer fit for Unicode 4.0, if the
matchVersion parameter is Version.LUCENE_31. (Uwe Schindler)
* LUCENE-2385: Moved NoDeletionPolicy from benchmark to core. NoDeletionPolicy
can be used to prevent commits from ever getting deleted from the index.
(Shai Erera)
* LUCENE-1585: IndexWriter now accepts a PayloadProcessorProvider which can
return a DirPayloadProcessor for a given Directory, which returns a
PayloadProcessor for a given Term. The PayloadProcessor will be used to
process the payloads of the segments as they are merged (e.g. if one wants to
rewrite payloads of external indexes as they are added, or of local ones).
(Shai Erera, Michael Busch, Mike McCandless)
* LUCENE-2440: Add support for custom ExecutorService in
ParallelMultiSearcher (Edward Drapkin via Mike McCandless)
* LUCENE-2295: Added a LimitTokenCountAnalyzer / LimitTokenCountFilter
to wrap any other Analyzer and provide the same functionality as
MaxFieldLength provided on IndexWriter. This patch also fixes a bug
in the offset calculation in CharTokenizer. (Uwe Schindler, Shai Erera)
* LUCENE-2526: Don't throw NPE from MultiPhraseQuery.toString when
it's empty. (Ross Woolf via Mike McCandless)
* LUCENE-2559: Added SegmentReader.reopen methods (John Wang via Mike
McCandless)
* LUCENE-2590: Added Scorer.visitSubScorers, and Scorer.freq. Along
with a custom Collector these experimental methods make it possible
to gather the hit-count per sub-clause and per document while a
search is running. (Simon Willnauer, Mike McCandless)
* LUCENE-2636: Added MultiCollector which allows running the search with several
Collectors. (Shai Erera)
* LUCENE-2754, LUCENE-2757: Added a wrapper around MultiTermQueries
to add span support: SpanMultiTermQueryWrapper<Q extends MultiTermQuery>.
Using this wrapper its easy to add fuzzy/wildcard to e.g. a SpanNearQuery.
(Robert Muir, Uwe Schindler)
* LUCENE-2838: ConstantScoreQuery now directly supports wrapping a Query
instance for stripping off scores. The use of a QueryWrapperFilter
is no longer needed and discouraged for that use case. Directly wrapping
Query improves performance, as out-of-order collection is now supported.
(Uwe Schindler)
* LUCENE-2864: Add getMaxTermFrequency (maximum within-document TF) to
FieldInvertState so that it can be used in Similarity.computeNorm.
(Robert Muir)
* LUCENE-2720: Segments now record the code version which created them.
(Shai Erera, Mike McCandless, Uwe Schindler)
* LUCENE-2474: Added expert ReaderFinishedListener API to
IndexReader, to allow apps that maintain external per-segment caches
to evict entries when a segment is finished. (Shay Banon, Yonik
Seeley, Mike McCandless)
* LUCENE-2911: The new StandardTokenizer, UAX29URLEmailTokenizer, and
the ICUTokenizer in contrib now all tag types with a consistent set
of token types (defined in StandardTokenizer). Tokens in the major
CJK types are explicitly marked to allow for custom downstream handling:
<IDEOGRAPHIC>, <HANGUL>, <KATAKANA>, and <HIRAGANA>.
(Robert Muir, Steven Rowe)
* LUCENE-2913: Add missing getters to Numeric* classes. (Uwe Schindler)
* LUCENE-1810: Added FieldSelectorResult.LATENT to not cache lazy loaded fields
(Tim Smith, Grant Ingersoll)
* LUCENE-2692: Added several new SpanQuery classes for positional checking
(match is in a range, payload is a specific value) (Grant Ingersoll)
Optimizations
* LUCENE-2494: Use CompletionService in ParallelMultiSearcher instead of
simple polling for results. (Edward Drapkin, Simon Willnauer)
* LUCENE-2075: Terms dict cache is now shared across threads instead
of being stored separately in thread local storage. Also fixed
terms dict so that the cache is used when seeking the thread local
term enum, which will be important for MultiTermQuery impls that do
lots of seeking (Mike McCandless, Uwe Schindler, Robert Muir, Yonik
Seeley)
* LUCENE-2136: If the multi reader (DirectoryReader or MultiReader)
only has a single sub-reader, delegate all enum requests to it.
This avoid the overhead of using a PQ unnecessarily. (Mike
McCandless)
* LUCENE-2137: Switch to AtomicInteger for some ref counting (Earwin
Burrfoot via Mike McCandless)
* LUCENE-2123, LUCENE-2261: Move FuzzyQuery rewrite to separate RewriteMode
into MultiTermQuery. The number of fuzzy expansions can be specified with
the maxExpansions parameter to FuzzyQuery.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2164: ConcurrentMergeScheduler has more control over merge
threads. First, it gives smaller merges higher thread priority than
larges ones. Second, a new set/getMaxMergeCount setting will pause
the larger merges to allow smaller ones to finish. The defaults for
these settings are now dynamic, depending the number CPU cores as
reported by Runtime.getRuntime().availableProcessors() (Mike
McCandless)
* LUCENE-2169: Improved CharArraySet.copy(), if source set is
also a CharArraySet. (Simon Willnauer via Uwe Schindler)
* LUCENE-2084: Change IndexableBinaryStringTools to work on byte[] and char[]
directly, instead of Byte/CharBuffers, and modify CollationKeyFilter to
take advantage of this for faster performance.
(Steven Rowe, Uwe Schindler, Robert Muir)
* LUCENE-2188: Add a utility class for tracking deprecated overridden
methods in non-final subclasses.
(Uwe Schindler, Robert Muir)
* LUCENE-2195: Speedup CharArraySet if set is empty.
(Simon Willnauer via Robert Muir)
* LUCENE-2285: Code cleanup. (Shai Erera via Uwe Schindler)
* LUCENE-2303: Remove code duplication in Token class by subclassing
TermAttributeImpl, move DEFAULT_TYPE constant to TypeInterface, improve
null-handling for TypeAttribute. (Uwe Schindler)
* LUCENE-2329: Switch TermsHash* from using a PostingList object per unique
term to parallel arrays, indexed by termID. This reduces garbage collection
overhead significantly, which results in great indexing performance wins
when the available JVM heap space is low. This will become even more
important when the DocumentsWriter RAM buffer is searchable in the future,
because then it will make sense to make the RAM buffers as large as
possible. (Mike McCandless, Michael Busch)
* LUCENE-2380: The terms field cache methods (getTerms,
getTermsIndex), which replace the older String equivalents
(getStrings, getStringIndex), consume quite a bit less RAM in most
cases. (Mike McCandless)
* LUCENE-2410: ~20% speedup on exact (slop=0) PhraseQuery matching.
(Mike McCandless)
* LUCENE-2531: Fix issue when sorting by a String field that was
causing too many fallbacks to compare-by-value (instead of by-ord).
(Mike McCandless)
* LUCENE-2574: IndexInput exposes copyBytes(IndexOutput, long) to allow for
efficient copying by sub-classes. Optimized copy is implemented for RAM and FS
streams. (Shai Erera)
* LUCENE-2719: Improved TermsHashPerField's sorting to use a better
quick sort algorithm that dereferences the pivot element not on
every compare call. Also replaced lots of sorting code in Lucene
by the improved SorterTemplate class.
(Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2760: Optimize SpanFirstQuery and SpanPositionRangeQuery.
(Robert Muir)
* LUCENE-2770: Make SegmentMerger always work on atomic subreaders,
even when IndexWriter.addIndexes(IndexReader...) is used with
DirectoryReaders or other MultiReaders. This saves lots of memory
during merge of norms. (Uwe Schindler, Mike McCandless)
* LUCENE-2824: Optimize BufferedIndexInput to do less bounds checks.
(Robert Muir)
* LUCENE-2010: Segments with 100% deleted documents are now removed on
IndexReader or IndexWriter commit. (Uwe Schindler, Mike McCandless)
* LUCENE-1472: Removed synchronization from static DateTools methods
by using a ThreadLocal. Also converted DateTools.Resolution to a
Java 5 enum (this should not break backwards). (Uwe Schindler)
Build
* LUCENE-2124: Moved the JDK-based collation support from contrib/collation
into core, and moved the ICU-based collation support into contrib/icu.
(Robert Muir)
* LUCENE-2326: Removed SVN checkouts for backwards tests. The backwards
branch is now included in the svn repository using "svn copy"
after release. (Uwe Schindler)
* LUCENE-2074: Regenerating StandardTokenizerImpl files now needs
JFlex 1.5 (currently only available on SVN). (Uwe Schindler)
* LUCENE-1709: Tests are now parallelized by default (except for benchmark). You
can force them to run sequentially by passing -Drunsequential=1 on the command
line. The number of threads that are spawned per CPU defaults to '1'. If you
wish to change that, you can run the tests with -DthreadsPerProcessor=[num].
(Robert Muir, Shai Erera, Peter Kofler)
* LUCENE-2516: Backwards tests are now compiled against released lucene-core.jar
from tarball of previous version. Backwards tests are now packaged together
with src distribution. (Uwe Schindler)
* LUCENE-2611: Added Ant target to install IntelliJ IDEA configuration:
"ant idea". See http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ
(Steven Rowe)
* LUCENE-2657: Switch from using Maven POM templates to full POMs when
generating Maven artifacts (Steven Rowe)
* LUCENE-2609: Added jar-test-framework Ant target which packages Lucene's
tests' framework classes. (Drew Farris, Grant Ingersoll, Shai Erera,
Steven Rowe)
Test Cases
* LUCENE-2037 Allow Junit4 tests in our environment (Erick Erickson
via Mike McCandless)
* LUCENE-1844: Speed up the unit tests (Mark Miller, Erick Erickson,
Mike McCandless)
* LUCENE-2065: Use Java 5 generics throughout our unit tests. (Kay
Kay via Mike McCandless)
* LUCENE-2155: Fix time and zone dependent localization test failures
in queryparser tests. (Uwe Schindler, Chris Male, Robert Muir)
* LUCENE-2170: Fix thread starvation problems. (Uwe Schindler)
* LUCENE-2248, LUCENE-2251, LUCENE-2285: Refactor tests to not use
Version.LUCENE_CURRENT, but instead use a global static value
from LuceneTestCase(J4), that contains the release version.
(Uwe Schindler, Simon Willnauer, Shai Erera)
* LUCENE-2313, LUCENE-2322: Add VERBOSE to LuceneTestCase(J4) to control
verbosity of tests. If VERBOSE==false (default) tests should not print
anything other than errors to System.(out|err). The setting can be
changed with -Dtests.verbose=true on test invocation.
(Shai Erera, Paul Elschot, Uwe Schindler)
* LUCENE-2318: Remove inconsistent system property code for retrieving
temp and data directories inside test cases. It is now centralized in
LuceneTestCase(J4). Also changed lots of tests to use
getClass().getResourceAsStream() to retrieve test data. Tests needing
access to "real" files from the test folder itself, can use
LuceneTestCase(J4).getDataFile(). (Uwe Schindler)
* LUCENE-2398, LUCENE-2611: Improve tests to work better from IDEs such
as Eclipse and IntelliJ.
(Paolo Castagna, Steven Rowe via Robert Muir)
* LUCENE-2804: add newFSDirectory to LuceneTestCase to create a FSDirectory at
random. (Shai Erera, Robert Muir)
Documentation
* LUCENE-2579: Fix oal.search's package.html description of abstract
methods. (Santiago M. Mola via Mike McCandless)
* LUCENE-2625: Add a note to IndexReader.termDocs() with additional verbiage
that the TermEnum must be seeked since it is unpositioned.
(Adriano Crestani via Robert Muir)
* LUCENE-2894: Use google-code-prettify for syntax highlighting in javadoc.
(Shinichiro Abe, Koji Sekiguchi)
================== Release 2.9.4 / 3.0.3 ====================
Changes in runtime behavior
* LUCENE-2689: NativeFSLockFactory no longer attempts to acquire a
test lock just before the real lock is acquired. (Surinder Pal
Singh Bindra via Mike McCandless)
* LUCENE-2762: Fixed bug in IndexWriter causing it to hold open file
handles against deleted files when compound-file was enabled (the
default) and readers are pooled. As a result of this the peak
worst-case free disk space required during optimize is now 3X the
index size, when compound file is enabled (else 2X). (Mike
McCandless)
* LUCENE-2773: LogMergePolicy accepts a double noCFSRatio (default =
0.1), which means any time a merged segment is greater than 10% of
the index size, it will be left in non-compound format even if
compound format is on. This change was made to reduce peak
transient disk usage during optimize which increased due to
LUCENE-2762. (Mike McCandless)
Bug fixes
* LUCENE-2142 (correct fix): FieldCacheImpl.getStringIndex no longer
throws an exception when term count exceeds doc count.
(Mike McCandless, Uwe Schindler)
* LUCENE-2513: when opening writable IndexReader on a not-current
commit, do not overwrite "future" commits. (Mike McCandless)
* LUCENE-2536: IndexWriter.rollback was failing to properly rollback
buffered deletions against segments that were flushed (Mark Harwood
via Mike McCandless)
* LUCENE-2541: Fixed NumericRangeQuery that returned incorrect results
with endpoints near Long.MIN_VALUE and Long.MAX_VALUE:
NumericUtils.splitRange() overflowed, if
- the range contained a LOWER bound
that was greater than (Long.MAX_VALUE - (1L << precisionStep))
- the range contained an UPPER bound
that was less than (Long.MIN_VALUE + (1L << precisionStep))
With standard precision steps around 4, this had no effect on
most queries, only those that met the above conditions.
Queries with large precision steps failed more easy. Queries with
precision step >=64 were not affected. Also 32 bit data types int
and float were not affected.
(Yonik Seeley, Uwe Schindler)
* LUCENE-2593: Fixed certain rare cases where a disk full could lead
to a corrupted index (Robert Muir, Mike McCandless)
* LUCENE-2620: Fixed a bug in WildcardQuery where too many asterisks
would result in unbearably slow performance. (Nick Barkas via Robert Muir)
* LUCENE-2627: Fixed bug in MMapDirectory chunking when a file is an
exact multiple of the chunk size. (Robert Muir)
* LUCENE-2634: isCurrent on an NRT reader was failing to return false
if the writer had just committed (Nikolay Zamosenchuk via Mike McCandless)
* LUCENE-2650: Added extra safety to MMapIndexInput clones to prevent accessing
an unmapped buffer if the input is closed (Mike McCandless, Uwe Schindler, Robert Muir)
* LUCENE-2384: Reset zzBuffer in StandardTokenizerImpl when lexer is reset.
(Ruben Laguna via Uwe Schindler, sub-issue of LUCENE-2074)
* LUCENE-2658: Exceptions while processing term vectors enabled for multiple
fields could lead to invalid ArrayIndexOutOfBoundsExceptions.
(Robert Muir, Mike McCandless)
* LUCENE-2235: Implement missing PerFieldAnalyzerWrapper.getOffsetGap().
(Javier Godoy via Uwe Schindler)
* LUCENE-2328: Fixed memory leak in how IndexWriter/Reader tracked
already sync'd files. (Earwin Burrfoot via Mike McCandless)
* LUCENE-2549: Fix TimeLimitingCollector#TimeExceededException to record
the absolute docid. (Uwe Schindler)
* LUCENE-2533: fix FileSwitchDirectory.listAll to not return dups when
primary & secondary dirs share the same underlying directory.
(Michael McCandless)
* LUCENE-2365: IndexWriter.newestSegment (used normally for testing)
is fixed to return null if there are no segments. (Karthick
Sankarachary via Mike McCandless)
* LUCENE-2730: Fix two rare deadlock cases in IndexWriter (Mike McCandless)
* LUCENE-2744: CheckIndex was stating total number of fields,
not the number that have norms enabled, on the "test: field
norms..." output. (Mark Kristensson via Mike McCandless)
* LUCENE-2759: Fixed two near-real-time cases where doc store files
may be opened for read even though they are still open for write.
(Mike McCandless)
* LUCENE-2618: Fix rare thread safety issue whereby
IndexWriter.optimize could sometimes return even though the index
wasn't fully optimized (Mike McCandless)
* LUCENE-2767: Fix thread safety issue in addIndexes(IndexReader[])
that could potentially result in index corruption. (Mike
McCandless)
* LUCENE-2762: Fixed bug in IndexWriter causing it to hold open file
handles against deleted files when compound-file was enabled (the
default) and readers are pooled. As a result of this the peak
worst-case free disk space required during optimize is now 3X the
index size, when compound file is enabled (else 2X). (Mike
McCandless)
* LUCENE-2216: OpenBitSet.hashCode returned different hash codes for
sets that only differed by trailing zeros. (Dawid Weiss, yonik)
* LUCENE-2782: Fix rare potential thread hazard with
IndexWriter.commit (Mike McCandless)
API Changes
* LUCENE-2773: LogMergePolicy accepts a double noCFSRatio (default =
0.1), which means any time a merged segment is greater than 10% of
the index size, it will be left in non-compound format even if
compound format is on. This change was made to reduce peak
transient disk usage during optimize which increased due to
LUCENE-2762. (Mike McCandless)
Optimizations
* LUCENE-2556: Improve memory usage after cloning TermAttribute.
(Adriano Crestani via Uwe Schindler)
* LUCENE-2098: Improve the performance of BaseCharFilter, especially for
large documents. (Robin Wojciki, Koji Sekiguchi, Robert Muir)
New features
* LUCENE-2675 (2.9.4 only): Add support for Lucene 3.0 stored field files
also in 2.9. The file format did not change, only the version number was
upgraded to mark segments that have no compression. FieldsWriter still only
writes 2.9 segments as they could contain compressed fields. This cross-version
index format compatibility is provided here solely because Lucene 2.9 and 3.0
have the same bugfix level, features, and the same index format with this slight
compression difference. In general, Lucene does not support reading newer
indexes with older library versions. (Uwe Schindler)
Documentation
* LUCENE-2239: Documented limitations in NIOFSDirectory and MMapDirectory due to
Java NIO behavior when a Thread is interrupted while blocking on IO.
(Simon Willnauer, Robert Muir)
================== Release 2.9.3 / 3.0.2 ====================
Changes in backwards compatibility policy
* LUCENE-2135: Added FieldCache.purge(IndexReader) method to the
interface. Anyone implementing FieldCache externally will need to
fix their code to implement this, on upgrading. (Mike McCandless)
Changes in runtime behavior
* LUCENE-2421: NativeFSLockFactory does not throw LockReleaseFailedException if
it cannot delete the lock file, since obtaining the lock does not fail if the
file is there. (Shai Erera)
* LUCENE-2060 (2.9.3 only): Changed ConcurrentMergeScheduler's default for
maxNumThreads from 3 to 1, because in practice we get the most gains
from running a single merge in the backround. More than one
concurrent merge causes alot of thrashing (though it's possible on
SSD storage that there would be net gains). (Jason Rutherglen, Mike
McCandless)
Bug fixes
* LUCENE-2046 (2.9.3 only): IndexReader should not see the index as changed, after
IndexWriter.prepareCommit has been called but before
IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
* LUCENE-2119: Don't throw NegativeArraySizeException if you pass
Integer.MAX_VALUE as nDocs to IndexSearcher search methods. (Paul
Taylor via Mike McCandless)
* LUCENE-2142: FieldCacheImpl.getStringIndex no longer throws an
exception when term count exceeds doc count. (Mike McCandless)
* LUCENE-2104: NativeFSLock.release() would silently fail if the lock is held by
another thread/process. (Shai Erera via Uwe Schindler)
* LUCENE-2283: Use shared memory pool for term vector and stored
fields buffers. This memory will be reclaimed if needed according to
the configured RAM Buffer Size for the IndexWriter. This also fixes
potentially excessive memory usage when many threads are indexing a
mix of small and large documents. (Tim Smith via Mike McCandless)
* LUCENE-2300: If IndexWriter is pooling reader (because NRT reader
has been obtained), and addIndexes* is run, do not pool the
readers from the external directory. This is harmless (NRT reader is
correct), but a waste of resources. (Mike McCandless)
* LUCENE-2422: Don't reuse byte[] in IndexInput/Output -- it gains
little performance, and ties up possibly large amounts of memory
for apps that index large docs. (Ross Woolf via Mike McCandless)
* LUCENE-2387: Don't hang onto Fieldables from the last doc indexed,
in IndexWriter, nor the Reader in Tokenizer after close is
called. (Ruben Laguna, Uwe Schindler, Mike McCandless)
* LUCENE-2417: IndexCommit did not implement hashCode() and equals()
consistently. Now they both take Directory and version into consideration. In
addition, all of IndexComnmit methods which threw
UnsupportedOperationException are now abstract. (Shai Erera)
* LUCENE-2467: Fixed memory leaks in IndexWriter when large documents
are indexed. (Mike McCandless)
* LUCENE-2473: Clicking on the "More Results" link in the luceneweb.war
demo resulted in ArrayIndexOutOfBoundsException.
(Sami Siren via Robert Muir)
* LUCENE-2476: If any exception is hit init'ing IW, release the write
lock (previously we only released on IOException). (Tamas Cservenak
via Mike McCandless)
* LUCENE-2478: Fix CachingWrapperFilter to not throw NPE when
Filter.getDocIdSet() returns null. (Uwe Schindler, Daniel Noll)
* LUCENE-2468: Allow specifying how new deletions should be handled in
CachingWrapperFilter and CachingSpanFilter. By default, new
deletions are ignored in CachingWrapperFilter, since typically this
filter is AND'd with a query that correctly takes new deletions into
account. This should be a performance gain (higher cache hit rate)
in apps that reopen readers, or use near-real-time reader
(IndexWriter.getReader()), but may introduce invalid search results
(allowing deleted docs to be returned) for certain cases, so a new
expert ctor was added to CachingWrapperFilter to enforce deletions
at a performance cost. CachingSpanFilter by default recaches if
there are new deletions (Shay Banon via Mike McCandless)
* LUCENE-2299: If you open an NRT reader while addIndexes* is running,
it may miss some segments (Earwin Burrfoot via Mike McCandless)
* LUCENE-2397: Don't throw NPE from SnapshotDeletionPolicy.snapshot if
there are no commits yet (Shai Erera)
* LUCENE-2424: Fix FieldDoc.toString to actually return its fields
(Stephen Green via Mike McCandless)
* LUCENE-2311: Always pass a "fully loaded" (terms index & doc stores)
SegmentsReader to IndexWriter's mergedSegmentWarmer (if set), so
that warming is free to do whatever it needs to. (Earwin Burrfoot
via Mike McCandless)
* LUCENE-3029: Fix corner case when MultiPhraseQuery is used with zero
position-increment tokens that would sometimes assign different
scores to identical docs. (Mike McCandless)
* LUCENE-2486: Fixed intermittent FileNotFoundException on doc store
files when a mergedSegmentWarmer is set on IndexWriter. (Mike
McCandless)
* LUCENE-2130: Fix performance issue when FuzzyQuery runs on a
multi-segment index (Michael McCandless)
API Changes
* LUCENE-2281: added doBeforeFlush to IndexWriter to allow extensions to perform
operations before flush starts. Also exposed doAfterFlush as protected instead
of package-private. (Shai Erera via Mike McCandless)
* LUCENE-2356: Add IndexWriter.set/getReaderTermsIndexDivisor, to set
what IndexWriter passes for termsIndexDivisor to the readers it
opens internally when applying deletions or creating a
near-real-time reader. (Earwin Burrfoot via Mike McCandless)
Optimizations
* LUCENE-2494 (3.0.2 only): Use CompletionService in ParallelMultiSearcher
instead of simple polling for results. (Edward Drapkin, Simon Willnauer)
* LUCENE-2135: On IndexReader.close, forcefully evict any entries from
the FieldCache rather than waiting for the WeakHashMap to release
the reference (Mike McCandless)
* LUCENE-2161: Improve concurrency of IndexReader, especially in the
context of near real-time readers. (Mike McCandless)
* LUCENE-2360: Small speedup to recycling of reused per-doc RAM in
IndexWriter (Robert Muir, Mike McCandless)
Build
* LUCENE-2488 (2.9.3 only): Support build with JDK 1.4 and exclude Java 1.5
contrib modules on request (pass '-Dforce.jdk14.build=true') when
compiling/testing/packaging. This marks the benchmark contrib also
as Java 1.5, as it depends on fast-vector-highlighter. (Uwe Schindler)
================== Release 2.9.2 / 3.0.1 ====================
Changes in backwards compatibility policy
* LUCENE-2123 (3.0.1 only): Removed the protected inner class ScoreTerm
from FuzzyQuery. The change was needed because the comparator of this
class had to be changed in an incompatible way. The class was never
intended to be public. (Uwe Schindler, Mike McCandless)
Bug fixes
* LUCENE-2092: BooleanQuery was ignoring disableCoord in its hashCode
and equals methods, cause bad things to happen when caching
BooleanQueries. (Chris Hostetter, Mike McCandless)
* LUCENE-2095: Fixes: when two threads call IndexWriter.commit() at
the same time, it's possible for commit to return control back to
one of the threads before all changes are actually committed.
(Sanne Grinovero via Mike McCandless)
* LUCENE-2132 (3.0.1 only): Fix the demo result.jsp to use QueryParser
with a Version argument. (Brian Li via Robert Muir)
* LUCENE-2166: Don't incorrectly keep warning about the same immense
term, when IndexWriter.infoStream is on. (Mike McCandless)
* LUCENE-2158: At high indexing rates, NRT reader could temporarily
lose deletions. (Mike McCandless)
* LUCENE-2182: DEFAULT_ATTRIBUTE_FACTORY was failing to load
implementation class when interface was loaded by a different
class loader. (Uwe Schindler, reported on java-user by Ahmed El-dawy)
* LUCENE-2257: Increase max number of unique terms in one segment to
termIndexInterval (default 128) * ~2.1 billion = ~274 billion.
(Tom Burton-West via Mike McCandless)
* LUCENE-2260: Fixed AttributeSource to not hold a strong
reference to the Attribute/AttributeImpl classes which prevents
unloading of custom attributes loaded by other classloaders
(e.g. in Solr plugins). (Uwe Schindler)
* LUCENE-1941: Fix Min/MaxPayloadFunction returns 0 when
only one payload is present. (Erik Hatcher, Mike McCandless
via Uwe Schindler)
* LUCENE-2270: Queries consisting of all zero-boost clauses
(for example, text:foo^0) sorted incorrectly and produced
invalid docids. (yonik)
API Changes
* LUCENE-1609 (3.0.1 only): Restore IndexReader.getTermInfosIndexDivisor
(it was accidentally removed in 3.0.0) (Mike McCandless)
* LUCENE-1972 (3.0.1 only): Restore SortField.getComparatorSource
(it was accidentally removed in 3.0.0) (John Wang via Uwe Schindler)
* LUCENE-2190: Added a new class CustomScoreProvider to function package
that can be subclassed to provide custom scoring to CustomScoreQuery.
The methods in CustomScoreQuery that did this before were deprecated
and replaced by a method getCustomScoreProvider(IndexReader) that
returns a custom score implementation using the above class. The change
is necessary with per-segment searching, as CustomScoreQuery is
a stateless class (like all other Queries) and does not know about
the currently searched segment. This API works similar to Filter's
getDocIdSet(IndexReader). (Paul chez Jamespot via Mike McCandless,
Uwe Schindler)
* LUCENE-2080: Deprecate Version.LUCENE_CURRENT, as using this constant
will cause backwards compatibility problems when upgrading Lucene. See
the Version javadocs for additional information.
(Robert Muir)
Optimizations
* LUCENE-2086: When resolving deleted terms, do so in term sort order
for better performance (Bogdan Ghidireac via Mike McCandless)
* LUCENE-2123 (partly, 3.0.1 only): Fixes a slowdown / memory issue
added by LUCENE-504. (Uwe Schindler, Robert Muir, Mike McCandless)
* LUCENE-2258: Remove unneeded synchronization in FuzzyTermEnum.
(Uwe Schindler, Robert Muir)
Test Cases
* LUCENE-2114: Change TestFilteredSearch to test on multi-segment
index as well. (Simon Willnauer via Mike McCandless)
* LUCENE-2211: Improves BaseTokenStreamTestCase to use a fake attribute
that checks if clearAttributes() was called correctly.
(Uwe Schindler, Robert Muir)
* LUCENE-2207, LUCENE-2219: Improve BaseTokenStreamTestCase to check if
end() is implemented correctly. (Koji Sekiguchi, Robert Muir)
Documentation
* LUCENE-2114: Improve javadocs of Filter to call out that the
provided reader is per-segment (Simon Willnauer via Mike
McCandless)
======================= Release 3.0.0 =======================
Changes in backwards compatibility policy
* LUCENE-1979: Change return type of SnapshotDeletionPolicy#snapshot()
from IndexCommitPoint to IndexCommit. Code that uses this method
needs to be recompiled against Lucene 3.0 in order to work. The
previously deprecated IndexCommitPoint is also removed.
(Michael Busch)
* o.a.l.Lock.isLocked() is now allowed to throw an IOException.
(Mike McCandless)
* LUCENE-2030: CachingWrapperFilter and CachingSpanFilter now hide
the internal cache implementation for thread safety, before it was
declared protected. (Peter Lenahan, Uwe Schindler, Simon Willnauer)
* LUCENE-2053: If you call Thread.interrupt() on a thread inside
Lucene, Lucene will do its best to interrupt the thread. However,
instead of throwing InterruptedException (which is a checked
exception), you'll get an oal.util.ThreadInterruptedException (an
unchecked exception, subclassing RuntimeException). The interrupt
status on the thread is cleared when this exception is thrown.
(Mike McCandless)
* LUCENE-2052: Some methods in Lucene core were changed to accept
Java 5 varargs. This is not a backwards compatibility problem as
long as you not try to override such a method. We left common
overridden methods unchanged and added varargs to constructors,
static, or final methods (MultiSearcher,...). (Uwe Schindler)
* LUCENE-1558: IndexReader.open(Directory) now opens a readOnly=true
reader, and new IndexSearcher(Directory) does the same. Note that
this is a change in the default from 2.9, when these methods were
previously deprecated. (Mike McCandless)
* LUCENE-1753: Make not yet final TokenStreams final to enforce
decorator pattern. (Uwe Schindler)
Changes in runtime behavior
* LUCENE-1677: Remove the system property to set SegmentReader class
implementation. (Uwe Schindler)
* LUCENE-1960: As a consequence of the removal of Field.Store.COMPRESS,
support for this type of fields was removed. Lucene 3.0 is still able
to read indexes with compressed fields, but as soon as merges occur
or the index is optimized, all compressed fields are decompressed
and converted to Field.Store.YES. Because of this, indexes with
compressed fields can suddenly get larger. Also the first merge with
decompression cannot be done in raw mode, it is therefore slower.
This change has no effect for code that uses such old indexes,
they behave as before (fields are automatically decompressed
during read). Indexes converted to Lucene 3.0 format cannot be read
anymore with previous versions.
It is recommended to optimize your indexes after upgrading to convert
to the new format and decompress all fields.
If you want compressed fields, you can use CompressionTools, that
creates compressed byte[] to be added as binary stored field. This
cannot be done automatically, as you also have to decompress such
fields when reading. You have to reindex to do that.
(Michael Busch, Uwe Schindler)
* LUCENE-2060: Changed ConcurrentMergeScheduler's default for
maxNumThreads from 3 to 1, because in practice we get the most
gains from running a single merge in the background. More than one
concurrent merge causes a lot of thrashing (though it's possible on
SSD storage that there would be net gains). (Jason Rutherglen,
Mike McCandless)
API Changes
* LUCENE-1257, LUCENE-1984, LUCENE-1985, LUCENE-2057, LUCENE-1833, LUCENE-2012,
LUCENE-1998: Port to Java 1.5:
- Add generics to public and internal APIs (see below).
- Replace new Integer(int), new Double(double),... by static valueOf() calls.
- Replace for-loops with Iterator by foreach loops.
- Replace StringBuffer with StringBuilder.
- Replace o.a.l.util.Parameter by Java 5 enums (see below).
- Add @Override annotations.
(Uwe Schindler, Robert Muir, Karl Wettin, Paul Elschot, Kay Kay, Shai Erera,
DM Smith)
* Generify Lucene API:
- TokenStream/AttributeSource: Now addAttribute()/getAttribute() return an
instance of the requested attribute interface and no cast needed anymore
(LUCENE-1855).
- NumericRangeQuery, NumericRangeFilter, and FieldCacheRangeFilter
now have Integer, Long, Float, Double as type param (LUCENE-1857).
- Document.getFields() returns List<Fieldable>.
- Query.extractTerms(Set<Term>)
- CharArraySet and stop word sets in core/contrib
- PriorityQueue (LUCENE-1935)
- TopDocCollector
- DisjunctionMaxQuery (LUCENE-1984)
- MultiTermQueryWrapperFilter
- CloseableThreadLocal
- MapOfSets
- o.a.l.util.cache package
- lot's of internal APIs of IndexWriter
(Uwe Schindler, Michael Busch, Kay Kay, Robert Muir, Adriano Crestani)
* LUCENE-1944, LUCENE-1856, LUCENE-1957, LUCENE-1960, LUCENE-1961,
LUCENE-1968, LUCENE-1970, LUCENE-1946, LUCENE-1971, LUCENE-1975,
LUCENE-1972, LUCENE-1978, LUCENE-944, LUCENE-1979, LUCENE-1973, LUCENE-2011:
Remove deprecated methods/constructors/classes:
- Remove all String/File directory paths in IndexReader /
IndexSearcher / IndexWriter.
- Remove FSDirectory.getDirectory()
- Make FSDirectory abstract.
- Remove Field.Store.COMPRESS (see above).
- Remove Filter.bits(IndexReader) method and make
Filter.getDocIdSet(IndexReader) abstract.
- Remove old DocIdSetIterator methods and make the new ones abstract.
- Remove some methods in PriorityQueue.
- Remove old TokenStream API and backwards compatibility layer.
- Remove RangeQuery, RangeFilter and ConstantScoreRangeQuery.
- Remove SpanQuery.getTerms().
- Remove ExtendedFieldCache, custom and auto caches, SortField.AUTO.
- Remove old-style custom sort.
- Remove legacy search setting in SortField.
- Remove Hits and all references from core and contrib.
- Remove HitCollector and its TopDocs support implementations.
- Remove term field and accessors in MultiTermQuery
(and fix Highlighter).
- Remove deprecated methods in BooleanQuery.
- Remove deprecated methods in Similarity.
- Remove BoostingTermQuery.
- Remove MultiValueSource.
- Remove Scorer.explain(int).
...and some other minor ones (Uwe Schindler, Michael Busch, Mark Miller)
* LUCENE-1925: Make IndexSearcher's subReaders and docStarts members
protected; add expert ctor to directly specify reader, subReaders
and docStarts. (John Wang, Tim Smith via Mike McCandless)
* LUCENE-1945: All public classes that have a close() method now
also implement java.io.Closeable (IndexReader, IndexWriter, Directory,...).
(Uwe Schindler)
* LUCENE-1998: Change all Parameter instances to Java 5 enums. This
is no backwards-break, only a change of the super class. Parameter
was deprecated and will be removed in a later version.
(DM Smith, Uwe Schindler)
Bug fixes
* LUCENE-1951: When the text provided to WildcardQuery has no wildcard
characters (ie matches a single term), don't lose the boost and
rewrite method settings. Also, rewrite to PrefixQuery if the
wildcard is form "foo*", for slightly faster performance. (Robert
Muir via Mike McCandless)
* LUCENE-2013: SpanRegexQuery does not work with QueryScorer.
(Benjamin Keil via Mark Miller)
* LUCENE-2088: addAttribute() should only accept interfaces that
extend Attribute. (Shai Erera, Uwe Schindler)
* LUCENE-2045: Fix silly FileNotFoundException hit if you enable
infoStream on IndexWriter and then add an empty document and commit
(Shai Erera via Mike McCandless)
* LUCENE-2046: IndexReader should not see the index as changed, after
IndexWriter.prepareCommit has been called but before
IndexWriter.commit is called. (Peter Keegan via Mike McCandless)
New features
* LUCENE-1933: Provide a convenience AttributeFactory that creates a
Token instance for all basic attributes. (Uwe Schindler)
* LUCENE-2041: Parallelize the rest of ParallelMultiSearcher. Lots of
code refactoring and Java 5 concurrent support in MultiSearcher.
(Joey Surls, Simon Willnauer via Uwe Schindler)
* LUCENE-2051: Add CharArraySet.copy() as a simple method to copy
any Set<?> to a CharArraySet that is optimized, if Set<?> is already
an CharArraySet. (Simon Willnauer)
Optimizations
* LUCENE-1183: Optimize Levenshtein Distance computation in
FuzzyQuery. (Cédrik Lime via Mike McCandless)
* LUCENE-2006: Optimization of FieldDocSortedHitQueue to always
use Comparable<?> interface. (Uwe Schindler, Mark Miller)
* LUCENE-2087: Remove recursion in NumericRangeTermEnum.
(Uwe Schindler)
Build
* LUCENE-486: Remove test->demo dependencies. (Michael Busch)
* LUCENE-2024: Raise build requirements to Java 1.5 and ANT 1.7.0
(Uwe Schindler, Mike McCandless)
======================= Release 2.9.1 =======================
Changes in backwards compatibility policy
* LUCENE-2002: Add required Version matchVersion argument when
constructing QueryParser or MultiFieldQueryParser and, default (as
of 2.9) enablePositionIncrements to true to match
StandardAnalyzer's 2.9 default (Uwe Schindler, Mike McCandless)
Bug fixes
* LUCENE-1974: Fixed nasty bug in BooleanQuery (when it used
BooleanScorer for scoring), whereby some matching documents fail to
be collected. (Fulin Tang via Mike McCandless)
* LUCENE-1124: Make sure FuzzyQuery always matches the precise term.
(stefatwork@gmail.com via Mike McCandless)
* LUCENE-1976: Fix IndexReader.isCurrent() to return the right thing
when the reader is a near real-time reader. (Jake Mannix via Mike
McCandless)
* LUCENE-1986: Fix NPE when scoring PayloadNearQuery (Peter Keegan,
Mark Miller via Mike McCandless)
* LUCENE-1992: Fix thread hazard if a merge is committing just as an
exception occurs during sync (Uwe Schindler, Mike McCandless)
* LUCENE-1995: Note in javadocs that IndexWriter.setRAMBufferSizeMB
cannot exceed 2048 MB, and throw IllegalArgumentException if it
does. (Aaron McKee, Yonik Seeley, Mike McCandless)
* LUCENE-2004: Fix Constants.LUCENE_MAIN_VERSION to not be inlined
by client code. (Uwe Schindler)
* LUCENE-2016: Replace illegal U+FFFF character with the replacement
char (U+FFFD) during indexing, to prevent silent index corruption.
(Peter Keegan, Mike McCandless)
API Changes
* Un-deprecate search(Weight weight, Filter filter, int n) from
Searchable interface (deprecated by accident). (Uwe Schindler)
* Un-deprecate o.a.l.util.Version constants. (Mike McCandless)
* LUCENE-1987: Un-deprecate some ctors of Token, as they will not
be removed in 3.0 and are still useful. Also add some missing
o.a.l.util.Version constants for enabling invalid acronym
settings in StandardAnalyzer to be compatible with the coming
Lucene 3.0. (Uwe Schindler)
* LUCENE-1973: Un-deprecate IndexSearcher.setDefaultFieldSortScoring,
to allow controlling per-IndexSearcher whether scores are computed
when sorting by field. (Uwe Schindler, Mike McCandless)
* LUCENE-2043: Make IndexReader.commit(Map<String,String>) public.
(Mike McCandless)
Documentation
* LUCENE-1955: Fix Hits deprecation notice to point users in right
direction. (Mike McCandless, Mark Miller)
* Fix javadoc about score tracking done by search methods in Searcher
and IndexSearcher. (Mike McCandless)
* LUCENE-2008: Javadoc improvements for TokenStream/Tokenizer/Token
(Luke Nezda via Mike McCandless)
======================= Release 2.9.0 =======================
Changes in backwards compatibility policy
* LUCENE-1575: Searchable.search(Weight, Filter, int, Sort) no
longer computes a document score for each hit by default. If
document score tracking is still needed, you can call
IndexSearcher.setDefaultFieldSortScoring(true, true) to enable
both per-hit and maxScore tracking; however, this is deprecated
and will be removed in 3.0.
Alternatively, use Searchable.search(Weight, Filter, Collector)
and pass in a TopFieldCollector instance, using the following code
sample:
<code>
TopFieldCollector tfc = TopFieldCollector.create(sort, numHits, fillFields,
true /* trackDocScores */,
true /* trackMaxScore */,
false /* docsInOrder */);
searcher.search(query, tfc);
TopDocs results = tfc.topDocs();
</code>
Note that your Sort object cannot use SortField.AUTO when you
directly instantiate TopFieldCollector.
Also, the method search(Weight, Filter, Collector) was added to
the Searchable interface and the Searcher abstract class to
replace the deprecated HitCollector versions. If you either
implement Searchable or extend Searcher, you should change your
code to implement this method. If you already extend
IndexSearcher, no further changes are needed to use Collector.
Finally, the values Float.NaN and Float.NEGATIVE_INFINITY are not
valid scores. Lucene uses these values internally in certain
places, so if you have hits with such scores, it will cause
problems. (Shai Erera via Mike McCandless)
* LUCENE-1687: All methods and parsers from the interface ExtendedFieldCache
have been moved into FieldCache. ExtendedFieldCache is now deprecated and
contains only a few declarations for binary backwards compatibility.
ExtendedFieldCache will be removed in version 3.0. Users of FieldCache and
ExtendedFieldCache will be able to plug in Lucene 2.9 without recompilation.
The auto cache (FieldCache.getAuto) is now deprecated. Due to the merge of
ExtendedFieldCache and FieldCache, FieldCache can now additionally return
long[] and double[] arrays in addition to int[] and float[] and StringIndex.
The interface changes are only notable for users implementing the interfaces,
which was unlikely done, because there is no possibility to change
Lucene's FieldCache implementation. (Grant Ingersoll, Uwe Schindler)
* LUCENE-1630, LUCENE-1771: Weight, previously an interface, is now an abstract
class. Some of the method signatures have changed, but it should be fairly
easy to see what adjustments must be made to existing code to sync up
with the new API. You can find more detail in the API Changes section.
Going forward Searchable will be kept for convenience only and may
be changed between minor releases without any deprecation
process. It is not recommended that you implement it, but rather extend
Searcher.
(Shai Erera, Chris Hostetter, Martin Ruckli, Mark Miller via Mike McCandless)
* LUCENE-1422, LUCENE-1693: The new Attribute based TokenStream API (see below)
has some backwards breaks in rare cases. We did our best to make the
transition as easy as possible and you are not likely to run into any problems.
If your tokenizers still implement next(Token) or next(), the calls are
automatically wrapped. The indexer and query parser use the new API
(eg use incrementToken() calls). All core TokenStreams are implemented using
the new API. You can mix old and new API style TokenFilters/TokenStream.
Problems only occur when you have done the following:
You have overridden next(Token) or next() in one of the non-abstract core
TokenStreams/-Filters. These classes should normally be final, but some
of them are not. In this case, next(Token)/next() would never be called.
To fail early with a hard compile/runtime error, the next(Token)/next()
methods in these TokenStreams/-Filters were made final in this release.
(Michael Busch, Uwe Schindler)
* LUCENE-1763: MergePolicy now requires an IndexWriter instance to
be passed upon instantiation. As a result, IndexWriter was removed
as a method argument from all MergePolicy methods. (Shai Erera via
Mike McCandless)
* LUCENE-1748: LUCENE-1001 introduced PayloadSpans, but this was a back
compat break and caused custom SpanQuery implementations to fail at runtime
in a variety of ways. This issue attempts to remedy things by causing
a compile time break on custom SpanQuery implementations and removing
the PayloadSpans class, with its functionality now moved to Spans. To
help in alleviating future back compat pain, Spans has been changed from
an interface to an abstract class.
(Hugh Cayless, Mark Miller)
* LUCENE-1808: Query.createWeight has been changed from protected to
public. This will be a back compat break if you have overridden this
method - but you are likely already affected by the LUCENE-1693 (make Weight
abstract rather than an interface) back compat break if you have overridden
Query.creatWeight, so we have taken the opportunity to make this change.
(Tim Smith, Shai Erera via Mark Miller)
* LUCENE-1708 - IndexReader.document() no longer checks if the document is
deleted. You can call IndexReader.isDeleted(n) prior to calling document(n).
(Shai Erera via Mike McCandless)
Changes in runtime behavior
* LUCENE-1424: QueryParser now by default uses constant score auto
rewriting when it generates a WildcardQuery and PrefixQuery (it
already does so for TermRangeQuery, as well). Call
setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE)
to revert to slower BooleanQuery rewriting method. (Mark Miller via Mike
McCandless)
* LUCENE-1575: As of 2.9, the core collectors as well as
IndexSearcher's search methods that return top N results, no
longer filter documents with scores <= 0.0. If you rely on this
functionality you can use PositiveScoresOnlyCollector like this:
<code>
TopDocsCollector tdc = new TopScoreDocCollector(10);
Collector c = new PositiveScoresOnlyCollector(tdc);
searcher.search(query, c);
TopDocs hits = tdc.topDocs();
...
</code>
* LUCENE-1604: IndexReader.norms(String field) is now allowed to
return null if the field has no norms, as long as you've
previously called IndexReader.setDisableFakeNorms(true). This
setting now defaults to false (to preserve the fake norms back
compatible behavior) but in 3.0 will be hardwired to true. (Shon
Vella via Mike McCandless).
* LUCENE-1624: If you open IndexWriter with create=true and
autoCommit=false on an existing index, IndexWriter no longer
writes an empty commit when it's created. (Paul Taylor via Mike
McCandless)
* LUCENE-1593: When you call Sort() or Sort.setSort(String field,
boolean reverse), the resulting SortField array no longer ends
with SortField.FIELD_DOC (it was unnecessary as Lucene breaks ties
internally by docID). (Shai Erera via Michael McCandless)
* LUCENE-1542: When the first token(s) have 0 position increment,
IndexWriter used to incorrectly record the position as -1, if no
payload is present, or Integer.MAX_VALUE if a payload is present.
This causes positional queries to fail to match. The bug is now
fixed, but if your app relies on the buggy behavior then you must
call IndexWriter.setAllowMinus1Position(). That API is deprecated
so you must fix your application, and rebuild your index, to not
rely on this behavior by the 3.0 release of Lucene. (Jonathan
Mamou, Mark Miller via Mike McCandless)
* LUCENE-1715: Finalizers have been removed from the 4 core classes
that still had them, since they will cause GC to take longer, thus
tying up memory for longer, and at best they mask buggy app code.
DirectoryReader (returned from IndexReader.open) & IndexWriter
previously released the write lock during finalize.
SimpleFSDirectory.FSIndexInput closed the descriptor in its
finalizer, and NativeFSLock released the lock. It's possible
applications will be affected by this, but only if the application
is failing to close reader/writers. (Brian Groose via Mike
McCandless)
* LUCENE-1717: Fixed IndexWriter to account for RAM usage of
buffered deletions. (Mike McCandless)
* LUCENE-1727: Ensure that fields are stored & retrieved in the
exact order in which they were added to the document. This was
true in all Lucene releases before 2.3, but was broken in 2.3 and
2.4, and is now fixed in 2.9. (Mike McCandless)
* LUCENE-1678: The addition of Analyzer.reusableTokenStream
accidentally broke back compatibility of external analyzers that
subclassed core analyzers that implemented tokenStream but not
reusableTokenStream. This is now fixed, such that if
reusableTokenStream is invoked on such a subclass, that method
will forcefully fallback to tokenStream. (Mike McCandless)
* LUCENE-1801: Token.clear() and Token.clearNoTermBuffer() now also clear
startOffset, endOffset and type. This is not likely to affect any
Tokenizer chains, as Tokenizers normally always set these three values.
This change was made to be conform to the new AttributeImpl.clear() and
AttributeSource.clearAttributes() to work identical for Token as one for all
AttributeImpl and the 6 separate AttributeImpls. (Uwe Schindler, Michael Busch)
* LUCENE-1483: When searching over multiple segments, a new Scorer is now created
for each segment. Searching has been telescoped out a level and IndexSearcher now
operates much like MultiSearcher does. The Weight is created only once for the top
level Searcher, but each Scorer is passed a per-segment IndexReader. This will
result in doc ids in the Scorer being internal to the per-segment IndexReader. It
has always been outside of the API to count on a given IndexReader to contain every
doc id in the index - and if you have been ignoring MultiSearcher in your custom code
and counting on this fact, you will find your code no longer works correctly. If a
custom Scorer implementation uses any caches/filters that rely on being based on the
top level IndexReader, it will need to be updated to correctly use contextless
caches/filters eg you can't count on the IndexReader to contain any given doc id or
all of the doc ids. (Mark Miller, Mike McCandless)
* LUCENE-1846: DateTools now uses the US locale to format the numbers in its
date/time strings instead of the default locale. For most locales there will
be no change in the index format, as DateFormatSymbols is using ASCII digits.
The usage of the US locale is important to guarantee correct ordering of
generated terms. (Uwe Schindler)
* LUCENE-1860: MultiTermQuery now defaults to
CONSTANT_SCORE_AUTO_REWRITE_DEFAULT rewrite method (previously it
was SCORING_BOOLEAN_QUERY_REWRITE). This means that PrefixQuery
and WildcardQuery will now produce constant score for all matching
docs, equal to the boost of the query. (Mike McCandless)
API Changes
* LUCENE-1419: Add expert API to set custom indexing chain. This API is
package-protected for now, so we don't have to officially support it.
Yet, it will give us the possibility to try out different consumers
in the chain. (Michael Busch)
* LUCENE-1427: DocIdSet.iterator() is now allowed to throw
IOException. (Paul Elschot, Mike McCandless)
* LUCENE-1422, LUCENE-1693: New TokenStream API that uses a new class called
AttributeSource instead of the Token class, which is now a utility class that
holds common Token attributes. All attributes that the Token class had have
been moved into separate classes: TermAttribute, OffsetAttribute,
PositionIncrementAttribute, PayloadAttribute, TypeAttribute and FlagsAttribute.
The new API is much more flexible; it allows to combine the Attributes
arbitrarily and also to define custom Attributes. The new API has the same
performance as the old next(Token) approach. For conformance with this new
API Tee-/SinkTokenizer was deprecated and replaced by a new TeeSinkTokenFilter.
(Michael Busch, Uwe Schindler; additional contributions and bug fixes by
Daniel Shane, Doron Cohen)
* LUCENE-1467: Add nextDoc() and next(int) methods to OpenBitSetIterator.
These methods can be used to avoid additional calls to doc().
(Michael Busch)
* LUCENE-1468: Deprecate Directory.list(), which sometimes (in
FSDirectory) filters out files that don't look like index files, in
favor of new Directory.listAll(), which does no filtering. Also,
listAll() will never return null; instead, it throws an IOException
(or subclass). Specifically, FSDirectory.listAll() will throw the
newly added NoSuchDirectoryException if the directory does not
exist. (Marcel Reutegger, Mike McCandless)
* LUCENE-1546: Add IndexReader.flush(Map commitUserData), allowing
you to record an opaque commitUserData (maps String -> String) into
the commit written by IndexReader. This matches IndexWriter's
commit methods. (Jason Rutherglen via Mike McCandless)
* LUCENE-652: Added org.apache.lucene.document.CompressionTools, to
enable compressing & decompressing binary content, external to
Lucene's indexing. Deprecated Field.Store.COMPRESS.
* LUCENE-1561: Renamed Field.omitTf to Field.omitTermFreqAndPositions
(Otis Gospodnetic via Mike McCandless)
* LUCENE-1500: Added new InvalidTokenOffsetsException to Highlighter methods
to denote issues when offsets in TokenStream tokens exceed the length of the
provided text. (Mark Harwood)
* LUCENE-1575, LUCENE-1483: HitCollector is now deprecated in favor of
a new Collector abstract class. For easy migration, people can use
HitCollectorWrapper which translates (wraps) HitCollector into
Collector. Note that this class is also deprecated and will be
removed when HitCollector is removed. Also TimeLimitedCollector
is deprecated in favor of the new TimeLimitingCollector which
extends Collector. (Shai Erera, Mark Miller, Mike McCandless)
* LUCENE-1592: The method TermsEnum.skipTo() was deprecated, because
it is used nowhere in core/contrib and there is only a very ineffective
default implementation available. If you want to position a TermEnum
to another Term, create a new one using IndexReader.terms(Term).
(Uwe Schindler)
* LUCENE-1621: MultiTermQuery.getTerm() has been deprecated as it does
not make sense for all subclasses of MultiTermQuery. Check individual
subclasses to see if they support getTerm(). (Mark Miller)
* LUCENE-1636: Make TokenFilter.input final so it's set only
once. (Wouter Heijke, Uwe Schindler via Mike McCandless).
* LUCENE-1658, LUCENE-1451: Renamed FSDirectory to SimpleFSDirectory
(but left an FSDirectory base class). Added an FSDirectory.open
static method to pick a good default FSDirectory implementation
given the OS. FSDirectories should now be instantiated using
FSDirectory.open or with public constructors rather than
FSDirectory.getDirectory(), which has been deprecated.
(Michael McCandless, Uwe Schindler, yonik)
* LUCENE-1665: Deprecate SortField.AUTO, to be removed in 3.0.
Instead, when sorting by field, the application should explicitly
state the type of the field. (Mike McCandless)
* LUCENE-1660: StopFilter, StandardAnalyzer, StopAnalyzer now
require up front specification of enablePositionIncrement (Mike
McCandless)
* LUCENE-1614: DocIdSetIterator's next() and skipTo() were deprecated in favor
of the new nextDoc() and advance(). The new methods return the doc Id they
landed on, saving an extra call to doc() in most cases.
For easy migration of the code, you can change the calls to next() to
nextDoc() != DocIdSetIterator.NO_MORE_DOCS and similarly for skipTo().
However it is advised that you take advantage of the returned doc ID and not
call doc() following those two.
Also, doc() was deprecated in favor of docID(). docID() should return -1 or
NO_MORE_DOCS if nextDoc/advance were not called yet, or NO_MORE_DOCS if the
iterator has exhausted. Otherwise it should return the current doc ID.
(Shai Erera via Mike McCandless)
* LUCENE-1672: All ctors/opens and other methods using String/File to
specify the directory in IndexReader, IndexWriter, and IndexSearcher
were deprecated. You should instantiate the Directory manually before
and pass it to these classes (LUCENE-1451, LUCENE-1658).
(Uwe Schindler)
* LUCENE-1407: Move RemoteSearchable, RemoteCachingWrapperFilter out
of Lucene's core into new contrib/remote package. Searchable no
longer extends java.rmi.Remote (Simon Willnauer via Mike
McCandless)
* LUCENE-1677: The global property
org.apache.lucene.SegmentReader.class, and
ReadOnlySegmentReader.class are now deprecated, to be removed in
3.0. src/gcj/* has been removed. (Earwin Burrfoot via Mike
McCandless)
* LUCENE-1673: Deprecated NumberTools in favour of the new
NumericRangeQuery and its new indexing format for numeric or
date values. (Uwe Schindler)
* LUCENE-1630, LUCENE-1771: Weight is now an abstract class, and adds
a scorer(IndexReader, boolean /* scoreDocsInOrder */, boolean /*
topScorer */) method instead of scorer(IndexReader). IndexSearcher uses
this method to obtain a scorer matching the capabilities of the Collector
wrt orderedness of docIDs. Some Scorers (like BooleanScorer) are much more
efficient if out-of-order documents scoring is allowed by a Collector.
Collector must now implement acceptsDocsOutOfOrder. If you write a
Collector which does not care about doc ID orderness, it is recommended
that you return true. Weight has a scoresDocsOutOfOrder method, which by
default returns false. If you create a Weight which will score documents
out of order if requested, you should override that method to return true.
BooleanQuery's setAllowDocsOutOfOrder and getAllowDocsOutOfOrder have been
deprecated as they are not needed anymore. BooleanQuery will now score docs
out of order when used with a Collector that can accept docs out of order.
Finally, Weight#explain now takes a sub-reader and sub-docID, rather than
a top level reader and docID.
(Shai Erera, Chris Hostetter, Martin Ruckli, Mark Miller via Mike McCandless)
* LUCENE-1466, LUCENE-1906: Added CharFilter and MappingCharFilter, which allows
chaining & mapping of characters before tokenizers run. CharStream (subclass of
Reader) is the base class for custom java.io.Reader's, that support offset
correction. Tokenizers got an additional method correctOffset() that is passed
down to the underlying CharStream if input is a subclass of CharStream/-Filter.
(Koji Sekiguchi via Mike McCandless, Uwe Schindler)
* LUCENE-1703: Add IndexWriter.waitForMerges. (Tim Smith via Mike
McCandless)
* LUCENE-1625: CheckIndex's programmatic API now returns separate
classes detailing the status of each component in the index, and
includes more detailed status than previously. (Tim Smith via
Mike McCandless)
* LUCENE-1713: Deprecated RangeQuery and RangeFilter and renamed to
TermRangeQuery and TermRangeFilter. TermRangeQuery is in constant
score auto rewrite mode by default. The new classes also have new
ctors taking field and term ranges as Strings (see also
LUCENE-1424). (Uwe Schindler)
* LUCENE-1609: The termInfosIndexDivisor must now be specified
up-front when opening the IndexReader. Attempts to call
IndexReader.setTermInfosIndexDivisor will hit an
UnsupportedOperationException. This was done to enable removal of
all synchronization in TermInfosReader, which previously could
cause threads to pile up in certain cases. (Dan Rosher via Mike
McCandless)
* LUCENE-1688: Deprecate static final String stop word array in and
StopAnalzyer and replace it with an immutable implementation of
CharArraySet. (Simon Willnauer via Mark Miller)
* LUCENE-1742: SegmentInfos, SegmentInfo and SegmentReader have been
made public as expert, experimental APIs. These APIs may suddenly
change from release to release (Jason Rutherglen via Mike
McCandless).
* LUCENE-1754: QueryWeight.scorer() can return null if no documents
are going to be matched by the query. Similarly,
Filter.getDocIdSet() can return null if no documents are going to
be accepted by the Filter. Note that these 'can' return null,
however they don't have to and can return a Scorer/DocIdSet which
does not match / reject all documents. This is already the
behavior of some QueryWeight/Filter implementations, and is
documented here just for emphasis. (Shai Erera via Mike
McCandless)
* LUCENE-1705: Added IndexWriter.deleteAllDocuments. (Tim Smith via
Mike McCandless)
* LUCENE-1460: Changed TokenStreams/TokenFilters in contrib to
use the new TokenStream API. (Robert Muir, Michael Busch)
* LUCENE-1748: LUCENE-1001 introduced PayloadSpans, but this was a back
compat break and caused custom SpanQuery implementations to fail at runtime
in a variety of ways. This issue attempts to remedy things by causing
a compile time break on custom SpanQuery implementations and removing
the PayloadSpans class, with its functionality now moved to Spans. To
help in alleviating future back compat pain, Spans has been changed from
an interface to an abstract class.
(Hugh Cayless, Mark Miller)
* LUCENE-1808: Query.createWeight has been changed from protected to
public. (Tim Smith, Shai Erera via Mark Miller)
* LUCENE-1826: Add constructors that take AttributeSource and
AttributeFactory to all Tokenizer implementations.
(Michael Busch)
* LUCENE-1847: Similarity#idf for both a Term and Term Collection have
been deprecated. New versions that return an IDFExplanation have been
added. (Yasoja Seneviratne, Mike McCandless, Mark Miller)
* LUCENE-1877: Made NativeFSLockFactory the default for
the new FSDirectory API (open(), FSDirectory subclass ctors).
All FSDirectory system properties were deprecated and all lock
implementations use no lock prefix if the locks are stored inside
the index directory. Because the deprecated String/File ctors of
IndexWriter and IndexReader (LUCENE-1672) and FSDirectory.getDirectory()
still use the old SimpleFSLockFactory and the new API
NativeFSLockFactory, we strongly recommend not to mix deprecated
and new API. (Uwe Schindler, Mike McCandless)
* LUCENE-1911: Added a new method isCacheable() to DocIdSet. This method
should return true, if the underlying implementation does not use disk
I/O and is fast enough to be directly cached by CachingWrapperFilter.
OpenBitSet, SortedVIntList, and DocIdBitSet are such candidates.
The default implementation of the abstract DocIdSet class returns false.
In this case, CachingWrapperFilter copies the DocIdSetIterator into an
OpenBitSet for caching. (Uwe Schindler, Thomas Becker)
Bug fixes
* LUCENE-1415: MultiPhraseQuery has incorrect hashCode() and equals()
implementation - Leads to Solr Cache misses.
(Todd Feak, Mark Miller via yonik)
* LUCENE-1327: Fix TermSpans#skipTo() to behave as specified in javadocs
of Terms#skipTo(). (Michael Busch)
* LUCENE-1573: Do not ignore InterruptedException (caused by
Thread.interrupt()) nor enter deadlock/spin loop. Now, an interrupt
will cause a RuntimeException to be thrown. In 3.0 we will change
public APIs to throw InterruptedException. (Jeremy Volkman via
Mike McCandless)
* LUCENE-1590: Fixed stored-only Field instances do not change the
value of omitNorms, omitTermFreqAndPositions in FieldInfo; when you
retrieve such fields they will now have omitNorms=true and
omitTermFreqAndPositions=false (though these values are unused).
(Uwe Schindler via Mike McCandless)
* LUCENE-1587: RangeQuery#equals() could consider a RangeQuery
without a collator equal to one with a collator.
(Mark Platvoet via Mark Miller)
* LUCENE-1600: Don't call String.intern unnecessarily in some cases
when loading documents from the index. (P Eger via Mike
McCandless)
* LUCENE-1611: Fix case where OutOfMemoryException in IndexWriter
could cause "infinite merging" to happen. (Christiaan Fluit via
Mike McCandless)
* LUCENE-1623: Properly handle back-compatibility of 2.3.x indexes that
contain field names with non-ascii characters. (Mike Streeton via
Mike McCandless)
* LUCENE-1593: MultiSearcher and ParallelMultiSearcher did not break ties (in
sort) by doc Id in a consistent manner (i.e., if Sort.FIELD_DOC was used vs.
when it wasn't). (Shai Erera via Michael McCandless)
* LUCENE-1647: Fix case where IndexReader.undeleteAll would cause
the segment's deletion count to be incorrect. (Mike McCandless)
* LUCENE-1542: When the first token(s) have 0 position increment,
IndexWriter used to incorrectly record the position as -1, if no
payload is present, or Integer.MAX_VALUE if a payload is present.
This causes positional queries to fail to match. The bug is now
fixed, but if your app relies on the buggy behavior then you must
call IndexWriter.setAllowMinus1Position(). That API is deprecated
so you must fix your application, and rebuild your index, to not
rely on this behavior by the 3.0 release of Lucene. (Jonathan
Mamou, Mark Miller via Mike McCandless)
* LUCENE-1658: Fixed MMapDirectory to correctly throw IOExceptions
on EOF, removed numeric overflow possibilities and added support
for a hack to unmap the buffers on closing IndexInput.
(Uwe Schindler)
* LUCENE-1681: Fix infinite loop caused by a call to DocValues methods
getMinValue, getMaxValue, getAverageValue. (Simon Willnauer via Mark Miller)
* LUCENE-1599: Add clone support for SpanQuerys. SpanRegexQuery counts
on this functionality and does not work correctly without it.
(Billow Gao, Mark Miller)
* LUCENE-1718: Fix termInfosIndexDivisor to carry over to reopened
readers (Mike McCandless)
* LUCENE-1583: SpanOrQuery skipTo() doesn't always move forwards as Spans
documentation indicates it should. (Moti Nisenson via Mark Miller)
* LUCENE-1566: Sun JVM Bug
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 causes
invalid OutOfMemoryError when reading too many bytes at once from
a file on 32bit JVMs that have a large maximum heap size. This
fix adds set/getReadChunkSize to FSDirectory so that large reads
are broken into chunks, to work around this JVM bug. On 32bit
JVMs the default chunk size is 100 MB; on 64bit JVMs, which don't
show the bug, the default is Integer.MAX_VALUE. (Simon Willnauer
via Mike McCandless)
* LUCENE-1448: Added TokenStream.end() to perform end-of-stream
operations (ie to return the end offset of the tokenization).
This is important when multiple fields with the same name are added
to a document, to ensure offsets recorded in term vectors for all
of the instances are correct.
(Mike McCandless, Mark Miller, Michael Busch)
* LUCENE-1805: CloseableThreadLocal did not allow a null Object in get(),
although it does allow it in set(Object). Fix get() to not assert the object
is not null. (Shai Erera via Mike McCandless)
* LUCENE-1801: Changed all Tokenizers or TokenStreams in core/contrib)
that are the source of Tokens to always call
AttributeSource.clearAttributes() first. (Uwe Schindler)
* LUCENE-1819: MatchAllDocsQuery.toString(field) should produce output
that is parsable by the QueryParser. (John Wang, Mark Miller)
* LUCENE-1836: Fix localization bug in the new query parser and add
new LocalizedTestCase as base class for localization junit tests.
(Robert Muir, Uwe Schindler via Michael Busch)
* LUCENE-1847: PhraseQuery/TermQuery/SpanQuery use IndexReader specific stats
in their Weight#explain methods - these stats should be corpus wide.
(Yasoja Seneviratne, Mike McCandless, Mark Miller)
* LUCENE-1885: Fix the bug that NativeFSLock.isLocked() did not work,
if the lock was obtained by another NativeFSLock(Factory) instance.
Because of this IndexReader.isLocked() and IndexWriter.isLocked() did
not work correctly. (Uwe Schindler)
* LUCENE-1899: Fix O(N^2) CPU cost when setting docIDs in order in an
OpenBitSet, due to an inefficiency in how the underlying storage is
reallocated. (Nadav Har'El via Mike McCandless)
* LUCENE-1918: Fixed cases where a ParallelReader would
generate exceptions on being passed to
IndexWriter.addIndexes(IndexReader[]). First case was when the
ParallelReader was empty. Second case was when the ParallelReader
used to contain documents with TermVectors, but all such documents
have been deleted. (Christian Kohlschütter via Mike McCandless)
New features
* LUCENE-1411: Added expert API to open an IndexWriter on a prior
commit, obtained from IndexReader.listCommits. This makes it
possible to rollback changes to an index even after you've closed
the IndexWriter that made the changes, assuming you are using an
IndexDeletionPolicy that keeps past commits around. This is useful
when building transactional support on top of Lucene. (Mike
McCandless)
* LUCENE-1382: Add an optional arbitrary Map (String -> String)
"commitUserData" to IndexWriter.commit(), which is stored in the
segments file and is then retrievable via
IndexReader.getCommitUserData instance and static methods.
(Shalin Shekhar Mangar via Mike McCandless)
* LUCENE-1420: Similarity now has a computeNorm method that allows
custom Similarity classes to override how norm is computed. It's
provided a FieldInvertState instance that contains details from
inverting the field. The default impl is boost *
lengthNorm(numTerms), to be backwards compatible. Also added
{set/get}DiscountOverlaps to DefaultSimilarity, to control whether
overlapping tokens (tokens with 0 position increment) should be
counted in lengthNorm. (Andrzej Bialecki via Mike McCandless)
* LUCENE-1424: Moved constant score query rewrite capability into
MultiTermQuery, allowing TermRangeQuery, PrefixQuery and WildcardQuery
to switch between constant-score rewriting or BooleanQuery
expansion rewriting via a new setRewriteMethod method.
Deprecated ConstantScoreRangeQuery (Mark Miller via Mike
McCandless)
* LUCENE-1461: Added FieldCacheRangeFilter, a RangeFilter for
single-term fields that uses FieldCache to compute the filter. If
your documents all have a single term for a given field, and you
need to create many RangeFilters with varying lower/upper bounds,
then this is likely a much faster way to create the filters than
RangeFilter. FieldCacheRangeFilter allows ranges on all data types,
FieldCache supports (term ranges, byte, short, int, long, float, double).
However, it comes at the expense of added RAM consumption and slower
first-time usage due to populating the FieldCache. It also does not
support collation (Tim Sturge, Matt Ericson via Mike McCandless and
Uwe Schindler)
* LUCENE-1296: add protected method CachingWrapperFilter.docIdSetToCache
to allow subclasses to choose which DocIdSet implementation to use
(Paul Elschot via Mike McCandless)
* LUCENE-1390: Added ASCIIFoldingFilter, a Filter that converts
alphabetic, numeric, and symbolic Unicode characters which are not in
the first 127 ASCII characters (the "Basic Latin" Unicode block) into
their ASCII equivalents, if one exists. ISOLatin1AccentFilter, which
handles a subset of this filter, has been deprecated.
(Andi Vajda, Steven Rowe via Mark Miller)
* LUCENE-1478: Added new SortField constructor allowing you to
specify a custom FieldCache parser to generate numeric values from
terms for a field. (Uwe Schindler via Mike McCandless)
* LUCENE-1528: Add support for Ideographic Space to the queryparser.
(Luis Alves via Michael Busch)
* LUCENE-1487: Added FieldCacheTermsFilter, to filter by multiple
terms on single-valued fields. The filter loads the FieldCache
for the field the first time it's called, and subsequent usage of
that field, even with different Terms in the filter, are fast.
(Tim Sturge, Shalin Shekhar Mangar via Mike McCandless).
* LUCENE-1314: Add clone(), clone(boolean readOnly) and
reopen(boolean readOnly) to IndexReader. Cloning an IndexReader
gives you a new reader which you can make changes to (deletions,
norms) without affecting the original reader. Now, with clone or
reopen you can change the readOnly of the original reader. (Jason
Rutherglen, Mike McCandless)
* LUCENE-1506: Added FilteredDocIdSet, an abstract class which you
subclass to implement the "match" method to accept or reject each
docID. Unlike ChainedFilter (under contrib/misc),
FilteredDocIdSet never requires you to materialize the full
bitset. Instead, match() is called on demand per docID. (John
Wang via Mike McCandless)
* LUCENE-1398: Add ReverseStringFilter to contrib/analyzers, a filter
to reverse the characters in each token. (Koji Sekiguchi via yonik)
* LUCENE-1551: Add expert IndexReader.reopen(IndexCommit) to allow
efficiently opening a new reader on a specific commit, sharing
resources with the original reader. (Torin Danil via Mike
McCandless)
* LUCENE-1434: Added org.apache.lucene.util.IndexableBinaryStringTools,
to encode byte[] as String values that are valid terms, and
maintain sort order of the original byte[] when the bytes are
interpreted as unsigned. (Steven Rowe via Mike McCandless)
* LUCENE-1543: Allow MatchAllDocsQuery to optionally use norms from
a specific fields to set the score for a document. (Karl Wettin
via Mike McCandless)
* LUCENE-1586: Add IndexReader.getUniqueTermCount(). (Mike
McCandless via Derek)
* LUCENE-1516: Added "near real-time search" to IndexWriter, via a
new expert getReader() method. This method returns a reader that
searches the full index, including any uncommitted changes in the
current IndexWriter session. This should result in a faster
turnaround than the normal approach of commiting the changes and
then reopening a reader. (Jason Rutherglen via Mike McCandless)
* LUCENE-1603: Added new MultiTermQueryWrapperFilter, to wrap any
MultiTermQuery as a Filter. Also made some improvements to
MultiTermQuery: return DocIdSet.EMPTY_DOCIDSET if there are no
terms in the enum; track the total number of terms it visited
during rewrite (getTotalNumberOfTerms). FilteredTermEnum is also
more friendly to subclassing. (Uwe Schindler via Mike McCandless)
* LUCENE-1605: Added BitVector.subset(). (Jeremy Volkman via Mike
McCandless)
* LUCENE-1618: Added FileSwitchDirectory that enables files with
specified extensions to be stored in a primary directory and the
rest of the files to be stored in the secondary directory. For
example, this can be useful for the large doc-store (stored
fields, term vectors) files in FSDirectory and the rest of the
index files in a RAMDirectory. (Jason Rutherglen via Mike
McCandless)
* LUCENE-1494: Added FieldMaskingSpanQuery which can be used to
cross-correlate Spans from different fields.
(Paul Cowan and Chris Hostetter)
* LUCENE-1634: Add calibrateSizeByDeletes to LogMergePolicy, to take
deletions into account when considering merges. (Yasuhiro Matsuda
via Mike McCandless)
* LUCENE-1550: Added new n-gram based String distance measure for spell checking.
See the Javadocs for NGramDistance.java for a reference paper on why
this is helpful (Tom Morton via Grant Ingersoll)
* LUCENE-1470, LUCENE-1582, LUCENE-1602, LUCENE-1673, LUCENE-1701, LUCENE-1712:
Added NumericRangeQuery and NumericRangeFilter, a fast alternative to
RangeQuery/RangeFilter for numeric searches. They depend on a specific
structure of terms in the index that can be created by indexing
using the new NumericField or NumericTokenStream classes. NumericField
can only be used for indexing and optionally stores the values as
string representation in the doc store. Documents returned from
IndexReader/IndexSearcher will return only the String value using
the standard Fieldable interface. NumericFields can be sorted on
and loaded into the FieldCache. (Uwe Schindler, Yonik Seeley,
Mike McCandless)
* LUCENE-1405: Added support for Ant resource collections in contrib/ant
<index> task. (Przemyslaw Sztoch via Erik Hatcher)
* LUCENE-1699: Allow setting a TokenStream on Field/Fieldable for indexing
in conjunction with any other ways to specify stored field values,
currently binary or string values. (yonik)
* LUCENE-1701: Made the standard FieldCache.Parsers public and added
parsers for fields generated using NumericField/NumericTokenStream.
All standard parsers now also implement Serializable and enforce
their singleton status. (Uwe Schindler, Mike McCandless)
* LUCENE-1741: User configurable maximum chunk size in MMapDirectory.
On 32 bit platforms, the address space can be very fragmented, so
one big ByteBuffer for the whole file may not fit into address space.
(Eks Dev via Uwe Schindler)
* LUCENE-1644: Enable 4 rewrite modes for queries deriving from
MultiTermQuery (WildcardQuery, PrefixQuery, TermRangeQuery,
NumericRangeQuery): CONSTANT_SCORE_FILTER_REWRITE first creates a
filter and then assigns constant score (boost) to docs;
CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE create a BooleanQuery but
uses a constant score (boost); SCORING_BOOLEAN_QUERY_REWRITE also
creates a BooleanQuery but keeps the BooleanQuery's scores;
CONSTANT_SCORE_AUTO_REWRITE tries to pick the most performant
constant-score rewrite method. (Mike McCandless)
* LUCENE-1448: Added TokenStream.end(), to perform end-of-stream
operations. This is currently used to fix offset problems when
multiple fields with the same name are added to a document.
(Mike McCandless, Mark Miller, Michael Busch)
* LUCENE-1776: Add an option to not collect payloads for an ordered
SpanNearQuery. Payloads were not lazily loaded in this case as
the javadocs implied. If you have payloads and want to use an ordered
SpanNearQuery that does not need to use the payloads, you can
disable loading them with a new constructor switch. (Mark Miller)
* LUCENE-1341: Added PayloadNearQuery to enable SpanNearQuery functionality
with payloads (Peter Keegan, Grant Ingersoll, Mark Miller)
* LUCENE-1790: Added PayloadTermQuery to enable scoring of payloads
based on the maximum payload seen for a document.
Slight refactoring of Similarity and other payload queries (Grant Ingersoll, Mark Miller)
* LUCENE-1749: Addition of FieldCacheSanityChecker utility, and
hooks to use it in all existing Lucene Tests. This class can
be used by any application to inspect the FieldCache and provide
diagnostic information about the possibility of inconsistent
FieldCache usage. Namely: FieldCache entries for the same field
with different datatypes or parsers; and FieldCache entries for
the same field in both a reader, and one of it's (descendant) sub
readers.
(Chris Hostetter, Mark Miller)
* LUCENE-1789: Added utility class
oal.search.function.MultiValueSource to ease the transition to
segment based searching for any apps that directly call
oal.search.function.* APIs. This class wraps any other
ValueSource, but takes care when composite (multi-segment) are
passed to not double RAM usage in the FieldCache. (Chris
Hostetter, Mark Miller, Mike McCandless)
Optimizations
* LUCENE-1427: Fixed QueryWrapperFilter to not waste time computing
scores of the query, since they are just discarded. Also, made it
more efficient (single pass) by not creating & populating an
intermediate OpenBitSet (Paul Elschot, Mike McCandless)
* LUCENE-1443: Performance improvement for OpenBitSetDISI.inPlaceAnd()
(Paul Elschot via yonik)
* LUCENE-1484: Remove synchronization of IndexReader.document() by
using CloseableThreadLocal internally. (Jason Rutherglen via Mike
McCandless).
* LUCENE-1124: Short circuit FuzzyQuery.rewrite when input token length
is small compared to minSimilarity. (Timo Nentwig, Mark Miller)
* LUCENE-1316: MatchAllDocsQuery now avoids the synchronized
IndexReader.isDeleted() call per document, by directly accessing
the underlying deleteDocs BitVector. This improves performance
with non-readOnly readers, especially in a multi-threaded
environment. (Todd Feak, Yonik Seeley, Jason Rutherglen via Mike
McCandless)
* LUCENE-1483: When searching over multiple segments we now visit
each sub-reader one at a time. This speeds up warming, since
FieldCache entries (if required) can be shared across reopens for
those segments that did not change, and also speeds up searches
that sort by relevance or by field values. (Mark Miller, Mike
McCandless)
* LUCENE-1575: The new Collector class decouples collect() from
score computation. Collector.setScorer is called to establish the
current Scorer in-use per segment. Collectors that require the
score should then call Scorer.score() per hit inside
collect(). (Shai Erera via Mike McCandless)
* LUCENE-1596: MultiTermDocs speedup when set with
MultiTermDocs.seek(MultiTermEnum) (yonik)
* LUCENE-1653: Avoid creating a Calendar in every call to
DateTools#dateToString, DateTools#timeToString and
DateTools#round. (Shai Erera via Mark Miller)
* LUCENE-1688: Deprecate static final String stop word array and
replace it with an immutable implementation of CharArraySet.
Removes conversions between Set and array.
(Simon Willnauer via Mark Miller)
* LUCENE-1754: BooleanQuery.queryWeight.scorer() will return null if
it won't match any documents (e.g. if there are no required and
optional scorers, or not enough optional scorers to satisfy
minShouldMatch). (Shai Erera via Mike McCandless)
* LUCENE-1607: To speed up string interning for commonly used
strings, the StringHelper.intern() interface was added with a
default implementation that uses a lockless cache.
(Earwin Burrfoot, yonik)
* LUCENE-1800: QueryParser should use reusable TokenStreams. (yonik)
Documentation
* LUCENE-1908: Scoring documentation imrovements in Similarity javadocs.
(Mark Miller, Shai Erera, Ted Dunning, Jiri Kuhn, Marvin Humphrey, Doron Cohen)
* LUCENE-1872: NumericField javadoc improvements
(Michael McCandless, Uwe Schindler)
* LUCENE-1875: Make TokenStream.end javadoc less confusing.
(Uwe Schindler)
* LUCENE-1862: Rectified duplicate package level javadocs for
o.a.l.queryParser and o.a.l.analysis.cn.
(Chris Hostetter)
* LUCENE-1886: Improved hyperlinking in key Analysis javadocs
(Bernd Fondermann via Chris Hostetter)
* LUCENE-1884: massive javadoc and comment cleanup, primarily dealing with
typos.
(Robert Muir via Chris Hostetter)
* LUCENE-1898: Switch changes to use bullets rather than numbers and
update changes-to-html script to handle the new format.
(Steven Rowe, Mark Miller)
* LUCENE-1900: Improve Searchable Javadoc.
(Nadav Har'El, Doron Cohen, Marvin Humphrey, Mark Miller)
* LUCENE-1896: Improve Similarity#queryNorm javadocs.
(Jiri Kuhn, Mark Miller)
Build
* LUCENE-1440: Add new targets to build.xml that allow downloading
and executing the junit testcases from an older release for
backwards-compatibility testing. (Michael Busch)
* LUCENE-1446: Add compatibility tag to common-build.xml and run
backwards-compatibility tests in the nightly build. (Michael Busch)
* LUCENE-1529: Properly test "drop-in" replacement of jar with
backwards-compatibility tests. (Mike McCandless, Michael Busch)
* LUCENE-1851: Change 'javacc' and 'clean-javacc' targets to build
and clean contrib/surround files. (Luis Alves via Michael Busch)
* LUCENE-1854: tar task should use longfile="gnu" to avoid false file
name length warnings. (Mark Miller)
Test Cases
* LUCENE-1791: Enhancements to the QueryUtils and CheckHits utility
classes to wrap IndexReaders and Searchers in MultiReaders or
MultiSearcher when possible to help exercise more edge cases.
(Chris Hostetter, Mark Miller)
* LUCENE-1852: Fix localization test failures.
(Robert Muir via Michael Busch)
* LUCENE-1843: Refactored all tests that use assertAnalyzesTo() & others
in core and contrib to use a new BaseTokenStreamTestCase
base class. Also rewrote some tests to use this general analysis assert
functions instead of own ones (e.g. TestMappingCharFilter).
The new base class also tests tokenization with the TokenStream.next()
backwards layer enabled (using Token/TokenWrapper as attribute
implementation) and disabled (default for Lucene 3.0)
(Uwe Schindler, Robert Muir)
* LUCENE-1836: Added a new LocalizedTestCase as base class for localization
junit tests. (Robert Muir, Uwe Schindler via Michael Busch)
======================= Release 2.4.1 =======================
API Changes
1. LUCENE-1186: Add Analyzer.close() to free internal ThreadLocal
resources. (Christian Kohlschütter via Mike McCandless)
Bug fixes
1. LUCENE-1452: Fixed silent data-loss case whereby binary fields are
truncated to 0 bytes during merging if the segments being merged
are non-congruent (same field name maps to different field
numbers). This bug was introduced with LUCENE-1219. (Andrzej
Bialecki via Mike McCandless).
2. LUCENE-1429: Don't throw incorrect IllegalStateException from
IndexWriter.close() if you've hit an OOM when autoCommit is true.
(Mike McCandless)
3. LUCENE-1474: If IndexReader.flush() is called twice when there were
pending deletions, it could lead to later false AssertionError
during IndexReader.open. (Mike McCandless)
4. LUCENE-1430: Fix false AlreadyClosedException from IndexReader.open
(masking an actual IOException) that takes String or File path.
(Mike McCandless)
5. LUCENE-1442: Multiple-valued NOT_ANALYZED fields can double-count
token offsets. (Mike McCandless)
6. LUCENE-1453: Ensure IndexReader.reopen()/clone() does not result in
incorrectly closing the shared FSDirectory. This bug would only
happen if you use IndexReader.open() with a File or String argument.
The returned readers are wrapped by a FilterIndexReader that
correctly handles closing of directory after reopen()/clone().
(Mark Miller, Uwe Schindler, Mike McCandless)
7. LUCENE-1457: Fix possible overflow bugs during binary
searches. (Mark Miller via Mike McCandless)
8. LUCENE-1459: Fix CachingWrapperFilter to not throw exception if
both bits() and getDocIdSet() methods are called. (Matt Jones via
Mike McCandless)
9. LUCENE-1519: Fix int overflow bug during segment merging. (Deepak
via Mike McCandless)
10. LUCENE-1521: Fix int overflow bug when flushing segment.
(Shon Vella via Mike McCandless).
11. LUCENE-1544: Fix deadlock in IndexWriter.addIndexes(IndexReader[]).
(Mike McCandless via Doug Sale)
12. LUCENE-1547: Fix rare thread safety issue if two threads call
IndexWriter commit() at the same time. (Mike McCandless)
13. LUCENE-1465: NearSpansOrdered returns payloads from first possible match
rather than the correct, shortest match; Payloads could be returned even
if the max slop was exceeded; The wrong payload could be returned in
certain situations. (Jonathan Mamou, Greg Shackles, Mark Miller)
14. LUCENE-1186: Add Analyzer.close() to free internal ThreadLocal
resources. (Christian Kohlschütter via Mike McCandless)
15. LUCENE-1552: Fix IndexWriter.addIndexes(IndexReader[]) to properly
rollback IndexWriter's internal state on hitting an
exception. (Scott Garland via Mike McCandless)
======================= Release 2.4.0 =======================
Changes in backwards compatibility policy
1. LUCENE-1340: In a minor change to Lucene's backward compatibility
policy, we are now allowing the Fieldable interface to have
changes, within reason, and made on a case-by-case basis. If an
application implements it's own Fieldable, please be aware of
this. Otherwise, no need to be concerned. This is in effect for
all 2.X releases, starting with 2.4. Also note, that in all
likelihood, Fieldable will be changed in 3.0.
Changes in runtime behavior
1. LUCENE-1151: Fix StandardAnalyzer to not mis-identify host names
(eg lucene.apache.org) as an ACRONYM. To get back to the pre-2.4
backwards compatible, but buggy, behavior, you can either call
StandardAnalyzer.setDefaultReplaceInvalidAcronym(false) (static
method), or, set system property
org.apache.lucene.analysis.standard.StandardAnalyzer.replaceInvalidAcronym
to "false" on JVM startup. All StandardAnalyzer instances created
after that will then show the pre-2.4 behavior. Alternatively,
you can call setReplaceInvalidAcronym(false) to change the
behavior per instance of StandardAnalyzer. This backwards
compatibility will be removed in 3.0 (hardwiring the value to
true). (Mike McCandless)
2. LUCENE-1044: IndexWriter with autoCommit=true now commits (such
that a reader can see the changes) far less often than it used to.
Previously, every flush was also a commit. You can always force a
commit by calling IndexWriter.commit(). Furthermore, in 3.0,
autoCommit will be hardwired to false (IndexWriter constructors
that take an autoCommit argument have been deprecated) (Mike
McCandless)
3. LUCENE-1335: IndexWriter.addIndexes(Directory[]) and
addIndexesNoOptimize no longer allow the same Directory instance
to be passed in more than once. Internally, IndexWriter uses
Directory and segment name to uniquely identify segments, so
adding the same Directory more than once was causing duplicates
which led to problems (Mike McCandless)
4. LUCENE-1396: Improve PhraseQuery.toString() so that gaps in the
positions are indicated with a ? and multiple terms at the same
position are joined with a |. (Andrzej Bialecki via Mike
McCandless)
API Changes
1. LUCENE-1084: Changed all IndexWriter constructors to take an
explicit parameter for maximum field size. Deprecated all the
pre-existing constructors; these will be removed in release 3.0.
NOTE: these new constructors set autoCommit to false. (Steven
Rowe via Mike McCandless)
2. LUCENE-584: Changed Filter API to return a DocIdSet instead of a
java.util.BitSet. This allows using more efficient data structures
for Filters and makes them more flexible. This deprecates
Filter.bits(), so all filters that implement this outside
the Lucene code base will need to be adapted. See also the javadocs
of the Filter class. (Paul Elschot, Michael Busch)
3. LUCENE-1044: Added IndexWriter.commit() which flushes any buffered
adds/deletes and then commits a new segments file so readers will
see the changes. Deprecate IndexWriter.flush() in favor of
IndexWriter.commit(). (Mike McCandless)
4. LUCENE-325: Added IndexWriter.expungeDeletes methods, which
consult the MergePolicy to find merges necessary to merge away all
deletes from the index. This should be a somewhat lower cost
operation than optimize. (John Wang via Mike McCandless)
5. LUCENE-1233: Return empty array instead of null when no fields
match the specified name in these methods in Document:
getFieldables, getFields, getValues, getBinaryValues. (Stefan
Trcek vai Mike McCandless)
6. LUCENE-1234: Make BoostingSpanScorer protected. (Andi Vajda via Grant Ingersoll)
7. LUCENE-510: The index now stores strings as true UTF-8 bytes
(previously it was Java's modified UTF-8). If any text, either
stored fields or a token, has illegal UTF-16 surrogate characters,
these characters are now silently replaced with the Unicode
replacement character U+FFFD. This is a change to the index file
format. (Marvin Humphrey via Mike McCandless)
8. LUCENE-852: Let the SpellChecker caller specify IndexWriter mergeFactor
and RAM buffer size. (Otis Gospodnetic)
9. LUCENE-1290: Deprecate org.apache.lucene.search.Hits, Hit and HitIterator
and remove all references to these classes from the core. Also update demos
and tutorials. (Michael Busch)
10. LUCENE-1288: Add getVersion() and getGeneration() to IndexCommit.
getVersion() returns the same value that IndexReader.getVersion()
returns when the reader is opened on the same commit. (Jason
Rutherglen via Mike McCandless)
11. LUCENE-1311: Added IndexReader.listCommits(Directory) static
method to list all commits in a Directory, plus IndexReader.open
methods that accept an IndexCommit and open the index as of that
commit. These methods are only useful if you implement a custom
DeletionPolicy that keeps more than the last commit around.
(Jason Rutherglen via Mike McCandless)
12. LUCENE-1325: Added IndexCommit.isOptimized(). (Shalin Shekhar
Mangar via Mike McCandless)
13. LUCENE-1324: Added TokenFilter.reset(). (Shai Erera via Mike
McCandless)
14. LUCENE-1340: Added Fieldable.omitTf() method to skip indexing term
frequency, positions and payloads. This saves index space, and
indexing/searching time. (Eks Dev via Mike McCandless)
15. LUCENE-1219: Add basic reuse API to Fieldable for binary fields:
getBinaryValue/Offset/Length(); currently only lazy fields reuse
the provided byte[] result to getBinaryValue. (Eks Dev via Mike
McCandless)
16. LUCENE-1334: Add new constructor for Term: Term(String fieldName)
which defaults term text to "". (DM Smith via Mike McCandless)
17. LUCENE-1333: Added Token.reinit(*) APIs to re-initialize (reuse) a
Token. Also added term() method to return a String, with a
performance penalty clearly documented. Also implemented
hashCode() and equals() in Token, and fixed all core and contrib
analyzers to use the re-use APIs. (DM Smith via Mike McCandless)
18. LUCENE-1329: Add optional readOnly boolean when opening an
IndexReader. A readOnly reader is not allowed to make changes
(deletions, norms) to the index; in exchanged, the isDeleted
method, often a bottleneck when searching with many threads, is
not synchronized. The default for readOnly is still false, but in
3.0 the default will become true. (Jason Rutherglen via Mike
McCandless)
19. LUCENE-1367: Add IndexCommit.isDeleted(). (Shalin Shekhar Mangar
via Mike McCandless)
20. LUCENE-1061: Factored out all "new XXXQuery(...)" in
QueryParser.java into protected methods newXXXQuery(...) so that
subclasses can create their own subclasses of each Query type.
(John Wang via Mike McCandless)
21. LUCENE-753: Added new Directory implementation
org.apache.lucene.store.NIOFSDirectory, which uses java.nio's
FileChannel to do file reads. On most non-Windows platforms, with
many threads sharing a single searcher, this may yield sizable
improvement to query throughput when compared to FSDirectory,
which only allows a single thread to read from an open file at a
time. (Jason Rutherglen via Mike McCandless)
22. LUCENE-1371: Added convenience method TopDocs Searcher.search(Query query, int n).
(Mike McCandless)
23. LUCENE-1356: Allow easy extensions of TopDocCollector by turning
constructor and fields from package to protected. (Shai Erera
via Doron Cohen)
24. LUCENE-1375: Added convenience method IndexCommit.getTimestamp,
which is equivalent to
getDirectory().fileModified(getSegmentsFileName()). (Mike McCandless)
23. LUCENE-1366: Rename Field.Index options to be more accurate:
TOKENIZED becomes ANALYZED; UN_TOKENIZED becomes NOT_ANALYZED;
NO_NORMS becomes NOT_ANALYZED_NO_NORMS and a new ANALYZED_NO_NORMS
is added. (Mike McCandless)
24. LUCENE-1131: Added numDeletedDocs method to IndexReader (Otis Gospodnetic)
Bug fixes
1. LUCENE-1134: Fixed BooleanQuery.rewrite to only optimize a single
clause query if minNumShouldMatch<=0. (Shai Erera via Michael Busch)
2. LUCENE-1169: Fixed bug in IndexSearcher.search(): searching with
a filter might miss some hits because scorer.skipTo() is called
without checking if the scorer is already at the right position.
scorer.skipTo(scorer.doc()) is not a NOOP, it behaves as
scorer.next(). (Eks Dev, Michael Busch)
3. LUCENE-1182: Added scorePayload to SimilarityDelegator (Andi Vajda via Grant Ingersoll)
4. LUCENE-1213: MultiFieldQueryParser was ignoring slop in case
of a single field phrase. (Trejkaz via Doron Cohen)
5. LUCENE-1228: IndexWriter.commit() was not updating the index version and as
result IndexReader.reopen() failed to sense index changes. (Doron Cohen)
6. LUCENE-1267: Added numDocs() and maxDoc() to IndexWriter;
deprecated docCount(). (Mike McCandless)
7. LUCENE-1274: Added new prepareCommit() method to IndexWriter,
which does phase 1 of a 2-phase commit (commit() does phase 2).
This is needed when you want to update an index as part of a
transaction involving external resources (eg a database). Also
deprecated abort(), renaming it to rollback(). (Mike McCandless)
8. LUCENE-1003: Stop RussianAnalyzer from removing numbers.
(TUSUR OpenTeam, Dmitry Lihachev via Otis Gospodnetic)
9. LUCENE-1152: SpellChecker fix around clearIndex and indexDictionary
methods, plus removal of IndexReader reference.
(Naveen Belkale via Otis Gospodnetic)
10. LUCENE-1046: Removed dead code in SpellChecker
(Daniel Naber via Otis Gospodnetic)
11. LUCENE-1189: Fixed the QueryParser to handle escaped characters within
quoted terms correctly. (Tomer Gabel via Michael Busch)
12. LUCENE-1299: Fixed NPE in SpellChecker when IndexReader is not null and field is (Grant Ingersoll)
13. LUCENE-1303: Fixed BoostingTermQuery's explanation to be marked as a Match
depending only upon the non-payload score part, regardless of the effect of
the payload on the score. Prior to this, score of a query containing a BTQ
differed from its explanation. (Doron Cohen)
14. LUCENE-1310: Fixed SloppyPhraseScorer to work also for terms repeating more
than twice in the query. (Doron Cohen)
15. LUCENE-1351: ISOLatin1AccentFilter now cleans additional ligatures (Cedrik Lime via Grant Ingersoll)
16. LUCENE-1383: Workaround a nasty "leak" in Java's builtin
ThreadLocal, to prevent Lucene from causing unexpected
OutOfMemoryError in certain situations (notably J2EE
applications). (Chris Lu via Mike McCandless)
New features
1. LUCENE-1137: Added Token.set/getFlags() accessors for passing more information about a Token through the analysis
process. The flag is not indexed/stored and is thus only used by analysis.
2. LUCENE-1147: Add -segment option to CheckIndex tool so you can
check only a specific segment or segments in your index. (Mike
McCandless)
3. LUCENE-1045: Reopened this issue to add support for short and bytes.
4. LUCENE-584: Added new data structures to o.a.l.util, such as
OpenBitSet and SortedVIntList. These extend DocIdSet and can
directly be used for Filters with the new Filter API. Also changed
the core Filters to use OpenBitSet instead of java.util.BitSet.
(Paul Elschot, Michael Busch)
5. LUCENE-494: Added QueryAutoStopWordAnalyzer to allow for the automatic removal, from a query of frequently occurring terms.
This Analyzer is not intended for use during indexing. (Mark Harwood via Grant Ingersoll)
6. LUCENE-1044: Change Lucene to properly "sync" files after
committing, to ensure on a machine or OS crash or power cut, even
with cached writes, the index remains consistent. Also added
explicit commit() method to IndexWriter to force a commit without
having to close. (Mike McCandless)
7. LUCENE-997: Add search timeout (partial) support.
A TimeLimitedCollector was added to allow limiting search time.
It is a partial solution since timeout is checked only when
collecting a hit, and therefore a search for rare words in a
huge index might not stop within the specified time.
(Sean Timm via Doron Cohen)
8. LUCENE-1184: Allow SnapshotDeletionPolicy to be re-used across
close/re-open of IndexWriter while still protecting an open
snapshot (Tim Brennan via Mike McCandless)
9. LUCENE-1194: Added IndexWriter.deleteDocuments(Query) to delete
documents matching the specified query. Also added static unlock
and isLocked methods (deprecating the ones in IndexReader). (Mike
McCandless)
10. LUCENE-1201: Add IndexReader.getIndexCommit() method. (Tim Brennan
via Mike McCandless)
11. LUCENE-550: Added InstantiatedIndex implementation. Experimental
Index store similar to MemoryIndex but allows for multiple documents
in memory. (Karl Wettin via Grant Ingersoll)
12. LUCENE-400: Added word based n-gram filter (in contrib/analyzers) called ShingleFilter and an Analyzer wrapper
that wraps another Analyzer's token stream with a ShingleFilter (Sebastian Kirsch, Steve Rowe via Grant Ingersoll)
13. LUCENE-1166: Decomposition tokenfilter for languages like German and Swedish (Thomas Peuss via Grant Ingersoll)
14. LUCENE-1187: ChainedFilter and BooleanFilter now work with new Filter API
and DocIdSetIterator-based filters. Backwards-compatibility with old
BitSet-based filters is ensured. (Paul Elschot via Michael Busch)
15. LUCENE-1295: Added new method to MoreLikeThis for retrieving interesting terms and made retrieveTerms(int) public. (Grant Ingersoll)
16. LUCENE-1298: MoreLikeThis can now accept a custom Similarity (Grant Ingersoll)
17. LUCENE-1297: Allow other string distance measures for the SpellChecker
(Thomas Morton via Otis Gospodnetic)
18. LUCENE-1001: Provide access to Payloads via Spans. All existing Span Query implementations in Lucene implement. (Mark Miller, Grant Ingersoll)
19. LUCENE-1354: Provide programmatic access to CheckIndex (Grant Ingersoll, Mike McCandless)
20. LUCENE-1279: Add support for Collators to RangeFilter/Query and Query Parser. (Steve Rowe via Grant Ingersoll)
Optimizations
1. LUCENE-705: When building a compound file, use
RandomAccessFile.setLength() to tell the OS/filesystem to
pre-allocate space for the file. This may improve fragmentation
in how the CFS file is stored, and allows us to detect an upcoming
disk full situation before actually filling up the disk. (Mike
McCandless)
2. LUCENE-1120: Speed up merging of term vectors by bulk-copying the
raw bytes for each contiguous range of non-deleted documents.
(Mike McCandless)
3. LUCENE-1185: Avoid checking if the TermBuffer 'scratch' in
SegmentTermEnum is null for every call of scanTo().
(Christian Kohlschuetter via Michael Busch)
4. LUCENE-1217: Internal to Field.java, use isBinary instead of
runtime type checking for possible speedup of binaryValue().
(Eks Dev via Mike McCandless)
5. LUCENE-1183: Optimized TRStringDistance class (in contrib/spell) that uses
less memory than the previous version. (Cédrik LIME via Otis Gospodnetic)
6. LUCENE-1195: Improve term lookup performance by adding a LRU cache to the
TermInfosReader. In performance experiments the speedup was about 25% on
average on mid-size indexes with ~500,000 documents for queries with 3
terms and about 7% on larger indexes with ~4.3M documents. (Michael Busch)
Documentation
1. LUCENE-1236: Added some clarifying remarks to EdgeNGram*.java (Hiroaki Kawai via Grant Ingersoll)
2. LUCENE-1157 and LUCENE-1256: HTML changes log, created automatically
from CHANGES.txt. This HTML file is currently visible only via developers page.
(Steven Rowe via Doron Cohen)
3. LUCENE-1349: Fieldable can now be changed without breaking backward compatibility rules (within reason. See the note at
the top of this file and also on Fieldable.java). (Grant Ingersoll)
4. LUCENE-1873: Update documentation to reflect current Contrib area status.
(Steven Rowe, Mark Miller)
Build
1. LUCENE-1153: Added JUnit JAR to new lib directory. Updated build to rely on local JUnit instead of ANT/lib.
2. LUCENE-1202: Small fixes to the way Clover is used to work better
with contribs. Of particular note: a single clover db is used
regardless of whether tests are run globally or in the specific
contrib directories.
3. LUCENE-1353: Javacc target in contrib/miscellaneous for
generating the precedence query parser.
Test Cases
1. LUCENE-1238: Fixed intermittent failures of TestTimeLimitedCollector.testTimeoutMultiThreaded.
Within this fix, "greedy" flag was added to TimeLimitedCollector, to allow the wrapped
collector to collect also the last doc, after allowed-tTime passed. (Doron Cohen)
2. LUCENE-1348: relax TestTimeLimitedCollector to not fail due to
timeout exceeded (just because test machine is very busy).
======================= Release 2.3.2 =======================
Bug fixes
1. LUCENE-1191: On hitting OutOfMemoryError in any index-modifying
methods in IndexWriter, do not commit any further changes to the
index to prevent risk of possible corruption. (Mike McCandless)
2. LUCENE-1197: Fixed issue whereby IndexWriter would flush by RAM
too early when TermVectors were in use. (Mike McCandless)
3. LUCENE-1198: Don't corrupt index if an exception happens inside
DocumentsWriter.init (Mike McCandless)
4. LUCENE-1199: Added defensive check for null indexReader before
calling close in IndexModifier.close() (Mike McCandless)
5. LUCENE-1200: Fix rare deadlock case in addIndexes* when
ConcurrentMergeScheduler is in use (Mike McCandless)
6. LUCENE-1208: Fix deadlock case on hitting an exception while
processing a document that had triggered a flush (Mike McCandless)
7. LUCENE-1210: Fix deadlock case on hitting an exception while
starting a merge when using ConcurrentMergeScheduler (Mike McCandless)
8. LUCENE-1222: Fix IndexWriter.doAfterFlush to always be called on
flush (Mark Ferguson via Mike McCandless)
9. LUCENE-1226: Fixed IndexWriter.addIndexes(IndexReader[]) to commit
successfully created compound files. (Michael Busch)
10. LUCENE-1150: Re-expose StandardTokenizer's constants publicly;
this was accidentally lost with LUCENE-966. (Nicolas Lalevée via
Mike McCandless)
11. LUCENE-1262: Fixed bug in BufferedIndexReader.refill whereby on
hitting an exception in readInternal, the buffer is incorrectly
filled with stale bytes such that subsequent calls to readByte()
return incorrect results. (Trejkaz via Mike McCandless)
12. LUCENE-1270: Fixed intermittent case where IndexWriter.close()
would hang after IndexWriter.addIndexesNoOptimize had been
called. (Stu Hood via Mike McCandless)
Build
1. LUCENE-1230: Include *pom.xml* in source release files. (Michael Busch)
======================= Release 2.3.1 =======================
Bug fixes
1. LUCENE-1168: Fixed corruption cases when autoCommit=false and
documents have mixed term vectors (Suresh Guvvala via Mike
McCandless).
2. LUCENE-1171: Fixed some cases where OOM errors could cause
deadlock in IndexWriter (Mike McCandless).
3. LUCENE-1173: Fixed corruption case when autoCommit=false and bulk
merging of stored fields is used (Yonik via Mike McCandless).
4. LUCENE-1163: Fixed bug in CharArraySet.contains(char[] buffer, int
offset, int len) that was ignoring offset and thus giving the
wrong answer. (Thomas Peuss via Mike McCandless)
5. LUCENE-1177: Fix rare case where IndexWriter.optimize might do too
many merges at the end. (Mike McCandless)
6. LUCENE-1176: Fix corruption case when documents with no term
vector fields are added before documents with term vector fields.
(Mike McCandless)
7. LUCENE-1179: Fixed assert statement that was incorrectly
preventing Fields with empty-string field name from working.
(Sergey Kabashnyuk via Mike McCandless)
======================= Release 2.3.0 =======================
Changes in runtime behavior
1. LUCENE-994: Defaults for IndexWriter have been changed to maximize
out-of-the-box indexing speed. First, IndexWriter now flushes by
RAM usage (16 MB by default) instead of a fixed doc count (call
IndexWriter.setMaxBufferedDocs to get backwards compatible
behavior). Second, ConcurrentMergeScheduler is used to run merges
using background threads (call IndexWriter.setMergeScheduler(new
SerialMergeScheduler()) to get backwards compatible behavior).
Third, merges are chosen based on size in bytes of each segment
rather than document count of each segment (call
IndexWriter.setMergePolicy(new LogDocMergePolicy()) to get
backwards compatible behavior).
NOTE: users of ParallelReader must change back all of these
defaults in order to ensure the docIDs "align" across all parallel
indices.
(Mike McCandless)
2. LUCENE-1045: SortField.AUTO didn't work with long. When detecting
the field type for sorting automatically, numbers used to be
interpreted as int, then as float, if parsing the number as an int
failed. Now the detection checks for int, then for long,
then for float. (Daniel Naber)
API Changes
1. LUCENE-843: Added IndexWriter.setRAMBufferSizeMB(...) to have
IndexWriter flush whenever the buffered documents are using more
than the specified amount of RAM. Also added new APIs to Token
that allow one to set a char[] plus offset and length to specify a
token (to avoid creating a new String() for each Token). (Mike
McCandless)
2. LUCENE-963: Add setters to Field to allow for re-using a single
Field instance during indexing. This is a sizable performance
gain, especially for small documents. (Mike McCandless)
3. LUCENE-969: Add new APIs to Token, TokenStream and Analyzer to
permit re-using of Token and TokenStream instances during
indexing. Changed Token to use a char[] as the store for the
termText instead of String. This gives faster tokenization
performance (~10-15%). (Mike McCandless)
4. LUCENE-847: Factored MergePolicy, which determines which merges
should take place and when, as well as MergeScheduler, which
determines when the selected merges should actually run, out of
IndexWriter. The default merge policy is now
LogByteSizeMergePolicy (see LUCENE-845) and the default merge
scheduler is now ConcurrentMergeScheduler (see
LUCENE-870). (Steven Parkes via Mike McCandless)
5. LUCENE-1052: Add IndexReader.setTermInfosIndexDivisor(int) method
that allows you to reduce memory usage of the termInfos by further
sub-sampling (over the termIndexInterval that was used during
indexing) which terms are loaded into memory. (Chuck Williams,
Doug Cutting via Mike McCandless)
6. LUCENE-743: Add IndexReader.reopen() method that re-opens an
existing IndexReader (see New features -> 8.) (Michael Busch)
7. LUCENE-1062: Add setData(byte[] data),
setData(byte[] data, int offset, int length), getData(), getOffset()
and clone() methods to o.a.l.index.Payload. Also add the field name
as arg to Similarity.scorePayload(). (Michael Busch)
8. LUCENE-982: Add IndexWriter.optimize(int maxNumSegments) method to
"partially optimize" an index down to maxNumSegments segments.
(Mike McCandless)
9. LUCENE-1080: Changed Token.DEFAULT_TYPE to be public.
10. LUCENE-1064: Changed TopDocs constructor to be public.
(Shai Erera via Michael Busch)
11. LUCENE-1079: DocValues cleanup: constructor now has no params,
and getInnerArray() now throws UnsupportedOperationException (Doron Cohen)
12. LUCENE-1089: Added PriorityQueue.insertWithOverflow, which returns
the Object (if any) that was bumped from the queue to allow
re-use. (Shai Erera via Mike McCandless)
13. LUCENE-1101: Token reuse 'contract' (defined LUCENE-969)
modified so it is token producer's responsibility
to call Token.clear(). (Doron Cohen)
14. LUCENE-1118: Changed StandardAnalyzer to skip too-long (default >
255 characters) tokens. You can increase this limit by calling
StandardAnalyzer.setMaxTokenLength(...). (Michael McCandless)
Bug fixes
1. LUCENE-933: QueryParser fixed to not produce empty sub
BooleanQueries "()" even if the Analyzer produced no
tokens for input. (Doron Cohen)
2. LUCENE-955: Fixed SegmentTermPositions to work correctly with the
first term in the dictionary. (Michael Busch)
3. LUCENE-951: Fixed NullPointerException in MultiLevelSkipListReader
that was thrown after a call of TermPositions.seek().
(Rich Johnson via Michael Busch)
4. LUCENE-938: Fixed cases where an unhandled exception in
IndexWriter's methods could cause deletes to be lost.
(Steven Parkes via Mike McCandless)
5. LUCENE-962: Fixed case where an unhandled exception in
IndexWriter.addDocument or IndexWriter.updateDocument could cause
unreferenced files in the index to not be deleted
(Steven Parkes via Mike McCandless)
6. LUCENE-957: RAMDirectory fixed to properly handle directories
larger than Integer.MAX_VALUE. (Doron Cohen)
7. LUCENE-781: MultiReader fixed to not throw NPE if isCurrent(),
isOptimized() or getVersion() is called. Separated MultiReader
into two classes: MultiSegmentReader extends IndexReader, is
package-protected and is created automatically by IndexReader.open()
in case the index has multiple segments. The public MultiReader
now extends MultiSegmentReader and is intended to be used by users
who want to add their own subreaders. (Daniel Naber, Michael Busch)
8. LUCENE-970: FilterIndexReader now implements isOptimized(). Before
a call of isOptimized() would throw a NPE. (Michael Busch)
9. LUCENE-832: ParallelReader fixed to not throw NPE if isCurrent(),
isOptimized() or getVersion() is called. (Michael Busch)
10. LUCENE-948: Fix FNFE exception caused by stale NFS client
directory listing caches when writers on different machines are
sharing an index over NFS and using a custom deletion policy (Mike
McCandless)
11. LUCENE-978: Ensure TermInfosReader, FieldsReader, and FieldsReader
close any streams they had opened if an exception is hit in the
constructor. (Ning Li via Mike McCandless)
12. LUCENE-985: If an extremely long term is in a doc (> 16383 chars),
we now throw an IllegalArgumentException saying the term is too
long, instead of cryptic ArrayIndexOutOfBoundsException. (Karl
Wettin via Mike McCandless)
13. LUCENE-991: The explain() method of BoostingTermQuery had errors
when no payloads were present on a document. (Peter Keegan via
Grant Ingersoll)
14. LUCENE-992: Fixed IndexWriter.updateDocument to be atomic again
(this was broken by LUCENE-843). (Ning Li via Mike McCandless)
15. LUCENE-1008: Fixed corruption case when document with no term
vector fields is added after documents with term vector fields.
This bug was introduced with LUCENE-843. (Grant Ingersoll via
Mike McCandless)
16. LUCENE-1006: Fixed QueryParser to accept a "" field value (zero
length quoted string.) (yonik)
17. LUCENE-1010: Fixed corruption case when document with no term
vector fields is added after documents with term vector fields.
This case is hit during merge and would cause an EOFException.
This bug was introduced with LUCENE-984. (Andi Vajda via Mike
McCandless)
19. LUCENE-1009: Fix merge slowdown with LogByteSizeMergePolicy when
autoCommit=false and documents are using stored fields and/or term
vectors. (Mark Miller via Mike McCandless)
20. LUCENE-1011: Fixed corruption case when two or more machines,
sharing an index over NFS, can be writers in quick succession.
(Patrick Kimber via Mike McCandless)
21. LUCENE-1028: Fixed Weight serialization for few queries:
DisjunctionMaxQuery, ValueSourceQuery, CustomScoreQuery.
Serialization check added for all queries.
(Kyle Maxwell via Doron Cohen)
22. LUCENE-1048: Fixed incorrect behavior in Lock.obtain(...) when the
timeout argument is very large (eg Long.MAX_VALUE). Also added
Lock.LOCK_OBTAIN_WAIT_FOREVER constant to never timeout. (Nikolay
Diakov via Mike McCandless)
23. LUCENE-1050: Throw LockReleaseFailedException in
Simple/NativeFSLockFactory if we fail to delete the lock file when
releasing the lock. (Nikolay Diakov via Mike McCandless)
24. LUCENE-1071: Fixed SegmentMerger to correctly set payload bit in
the merged segment. (Michael Busch)
25. LUCENE-1042: Remove throwing of IOException in getTermFreqVector(int, String, TermVectorMapper) to be consistent
with other getTermFreqVector calls. Also removed the throwing of the other IOException in that method to be consistent. (Karl Wettin via Grant Ingersoll)
26. LUCENE-1096: Fixed Hits behavior when hits' docs are deleted
along with iterating the hits. Deleting docs already retrieved
now works seamlessly. If docs not yet retrieved are deleted
(e.g. from another thread), and then, relying on the initial
Hits.length(), an application attempts to retrieve more hits
than actually exist , a ConcurrentMidificationException
is thrown. (Doron Cohen)
27. LUCENE-1068: Changed StandardTokenizer to fix an issue with it marking
the type of some tokens incorrectly. This is done by adding a new flag named
replaceInvalidAcronym which defaults to false, the current, incorrect behavior. Setting
this flag to true fixes the problem. This flag is a temporary fix and is already
marked as being deprecated. 3.x will implement the correct approach. (Shai Erera via Grant Ingersoll)
LUCENE-1140: Fixed NPE caused by 1068 (Alexei Dets via Grant Ingersoll)
28. LUCENE-749: ChainedFilter behavior fixed when logic of
first filter is ANDNOT. (Antonio Bruno via Doron Cohen)
29. LUCENE-508: Make sure SegmentTermEnum.prev() is accurate (= last
term) after next() returns false. (Steven Tamm via Mike
McCandless)
New features
1. LUCENE-906: Elision filter for French.
(Mathieu Lecarme via Otis Gospodnetic)
2. LUCENE-960: Added a SpanQueryFilter and related classes to allow for
not only filtering, but knowing where in a Document a Filter matches
(Grant Ingersoll)
3. LUCENE-868: Added new Term Vector access features. New callback
mechanism allows application to define how and where to read Term
Vectors from disk. This implementation contains several extensions
of the new abstract TermVectorMapper class. The new API should be
back-compatible. No changes in the actual storage of Term Vectors
has taken place.
3.1 LUCENE-1038: Added setDocumentNumber() method to TermVectorMapper
to provide information about what document is being accessed.
(Karl Wettin via Grant Ingersoll)
4. LUCENE-975: Added PositionBasedTermVectorMapper that allows for
position based lookup of term vector information.
See item #3 above (LUCENE-868).
5. LUCENE-1011: Added simple tools (all in org.apache.lucene.store)
to verify that locking is working properly. LockVerifyServer runs
a separate server to verify locks. LockStressTest runs a simple
tool that rapidly obtains and releases locks.
VerifyingLockFactory is a LockFactory that wraps any other
LockFactory and consults the LockVerifyServer whenever a lock is
obtained or released, throwing an exception if an illegal lock
obtain occurred. (Patrick Kimber via Mike McCandless)
6. LUCENE-1015: Added FieldCache extension (ExtendedFieldCache) to
support doubles and longs. Added support into SortField for sorting
on doubles and longs as well. (Grant Ingersoll)
7. LUCENE-1020: Created basic index checking & repair tool
(o.a.l.index.CheckIndex). When run without -fix it does a
detailed test of all segments in the index and reports summary
information and any errors it hit. With -fix it will remove
segments that had errors. (Mike McCandless)
8. LUCENE-743: Add IndexReader.reopen() method that re-opens an
existing IndexReader by only loading those portions of an index
that have changed since the reader was (re)opened. reopen() can
be significantly faster than open(), depending on the amount of
index changes. SegmentReader, MultiSegmentReader, MultiReader,
and ParallelReader implement reopen(). (Michael Busch)
9. LUCENE-1040: CharArraySet useful for efficiently checking
set membership of text specified by char[]. (yonik)
10. LUCENE-1073: Created SnapshotDeletionPolicy to facilitate taking a
live backup of an index without pausing indexing. (Mike
McCandless)
11. LUCENE-1019: CustomScoreQuery enhanced to support multiple
ValueSource queries. (Kyle Maxwell via Doron Cohen)
12. LUCENE-1095: Added an option to StopFilter to increase
positionIncrement of the token succeeding a stopped token.
Disabled by default. Similar option added to QueryParser
to consider token positions when creating PhraseQuery
and MultiPhraseQuery. Disabled by default (so by default
the query parser ignores position increments).
(Doron Cohen)
13. LUCENE-1380: Added TokenFilter for setting position increment in special cases related to the ShingleFilter (Mck SembWever, Steve Rowe, Karl Wettin via Grant Ingersoll)
Optimizations
1. LUCENE-937: CachingTokenFilter now uses an iterator to access the
Tokens that are cached in the LinkedList. This increases performance
significantly, especially when the number of Tokens is large.
(Mark Miller via Michael Busch)
2. LUCENE-843: Substantial optimizations to improve how IndexWriter
uses RAM for buffering documents and to speed up indexing (2X-8X
faster). A single shared hash table now records the in-memory
postings per unique term and is directly flushed into a single
segment. (Mike McCandless)
3. LUCENE-892: Fixed extra "buffer to buffer copy" that sometimes
takes place when using compound files. (Mike McCandless)
4. LUCENE-959: Remove synchronization in Document (yonik)
5. LUCENE-963: Add setters to Field to allow for re-using a single
Field instance during indexing. This is a sizable performance
gain, especially for small documents. (Mike McCandless)
6. LUCENE-939: Check explicitly for boundary conditions in FieldInfos
and don't rely on exceptions. (Michael Busch)
7. LUCENE-966: Very substantial speedups (~6X faster) for
StandardTokenizer (StandardAnalyzer) by using JFlex instead of
JavaCC to generate the tokenizer.
(Stanislaw Osinski via Mike McCandless)
8. LUCENE-969: Changed core tokenizers & filters to re-use Token and
TokenStream instances when possible to improve tokenization
performance (~10-15%). (Mike McCandless)
9. LUCENE-871: Speedup ISOLatin1AccentFilter (Ian Boston via Mike
McCandless)
10. LUCENE-986: Refactored SegmentInfos from IndexReader into the new
subclass DirectoryIndexReader. SegmentReader and MultiSegmentReader
now extend DirectoryIndexReader and are the only IndexReader
implementations that use SegmentInfos to access an index and
acquire a write lock for index modifications. (Michael Busch)
11. LUCENE-1007: Allow flushing in IndexWriter to be triggered by
either RAM usage or document count or both (whichever comes
first), by adding symbolic constant DISABLE_AUTO_FLUSH to disable
one of the flush triggers. (Ning Li via Mike McCandless)
12. LUCENE-1043: Speed up merging of stored fields by bulk-copying the
raw bytes for each contiguous range of non-deleted documents.
(Robert Engels via Mike McCandless)
13. LUCENE-693: Speed up nested conjunctions (~2x) that match many
documents, and a slight performance increase for top level
conjunctions. (yonik)
14. LUCENE-1098: Make inner class StandardAnalyzer.SavedStreams static
and final. (Nathan Beyer via Michael Busch)
Documentation
1. LUCENE-1051: Generate separate javadocs for core, demo and contrib
classes, as well as an unified view. Also add an appropriate menu
structure to the website. (Michael Busch)
2. LUCENE-746: Fix error message in AnalyzingQueryParser.getPrefixQuery.
(Ronnie Kolehmainen via Michael Busch)
Build
1. LUCENE-908: Improvements and simplifications for how the MANIFEST
file and the META-INF dir are created. (Michael Busch)
2. LUCENE-935: Various improvements for the maven artifacts. Now the
artifacts also include the sources as .jar files. (Michael Busch)
3. Added apply-patch target to top-level build. Defaults to looking for
a patch in ${basedir}/../patches with name specified by -Dpatch.name.
Can also specify any location by -Dpatch.file property on the command
line. This should be helpful for easy application of patches, but it
is also a step towards integrating automatic patch application with
JIRA and Hudson, and is thus subject to change. (Grant Ingersoll)
4. LUCENE-935: Defined property "m2.repository.url" to allow setting
the url to a maven remote repository to deploy to. (Michael Busch)
5. LUCENE-1051: Include javadocs in the maven artifacts. (Michael Busch)
6. LUCENE-1055: Remove gdata-server from build files and its sources
from trunk. (Michael Busch)
7. LUCENE-935: Allow to deploy maven artifacts to a remote m2 repository
via scp and ssh authentication. (Michael Busch)
8. LUCENE-1123: Allow overriding the specification version for
MANIFEST.MF (Michael Busch)
Test Cases
1. LUCENE-766: Test adding two fields with the same name but different
term vector setting. (Nicolas Lalevée via Doron Cohen)
======================= Release 2.2.0 =======================
Changes in runtime behavior
API Changes
1. LUCENE-793: created new exceptions and added them to throws clause
for many methods (all subclasses of IOException for backwards
compatibility): index.StaleReaderException,
index.CorruptIndexException, store.LockObtainFailedException.
This was done to better call out the possible root causes of an
IOException from these methods. (Mike McCandless)
2. LUCENE-811: make SegmentInfos class, plus a few methods from related
classes, package-private again (they were unnecessarily made public
as part of LUCENE-701). (Mike McCandless)
3. LUCENE-710: added optional autoCommit boolean to IndexWriter
constructors. When this is false, index changes are not committed
until the writer is closed. This gives explicit control over when
a reader will see the changes. Also added optional custom
deletion policy to explicitly control when prior commits are
removed from the index. This is intended to allow applications to
share an index over NFS by customizing when prior commits are
deleted. (Mike McCandless)
4. LUCENE-818: changed most public methods of IndexWriter,
IndexReader (and its subclasses), FieldsReader and RAMDirectory to
throw AlreadyClosedException if they are accessed after being
closed. (Mike McCandless)
5. LUCENE-834: Changed some access levels for certain Span classes to allow them
to be overridden. They have been marked expert only and not for public
consumption. (Grant Ingersoll)
6. LUCENE-796: Removed calls to super.* from various get*Query methods in
MultiFieldQueryParser, in order to allow sub-classes to override them.
(Steven Parkes via Otis Gospodnetic)
7. LUCENE-857: Removed caching from QueryFilter and deprecated QueryFilter
in favour of QueryWrapperFilter or QueryWrapperFilter + CachingWrapperFilter
combination when caching is desired.
(Chris Hostetter, Otis Gospodnetic)
8. LUCENE-869: Changed FSIndexInput and FSIndexOutput to inner classes of FSDirectory
to enable extensibility of these classes. (Michael Busch)
9. LUCENE-580: Added the public method reset() to TokenStream. This method does
nothing by default, but may be overwritten by subclasses to support consuming
the TokenStream more than once. (Michael Busch)
10. LUCENE-580: Added a new constructor to Field that takes a TokenStream as
argument, available as tokenStreamValue(). This is useful to avoid the need of
"dummy analyzers" for pre-analyzed fields. (Karl Wettin, Michael Busch)
11. LUCENE-730: Added the new methods to BooleanQuery setAllowDocsOutOfOrder() and
getAllowDocsOutOfOrder(). Deprecated the methods setUseScorer14() and
getUseScorer14(). The optimization patch LUCENE-730 (see Optimizations->3.)
improves performance for certain queries but results in scoring out of docid
order. This patch reverse this change, so now by default hit docs are scored
in docid order if not setAllowDocsOutOfOrder(true) is explicitly called.
This patch also enables the tests in QueryUtils again that check for docid
order. (Paul Elschot, Doron Cohen, Michael Busch)
12. LUCENE-888: Added Directory.openInput(File path, int bufferSize)
to optionally specify the size of the read buffer. Also added
BufferedIndexInput.setBufferSize(int) to change the buffer size.
(Mike McCandless)
13. LUCENE-923: Make SegmentTermPositionVector package-private. It does not need
to be public because it implements the public interface TermPositionVector.
(Michael Busch)
Bug fixes
1. LUCENE-804: Fixed build.xml to pack a fully compilable src dist. (Doron Cohen)
2. LUCENE-813: Leading wildcard fixed to work with trailing wildcard.
Query parser modified to create a prefix query only for the case
that there is a single trailing wildcard (and no additional wildcard
or '?' in the query text). (Doron Cohen)
3. LUCENE-812: Add no-argument constructors to NativeFSLockFactory
and SimpleFSLockFactory. This enables all 4 builtin LockFactory
implementations to be specified via the System property
org.apache.lucene.store.FSDirectoryLockFactoryClass. (Mike McCandless)
4. LUCENE-821: The new single-norm-file introduced by LUCENE-756
failed to reduce the number of open descriptors since it was still
opened once per field with norms. (yonik)
5. LUCENE-823: Make sure internal file handles are closed when
hitting an exception (eg disk full) while flushing deletes in
IndexWriter's mergeSegments, and also during
IndexWriter.addIndexes. (Mike McCandless)
6. LUCENE-825: If directory is removed after
FSDirectory.getDirectory() but before IndexReader.open you now get
a FileNotFoundException like Lucene pre-2.1 (before this fix you
got an NPE). (Mike McCandless)
7. LUCENE-800: Removed backslash from the TERM_CHAR list in the queryparser,
because the backslash is the escape character. Also changed the ESCAPED_CHAR
list to contain all possible characters, because every character that
follows a backslash should be considered as escaped. (Michael Busch)
8. LUCENE-372: QueryParser.parse() now ensures that the entire input string
is consumed. Now a ParseException is thrown if a query contains too many
closing parentheses. (Andreas Neumann via Michael Busch)
9. LUCENE-814: javacc build targets now fix line-end-style of generated files.
Now also deleting all javacc generated files before calling javacc.
(Steven Parkes, Doron Cohen)
10. LUCENE-829: close readers in contrib/benchmark. (Karl Wettin, Doron Cohen)
11. LUCENE-828: Minor fix for Term's equal().
(Paul Cowan via Otis Gospodnetic)
12. LUCENE-846: Fixed: if IndexWriter is opened with autoCommit=false,
and you call addIndexes, and hit an exception (eg disk full) then
when IndexWriter rolls back its internal state this could corrupt
the instance of IndexWriter (but, not the index itself) by
referencing already deleted segments. This bug was only present
in 2.2 (trunk), ie was never released. (Mike McCandless)
13. LUCENE-736: Sloppy phrase query with repeating terms matches wrong docs.
For example query "B C B"~2 matches the doc "A B C D E". (Doron Cohen)
14. LUCENE-789: Fixed: custom similarity is ignored when using MultiSearcher (problem reported
by Alexey Lef). Now the similarity applied by MultiSearcer.setSimilarity(sim) is being used.
Note that as before this fix, creating a multiSearcher from Searchers for whom custom similarity
was set has no effect - it is masked by the similarity of the MultiSearcher. This is as
designed, because MultiSearcher operates on Searchables (not Searchers). (Doron Cohen)
15. LUCENE-880: Fixed DocumentWriter to close the TokenStreams after it
has written the postings. Then the resources associated with the
TokenStreams can safely be released. (Michael Busch)
16. LUCENE-883: consecutive calls to Spellchecker.indexDictionary()
won't insert terms twice anymore. (Daniel Naber)
17. LUCENE-881: QueryParser.escape() now also escapes the characters
'|' and '&' which are part of the queryparser syntax. (Michael Busch)
18. LUCENE-886: Spellchecker clean up: exceptions aren't printed to STDERR
anymore and ignored, but re-thrown. Some javadoc improvements.
(Daniel Naber)
19. LUCENE-698: FilteredQuery now takes the query boost into account for
scoring. (Michael Busch)
20. LUCENE-763: Spellchecker: LuceneDictionary used to skip first word in
enumeration. (Christian Mallwitz via Daniel Naber)
21. LUCENE-903: FilteredQuery explanation inaccuracy with boost.
Explanation tests now "deep" check the explanation details.
(Chris Hostetter, Doron Cohen)
22. LUCENE-912: DisjunctionMaxScorer first skipTo(target) call ignores the
skip target param and ends up at the first match.
(Sudaakeran B. via Chris Hostetter & Doron Cohen)
23. LUCENE-913: Two consecutive score() calls return different
scores for Boolean Queries. (Michael Busch, Doron Cohen)
24. LUCENE-1013: Fix IndexWriter.setMaxMergeDocs to work "out of the
box", again, by moving set/getMaxMergeDocs up from
LogDocMergePolicy into LogMergePolicy. This fixes the API
breakage (non backwards compatible change) caused by LUCENE-994.
(Yonik Seeley via Mike McCandless)
New features
1. LUCENE-759: Added two n-gram-producing TokenFilters.
(Otis Gospodnetic)
2. LUCENE-822: Added FieldSelector capabilities to Searchable for use with
RemoteSearcher, and other Searchable implementations. (Mark Miller, Grant Ingersoll)
3. LUCENE-755: Added the ability to store arbitrary binary metadata in the posting list.
These metadata are called Payloads. For every position of a Token one Payload in the form
of a variable length byte array can be stored in the prox file.
Remark: The APIs introduced with this feature are in experimental state and thus
contain appropriate warnings in the javadocs.
(Michael Busch)
4. LUCENE-834: Added BoostingTermQuery which can boost scores based on the
values of a payload (see #3 above.) (Grant Ingersoll)
5. LUCENE-834: Similarity has a new method for scoring payloads called
scorePayloads that can be overridden to take advantage of payload
storage (see #3 above)
6. LUCENE-834: Added isPayloadAvailable() onto TermPositions interface and
implemented it in the appropriate places (Grant Ingersoll)
7. LUCENE-853: Added RemoteCachingWrapperFilter to enable caching of Filters
on the remote side of the RMI connection.
(Matt Ericson via Otis Gospodnetic)
8. LUCENE-446: Added Solr's search.function for scores based on field
values, plus CustomScoreQuery for simple score (post) customization.
(Yonik Seeley, Doron Cohen)
9. LUCENE-1058: Added new TeeTokenFilter (like the UNIX 'tee' command) and SinkTokenizer which can be used to share tokens between two or more
Fields such that the other Fields do not have to go through the whole Analysis process over again. For instance, if you have two
Fields that share all the same analysis steps except one lowercases tokens and the other does not, you can coordinate the operations
between the two using the TeeTokenFilter and the SinkTokenizer. See TeeSinkTokenTest.java for examples.
(Grant Ingersoll, Michael Busch, Yonik Seeley)
Optimizations
1. LUCENE-761: The proxStream is now cloned lazily in SegmentTermPositions
when nextPosition() is called for the first time. This allows using instances
of SegmentTermPositions instead of SegmentTermDocs without additional costs.
(Michael Busch)
2. LUCENE-431: RAMInputStream and RAMOutputStream extend IndexInput and
IndexOutput directly now. This avoids further buffering and thus avoids
unnecessary array copies. (Michael Busch)
3. LUCENE-730: Updated BooleanScorer2 to make use of BooleanScorer in some
cases and possibly improve scoring performance. Documents can now be
delivered out-of-order as they are scored (e.g. to HitCollector).
N.B. A bit of code had to be disabled in QueryUtils in order for
TestBoolean2 test to keep passing.
(Paul Elschot via Otis Gospodnetic)
4. LUCENE-882: Spellchecker doesn't store the ngrams anymore but only indexes
them to keep the spell index small. (Daniel Naber)
5. LUCENE-430: Delay allocation of the buffer after a clone of BufferedIndexInput.
Together with LUCENE-888 this will allow to adjust the buffer size
dynamically. (Paul Elschot, Michael Busch)
6. LUCENE-888: Increase buffer sizes inside CompoundFileWriter and
BufferedIndexOutput. Also increase buffer size in
BufferedIndexInput, but only when used during merging. Together,
these increases yield 10-18% overall performance gain vs the
previous 1K defaults. (Mike McCandless)
7. LUCENE-866: Adds multi-level skip lists to the posting lists. This speeds
up most queries that use skipTo(), especially on big indexes with large posting
lists. For average AND queries the speedup is about 20%, for queries that
contain very frequent and very unique terms the speedup can be over 80%.
(Michael Busch)
Documentation
1. LUCENE 791 && INFRA-1173: Infrastructure moved the Wiki to
http://wiki.apache.org/lucene-java/ Updated the links in the docs and
wherever else I found references. (Grant Ingersoll, Joe Schaefer)
2. LUCENE-807: Fixed the javadoc for ScoreDocComparator.compare() to be
consistent with java.util.Comparator.compare(): Any integer is allowed to
be returned instead of only -1/0/1.
(Paul Cowan via Michael Busch)
3. LUCENE-875: Solved javadoc warnings & errors under jdk1.4.
Solved javadoc errors under jdk5 (jars in path for gdata).
Made "javadocs" target depend on "build-contrib" for first downloading
contrib jars configured for dynamic downloaded. (Note: when running
behind firewall, a firewall prompt might pop up) (Doron Cohen)
4. LUCENE-740: Added SNOWBALL-LICENSE.txt to the snowball package and a
remark about the license to NOTICE.TXT. (Steven Parkes via Michael Busch)
5. LUCENE-925: Added analysis package javadocs. (Grant Ingersoll and Doron Cohen)
6. LUCENE-926: Added document package javadocs. (Grant Ingersoll)
Build
1. LUCENE-802: Added LICENSE.TXT and NOTICE.TXT to Lucene jars.
(Steven Parkes via Michael Busch)
2. LUCENE-885: "ant test" now includes all contrib tests. The new
"ant test-core" target can be used to run only the Core (non
contrib) tests.
(Chris Hostetter)
3. LUCENE-900: "ant test" now enables Java assertions (in Lucene packages).
(Doron Cohen)
4. LUCENE-894: Add custom build file for binary distributions that includes
targets to build the demos. (Chris Hostetter, Michael Busch)
5. LUCENE-904: The "package" targets in build.xml now also generate .md5
checksum files. (Chris Hostetter, Michael Busch)
6. LUCENE-907: Include LICENSE.TXT and NOTICE.TXT in the META-INF dirs of
demo war, demo jar, and the contrib jars. (Michael Busch)
7. LUCENE-909: Demo targets for running the demo. (Doron Cohen)
8. LUCENE-908: Improves content of MANIFEST file and makes it customizable
for the contribs. Adds SNOWBALL-LICENSE.txt to META-INF of the snowball
jar and makes sure that the lucli jar contains LICENSE.txt and NOTICE.txt.
(Chris Hostetter, Michael Busch)
9. LUCENE-930: Various contrib building improvements to ensure contrib
dependencies are met, and test compilation errors fail the build.
(Steven Parkes, Chris Hostetter)
10. LUCENE-622: Add ant target and pom.xml files for building maven artifacts
of the Lucene core and the contrib modules.
(Sami Siren, Karl Wettin, Michael Busch)
======================= Release 2.1.0 =======================
Changes in runtime behavior
1. 's' and 't' have been removed from the list of default stopwords
in StopAnalyzer (also used in by StandardAnalyzer). Having e.g. 's'
as a stopword meant that 's-class' led to the same results as 'class'.
Note that this problem still exists for 'a', e.g. in 'a-class' as
'a' continues to be a stopword.
(Daniel Naber)
2. LUCENE-478: Updated the list of Unicode code point ranges for CJK
(now split into CJ and K) in StandardAnalyzer. (John Wang and
Steven Rowe via Otis Gospodnetic)
3. Modified some CJK Unicode code point ranges in StandardTokenizer.jj,
and added a few more of them to increase CJK character coverage.
Also documented some of the ranges.
(Otis Gospodnetic)
4. LUCENE-489: Add support for leading wildcard characters (*, ?) to
QueryParser. Default is to disallow them, as before.
(Steven Parkes via Otis Gospodnetic)
5. LUCENE-703: QueryParser changed to default to use of ConstantScoreRangeQuery
for range queries. Added useOldRangeQuery property to QueryParser to allow
selection of old RangeQuery class if required.
(Mark Harwood)
6. LUCENE-543: WildcardQuery now performs a TermQuery if the provided term
does not contain a wildcard character (? or *), when previously a
StringIndexOutOfBoundsException was thrown.
(Michael Busch via Erik Hatcher)
7. LUCENE-726: Removed the use of deprecated doc.fields() method and
Enumeration.
(Michael Busch via Otis Gospodnetic)
8. LUCENE-436: Removed finalize() in TermInfosReader and SegmentReader,
and added a call to enumerators.remove() in TermInfosReader.close().
The finalize() overrides were added to help with a pre-1.4.2 JVM bug
that has since been fixed, plus we no longer support pre-1.4.2 JVMs.
(Otis Gospodnetic)
9. LUCENE-771: The default location of the write lock is now the
index directory, and is named simply "write.lock" (without a big
digest prefix). The system properties "org.apache.lucene.lockDir"
nor "java.io.tmpdir" are no longer used as the global directory
for storing lock files, and the LOCK_DIR field of FSDirectory is
now deprecated. (Mike McCandless)
New features
1. LUCENE-503: New ThaiAnalyzer and ThaiWordFilter in contrib/analyzers
(Samphan Raruenrom via Chris Hostetter)
2. LUCENE-545: New FieldSelector API and associated changes to
IndexReader and implementations. New Fieldable interface for use
with the lazy field loading mechanism. (Grant Ingersoll and Chuck
Williams via Grant Ingersoll)
3. LUCENE-676: Move Solr's PrefixFilter to Lucene core. (Yura
Smolsky, Yonik Seeley)
4. LUCENE-678: Added NativeFSLockFactory, which implements locking
using OS native locking (via java.nio.*). (Michael McCandless via
Yonik Seeley)
5. LUCENE-544: Added the ability to specify different boosts for
different fields when using MultiFieldQueryParser (Matt Ericson
via Otis Gospodnetic)
6. LUCENE-528: New IndexWriter.addIndexesNoOptimize() that doesn't
optimize the index when adding new segments, only performing
merges as needed. (Ning Li via Yonik Seeley)
7. LUCENE-573: QueryParser now allows backslash escaping in
quoted terms and phrases. (Michael Busch via Yonik Seeley)
8. LUCENE-716: QueryParser now allows specification of Unicode
characters in terms via a unicode escape of the form \uXXXX
(Michael Busch via Yonik Seeley)
9. LUCENE-709: Added RAMDirectory.sizeInBytes(), IndexWriter.ramSizeInBytes()
and IndexWriter.flushRamSegments(), allowing applications to
control the amount of memory used to buffer documents.
(Chuck Williams via Yonik Seeley)
10. LUCENE-723: QueryParser now parses *:* as MatchAllDocsQuery
(Yonik Seeley)
11. LUCENE-741: Command-line utility for modifying or removing norms
on fields in an existing index. This is mostly based on LUCENE-496
and lives in contrib/miscellaneous.
(Chris Hostetter, Otis Gospodnetic)
12. LUCENE-759: Added NGramTokenizer and EdgeNGramTokenizer classes and
their passing unit tests.
(Otis Gospodnetic)
13. LUCENE-565: Added methods to IndexWriter to more efficiently
handle updating documents (the "delete then add" use case). This
is intended to be an eventual replacement for the existing
IndexModifier. Added IndexWriter.flush() (renamed from
flushRamSegments()) to flush all pending updates (held in RAM), to
the Directory. (Ning Li via Mike McCandless)
14. LUCENE-762: Added in SIZE and SIZE_AND_BREAK FieldSelectorResult options
which allow one to retrieve the size of a field without retrieving the
actual field. (Chuck Williams via Grant Ingersoll)
15. LUCENE-799: Properly handle lazy, compressed fields.
(Mike Klaas via Grant Ingersoll)
API Changes
1. LUCENE-438: Remove "final" from Token, implement Cloneable, allow
changing of termText via setTermText(). (Yonik Seeley)
2. org.apache.lucene.analysis.nl.WordlistLoader has been deprecated
and is supposed to be replaced with the WordlistLoader class in
package org.apache.lucene.analysis (Daniel Naber)
3. LUCENE-609: Revert return type of Document.getField(s) to Field
for backward compatibility, added new Document.getFieldable(s)
for access to new lazy loaded fields. (Yonik Seeley)
4. LUCENE-608: Document.fields() has been deprecated and a new method
Document.getFields() has been added that returns a List instead of
an Enumeration (Daniel Naber)
5. LUCENE-605: New Explanation.isMatch() method and new ComplexExplanation
subclass allows explain methods to produce Explanations which model
"matching" independent of having a positive value.
(Chris Hostetter)
6. LUCENE-621: New static methods IndexWriter.setDefaultWriteLockTimeout
and IndexWriter.setDefaultCommitLockTimeout for overriding default
timeout values for all future instances of IndexWriter (as well
as for any other classes that may reference the static values,
ie: IndexReader).
(Michael McCandless via Chris Hostetter)
7. LUCENE-638: FSDirectory.list() now only returns the directory's
Lucene-related files. Thanks to this change one can now construct
a RAMDirectory from a file system directory that contains files
not related to Lucene.
(Simon Willnauer via Daniel Naber)
8. LUCENE-635: Decoupling locking implementation from Directory
implementation. Added set/getLockFactory to Directory and moved
all locking code into subclasses of abstract class LockFactory.
FSDirectory and RAMDirectory still default to their prior locking
implementations, but now you can mix & match, for example using
SingleInstanceLockFactory (ie, in memory locking) locking with an
FSDirectory. Note that now you must call setDisableLocks before
the instantiation a FSDirectory if you wish to disable locking
for that Directory.
(Michael McCandless, Jeff Patterson via Yonik Seeley)
9. LUCENE-657: Made FuzzyQuery non-final and inner ScoreTerm protected.
(Steven Parkes via Otis Gospodnetic)
10. LUCENE-701: Lockless commits: a commit lock is no longer required
when a writer commits and a reader opens the index. This includes
a change to the index file format (see docs/fileformats.html for
details). It also removes all APIs associated with the commit
lock & its timeout. Readers are now truly read-only and do not
block one another on startup. This is the first step to getting
Lucene to work correctly over NFS (second step is
LUCENE-710). (Mike McCandless)
11. LUCENE-722: DEFAULT_MIN_DOC_FREQ was misspelled DEFALT_MIN_DOC_FREQ
in Similarity's MoreLikeThis class. The misspelling has been
replaced by the correct spelling.
(Andi Vajda via Daniel Naber)
12. LUCENE-738: Reduce the size of the file that keeps track of which
documents are deleted when the number of deleted documents is
small. This changes the index file format and cannot be
read by previous versions of Lucene. (Doron Cohen via Yonik Seeley)
13. LUCENE-756: Maintain all norms in a single .nrm file to reduce the
number of open files and file descriptors for the non-compound index
format. This changes the index file format, but maintains the
ability to read and update older indices. The first segment merge
on an older format index will create a single .nrm file for the new
segment. (Doron Cohen via Yonik Seeley)
14. LUCENE-732: DateTools support has been added to QueryParser, with
setters for both the default Resolution, and per-field Resolution.
For backwards compatibility, DateField is still used if no Resolutions
are specified. (Michael Busch via Chris Hostetter)
15. Added isOptimized() method to IndexReader.
(Otis Gospodnetic)
16. LUCENE-773: Deprecate the FSDirectory.getDirectory(*) methods that
take a boolean "create" argument. Instead you should use
IndexWriter's "create" argument to create a new index.
(Mike McCandless)
17. LUCENE-780: Add a static Directory.copy() method to copy files
from one Directory to another. (Jiri Kuhn via Mike McCandless)
18. LUCENE-773: Added Directory.clearLock(String name) to forcefully
remove an old lock. The default implementation is to ask the
lockFactory (if non null) to clear the lock. (Mike McCandless)
19. LUCENE-795: Directory.renameFile() has been deprecated as it is
not used anymore inside Lucene. (Daniel Naber)
Bug fixes
1. Fixed the web application demo (built with "ant war-demo") which
didn't work because it used a QueryParser method that had
been removed (Daniel Naber)
2. LUCENE-583: ISOLatin1AccentFilter fails to preserve positionIncrement
(Yonik Seeley)
3. LUCENE-575: SpellChecker min score is incorrectly changed by suggestSimilar
(Karl Wettin via Yonik Seeley)
4. LUCENE-587: Explanation.toHtml was producing malformed HTML
(Chris Hostetter)
5. Fix to allow MatchAllDocsQuery to be used with RemoteSearcher (Yonik Seeley)
6. LUCENE-601: RAMDirectory and RAMFile made Serializable
(Karl Wettin via Otis Gospodnetic)
7. LUCENE-557: Fixes to BooleanQuery and FilteredQuery so that the score
Explanations match up with the real scores.
(Chris Hostetter)
8. LUCENE-607: ParallelReader's TermEnum fails to advance properly to
new fields (Chuck Williams, Christian Kohlschuetter via Yonik Seeley)
9. LUCENE-610,LUCENE-611: Simple syntax changes to allow compilation with ecj:
disambiguate inner class scorer's use of doc() in BooleanScorer2,
other test code changes. (DM Smith via Yonik Seeley)
10. LUCENE-451: All core query types now use ComplexExplanations so that
boosts of zero don't confuse the BooleanWeight explain method.
(Chris Hostetter)
11. LUCENE-593: Fixed LuceneDictionary's inner Iterator
(Kåre Fiedler Christiansen via Otis Gospodnetic)
12. LUCENE-641: fixed an off-by-one bug with IndexWriter.setMaxFieldLength()
(Daniel Naber)
13. LUCENE-659: Make PerFieldAnalyzerWrapper delegate getPositionIncrementGap()
to the correct analyzer for the field. (Chuck Williams via Yonik Seeley)
14. LUCENE-650: Fixed NPE in Locale specific String Sort when Document
has no value.
(Oliver Hutchison via Chris Hostetter)
15. LUCENE-683: Fixed data corruption when reading lazy loaded fields.
(Yonik Seeley)
16. LUCENE-678: Fixed bug in NativeFSLockFactory which caused the same
lock to be shared between different directories.
(Michael McCandless via Yonik Seeley)
17. LUCENE-690: Fixed thread unsafe use of IndexInput by lazy loaded fields.
(Yonik Seeley)
18. LUCENE-696: Fix bug when scorer for DisjunctionMaxQuery has skipTo()
called on it before next(). (Yonik Seeley)
19. LUCENE-569: Fixed SpanNearQuery bug, for 'inOrder' queries it would fail
to recognize ordered spans if they overlapped with unordered spans.
(Paul Elschot via Chris Hostetter)
20. LUCENE-706: Updated fileformats.xml|html concerning the docdelta value
in the frequency file. (Johan Stuyts, Doron Cohen via Grant Ingersoll)
21. LUCENE-715: Fixed private constructor in IndexWriter.java to
properly release the acquired write lock if there is an
IOException after acquiring the write lock but before finishing
instantiation. (Matthew Bogosian via Mike McCandless)
22. LUCENE-651: Multiple different threads requesting the same
FieldCache entry (often for Sorting by a field) at the same
time caused multiple generations of that entry, which was
detrimental to performance and memory use.
(Oliver Hutchison via Otis Gospodnetic)
23. LUCENE-717: Fixed build.xml not to fail when there is no lib dir.
(Doron Cohen via Otis Gospodnetic)
24. LUCENE-728: Removed duplicate/old MoreLikeThis and SimilarityQueries
classes from contrib/similarity, as their new home is under
contrib/queries.
(Otis Gospodnetic)
25. LUCENE-669: Do not double-close the RandomAccessFile in
FSIndexInput/Output during finalize(). Besides sending an
IOException up to the GC, this may also be the cause intermittent
"The handle is invalid" IOExceptions on Windows when trying to
close readers or writers. (Michael Busch via Mike McCandless)
26. LUCENE-702: Fix IndexWriter.addIndexes(*) to not corrupt the index
on any exceptions (eg disk full). The semantics of these methods
is now transactional: either all indices are merged or none are.
Also fixed IndexWriter.mergeSegments (called outside of
addIndexes(*) by addDocument, optimize, flushRamSegments) and
IndexReader.commit() (called by close) to clean up and keep the
instance state consistent to what's actually in the index (Mike
McCandless).
27. LUCENE-129: Change finalizers to do "try {...} finally
{super.finalize();}" to make sure we don't miss finalizers in
classes above us. (Esmond Pitt via Mike McCandless)
28. LUCENE-754: Fix a problem introduced by LUCENE-651, causing
IndexReaders to hang around forever, in addition to not
fixing the original FieldCache performance problem.
(Chris Hostetter, Yonik Seeley)
29. LUCENE-140: Fix IndexReader.deleteDocument(int docNum) to
correctly raise ArrayIndexOutOfBoundsException when docNum is too
large. Previously, if docNum was only slightly too large (within
the same multiple of 8, ie, up to 7 ints beyond maxDoc), no
exception would be raised and instead the index would become
silently corrupted. The corruption then only appears much later,
in mergeSegments, when the corrupted segment is merged with
segment(s) after it. (Mike McCandless)
30. LUCENE-768: Fix case where an Exception during deleteDocument,
undeleteAll or setNorm in IndexReader could leave the reader in a
state where close() fails to release the write lock.
(Mike McCandless)
31. Remove "tvp" from known index file extensions because it is
never used. (Nicolas Lalevée via Bernhard Messer)
32. LUCENE-767: Change how SegmentReader.maxDoc() is computed to not
rely on file length check and instead use the SegmentInfo's
docCount that's already stored explicitly in the index. This is a
defensive bug fix (ie, there is no known problem seen "in real
life" due to this, just a possible future problem). (Chuck
Williams via Mike McCandless)
Optimizations
1. LUCENE-586: TermDocs.skipTo() is now more efficient for
multi-segment indexes. This will improve the performance of many
types of queries against a non-optimized index. (Andrew Hudson
via Yonik Seeley)
2. LUCENE-623: RAMDirectory.close now nulls out its reference to all
internal "files", allowing them to be GCed even if references to the
RAMDirectory itself still exist. (Nadav Har'El via Chris Hostetter)
3. LUCENE-629: Compressed fields are no longer uncompressed and
recompressed during segment merges (e.g. during indexing or
optimizing), thus improving performance . (Michael Busch via Otis
Gospodnetic)
4. LUCENE-388: Improve indexing performance when maxBufferedDocs is
large by keeping a count of buffered documents rather than
counting after each document addition. (Doron Cohen, Paul Smith,
Yonik Seeley)
5. Modified TermScorer.explain to use TermDocs.skipTo() instead of
looping through docs. (Grant Ingersoll)
6. LUCENE-672: New indexing segment merge policy flushes all
buffered docs to their own segment and delays a merge until
mergeFactor segments of a certain level have been accumulated.
This increases indexing performance in the presence of deleted
docs or partially full segments as well as enabling future
optimizations.
NOTE: this also fixes an "under-merging" bug whereby it is
possible to get far too many segments in your index (which will
drastically slow down search, risks exhausting file descriptor
limit, etc.). This can happen when the number of buffered docs
at close, plus the number of docs in the last non-ram segment is
greater than mergeFactor. (Ning Li, Yonik Seeley)
7. Lazy loaded fields unnecessarily retained an extra copy of loaded
String data. (Yonik Seeley)
8. LUCENE-443: ConjunctionScorer performance increase. Speed up
any BooleanQuery with more than one mandatory clause.
(Abdul Chaudhry, Paul Elschot via Yonik Seeley)
9. LUCENE-365: DisjunctionSumScorer performance increase of
~30%. Speeds up queries with optional clauses. (Paul Elschot via
Yonik Seeley)
10. LUCENE-695: Optimized BufferedIndexInput.readBytes() for medium
size buffers, which will speed up merging and retrieving binary
and compressed fields. (Nadav Har'El via Yonik Seeley)
11. LUCENE-687: Lazy skipping on proximity file speeds up most
queries involving term positions, including phrase queries.
(Michael Busch via Yonik Seeley)
12. LUCENE-714: Replaced 2 cases of manual for-loop array copying
with calls to System.arraycopy instead, in DocumentWriter.java.
(Nicolas Lalevee via Mike McCandless)
13. LUCENE-729: Non-recursive skipTo and next implementation of
TermDocs for a MultiReader. The old implementation could
recurse up to the number of segments in the index. (Yonik Seeley)
14. LUCENE-739: Improve segment merging performance by reusing
the norm array across different fields and doing bulk writes
of norms of segments with no deleted docs.
(Michael Busch via Yonik Seeley)
15. LUCENE-745: Add BooleanQuery.clauses(), allowing direct access
to the List of clauses and replaced the internal synchronized Vector
with an unsynchronized List. (Yonik Seeley)
16. LUCENE-750: Remove finalizers from FSIndexOutput and move the
FSIndexInput finalizer to the actual file so all clones don't
register a new finalizer. (Yonik Seeley)
Test Cases
1. Added TestTermScorer.java (Grant Ingersoll)
2. Added TestWindowsMMap.java (Benson Margulies via Mike McCandless)
3. LUCENE-744 Append the user.name property onto the temporary directory
that is created so it doesn't interfere with other users. (Grant Ingersoll)
Documentation
1. Added style sheet to xdocs named lucene.css and included in the
Anakia VSL descriptor. (Grant Ingersoll)
2. Added scoring.xml document into xdocs. Updated Similarity.java
scoring formula.(Grant Ingersoll and Steve Rowe. Updates from:
Michael McCandless, Doron Cohen, Chris Hostetter, Doug Cutting).
Issue 664.
3. Added javadocs for FieldSelectorResult.java. (Grant Ingersoll)
4. Moved xdocs directory to src/site/src/documentation/content/xdocs per
Issue 707. Site now builds using Forrest, just like the other Lucene
siblings. See http://wiki.apache.org/jakarta-lucene/HowToUpdateTheWebsite
for info on updating the website. (Grant Ingersoll with help from Steve Rowe,
Chris Hostetter, Doug Cutting, Otis Gospodnetic, Yonik Seeley)
5. Added in Developer and System Requirements sections under Resources (Grant Ingersoll)
6. LUCENE-713 Updated the Term Vector section of File Formats to include
documentation on how Offset and Position info are stored in the TVF file.
(Grant Ingersoll, Samir Abdou)
7. Added in link to Clover Test Code Coverage Reports under the Develop
section in Resources (Grant Ingersoll)
8. LUCENE-748: Added details for semantics of IndexWriter.close on
hitting an Exception. (Jed Wesley-Smith via Mike McCandless)
9. Added some text about what is contained in releases.
(Eric Haszlakiewicz via Grant Ingersoll)
10. LUCENE-758: Fix javadoc to clarify that RAMDirectory(Directory)
makes a full copy of the starting Directory. (Mike McCandless)
11. LUCENE-764: Fix javadocs to detail temporary space requirements
for IndexWriter's optimize(), addIndexes(*) and addDocument(...)
methods. (Mike McCandless)
Build
1. Added in clover test code coverage per http://issues.apache.org/jira/browse/LUCENE-721
To enable clover code coverage, you must have clover.jar in the ANT
classpath and specify -Drun.clover=true on the command line.
(Michael Busch and Grant Ingersoll)
2. Added a sysproperty in common-build.xml per Lucene 752 to map java.io.tmpdir to
${build.dir}/test just like the tempDir sysproperty.
3. LUCENE-757 Added new target named init-dist that does setup for
distribution of both binary and source distributions. Called by package
and package-*-src
======================= Release 2.0.0 =======================
API Changes
1. All deprecated methods and fields have been removed, except
DateField, which will still be supported for some time
so Lucene can read its date fields from old indexes
(Yonik Seeley & Grant Ingersoll)
2. DisjunctionSumScorer is no longer public.
(Paul Elschot via Otis Gospodnetic)
3. Creating a Field with both an empty name and an empty value
now throws an IllegalArgumentException
(Daniel Naber)
4. LUCENE-301: Added new IndexWriter({String,File,Directory},
Analyzer) constructors that do not take a boolean "create"
argument. These new constructors will create a new index if
necessary, else append to the existing one. (Dan Armbrust via
Mike McCandless)
New features
1. LUCENE-496: Command line tool for modifying the field norms of an
existing index; added to contrib/miscellaneous. (Chris Hostetter)
2. LUCENE-577: SweetSpotSimilarity added to contrib/miscellaneous.
(Chris Hostetter)
Bug fixes
1. LUCENE-330: Fix issue of FilteredQuery not working properly within
BooleanQuery. (Paul Elschot via Erik Hatcher)
2. LUCENE-515: Make ConstantScoreRangeQuery and ConstantScoreQuery work
with RemoteSearchable. (Philippe Laflamme via Yonik Seeley)
3. Added methods to get/set writeLockTimeout and commitLockTimeout in
IndexWriter. These could be set in Lucene 1.4 using a system property.
This feature had been removed without adding the corresponding
getter/setter methods. (Daniel Naber)
4. LUCENE-413: Fixed ArrayIndexOutOfBoundsException exceptions
when using SpanQueries. (Paul Elschot via Yonik Seeley)
5. Implemented FilterIndexReader.getVersion() and isCurrent()
(Yonik Seeley)
6. LUCENE-540: Fixed a bug with IndexWriter.addIndexes(Directory[])
that sometimes caused the index order of documents to change.
(Yonik Seeley)
7. LUCENE-526: Fixed a bug in FieldSortedHitQueue that caused
subsequent String sorts with different locales to sort identically.
(Paul Cowan via Yonik Seeley)
8. LUCENE-541: Add missing extractTerms() to DisjunctionMaxQuery
(Stefan Will via Yonik Seeley)
9. LUCENE-514: Added getTermArrays() and extractTerms() to
MultiPhraseQuery (Eric Jain & Yonik Seeley)
10. LUCENE-512: Fixed ClassCastException in ParallelReader.getTermFreqVectors
(frederic via Yonik)
11. LUCENE-352: Fixed bug in SpanNotQuery that manifested as
NullPointerException when "exclude" query was not a SpanTermQuery.
(Chris Hostetter)
12. LUCENE-572: Fixed bug in SpanNotQuery hashCode, was ignoring exclude clause
(Chris Hostetter)
13. LUCENE-561: Fixed some ParallelReader bugs. NullPointerException if the reader
didn't know about the field yet, reader didn't keep track if it had deletions,
and deleteDocument calls could circumvent synchronization on the subreaders.
(Chuck Williams via Yonik Seeley)
14. LUCENE-556: Added empty extractTerms() implementation to MatchAllDocsQuery and
ConstantScoreQuery in order to allow their use with a MultiSearcher.
(Yonik Seeley)
15. LUCENE-546: Removed 2GB file size limitations for RAMDirectory.
(Peter Royal, Michael Chan, Yonik Seeley)
16. LUCENE-485: Don't hold commit lock while removing obsolete index
files. (Luc Vanlerberghe via cutting)
1.9.1
Bug fixes
1. LUCENE-511: Fix a bug in the BufferedIndexOutput optimization
introduced in 1.9-final. (Shay Banon & Steven Tamm via cutting)
1.9 final
Note that this release is mostly but not 100% source compatible with
the previous release of Lucene (1.4.3). In other words, you should
make sure your application compiles with this version of Lucene before
you replace the old Lucene JAR with the new one. Many methods have
been deprecated in anticipation of release 2.0, so deprecation
warnings are to be expected when upgrading from 1.4.3 to 1.9.
Bug fixes
1. The fix that made IndexWriter.setMaxBufferedDocs(1) work had negative
effects on indexing performance and has thus been reverted. The
argument for setMaxBufferedDocs(int) must now at least be 2, otherwise
an exception is thrown. (Daniel Naber)
Optimizations
1. Optimized BufferedIndexOutput.writeBytes() to use
System.arraycopy() in more cases, rather than copying byte-by-byte.
(Lukas Zapletal via Cutting)
1.9 RC1
Requirements
1. To compile and use Lucene you now need Java 1.4 or later.
Changes in runtime behavior
1. FuzzyQuery can no longer throw a TooManyClauses exception. If a
FuzzyQuery expands to more than BooleanQuery.maxClauseCount
terms only the BooleanQuery.maxClauseCount most similar terms
go into the rewritten query and thus the exception is avoided.
(Christoph)
2. Changed system property from "org.apache.lucene.lockdir" to
"org.apache.lucene.lockDir", so that its casing follows the existing
pattern used in other Lucene system properties. (Bernhard)
3. The terms of RangeQueries and FuzzyQueries are now converted to
lowercase by default (as it has been the case for PrefixQueries
and WildcardQueries before). Use setLowercaseExpandedTerms(false)
to disable that behavior but note that this also affects
PrefixQueries and WildcardQueries. (Daniel Naber)
4. Document frequency that is computed when MultiSearcher is used is now
computed correctly and "globally" across subsearchers and indices, while
before it used to be computed locally to each index, which caused
ranking across multiple indices not to be equivalent.
(Chuck Williams, Wolf Siberski via Otis, bug #31841)
5. When opening an IndexWriter with create=true, Lucene now only deletes
its own files from the index directory (looking at the file name suffixes
to decide if a file belongs to Lucene). The old behavior was to delete
all files. (Daniel Naber and Bernhard Messer, bug #34695)
6. The version of an IndexReader, as returned by getCurrentVersion()
and getVersion() doesn't start at 0 anymore for new indexes. Instead, it
is now initialized by the system time in milliseconds.
(Bernhard Messer via Daniel Naber)
7. Several default values cannot be set via system properties anymore, as
this has been considered inappropriate for a library like Lucene. For
most properties there are set/get methods available in IndexWriter which
you should use instead. This affects the following properties:
See IndexWriter for getter/setter methods:
org.apache.lucene.writeLockTimeout, org.apache.lucene.commitLockTimeout,
org.apache.lucene.minMergeDocs, org.apache.lucene.maxMergeDocs,
org.apache.lucene.maxFieldLength, org.apache.lucene.termIndexInterval,
org.apache.lucene.mergeFactor,
See BooleanQuery for getter/setter methods:
org.apache.lucene.maxClauseCount
See FSDirectory for getter/setter methods:
disableLuceneLocks
(Daniel Naber)
8. Fixed FieldCacheImpl to use user-provided IntParser and FloatParser,
instead of using Integer and Float classes for parsing.
(Yonik Seeley via Otis Gospodnetic)
9. Expert level search routines returning TopDocs and TopFieldDocs
no longer normalize scores. This also fixes bugs related to
MultiSearchers and score sorting/normalization.
(Luc Vanlerberghe via Yonik Seeley, LUCENE-469)
New features
1. Added support for stored compressed fields (patch #31149)
(Bernhard Messer via Christoph)
2. Added support for binary stored fields (patch #29370)
(Drew Farris and Bernhard Messer via Christoph)
3. Added support for position and offset information in term vectors
(patch #18927). (Grant Ingersoll & Christoph)
4. A new class DateTools has been added. It allows you to format dates
in a readable format adequate for indexing. Unlike the existing
DateField class DateTools can cope with dates before 1970 and it
forces you to specify the desired date resolution (e.g. month, day,
second, ...) which can make RangeQuerys on those fields more efficient.
(Daniel Naber)
5. QueryParser now correctly works with Analyzers that can return more
than one token per position. For example, a query "+fast +car"
would be parsed as "+fast +(car automobile)" if the Analyzer
returns "car" and "automobile" at the same position whenever it
finds "car" (Patch #23307).
(Pierrick Brihaye, Daniel Naber)
6. Permit unbuffered Directory implementations (e.g., using mmap).
InputStream is replaced by the new classes IndexInput and
BufferedIndexInput. OutputStream is replaced by the new classes
IndexOutput and BufferedIndexOutput. InputStream and OutputStream
are now deprecated and FSDirectory is now subclassable. (cutting)
7. Add native Directory and TermDocs implementations that work under
GCJ. These require GCC 3.4.0 or later and have only been tested
on Linux. Use 'ant gcj' to build demo applications. (cutting)
8. Add MMapDirectory, which uses nio to mmap input files. This is
still somewhat slower than FSDirectory. However it uses less
memory per query term, since a new buffer is not allocated per
term, which may help applications which use, e.g., wildcard
queries. It may also someday be faster. (cutting & Paul Elschot)
9. Added javadocs-internal to build.xml - bug #30360
(Paul Elschot via Otis)
10. Added RangeFilter, a more generically useful filter than DateFilter.
(Chris M Hostetter via Erik)
11. Added NumberTools, a utility class indexing numeric fields.
(adapted from code contributed by Matt Quail; committed by Erik)
12. Added public static IndexReader.main(String[] args) method.
IndexReader can now be used directly at command line level
to list and optionally extract the individual files from an existing
compound index file.
(adapted from code contributed by Garrett Rooney; committed by Bernhard)
13. Add IndexWriter.setTermIndexInterval() method. See javadocs.
(Doug Cutting)
14. Added LucenePackage, whose static get() method returns java.util.Package,
which lets the caller get the Lucene version information specified in
the Lucene Jar.
(Doug Cutting via Otis)
15. Added Hits.iterator() method and corresponding HitIterator and Hit objects.
This provides standard java.util.Iterator iteration over Hits.
Each call to the iterator's next() method returns a Hit object.
(Jeremy Rayner via Erik)
16. Add ParallelReader, an IndexReader that combines separate indexes
over different fields into a single virtual index. (Doug Cutting)
17. Add IntParser and FloatParser interfaces to FieldCache, so that
fields in arbitrarily formats can be cached as ints and floats.
(Doug Cutting)
18. Added class org.apache.lucene.index.IndexModifier which combines
IndexWriter and IndexReader, so you can add and delete documents without
worrying about synchronization/locking issues.
(Daniel Naber)
19. Lucene can now be used inside an unsigned applet, as Lucene's access
to system properties will not cause a SecurityException anymore.
(Jon Schuster via Daniel Naber, bug #34359)
20. Added a new class MatchAllDocsQuery that matches all documents.
(John Wang via Daniel Naber, bug #34946)
21. Added ability to omit norms on a per field basis to decrease
index size and memory consumption when there are many indexed fields.
See Field.setOmitNorms()
(Yonik Seeley, LUCENE-448)
22. Added NullFragmenter to contrib/highlighter, which is useful for
highlighting entire documents or fields.
(Erik Hatcher)
23. Added regular expression queries, RegexQuery and SpanRegexQuery.
Note the same term enumeration caveats apply with these queries as
apply to WildcardQuery and other term expanding queries.
These two new queries are not currently supported via QueryParser.
(Erik Hatcher)
24. Added ConstantScoreQuery which wraps a filter and produces a score
equal to the query boost for every matching document.
(Yonik Seeley, LUCENE-383)
25. Added ConstantScoreRangeQuery which produces a constant score for
every document in the range. One advantage over a normal RangeQuery
is that it doesn't expand to a BooleanQuery and thus doesn't have a maximum
number of terms the range can cover. Both endpoints may also be open.
(Yonik Seeley, LUCENE-383)
26. Added ability to specify a minimum number of optional clauses that
must match in a BooleanQuery. See BooleanQuery.setMinimumNumberShouldMatch().
(Paul Elschot, Chris Hostetter via Yonik Seeley, LUCENE-395)
27. Added DisjunctionMaxQuery which provides the maximum score across its clauses.
It's very useful for searching across multiple fields.
(Chuck Williams via Yonik Seeley, LUCENE-323)
28. New class ISOLatin1AccentFilter that replaces accented characters in the ISO
Latin 1 character set by their unaccented equivalent.
(Sven Duzont via Erik Hatcher)
29. New class KeywordAnalyzer. "Tokenizes" the entire stream as a single token.
This is useful for data like zip codes, ids, and some product names.
(Erik Hatcher)
30. Copied LengthFilter from contrib area to core. Removes words that are too
long and too short from the stream.
(David Spencer via Otis and Daniel)
31. Added getPositionIncrementGap(String fieldName) to Analyzer. This allows
custom analyzers to put gaps between Field instances with the same field
name, preventing phrase or span queries crossing these boundaries. The
default implementation issues a gap of 0, allowing the default token
position increment of 1 to put the next field's first token into a
successive position.
(Erik Hatcher, with advice from Yonik)
32. StopFilter can now ignore case when checking for stop words.
(Grant Ingersoll via Yonik, LUCENE-248)
33. Add TopDocCollector and TopFieldDocCollector. These simplify the
implementation of hit collectors that collect only the
top-scoring or top-sorting hits.
API Changes
1. Several methods and fields have been deprecated. The API documentation
contains information about the recommended replacements. It is planned
that most of the deprecated methods and fields will be removed in
Lucene 2.0. (Daniel Naber)
2. The Russian and the German analyzers have been moved to contrib/analyzers.
Also, the WordlistLoader class has been moved one level up in the
hierarchy and is now org.apache.lucene.analysis.WordlistLoader
(Daniel Naber)
3. The API contained methods that declared to throw an IOException
but that never did this. These declarations have been removed. If
your code tries to catch these exceptions you might need to remove
those catch clauses to avoid compile errors. (Daniel Naber)
4. Add a serializable Parameter Class to standardize parameter enum
classes in BooleanClause and Field. (Christoph)
5. Added rewrite methods to all SpanQuery subclasses that nest other SpanQuerys.
This allows custom SpanQuery subclasses that rewrite (for term expansion, for
example) to nest within the built-in SpanQuery classes successfully.
Bug fixes
1. The JSP demo page (src/jsp/results.jsp) now properly closes the
IndexSearcher it opens. (Daniel Naber)
2. Fixed a bug in IndexWriter.addIndexes(IndexReader[] readers) that
prevented deletion of obsolete segments. (Christoph Goller)
3. Fix in FieldInfos to avoid the return of an extra blank field in
IndexReader.getFieldNames() (Patch #19058). (Mark Harwood via Bernhard)
4. Some combinations of BooleanQuery and MultiPhraseQuery (formerly
PhrasePrefixQuery) could provoke UnsupportedOperationException
(bug #33161). (Rhett Sutphin via Daniel Naber)
5. Small bug in skipTo of ConjunctionScorer that caused NullPointerException
if skipTo() was called without prior call to next() fixed. (Christoph)
6. Disable Similiarty.coord() in the scoring of most automatically
generated boolean queries. The coord() score factor is
appropriate when clauses are independently specified by a user,
but is usually not appropriate when clauses are generated
automatically, e.g., by a fuzzy, wildcard or range query. Matches
on such automatically generated queries are no longer penalized
for not matching all terms. (Doug Cutting, Patch #33472)
7. Getting a lock file with Lock.obtain(long) was supposed to wait for
a given amount of milliseconds, but this didn't work.
(John Wang via Daniel Naber, Bug #33799)
8. Fix FSDirectory.createOutput() to always create new files.
Previously, existing files were overwritten, and an index could be
corrupted when the old version of a file was longer than the new.
Now any existing file is first removed. (Doug Cutting)
9. Fix BooleanQuery containing nested SpanTermQuery's, which previously
could return an incorrect number of hits.
(Reece Wilton via Erik Hatcher, Bug #35157)
10. Fix NullPointerException that could occur with a MultiPhraseQuery
inside a BooleanQuery.
(Hans Hjelm and Scotty Allen via Daniel Naber, Bug #35626)
11. Fixed SnowballFilter to pass through the position increment from
the original token.
(Yonik Seeley via Erik Hatcher, LUCENE-437)
12. Added Unicode range of Korean characters to StandardTokenizer,
grouping contiguous characters into a token rather than one token
per character. This change also changes the token type to "<CJ>"
for Chinese and Japanese character tokens (previously it was "<CJK>").
(Cheolgoo Kang via Otis and Erik, LUCENE-444 and LUCENE-461)
13. FieldsReader now looks at FieldInfo.storeOffsetWithTermVector and
FieldInfo.storePositionWithTermVector and creates the Field with
correct TermVector parameter.
(Frank Steinmann via Bernhard, LUCENE-455)
14. Fixed WildcardQuery to prevent "cat" matching "ca??".
(Xiaozheng Ma via Bernhard, LUCENE-306)
15. Fixed a bug where MultiSearcher and ParallelMultiSearcher could
change the sort order when sorting by string for documents without
a value for the sort field.
(Luc Vanlerberghe via Yonik, LUCENE-453)
16. Fixed a sorting problem with MultiSearchers that can lead to
missing or duplicate docs due to equal docs sorting in an arbitrary order.
(Yonik Seeley, LUCENE-456)
17. A single hit using the expert level sorted search methods
resulted in the score not being normalized.
(Yonik Seeley, LUCENE-462)
18. Fixed inefficient memory usage when loading an index into RAMDirectory.
(Volodymyr Bychkoviak via Bernhard, LUCENE-475)
19. Corrected term offsets returned by ChineseTokenizer.
(Ray Tsang via Erik Hatcher, LUCENE-324)
20. Fixed MultiReader.undeleteAll() to correctly update numDocs.
(Robert Kirchgessner via Doug Cutting, LUCENE-479)
21. Race condition in IndexReader.getCurrentVersion() and isCurrent()
fixed by acquiring the commit lock.
(Luc Vanlerberghe via Yonik Seeley, LUCENE-481)
22. IndexWriter.setMaxBufferedDocs(1) didn't have the expected effect,
this has now been fixed. (Daniel Naber)
23. Fixed QueryParser when called with a date in local form like
"[1/16/2000 TO 1/18/2000]". This query did not include the documents
of 1/18/2000, i.e. the last day was not included. (Daniel Naber)
24. Removed sorting constraint that threw an exception if there were
not yet any values for the sort field (Yonik Seeley, LUCENE-374)
Optimizations
1. Disk usage (peak requirements during indexing and optimization)
in case of compound file format has been improved.
(Bernhard, Dmitry, and Christoph)
2. Optimize the performance of certain uses of BooleanScorer,
TermScorer and IndexSearcher. In particular, a BooleanQuery
composed of TermQuery, with not all terms required, that returns a
TopDocs (e.g., through a Hits with no Sort specified) runs much
faster. (cutting)
3. Removed synchronization from reading of term vectors with an
IndexReader (Patch #30736). (Bernhard Messer via Christoph)
4. Optimize term-dictionary lookup to allocate far fewer terms when
scanning for the matching term. This speeds searches involving
low-frequency terms, where the cost of dictionary lookup can be
significant. (cutting)
5. Optimize fuzzy queries so the standard fuzzy queries with a prefix
of 0 now run 20-50% faster (Patch #31882).
(Jonathan Hager via Daniel Naber)
6. A Version of BooleanScorer (BooleanScorer2) added that delivers
documents in increasing order and implements skipTo. For queries
with required or forbidden clauses it may be faster than the old
BooleanScorer, for BooleanQueries consisting only of optional
clauses it is probably slower. The new BooleanScorer is now the
default. (Patch 31785 by Paul Elschot via Christoph)
7. Use uncached access to norms when merging to reduce RAM usage.
(Bug #32847). (Doug Cutting)
8. Don't read term index when random-access is not required. This
reduces time to open IndexReaders and they use less memory when
random access is not required, e.g., when merging segments. The
term index is now read into memory lazily at the first
random-access. (Doug Cutting)
9. Optimize IndexWriter.addIndexes(Directory[]) when the number of
added indexes is larger than mergeFactor. Previously this could
result in quadratic performance. Now performance is n log(n).
(Doug Cutting)
10. Speed up the creation of TermEnum for indices with multiple
segments and deleted documents, and thus speed up PrefixQuery,
RangeQuery, WildcardQuery, FuzzyQuery, RangeFilter, DateFilter,
and sorting the first time on a field.
(Yonik Seeley, LUCENE-454)
11. Optimized and generalized 32 bit floating point to byte
(custom 8 bit floating point) conversions. Increased the speed of
Similarity.encodeNorm() anywhere from 10% to 250%, depending on the JVM.
(Yonik Seeley, LUCENE-467)
Infrastructure
1. Lucene's source code repository has converted from CVS to
Subversion. The new repository is at
http://svn.apache.org/repos/asf/lucene/java/trunk
2. Lucene's issue tracker has migrated from Bugzilla to JIRA.
Lucene's JIRA is at http://issues.apache.org/jira/browse/LUCENE
The old issues are still available at
http://issues.apache.org/bugzilla/show_bug.cgi?id=xxxx
(use the bug number instead of xxxx)
1.4.3
1. The JSP demo page (src/jsp/results.jsp) now properly escapes error
messages which might contain user input (e.g. error messages about
query parsing). If you used that page as a starting point for your
own code please make sure your code also properly escapes HTML
characters from user input in order to avoid so-called cross site
scripting attacks. (Daniel Naber)
2. QueryParser changes in 1.4.2 broke the QueryParser API. Now the old
API is supported again. (Christoph)
1.4.2
1. Fixed bug #31241: Sorting could lead to incorrect results (documents
missing, others duplicated) if the sort keys were not unique and there
were more than 100 matches. (Daniel Naber)
2. Memory leak in Sort code (bug #31240) eliminated.
(Rafal Krzewski via Christoph and Daniel)
3. FuzzyQuery now takes an additional parameter that specifies the
minimum similarity that is required for a term to match the query.
The QueryParser syntax for this is term~x, where x is a floating
point number >= 0 and < 1 (a bigger number means that a higher
similarity is required). Furthermore, a prefix can be specified
for FuzzyQuerys so that only those terms are considered similar that
start with this prefix. This can speed up FuzzyQuery greatly.
(Daniel Naber, Christoph Goller)
4. PhraseQuery and PhrasePrefixQuery now allow the explicit specification
of relative positions. (Christoph Goller)
5. QueryParser changes: Fix for ArrayIndexOutOfBoundsExceptions
(patch #9110); some unused method parameters removed; The ability
to specify a minimum similarity for FuzzyQuery has been added.
(Christoph Goller)
6. IndexSearcher optimization: a new ScoreDoc is no longer allocated
for every non-zero-scoring hit. This makes 'OR' queries that
contain common terms substantially faster. (cutting)
1.4.1
1. Fixed a performance bug in hit sorting code, where values were not
correctly cached. (Aviran via cutting)
2. Fixed errors in file format documentation. (Daniel Naber)
1.4 final
1. Added "an" to the list of stop words in StopAnalyzer, to complement
the existing "a" there. Fix for bug 28960
(http://issues.apache.org/bugzilla/show_bug.cgi?id=28960). (Otis)
2. Added new class FieldCache to manage in-memory caches of field term
values. (Tim Jones)
3. Added overloaded getFieldQuery method to QueryParser which
accepts the slop factor specified for the phrase (or the default
phrase slop for the QueryParser instance). This allows overriding
methods to replace a PhraseQuery with a SpanNearQuery instead,
keeping the proper slop factor. (Erik Hatcher)
4. Changed the encoding of GermanAnalyzer.java and GermanStemmer.java to
UTF-8 and changed the build encoding to UTF-8, to make changed files
compile. (Otis Gospodnetic)
5. Removed synchronization from term lookup under IndexReader methods
termFreq(), termDocs() or termPositions() to improve
multi-threaded performance. (cutting)
6. Fix a bug where obsolete segment files were not deleted on Win32.
1.4 RC3
1. Fixed several search bugs introduced by the skipTo() changes in
release 1.4RC1. The index file format was changed a bit, so
collections must be re-indexed to take advantage of the skipTo()
optimizations. (Christoph Goller)
2. Added new Document methods, removeField() and removeFields().
(Christoph Goller)
3. Fixed inconsistencies with index closing. Indexes and directories
are now only closed automatically by Lucene when Lucene opened
them automatically. (Christoph Goller)
4. Added new class: FilteredQuery. (Tim Jones)
5. Added a new SortField type for custom comparators. (Tim Jones)
6. Lock obtain timed out message now displays the full path to the lock
file. (Daniel Naber via Erik)
7. Fixed a bug in SpanNearQuery when ordered. (Paul Elschot via cutting)
8. Fixed so that FSDirectory's locks still work when the
java.io.tmpdir system property is null. (cutting)
9. Changed FilteredTermEnum's constructor to take no parameters,
as the parameters were ignored anyway (bug #28858)
1.4 RC2
1. GermanAnalyzer now throws an exception if the stopword file
cannot be found (bug #27987). It now uses LowerCaseFilter
(bug #18410) (Daniel Naber via Otis, Erik)
2. Fixed a few bugs in the file format documentation. (cutting)
1.4 RC1
1. Changed the format of the .tis file, so that:
- it has a format version number, which makes it easier to
back-compatibly change file formats in the future.
- the term count is now stored as a long. This was the one aspect
of the Lucene's file formats which limited index size.
- a few internal index parameters are now stored in the index, so
that they can (in theory) now be changed from index to index,
although there is not yet an API to do so.
These changes are back compatible. The new code can read old
indexes. But old code will not be able read new indexes. (cutting)
2. Added an optimized implementation of TermDocs.skipTo(). A skip
table is now stored for each term in the .frq file. This only
adds a percent or two to overall index size, but can substantially
speedup many searches. (cutting)
3. Restructured the Scorer API and all Scorer implementations to take
advantage of an optimized TermDocs.skipTo() implementation. In
particular, PhraseQuerys and conjunctive BooleanQuerys are
faster when one clause has substantially fewer matches than the
others. (A conjunctive BooleanQuery is a BooleanQuery where all
clauses are required.) (cutting)
4. Added new class ParallelMultiSearcher. Combined with
RemoteSearchable this makes it easy to implement distributed
search systems. (Jean-Francois Halleux via cutting)
5. Added support for hit sorting. Results may now be sorted by any
indexed field. For details see the javadoc for
Searcher#search(Query, Sort). (Tim Jones via Cutting)
6. Changed FSDirectory to auto-create a full directory tree that it
needs by using mkdirs() instead of mkdir(). (Mladen Turk via Otis)
7. Added a new span-based query API. This implements, among other
things, nested phrases. See javadocs for details. (Doug Cutting)
8. Added new method Query.getSimilarity(Searcher), and changed
scorers to use it. This permits one to subclass a Query class so
that it can specify its own Similarity implementation, perhaps
one that delegates through that of the Searcher. (Julien Nioche
via Cutting)
9. Added MultiReader, an IndexReader that combines multiple other
IndexReaders. (Cutting)
10. Added support for term vectors. See Field#isTermVectorStored().
(Grant Ingersoll, Cutting & Dmitry)
11. Fixed the old bug with escaping of special characters in query
strings: http://issues.apache.org/bugzilla/show_bug.cgi?id=24665
(Jean-Francois Halleux via Otis)
12. Added support for overriding default values for the following,
using system properties:
- default commit lock timeout
- default maxFieldLength
- default maxMergeDocs
- default mergeFactor
- default minMergeDocs
- default write lock timeout
(Otis)
13. Changed QueryParser.jj to allow '-' and '+' within tokens:
http://issues.apache.org/bugzilla/show_bug.cgi?id=27491
(Morus Walter via Otis)
14. Changed so that the compound index format is used by default.
This makes indexing a bit slower, but vastly reduces the chances
of file handle problems. (Cutting)
1.3 final
1. Added catch of BooleanQuery$TooManyClauses in QueryParser to
throw ParseException instead. (Erik Hatcher)
2. Fixed a NullPointerException in Query.explain(). (Doug Cutting)
3. Added a new method IndexReader.setNorm(), that permits one to
alter the boosting of fields after an index is created.
4. Distinguish between the final position and length when indexing a
field. The length is now defined as the total number of tokens,
instead of the final position, as it was previously. Length is
used for score normalization (Similarity.lengthNorm()) and for
controlling memory usage (IndexWriter.maxFieldLength). In both of
these cases, the total number of tokens is a better value to use
than the final token position. Position is used in phrase
searching (see PhraseQuery and Token.setPositionIncrement()).
5. Fix StandardTokenizer's handling of CJK characters (Chinese,
Japanese and Korean ideograms). Previously contiguous sequences
were combined in a single token, which is not very useful. Now
each ideogram generates a separate token, which is more useful.
1.3 RC3
1. Added minMergeDocs in IndexWriter. This can be raised to speed
indexing without altering the number of files, but only using more
memory. (Julien Nioche via Otis)
2. Fix bug #24786, in query rewriting. (bschneeman via Cutting)
3. Fix bug #16952, in demo HTML parser, skip comments in
javascript. (Christoph Goller)
4. Fix bug #19253, in demo HTML parser, add whitespace as needed to
output (Daniel Naber via Christoph Goller)
5. Fix bug #24301, in demo HTML parser, long titles no longer
hang things. (Christoph Goller)
6. Fix bug #23534, Replace use of file timestamp of segments file
with an index version number stored in the segments file. This
resolves problems when running on file systems with low-resolution
timestamps, e.g., HFS under MacOS X. (Christoph Goller)
7. Fix QueryParser so that TokenMgrError is not thrown, only
ParseException. (Erik Hatcher)
8. Fix some bugs introduced by change 11 of RC2. (Christoph Goller)
9. Fixed a problem compiling TestRussianStem. (Christoph Goller)
10. Cleaned up some build stuff. (Erik Hatcher)
1.3 RC2
1. Added getFieldNames(boolean) to IndexReader, SegmentReader, and
SegmentsReader. (Julien Nioche via otis)
2. Changed file locking to place lock files in
System.getProperty("java.io.tmpdir"), where all users are
permitted to write files. This way folks can open and correctly
lock indexes which are read-only to them.
3. IndexWriter: added a new method, addDocument(Document, Analyzer),
permitting one to easily use different analyzers for different
documents in the same index.
4. Minor enhancements to FuzzyTermEnum.
(Christoph Goller via Otis)
5. PriorityQueue: added insert(Object) method and adjusted IndexSearcher
and MultiIndexSearcher to use it.
(Christoph Goller via Otis)
6. Fixed a bug in IndexWriter that returned incorrect docCount().
(Christoph Goller via Otis)
7. Fixed SegmentsReader to eliminate the confusing and slightly different
behaviour of TermEnum when dealing with an enumeration of all terms,
versus an enumeration starting from a specific term.
This patch also fixes incorrect term document frequencies when the same term
is present in multiple segments.
(Christoph Goller via Otis)
8. Added CachingWrapperFilter and PerFieldAnalyzerWrapper. (Erik Hatcher)
9. Added support for the new "compound file" index format (Dmitry
Serebrennikov)
10. Added Locale setting to QueryParser, for use by date range parsing.
11. Changed IndexReader so that it can be subclassed by classes
outside of its package. Previously it had package-private
abstract methods. Also modified the index merging code so that it
can work on an arbitrary IndexReader implementation, and added a
new method, IndexWriter.addIndexes(IndexReader[]), to take
advantage of this. (cutting)
12. Added a limit to the number of clauses which may be added to a
BooleanQuery. The default limit is 1024 clauses. This should
stop most OutOfMemoryExceptions by prefix, wildcard and fuzzy
queries which run amok. (cutting)
13. Add new method: IndexReader.undeleteAll(). This undeletes all
deleted documents which still remain in the index. (cutting)
1.3 RC1
1. Fixed PriorityQueue's clear() method.
Fix for bug 9454, http://nagoya.apache.org/bugzilla/show_bug.cgi?id=9454
(Matthijs Bomhoff via otis)
2. Changed StandardTokenizer.jj grammar for EMAIL tokens.
Fix for bug 9015, http://nagoya.apache.org/bugzilla/show_bug.cgi?id=9015
(Dale Anson via otis)
3. Added the ability to disable lock creation by using disableLuceneLocks
system property. This is useful for read-only media, such as CD-ROMs.
(otis)
4. Added id method to Hits to be able to access the index global id.
Required for sorting options.
(carlson)
5. Added support for new range query syntax to QueryParser.jj.
(briangoetz)
6. Added the ability to retrieve HTML documents' META tag values to
HTMLParser.jj.
(Mark Harwood via otis)
7. Modified QueryParser to make it possible to programmatically specify the
default Boolean operator (OR or AND).
(Péter Halácsy via otis)
8. Made many search methods and classes non-final, per requests.
This includes IndexWriter and IndexSearcher, among others.
(cutting)
9. Added class RemoteSearchable, providing support for remote
searching via RMI. The test class RemoteSearchableTest.java
provides an example of how this can be used. (cutting)
10. Added PhrasePrefixQuery (and supporting MultipleTermPositions). The
test class TestPhrasePrefixQuery provides the usage example.
(Anders Nielsen via otis)
11. Changed the German stemming algorithm to ignore case while
stripping. The new algorithm is faster and produces more equal
stems from nouns and verbs derived from the same word.
(gschwarz)
12. Added support for boosting the score of documents and fields via
the new methods Document.setBoost(float) and Field.setBoost(float).
Note: This changes the encoding of an indexed value. Indexes
should be re-created from scratch in order for search scores to
be correct. With the new code and an old index, searches will
yield very large scores for shorter fields, and very small scores
for longer fields. Once the index is re-created, scores will be
as before. (cutting)
13. Added new method Token.setPositionIncrement().
This permits, for the purpose of phrase searching, placing
multiple terms in a single position. This is useful with
stemmers that produce multiple possible stems for a word.
This also permits the introduction of gaps between terms, so that
terms which are adjacent in a token stream will not be matched by
and exact phrase query. This makes it possible, e.g., to build
an analyzer where phrases are not matched over stop words which
have been removed.
Finally, repeating a token with an increment of zero can also be
used to boost scores of matches on that token. (cutting)
14. Added new Filter class, QueryFilter. This constrains search
results to only match those which also match a provided query.
Results are cached, so that searches after the first on the same
index using this filter are very fast.
This could be used, for example, with a RangeQuery on a formatted
date field to implement date filtering. One could re-use a
single QueryFilter that matches, e.g., only documents modified
within the last week. The QueryFilter and RangeQuery would only
need to be reconstructed once per day. (cutting)
15. Added a new IndexWriter method, getAnalyzer(). This returns the
analyzer used when adding documents to this index. (cutting)
16. Fixed a bug with IndexReader.lastModified(). Before, document
deletion did not update this. Now it does. (cutting)
17. Added Russian Analyzer.
(Boris Okner via otis)
18. Added a public, extensible scoring API. For details, see the
javadoc for org.apache.lucene.search.Similarity.
19. Fixed return of Hits.id() from float to int. (Terry Steichen via Peter).
20. Added getFieldNames() to IndexReader and Segment(s)Reader classes.
(Peter Mularien via otis)
21. Added getFields(String) and getValues(String) methods.
Contributed by Rasik Pandey on 2002-10-09
(Rasik Pandey via otis)
22. Revised internal search APIs. Changes include:
a. Queries are no longer modified during a search. This makes
it possible, e.g., to reuse the same query instance with
multiple indexes from multiple threads.
b. Term-expanding queries (e.g. PrefixQuery, WildcardQuery,
etc.) now work correctly with MultiSearcher, fixing bugs 12619
and 12667.
c. Boosting BooleanQuery's now works, and is supported by the
query parser (problem reported by Lee Mallabone). Thus a query
like "(+foo +bar)^2 +baz" is now supported and equivalent to
"(+foo^2 +bar^2) +baz".
d. New method: Query.rewrite(IndexReader). This permits a
query to re-write itself as an alternate, more primitive query.
Most of the term-expanding query classes (PrefixQuery,
WildcardQuery, etc.) are now implemented using this method.
e. New method: Searchable.explain(Query q, int doc). This
returns an Explanation instance that describes how a particular
document is scored against a query. An explanation can be
displayed as either plain text, with the toString() method, or
as HTML, with the toHtml() method. Note that computing an
explanation is as expensive as executing the query over the
entire index. This is intended to be used in developing
Similarity implementations, and, for good performance, should
not be displayed with every hit.
f. Scorer and Weight are public, not package protected. It now
possible for someone to write a Scorer implementation that is
not in the org.apache.lucene.search package. This is still
fairly advanced programming, and I don't expect anyone to do
this anytime soon, but at least now it is possible.
g. Added public accessors to the primitive query classes
(TermQuery, PhraseQuery and BooleanQuery), permitting access to
their terms and clauses.
Caution: These are extensive changes and they have not yet been
tested extensively. Bug reports are appreciated.
(cutting)
23. Added convenience RAMDirectory constructors taking File and String
arguments, for easy FSDirectory to RAMDirectory conversion.
(otis)
24. Added code for manual renaming of files in FSDirectory, since it
has been reported that java.io.File's renameTo(File) method sometimes
fails on Windows JVMs.
(Matt Tucker via otis)
25. Refactored QueryParser to make it easier for people to extend it.
Added the ability to automatically lower-case Wildcard terms in
the QueryParser.
(Tatu Saloranta via otis)
1.2 RC6
1. Changed QueryParser.jj to have "?" be a special character which
allowed it to be used as a wildcard term. Updated TestWildcard
unit test also. (Ralf Hettesheimer via carlson)
1.2 RC5
1. Renamed build.properties to default.properties and updated
the BUILD.txt document to describe how to override the
default.property settings without having to edit the file. This
brings the build process closer to Scarab's build process.
(jon)
2. Added MultiFieldQueryParser class. (Kelvin Tan, via otis)
3. Updated "powered by" links. (otis)
4. Fixed instruction for setting up JavaCC - Bug #7017 (otis)
5. Added throwing exception if FSDirectory could not create directory
- Bug #6914 (Eugene Gluzberg via otis)
6. Update MultiSearcher, MultiFieldParse, Constants, DateFilter,
LowerCaseTokenizer javadoc (otis)
7. Added fix to avoid NullPointerException in results.jsp
(Mark Hayes via otis)
8. Changed Wildcard search to find 0 or more char instead of 1 or more
(Lee Mallobone, via otis)
9. Fixed error in offset issue in GermanStemFilter - Bug #7412
(Rodrigo Reyes, via otis)
10. Added unit tests for wildcard search and DateFilter (otis)
11. Allow co-existence of indexed and non-indexed fields with the same name
(cutting/casper, via otis)
12. Add escape character to query parser.
(briangoetz)
13. Applied a patch that ensures that searches that use DateFilter
don't throw an exception when no matches are found. (David Smiley, via
otis)
14. Fixed bugs in DateFilter and wildcardquery unit tests. (cutting, otis, carlson)
1.2 RC4
1. Updated contributions section of website.
Add XML Document #3 implementation to Document Section.
Also added Term Highlighting to Misc Section. (carlson)
2. Fixed NullPointerException for phrase searches containing
unindexed terms, introduced in 1.2RC3. (cutting)
3. Changed document deletion code to obtain the index write lock,
enforcing the fact that document addition and deletion cannot be
performed concurrently. (cutting)
4. Various documentation cleanups. (otis, acoliver)
5. Updated "powered by" links. (cutting, jon)
6. Fixed a bug in the GermanStemmer. (Bernhard Messer, via otis)
7. Changed Term and Query to implement Serializable. (scottganyo)
8. Fixed to never delete indexes added with IndexWriter.addIndexes().
(cutting)
9. Upgraded to JUnit 3.7. (otis)
1.2 RC3
1. IndexWriter: fixed a bug where adding an optimized index to an
empty index failed. This was encountered using addIndexes to copy
a RAMDirectory index to an FSDirectory.
2. RAMDirectory: fixed a bug where RAMInputStream could not read
across more than across a single buffer boundary.
3. Fix query parser so it accepts queries with unicode characters.
(briangoetz)
4. Fix query parser so that PrefixQuery is used in preference to
WildcardQuery when there's only an asterisk at the end of the
term. Previously PrefixQuery would never be used.
5. Fix tests so they compile; fix ant file so it compiles tests
properly. Added test cases for Analyzers and PriorityQueue.
6. Updated demos, added Getting Started documentation. (acoliver)
7. Added 'contributions' section to website & docs. (carlson)
8. Removed JavaCC from source distribution for copyright reasons.
Folks must now download this separately from metamata in order to
compile Lucene. (cutting)
9. Substantially improved the performance of DateFilter by adding the
ability to reuse TermDocs objects. (cutting)
10. Added IndexReader methods:
public static boolean indexExists(String directory);
public static boolean indexExists(File directory);
public static boolean indexExists(Directory directory);
public static boolean isLocked(Directory directory);
public static void unlock(Directory directory);
(cutting, otis)
11. Fixed bugs in GermanAnalyzer (gschwarz)
1.2 RC2
- added sources to distribution
- removed broken build scripts and libraries from distribution
- SegmentsReader: fixed potential race condition
- FSDirectory: fixed so that getDirectory(xxx,true) correctly
erases the directory contents, even when the directory
has already been accessed in this JVM.
- RangeQuery: Fix issue where an inclusive range query would
include the nearest term in the index above a non-existant
specified upper term.
- SegmentTermEnum: Fix NullPointerException in clone() method
when the Term is null.
- JDK 1.1 compatibility fix: disabled lock files for JDK 1.1,
since they rely on a feature added in JDK 1.2.
1.2 RC1
- first Apache release
- packages renamed from com.lucene to org.apache.lucene
- license switched from LGPL to Apache
- ant-only build -- no more makefiles
- addition of lock files--now fully thread & process safe
- addition of German stemmer
- MultiSearcher now supports low-level search API
- added RangeQuery, for term-range searching
- Analyzers can choose tokenizer based on field name
- misc bug fixes.
1.01b
. last Sourceforge release
. a few bug fixes
. new Query Parser
. new prefix query (search for "foo*" matches "food")
1.0
This release fixes a few serious bugs and also includes some
performance optimizations, a stemmer, and a few other minor
enhancements.
0.04
Lucene now includes a grammar-based tokenizer, StandardTokenizer.
The only tokenizer included in the previous release (LetterTokenizer)
identified terms consisting entirely of alphabetic characters. The
new tokenizer uses a regular-expression grammar to identify more
complex classes of terms, including numbers, acronyms, email
addresses, etc.
StandardTokenizer serves two purposes:
1. It is a much better, general purpose tokenizer for use by
applications as is.
The easiest way for applications to start using
StandardTokenizer is to use StandardAnalyzer.
2. It provides a good example of grammar-based tokenization.
If an application has special tokenization requirements, it can
implement a custom tokenizer by copying the directory containing
the new tokenizer into the application and modifying it
accordingly.
0.01
First open source release.
The code has been re-organized into a new package and directory
structure for this release. It builds OK, but has not been tested
beyond that since the re-organization.