mirror of https://github.com/apache/lucene.git
split changes for 3.x and trunk
git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@940818 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
cd320b5f57
commit
20a0dc280d
|
@ -70,6 +70,118 @@ Changes in backwards compatibility policy
|
||||||
TermAttribute/CharTermAttribute. If you want to further filter
|
TermAttribute/CharTermAttribute. If you want to further filter
|
||||||
or attach Payloads to NTS, use the new NumericTermAttribute.
|
or attach Payloads to NTS, use the new NumericTermAttribute.
|
||||||
|
|
||||||
|
* LUCENE-2386: IndexWriter no longer performs an empty commit upon new index
|
||||||
|
creation. Previously, if you passed an empty Directory and set OpenMode to
|
||||||
|
CREATE*, IndexWriter would make a first empty commit. If you need that
|
||||||
|
behavior you can call writer.commit()/close() immediately after you create it.
|
||||||
|
(Shai Erera, Mike McCandless)
|
||||||
|
|
||||||
|
* LUCENE-2316: Directory.fileLength contract was clarified - it returns the
|
||||||
|
actual file's length if the file exists, and throws FileNotFoundException
|
||||||
|
otherwise. Returning length=0 for a non-existent file is no longer allowed. If
|
||||||
|
you relied on that, make sure to catch the exception. (Shai Erera)
|
||||||
|
|
||||||
|
* LUCENE-2265: FuzzyQuery and WildcardQuery now operate on Unicode codepoints,
|
||||||
|
not unicode code units. For example, a Wildcard "?" represents any unicode
|
||||||
|
character. Furthermore, the rest of the automaton package and RegexpQuery use
|
||||||
|
true Unicode codepoint representation. (Robert Muir, Mike McCandless)
|
||||||
|
|
||||||
|
Changes in runtime behavior
|
||||||
|
|
||||||
|
* LUCENE-2421: NativeFSLockFactory does not throw LockReleaseFailedException if
|
||||||
|
it cannot delete the lock file, since obtaining the lock does not fail if the
|
||||||
|
file is there. (Shai Erera)
|
||||||
|
|
||||||
|
API Changes
|
||||||
|
|
||||||
|
* LUCENE-1458, LUCENE-2111: The postings APIs (TermEnum, TermDocsEnum,
|
||||||
|
TermPositionsEnum) have been deprecated in favor of the new flexible
|
||||||
|
indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum,
|
||||||
|
DocsEnum, DocsAndPositionsEnum). One big difference is that field
|
||||||
|
and terms are now enumerated separately: a TermsEnum provides a
|
||||||
|
BytesRef (wraps a byte[]) per term within a single field, not a
|
||||||
|
Term. Another is that when asking for a Docs/AndPositionsEnum, you
|
||||||
|
now specify the skipDocs explicitly (typically this will be the
|
||||||
|
deleted docs, but in general you can provide any Bits).
|
||||||
|
|
||||||
|
* LUCENE-1458, LUCENE-2111: IndexReader now directly exposes its
|
||||||
|
deleted docs (getDeletedDocs), providing a new Bits interface to
|
||||||
|
directly query by doc ID.
|
||||||
|
|
||||||
|
* LUCENE-2402: IndexWriter.deleteUnusedFiles now deletes unreferenced commit
|
||||||
|
points too. If you use an IndexDeletionPolicy which holds onto index commits
|
||||||
|
(such as SnapshotDeletionPolicy), you can call this method to remove those
|
||||||
|
commit points when they are not needed anymore (instead of waiting for the
|
||||||
|
next commit). (Shai Erera)
|
||||||
|
|
||||||
|
New features
|
||||||
|
|
||||||
|
* LUCENE-1606, LUCENE-2089: Adds AutomatonQuery, a MultiTermQuery that
|
||||||
|
matches terms against a finite-state machine. Implement WildcardQuery
|
||||||
|
and FuzzyQuery with finite-state methods. Adds RegexpQuery.
|
||||||
|
(Robert Muir, Mike McCandless, Uwe Schindler, Mark Miller)
|
||||||
|
|
||||||
|
* LUCENE-1990: Adds internal packed ints implementation, to be used
|
||||||
|
for more efficient storage of int arrays when the values are
|
||||||
|
bounded, for example for storing the terms dict index Toke Toke
|
||||||
|
Eskildsen via Mike McCandless)
|
||||||
|
|
||||||
|
* LUCENE-2321: Cutover to a more RAM efficient packed-ints based
|
||||||
|
representation for the in-memory terms dict index. (Mike
|
||||||
|
McCandless)
|
||||||
|
|
||||||
|
* LUCENE-2126: Add new classes for data (de)serialization: DataInput
|
||||||
|
and DataOutput. IndexInput and IndexOutput extend these new classes.
|
||||||
|
(Michael Busch)
|
||||||
|
|
||||||
|
* LUCENE-1458, LUCENE-2111: With flexible indexing it is now possible
|
||||||
|
for an application to create its own postings codec, to alter how
|
||||||
|
fields, terms, docs and positions are encoded into the index. The
|
||||||
|
standard codec is the default codec. Both IndexWriter and
|
||||||
|
IndexReader accept a CodecProvider class to obtain codecs for newly
|
||||||
|
written segments as well as existing segments opened for reading.
|
||||||
|
|
||||||
|
* LUCENE-1458, LUCENE-2111: Some experimental codecs have been added
|
||||||
|
for flexible indexing, including pulsing codec (inlines
|
||||||
|
low-frequency terms directly into the terms dict, avoiding seeking
|
||||||
|
for some queries), sep codec (stores docs, freqs, positions, skip
|
||||||
|
data and payloads in 5 separate files instead of the 2 used by
|
||||||
|
standard codec), and int block (really a "base" for using
|
||||||
|
block-based compressors like PForDelta for storing postings data).
|
||||||
|
|
||||||
|
* LUCENE-2302, LUCENE-1458, LUCENE-2111: Terms are no longer required
|
||||||
|
to be character based. Lucene views a term as an arbitrary byte[]:
|
||||||
|
during analysis, character-based terms are converted to UTF8 byte[],
|
||||||
|
but analyzers are free to directly create terms as byte[]
|
||||||
|
(NumericField does this, for example). The term data is buffered as
|
||||||
|
byte[] during indexing, written as byte[] into the terms dictionary,
|
||||||
|
and iterated as byte[] (wrapped in a BytesRef) by IndexReader for
|
||||||
|
searching.
|
||||||
|
|
||||||
|
* LUCENE-2385: Moved NoDeletionPolicy from benchmark to core. NoDeletionPolicy
|
||||||
|
can be used to prevent commits from ever getting deleted from the index.
|
||||||
|
(Shai Erera)
|
||||||
|
|
||||||
|
* LUCENE-1458, LUCENE-2111: The in-memory terms index used by standard
|
||||||
|
codec is more RAM efficient: terms data is stored as block byte
|
||||||
|
arrays and packed integers. Net RAM reduction for indexes that have
|
||||||
|
many unique terms should be substantial, and initial open time for
|
||||||
|
IndexReaders should be faster. These gains only apply for newly
|
||||||
|
written segments after upgrading.
|
||||||
|
|
||||||
|
* LUCENE-1458, LUCENE-2111: Terms data are now buffered directly as
|
||||||
|
byte[] during indexing, which uses half the RAM for ascii terms (and
|
||||||
|
also numeric fields). This can improve indexing throughput for
|
||||||
|
applications that have many unique terms, since it reduces how often
|
||||||
|
a new segment must be flushed given a fixed RAM buffer size.
|
||||||
|
|
||||||
|
* LUCENE-2398: Improve tests to work better from IDEs such as Eclipse.
|
||||||
|
(Paolo Castagna via Robert Muir)
|
||||||
|
|
||||||
|
======================= Lucene 3.x (not yet released) =======================
|
||||||
|
|
||||||
|
Changes in backwards compatibility policy
|
||||||
|
|
||||||
* LUCENE-1483: Removed utility class oal.util.SorterTemplate; this
|
* LUCENE-1483: Removed utility class oal.util.SorterTemplate; this
|
||||||
class is no longer used by Lucene. (Gunnar Wagenknecht via Mike
|
class is no longer used by Lucene. (Gunnar Wagenknecht via Mike
|
||||||
McCandless)
|
McCandless)
|
||||||
|
@ -117,22 +229,6 @@ Changes in backwards compatibility policy
|
||||||
of incrementToken(), tokenStream(), and reusableTokenStream().
|
of incrementToken(), tokenStream(), and reusableTokenStream().
|
||||||
(Uwe Schindler, Robert Muir)
|
(Uwe Schindler, Robert Muir)
|
||||||
|
|
||||||
* LUCENE-2386: IndexWriter no longer performs an empty commit upon new index
|
|
||||||
creation. Previously, if you passed an empty Directory and set OpenMode to
|
|
||||||
CREATE*, IndexWriter would make a first empty commit. If you need that
|
|
||||||
behavior you can call writer.commit()/close() immediately after you create it.
|
|
||||||
(Shai Erera, Mike McCandless)
|
|
||||||
|
|
||||||
* LUCENE-2316: Directory.fileLength contract was clarified - it returns the
|
|
||||||
actual file's length if the file exists, and throws FileNotFoundException
|
|
||||||
otherwise. Returning length=0 for a non-existent file is no longer allowed. If
|
|
||||||
you relied on that, make sure to catch the exception. (Shai Erera)
|
|
||||||
|
|
||||||
* LUCENE-2265: FuzzyQuery and WildcardQuery now operate on Unicode codepoints,
|
|
||||||
not unicode code units. For example, a Wildcard "?" represents any unicode
|
|
||||||
character. Furthermore, the rest of the automaton package and RegexpQuery use
|
|
||||||
true Unicode codepoint representation. (Robert Muir, Mike McCandless)
|
|
||||||
|
|
||||||
Changes in runtime behavior
|
Changes in runtime behavior
|
||||||
|
|
||||||
* LUCENE-1923: Made IndexReader.toString() produce something
|
* LUCENE-1923: Made IndexReader.toString() produce something
|
||||||
|
@ -141,10 +237,6 @@ Changes in runtime behavior
|
||||||
* LUCENE-2179: CharArraySet.clear() is now functional.
|
* LUCENE-2179: CharArraySet.clear() is now functional.
|
||||||
(Robert Muir, Uwe Schindler)
|
(Robert Muir, Uwe Schindler)
|
||||||
|
|
||||||
* LUCENE-2421: NativeFSLockFactory does not throw LockReleaseFailedException if
|
|
||||||
it cannot delete the lock file, since obtaining the lock does not fail if the
|
|
||||||
file is there. (Shai Erera)
|
|
||||||
|
|
||||||
API Changes
|
API Changes
|
||||||
|
|
||||||
* LUCENE-2076: Rename FSDirectory.getFile -> getDirectory. (George
|
* LUCENE-2076: Rename FSDirectory.getFile -> getDirectory. (George
|
||||||
|
@ -215,20 +307,6 @@ API Changes
|
||||||
FSDirectory to see a sample of how such tracking might look like, if needed
|
FSDirectory to see a sample of how such tracking might look like, if needed
|
||||||
in your custom Directories. (Earwin Burrfoot via Mike McCandless)
|
in your custom Directories. (Earwin Burrfoot via Mike McCandless)
|
||||||
|
|
||||||
* LUCENE-1458, LUCENE-2111: The postings APIs (TermEnum, TermDocsEnum,
|
|
||||||
TermPositionsEnum) have been deprecated in favor of the new flexible
|
|
||||||
indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum,
|
|
||||||
DocsEnum, DocsAndPositionsEnum). One big difference is that field
|
|
||||||
and terms are now enumerated separately: a TermsEnum provides a
|
|
||||||
BytesRef (wraps a byte[]) per term within a single field, not a
|
|
||||||
Term. Another is that when asking for a Docs/AndPositionsEnum, you
|
|
||||||
now specify the skipDocs explicitly (typically this will be the
|
|
||||||
deleted docs, but in general you can provide any Bits).
|
|
||||||
|
|
||||||
* LUCENE-1458, LUCENE-2111: IndexReader now directly exposes its
|
|
||||||
deleted docs (getDeletedDocs), providing a new Bits interface to
|
|
||||||
directly query by doc ID.
|
|
||||||
|
|
||||||
* LUCENE-2302: Deprecated TermAttribute and replaced by a new
|
* LUCENE-2302: Deprecated TermAttribute and replaced by a new
|
||||||
CharTermAttribute. The change is backwards compatible, so
|
CharTermAttribute. The change is backwards compatible, so
|
||||||
mixed new/old TokenStreams all work on the same char[] buffer
|
mixed new/old TokenStreams all work on the same char[] buffer
|
||||||
|
@ -240,12 +318,6 @@ API Changes
|
||||||
expressions).
|
expressions).
|
||||||
(Uwe Schindler, Robert Muir)
|
(Uwe Schindler, Robert Muir)
|
||||||
|
|
||||||
* LUCENE-2402: IndexWriter.deleteUnusedFiles now deletes unreferenced commit
|
|
||||||
points too. If you use an IndexDeletionPolicy which holds onto index commits
|
|
||||||
(such as SnapshotDeletionPolicy), you can call this method to remove those
|
|
||||||
commit points when they are not needed anymore (instead of waiting for the
|
|
||||||
next commit). (Shai Erera)
|
|
||||||
|
|
||||||
Bug fixes
|
Bug fixes
|
||||||
|
|
||||||
* LUCENE-2119: Don't throw NegativeArraySizeException if you pass
|
* LUCENE-2119: Don't throw NegativeArraySizeException if you pass
|
||||||
|
@ -290,28 +362,9 @@ Bug fixes
|
||||||
is fixed to return null if there are no segments. (Karthick
|
is fixed to return null if there are no segments. (Karthick
|
||||||
Sankarachary via Mike McCandless)
|
Sankarachary via Mike McCandless)
|
||||||
|
|
||||||
* LUCENE-2222: FixedIntBlockIndexInput incorrectly read one block of
|
|
||||||
0s before the actual data. (Renaud Delbru via Mike McCandless)
|
|
||||||
|
|
||||||
* LUCENE-2344: PostingsConsumer.merge was failing to call finishDoc,
|
|
||||||
which caused corruption for sep codec. Also fixed several tests to
|
|
||||||
test all 4 core codecs. (Renaud Delbru via Mike McCandless)
|
|
||||||
|
|
||||||
* LUCENE-2074: Reduce buffer size of lexer back to default on reset.
|
* LUCENE-2074: Reduce buffer size of lexer back to default on reset.
|
||||||
(Ruben Laguna, Shai Erera via Uwe Schindler)
|
(Ruben Laguna, Shai Erera via Uwe Schindler)
|
||||||
|
|
||||||
* LUCENE-2387: Don't hang onto Fieldables from the last doc indexed,
|
|
||||||
in IndexWriter, nor the Reader in Tokenizer after close is
|
|
||||||
called. (Ruben Laguna, Uwe Schindler, Mike McCandless)
|
|
||||||
|
|
||||||
* LUCENE-2417: IndexCommit did not implement hashCode() and equals()
|
|
||||||
consitently. Now they both take Directory and version into consideration. In
|
|
||||||
addition, all of IndexComnmit methods which threw
|
|
||||||
UnsupportedOperationException are now abstract. (Shai Erera)
|
|
||||||
|
|
||||||
* LUCENE-2424: Fix FieldDoc.toString to actually return its fields
|
|
||||||
(Stephen Green via Mike McCandless)
|
|
||||||
|
|
||||||
New features
|
New features
|
||||||
|
|
||||||
* LUCENE-2128: Parallelized fetching document frequencies during weight
|
* LUCENE-2128: Parallelized fetching document frequencies during weight
|
||||||
|
@ -376,52 +429,6 @@ New features
|
||||||
files between FSDirectory instances. (Earwin Burrfoot via Mike
|
files between FSDirectory instances. (Earwin Burrfoot via Mike
|
||||||
McCandless).
|
McCandless).
|
||||||
|
|
||||||
* LUCENE-1606, LUCENE-2089: Adds AutomatonQuery, a MultiTermQuery that
|
|
||||||
matches terms against a finite-state machine. Implement WildcardQuery
|
|
||||||
and FuzzyQuery with finite-state methods. Adds RegexpQuery.
|
|
||||||
(Robert Muir, Mike McCandless, Uwe Schindler, Mark Miller)
|
|
||||||
|
|
||||||
* LUCENE-1990: Adds internal packed ints implementation, to be used
|
|
||||||
for more efficient storage of int arrays when the values are
|
|
||||||
bounded, for example for storing the terms dict index Toke Toke
|
|
||||||
Eskildsen via Mike McCandless)
|
|
||||||
|
|
||||||
* LUCENE-2321: Cutover to a more RAM efficient packed-ints based
|
|
||||||
representation for the in-memory terms dict index. (Mike
|
|
||||||
McCandless)
|
|
||||||
|
|
||||||
* LUCENE-2126: Add new classes for data (de)serialization: DataInput
|
|
||||||
and DataOutput. IndexInput and IndexOutput extend these new classes.
|
|
||||||
(Michael Busch)
|
|
||||||
|
|
||||||
* LUCENE-1458, LUCENE-2111: With flexible indexing it is now possible
|
|
||||||
for an application to create its own postings codec, to alter how
|
|
||||||
fields, terms, docs and positions are encoded into the index. The
|
|
||||||
standard codec is the default codec. Both IndexWriter and
|
|
||||||
IndexReader accept a CodecProvider class to obtain codecs for newly
|
|
||||||
written segments as well as existing segments opened for reading.
|
|
||||||
|
|
||||||
* LUCENE-1458, LUCENE-2111: Some experimental codecs have been added
|
|
||||||
for flexible indexing, including pulsing codec (inlines
|
|
||||||
low-frequency terms directly into the terms dict, avoiding seeking
|
|
||||||
for some queries), sep codec (stores docs, freqs, positions, skip
|
|
||||||
data and payloads in 5 separate files instead of the 2 used by
|
|
||||||
standard codec), and int block (really a "base" for using
|
|
||||||
block-based compressors like PForDelta for storing postings data).
|
|
||||||
|
|
||||||
* LUCENE-2302, LUCENE-1458, LUCENE-2111: Terms are no longer required
|
|
||||||
to be character based. Lucene views a term as an arbitrary byte[]:
|
|
||||||
during analysis, character-based terms are converted to UTF8 byte[],
|
|
||||||
but analyzers are free to directly create terms as byte[]
|
|
||||||
(NumericField does this, for example). The term data is buffered as
|
|
||||||
byte[] during indexing, written as byte[] into the terms dictionary,
|
|
||||||
and iterated as byte[] (wrapped in a BytesRef) by IndexReader for
|
|
||||||
searching.
|
|
||||||
|
|
||||||
* LUCENE-2385: Moved NoDeletionPolicy from benchmark to core. NoDeletionPolicy
|
|
||||||
can be used to prevent commits from ever getting deleted from the index.
|
|
||||||
(Shai Erera)
|
|
||||||
|
|
||||||
* LUCENE-2074: Make StandardTokenizer fit for Unicode 4.0, if the
|
* LUCENE-2074: Make StandardTokenizer fit for Unicode 4.0, if the
|
||||||
matchVersion parameter is Version.LUCENE_31. (Uwe Schindler)
|
matchVersion parameter is Version.LUCENE_31. (Uwe Schindler)
|
||||||
|
|
||||||
|
@ -492,18 +499,6 @@ Optimizations
|
||||||
because then it will make sense to make the RAM buffers as large as
|
because then it will make sense to make the RAM buffers as large as
|
||||||
possible. (Mike McCandless, Michael Busch)
|
possible. (Mike McCandless, Michael Busch)
|
||||||
|
|
||||||
* LUCENE-1458, LUCENE-2111: The in-memory terms index used by standard
|
|
||||||
codec is more RAM efficient: terms data is stored as block byte
|
|
||||||
arrays and packed integers. Net RAM reduction for indexes that have
|
|
||||||
many unique terms should be substantial, and initial open time for
|
|
||||||
IndexReaders should be faster. These gains only apply for newly
|
|
||||||
written segments after upgrading.
|
|
||||||
|
|
||||||
* LUCENE-1458, LUCENE-2111: Terms data are now buffered directly as
|
|
||||||
byte[] during indexing, which uses half the RAM for ascii terms (and
|
|
||||||
also numeric fields). This can improve indexing throughput for
|
|
||||||
applications that have many unique terms, since it reduces how often
|
|
||||||
a new segment must be flushed given a fixed RAM buffer size.
|
|
||||||
|
|
||||||
Build
|
Build
|
||||||
|
|
||||||
|
@ -518,12 +513,6 @@ Build
|
||||||
* LUCENE-2074: Regenerating StandardTokenizerImpl files now needs
|
* LUCENE-2074: Regenerating StandardTokenizerImpl files now needs
|
||||||
JFlex 1.5 (currently only available on SVN). (Uwe Schindler)
|
JFlex 1.5 (currently only available on SVN). (Uwe Schindler)
|
||||||
|
|
||||||
* LUCENE-1709: Tests are now parallelized by default (except for benchmark). You
|
|
||||||
can force them to run sequentially by passing -Drunsequential=1 on the command
|
|
||||||
line. The number of threads that are spwaned per CPU defaults to '1'. If you
|
|
||||||
wish to change that, you can run the tests with -DthreadsPerProcessor=[num].
|
|
||||||
(Robert Muir, Shai Erera, Peter Kofler)
|
|
||||||
|
|
||||||
Test Cases
|
Test Cases
|
||||||
|
|
||||||
* LUCENE-2037 Allow Junit4 tests in our envrionment (Erick Erickson
|
* LUCENE-2037 Allow Junit4 tests in our envrionment (Erick Erickson
|
||||||
|
@ -558,9 +547,6 @@ Test Cases
|
||||||
access to "real" files from the test folder itsself, can use
|
access to "real" files from the test folder itsself, can use
|
||||||
LuceneTestCase(J4).getDataFile(). (Uwe Schindler)
|
LuceneTestCase(J4).getDataFile(). (Uwe Schindler)
|
||||||
|
|
||||||
* LUCENE-2398: Improve tests to work better from IDEs such as Eclipse.
|
|
||||||
(Paolo Castagna via Robert Muir)
|
|
||||||
|
|
||||||
================== Release 2.9.2 / 3.0.1 2010-02-26 ====================
|
================== Release 2.9.2 / 3.0.1 2010-02-26 ====================
|
||||||
|
|
||||||
Changes in backwards compatibility policy
|
Changes in backwards compatibility policy
|
||||||
|
|
Loading…
Reference in New Issue