split changes for 3.x and trunk

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@940818 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Uwe Schindler 2010-05-04 12:04:08 +00:00
parent cd320b5f57
commit 20a0dc280d
1 changed files with 114 additions and 128 deletions

View File

@ -70,6 +70,118 @@ Changes in backwards compatibility policy
TermAttribute/CharTermAttribute. If you want to further filter
or attach Payloads to NTS, use the new NumericTermAttribute.
* LUCENE-2386: IndexWriter no longer performs an empty commit upon new index
creation. Previously, if you passed an empty Directory and set OpenMode to
CREATE*, IndexWriter would make a first empty commit. If you need that
behavior you can call writer.commit()/close() immediately after you create it.
(Shai Erera, Mike McCandless)
* LUCENE-2316: Directory.fileLength contract was clarified - it returns the
actual file's length if the file exists, and throws FileNotFoundException
otherwise. Returning length=0 for a non-existent file is no longer allowed. If
you relied on that, make sure to catch the exception. (Shai Erera)
* LUCENE-2265: FuzzyQuery and WildcardQuery now operate on Unicode codepoints,
not unicode code units. For example, a Wildcard "?" represents any unicode
character. Furthermore, the rest of the automaton package and RegexpQuery use
true Unicode codepoint representation. (Robert Muir, Mike McCandless)
Changes in runtime behavior
* LUCENE-2421: NativeFSLockFactory does not throw LockReleaseFailedException if
it cannot delete the lock file, since obtaining the lock does not fail if the
file is there. (Shai Erera)
API Changes
* LUCENE-1458, LUCENE-2111: The postings APIs (TermEnum, TermDocsEnum,
TermPositionsEnum) have been deprecated in favor of the new flexible
indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum,
DocsEnum, DocsAndPositionsEnum). One big difference is that field
and terms are now enumerated separately: a TermsEnum provides a
BytesRef (wraps a byte[]) per term within a single field, not a
Term. Another is that when asking for a Docs/AndPositionsEnum, you
now specify the skipDocs explicitly (typically this will be the
deleted docs, but in general you can provide any Bits).
* LUCENE-1458, LUCENE-2111: IndexReader now directly exposes its
deleted docs (getDeletedDocs), providing a new Bits interface to
directly query by doc ID.
* LUCENE-2402: IndexWriter.deleteUnusedFiles now deletes unreferenced commit
points too. If you use an IndexDeletionPolicy which holds onto index commits
(such as SnapshotDeletionPolicy), you can call this method to remove those
commit points when they are not needed anymore (instead of waiting for the
next commit). (Shai Erera)
New features
* LUCENE-1606, LUCENE-2089: Adds AutomatonQuery, a MultiTermQuery that
matches terms against a finite-state machine. Implement WildcardQuery
and FuzzyQuery with finite-state methods. Adds RegexpQuery.
(Robert Muir, Mike McCandless, Uwe Schindler, Mark Miller)
* LUCENE-1990: Adds internal packed ints implementation, to be used
for more efficient storage of int arrays when the values are
bounded, for example for storing the terms dict index Toke Toke
Eskildsen via Mike McCandless)
* LUCENE-2321: Cutover to a more RAM efficient packed-ints based
representation for the in-memory terms dict index. (Mike
McCandless)
* LUCENE-2126: Add new classes for data (de)serialization: DataInput
and DataOutput. IndexInput and IndexOutput extend these new classes.
(Michael Busch)
* LUCENE-1458, LUCENE-2111: With flexible indexing it is now possible
for an application to create its own postings codec, to alter how
fields, terms, docs and positions are encoded into the index. The
standard codec is the default codec. Both IndexWriter and
IndexReader accept a CodecProvider class to obtain codecs for newly
written segments as well as existing segments opened for reading.
* LUCENE-1458, LUCENE-2111: Some experimental codecs have been added
for flexible indexing, including pulsing codec (inlines
low-frequency terms directly into the terms dict, avoiding seeking
for some queries), sep codec (stores docs, freqs, positions, skip
data and payloads in 5 separate files instead of the 2 used by
standard codec), and int block (really a "base" for using
block-based compressors like PForDelta for storing postings data).
* LUCENE-2302, LUCENE-1458, LUCENE-2111: Terms are no longer required
to be character based. Lucene views a term as an arbitrary byte[]:
during analysis, character-based terms are converted to UTF8 byte[],
but analyzers are free to directly create terms as byte[]
(NumericField does this, for example). The term data is buffered as
byte[] during indexing, written as byte[] into the terms dictionary,
and iterated as byte[] (wrapped in a BytesRef) by IndexReader for
searching.
* LUCENE-2385: Moved NoDeletionPolicy from benchmark to core. NoDeletionPolicy
can be used to prevent commits from ever getting deleted from the index.
(Shai Erera)
* LUCENE-1458, LUCENE-2111: The in-memory terms index used by standard
codec is more RAM efficient: terms data is stored as block byte
arrays and packed integers. Net RAM reduction for indexes that have
many unique terms should be substantial, and initial open time for
IndexReaders should be faster. These gains only apply for newly
written segments after upgrading.
* LUCENE-1458, LUCENE-2111: Terms data are now buffered directly as
byte[] during indexing, which uses half the RAM for ascii terms (and
also numeric fields). This can improve indexing throughput for
applications that have many unique terms, since it reduces how often
a new segment must be flushed given a fixed RAM buffer size.
* LUCENE-2398: Improve tests to work better from IDEs such as Eclipse.
(Paolo Castagna via Robert Muir)
======================= Lucene 3.x (not yet released) =======================
Changes in backwards compatibility policy
* LUCENE-1483: Removed utility class oal.util.SorterTemplate; this
class is no longer used by Lucene. (Gunnar Wagenknecht via Mike
McCandless)
@ -117,22 +229,6 @@ Changes in backwards compatibility policy
of incrementToken(), tokenStream(), and reusableTokenStream().
(Uwe Schindler, Robert Muir)
* LUCENE-2386: IndexWriter no longer performs an empty commit upon new index
creation. Previously, if you passed an empty Directory and set OpenMode to
CREATE*, IndexWriter would make a first empty commit. If you need that
behavior you can call writer.commit()/close() immediately after you create it.
(Shai Erera, Mike McCandless)
* LUCENE-2316: Directory.fileLength contract was clarified - it returns the
actual file's length if the file exists, and throws FileNotFoundException
otherwise. Returning length=0 for a non-existent file is no longer allowed. If
you relied on that, make sure to catch the exception. (Shai Erera)
* LUCENE-2265: FuzzyQuery and WildcardQuery now operate on Unicode codepoints,
not unicode code units. For example, a Wildcard "?" represents any unicode
character. Furthermore, the rest of the automaton package and RegexpQuery use
true Unicode codepoint representation. (Robert Muir, Mike McCandless)
Changes in runtime behavior
* LUCENE-1923: Made IndexReader.toString() produce something
@ -141,10 +237,6 @@ Changes in runtime behavior
* LUCENE-2179: CharArraySet.clear() is now functional.
(Robert Muir, Uwe Schindler)
* LUCENE-2421: NativeFSLockFactory does not throw LockReleaseFailedException if
it cannot delete the lock file, since obtaining the lock does not fail if the
file is there. (Shai Erera)
API Changes
* LUCENE-2076: Rename FSDirectory.getFile -> getDirectory. (George
@ -215,20 +307,6 @@ API Changes
FSDirectory to see a sample of how such tracking might look like, if needed
in your custom Directories. (Earwin Burrfoot via Mike McCandless)
* LUCENE-1458, LUCENE-2111: The postings APIs (TermEnum, TermDocsEnum,
TermPositionsEnum) have been deprecated in favor of the new flexible
indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum,
DocsEnum, DocsAndPositionsEnum). One big difference is that field
and terms are now enumerated separately: a TermsEnum provides a
BytesRef (wraps a byte[]) per term within a single field, not a
Term. Another is that when asking for a Docs/AndPositionsEnum, you
now specify the skipDocs explicitly (typically this will be the
deleted docs, but in general you can provide any Bits).
* LUCENE-1458, LUCENE-2111: IndexReader now directly exposes its
deleted docs (getDeletedDocs), providing a new Bits interface to
directly query by doc ID.
* LUCENE-2302: Deprecated TermAttribute and replaced by a new
CharTermAttribute. The change is backwards compatible, so
mixed new/old TokenStreams all work on the same char[] buffer
@ -240,12 +318,6 @@ API Changes
expressions).
(Uwe Schindler, Robert Muir)
* LUCENE-2402: IndexWriter.deleteUnusedFiles now deletes unreferenced commit
points too. If you use an IndexDeletionPolicy which holds onto index commits
(such as SnapshotDeletionPolicy), you can call this method to remove those
commit points when they are not needed anymore (instead of waiting for the
next commit). (Shai Erera)
Bug fixes
* LUCENE-2119: Don't throw NegativeArraySizeException if you pass
@ -290,28 +362,9 @@ Bug fixes
is fixed to return null if there are no segments. (Karthick
Sankarachary via Mike McCandless)
* LUCENE-2222: FixedIntBlockIndexInput incorrectly read one block of
0s before the actual data. (Renaud Delbru via Mike McCandless)
* LUCENE-2344: PostingsConsumer.merge was failing to call finishDoc,
which caused corruption for sep codec. Also fixed several tests to
test all 4 core codecs. (Renaud Delbru via Mike McCandless)
* LUCENE-2074: Reduce buffer size of lexer back to default on reset.
(Ruben Laguna, Shai Erera via Uwe Schindler)
* LUCENE-2387: Don't hang onto Fieldables from the last doc indexed,
in IndexWriter, nor the Reader in Tokenizer after close is
called. (Ruben Laguna, Uwe Schindler, Mike McCandless)
* LUCENE-2417: IndexCommit did not implement hashCode() and equals()
consitently. Now they both take Directory and version into consideration. In
addition, all of IndexComnmit methods which threw
UnsupportedOperationException are now abstract. (Shai Erera)
* LUCENE-2424: Fix FieldDoc.toString to actually return its fields
(Stephen Green via Mike McCandless)
New features
* LUCENE-2128: Parallelized fetching document frequencies during weight
@ -376,52 +429,6 @@ New features
files between FSDirectory instances. (Earwin Burrfoot via Mike
McCandless).
* LUCENE-1606, LUCENE-2089: Adds AutomatonQuery, a MultiTermQuery that
matches terms against a finite-state machine. Implement WildcardQuery
and FuzzyQuery with finite-state methods. Adds RegexpQuery.
(Robert Muir, Mike McCandless, Uwe Schindler, Mark Miller)
* LUCENE-1990: Adds internal packed ints implementation, to be used
for more efficient storage of int arrays when the values are
bounded, for example for storing the terms dict index Toke Toke
Eskildsen via Mike McCandless)
* LUCENE-2321: Cutover to a more RAM efficient packed-ints based
representation for the in-memory terms dict index. (Mike
McCandless)
* LUCENE-2126: Add new classes for data (de)serialization: DataInput
and DataOutput. IndexInput and IndexOutput extend these new classes.
(Michael Busch)
* LUCENE-1458, LUCENE-2111: With flexible indexing it is now possible
for an application to create its own postings codec, to alter how
fields, terms, docs and positions are encoded into the index. The
standard codec is the default codec. Both IndexWriter and
IndexReader accept a CodecProvider class to obtain codecs for newly
written segments as well as existing segments opened for reading.
* LUCENE-1458, LUCENE-2111: Some experimental codecs have been added
for flexible indexing, including pulsing codec (inlines
low-frequency terms directly into the terms dict, avoiding seeking
for some queries), sep codec (stores docs, freqs, positions, skip
data and payloads in 5 separate files instead of the 2 used by
standard codec), and int block (really a "base" for using
block-based compressors like PForDelta for storing postings data).
* LUCENE-2302, LUCENE-1458, LUCENE-2111: Terms are no longer required
to be character based. Lucene views a term as an arbitrary byte[]:
during analysis, character-based terms are converted to UTF8 byte[],
but analyzers are free to directly create terms as byte[]
(NumericField does this, for example). The term data is buffered as
byte[] during indexing, written as byte[] into the terms dictionary,
and iterated as byte[] (wrapped in a BytesRef) by IndexReader for
searching.
* LUCENE-2385: Moved NoDeletionPolicy from benchmark to core. NoDeletionPolicy
can be used to prevent commits from ever getting deleted from the index.
(Shai Erera)
* LUCENE-2074: Make StandardTokenizer fit for Unicode 4.0, if the
matchVersion parameter is Version.LUCENE_31. (Uwe Schindler)
@ -492,24 +499,12 @@ Optimizations
because then it will make sense to make the RAM buffers as large as
possible. (Mike McCandless, Michael Busch)
* LUCENE-1458, LUCENE-2111: The in-memory terms index used by standard
codec is more RAM efficient: terms data is stored as block byte
arrays and packed integers. Net RAM reduction for indexes that have
many unique terms should be substantial, and initial open time for
IndexReaders should be faster. These gains only apply for newly
written segments after upgrading.
* LUCENE-1458, LUCENE-2111: Terms data are now buffered directly as
byte[] during indexing, which uses half the RAM for ascii terms (and
also numeric fields). This can improve indexing throughput for
applications that have many unique terms, since it reduces how often
a new segment must be flushed given a fixed RAM buffer size.
Build
* LUCENE-2124: Moved the JDK-based collation support from contrib/collation
into core, and moved the ICU-based collation support into contrib/icu.
(Robert Muir)
into core, and moved the ICU-based collation support into contrib/icu.
(Robert Muir)
* LUCENE-2326: Removed SVN checkouts for backwards tests. The backwards
branch is now included in the svn repository using "svn copy"
@ -518,12 +513,6 @@ Build
* LUCENE-2074: Regenerating StandardTokenizerImpl files now needs
JFlex 1.5 (currently only available on SVN). (Uwe Schindler)
* LUCENE-1709: Tests are now parallelized by default (except for benchmark). You
can force them to run sequentially by passing -Drunsequential=1 on the command
line. The number of threads that are spwaned per CPU defaults to '1'. If you
wish to change that, you can run the tests with -DthreadsPerProcessor=[num].
(Robert Muir, Shai Erera, Peter Kofler)
Test Cases
* LUCENE-2037 Allow Junit4 tests in our envrionment (Erick Erickson
@ -558,9 +547,6 @@ Test Cases
access to "real" files from the test folder itsself, can use
LuceneTestCase(J4).getDataFile(). (Uwe Schindler)
* LUCENE-2398: Improve tests to work better from IDEs such as Eclipse.
(Paolo Castagna via Robert Muir)
================== Release 2.9.2 / 3.0.1 2010-02-26 ====================
Changes in backwards compatibility policy