mirror of https://github.com/apache/lucene.git
LUCENE-7322: Remove CompressionTools.
This commit is contained in:
parent
bd9005d562
commit
894a8b275c
|
@ -1,90 +1,16 @@
|
||||||
# Apache Lucene Migration Guide
|
# Apache Lucene Migration Guide
|
||||||
|
|
||||||
## Query.hashCode and Query.equals are now abstract methods (LUCENE-7277)
|
## Query.hashCode and Query.equals are now abstract methods (LUCENE-7277)
|
||||||
|
|
||||||
Any custom query subclasses should redeclare equivalence relationship according
|
Any custom query subclasses should redeclare equivalence relationship according
|
||||||
to the subclass's details. See code patterns used in existing core Lucene query
|
to the subclass's details. See code patterns used in existing core Lucene query
|
||||||
classes for details.
|
classes for details.
|
||||||
|
|
||||||
## The way how number of document calculated is changed (LUCENE-6711)
|
## CompressionTools removed (LUCENE-7322)
|
||||||
The number of documents (numDocs) is used to calculate term specificity (idf) and average document length (avdl).
|
|
||||||
Prior to LUCENE-6711, collectionStats.maxDoc() was used for the statistics.
|
|
||||||
Now, collectionStats.docCount() is used whenever possible, if not maxDocs() is used.
|
|
||||||
|
|
||||||
Assume that a collection contains 100 documents, and 50 of them have "keywords" field.
|
Per-field compression has been superseded by codec-level compression, which has
|
||||||
In this example, maxDocs is 100 while docCount is 50 for the "keywords" field.
|
the benefit of being able to compress several fields, or even documents at once,
|
||||||
The total number of tokens for "keywords" field is divided by docCount to obtain avdl.
|
yielding better compression ratios. In case you would still like to compress on
|
||||||
Therefore, docCount which is the total number of documents that have at least one term for the field, is a more precise metric for optional fields.
|
top of the codec, you can do it on the application side by using the utility
|
||||||
|
classes from the java.util.zip package.
|
||||||
|
|
||||||
DefaultSimilarity does not leverage avdl, so this change would have relatively minor change in the result list.
|
|
||||||
Because relative idf values of terms will remain same.
|
|
||||||
However, when combined with other factors such as term frequency, relative ranking of documents could change.
|
|
||||||
Some Similarity implementations (such as the ones instantiated with NormalizationH2 and BM25) take account into avdl and would have notable change in ranked list.
|
|
||||||
Especially if you have a collection of documents with varying lengths.
|
|
||||||
Because NormalizationH2 tends to punish documents longer than avdl.
|
|
||||||
|
|
||||||
## FunctionValues.exist() Behavior Changes due to ValueSource bug fixes (LUCENE-5961)
|
|
||||||
|
|
||||||
Bugs fixed in several ValueSource functions may result in different behavior in
|
|
||||||
situations where some documents do not have values for fields wrapped in other
|
|
||||||
ValueSources. Users who want to preserve the previous behavior may need to wrap
|
|
||||||
their ValueSources in a "DefFunction" along with a ConstValueSource of "0.0".
|
|
||||||
|
|
||||||
## Removal of Filter and FilteredQuery (LUCENE-6301,LUCENE-6583)
|
|
||||||
|
|
||||||
Filter and FilteredQuery have been removed. Regular queries can be used instead
|
|
||||||
of filters as they have been optimized for the filtering case. And you can
|
|
||||||
construct a BooleanQuery with one MUST clause for the query, and one FILTER
|
|
||||||
clause for the filter in order to have similar behaviour to FilteredQuery.
|
|
||||||
|
|
||||||
## PhraseQuery and BooleanQuery made immutable (LUCENE-6531 LUCENE-6570)
|
|
||||||
|
|
||||||
PhraseQuery and BooleanQuery are now immutable and have a builder API to help
|
|
||||||
construct them. For instance a BooleanQuery that used to be constructed like
|
|
||||||
this:
|
|
||||||
|
|
||||||
BooleanQuery bq = new BooleanQuery();
|
|
||||||
bq.add(q1, Occur.SHOULD);
|
|
||||||
bq.add(q2, Occur.SHOULD);
|
|
||||||
bq.add(q3, Occur.MUST);
|
|
||||||
bq.setMinimumNumberShouldMatch(1);
|
|
||||||
|
|
||||||
can now be constructed this way using its builder:
|
|
||||||
|
|
||||||
BooleanQuery bq = new BooleanQuery.Builder()
|
|
||||||
.add(q1, Occur.SHOULD)
|
|
||||||
.add(q2, Occur.SHOULD)
|
|
||||||
.add(q3, Occur.SHOULD)
|
|
||||||
.setMinimumNumberShouldMatch(1)
|
|
||||||
.build();
|
|
||||||
|
|
||||||
## AttributeImpl now requires that reflectWith() is implemented (LUCENE-6651)
|
|
||||||
|
|
||||||
AttributeImpl removed the default, reflection-based implementation of
|
|
||||||
reflectWith(AtrributeReflector). The method was made abstract. If you have
|
|
||||||
implemented your own attribute, make sure to add the required method sigature.
|
|
||||||
See the Javadocs for an example.
|
|
||||||
|
|
||||||
## Query.setBoost() and Query.clone() are removed (LUCENE-6590)
|
|
||||||
|
|
||||||
Query.setBoost has been removed. In order to apply a boost to a Query, you now
|
|
||||||
need to wrap it inside a BoostQuery. For instance,
|
|
||||||
|
|
||||||
Query q = ...;
|
|
||||||
float boost = ...;
|
|
||||||
q = new BoostQuery(q, boost);
|
|
||||||
|
|
||||||
would be equivalent to the following code with the old setBoost API:
|
|
||||||
|
|
||||||
Query q = ...;
|
|
||||||
float boost = ...;
|
|
||||||
q.setBoost(q.getBoost() * boost);
|
|
||||||
|
|
||||||
# PointValues replaces NumericField (LUCENE-6917)
|
|
||||||
|
|
||||||
PointValues provides faster indexing and searching, a smaller
|
|
||||||
index size, and less heap used at search time. See org.apache.lucene.index.PointValues
|
|
||||||
for an introduction.
|
|
||||||
|
|
||||||
Legacy numeric encodings from previous versions of Lucene are
|
|
||||||
deprecated as LegacyIntField, LegacyFloatField, LegacyLongField, and LegacyDoubleField,
|
|
||||||
and can be searched with LegacyNumericRangeQuery.
|
|
||||||
|
|
|
@ -1,150 +0,0 @@
|
||||||
/*
|
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
|
||||||
* this work for additional information regarding copyright ownership.
|
|
||||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
|
||||||
* (the "License"); you may not use this file except in compliance with
|
|
||||||
* the License. You may obtain a copy of the License at
|
|
||||||
*
|
|
||||||
* http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
*
|
|
||||||
* Unless required by applicable law or agreed to in writing, software
|
|
||||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
* See the License for the specific language governing permissions and
|
|
||||||
* limitations under the License.
|
|
||||||
*/
|
|
||||||
package org.apache.lucene.document;
|
|
||||||
|
|
||||||
|
|
||||||
import java.util.zip.Deflater;
|
|
||||||
import java.util.zip.Inflater;
|
|
||||||
import java.util.zip.DataFormatException;
|
|
||||||
import java.io.ByteArrayOutputStream;
|
|
||||||
|
|
||||||
import org.apache.lucene.util.BytesRef;
|
|
||||||
import org.apache.lucene.util.UnicodeUtil;
|
|
||||||
|
|
||||||
/** Simple utility class providing static methods to
|
|
||||||
* compress and decompress binary data for stored fields.
|
|
||||||
* This class uses java.util.zip.Deflater and Inflater
|
|
||||||
* classes to compress and decompress.
|
|
||||||
*/
|
|
||||||
|
|
||||||
public class CompressionTools {
|
|
||||||
|
|
||||||
// Export only static methods
|
|
||||||
private CompressionTools() {}
|
|
||||||
|
|
||||||
/** Compresses the specified byte range using the
|
|
||||||
* specified compressionLevel (constants are defined in
|
|
||||||
* java.util.zip.Deflater). */
|
|
||||||
public static byte[] compress(byte[] value, int offset, int length, int compressionLevel) {
|
|
||||||
|
|
||||||
/* Create an expandable byte array to hold the compressed data.
|
|
||||||
* You cannot use an array that's the same size as the orginal because
|
|
||||||
* there is no guarantee that the compressed data will be smaller than
|
|
||||||
* the uncompressed data. */
|
|
||||||
ByteArrayOutputStream bos = new ByteArrayOutputStream(length);
|
|
||||||
|
|
||||||
Deflater compressor = new Deflater();
|
|
||||||
|
|
||||||
try {
|
|
||||||
compressor.setLevel(compressionLevel);
|
|
||||||
compressor.setInput(value, offset, length);
|
|
||||||
compressor.finish();
|
|
||||||
|
|
||||||
// Compress the data
|
|
||||||
final byte[] buf = new byte[1024];
|
|
||||||
while (!compressor.finished()) {
|
|
||||||
int count = compressor.deflate(buf);
|
|
||||||
bos.write(buf, 0, count);
|
|
||||||
}
|
|
||||||
} finally {
|
|
||||||
compressor.end();
|
|
||||||
}
|
|
||||||
|
|
||||||
return bos.toByteArray();
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Compresses the specified byte range, with default BEST_COMPRESSION level */
|
|
||||||
public static byte[] compress(byte[] value, int offset, int length) {
|
|
||||||
return compress(value, offset, length, Deflater.BEST_COMPRESSION);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Compresses all bytes in the array, with default BEST_COMPRESSION level */
|
|
||||||
public static byte[] compress(byte[] value) {
|
|
||||||
return compress(value, 0, value.length, Deflater.BEST_COMPRESSION);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Compresses the String value, with default BEST_COMPRESSION level */
|
|
||||||
public static byte[] compressString(String value) {
|
|
||||||
return compressString(value, Deflater.BEST_COMPRESSION);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Compresses the String value using the specified
|
|
||||||
* compressionLevel (constants are defined in
|
|
||||||
* java.util.zip.Deflater). */
|
|
||||||
public static byte[] compressString(String value, int compressionLevel) {
|
|
||||||
byte[] b = new byte[UnicodeUtil.MAX_UTF8_BYTES_PER_CHAR * value.length()];
|
|
||||||
final int len = UnicodeUtil.UTF16toUTF8(value, 0, value.length(), b);
|
|
||||||
return compress(b, 0, len, compressionLevel);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Decompress the byte array previously returned by
|
|
||||||
* compress (referenced by the provided BytesRef) */
|
|
||||||
public static byte[] decompress(BytesRef bytes) throws DataFormatException {
|
|
||||||
return decompress(bytes.bytes, bytes.offset, bytes.length);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Decompress the byte array previously returned by
|
|
||||||
* compress */
|
|
||||||
public static byte[] decompress(byte[] value) throws DataFormatException {
|
|
||||||
return decompress(value, 0, value.length);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Decompress the byte array previously returned by
|
|
||||||
* compress */
|
|
||||||
public static byte[] decompress(byte[] value, int offset, int length) throws DataFormatException {
|
|
||||||
// Create an expandable byte array to hold the decompressed data
|
|
||||||
ByteArrayOutputStream bos = new ByteArrayOutputStream(length);
|
|
||||||
|
|
||||||
Inflater decompressor = new Inflater();
|
|
||||||
|
|
||||||
try {
|
|
||||||
decompressor.setInput(value, offset, length);
|
|
||||||
|
|
||||||
// Decompress the data
|
|
||||||
final byte[] buf = new byte[1024];
|
|
||||||
while (!decompressor.finished()) {
|
|
||||||
int count = decompressor.inflate(buf);
|
|
||||||
bos.write(buf, 0, count);
|
|
||||||
}
|
|
||||||
} finally {
|
|
||||||
decompressor.end();
|
|
||||||
}
|
|
||||||
|
|
||||||
return bos.toByteArray();
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Decompress the byte array previously returned by
|
|
||||||
* compressString back into a String */
|
|
||||||
public static String decompressString(byte[] value) throws DataFormatException {
|
|
||||||
return decompressString(value, 0, value.length);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Decompress the byte array previously returned by
|
|
||||||
* compressString back into a String */
|
|
||||||
public static String decompressString(byte[] value, int offset, int length) throws DataFormatException {
|
|
||||||
final byte[] bytes = decompress(value, offset, length);
|
|
||||||
final char[] result = new char[bytes.length];
|
|
||||||
final int len = UnicodeUtil.UTF8toUTF16(bytes, 0, bytes.length, result);
|
|
||||||
return new String(result, 0, len);
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Decompress the byte array (referenced by the provided BytesRef)
|
|
||||||
* previously returned by compressString back into a String */
|
|
||||||
public static String decompressString(BytesRef bytes) throws DataFormatException {
|
|
||||||
return decompressString(bytes.bytes, bytes.offset, bytes.length);
|
|
||||||
}
|
|
||||||
}
|
|
|
@ -74,33 +74,4 @@ public class TestBinaryDocument extends LuceneTestCase {
|
||||||
reader.close();
|
reader.close();
|
||||||
dir.close();
|
dir.close();
|
||||||
}
|
}
|
||||||
|
|
||||||
public void testCompressionTools() throws Exception {
|
|
||||||
StoredField binaryFldCompressed = new StoredField("binaryCompressed", CompressionTools.compress(binaryValCompressed.getBytes(StandardCharsets.UTF_8)));
|
|
||||||
StoredField stringFldCompressed = new StoredField("stringCompressed", CompressionTools.compressString(binaryValCompressed));
|
|
||||||
|
|
||||||
Document doc = new Document();
|
|
||||||
|
|
||||||
doc.add(binaryFldCompressed);
|
|
||||||
doc.add(stringFldCompressed);
|
|
||||||
|
|
||||||
/** add the doc to a ram index */
|
|
||||||
Directory dir = newDirectory();
|
|
||||||
RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
|
|
||||||
writer.addDocument(doc);
|
|
||||||
|
|
||||||
/** open a reader and fetch the document */
|
|
||||||
IndexReader reader = writer.getReader();
|
|
||||||
Document docFromReader = reader.document(0);
|
|
||||||
assertTrue(docFromReader != null);
|
|
||||||
|
|
||||||
/** fetch the binary compressed field and compare its content with the original one */
|
|
||||||
String binaryFldCompressedTest = new String(CompressionTools.decompress(docFromReader.getBinaryValue("binaryCompressed")), StandardCharsets.UTF_8);
|
|
||||||
assertTrue(binaryFldCompressedTest.equals(binaryValCompressed));
|
|
||||||
assertTrue(CompressionTools.decompressString(docFromReader.getBinaryValue("stringCompressed")).equals(binaryValCompressed));
|
|
||||||
|
|
||||||
writer.close();
|
|
||||||
reader.close();
|
|
||||||
dir.close();
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue