LUCENE-6682: StandardTokenizer performance bug: scanner buffer is unnecessarily copied when maxTokenLength doesn't change. Also stop silently maxing out buffer size (and effectively also max token length) at 1M chars, but instead throw an exception from setMaxTokenLength() when the given length is greater than 1M chars.

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1691604 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Steven Rowe 2015-07-17 16:16:36 +00:00
parent 2dfad693d6
commit e92618bbda
2 changed files with 27 additions and 4 deletions

View File

@ -261,6 +261,12 @@ Bug fixes
* LUCENE-6681: SortingMergePolicy must override MergePolicy.size(...).
(Christine Poerschke via Adrien Grand)
* LUCENE-6682: StandardTokenizer performance bug: scanner buffer is
unnecessarily copied when maxTokenLength doesn't change. Also stop silently
maxing out buffer size (and effectively also max token length) at 1M chars,
but instead throw an exception from setMaxTokenLength() when the given
length is greater than 1M chars. (Piotr Idzikowski, Steve Rowe)
Changes in Runtime Behavior
@ -385,6 +391,13 @@ Changes in Backwards Compatibility Policy
was removed. The implementation is free to reuse the internal BytesRef
or return a new one on each call. (Uwe Schindler)
* LUCENE-6682: StandardTokenizer.setMaxTokenLength() now throws an exception if
a length greater than 1M chars is given. Previously the effective max token
length (the scanner's buffer) was capped at 1M chars, but getMaxTokenLength()
incorrectly returned the previously requested length, even when it exceeded 1M.
(Piotr Idzikowski, Steve Rowe)
======================= Lucene 5.2.1 =======================
Bug Fixes

View File

@ -88,18 +88,28 @@ public final class StandardTokenizer extends Tokenizer {
"<HANGUL>"
};
public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
private int skippedPositions;
private int maxTokenLength = StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH;
/** Set the max allowed token length. Any token longer
* than this is skipped. */
/**
* Set the max allowed token length. No tokens longer than this are emitted.
*
* @throws IllegalArgumentException if the given length is outside of the
* range [1, {@value #MAX_TOKEN_LENGTH_LIMIT}].
*/
public void setMaxTokenLength(int length) {
if (length < 1) {
throw new IllegalArgumentException("maxTokenLength must be greater than zero");
} else if (length > MAX_TOKEN_LENGTH_LIMIT) {
throw new IllegalArgumentException("maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT);
}
if (length != maxTokenLength) {
maxTokenLength = length;
scanner.setBufferSize(length);
}
this.maxTokenLength = length;
scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer size to 1M chars
}
/** @see #setMaxTokenLength */