LUCENE-4702: Improve performance for fuzzy queries.

Fuzzy queries with an edit distance of 1 or 2 must visit all blocks whose prefix
length is 1 or 2. By not compressing those, we can trade very little space (a
couple MBs in the case of the wikibigall index) for better query efficiency.
This commit is contained in:
Adrien Grand 2020-01-30 10:37:39 +01:00
parent a9482911a8
commit 13e2094804
1 changed files with 3 additions and 1 deletions

View File

@ -841,7 +841,9 @@ public final class BlockTreeTermsWriter extends FieldsConsumer {
// If there are 2 suffix bytes or less per term, then we don't bother compressing as suffix are unlikely what
// makes the terms dictionary large, and it also tends to be frequently the case for dense IDs like
// auto-increment IDs, so not compressing in that case helps not hurt ID lookups by too much.
if (suffixWriter.length() > 2L * numEntries) {
// We also only start compressing when the prefix length is greater than 2 since blocks whose prefix length is
// 1 or 2 always all get visited when running a fuzzy query whose max number of edits is 2.
if (suffixWriter.length() > 2L * numEntries && prefixLength > 2) {
// LZ4 inserts references whenever it sees duplicate strings of 4 chars or more, so only try it out if the
// average suffix length is greater than 6.
if (suffixWriter.length() > 6L * numEntries) {