Fuzzy queries with an edit distance of 1 or 2 must visit all blocks whose prefix
length is 1 or 2. By not compressing those, we can trade very little space (a
couple MBs in the case of the wikibigall index) for better query efficiency.
Changes include:
- Removed LZ4 compression of suffix lengths which didn't save much space
anyway.
- For stats, LZ4 was only really used for run-length compression of terms whose
docFreq is 1. This has been replaced by explicit run-length compression.
- Since we only use LZ4 for suffix bytes if the compression ration is < 75%, we
now only try LZ4 out if the average suffix length is greater than 6, in order
to reduce index-time overhead.
The issue is that MockDirectoryWrapper's disk full check is horribly
inefficient. On every writeByte/etc, it totally recomputes disk space
across all files. This means it calls listAll() on the underlying
Directory (which sorts all the underlying files), then sums up fileLength()
for each of those files.
This leads to many pathological cases in the disk full tests... but the
number of tests impacted by this is minimal, and the logic is scary.
WhitespaceTokenizer defaults to a maximum token length of 255, and WhitespaceAnalyzer
does not allow this to be changed. This commit adds an optional maxTokenLen parameter
to WhitespaceAnalyzer as well, and documents the existing token length restriction.
- added 'owasp' task to the root project. This depends on
dependencyCheckAggregate which seems to be a better fit for multi-module
projects than dependencyCheckAnalyze (the difference is vague to me
from plugin's documentation).
- you can run the "gradlew owasp" task explicitly and it'll run the
validation without any flags.
- the owasp task is only added to check if validation.owasp property
is true. I think this should stay as the default on non-CI systems
(developer defaults) because it's a significant chunk of time it takes
to download and validate dependencies.
- I'm not sure *all* configurations should be included in the check...
perhaps we should only limit ourselves to actual runtime dependencies
not build dependencies, solr-ref-guide, etc.
Compress blocks of suffixes in order to make the terms dictionary more
space-efficient. Two compression algorithms are used depending on which one is
more space-efficient:
- LowercaseAsciiCompression, which applies when all bytes are in the
`[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits,
lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 bytes.
It is very often applicable on analyzed content and decompresses very quickly
thanks to auto-vectorization support in the JVM.
- LZ4, when the compression ratio is less than 0.75.
I was a bit unhappy with the complexity of the high-compression LZ4 option, so
I simplified it in order to only keep the logic that detects duplicate strings.
The logic about what to do in case overlapping matches are found, which was
responsible for most of the complexity while only yielding tiny benefits, has
been removed.