This replaces the index of stored fields and term vectors with two
`DirectMonotonic` arrays. `DirectMonotonicWriter` requires to know the number
of values to write up-front, so incoming doc IDs and file pointers are buffered
on disk using temporary files that never get fsynced, but have index headers
and footers to make sure any corruption in these files wouldn't propagate to the
index.
`DirectMonotonicReader` gets a specialized `binarySearch` implementation that
leverages the metadata in order to avoid going to the IndexInput as often as
possible. Actually in the common case, it would only go to a single
sub `DirectReader` which, combined with the size of blocks of 1k values, helps
bound the number of page faults to 2.
SOLR-14095 Introduced an issue for rolling restarts (Incompatible Java serialization). This change fixes the compatibility issue while keeping the functionality in SOLR-14095
The entire precommit task will still fail with unsupported java version
(subsequent checks do not support the newer javadocs format).
But this allows the ECJ linter to run, which checks for things such as
unused imports.
This triggers various places in the Streaming Expressions code that use background threads
to confirm that the expected credentails (or lack of) are propogarded along.
Test currently has comments + workarounds for 2 known client issues:
- SOLR-14226: SolrStream reports AuthN/AuthZ failures (401|403) as IOException w/o details
- SOLR-14222: CloudSolrClient converts (update) 403 error to 500 error
(cherry picked from commit 517438e356)
Fuzzy queries with an edit distance of 1 or 2 must visit all blocks whose prefix
length is 1 or 2. By not compressing those, we can trade very little space (a
couple MBs in the case of the wikibigall index) for better query efficiency.
Adds some build parameters to tune how tests run. There is an example
shown by "gradle helpLocalSettings"
Default C2 off in tests as it is wasteful locally and causes slowdown of
tests runs. You can override this by setting tests.jvmargs for gradle,
or args for ant.
Some crazy lucene stress tests may need to be toned down after the
change, as they may have been doing too many iterations by default...
but this is not a new problem.
The issue is that MockDirectoryWrapper's disk full check is horribly
inefficient. On every writeByte/etc, it totally recomputes disk space
across all files. This means it calls listAll() on the underlying
Directory (which sorts all the underlying files), then sums up fileLength()
for each of those files.
This leads to many pathological cases in the disk full tests... but the
number of tests impacted by this is minimal, and the logic is scary.
Calming down individual test methods with double-digit execution times
after running tests many times.
There are a few more issues remaining, but this solves the majority of them.
This is consistently the slowest test for me in all of lucene core by
far. Takes around an entire minute. Mark it nightly: should catch any
issues with RAM estimation but keep local builds fast.
Changes include:
- Removed LZ4 compression of suffix lengths which didn't save much space
anyway.
- For stats, LZ4 was only really used for run-length compression of terms whose
docFreq is 1. This has been replaced by explicit run-length compression.
- Since we only use LZ4 for suffix bytes if the compression ration is < 75%, we
now only try LZ4 out if the average suffix length is greater than 6, in order
to reduce index-time overhead.
Compress blocks of suffixes in order to make the terms dictionary more
space-efficient. Two compression algorithms are used depending on which one is
more space-efficient:
- LowercaseAsciiCompression, which applies when all bytes are in the
`[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits,
lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 bytes.
It is very often applicable on analyzed content and decompresses very quickly
thanks to auto-vectorization support in the JVM.
- LZ4, when the compression ratio is less than 0.75.
I was a bit unhappy with the complexity of the high-compression LZ4 option, so
I simplified it in order to only keep the logic that detects duplicate strings.
The logic about what to do in case overlapping matches are found, which was
responsible for most of the complexity while only yielding tiny benefits, has
been removed.
WhitespaceTokenizer defaults to a maximum token length of 255, and WhitespaceAnalyzer
does not allow this to be changed. This commit adds an optional maxTokenLen parameter
to WhitespaceAnalyzer as well, and documents the existing token length restriction.