* hunspell: a couple micro-optimizations to speed up dictionary loading
1. sort by the whole entry without searching for separators first: WordStorage doesn't require strong lexicographic order (only something close to it), and the separators are anyway before any usual word characters
2. avoid stream overhead when adding an entry
I think that this optimization was introduced because `advanceShallow` may
advance skip lists and then never decode a block of postings. But actually
`IndexInput#seek` is cheap, including on `NIOFSDirectory`. So let's seek
immediately?
* #7820: add initial (failing) test case exposing the bug in CheckIndex
* #7820: initial fix to CheckIndex to detect 'on load' corruption of older segments files
* exclamation point police
* tidy
* add missing @Override in new test case
* fold feedback: merge SIS-loading logic into single for-loop; rename sis -> lastCommit
* tidy
* sidestep segments.gen file as well in case we are reading very old index
* tidy
* always load last commit point first so if it has an issue that will not be masked by issues with older commit points
When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors.
One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.
This was found by `testRandomExceptions()`: if an exception occurs when opening
the meta file, then the `rawVectorsReader` that is passed to the constructor
never gets closed.
Prevent contention on the ReentrantReadWriteLock at search time by
ensuring that `commit` and `purgeCache` are never both trying to
hold the write lock.
This commit refactors the usage of the deprecated System::runFinalization in tests and benchmarks, so that narrowly targeted suppressions can be added.
Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.
This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.
Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`.
Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`
* Skip document by docValues
*When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4].
---------
Co-authored-by: Adrien Grand <jpountz@gmail.com>
### Description
When using cosine similarity, the `ScalarQuantizer` normalizes vectors when calculating quantiles and `ScalarQuantizedRandomVectorScorer` normalizes query vectors before scoring them, but `Lucene99ScalarQuantizedVectorsWriter` does not normalize the vectors prior to quantizing them when producing the quantized vectors to write to disk. This PR normalizes vectors prior to quantizing them when writing them to disk.
Recall results on my M1 with the `glove-100-angular` data set (all using `maxConn`: 16, `beamWidth` 100, `numCandidates`: 100, `k`: 10, single segment):
| Configuration | Recall | Average Query Duration |
|---------------|-------|-----------------|
| Pre-patch no quantization | 0.78762 | 0.68 ms |
| Pre-patch with quantization | 8.999999999999717E-5 | 0.45 ms |
| Post-patch no quantization | 0.78762 | 0.70 ms |
| Post-patch with quantization | 0.66742 | 0.66 ms |