When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors.
One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.
This was found by `testRandomExceptions()`: if an exception occurs when opening
the meta file, then the `rawVectorsReader` that is passed to the constructor
never gets closed.
Prevent contention on the ReentrantReadWriteLock at search time by
ensuring that `commit` and `purgeCache` are never both trying to
hold the write lock.
This commit refactors the usage of the deprecated System::runFinalization in tests and benchmarks, so that narrowly targeted suppressions can be added.
Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.
This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.
Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`.
Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`
* Skip document by docValues
*When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4].
---------
Co-authored-by: Adrien Grand <jpountz@gmail.com>
### Description
When using cosine similarity, the `ScalarQuantizer` normalizes vectors when calculating quantiles and `ScalarQuantizedRandomVectorScorer` normalizes query vectors before scoring them, but `Lucene99ScalarQuantizedVectorsWriter` does not normalize the vectors prior to quantizing them when producing the quantized vectors to write to disk. This PR normalizes vectors prior to quantizing them when writing them to disk.
Recall results on my M1 with the `glove-100-angular` data set (all using `maxConn`: 16, `beamWidth` 100, `numCandidates`: 100, `k`: 10, single segment):
| Configuration | Recall | Average Query Duration |
|---------------|-------|-----------------|
| Pre-patch no quantization | 0.78762 | 0.68 ms |
| Pre-patch with quantization | 8.999999999999717E-5 | 0.45 ms |
| Post-patch no quantization | 0.78762 | 0.70 ms |
| Post-patch with quantization | 0.66742 | 0.66 ms |
Clean-up from adding the Lucene99PostingsFormat in https://github.com/apache/lucene/pull/12741
These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90
Both testEuclidean and testExplain have vectors that result
in equal scores. Since we no longer tie break on vector ordinal
as it doesn't make sense when building the graph, the vectors returned
might be slightly different. This commit fixes the flaky nature of the
test.
I noticed while testing lower dimensionality and quantization, we would explore the HNSW graph way too much. I was stuck figuring out why until I noticed the searcher checks for distance equality (not just if the distance is better) when exploring neighbors-of-neighbors. This seems like a bad heuristic, but to double check I looked at what nmslib does. This pointed me back to this commit: nmslib/nmslib#106
Seems like this performance hitch was discovered awhile ago :).
This commit adjusts HNSW to only explore the graph layer if the distance is actually better.
* Change Postings back to using FOR in Lucene99PostingsFormat
We are still keeping PFOR for positions only.
This is a partial revert of https://github.com/apache/lucene/pull/69 which brings back ForDeltaUtil.
* fix merge commit
* Add forgotten forDeltaUtil calls to reader
* Addressing comments: adding Lucene90RWPostingsFormat + more
Also:
* Change to Changes.txt
* Removal of dead code which was only used in unit tests
* Removal of test code from PForUtil
* Changes.txt edit in right place now
* Apply suggestions from code review: `90 -> 99 refactoring`
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
* Remove decodeTo32 from ForUtil and regenerate
---------
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>