Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.
This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.
Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`.
Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`
* Skip document by docValues
*When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4].
---------
Co-authored-by: Adrien Grand <jpountz@gmail.com>
### Description
When using cosine similarity, the `ScalarQuantizer` normalizes vectors when calculating quantiles and `ScalarQuantizedRandomVectorScorer` normalizes query vectors before scoring them, but `Lucene99ScalarQuantizedVectorsWriter` does not normalize the vectors prior to quantizing them when producing the quantized vectors to write to disk. This PR normalizes vectors prior to quantizing them when writing them to disk.
Recall results on my M1 with the `glove-100-angular` data set (all using `maxConn`: 16, `beamWidth` 100, `numCandidates`: 100, `k`: 10, single segment):
| Configuration | Recall | Average Query Duration |
|---------------|-------|-----------------|
| Pre-patch no quantization | 0.78762 | 0.68 ms |
| Pre-patch with quantization | 8.999999999999717E-5 | 0.45 ms |
| Post-patch no quantization | 0.78762 | 0.70 ms |
| Post-patch with quantization | 0.66742 | 0.66 ms |
Clean-up from adding the Lucene99PostingsFormat in https://github.com/apache/lucene/pull/12741
These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90
Both testEuclidean and testExplain have vectors that result
in equal scores. Since we no longer tie break on vector ordinal
as it doesn't make sense when building the graph, the vectors returned
might be slightly different. This commit fixes the flaky nature of the
test.
I noticed while testing lower dimensionality and quantization, we would explore the HNSW graph way too much. I was stuck figuring out why until I noticed the searcher checks for distance equality (not just if the distance is better) when exploring neighbors-of-neighbors. This seems like a bad heuristic, but to double check I looked at what nmslib does. This pointed me back to this commit: nmslib/nmslib#106
Seems like this performance hitch was discovered awhile ago :).
This commit adjusts HNSW to only explore the graph layer if the distance is actually better.
* Change Postings back to using FOR in Lucene99PostingsFormat
We are still keeping PFOR for positions only.
This is a partial revert of https://github.com/apache/lucene/pull/69 which brings back ForDeltaUtil.
* fix merge commit
* Add forgotten forDeltaUtil calls to reader
* Addressing comments: adding Lucene90RWPostingsFormat + more
Also:
* Change to Changes.txt
* Removal of dead code which was only used in unit tests
* Removal of test code from PForUtil
* Changes.txt edit in right place now
* Apply suggestions from code review: `90 -> 99 refactoring`
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
* Remove decodeTo32 from ForUtil and regenerate
---------
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
* Clean up ByteBlockPool
1. Move slice functionality to a separate class, ByteSlicePool.
2. Add exception case if the requested slice size is larger than the
block size.
3. Make pool buffers private instead of the comment asking not to modify it.
4. Consolidate setBytesRef methods with int offsets.
5. Simplify ramBytesUsed.
6. Update and expand comments and javadoc.
* Revert to long offsets in ByteBlockPool; Clean-up
* Remove slice functionality from TermsHashPerField
* First working test
* [Broken] Test interleaving slices
* Fixed randomized test
* Remove redundant tests
* Tidy
* Add CHANGES
* Move ByteSlicePool to oal.index
* Use value-based LRU cache in NodeHash (#12714)
* tidy code
* Add a nocommit about OffsetAndLength
* Fix the readBytes method
* Use List<byte[]> instead of ByteBlockPool
* Move nodesEqual to PagedGrowableHash
* Add generic type
* Fix the count variable
* Fix the RAM usage measurement
* Use PagedGrowableWriter instead of HashMap
* Remove unused generic type
* Update the ramBytesUsed formula
* Retain the FSTCompiler.addNode signature
* Switch back to ByteBlockPool
* Remove the unnecessary assertion
* Remove fstHashAddress
* Add some javadoc
* Fix the address offset when reading from fallback table
* tidy code
* Address comments
* Add assertions
This commit replaces the usage of the deprecated java.net.URL constructor with URI, later converting toURL where necessary to interoperate with the URLConnection API.
The default tests.multiplier passed from gradle was 1, but
LuceneTestCase tried to compute its default value from TESTS_NIGHTLY.
This could lead to subtle errors: nightly mode failures would not report
tests.multipler=1 and when started from the IDE, the tests.multiplier
would be set to 2 (leading to different randomness).
Since increasing the number of hits retrieved in nightly benchmarks from 10 to
100, the performance of sorting documents by title dropped back to the level it
had before introducing dynamic pruning. This is not too surprising given that
the `title` field is a unique field, so the optimization would only kick in
when the current 100th hit would have an ordinal that is less than 128 -
something that would only happen after collecting most hits.
This change increases the threshold to 1024, so that the optimization would
kick in when the current 100th hit has an ordinal that is less than 1024,
something that happens a bit sooner.