Prevent contention on the ReentrantReadWriteLock at search time by
ensuring that `commit` and `purgeCache` are never both trying to
hold the write lock.
This commit refactors the usage of the deprecated System::runFinalization in tests and benchmarks, so that narrowly targeted suppressions can be added.
Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.
This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.
Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`.
Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`
* Skip document by docValues
*When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4].
---------
Co-authored-by: Adrien Grand <jpountz@gmail.com>
### Description
When using cosine similarity, the `ScalarQuantizer` normalizes vectors when calculating quantiles and `ScalarQuantizedRandomVectorScorer` normalizes query vectors before scoring them, but `Lucene99ScalarQuantizedVectorsWriter` does not normalize the vectors prior to quantizing them when producing the quantized vectors to write to disk. This PR normalizes vectors prior to quantizing them when writing them to disk.
Recall results on my M1 with the `glove-100-angular` data set (all using `maxConn`: 16, `beamWidth` 100, `numCandidates`: 100, `k`: 10, single segment):
| Configuration | Recall | Average Query Duration |
|---------------|-------|-----------------|
| Pre-patch no quantization | 0.78762 | 0.68 ms |
| Pre-patch with quantization | 8.999999999999717E-5 | 0.45 ms |
| Post-patch no quantization | 0.78762 | 0.70 ms |
| Post-patch with quantization | 0.66742 | 0.66 ms |
Clean-up from adding the Lucene99PostingsFormat in https://github.com/apache/lucene/pull/12741
These test cases were moved to Lucene99 dir and I forgot to copy the unmodified versions for the backward_codecs.lucene90
Both testEuclidean and testExplain have vectors that result
in equal scores. Since we no longer tie break on vector ordinal
as it doesn't make sense when building the graph, the vectors returned
might be slightly different. This commit fixes the flaky nature of the
test.
I noticed while testing lower dimensionality and quantization, we would explore the HNSW graph way too much. I was stuck figuring out why until I noticed the searcher checks for distance equality (not just if the distance is better) when exploring neighbors-of-neighbors. This seems like a bad heuristic, but to double check I looked at what nmslib does. This pointed me back to this commit: nmslib/nmslib#106
Seems like this performance hitch was discovered awhile ago :).
This commit adjusts HNSW to only explore the graph layer if the distance is actually better.
* Change Postings back to using FOR in Lucene99PostingsFormat
We are still keeping PFOR for positions only.
This is a partial revert of https://github.com/apache/lucene/pull/69 which brings back ForDeltaUtil.
* fix merge commit
* Add forgotten forDeltaUtil calls to reader
* Addressing comments: adding Lucene90RWPostingsFormat + more
Also:
* Change to Changes.txt
* Removal of dead code which was only used in unit tests
* Removal of test code from PForUtil
* Changes.txt edit in right place now
* Apply suggestions from code review: `90 -> 99 refactoring`
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
* Remove decodeTo32 from ForUtil and regenerate
---------
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
* Clean up ByteBlockPool
1. Move slice functionality to a separate class, ByteSlicePool.
2. Add exception case if the requested slice size is larger than the
block size.
3. Make pool buffers private instead of the comment asking not to modify it.
4. Consolidate setBytesRef methods with int offsets.
5. Simplify ramBytesUsed.
6. Update and expand comments and javadoc.
* Revert to long offsets in ByteBlockPool; Clean-up
* Remove slice functionality from TermsHashPerField
* First working test
* [Broken] Test interleaving slices
* Fixed randomized test
* Remove redundant tests
* Tidy
* Add CHANGES
* Move ByteSlicePool to oal.index