Getting the access of a FeatureField#value is useful for deduplicating. If you have a sparse vector model and you want to handle multiple inputs from them, you want flexibility in how you de-duplicate the feature dimensions.
The original tests assume particular document orders & scores. To make the test more resilient to random flushes & merges, I adjusted the assertion conditions. Particularly, we verify matching ids -> score instead of relying on docIds.
closes: https://github.com/apache/lucene/issues/13057
In #13046 several changes broke the addBackcompatIndexes.py script
to properly add and test the unreleased version. This updates the
script to again properly add the new version.
Closes#13094
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
This enables the optional parent field in BWC tests from 9.10 on.
This will need to be forward ported to main branch where the
parent field is required to these tests since they add document blocks
during tests.
The initial release of scalar quantization would periodically create a humongous allocation, which can put unwarranted pressure on the GC & on the heap usage as a whole.
This commit adjusts this by only allocating a float array of 20*dimensions and averaging the discovered quantiles from there.
Why does this work?
- Quantiles based on confidence intervals are (generally) unbiased and doing an average gives statistically good results
- The selector algorithm scales linearly, so the cost is just about the same
- We need to do more than `1` vector at a time to prevent extreme confidence intervals interacting strangely with edge cases
I noticed while experimenting with brute-force search that our visitation limit is EXACTLY the number of filtered docs to hit. Consequently, if we happen to do brute force search and visit that exact number of vectors, we will fall back again to do brute-force a second time. This struck me as weird.
This commit adjusts the visit limit threshold for approximate search to account for this.
Speedup concurrent multi-segment HNWS graph search by exchanging
the global top candidated collected so far across segments. These global top
candidates set the minimum threshold that new candidates need to pass
to be considered. This allows earlier stopping for segments that don't have
good candidates.