When merging segments together, the `KnnVectorsWriter` creates a `VectorValues`
instance with a merged view of all the segments' vectors. This merged instance
is used when constructing the new HNSW graph. Graph building needs random
access, and the merged VectorValues support this by mapping from merged
ordinals to segments and segment ordinals. This mapping can add significant
overhead when building the graph.
This change updates the HNSW merging logic to first write the combined segment
vectors to a file, then use that the file to build the graph. This helps speed
up segment merging, and also lets us simplify `VectorValuesMerger`, which
provides the merged view of vector values.
The sort position parameter in SortField.getComparator() is only ever used
to determine whether or not skipping should be enabled on a given comparator,
so the parameter name should reflect that. This commit also explicitly disables
skipping in a number of cases where it is never used, in particular CheckIndex
and the grouping collectors.
1. Correct the remaining size for input files larger
than Integer.MAX_VALUE, as currently with every
iteration we try to map the next blockSize of bytes
even if less < blockSize bytes are left in the file.
2. Correct java.lang.ClassCastException when retrieving
KnnGraphValues for stats printing.
3. Add an option for euclidean metric
This test was occasionally failing on CI, as the test randomly installed a merge policy
that would force compound file creation while the goal of the test was not to do so.
Current when doing knn search on an segment where all documents
with knn field were deleted, we get the following error:
maxSize must be > 0 and < 2147483630; got: 0
java.lang.IllegalArgumentException: maxSize must be > 0 and < 2147483630; got: 0
at __randomizedtesting.SeedInfo.seed([43F1F124D7076A4E:1B860BFCCB9B0BB5]:0)
at org.apache.lucene.util.LongHeap.<init>(LongHeap.java:57)
at org.apache.lucene.util.LongHeap$1.<init>(LongHeap.java:69)
at org.apache.lucene.util.LongHeap.create(LongHeap.java:69)
at org.apache.lucene.util.hnsw.NeighborQueue.<init>(NeighborQueue.java:41)
at org.apache.lucene.util.hnsw.HnswGraph.search(HnswGraph.java:105)#
This patch fixes this error and ensures empty TopDocs are returned when
knn field doesn't have any documents left.
ValueSource.asDoubleValues and asLongValues should not compute the score unless asked to -- typically never. This fixes a performance regression since 7.3 LUCENE-8099 when some older boosting queries were replaced with this.