Performing a kNN search with very large k may return fewer than k documents.
This is due to the fact that the HNSW graph is not guaranteed to be connected.
This commit documents the behavior as part of a general warning that the results
of a kNN search may be approximate.
Before when creating a KnnVectorsWriter for merging, we consulted the existing
"PER_FIELD_SUFFIX_KEY" attribute to determine the format's per-field suffix.
This isn't correct since we could be using a new codec (that produces different
formats/ suffixes).
This commit modifies TestPerFieldDocValuesFormat#testMergeUsesNewFormat to
trigger the problem. Without the fix we it throws an error like
"java.nio.file.FileAlreadyExistsException: File
"_3_Lucene90HnswVectorsFormat_0.vem" was already written to."
This part of the change would call `ArrayUtil#getUnsignedComparator` on a
length that is rarely 4 or 8. In such cases it's better to use
`Arrays#compareUnsigned`.
When the reader has no live docs, `KnnVectorQuery` can error out. This happens
because `IndexReader#numDocs` is 0, and we end up passing an illegal value of
`k = 0` to the search method.
This commit removes the problematic optimization in `KnnVectorQuery` and
replaces with a lower-level based on the total number of vectors in the segment.
When we load a query into the query cache we always calculate the count
of matching documents. This uses that count to power the new `O(1)`
`Weight#count` method.
This patch adds getPointValues to NumericLeafComparatorsimilar how it
has getNumericDocValues.
Numeric Sort optimization with points relies on the assumption that
points and doc values record the same information, as we substitute
iterator over doc_values with one over points.
If we override getNumericDocValues it almost certainly means that whatever
PointValues NumericComparator is going to look at shouldn't be used to
skip non-competitive documents. Returning null for pointValues in this
case will force comparator NOT to use sort optimization with points,
and continue with a traditional way of iterating over doc values.
LZ4 is interesting because it used to read data in little-endian order even
though Directory APIs were big endian. So most calls to LZ4 in backward-codecs
have been changed to change the endianness of the input/output.
While VectorSimilarityFunction#COSINE is helpful when you need to preserve the
original vectors, it is significantly slower than DOT_PRODUCT. This commit adds
javadocs to COSINE explaining that dot product is the fastest option.