TestKnnVectorQuery#testDeletes assumes that if there are n total documents, we
can perform a kNN search with k=n and retrieve all documents. This isn't true
with our implementation -- due to randomization we may select less than n entry
points and never visit some vectors.
Currently HNSW has only a single layer.
This is the first part to make it multi-layered.
To keep changes small, this PR only adds
multiple layers in the HnswGraph class.
TODO for following PRs:
- modify graph construction and search algorithm for a hierarchical
graph.
- modify Lucene90HnswVectorsWriter and Lucene90HnswVectorsReader to
write and read multiple layers\
This PR extends VectorReader#search to take a parameter specifying the live
docs. LeafReader#searchNearestVectors then always returns the k nearest
undeleted docs.
To implement this, the HNSW algorithm will only add a candidate to the result
set if it is a live doc. The graph search still visits and traverses deleted
docs as it gathers candidates.
In LUCENE-9002 we introduced logic to skip caching a clause if it would be too
expensive compared to the usual query cost. Specifically, we avoid caching a
clause if its cost is estimated to be a 250x higher than the lead iterator's.
We've found that the default of 250 is quite high and can lead to poor tail
latencies. This PR decreases it to 10 to cache more conservatively.
6 main improvements:
1) Iterate through all output.InputNodes since dest gaps can exist.
2) freeBefore the minimum input node instead of the first input node(which was usually, but not always, the minimum).
3) Don't freeBefore from a hole source node. Book keeping may not be correct and could result in an early free.
4) When adding an output node after hole recovery, calculate its new position increment instead of adding it to the end of the output graph.
5) Nodes after holes that have edges to their source will do the output re-mapping that the deleted node would have done.
6) If a disconnected input node swaps order with another node in the output, then map them to the same output node.
Co-authored-by: Lawson <geoffrl@amazon.com>
Provide leaf sorter for directory readers opened from IndexCommit
LUCENE-9507 allowed to provide a leaf sorter for directory readers.
One API that was missed is to allow to provide a leaf sorter
for directory readers opened from an index commit.
This patch address this by adding an extra parameter: a custom
comparator for sorting leaf readers to the Directory reader open API
from indexCommit and minSupportedMajorVersion.
Relates to PR #32
the last index commit
If the Lucene version was < 9 then use a StringField or else
if the index is fresh or if the index is was built using a
version >= 9, then use a BDV field.
BDV field with a different name
Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).
This commit also enables the back-compat check that was disabled
earlier.
When sorting by low-cardinality fields, the same sub remains current for long
sequences of doc IDs. This speeds up SortedDocIdMerger a bit by extracting
the sub that leads iteration.
When there's only one field, CombinedFieldQuery will ignore its weight while
scoring. This makes the scoring inconsistent, since the field weight is supposed
to multiply its term frequency.
This PR removes the optimizations around single-field scoring to make sure the
weight is always taken into account. These optimizations are not critical since
it should be uncommon to use CombinedFieldQuery with only one field.