A couple of the data structures used in HNSW search are pretty large and
expensive to allocate. This commit creates a shared candidates queue and
visited set that are reused across calls to HnswGraph#searchLevel. Now the same
data structures are used for building the entire graph, which can cut down on
allocations during indexing. For graph building it also switches the visited
set to FixedBitSet for better performance.
IndexSortSortedNumericDocValuesRangeQuery can implement its count method and coompute count through a binary search, the same binary search that is used to execute the query itself, whenever all the required conditions are met.
TestKnnGraph.testMultipleVectorFields sometimes breaks with
the following message:
java.lang.NullPointerException: Cannot invoke
"org.apache.lucene.codecs.lucene91.Lucene91HnswVectorsReader.getGraphValues(String)"
because "vectorReader" is null
This happens in assertConsistentGraph.
This patch ensures that for a segment and a field where there is no
vectors indexed, we don't run a check on consistent graph.
This patch adds KNN vectors for testing backward compatible indices
- Add a KnnVectorField to documents when creating a new backward
compatible index
- Add knn vectors search and check for vector values to the testing
of search of backward compatible indices
- Add tests for knn vector search when changing backward compatible
indices (merging them and adding new documents to them)
Previously -Xlint:text-blocks and -Xlint:text-blocks were enabled
conditionally, if the user had at least java 15 or java 16,
respectively. Enable them always.
Add new options so that the warnings list is fully configured:
* -Xlint:module (new in java 17)
* -Xlint:strictfp (new in java 17)
Disable "path" with -Xlint:-path rather than commenting it out, for
consistency.
Disable "missing-explicit-ctor" (new in java 17), as it is unlikely to
succeed right now.
Alphasort the flags and doc how to get the updated list, this makes it
easy to compare and keep up to date.
Currently the contract on `bound` is that it holds the score of the top of the
`results` priority queue. It means that a candidate is only considered if its
score is better than the bound *or* if less than `topK` results have been
accumulated so far. I think it would be simpler if `bound` would always hold
the minimum score that is required for a candidate to be considered? This would
also be more consistent with how our WAND support works, by trusting
`setMinCompetitiveScore` alone, instead of having to check whether the priority
queue is full as well.
Before PR #608 this test when searching HnswGraph was using
numSeed (the search queue size) equal to 100.
This patch returns the original value of the search queue to 100,
and gets the top topK results from it.
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.
This PR is based on the following PRs:
#250, #267, #287, #315, #536, #416
Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.
TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
LUCENE-10384 and PR#615 introduced encoding f into NeighborQueue.
But one function `nodes()` was remained to add this encoding.
Also modify the test that would fail without this patch.
In case only number of documents are collected, IndexSearcher#search(Query, Collector) is commonly used, which does not use the executor that's been eventually set to the searcher. Calling `IndexSearcher#count(Query)` makes the code more concise and is also more correct as it honours the executor that's been set to the searcher instance.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
In a previous commit, we updated HNSW merge to first write the combined segment
vectors to a file, then use that file to build the graph. This commit applies
the same strategy to flush, which lets us use the same logic for flush and
merge.
When merging segments together, the `KnnVectorsWriter` creates a `VectorValues`
instance with a merged view of all the segments' vectors. This merged instance
is used when constructing the new HNSW graph. Graph building needs random
access, and the merged VectorValues support this by mapping from merged
ordinals to segments and segment ordinals. This mapping can add significant
overhead when building the graph.
This change updates the HNSW merging logic to first write the combined segment
vectors to a file, then use that the file to build the graph. This helps speed
up segment merging, and also lets us simplify `VectorValuesMerger`, which
provides the merged view of vector values.
The sort position parameter in SortField.getComparator() is only ever used
to determine whether or not skipping should be enabled on a given comparator,
so the parameter name should reflect that. This commit also explicitly disables
skipping in a number of cases where it is never used, in particular CheckIndex
and the grouping collectors.
1. Correct the remaining size for input files larger
than Integer.MAX_VALUE, as currently with every
iteration we try to map the next blockSize of bytes
even if less < blockSize bytes are left in the file.
2. Correct java.lang.ClassCastException when retrieving
KnnGraphValues for stats printing.
3. Add an option for euclidean metric