This commit adds a new getDefaultStopwords() static method to
UkrainianMorfologikAnalyzer, which makes it possible to create an
analyzer with the default stop word set but a custom stem exclusion
set.
DocumentWriter#anyChanges() can return false after we process and
generate a sequence number for an update operation; but before we adjust
the numDocsInRAM. In this window of time, refreshes are noop, although
the maxCompletedSequenceNumber has advanced.
This PR proposes some renames to clarify the code structure. The top-level
`KnnGraphValues` is renamed to `HnswGraph`, since it now represents a
hierarchical graph. It's also moved from `org.apache.lucene.index` to the
`hnsw` package.
Other renames:
* The old `HnswGraph` -> `OnHeapHnswGraph`
* `IndexedKnnGraphValues` -> `OffHeapHnswGraph` (to match
`OffHeapVectorValues`)
ContainedByIntervalIterator and OverlappingIntervalIterator set their 'is the filter
interval exhausted' flag to `false` once it has returned NO_MORE_POSITIONS on
a document, so that subsequent calls to `startPosition()` will also return
NO_MORE_POSITIONS. ContainingIntervalIterator omits to do this, and so it can
incorrectly report matches, for example when used in a disjunction. This commit
fixes that omission.
A couple of the data structures used in HNSW search are pretty large and
expensive to allocate. This commit creates a shared candidates queue and
visited set that are reused across calls to HnswGraph#searchLevel. Now the same
data structures are used for building the entire graph, which can cut down on
allocations during indexing. For graph building it also switches the visited
set to FixedBitSet for better performance.
IndexSortSortedNumericDocValuesRangeQuery can implement its count method and coompute count through a binary search, the same binary search that is used to execute the query itself, whenever all the required conditions are met.
TestKnnGraph.testMultipleVectorFields sometimes breaks with
the following message:
java.lang.NullPointerException: Cannot invoke
"org.apache.lucene.codecs.lucene91.Lucene91HnswVectorsReader.getGraphValues(String)"
because "vectorReader" is null
This happens in assertConsistentGraph.
This patch ensures that for a segment and a field where there is no
vectors indexed, we don't run a check on consistent graph.
This patch adds KNN vectors for testing backward compatible indices
- Add a KnnVectorField to documents when creating a new backward
compatible index
- Add knn vectors search and check for vector values to the testing
of search of backward compatible indices
- Add tests for knn vector search when changing backward compatible
indices (merging them and adding new documents to them)
Previously -Xlint:text-blocks and -Xlint:text-blocks were enabled
conditionally, if the user had at least java 15 or java 16,
respectively. Enable them always.
Add new options so that the warnings list is fully configured:
* -Xlint:module (new in java 17)
* -Xlint:strictfp (new in java 17)
Disable "path" with -Xlint:-path rather than commenting it out, for
consistency.
Disable "missing-explicit-ctor" (new in java 17), as it is unlikely to
succeed right now.
Alphasort the flags and doc how to get the updated list, this makes it
easy to compare and keep up to date.
Currently the contract on `bound` is that it holds the score of the top of the
`results` priority queue. It means that a candidate is only considered if its
score is better than the bound *or* if less than `topK` results have been
accumulated so far. I think it would be simpler if `bound` would always hold
the minimum score that is required for a candidate to be considered? This would
also be more consistent with how our WAND support works, by trusting
`setMinCompetitiveScore` alone, instead of having to check whether the priority
queue is full as well.
Before PR #608 this test when searching HnswGraph was using
numSeed (the search queue size) equal to 100.
This patch returns the original value of the search queue to 100,
and gets the top topK results from it.
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.
This PR is based on the following PRs:
#250, #267, #287, #315, #536, #416
Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.
TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
LUCENE-10384 and PR#615 introduced encoding f into NeighborQueue.
But one function `nodes()` was remained to add this encoding.
Also modify the test that would fail without this patch.
In case only number of documents are collected, IndexSearcher#search(Query, Collector) is commonly used, which does not use the executor that's been eventually set to the searcher. Calling `IndexSearcher#count(Query)` makes the code more concise and is also more correct as it honours the executor that's been set to the searcher instance.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
In a previous commit, we updated HNSW merge to first write the combined segment
vectors to a file, then use that file to build the graph. This commit applies
the same strategy to flush, which lets us use the same logic for flush and
merge.