This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.
Co-authored-by: Joel Bernstein <jbernste@apache.org>
Since all documents are required to use the same features (LUCENE-9334) we can
rewrite DocValuesFieldExistsQuery to a MatchAllDocsQuery whenever terms or
points have a docCount that is equal to maxDoc.
Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.
Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.
Implement Weight#count for PointRangeQuery to provide a faster way to calculate
the number of matching range docs when each doc has at-most one point and the
points are 1-dimensional.
The recently introduced testCount verifies that the Weight#count optimization kicks in. When SimpleText codec is used, `DocValues#unwrapSingleton` returns null which disables the optimization and makes the test fail.
This commit adds a new getDefaultStopwords() static method to
UkrainianMorfologikAnalyzer, which makes it possible to create an
analyzer with the default stop word set but a custom stem exclusion
set.
DocumentWriter#anyChanges() can return false after we process and
generate a sequence number for an update operation; but before we adjust
the numDocsInRAM. In this window of time, refreshes are noop, although
the maxCompletedSequenceNumber has advanced.
This PR proposes some renames to clarify the code structure. The top-level
`KnnGraphValues` is renamed to `HnswGraph`, since it now represents a
hierarchical graph. It's also moved from `org.apache.lucene.index` to the
`hnsw` package.
Other renames:
* The old `HnswGraph` -> `OnHeapHnswGraph`
* `IndexedKnnGraphValues` -> `OffHeapHnswGraph` (to match
`OffHeapVectorValues`)
ContainedByIntervalIterator and OverlappingIntervalIterator set their 'is the filter
interval exhausted' flag to `false` once it has returned NO_MORE_POSITIONS on
a document, so that subsequent calls to `startPosition()` will also return
NO_MORE_POSITIONS. ContainingIntervalIterator omits to do this, and so it can
incorrectly report matches, for example when used in a disjunction. This commit
fixes that omission.
A couple of the data structures used in HNSW search are pretty large and
expensive to allocate. This commit creates a shared candidates queue and
visited set that are reused across calls to HnswGraph#searchLevel. Now the same
data structures are used for building the entire graph, which can cut down on
allocations during indexing. For graph building it also switches the visited
set to FixedBitSet for better performance.
IndexSortSortedNumericDocValuesRangeQuery can implement its count method and coompute count through a binary search, the same binary search that is used to execute the query itself, whenever all the required conditions are met.