The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).
So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.
Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.
Related to: https://github.com/apache/lucene/pull/12434
we should delete this comment since this constructor parameters already removed from LUCENE-2876 , it's description of 'given Similarity' is a lit bit confuse to reader.
Scorer always provide non-negative
This is a follow up to: https://github.com/apache/lucene/pull/12434
Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.
On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).
Relates #12358
Periodically, the random indexer will force merge on close, this means that what was originally indexed as the zeroth document could no longer be the zeroth document.
This commit adjusts the assertion to ensure the to string format is as expected for `DocAndScoreQuery`, regardless of the matching doc-id in the test.
This seed shows the issue:
```
./gradlew test --tests TestKnnByteVectorQuery.testToString -Dtests.seed=B78CDB966F4B8FC5
```
* hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration
avoid the automaton access on definitely absent characters;
count the scores for all substring lengths together
A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing.
However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent.
This commit adds this ability through some significant changes:
- New leaf reader function that allows a collector for knn results
- The knn results can then utilize bit-sets to join back to the parent id
This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.
This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).
BooleanScorer aligns windows to multiples of 2048, but it doesn't have to.
Actually, not aligning windows can help evaluate fewer windows overall and
speed up query evaluation.
The way `DefaultBulkScorer` uses `ConjunctionDISI` may make it advance the
competitive iterator beyond the end of the window. This may cause bugs with
bulk scorers such as `BooleanScorer` that sometimes delegate to the single
clause that has matches in a given window of doc IDs. We should then make sure
to not advance the competitive iterator beyond the end of the window based on
this clause, as other clauses may have matches as well.
Partitioning scorers is an optimization problem: the optimal set of
non-essential scorers is the subset of scorers whose sum of max window scores
is less than the minimum competitive score that maximizes the sum of costs.
The current approach consists of sorting scorers by maximum score within the
window and computing the set of non-essential clauses as the first scorers
whose sum of max scores is less than the minimum competitive score, ie. you
cannot have a competitive hit by matching only non-essential clauses.
This sorting logic works well in the common case when costs are inversely
correlated with maximum scores and gives an optimal solution: the above
algorithm will also optimize the cost of non-essential clauses and thus
minimize the cost of essential clauses, in-turn further improving query
runtimes. But this isn't true for all queries. E.g. fuzzy queries compute
scores based on artificial term statistics, so scores are no longer inversely
correlated with maximum scores. This was especially visible with the query
`titel~2` on the wikipedia dataset, as `title` matches this query and is a
high-frequency term. Yet the score contribution of this term is in the same
order as the contribution of most other terms, so query runtime gets much
improved if this clause gets considered non-essential rather than essential.
This commit optimize the partitioning logic a bit by sorting clauses by
`max_score / cost` instead of just `max_score`. This will not change anything
in the common case when max scores are inversely correlated with costs, but can
significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my
machine and the wikimedium10m dataset with this change.
Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.
Relates to #12436
Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.
Closes#12309
Resolving TODO to use UnicodeUtil instead of a copy of its code here.
Maybe slightly slower from the extra check for high-surrogate but that
may be outweigh or better by more compact code and saving the capturing lambda
that might not inline.
Reading ints/floats/longs one-by-one from a heap-byte-buffer, including
doing our own bounds checks is not very efficient. We can use the
ability to translate the buffer and read in bulk while taking turns with
one-off reading/refilling instead.
When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query.
To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed.
I have verified that bumping the search candidate pool to 60 fixes the failure.
The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph.
I verified further by running the test again over a couple thousand seeds and it didn't fail again.
Lucene's scorers that can dynamically prune on score provide great speedups
when they manage to skip many hits. Unfortunately, there are also cases when
they cannot skip hits efficiently, one example case being when there are many
clauses in the query. In this case, exhaustively evaluating the set of matches
with `BooleanScorer` (BS1) may perform several times faster.
This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of
collecting hits into a bitset to save the overhead of reordering priority
queues. This helps make performance degrade much more gracefully when dynamic
pruning cannot help much.
Closes#12439
* hunspell: speed up the dictionary enumeration
cache each word's case and the lowercase form
group the words by lengths to avoid even visiting entries with unneeded lengths
* [WIP] Move IntBlockPool slices to MemoryIndex
* [WIP] Working TestMemoryIndex
* [WIP} Working TestSlicedIntBlockPool
* Working many allocations tests
* Add basic IntBlockPool test
* SlicedIntBlockPool inherits from IntBlockPool
* Tidy
`ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors.
This commit fixes that bug.
We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.
While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.
Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.
However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.
This adds `LeafCollector#finish` as a per-segment post-collection hook. While
it was already possible to do this sort of things on top of the collector API
before, a downside is that the last leaf would need to be post-collected in the
current thread instead of using the executor, which is a missed opportunity for
making queries concurrent.
Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search
is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs
tasks concurrently and waits for them to be completed and handles
eventual exceptions.
This commit shares code among these two scenarios, to reduce code
duplicate as well as to ensure that furhter improvements can be shared among them.