Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.
Closes#12309
Resolving TODO to use UnicodeUtil instead of a copy of its code here.
Maybe slightly slower from the extra check for high-surrogate but that
may be outweigh or better by more compact code and saving the capturing lambda
that might not inline.
Reading ints/floats/longs one-by-one from a heap-byte-buffer, including
doing our own bounds checks is not very efficient. We can use the
ability to translate the buffer and read in bulk while taking turns with
one-off reading/refilling instead.
When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query.
To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed.
I have verified that bumping the search candidate pool to 60 fixes the failure.
The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph.
I verified further by running the test again over a couple thousand seeds and it didn't fail again.
Lucene's scorers that can dynamically prune on score provide great speedups
when they manage to skip many hits. Unfortunately, there are also cases when
they cannot skip hits efficiently, one example case being when there are many
clauses in the query. In this case, exhaustively evaluating the set of matches
with `BooleanScorer` (BS1) may perform several times faster.
This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of
collecting hits into a bitset to save the overhead of reordering priority
queues. This helps make performance degrade much more gracefully when dynamic
pruning cannot help much.
Closes#12439
* hunspell: speed up the dictionary enumeration
cache each word's case and the lowercase form
group the words by lengths to avoid even visiting entries with unneeded lengths
* [WIP] Move IntBlockPool slices to MemoryIndex
* [WIP] Working TestMemoryIndex
* [WIP} Working TestSlicedIntBlockPool
* Working many allocations tests
* Add basic IntBlockPool test
* SlicedIntBlockPool inherits from IntBlockPool
* Tidy
`ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors.
This commit fixes that bug.
We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.
While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.
Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.
However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.
This adds `LeafCollector#finish` as a per-segment post-collection hook. While
it was already possible to do this sort of things on top of the collector API
before, a downside is that the last leaf would need to be post-collected in the
current thread instead of using the executor, which is a missed opportunity for
making queries concurrent.
Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search
is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs
tasks concurrently and waits for them to be completed and handles
eventual exceptions.
This commit shares code among these two scenarios, to reduce code
duplicate as well as to ensure that furhter improvements can be shared among them.
`AssertingBulkScorer` asserts that the return value of `BulkScorer#score` may
not be in `[maxDoc, NO_MORE_DOCS)`. While this is not part of the contract of
`BulkScorer#score`, a reasonable implementation should never have return values
in this range, as it would suggest that more matches need collecting when we're
already out of the range of valid doc IDs. So this generally indicates a bug.
`MaxScoreBulkScorer` failed this assertion, because it can sometimes skip the
requested window of doc IDs, when the sum of maximum scores would be less than
the minimum competitive score. In that case, the best information it has is
that there are no matches in the window, but it cannot give a good estimate of
the next potential match.
This assertion in `AssertingBulkScorer` looks sane to me, so I made a small
change to `MaxScoreBulkScorer` to make sure it meets `AssertingScorer`'s
expectations. This is done in a place that is only called once per scored
window, so it should not have a noticeable performance impact.
This changes HnswGraphBuilder to re-use the same candidates queues for adding nodes by allocating them in the Builder instance.
This saves about 2.5% of build time and takes memory allocations of NQ long[] from 25% of total to 0%. JFR runs are attached.
The difference from the first attempt (which actually made things slower for some graphs) is that it preserves the original code's behavior of using a 1-sized queue for the search in the levels above where the node actually gets added.
* Re-use NeighborQueue during build's search
* improve javadoc for OnHeapHnswGraphSearcher
* assert that results parameter is minheap as expected
* update CHANGES
We have recently increased the likelihood of leveraging inter-segment search
concurrency in tests when `newSearcher` is used to create the index
searcher (see #959). When parallel execution is enabled though, it is
dependent on the number of documents and segments. That means
that out of 1000 test runs that use `RandomIndexWriter` to index a random
number of docs up to 100, we will effectively parallelize only a couple
of times.
This commit increases the likelihood of running concurrent searches by
randomly forcing 1 max segments per slice as well as 1 max doc per slice.
PR #12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.
This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.