This PR involves the refactoring of the HNSW builder and searcher, aiming to create an abstraction for the random access and vector comparisons conducted during graph traversal.
The newly added RandomVectorScorer provides a means to directly compare ordinals, eliminating the need to expose the raw vector primitive type.
This scorer takes charge of vector retrieval and comparison during the graph's construction and search processes.
The primary purpose of this abstraction is to enable the implementation of various strategies.
For example, it opens the door to constructing the graph using the original float vectors while performing searches using their quantized int8 vector counterparts.
* Document why we need `lastPosBlockOffset`
* Let ./gradlew tidy fix the formatting
* Fix '<' with <
---------
Co-authored-by: Tony Xu <tonyx@amazon.com>
There are a few places where tests don't close index readers. This has
not caused problems so far, but it becomes an issue when the reader gets
an executor, because its shutdown happens as a closing listener of the
reader. This has become more evident since we now offload sequential
execution to the executor. If there's an executor, but it's never used,
no threads are created, and no threads are leaked. If we do use the
executor, and the reader is not closed, the test leaks threads.
When an executor is set to the IndexSearcher, we should try and offload
most of the computation to such executor. Ideally, the caller thread
would only do light coordination work, and the executor is responsible
for the heavier workload. If we don't offload sequential execution to
the executor, it becomes very difficult to make any distinction about
the type of workload performed on the two sides.
Closes#12498
When performing concurrent search, we may get an execution exception
from one or more slices. In that case, we'd like to rethrow the cause of
the execution exception, which we do by wrapping it into a new runtime
exception. Instead, we can rethrow runtime exceptions as-is, and the
same is true for io exceptions. Any other exception is still wrapped
into a new runtime exception. This unifies the exceptions that get
thrown between sequential codepath (when no executor is provided) and
concurrent codepath (when an executor is provided).
This commit removes the QueueSizeBasedExecutor (package private) in favour of simply offloading concurrent execution to the provided executor. In need of specific behaviour, it can all be included in the executor itself.
This removes an instanceof check that determines which type of executor wrapper is used, which means that some tasks may be executed on the caller thread depending on queue size, whenever a rejection happens, or always for the last slice. This behaviour is not configurable in any way, and is too rigid. Rather than making this pluggable, I propose to make Lucene less opinionated about concurrent tasks execution and require that users include their own execution strategy directly in the executor that they provide to the index searcher.
Relates to #12498
The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).
So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.
Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.
Related to: https://github.com/apache/lucene/pull/12434
we should delete this comment since this constructor parameters already removed from LUCENE-2876 , it's description of 'given Similarity' is a lit bit confuse to reader.
Scorer always provide non-negative
This is a follow up to: https://github.com/apache/lucene/pull/12434
Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.
On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).
Relates #12358
Periodically, the random indexer will force merge on close, this means that what was originally indexed as the zeroth document could no longer be the zeroth document.
This commit adjusts the assertion to ensure the to string format is as expected for `DocAndScoreQuery`, regardless of the matching doc-id in the test.
This seed shows the issue:
```
./gradlew test --tests TestKnnByteVectorQuery.testToString -Dtests.seed=B78CDB966F4B8FC5
```
* hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration
avoid the automaton access on definitely absent characters;
count the scores for all substring lengths together
A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing.
However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent.
This commit adds this ability through some significant changes:
- New leaf reader function that allows a collector for knn results
- The knn results can then utilize bit-sets to join back to the parent id
This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.
This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).
BooleanScorer aligns windows to multiples of 2048, but it doesn't have to.
Actually, not aligning windows can help evaluate fewer windows overall and
speed up query evaluation.
The way `DefaultBulkScorer` uses `ConjunctionDISI` may make it advance the
competitive iterator beyond the end of the window. This may cause bugs with
bulk scorers such as `BooleanScorer` that sometimes delegate to the single
clause that has matches in a given window of doc IDs. We should then make sure
to not advance the competitive iterator beyond the end of the window based on
this clause, as other clauses may have matches as well.
Partitioning scorers is an optimization problem: the optimal set of
non-essential scorers is the subset of scorers whose sum of max window scores
is less than the minimum competitive score that maximizes the sum of costs.
The current approach consists of sorting scorers by maximum score within the
window and computing the set of non-essential clauses as the first scorers
whose sum of max scores is less than the minimum competitive score, ie. you
cannot have a competitive hit by matching only non-essential clauses.
This sorting logic works well in the common case when costs are inversely
correlated with maximum scores and gives an optimal solution: the above
algorithm will also optimize the cost of non-essential clauses and thus
minimize the cost of essential clauses, in-turn further improving query
runtimes. But this isn't true for all queries. E.g. fuzzy queries compute
scores based on artificial term statistics, so scores are no longer inversely
correlated with maximum scores. This was especially visible with the query
`titel~2` on the wikipedia dataset, as `title` matches this query and is a
high-frequency term. Yet the score contribution of this term is in the same
order as the contribution of most other terms, so query runtime gets much
improved if this clause gets considered non-essential rather than essential.
This commit optimize the partitioning logic a bit by sorting clauses by
`max_score / cost` instead of just `max_score`. This will not change anything
in the common case when max scores are inversely correlated with costs, but can
significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my
machine and the wikimedium10m dataset with this change.
Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.
Relates to #12436
Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.
Closes#12309
Resolving TODO to use UnicodeUtil instead of a copy of its code here.
Maybe slightly slower from the extra check for high-surrogate but that
may be outweigh or better by more compact code and saving the capturing lambda
that might not inline.
Reading ints/floats/longs one-by-one from a heap-byte-buffer, including
doing our own bounds checks is not very efficient. We can use the
ability to translate the buffer and read in bulk while taking turns with
one-off reading/refilling instead.
When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query.
To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed.
I have verified that bumping the search candidate pool to 60 fixes the failure.
The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph.
I verified further by running the test again over a couple thousand seeds and it didn't fail again.
Lucene's scorers that can dynamically prune on score provide great speedups
when they manage to skip many hits. Unfortunately, there are also cases when
they cannot skip hits efficiently, one example case being when there are many
clauses in the query. In this case, exhaustively evaluating the set of matches
with `BooleanScorer` (BS1) may perform several times faster.
This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of
collecting hits into a bitset to save the overhead of reordering priority
queues. This helps make performance degrade much more gracefully when dynamic
pruning cannot help much.
Closes#12439
* hunspell: speed up the dictionary enumeration
cache each word's case and the lowercase form
group the words by lengths to avoid even visiting entries with unneeded lengths
* [WIP] Move IntBlockPool slices to MemoryIndex
* [WIP] Working TestMemoryIndex
* [WIP} Working TestSlicedIntBlockPool
* Working many allocations tests
* Add basic IntBlockPool test
* SlicedIntBlockPool inherits from IntBlockPool
* Tidy
`ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors.
This commit fixes that bug.
We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.
While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.
Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.
However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.
This adds `LeafCollector#finish` as a per-segment post-collection hook. While
it was already possible to do this sort of things on top of the collector API
before, a downside is that the last leaf would need to be post-collected in the
current thread instead of using the executor, which is a missed opportunity for
making queries concurrent.
Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search
is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs
tasks concurrently and waits for them to be completed and handles
eventual exceptions.
This commit shares code among these two scenarios, to reduce code
duplicate as well as to ensure that furhter improvements can be shared among them.
`AssertingBulkScorer` asserts that the return value of `BulkScorer#score` may
not be in `[maxDoc, NO_MORE_DOCS)`. While this is not part of the contract of
`BulkScorer#score`, a reasonable implementation should never have return values
in this range, as it would suggest that more matches need collecting when we're
already out of the range of valid doc IDs. So this generally indicates a bug.
`MaxScoreBulkScorer` failed this assertion, because it can sometimes skip the
requested window of doc IDs, when the sum of maximum scores would be less than
the minimum competitive score. In that case, the best information it has is
that there are no matches in the window, but it cannot give a good estimate of
the next potential match.
This assertion in `AssertingBulkScorer` looks sane to me, so I made a small
change to `MaxScoreBulkScorer` to make sure it meets `AssertingScorer`'s
expectations. This is done in a place that is only called once per scored
window, so it should not have a noticeable performance impact.
This changes HnswGraphBuilder to re-use the same candidates queues for adding nodes by allocating them in the Builder instance.
This saves about 2.5% of build time and takes memory allocations of NQ long[] from 25% of total to 0%. JFR runs are attached.
The difference from the first attempt (which actually made things slower for some graphs) is that it preserves the original code's behavior of using a 1-sized queue for the search in the levels above where the node actually gets added.
* Re-use NeighborQueue during build's search
* improve javadoc for OnHeapHnswGraphSearcher
* assert that results parameter is minheap as expected
* update CHANGES
We have recently increased the likelihood of leveraging inter-segment search
concurrency in tests when `newSearcher` is used to create the index
searcher (see #959). When parallel execution is enabled though, it is
dependent on the number of documents and segments. That means
that out of 1000 test runs that use `RandomIndexWriter` to index a random
number of docs up to 100, we will effectively parallelize only a couple
of times.
This commit increases the likelihood of running concurrent searches by
randomly forcing 1 max segments per slice as well as 1 max doc per slice.
PR #12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.
This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.
When reading data from outside the buffer, BufferedIndexInput always resets
its buffer to start at the new read position. If we are reading backwards (for example,
using an OffHeapFSTStore for a terms dictionary) then this can have the effect of
re-reading the same data over and over again.
This commit changes BufferedIndexInput to use paging when reading backwards,
so that if we ask for a byte that is before the current buffer, we read a block of data
of bufferSize that ends at the previous buffer start.
Fixes#12356
* TestHunspell: reduce the flakiness probability
We need to check how the timeout interacts with custom exception-throwing checkCanceled.
The default timeout seems not enough for some CI agents, so let's increase it.
Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.
The concurrent query rewrite for knn vectory query introduced with #12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.
Leverage accelerated vector hardware instructions in Vector Search.
Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).
Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------
Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>
The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.
QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.
There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.
The term member of TermAndBoost used to be a Term instance and became a
BytesRef with #11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.
Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is undefined. This commit changes to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.
QueryProfilerWeight should override matches and delegate to the
subQueryWeight. Another way to fix this issue is to make it extend
ProfileWeight and override only methods that need to have a different
behaviour than delegating to the sub weight.