If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer.
Closes#13410
* Remove hppc dependency
* Change fork version to 0.10.0
* Add @lucene.internal
* Move hppc classes to oal.internal.hppc but export it.
* Delete hppc license since it's no longer a dependency.
---------
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
Add LongObjectHashMap and replace Map<Long, Object>.
Add LongIntHashMap and replace Map<Long, Int>.
Add HPPC dependency to join and spatial modules for primitive values float and double.
Follow up to: #13181
I noticed the quantized interface had a slightly different name.
Additionally, testing showed we are inconsistent when there aren't any vectors to score. This makes the response consistent (e.g. null when there aren't any vectors).
With quantized vectors, and with current vectors, we separate out the "scoring" vs. "iteration", requiring the user to always iterate the raw vectors and provide their own similarity function.
While this is flexible, it creates frustration in:
- Just iterating and scoring, especially since the field already has a similarity function stored...Why can't we just know which one to use and use it!
- Iterating and scoring quantized vectors. By default it would be good to be able to iterate and score quantized vectors (e.g. without going through the HNSW graph).
This significantly hampers support for true exact kNN search.
This commit extends the vector value iterators to be able to return a scorer given some vector value (what this PR demonstrates). The scorer contains a copy of the originating iterator and allows for iteration and scoring the most optimized way the provided codec can give.
Users can still iterate vector values directly, read them on heap, and score any way they please.
* Add timeout support to graph searches in AbstractKnnVectorQuery
* Also timeout exact searches
* Return partial KNN results
* Add tests for partial KNN results
- Refactor tests to base classes
- Also timeout exact searches in Lucene99HnswVectorsReader
* Add CHANGES.txt entry and fix some comments
---------
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
I repeatably saw some test failures related to `TestParentBlockJoin[Byte|Float]KnnVectorQuery#testVectorEncodingMismatch`. This commit fixes those test failures and actually checks the field type.
Currently, when performing KNN exact search, we consistently set the HitQueue size to k. However, there may be instances where the number of candidates is actually lower than k.
This adds multi-leaf optimizations for diversified children collector. This means as children vectors are collected within a block join, we can share information between leaves to speed up vector search.
To make this happen, I refactored the multi-leaf collector slightly. Now, instead of inheriting from TopKnnCollector, we inject a inner collector.
The original tests assume particular document orders & scores. To make the test more resilient to random flushes & merges, I adjusted the assertion conditions. Particularly, we verify matching ids -> score instead of relying on docIds.
closes: https://github.com/apache/lucene/issues/13057
Speedup concurrent multi-segment HNWS graph search by exchanging
the global top candidated collected so far across segments. These global top
candidates set the minimum threshold that new candidates need to pass
to be considered. This allows earlier stopping for segments that don't have
good candidates.
This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc
IDs on merge using a `BPIndexReorderer`.
- Reordering always run on forced merges.
- A `minNaturalMergeNumDocs` parameter helps only enable reordering on the
larger merged segments. This way, small merges retain all merging
optimizations like bulk copying of stored fields, and only the larger
segments - which are the most important for search performance - get
reordered.
- If not enough RAM is available to perform reordering, reordering is skipped.
To make this work, I had to add the ability for any merge to reorder doc IDs of
the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the
test framework randomly reverts the order of documents in a merged segment to
make sure this logic is properly exercised.
* Skip document by docValues
*When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4].
---------
Co-authored-by: Adrien Grand <jpountz@gmail.com>
`ImpactsDISI` is nice: you give it an `ImpactsEnum`, typically coming from the
`PostingsFormat` and it will automatically skip hits whose score cannot be
greater than the minimum competitive score. This is the class that yields 10x
or more speedups on top-level `TermQuery`s compared to exhaustive evaluation.
However, when nested under a disjunction or a conjunction, `ImpactsDISI`
typically adds more overhead than it enables skipping. The reason is that on a
disjunction `a OR b`, the minimum competitive score of `a` is the minimum score
for the disjunction minus the maximum score of `b`. While this sort of
propagation of minimum competitive scores down the query tree sometimes helps,
it does hurt more than it helps on average, because `ImpactsDISI` adds quite
some overhead and the per-clauses minimum scores are usually so low that they
don't actually enable skipping hits. I looked into reducing this overhead, but
a big part of it is the additional virtual call, so the only way to get rid of
this overhead is to not wrap with an `ImpactsDISI` at all.
This means that scorers need a way to know whether they are producing the
top-level score, or whether they are producing a partial score that then gets
combined into the top-level score. Term queries would then only wrap with
`ImpactsDISI` when they produce the top-level score. Note that this does not
only include top-level term queries, but also conjunctions that have a single
scoring clause (`a #b`) or combinations of a term query and one or more
prohibited clauses (`a -b`).
The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).
So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.
Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.
Related to: https://github.com/apache/lucene/pull/12434
This is a follow up to: https://github.com/apache/lucene/pull/12434
Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE
A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing.
However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent.
This commit adds this ability through some significant changes:
- New leaf reader function that allows a collector for knn results
- The knn results can then utilize bit-sets to join back to the parent id
This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.
This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).
This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method
meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term
queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a
disjunction if 16 terms or less match the query. Otherwise postings lists of
matching terms are collected into a `DocIdSetBuilder`. This change replaces the
latter with a mixed approach where a disjunction is created between the 16
terms that have the highest document frequency and an iterator produced from
the `DocIdSetBuilder` that collects all other terms. On fields that have a
zipfian distribution, it's quite likely that no high-frequency terms make it to
the `DocIdSetBuilder`. This provides two main benefits:
- Queries are less likely to allocate a FixedBitSet of size `maxDoc`.
- Queries are better at skipping or early terminating.
On the other hand, queries that need to consume most or all matching documents may get a slowdown, so
users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new
default for PrefixQuery, WildcardQuery and TermRangeQuery.
Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
* Grammar: Remove incidents of "the the" in comments.
* fixes formatting, as per helpful comment from Mike
* Running ./gradlew :lucene:misc:spotlessApply again made more changes.
* It keeps finding new things ... what's up with this?
* Fixing more nits that gradlew finds. Sorry, folks. I am new at this.