Commit Graph

333 Commits

Author SHA1 Message Date
zhouhui 057cbf3c86
Use getAndSet, getAndClear instead split operations. (#13507) 2024-06-19 11:53:12 +02:00
Sanjay Dutt d0d2aa274f
Removed Scorer#getWeight (#13440)
If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer.

Closes #13410
2024-06-06 16:03:19 +02:00
Bruno Roustant f394c9418e
Remove the HPPC dependency from all modules and move the HPPC fork to internal. (#13422)
* Remove hppc dependency
* Change fork version to 0.10.0
* Add @lucene.internal
* Move hppc classes to oal.internal.hppc but export it.
* Delete hppc license since it's no longer a dependency.

---------

Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
2024-05-27 12:09:25 +02:00
Bruno Roustant 444d4e7c42
Replace List<Integer> by IntArrayList and List<Long> by LongArrayList. (#13406) 2024-05-25 19:53:42 +02:00
Bruno Roustant f70999980c
Replace Map<Long, Object> by primitive LongObjectHashMap. (#13392)
Add LongObjectHashMap and replace Map<Long, Object>.
Add LongIntHashMap and replace Map<Long, Int>.
Add HPPC dependency to join and spatial modules for primitive values float and double.
2024-05-21 17:11:34 +02:00
Sanjay Dutt 838b23ebed
Make Weight#scorerSupplier abstract, Weight#scorer final (#13319)
Co-authored-by: iamsanjay <sanjaydutt.india@yahoo.com>
2024-05-14 17:44:30 +02:00
Benjamin Trent fc12cc1847
Fix vector scorer interface consistency (#13365)
Follow up to: #13181

I noticed the quantized interface had a slightly different name.

Additionally, testing showed we are inconsistent when there aren't any vectors to score. This makes the response consistent (e.g. null when there aren't any vectors).
2024-05-13 13:02:38 -04:00
Benjamin Trent b60e86c4b9
Add new VectorScorer interface to vector value iterators (#13181)
With quantized vectors, and with current vectors, we separate out the "scoring" vs. "iteration", requiring the user to always iterate the raw vectors and provide their own similarity function.

While this is flexible, it creates frustration in:

 - Just iterating and scoring, especially since the field already has a similarity function stored...Why can't we just know which one to use and use it!
 - Iterating and scoring quantized vectors. By default it would be good to be able to iterate and score quantized vectors (e.g. without going through the HNSW graph).

This significantly hampers support for true exact kNN search.

This commit extends the vector value iterators to be able to return a scorer given some vector value (what this PR demonstrates). The scorer contains a copy of the originating iterator and allows for iteration and scoring the most optimized way the provided codec can give. 

Users can still iterate vector values directly, read them on heap, and score any way they please.
2024-05-09 16:30:14 -04:00
Sanjay Dutt 66b121f9b0
Remove deprecated code (#13286)
Co-authored-by: iamsanjay <sanjaydutt.india@yahoo.com>
2024-04-11 09:44:29 +02:00
Kaival Parikh 13cf882677
Fix failing BaseKnnVectorQueryTestCase#testTimeout (#13283)
* Fix failing BaseKnnVectorQueryTestCase#testTimeout

* Also fix ParentBlockJoinKnnVectorQueryTestCase#testTimeout

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
2024-04-10 20:43:01 -07:00
Benjamin Trent 88c0909210
Test mute for issue #13272 (#13273) 2024-04-04 16:21:32 -04:00
Kaival Parikh df154cdc22
Add timeout support to AbstractKnnVectorQuery (#13202)
* Add timeout support to graph searches in AbstractKnnVectorQuery

* Also timeout exact searches

* Return partial KNN results

* Add tests for partial KNN results

- Refactor tests to base classes
- Also timeout exact searches in Lucene99HnswVectorsReader

* Add CHANGES.txt entry and fix some comments

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
2024-04-03 06:53:24 -07:00
Benjamin Trent c0d81932df
Fix vector type check for diversified knn search (#13235)
I repeatably saw some test failures related to `TestParentBlockJoin[Byte|Float]KnnVectorQuery#testVectorEncodingMismatch`. This commit fixes those test failures and actually checks the field type.
2024-03-29 13:49:58 -04:00
panguixin 7a08eeab47
Make the HitQueue size more appropriate for KNN exact search (#13184)
Currently, when performing KNN exact search, we consistently set the HitQueue size to k. However, there may be instances where the number of candidates is actually lower than k.
2024-03-19 12:24:36 -04:00
panguixin 5b5815a26d
Fix NPE when LeafReader return null VectorValues (#13162)
### Description
`LeafReader#getXXXVectorValues` may return null value.

**Reproduction**:
```
public class TestKnnByteVectorQuery extends BaseKnnVectorQueryTestCase {
  public void testVectorEncodingMismatch() throws IOException {
    try (Directory indexStore =
        getIndexStore("field", new float[] {0, 1}, new float[] {1, 2}, new float[] {0, 0});
        IndexReader reader = DirectoryReader.open(indexStore)) {
      AbstractKnnVectorQuery query =
          new KnnFloatVectorQuery("field", new float[] {0, 1}, 10);
      IndexSearcher searcher = newSearcher(reader);
      searcher.search(query, 10);
    }
  }
}
```
**Output**:
```
java.lang.NullPointerException: Cannot invoke "org.apache.lucene.index.FloatVectorValues.size()" because the return value of "org.apache.lucene.index.LeafReader.getFloatVectorValues(String)" is null
```
2024-03-11 08:07:04 -04:00
Benjamin Trent 012b959b05
Add mult-leaf optimizations for diversify children collector (#13121)
This adds multi-leaf optimizations for diversified children collector. This means as children vectors are collected within a block join, we can share information between leaves to speed up vector search.

To make this happen, I refactored the multi-leaf collector slightly. Now, instead of inheriting from TopKnnCollector, we inject a inner collector.
2024-03-05 09:02:49 -05:00
Benjamin Trent af0d0edf5e
Fix test failure in #13057 (#13102)
The original tests assume particular document orders & scores. To make the test more resilient to random flushes & merges, I adjusted the assertion conditions. Particularly, we verify matching ids -> score instead of relying on docIds.

closes: https://github.com/apache/lucene/issues/13057
2024-02-16 09:33:34 -05:00
Benjamin Trent 810e43c58a
Fix test failure TestParentBlockJoinFloatKnnVectorQuery.testSkewedIndex (#13082) 2024-02-07 12:55:43 -05:00
Mayya Sharipova d095ed02a2
Speedup concurrent multi-segment HNWS graph search (#12962)
Speedup concurrent multi-segment HNWS graph search by exchanging 
the global top candidated collected so far across segments. These global top 
candidates set the minimum threshold that new candidates need to pass
 to be considered. This allows earlier stopping for segments that don't have 
good candidates.
2024-02-06 09:16:06 -05:00
sabi0 91272f45da
Replace println(String.format(...)) with printf(...) (#12976) 2023-12-28 19:32:06 +01:00
sabi0 02722eeb69
Add missing spaces in concatenated strings (#12967) 2023-12-23 20:30:30 -05:00
Michael Sokolov e18f9b1eb0
Add forceMerge to test to fix intermittent failure; addresses #12896 (#12928) 2023-12-12 09:11:24 -05:00
Shubham Chaudhary 1630ed4bd8
Remove some redundant modifiers from code (#12880) 2023-12-11 10:17:47 -08:00
Adrien Grand f7cab16450
Add a merge policy wrapper that performs recursive graph bisection on merge. (#12622)
This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc
IDs on merge using a `BPIndexReorderer`.
 - Reordering always run on forced merges.
 - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the
   larger merged segments. This way, small merges retain all merging
   optimizations like bulk copying of stored fields, and only the larger
   segments - which are the most important for search performance - get
   reordered.
 - If not enough RAM is available to perform reordering, reordering is skipped.

To make this work, I had to add the ability for any merge to reorder doc IDs of
the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the
test framework randomly reverts the order of documents in a merged segment to
make sure this logic is properly exercised.
2023-11-23 13:25:00 +01:00
Michael McCandless 1ebee9e611
Remove angry errant lurking semicolons (#12805)
* remove angry errant lurking semicolons

* tidy

* #12805: woops, put back ; from autogen'd QueryParser.java
2023-11-14 12:34:46 -05:00
Lu Xugang a71d64a598
Skip docs with Docvalues in NumericLeafComparator (#12405)
* Skip document by docValues

*When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4].
---------

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2023-11-09 13:05:28 +08:00
Kevin Risden de3b294be4
GITHUB#12655: gradle tidy after google java format update for jdk 21 and regen
* tidy whitespace changes from googleJavaFormat upgrade
* generateForUtil fixed and regened https://bugs.python.org/issue39350
* generateAntlr
* generateClassicTokenizer
* generateWikipediaTokenizer
2023-10-11 16:12:09 -04:00
Adrien Grand 37a42219fc
Reduce the overhead of ImpactsDISI. (#12490)
`ImpactsDISI` is nice: you give it an `ImpactsEnum`, typically coming from the
`PostingsFormat` and it will automatically skip hits whose score cannot be
greater than the minimum competitive score. This is the class that yields 10x
or more speedups on top-level `TermQuery`s compared to exhaustive evaluation.

However, when nested under a disjunction or a conjunction, `ImpactsDISI`
typically adds more overhead than it enables skipping. The reason is that on a
disjunction `a OR b`, the minimum competitive score of `a` is the minimum score
for the disjunction minus the maximum score of `b`. While this sort of
propagation of minimum competitive scores down the query tree sometimes helps,
it does hurt more than it helps on average, because `ImpactsDISI` adds quite
some overhead and the per-clauses minimum scores are usually so low that they
don't actually enable skipping hits. I looked into reducing this overhead, but
a big part of it is the additional virtual call, so the only way to get rid of
this overhead is to not wrap with an `ImpactsDISI` at all.

This means that scorers need a way to know whether they are producing the
top-level score, or whether they are producing a partial score that then gets
combined into the top-level score. Term queries would then only wrap with
`ImpactsDISI` when they produce the top-level score. Note that this does not
only include top-level term queries, but also conjunctions that have a single
scoring clause (`a #b`) or combinations of a term query and one or more
prohibited clauses (`a -b`).
2023-09-12 15:23:28 +02:00
Benjamin Trent 4174b521dd
Rename ToParentBlockJoin[Byte|Float]KnnVectorQuery and adjust to return highest score child doc ID by parent id (#12510)
The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

Related to: https://github.com/apache/lucene/pull/12434
2023-08-16 13:44:49 -04:00
Benjamin Trent 18b56bd002
ToParentBlockJoin[Byte|Float]KnnVectorQuery needs to handle the case when parents are missing (#12504)
This is a follow up to: https://github.com/apache/lucene/pull/12434

Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE
2023-08-14 09:24:25 -04:00
Benjamin Trent a65cf8960a
Add ParentJoin KNN support (#12434)
A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing. 

However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent.

This commit adds this ability through some significant changes:
 - New leaf reader function that allows a collector for knn results
 - The knn results can then utilize bit-sets to join back to the parent id
 
This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.

This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).
2023-08-07 14:46:42 -04:00
Alan Woodward 79e8c9c8b9 Fix edge case in TestJoinUtil
TestJoinUtil.checkBoost() needs to check to see if there are
any results to validate, otherwise we can get an array-out-of-bounds
exception
2023-06-26 13:28:09 +01:00
Alan Woodward edd799824f
Enable boosts on JoinUtil queries (#12388)
Boosts should not be ignored by queries returned from JoinUtil
2023-06-26 09:47:14 +01:00
Marcus 963ed7ce88
`ToParentBlockJoinQuery` Explain Support Score Mode (#12245)
* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Mikhail Khludnev <mkhl@apache.org>
2023-05-10 19:10:37 +03:00
Vigya Sharma 4e88118a35
Fix typo in CheckJoinIndex (#12231) 2023-04-14 14:06:19 -07:00
Greg Miller 3809106602
Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery (#12156) 2023-03-01 05:20:28 -08:00
Adrien Grand c6667e709f
Better skipping for multi-term queries with a FILTER rewrite. (#12055)
This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method
meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term
queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a
disjunction if 16 terms or less match the query. Otherwise postings lists of
matching terms are collected into a `DocIdSetBuilder`. This change replaces the
latter with a mixed approach where a disjunction is created between the 16
terms that have the highest document frequency and an iterator produced from
the `DocIdSetBuilder` that collects all other terms. On fields that have a
zipfian distribution, it's quite likely that no high-frequency terms make it to
the `DocIdSetBuilder`. This provides two main benefits:
 - Queries are less likely to allocate a FixedBitSet of size `maxDoc`.
 - Queries are better at skipping or early terminating.
On the other hand, queries that need to consume most or all matching documents may get a slowdown, so
users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new
default for PrefixQuery, WildcardQuery and TermRangeQuery. 

Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>
2023-02-27 15:11:31 -08:00
Robert Muir 47f8c1baa2
Migrate away from per-segment-per-threadlocals on SegmentReader (#11998)
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-12-13 09:10:21 -05:00
Robert Muir 06f9179295
Enable LongDoubleConversion error-prone check (#12010) 2022-12-12 20:55:39 -05:00
Robert Muir 4e93f29318
fix bad shift amounts and enable check (#11979) 2022-11-25 11:47:25 -05:00
Patrick Zhai 6cde41c9fd
GITHUB-11838 Change API to allow concurrent query rewrite (#11840)
Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)
2022-10-19 09:49:40 -07:00
Mayya Sharipova 554fabf682
LUCENE-10633 Disable sort optimization for SortedSetSortField (#3125)
Add ability to SortedSetSortField to disable sort optimization
2022-08-30 16:52:28 -04:00
Adrien Grand eb7b7791ba
LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields. (#1023)
This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields
by using postings to filter competitive documents.
2022-07-29 11:12:32 +02:00
Stefan Vodita dd4e8b82d7
LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests (#1004) 2022-07-07 09:54:41 -07:00
Greg Miller 5f2a4998a0
LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code (#995) 2022-06-30 14:01:14 -07:00
Lu Xugang 78b7b17f93
LUCENE-10600: SortedSetDocValues#docValueCount should be an int, not long (#960) 2022-06-16 12:22:05 +08:00
Robert Muir 3edfeb5eb2
LUCENE-10532: remove @Slow annotation (#832)
Remove `@Slow` annotation, for more consistency with CI and local jobs. All tests can be fast!
2022-05-09 23:03:55 -04:00
spike.liu d9d2cb6f09
LUCENE-10188: Give SortedSetDocValues a docValueCount() (#663)
Co-authored-by: vlc刘诚 <chengliu@trip.com>
2022-05-02 10:41:12 -04:00
Rich Bowen 0a069ed454
LUCENE-10512: Grammar: Remove incidents of "the the" in comments. (#807)
* Grammar: Remove incidents of "the the" in comments.

* fixes formatting, as per helpful comment from Mike

* Running ./gradlew :lucene:misc:spotlessApply again made more changes.

* It keeps finding new things ... what's up with this?

* Fixing more nits that gradlew finds. Sorry, folks. I am new at this.
2022-04-11 11:11:10 -04:00
zacharymorn 94fe7e314f
LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery (#790) 2022-04-07 00:53:29 -07:00