lucene

Commit Graph

Author	SHA1	Message	Date
zhouhui	057cbf3c86	Use getAndSet, getAndClear instead split operations. (#13507 )	2024-06-19 11:53:12 +02:00
Sanjay Dutt	d0d2aa274f	Removed Scorer#getWeight (#13440 ) If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer. Closes #13410	2024-06-06 16:03:19 +02:00
Bruno Roustant	f394c9418e	Remove the HPPC dependency from all modules and move the HPPC fork to internal. (#13422 ) * Remove hppc dependency * Change fork version to 0.10.0 * Add @lucene.internal * Move hppc classes to oal.internal.hppc but export it. * Delete hppc license since it's no longer a dependency. --------- Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>	2024-05-27 12:09:25 +02:00
Bruno Roustant	444d4e7c42	Replace List<Integer> by IntArrayList and List<Long> by LongArrayList. (#13406 )	2024-05-25 19:53:42 +02:00
Bruno Roustant	f70999980c	Replace Map<Long, Object> by primitive LongObjectHashMap. (#13392 ) Add LongObjectHashMap and replace Map<Long, Object>. Add LongIntHashMap and replace Map<Long, Int>. Add HPPC dependency to join and spatial modules for primitive values float and double.	2024-05-21 17:11:34 +02:00
Sanjay Dutt	838b23ebed	Make Weight#scorerSupplier abstract, Weight#scorer final (#13319 ) Co-authored-by: iamsanjay <sanjaydutt.india@yahoo.com>	2024-05-14 17:44:30 +02:00
Benjamin Trent	fc12cc1847	Fix vector scorer interface consistency (#13365 ) Follow up to: #13181 I noticed the quantized interface had a slightly different name. Additionally, testing showed we are inconsistent when there aren't any vectors to score. This makes the response consistent (e.g. null when there aren't any vectors).	2024-05-13 13:02:38 -04:00
Benjamin Trent	b60e86c4b9	Add new VectorScorer interface to vector value iterators (#13181 ) With quantized vectors, and with current vectors, we separate out the "scoring" vs. "iteration", requiring the user to always iterate the raw vectors and provide their own similarity function. While this is flexible, it creates frustration in: - Just iterating and scoring, especially since the field already has a similarity function stored...Why can't we just know which one to use and use it! - Iterating and scoring quantized vectors. By default it would be good to be able to iterate and score quantized vectors (e.g. without going through the HNSW graph). This significantly hampers support for true exact kNN search. This commit extends the vector value iterators to be able to return a scorer given some vector value (what this PR demonstrates). The scorer contains a copy of the originating iterator and allows for iteration and scoring the most optimized way the provided codec can give. Users can still iterate vector values directly, read them on heap, and score any way they please.	2024-05-09 16:30:14 -04:00
Sanjay Dutt	66b121f9b0	Remove deprecated code (#13286 ) Co-authored-by: iamsanjay <sanjaydutt.india@yahoo.com>	2024-04-11 09:44:29 +02:00
Kaival Parikh	13cf882677	Fix failing BaseKnnVectorQueryTestCase#testTimeout (#13283 ) * Fix failing BaseKnnVectorQueryTestCase#testTimeout * Also fix ParentBlockJoinKnnVectorQueryTestCase#testTimeout --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>	2024-04-10 20:43:01 -07:00
Benjamin Trent	88c0909210	Test mute for issue #13272 (#13273 )	2024-04-04 16:21:32 -04:00
Kaival Parikh	df154cdc22	Add timeout support to AbstractKnnVectorQuery (#13202 ) * Add timeout support to graph searches in AbstractKnnVectorQuery * Also timeout exact searches * Return partial KNN results * Add tests for partial KNN results - Refactor tests to base classes - Also timeout exact searches in Lucene99HnswVectorsReader * Add CHANGES.txt entry and fix some comments --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>	2024-04-03 06:53:24 -07:00
Benjamin Trent	c0d81932df	Fix vector type check for diversified knn search (#13235 ) I repeatably saw some test failures related to `TestParentBlockJoin[Byte\|Float]KnnVectorQuery#testVectorEncodingMismatch`. This commit fixes those test failures and actually checks the field type.	2024-03-29 13:49:58 -04:00
panguixin	7a08eeab47	Make the HitQueue size more appropriate for KNN exact search (#13184 ) Currently, when performing KNN exact search, we consistently set the HitQueue size to k. However, there may be instances where the number of candidates is actually lower than k.	2024-03-19 12:24:36 -04:00
panguixin	5b5815a26d	Fix NPE when LeafReader return null VectorValues (#13162 ) ### Description `LeafReader#getXXXVectorValues` may return null value. Reproduction: ``` public class TestKnnByteVectorQuery extends BaseKnnVectorQueryTestCase { public void testVectorEncodingMismatch() throws IOException { try (Directory indexStore = getIndexStore("field", new float[] {0, 1}, new float[] {1, 2}, new float[] {0, 0}); IndexReader reader = DirectoryReader.open(indexStore)) { AbstractKnnVectorQuery query = new KnnFloatVectorQuery("field", new float[] {0, 1}, 10); IndexSearcher searcher = newSearcher(reader); searcher.search(query, 10); } } } ``` Output: ``` java.lang.NullPointerException: Cannot invoke "org.apache.lucene.index.FloatVectorValues.size()" because the return value of "org.apache.lucene.index.LeafReader.getFloatVectorValues(String)" is null ```	2024-03-11 08:07:04 -04:00
Benjamin Trent	012b959b05	Add mult-leaf optimizations for diversify children collector (#13121 ) This adds multi-leaf optimizations for diversified children collector. This means as children vectors are collected within a block join, we can share information between leaves to speed up vector search. To make this happen, I refactored the multi-leaf collector slightly. Now, instead of inheriting from TopKnnCollector, we inject a inner collector.	2024-03-05 09:02:49 -05:00
Benjamin Trent	af0d0edf5e	Fix test failure in #13057 (#13102 ) The original tests assume particular document orders & scores. To make the test more resilient to random flushes & merges, I adjusted the assertion conditions. Particularly, we verify matching ids -> score instead of relying on docIds. closes: https://github.com/apache/lucene/issues/13057	2024-02-16 09:33:34 -05:00
Benjamin Trent	810e43c58a	Fix test failure TestParentBlockJoinFloatKnnVectorQuery.testSkewedIndex (#13082 )	2024-02-07 12:55:43 -05:00
Mayya Sharipova	d095ed02a2	Speedup concurrent multi-segment HNWS graph search (#12962 ) Speedup concurrent multi-segment HNWS graph search by exchanging the global top candidated collected so far across segments. These global top candidates set the minimum threshold that new candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.	2024-02-06 09:16:06 -05:00
sabi0	91272f45da	Replace println(String.format(...)) with printf(...) (#12976 )	2023-12-28 19:32:06 +01:00
sabi0	02722eeb69	Add missing spaces in concatenated strings (#12967 )	2023-12-23 20:30:30 -05:00
Michael Sokolov	e18f9b1eb0	Add forceMerge to test to fix intermittent failure; addresses #12896 (#12928 )	2023-12-12 09:11:24 -05:00
Shubham Chaudhary	1630ed4bd8	Remove some redundant modifiers from code (#12880 )	2023-12-11 10:17:47 -08:00
Adrien Grand	f7cab16450	Add a merge policy wrapper that performs recursive graph bisection on merge. (#12622 ) This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc IDs on merge using a `BPIndexReorderer`. - Reordering always run on forced merges. - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered. - If not enough RAM is available to perform reordering, reordering is skipped. To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.	2023-11-23 13:25:00 +01:00
Michael McCandless	1ebee9e611	Remove angry errant lurking semicolons (#12805 ) * remove angry errant lurking semicolons * tidy * #12805: woops, put back ; from autogen'd QueryParser.java	2023-11-14 12:34:46 -05:00
Lu Xugang	a71d64a598	Skip docs with Docvalues in NumericLeafComparator (#12405 ) * Skip document by docValues *When the queue is full with only one Comparator, we could better tune the maxValueAsBytes/minValueAsBytes. For instance, if the sort is ascending and bottom value is 5, we will use a range on [MIN_VALUE, 4]. --------- Co-authored-by: Adrien Grand <jpountz@gmail.com>	2023-11-09 13:05:28 +08:00
Kevin Risden	de3b294be4	GITHUB#12655: gradle tidy after google java format update for jdk 21 and regen * tidy whitespace changes from googleJavaFormat upgrade * generateForUtil fixed and regened https://bugs.python.org/issue39350 * generateAntlr * generateClassicTokenizer * generateWikipediaTokenizer	2023-10-11 16:12:09 -04:00
Adrien Grand	37a42219fc	Reduce the overhead of ImpactsDISI. (#12490 ) `ImpactsDISI` is nice: you give it an `ImpactsEnum`, typically coming from the `PostingsFormat` and it will automatically skip hits whose score cannot be greater than the minimum competitive score. This is the class that yields 10x or more speedups on top-level `TermQuery`s compared to exhaustive evaluation. However, when nested under a disjunction or a conjunction, `ImpactsDISI` typically adds more overhead than it enables skipping. The reason is that on a disjunction `a OR b`, the minimum competitive score of `a` is the minimum score for the disjunction minus the maximum score of `b`. While this sort of propagation of minimum competitive scores down the query tree sometimes helps, it does hurt more than it helps on average, because `ImpactsDISI` adds quite some overhead and the per-clauses minimum scores are usually so low that they don't actually enable skipping hits. I looked into reducing this overhead, but a big part of it is the additional virtual call, so the only way to get rid of this overhead is to not wrap with an `ImpactsDISI` at all. This means that scorers need a way to know whether they are producing the top-level score, or whether they are producing a partial score that then gets combined into the top-level score. Term queries would then only wrap with `ImpactsDISI` when they produce the top-level score. Note that this does not only include top-level term queries, but also conjunctions that have a single scoring clause (`a #b`) or combinations of a term query and one or more prohibited clauses (`a -b`).	2023-09-12 15:23:28 +02:00
Benjamin Trent	4174b521dd	Rename ToParentBlockJoin[Byte\|Float]KnnVectorQuery and adjust to return highest score child doc ID by parent id (#12510 ) The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search). So, I changed the new `ToParentBlockJoin[Byte\|Float]KnnVectorQuery` to `DiversifyingChildren[Byte\|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id. Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set. Related to: https://github.com/apache/lucene/pull/12434	2023-08-16 13:44:49 -04:00
Benjamin Trent	18b56bd002	ToParentBlockJoin[Byte\|Float]KnnVectorQuery needs to handle the case when parents are missing (#12504 ) This is a follow up to: https://github.com/apache/lucene/pull/12434 Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE	2023-08-14 09:24:25 -04:00
Benjamin Trent	a65cf8960a	Add ParentJoin KNN support (#12434 ) A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing. However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent. This commit adds this ability through some significant changes: - New leaf reader function that allows a collector for knn results - The knn results can then utilize bit-sets to join back to the parent id This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this. This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).	2023-08-07 14:46:42 -04:00
Alan Woodward	79e8c9c8b9	Fix edge case in TestJoinUtil TestJoinUtil.checkBoost() needs to check to see if there are any results to validate, otherwise we can get an array-out-of-bounds exception	2023-06-26 13:28:09 +01:00
Alan Woodward	edd799824f	Enable boosts on JoinUtil queries (#12388 ) Boosts should not be ignored by queries returned from JoinUtil	2023-06-26 09:47:14 +01:00
Marcus	963ed7ce88	`ToParentBlockJoinQuery` Explain Support Score Mode (#12245 ) * `ToParentBlockJoinQuery` Explain Support Score Mode --------- Co-authored-by: Mikhail Khludnev <mkhl@apache.org>	2023-05-10 19:10:37 +03:00
Vigya Sharma	4e88118a35	Fix typo in CheckJoinIndex (#12231 )	2023-04-14 14:06:19 -07:00
Greg Miller	3809106602	Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery (#12156 )	2023-03-01 05:20:28 -08:00
Adrien Grand	c6667e709f	Better skipping for multi-term queries with a FILTER rewrite. (#12055 ) This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a `DocIdSetBuilder`. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the `DocIdSetBuilder` that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the `DocIdSetBuilder`. This provides two main benefits: - Queries are less likely to allocate a FixedBitSet of size `maxDoc`. - Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown, so users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new default for PrefixQuery, WildcardQuery and TermRangeQuery. Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>	2023-02-27 15:11:31 -08:00
Robert Muir	47f8c1baa2	Migrate away from per-segment-per-threadlocals on SegmentReader (#11998 ) Add new stored fields and termvectors interfaces: IndexReader.storedFields() and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector(). The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly reduce RAM requirements when there are many threads and/or segments. Co-authored-by: Adrien Grand <jpountz@gmail.com>	2022-12-13 09:10:21 -05:00
Robert Muir	06f9179295	Enable LongDoubleConversion error-prone check (#12010 )	2022-12-12 20:55:39 -05:00
Robert Muir	4e93f29318	fix bad shift amounts and enable check (#11979 )	2022-11-25 11:47:25 -05:00
Patrick Zhai	6cde41c9fd	GITHUB-11838 Change API to allow concurrent query rewrite (#11840 ) Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)	2022-10-19 09:49:40 -07:00
Mayya Sharipova	554fabf682	LUCENE-10633 Disable sort optimization for SortedSetSortField (#3125 ) Add ability to SortedSetSortField to disable sort optimization	2022-08-30 16:52:28 -04:00
Adrien Grand	eb7b7791ba	LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields. (#1023 ) This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields by using postings to filter competitive documents.	2022-07-29 11:12:32 +02:00
Stefan Vodita	dd4e8b82d7	LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests (#1004 )	2022-07-07 09:54:41 -07:00
Greg Miller	5f2a4998a0	LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code (#995 )	2022-06-30 14:01:14 -07:00
Lu Xugang	78b7b17f93	LUCENE-10600: SortedSetDocValues#docValueCount should be an int, not long (#960 )	2022-06-16 12:22:05 +08:00
Robert Muir	3edfeb5eb2	LUCENE-10532: remove @Slow annotation (#832 ) Remove `@Slow` annotation, for more consistency with CI and local jobs. All tests can be fast!	2022-05-09 23:03:55 -04:00
spike.liu	d9d2cb6f09	LUCENE-10188: Give SortedSetDocValues a docValueCount() (#663 ) Co-authored-by: vlc刘诚 <chengliu@trip.com>	2022-05-02 10:41:12 -04:00
Rich Bowen	0a069ed454	LUCENE-10512: Grammar: Remove incidents of "the the" in comments. (#807 ) * Grammar: Remove incidents of "the the" in comments. * fixes formatting, as per helpful comment from Mike * Running ./gradlew :lucene:misc:spotlessApply again made more changes. * It keeps finding new things ... what's up with this? * Fixing more nits that gradlew finds. Sorry, folks. I am new at this.	2022-04-11 11:11:10 -04:00
zacharymorn	94fe7e314f	LUCENE-10436: Remove deprecated DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery (#790 )	2022-04-07 00:53:29 -07:00

1 2 3 4 5 ...

333 Commits