lucene

mirror of https://github.com/apache/lucene.git synced 2025-02-06 01:58:44 +00:00

Author	SHA1	Message	Date
Stefan Vodita	34721f9439	Assert IdxOrDvQuery subqueries and document useful fields (#12442 )	2023-07-24 16:36:48 +02:00
Benjamin Trent	59c56a0aed	Fix sorted&unsorted graph test flakiness (#12452 ) When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query. To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed. I have verified that bumping the search candidate pool to 60 fixes the failure. The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph. I verified further by running the test again over a couple thousand seeds and it didn't fail again.	2023-07-20 14:08:21 -04:00
Adrien Grand	17c13a76c8	Add BS1 optimization to MaxScoreBulkScorer. (#12444 ) Lucene's scorers that can dynamically prune on score provide great speedups when they manage to skip many hits. Unfortunately, there are also cases when they cannot skip hits efficiently, one example case being when there are many clauses in the query. In this case, exhaustively evaluating the set of matches with `BooleanScorer` (BS1) may perform several times faster. This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of collecting hits into a bitset to save the overhead of reordering priority queues. This helps make performance degrade much more gracefully when dynamic pruning cannot help much. Closes #12439	2023-07-19 13:51:22 +02:00
Martin Demberger	55f2f9958b	LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches (#12437 ) --------- Co-authored-by: Martin Demberger <martin.demberger@root-nine.de>	2023-07-19 13:15:33 +02:00
Peter Gromov	f05adff4ca	hunspell: speed up the dictionary enumeration (#12447 ) * hunspell: speed up the dictionary enumeration cache each word's case and the lowercase form group the words by lengths to avoid even visiting entries with unneeded lengths	2023-07-18 21:25:26 +02:00
Stefan Vodita	b4619d87ed	Move sliced int buffer functionality to MemoryIndex (#11248 ) (#12409 ) * [WIP] Move IntBlockPool slices to MemoryIndex * [WIP] Working TestMemoryIndex * [WIP} Working TestSlicedIntBlockPool * Working many allocations tests * Add basic IntBlockPool test * SlicedIntBlockPool inherits from IntBlockPool * Tidy	2023-07-10 10:18:28 -04:00
Benjamin Trent	d03c8f16d9	Have byte[] vectors also trigger a timeout in ExitableDirectoryReader (#12423 ) `ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors. This commit fixes that bug.	2023-07-07 12:29:55 -04:00
Benjamin Trent	861153020a	Fix HNSW graph visitation limit bug (#12413 ) We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer. While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset. Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results. However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.	2023-07-06 15:46:36 -04:00
Adrien Grand	f527eb3b12	Remove Scorable#docID. (#12407 ) `Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates #12358	2023-07-05 10:40:06 +02:00
Uwe Schindler	9ffc625b2e	Followup on #12410 : Fix caller class check to use string literals to allow private/pkg-private classes	2023-07-03 17:52:10 +02:00
Uwe Schindler	f668cfd1cd	Fix forUtil.gradle to actually execute python script and also fix type error in script (#12411 )	2023-07-03 16:22:03 +02:00
Uwe Schindler	fde2e50d9e	Refactor vectorization support in Lucene (#12410 )	2023-07-03 11:39:19 +02:00
Perdjesk	907883f701	Correct Javadocs using SimpleBindings (#12402 ) Javadocs were still referring the old `SortField` API, which has been replaced with methods that use `DoubleValuesSource` instead.	2023-06-30 15:14:24 +01:00
Adrien Grand	8811f31b9c	Add a post-collection hook to LeafCollector. (#12380 ) This adds `LeafCollector#finish` as a per-segment post-collection hook. While it was already possible to do this sort of things on top of the collector API before, a downside is that the last leaf would need to be post-collected in the current thread instead of using the executor, which is a missed opportunity for making queries concurrent.	2023-06-30 15:19:35 +02:00
Sagar	40ee6e583e	Assign a dummy simScorer in TermsWeight if score is not needed (#12383 )	2023-06-30 15:14:33 +02:00
Sorabh	223eecca33	Add a thread safe CachingLeafSlicesSupplier to compute and cache the LeafSlices used with concurrent segment (#12374 ) search. It uses the protected method `slices` by default to compute the slices which can be overriden by the sub classes of IndexSearcher	2023-06-30 14:57:34 +02:00
zhangchao	01200b5804	Speed up NumericDocValuesWriter with index sorting (#12381 )	2023-06-30 14:56:56 +02:00
Uwe Schindler	e503805758	Remove usage and add some legacy java.util classes to forbiddenapis (Stack, Hashtable, Vector) (#12404 )	2023-06-29 16:56:41 +02:00
Luca Cavanna	f44cc45cf8	Share concurrent execution code into TaskExecutor (#12398 ) Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs tasks concurrently and waits for them to be completed and handles eventual exceptions. This commit shares code among these two scenarios, to reduce code duplicate as well as to ensure that furhter improvements can be shared among them.	2023-06-28 13:52:01 +02:00
Adrien Grand	4029cc37a7	Fix MaxScoreBulkScorer#score's return value. (#12400 ) `AssertingBulkScorer` asserts that the return value of `BulkScorer#score` may not be in `[maxDoc, NO_MORE_DOCS)`. While this is not part of the contract of `BulkScorer#score`, a reasonable implementation should never have return values in this range, as it would suggest that more matches need collecting when we're already out of the range of valid doc IDs. So this generally indicates a bug. `MaxScoreBulkScorer` failed this assertion, because it can sometimes skip the requested window of doc IDs, when the sum of maximum scores would be less than the minimum competitive score. In that case, the best information it has is that there are no matches in the window, but it cannot give a good estimate of the next potential match. This assertion in `AssertingBulkScorer` looks sane to me, so I made a small change to `MaxScoreBulkScorer` to make sure it meets `AssertingScorer`'s expectations. This is done in a place that is only called once per scored window, so it should not have a noticeable performance impact.	2023-06-28 09:17:41 +02:00
yixunx	6bb8cc0235	Let hard link wrapper fallback to delegate.copyFrom (#12384 ) Co-authored-by: Yixun Xu <yixunx@palantir.com>	2023-06-28 09:17:20 +02:00
Stefan Vodita	b88d3e1988	Catch offset overflows in byte pool (#9660 ) (#12392 )	2023-06-28 09:16:56 +02:00
Tomas Eduardo Fernandez Lobbe	b50b969d25	Update comment about IndexOptions ordinals (#12360 ) FieldInfos no longer accepts changes in IndexOptions, however, different IndexOptions are still compared using their ordinals	2023-06-26 16:29:13 -07:00
Alan Woodward	79e8c9c8b9	Fix edge case in TestJoinUtil TestJoinUtil.checkBoost() needs to check to see if there are any results to validate, otherwise we can get an array-out-of-bounds exception	2023-06-26 13:28:09 +01:00
Adrien Grand	4fec812cc5	Add back-compat indices for 9.7.0	2023-06-26 14:24:27 +02:00
Luca Cavanna	3e0dc2b572	Add missing change entires for 9.8	2023-06-26 11:23:36 +02:00
Adrien Grand	edc2cf5cd1	DOAP changes for release 9.7.0	2023-06-26 11:05:46 +02:00
Alan Woodward	edd799824f	Enable boosts on JoinUtil queries (#12388 ) Boosts should not be ignored by queries returned from JoinUtil	2023-06-26 09:47:14 +01:00
Luca Cavanna	7f10dca1e5	Revert "Parallelize knn query rewrite across slices rather than segments (#12325 )" (#12385 ) This reverts commit 10bebde26936298c1909dd2ea0b5706b08d2face. Based on a recent discussion in https://github.com/apache/lucene/pull/12183#discussion_r1235739084 we agreed it makes more sense to parallelize knn query vector rewrite across leaves rather than leaf slices.	2023-06-26 10:41:18 +02:00
Michael Sokolov	cb195bd96e	github-12386: set java.io.tmpdir in replicator tests' forked processes (#12387 )	2023-06-23 08:38:06 -04:00
Jonathan Ellis	fe0278e36e	Reuse neighborqueue during hnsw index build (attempt 2) (#12372 ) This changes HnswGraphBuilder to re-use the same candidates queues for adding nodes by allocating them in the Builder instance. This saves about 2.5% of build time and takes memory allocations of NQ long[] from 25% of total to 0%. JFR runs are attached. The difference from the first attempt (which actually made things slower for some graphs) is that it preserves the original code's behavior of using a 1-sized queue for the search in the levels above where the node actually gets added. * Re-use NeighborQueue during build's search * improve javadoc for OnHeapHnswGraphSearcher * assert that results parameter is minheap as expected * update CHANGES	2023-06-20 15:05:37 -04:00
Adrien Grand	8703e449ce	Change the MAXSCORE scorer to a bulk scorer. (#12361 )	2023-06-20 18:55:03 +02:00
zhangchao	37b92adf6a	Avoid redundant loop for compute min value in DirectMonotonicWriter (#12377 ) * Avoid redundant loop for get min value * update CHANGES.txt	2023-06-20 09:12:15 -04:00
Alan Woodward	6d4314d46f	Add back-compat indices for 9.6.0	2023-06-16 17:58:31 +02:00
Adrien Grand	6ee7b2b9f6	Add next minor version 9.8.0	2023-06-16 13:39:30 +02:00
Uwe Schindler	148236a50b	This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (#12376 )	2023-06-16 12:28:29 +02:00
Luca Cavanna	bb6ec50d4c	Increased the likelihood of leveraging inter-segment concurrency in tests (#12369 ) We have recently increased the likelihood of leveraging inter-segment search concurrency in tests when `newSearcher` is used to create the index searcher (see #959). When parallel execution is enabled though, it is dependent on the number of documents and segments. That means that out of 1000 test runs that use `RandomIndexWriter` to index a random number of docs up to 100, we will effectively parallelize only a couple of times. This commit increases the likelihood of running concurrent searches by randomly forcing 1 max segments per slice as well as 1 max doc per slice.	2023-06-15 11:35:10 +02:00
Alessandro Benedetti	af1afc8cb6	* GITHUB#12252 CHANGES.txt fix	2023-06-14 16:00:41 +01:00
Elia Porciani	14c18d8624	GITHUB-12252: Add function queries for computing similarity scores between knn vectors (#12253 ) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>	2023-06-14 15:49:00 +01:00
Adrien Grand	a8baa47733	Move TermAndBoost back to its original location. (#12366 ) PR #12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location.	2023-06-14 11:54:10 +02:00
Chaitanya Gohel	65447c8388	Add CHANGES.txt for #12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (#12368 ) Signed-off-by: gashutos <gashutos@amazon.com>	2023-06-14 10:17:06 +02:00
Chris Hegarty	1090928c14	Implement VectorUtilProvider with Java 21 Project Pamana Vector API (#12363 ) This commit enables the Panama Vector API for Java 21. The version of VectorUtilPanamaProvider for Java 21 is identical to that of Java 20. As such, there is no specific 21 version - the Java 20 version will be loaded from the MRJAR.	2023-06-13 09:44:58 +01:00
Jonathan Ellis	071461ece5	Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (#12281 ) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org>	2023-06-13 10:40:03 +02:00
gf2121	30eba6df56	Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (#12324 )	2023-06-13 14:24:26 +08:00
Uwe Schindler	c8e05c8cd6	Implement MMapDirectory with Java 21 Project Panama Preview API (#12294 )	2023-06-12 21:07:04 +02:00
Chris Fournier	41baf23ad9	Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (#12249 )	2023-06-12 18:20:10 +02:00
Uwe Schindler	ef35e6edf4	Work around SecurityManager issues during initialization of vector api (JDK-8309727) (#12362 )	2023-06-09 22:07:31 +02:00
Alan Woodward	a51241e4c9	Better paging when random reads go backwards (#12357 ) When reading data from outside the buffer, BufferedIndexInput always resets its buffer to start at the new read position. If we are reading backwards (for example, using an OffHeapFSTStore for a terms dictionary) then this can have the effect of re-reading the same data over and over again. This commit changes BufferedIndexInput to use paging when reading backwards, so that if we ask for a byte that is before the current buffer, we read a block of data of bufferSize that ends at the previous buffer start. Fixes #12356	2023-06-09 11:57:59 +01:00
fudongying	2934899ca6	feat: soft delete optimize (#12339 )	2023-06-09 11:41:28 +02:00
Ignacio Vera	9a2d19324f	[Tessellator] Improve the checks that validate the diagonal between two polygon nodes (#12353 )	2023-06-09 08:10:33 +02:00

1 2 3 4 5 ...

36595 Commits