lucene

Commit Graph

Author	SHA1	Message	Date
Jim Ferenczi	c26b0180bd	Introduce a random vector scorer in HNSW builder/searcher (#12529 ) This PR involves the refactoring of the HNSW builder and searcher, aiming to create an abstraction for the random access and vector comparisons conducted during graph traversal. The newly added RandomVectorScorer provides a means to directly compare ordinals, eliminating the need to expose the raw vector primitive type. This scorer takes charge of vector retrieval and comparison during the graph's construction and search processes. The primary purpose of this abstraction is to enable the implementation of various strategies. For example, it opens the door to constructing the graph using the original float vectors while performing searches using their quantized int8 vector counterparts.	2023-09-12 13:57:07 +01:00
Tony-X	d77195d705	Document why we need `lastPosBlockOffset` (#12541 ) * Document why we need `lastPosBlockOffset` * Let ./gradlew tidy fix the formatting * Fix '<' with < --------- Co-authored-by: Tony Xu <tonyx@amazon.com>	2023-09-12 07:32:58 -04:00
Michael McCandless	57dd5a4bda	The bitsRequired passed during NodeHash rehash (when building an FST) (#12545 )	2023-09-09 19:14:18 -04:00
Luca Cavanna	a7202e2e6f	Close index readers in tests (#12544 ) There are a few places where tests don't close index readers. This has not caused problems so far, but it becomes an issue when the reader gets an executor, because its shutdown happens as a closing listener of the reader. This has become more evident since we now offload sequential execution to the executor. If there's an executor, but it's never used, no threads are created, and no threads are leaked. If we do use the executor, and the reader is not closed, the test leaks threads.	2023-09-08 14:55:01 +02:00
Christine Poerschke	ef42af65f2	clarify QueryVisitor.acceptField javadoc w.r.t. not being term-specific (#12540 )	2023-09-06 11:18:46 +01:00
Luca Cavanna	d62ca4a01f	add missing changelog entry for #12498	2023-09-05 16:41:38 +02:00
Luca Cavanna	da894151a6	Offload single slice to executor (#12515 ) When an executor is set to the IndexSearcher, we should try and offload most of the computation to such executor. Ideally, the caller thread would only do light coordination work, and the executor is responsible for the heavier workload. If we don't offload sequential execution to the executor, it becomes very difficult to make any distinction about the type of workload performed on the two sides. Closes #12498	2023-09-05 16:30:20 +02:00
Luca Cavanna	947b2c5e5a	Unwrap execution exceptions cause and rethrow as is when possible (#12516 ) When performing concurrent search, we may get an execution exception from one or more slices. In that case, we'd like to rethrow the cause of the execution exception, which we do by wrapping it into a new runtime exception. Instead, we can rethrow runtime exceptions as-is, and the same is true for io exceptions. Any other exception is still wrapped into a new runtime exception. This unifies the exceptions that get thrown between sequential codepath (when no executor is provided) and concurrent codepath (when an executor is provided).	2023-09-05 15:55:48 +02:00
Chaitanya Gohel	d631615665	Honor topvalue while determining isMissingvalueCompetitive in case bottom is not set (#12520 )	2023-09-04 18:25:34 +02:00
zhangchao	a52161b131	Update outdated comment about maxPointsInLeafNode in BKD tree (#12532 )	2023-09-04 14:37:16 +02:00
Jack Wang	9fd45e3951	Enhancement 11236 lazy compute similarity score (#12480 )	2023-09-01 11:05:49 -07:00
Benjamin Trent	d1c3531161	Use panama vector for l2normalize (#12518 ) Use panama vector for l2normalize	2023-08-29 08:33:49 -04:00
zhangchao	16e4874bb9	Remove unused variable in BKDWriter (#12512 )	2023-08-22 15:50:47 +08:00
Luca Cavanna	bb62720526	Simplify task executor for concurrent operations (#12499 ) This commit removes the QueueSizeBasedExecutor (package private) in favour of simply offloading concurrent execution to the provided executor. In need of specific behaviour, it can all be included in the executor itself. This removes an instanceof check that determines which type of executor wrapper is used, which means that some tasks may be executed on the caller thread depending on queue size, whenever a rejection happens, or always for the last slice. This behaviour is not configurable in any way, and is too rigid. Rather than making this pluggable, I propose to make Lucene less opinionated about concurrent tasks execution and require that users include their own execution strategy directly in the executor that they provide to the index searcher. Relates to #12498	2023-08-21 21:54:37 +02:00
Jakub Slowinski	fb8183332b	Fix stack overflow in RegExp for long string (#12462 )	2023-08-16 22:45:20 -07:00
Shubham Chaudhary	368dbffef3	Replace consecutive close() calls and close() calls with null checks with IOUtils.close() (#12428 )	2023-08-16 17:12:34 -07:00
tang donghai	ec1367862d	Fix UTF32toUTF8 will produce invalid transition (#12472 )	2023-08-16 13:59:07 -07:00
Benjamin Trent	4174b521dd	Rename ToParentBlockJoin[Byte\|Float]KnnVectorQuery and adjust to return highest score child doc ID by parent id (#12510 ) The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search). So, I changed the new `ToParentBlockJoin[Byte\|Float]KnnVectorQuery` to `DiversifyingChildren[Byte\|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id. Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set. Related to: https://github.com/apache/lucene/pull/12434	2023-08-16 13:44:49 -04:00
Benjamin Trent	5a5aa2c8fa	GITHUB#12342 Add new maximum inner product vector similarity method (#12479 ) The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. See: https://github.com/apache/lucene/issues/12342#issuecomment-1658640222 for a good explanation for the need. To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX. One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data. See: - For: https://github.com/apache/lucene/issues/12342#issuecomment-1618258984 - Against: https://github.com/apache/lucene/issues/12342#issuecomment-1631577657, https://github.com/apache/lucene/issues/12342#issuecomment-1631808301 To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran See: - https://github.com/apache/lucene/issues/12342#issuecomment-1653420640 - https://github.com/apache/lucene/issues/12342#issuecomment-1656112434 - https://github.com/apache/lucene/issues/12342#issuecomment-1656718447 If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users. closes: https://github.com/apache/lucene/issues/12342	2023-08-16 12:15:25 -04:00
Lu Xugang	71f6f59a75	Remove outdated comment in Scorer (#12494 ) we should delete this comment since this constructor parameters already removed from LUCENE-2876 , it's description of 'given Similarity' is a lit bit confuse to reader. Scorer always provide non-negative	2023-08-16 11:36:20 +08:00
Benjamin Trent	18b56bd002	ToParentBlockJoin[Byte\|Float]KnnVectorQuery needs to handle the case when parents are missing (#12504 ) This is a follow up to: https://github.com/apache/lucene/pull/12434 Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE	2023-08-14 09:24:25 -04:00
Adrien Grand	47258cc9e9	Move changes of #12415 to 9.8	2023-08-11 22:39:36 +02:00
Adrien Grand	4d26cb2219	Optimize disjunction counts. (#12415 ) This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358	2023-08-11 22:37:37 +02:00
Benjamin Trent	df8745e59e	Fix flaky testToString method for Knn Vector queries (#12500 ) Periodically, the random indexer will force merge on close, this means that what was originally indexed as the zeroth document could no longer be the zeroth document. This commit adjusts the assertion to ensure the to string format is as expected for `DocAndScoreQuery`, regardless of the matching doc-id in the test. This seed shows the issue: ``` ./gradlew test --tests TestKnnByteVectorQuery.testToString -Dtests.seed=B78CDB966F4B8FC5 ```	2023-08-11 07:26:49 -04:00
Peter Gromov	13e747f95f	hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration (#12491 ) * hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration avoid the automaton access on definitely absent characters; count the scores for all substring lengths together	2023-08-08 22:40:42 +02:00
Benjamin Trent	dd4e66dad6	Fix test failure with zero-length vector (#12493 ) This adds assertions around the random test vector dimension count and continues to generate random vectors until it has a `squareSum > 0`	2023-08-08 08:38:46 -04:00
Benjamin Trent	a65cf8960a	Add ParentJoin KNN support (#12434 ) A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing. However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent. This commit adds this ability through some significant changes: - New leaf reader function that allows a collector for knn results - The knn results can then utilize bit-sets to join back to the parent id This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this. This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).	2023-08-07 14:46:42 -04:00
Adrien Grand	03ab02157a	Revert "Stop aligning windows in BooleanScorer. (#12488 )" This reverts commit `09e3b43331`.	2023-08-06 22:10:09 +02:00
Adrien Grand	09e3b43331	Stop aligning windows in BooleanScorer. (#12488 ) BooleanScorer aligns windows to multiples of 2048, but it doesn't have to. Actually, not aligning windows can help evaluate fewer windows overall and speed up query evaluation.	2023-08-05 11:29:34 +02:00
Adrien Grand	df3632cb03	Fix `DefaultBulkScorer` to not advance the competitive iterator beyond the end of the window. (#12481 ) The way `DefaultBulkScorer` uses `ConjunctionDISI` may make it advance the competitive iterator beyond the end of the window. This may cause bugs with bulk scorers such as `BooleanScorer` that sometimes delegate to the single clause that has matches in a given window of doc IDs. We should then make sure to not advance the competitive iterator beyond the end of the window based on this clause, as other clauses may have matches as well.	2023-08-03 07:19:27 +02:00
Adrien Grand	acffcfaaf0	Reduce overhead of disabling scoring on `BooleanScorer`. (#12475 ) This is a subset of #12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```	2023-08-03 07:17:52 +02:00
Armin Braun	e78feb7809	Clenup duplication in BKDWriter (#12469 ) The logic for creating the writer runnable could be deduplicated. Also, a couple of annonymous classes could be turned into lambdas.	2023-08-03 07:16:44 +02:00
Benjamin Trent	229dc7481e	Fix randomly failing field info format tests (#12473 )	2023-08-02 14:10:57 -04:00
Adrien Grand	5e725964a0	Improve MaxScoreBulkScorer partitioning logic. (#12457 ) Partitioning scorers is an optimization problem: the optimal set of non-essential scorers is the subset of scorers whose sum of max window scores is less than the minimum competitive score that maximizes the sum of costs. The current approach consists of sorting scorers by maximum score within the window and computing the set of non-essential clauses as the first scorers whose sum of max scores is less than the minimum competitive score, ie. you cannot have a competitive hit by matching only non-essential clauses. This sorting logic works well in the common case when costs are inversely correlated with maximum scores and gives an optimal solution: the above algorithm will also optimize the cost of non-essential clauses and thus minimize the cost of essential clauses, in-turn further improving query runtimes. But this isn't true for all queries. E.g. fuzzy queries compute scores based on artificial term statistics, so scores are no longer inversely correlated with maximum scores. This was especially visible with the query `titel~2` on the wikipedia dataset, as `title` matches this query and is a high-frequency term. Yet the score contribution of this term is in the same order as the contribution of most other terms, so query runtime gets much improved if this clause gets considered non-essential rather than essential. This commit optimize the partitioning logic a bit by sorting clauses by `max_score / cost` instead of just `max_score`. This will not change anything in the common case when max scores are inversely correlated with costs, but can significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my machine and the wikimedium10m dataset with this change.	2023-07-29 21:02:28 +02:00
Peter Gromov	32ec38271e	hunspell: check for aff file wellformedness more strictly (#12468 ) * hunspell: check for .aff file well-formedness more strictly	2023-07-29 16:07:50 +02:00
Peter Gromov	1af68bf2d7	hunspell: make the hash table load factor customizable (#12464 ) * hunspell: make the hash table load factor customizable	2023-07-28 18:36:32 +02:00
Mayya Sharipova	155b2edbe3	Fix occasional failure in BaseKnnVectorsFormatTestCase#testIllegalDimensionTooLarge (#12467 ) Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to #12436	2023-07-28 09:37:16 -04:00
Mayya Sharipova	119635ad80	Make KnnVectorsFormat#getMaxDimensions abstract (#12466 ) - Backward codecs use 1024 as max dims - Test classes use the current KnnVectorsFormat#DEFAULT_MAX_DIMENSIONS Relates to PR#12436 Closes #12309	2023-07-28 08:34:17 -04:00
Mayya Sharipova	98320d7616	Move max vector dims limit to Codec (#12436 ) Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimensions that they can handle. Closes #12309	2023-07-27 14:50:33 -04:00
Armin Braun	538b7d0ffe	Clean up writing String to ByteBuffersDataOutput (#12455 ) Resolving TODO to use UnicodeUtil instead of a copy of its code here. Maybe slightly slower from the extra check for high-surrogate but that may be outweigh or better by more compact code and saving the capturing lambda that might not inline.	2023-07-26 14:25:01 +02:00
Greg Miller	87944c2aa7	Move CHANGES entry for GITHUB#12408 under 10.0 A backport to 9.x will be somewhat tricky with the API surface area so planning to wait for a 10.0 release.	2023-07-25 13:02:36 -07:00
Greg Miller	179b45bc23	Initialize facet counting data structures lazily (#12408 ) This change covers: * Taxonomy faceting * FastTaxonomyFacetCounts * TaxonomyFacetIntAssociations * TaxonomyFacetFloatAssociations * SSDV faceting * SortedSetDocValuesFacetCounts * ConcurrentSortedSetDocValuesFacetCounts * StringValueFacetCounts * Range faceting: * LongRangeFacetCounts * DoubleRangeFacetCounts * Long faceting: * LongValueFacetCounts Left for a future iteration: * RangeOnRange faceting * FacetSet faceting	2023-07-25 12:20:42 -07:00
Greg Miller	2b3b028734	GITHUB#12451: Update TestStringsToAutomaton validation to work around GH#12458 (#12461 )	2023-07-25 11:56:18 -07:00
Armin Braun	20e97fbd00	Faster bulk numeric reads from BufferedIndexInput (#12453 ) Reading ints/floats/longs one-by-one from a heap-byte-buffer, including doing our own bounds checks is not very efficient. We can use the ability to translate the buffer and read in bulk while taking turns with one-off reading/refilling instead.	2023-07-24 17:20:32 +02:00
Stefan Vodita	34721f9439	Assert IdxOrDvQuery subqueries and document useful fields (#12442 )	2023-07-24 16:36:48 +02:00
Benjamin Trent	59c56a0aed	Fix sorted&unsorted graph test flakiness (#12452 ) When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query. To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed. I have verified that bumping the search candidate pool to 60 fixes the failure. The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph. I verified further by running the test again over a couple thousand seeds and it didn't fail again.	2023-07-20 14:08:21 -04:00
Adrien Grand	17c13a76c8	Add BS1 optimization to MaxScoreBulkScorer. (#12444 ) Lucene's scorers that can dynamically prune on score provide great speedups when they manage to skip many hits. Unfortunately, there are also cases when they cannot skip hits efficiently, one example case being when there are many clauses in the query. In this case, exhaustively evaluating the set of matches with `BooleanScorer` (BS1) may perform several times faster. This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of collecting hits into a bitset to save the overhead of reordering priority queues. This helps make performance degrade much more gracefully when dynamic pruning cannot help much. Closes #12439	2023-07-19 13:51:22 +02:00
Martin Demberger	55f2f9958b	LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches (#12437 ) --------- Co-authored-by: Martin Demberger <martin.demberger@root-nine.de>	2023-07-19 13:15:33 +02:00
Peter Gromov	f05adff4ca	hunspell: speed up the dictionary enumeration (#12447 ) * hunspell: speed up the dictionary enumeration cache each word's case and the lowercase form group the words by lengths to avoid even visiting entries with unneeded lengths	2023-07-18 21:25:26 +02:00
Stefan Vodita	b4619d87ed	Move sliced int buffer functionality to MemoryIndex (#11248 ) (#12409 ) * [WIP] Move IntBlockPool slices to MemoryIndex * [WIP] Working TestMemoryIndex * [WIP} Working TestSlicedIntBlockPool * Working many allocations tests * Add basic IntBlockPool test * SlicedIntBlockPool inherits from IntBlockPool * Tidy	2023-07-10 10:18:28 -04:00
Benjamin Trent	d03c8f16d9	Have byte[] vectors also trigger a timeout in ExitableDirectoryReader (#12423 ) `ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors. This commit fixes that bug.	2023-07-07 12:29:55 -04:00
Benjamin Trent	861153020a	Fix HNSW graph visitation limit bug (#12413 ) We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer. While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset. Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results. However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.	2023-07-06 15:46:36 -04:00
Adrien Grand	f527eb3b12	Remove Scorable#docID. (#12407 ) `Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates #12358	2023-07-05 10:40:06 +02:00
Uwe Schindler	9ffc625b2e	Followup on #12410 : Fix caller class check to use string literals to allow private/pkg-private classes	2023-07-03 17:52:10 +02:00
Uwe Schindler	f668cfd1cd	Fix forUtil.gradle to actually execute python script and also fix type error in script (#12411 )	2023-07-03 16:22:03 +02:00
Uwe Schindler	fde2e50d9e	Refactor vectorization support in Lucene (#12410 )	2023-07-03 11:39:19 +02:00
Perdjesk	907883f701	Correct Javadocs using SimpleBindings (#12402 ) Javadocs were still referring the old `SortField` API, which has been replaced with methods that use `DoubleValuesSource` instead.	2023-06-30 15:14:24 +01:00
Adrien Grand	8811f31b9c	Add a post-collection hook to LeafCollector. (#12380 ) This adds `LeafCollector#finish` as a per-segment post-collection hook. While it was already possible to do this sort of things on top of the collector API before, a downside is that the last leaf would need to be post-collected in the current thread instead of using the executor, which is a missed opportunity for making queries concurrent.	2023-06-30 15:19:35 +02:00
Sagar	40ee6e583e	Assign a dummy simScorer in TermsWeight if score is not needed (#12383 )	2023-06-30 15:14:33 +02:00
Sorabh	223eecca33	Add a thread safe CachingLeafSlicesSupplier to compute and cache the LeafSlices used with concurrent segment (#12374 ) search. It uses the protected method `slices` by default to compute the slices which can be overriden by the sub classes of IndexSearcher	2023-06-30 14:57:34 +02:00
zhangchao	01200b5804	Speed up NumericDocValuesWriter with index sorting (#12381 )	2023-06-30 14:56:56 +02:00
Uwe Schindler	e503805758	Remove usage and add some legacy java.util classes to forbiddenapis (Stack, Hashtable, Vector) (#12404 )	2023-06-29 16:56:41 +02:00
Luca Cavanna	f44cc45cf8	Share concurrent execution code into TaskExecutor (#12398 ) Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs tasks concurrently and waits for them to be completed and handles eventual exceptions. This commit shares code among these two scenarios, to reduce code duplicate as well as to ensure that furhter improvements can be shared among them.	2023-06-28 13:52:01 +02:00
Adrien Grand	4029cc37a7	Fix MaxScoreBulkScorer#score's return value. (#12400 ) `AssertingBulkScorer` asserts that the return value of `BulkScorer#score` may not be in `[maxDoc, NO_MORE_DOCS)`. While this is not part of the contract of `BulkScorer#score`, a reasonable implementation should never have return values in this range, as it would suggest that more matches need collecting when we're already out of the range of valid doc IDs. So this generally indicates a bug. `MaxScoreBulkScorer` failed this assertion, because it can sometimes skip the requested window of doc IDs, when the sum of maximum scores would be less than the minimum competitive score. In that case, the best information it has is that there are no matches in the window, but it cannot give a good estimate of the next potential match. This assertion in `AssertingBulkScorer` looks sane to me, so I made a small change to `MaxScoreBulkScorer` to make sure it meets `AssertingScorer`'s expectations. This is done in a place that is only called once per scored window, so it should not have a noticeable performance impact.	2023-06-28 09:17:41 +02:00
yixunx	6bb8cc0235	Let hard link wrapper fallback to delegate.copyFrom (#12384 ) Co-authored-by: Yixun Xu <yixunx@palantir.com>	2023-06-28 09:17:20 +02:00
Stefan Vodita	b88d3e1988	Catch offset overflows in byte pool (#9660 ) (#12392 )	2023-06-28 09:16:56 +02:00
Tomas Eduardo Fernandez Lobbe	b50b969d25	Update comment about IndexOptions ordinals (#12360 ) FieldInfos no longer accepts changes in IndexOptions, however, different IndexOptions are still compared using their ordinals	2023-06-26 16:29:13 -07:00
Alan Woodward	79e8c9c8b9	Fix edge case in TestJoinUtil TestJoinUtil.checkBoost() needs to check to see if there are any results to validate, otherwise we can get an array-out-of-bounds exception	2023-06-26 13:28:09 +01:00
Adrien Grand	4fec812cc5	Add back-compat indices for 9.7.0	2023-06-26 14:24:27 +02:00
Luca Cavanna	3e0dc2b572	Add missing change entires for 9.8	2023-06-26 11:23:36 +02:00
Alan Woodward	edd799824f	Enable boosts on JoinUtil queries (#12388 ) Boosts should not be ignored by queries returned from JoinUtil	2023-06-26 09:47:14 +01:00
Luca Cavanna	7f10dca1e5	Revert "Parallelize knn query rewrite across slices rather than segments (#12325 )" (#12385 ) This reverts commit `10bebde269`. Based on a recent discussion in https://github.com/apache/lucene/pull/12183#discussion_r1235739084 we agreed it makes more sense to parallelize knn query vector rewrite across leaves rather than leaf slices.	2023-06-26 10:41:18 +02:00
Michael Sokolov	cb195bd96e	github-12386: set java.io.tmpdir in replicator tests' forked processes (#12387 )	2023-06-23 08:38:06 -04:00
Jonathan Ellis	fe0278e36e	Reuse neighborqueue during hnsw index build (attempt 2) (#12372 ) This changes HnswGraphBuilder to re-use the same candidates queues for adding nodes by allocating them in the Builder instance. This saves about 2.5% of build time and takes memory allocations of NQ long[] from 25% of total to 0%. JFR runs are attached. The difference from the first attempt (which actually made things slower for some graphs) is that it preserves the original code's behavior of using a 1-sized queue for the search in the levels above where the node actually gets added. * Re-use NeighborQueue during build's search * improve javadoc for OnHeapHnswGraphSearcher * assert that results parameter is minheap as expected * update CHANGES	2023-06-20 15:05:37 -04:00
Adrien Grand	8703e449ce	Change the MAXSCORE scorer to a bulk scorer. (#12361 )	2023-06-20 18:55:03 +02:00
zhangchao	37b92adf6a	Avoid redundant loop for compute min value in DirectMonotonicWriter (#12377 ) * Avoid redundant loop for get min value * update CHANGES.txt	2023-06-20 09:12:15 -04:00
Alan Woodward	6d4314d46f	Add back-compat indices for 9.6.0	2023-06-16 17:58:31 +02:00
Adrien Grand	6ee7b2b9f6	Add next minor version 9.8.0	2023-06-16 13:39:30 +02:00
Uwe Schindler	148236a50b	This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (#12376 )	2023-06-16 12:28:29 +02:00
Luca Cavanna	bb6ec50d4c	Increased the likelihood of leveraging inter-segment concurrency in tests (#12369 ) We have recently increased the likelihood of leveraging inter-segment search concurrency in tests when `newSearcher` is used to create the index searcher (see #959). When parallel execution is enabled though, it is dependent on the number of documents and segments. That means that out of 1000 test runs that use `RandomIndexWriter` to index a random number of docs up to 100, we will effectively parallelize only a couple of times. This commit increases the likelihood of running concurrent searches by randomly forcing 1 max segments per slice as well as 1 max doc per slice.	2023-06-15 11:35:10 +02:00
Alessandro Benedetti	af1afc8cb6	* GITHUB#12252 CHANGES.txt fix	2023-06-14 16:00:41 +01:00
Elia Porciani	14c18d8624	GITHUB-12252: Add function queries for computing similarity scores between knn vectors (#12253 ) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>	2023-06-14 15:49:00 +01:00
Adrien Grand	a8baa47733	Move TermAndBoost back to its original location. (#12366 ) PR #12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location.	2023-06-14 11:54:10 +02:00
Chaitanya Gohel	65447c8388	Add CHANGES.txt for #12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (#12368 ) Signed-off-by: gashutos <gashutos@amazon.com>	2023-06-14 10:17:06 +02:00
Chris Hegarty	1090928c14	Implement VectorUtilProvider with Java 21 Project Pamana Vector API (#12363 ) This commit enables the Panama Vector API for Java 21. The version of VectorUtilPanamaProvider for Java 21 is identical to that of Java 20. As such, there is no specific 21 version - the Java 20 version will be loaded from the MRJAR.	2023-06-13 09:44:58 +01:00
Jonathan Ellis	071461ece5	Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (#12281 ) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org>	2023-06-13 10:40:03 +02:00
gf2121	30eba6df56	Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (#12324 )	2023-06-13 14:24:26 +08:00
Uwe Schindler	c8e05c8cd6	Implement MMapDirectory with Java 21 Project Panama Preview API (#12294 )	2023-06-12 21:07:04 +02:00
Chris Fournier	41baf23ad9	Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (#12249 )	2023-06-12 18:20:10 +02:00
Uwe Schindler	ef35e6edf4	Work around SecurityManager issues during initialization of vector api (JDK-8309727) (#12362 )	2023-06-09 22:07:31 +02:00
Alan Woodward	a51241e4c9	Better paging when random reads go backwards (#12357 ) When reading data from outside the buffer, BufferedIndexInput always resets its buffer to start at the new read position. If we are reading backwards (for example, using an OffHeapFSTStore for a terms dictionary) then this can have the effect of re-reading the same data over and over again. This commit changes BufferedIndexInput to use paging when reading backwards, so that if we ask for a byte that is before the current buffer, we read a block of data of bufferSize that ends at the previous buffer start. Fixes #12356	2023-06-09 11:57:59 +01:00
fudongying	2934899ca6	feat: soft delete optimize (#12339 )	2023-06-09 11:41:28 +02:00
Ignacio Vera	9a2d19324f	[Tessellator] Improve the checks that validate the diagonal between two polygon nodes (#12353 )	2023-06-09 08:10:33 +02:00
Peter Gromov	5b63a1879d	TestHunspell: reduce the flakiness probability (#12351 ) * TestHunspell: reduce the flakiness probability We need to check how the timeout interacts with custom exception-throwing checkCanceled. The default timeout seems not enough for some CI agents, so let's increase it. Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>	2023-06-07 14:10:44 +02:00
Patrick Zhai	0c293909c0	Add updateDocuments API which accept a query (reopen) (#12346 )	2023-06-03 20:16:16 -07:00
Greg Miller	52ace7eb35	Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (#12320 )	2023-06-02 09:34:52 -07:00
Petr Portnov \| PROgrm_JARvis	45110a6a46	Make memory fence in `ByteBufferGuard` explicit (#12290 )	2023-06-01 13:41:06 +02:00
Uwe Schindler	40b582ab18	Revert "Add updateDocuments API which accept a query (#12341 )" (#12344 ) This reverts commit `52ab16731e`.	2023-06-01 13:37:36 +02:00
Patrick Zhai	52ab16731e	Add updateDocuments API which accept a query (#12341 )	2023-06-01 13:37:04 +02:00
Peter Gromov	4bf1b94209	hunspell (minor): reduce allocations when reading the dictionary's morphological data (#12323 ) there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation	2023-06-01 11:37:38 +02:00
Greg Miller	f79b316bd5	Add CHANGES entry for GH#12334	2023-05-31 15:18:34 -07:00
Chaitanya Gohel	d44be24025	Fix searchafter high latency when after value is out of range for segment (#12334 )	2023-05-31 15:07:53 -07:00
Daniele Antuzi	da36c24cb9	Use thread-safe search version of HnswGraphSearcher (#12246 ) Addressing comment received in the PR https://github.com/apache/lucene/pull/12246	2023-05-30 15:38:06 +01:00
Luca Cavanna	72b91156f3	Don't generate stacktrace for TimeExceededException (#12335 ) The exception is package private and never rethrown, we can avoid generating a stacktrace for it.	2023-05-30 10:29:46 +02:00
Patrick Zhai	d1850e44f3	Update TestVectorUtilProviders.java (#12338 )	2023-05-29 16:26:29 -07:00
Uwe Schindler	db0c21f25d	Clenaup and update changes and synchronize with 9.x	2023-05-26 18:22:51 +02:00
Jonathan Ellis	431dc7b415	add BitSet.clear() (#12268 )	2023-05-26 18:13:16 +02:00
Greg Miller	367b03bfc2	GH#12321: Reduce visibility of StringsToAutomaton (#12331 )	2023-05-26 08:55:02 -07:00
Uwe Schindler	f5f25777d8	Update changes to be correct with ARM (it is called NEON there)	2023-05-26 16:53:39 +02:00
Luca Cavanna	24712d7525	Move changes entry for #12328 to 9.7	2023-05-26 15:11:21 +02:00
Armin Braun	fd75807350	Optimize ConjunctionDISI.createConjunction (#12328 ) This method is showing up as a little hot when profiling some queries. Almost all the time spent in this method is just burnt on ceremony around stream indirections that don't inline. Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min cost) makes this method far cheaper and easier to read.	2023-05-26 13:44:39 +02:00
Luca Cavanna	0ce6b9a67b	Adjust changes entries for knn query concurrent rewrite Moved entry for #12160 to 9.7.0 as it's been backported. Added missing entry for #12325.	2023-05-26 09:25:54 +02:00
Luca Cavanna	10bebde269	Parallelize knn query rewrite across slices rather than segments (#12325 ) The concurrent query rewrite for knn vectory query introduced with #12160 requests one thread per segment to the executor. To align this with the IndexSearcher parallel behaviour, we should rather parallelize across slices. Also, we can reuse the same slice executor instance that the index searcher already holds, in that way we are using a QueueSizeBasedExecutor when a thread pool executor is provided.	2023-05-26 09:17:25 +02:00
Uwe Schindler	c188d47a8b	Handle jdk.internal classes mentioned in vector superclass or interfaces during extraction (#12329 )	2023-05-25 17:21:03 +02:00
Michael McCandless	7da7c43638	#12276 : rename DaciukMihovAutomatonBuilder to StringsToAutomaton (#12310 ) Closes #12276	2023-05-25 10:18:41 -04:00
Chris Hegarty	f756f90644	Integrate the Incubating Panama Vector API (#12311 ) Leverage accelerated vector hardware instructions in Vector Search. Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously). Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21. --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Robert Muir <rmuir@apache.org>	2023-05-25 07:59:50 +01:00
Andrey Bozhko	c9c49bc553	[MINOR] Update javadoc in Query class (#12233 ) - add a few missing full stops - update wording in the description of Query#equals method	2023-05-23 12:16:50 +02:00
Patrick Zhai	8a602b5063	Add multi-thread searchability to OnHeapHnswGraph (#12257 )	2023-05-21 21:48:46 -07:00
Peter Gromov	a454388b80	hunspell (minor): reduce allocations when processing compound rules (#12316 )	2023-05-19 21:36:05 +02:00
Uwe Schindler	84e2e3afc3	Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (#12315 )	2023-05-19 18:42:55 +02:00
Uwe Schindler	a8a95e64ce	Forward port references to AccessController in VirtualMethod (#12308 )	2023-05-19 16:38:24 +02:00
Jerry Chin	04ef6de826	GITHUB-12291: Skip blank lines from stopwords list. (#12299 )	2023-05-18 16:58:32 +02:00
Michael Sokolov	6b51cce0b8	NeighborQueue.reset() now clears incomplete flag	2023-05-18 10:23:22 -04:00
Greg Miller	3e4ca4042c	Minor cleanup and improvements to DaciukMihovAutomatonBuilder (#12305 )	2023-05-18 07:01:19 -07:00
Michael Sokolov	2facb3ae0e	Revert "allocate one NeighborQueue per search for results (#12255 )" This reverts commit `9a7efe92c0`.	2023-05-18 13:42:17 +00:00
Petr Portnov \| PROgrm_JARvis	0c6e8aec67	Seal `IndexReader` and `IndexReaderContext` (#12296 )	2023-05-17 08:47:47 +02:00
tang donghai	f53eb28af0	remove max recursion from Operations.java to AutomatonTestUtil.java (#12298 ) Co-authored-by: tangdonghai <tangdonghai@meituan.com>	2023-05-16 07:09:28 -04:00
Patrick Zhai	8af305892d	Optimize HNSW diversity calculation (#12235 )	2023-05-15 23:20:31 -07:00
tang donghai	0e172b0723	Update Javadoc for topoSortStates method after #12286 (#12292 )	2023-05-15 18:06:01 +02:00
tang donghai	5d203f8337	toposort use iterator to avoid stackoverflow (#12286 ) Co-authored-by: tangdonghai <tangdonghai@meituan.com>	2023-05-15 16:20:15 +02:00
Luca Cavanna	223e28ef16	Simplify SliceExecutor and QueueSizeBasedExecutor (#12285 ) The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.	2023-05-11 11:08:48 +02:00
Marcus	963ed7ce88	`ToParentBlockJoinQuery` Explain Support Score Mode (#12245 ) * `ToParentBlockJoinQuery` Explain Support Score Mode --------- Co-authored-by: Mikhail Khludnev <mkhl@apache.org>	2023-05-10 19:10:37 +03:00
Luca Cavanna	b6100d9787	Make TimeExceededException members final (#12271 ) TimeExceededException has three members that are set within its constructor and never modified. They can be made final.	2023-05-09 11:28:23 +02:00
Luca Cavanna	082c49a9ef	Update javadocs for QueryTimeout (#12272 ) QueryTimeout was introduced together with ExitableDirectoryReader but is now also optionally set to the IndexSearcher to wrap the bulk scorer with a TimeLimitingBulkScorer. Its javadocs needs updating.	2023-05-09 11:27:47 +02:00
Luca Cavanna	10bad40ed3	Make query timeout members final in ExitableDirectoryReader (#12274 ) There's a couple of places in the Exitable wrapper classes where queryTimeout is set within the constructor and never modified. This commit makes such members final.	2023-05-09 11:27:06 +02:00
Luca Cavanna	1cd9c1d66a	add missing changelog entry for #12220	2023-05-09 10:57:28 +02:00
Luca Cavanna	67bb384f72	add missing changelog entry for #12260	2023-05-09 10:52:03 +02:00
Luca Cavanna	9579d2de76	Move changes entry for #12270 to 9.7.0 section	2023-05-09 10:28:22 +02:00
Armin Braun	add9aba16d	Don't generate stacktrace in CollectionTerminatedException (#12270 ) CollectionTerminatedException is always caught and never exposed to users so there's no point in filling in a stack-trace for it.	2023-05-09 10:18:52 +02:00
Jonathan Ellis	9a7efe92c0	allocate one NeighborQueue per search for results (#12255 )	2023-05-08 17:22:58 -04:00
Michael Sokolov	a39885fdab	GITHUB-12224: remove KnnGraphTester (moved to luceneutil) (#12238 )	2023-05-08 10:12:36 -04:00
Uwe Schindler	397c2e547a	Fix MMapDirectory documentation for Java 20 (#12265 )	2023-05-05 12:04:38 +02:00
Luca Cavanna	caeabf3930	Fix SynonymQuery equals implementation (#12260 ) The term member of TermAndBoost used to be a Term instance and became a BytesRef with #11941, which means its equals impl won't take the field name into account. The SynonymQuery equals impl needs to be updated accordingly to take the field into account as well, otherwise synonym queries with same term and boost across different fields are equal which is a bug.	2023-05-03 11:27:33 +02:00
Jonathan Ellis	3c163745bb	Use HashMap (was TreeMap) for OnHeapHnswGraph neighbors	2023-04-30 17:59:39 -04:00
Patrick Zhai	1fa2be90ea	Tidy the main branch	2023-04-26 21:21:57 -07:00
Alan Woodward	7374c200a1	Add next minor version 9.7.0	2023-04-26 16:44:47 +01:00
Christoph Büscher	f45e096304	Add ordering of files in compound files (#12241 ) Today there is no specific ordering of how files are written to a compound file. The current order is determined by iterating over the set of file names in SegmentInfo, which is undefined. This commit changes to an order based on file size. Colocating data from files that are smaller (typically metadata files like terms index, field info etc...) but accessed often can help when parts of these files are held in cache.	2023-04-26 14:01:02 +01:00
Luca Cavanna	b0befef912	QueryProfilerWeight to extend FilterWeight (#12242 ) QueryProfilerWeight should override matches and delegate to the subQueryWeight. Another way to fix this issue is to make it extend ProfileWeight and override only methods that need to have a different behaviour than delegating to the sub weight.	2023-04-26 10:24:57 +02:00
Alessandro Benedetti	4deb0003c4	Word2VecSynonymFilter constructor null check (#12169 )	2023-04-24 17:28:12 +02:00
Daniele Antuzi	1f4f2bf509	Introduced the Word2VecSynonymFilter (#12169 ) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>	2023-04-24 13:35:26 +02:00

1 2 3 4 5 ...

13930 Commits