lucene

Commit Graph

Author	SHA1	Message	Date
Jakub Slowinski	fb8183332b	Fix stack overflow in RegExp for long string (#12462 )	2023-08-16 22:45:20 -07:00
Shubham Chaudhary	368dbffef3	Replace consecutive close() calls and close() calls with null checks with IOUtils.close() (#12428 )	2023-08-16 17:12:34 -07:00
tang donghai	ec1367862d	Fix UTF32toUTF8 will produce invalid transition (#12472 )	2023-08-16 13:59:07 -07:00
Benjamin Trent	4174b521dd	Rename ToParentBlockJoin[Byte\|Float]KnnVectorQuery and adjust to return highest score child doc ID by parent id (#12510 ) The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search). So, I changed the new `ToParentBlockJoin[Byte\|Float]KnnVectorQuery` to `DiversifyingChildren[Byte\|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id. Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set. Related to: https://github.com/apache/lucene/pull/12434	2023-08-16 13:44:49 -04:00
Benjamin Trent	5a5aa2c8fa	GITHUB#12342 Add new maximum inner product vector similarity method (#12479 ) The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. See: https://github.com/apache/lucene/issues/12342#issuecomment-1658640222 for a good explanation for the need. To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX. One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data. See: - For: https://github.com/apache/lucene/issues/12342#issuecomment-1618258984 - Against: https://github.com/apache/lucene/issues/12342#issuecomment-1631577657, https://github.com/apache/lucene/issues/12342#issuecomment-1631808301 To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran See: - https://github.com/apache/lucene/issues/12342#issuecomment-1653420640 - https://github.com/apache/lucene/issues/12342#issuecomment-1656112434 - https://github.com/apache/lucene/issues/12342#issuecomment-1656718447 If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users. closes: https://github.com/apache/lucene/issues/12342	2023-08-16 12:15:25 -04:00
Lu Xugang	71f6f59a75	Remove outdated comment in Scorer (#12494 ) we should delete this comment since this constructor parameters already removed from LUCENE-2876 , it's description of 'given Similarity' is a lit bit confuse to reader. Scorer always provide non-negative	2023-08-16 11:36:20 +08:00
Benjamin Trent	18b56bd002	ToParentBlockJoin[Byte\|Float]KnnVectorQuery needs to handle the case when parents are missing (#12504 ) This is a follow up to: https://github.com/apache/lucene/pull/12434 Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE	2023-08-14 09:24:25 -04:00
Adrien Grand	47258cc9e9	Move changes of #12415 to 9.8	2023-08-11 22:39:36 +02:00
Adrien Grand	4d26cb2219	Optimize disjunction counts. (#12415 ) This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358	2023-08-11 22:37:37 +02:00
Benjamin Trent	df8745e59e	Fix flaky testToString method for Knn Vector queries (#12500 ) Periodically, the random indexer will force merge on close, this means that what was originally indexed as the zeroth document could no longer be the zeroth document. This commit adjusts the assertion to ensure the to string format is as expected for `DocAndScoreQuery`, regardless of the matching doc-id in the test. This seed shows the issue: ``` ./gradlew test --tests TestKnnByteVectorQuery.testToString -Dtests.seed=B78CDB966F4B8FC5 ```	2023-08-11 07:26:49 -04:00
Peter Gromov	13e747f95f	hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration (#12491 ) * hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration avoid the automaton access on definitely absent characters; count the scores for all substring lengths together	2023-08-08 22:40:42 +02:00
Benjamin Trent	dd4e66dad6	Fix test failure with zero-length vector (#12493 ) This adds assertions around the random test vector dimension count and continues to generate random vectors until it has a `squareSum > 0`	2023-08-08 08:38:46 -04:00
Benjamin Trent	a65cf8960a	Add ParentJoin KNN support (#12434 ) A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing. However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent. This commit adds this ability through some significant changes: - New leaf reader function that allows a collector for knn results - The knn results can then utilize bit-sets to join back to the parent id This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this. This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).	2023-08-07 14:46:42 -04:00
Adrien Grand	03ab02157a	Revert "Stop aligning windows in BooleanScorer. (#12488 )" This reverts commit `09e3b43331`.	2023-08-06 22:10:09 +02:00
Adrien Grand	09e3b43331	Stop aligning windows in BooleanScorer. (#12488 ) BooleanScorer aligns windows to multiples of 2048, but it doesn't have to. Actually, not aligning windows can help evaluate fewer windows overall and speed up query evaluation.	2023-08-05 11:29:34 +02:00
Adrien Grand	df3632cb03	Fix `DefaultBulkScorer` to not advance the competitive iterator beyond the end of the window. (#12481 ) The way `DefaultBulkScorer` uses `ConjunctionDISI` may make it advance the competitive iterator beyond the end of the window. This may cause bugs with bulk scorers such as `BooleanScorer` that sometimes delegate to the single clause that has matches in a given window of doc IDs. We should then make sure to not advance the competitive iterator beyond the end of the window based on this clause, as other clauses may have matches as well.	2023-08-03 07:19:27 +02:00
Adrien Grand	acffcfaaf0	Reduce overhead of disabling scoring on `BooleanScorer`. (#12475 ) This is a subset of #12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```	2023-08-03 07:17:52 +02:00
Armin Braun	e78feb7809	Clenup duplication in BKDWriter (#12469 ) The logic for creating the writer runnable could be deduplicated. Also, a couple of annonymous classes could be turned into lambdas.	2023-08-03 07:16:44 +02:00
Benjamin Trent	229dc7481e	Fix randomly failing field info format tests (#12473 )	2023-08-02 14:10:57 -04:00
Adrien Grand	5e725964a0	Improve MaxScoreBulkScorer partitioning logic. (#12457 ) Partitioning scorers is an optimization problem: the optimal set of non-essential scorers is the subset of scorers whose sum of max window scores is less than the minimum competitive score that maximizes the sum of costs. The current approach consists of sorting scorers by maximum score within the window and computing the set of non-essential clauses as the first scorers whose sum of max scores is less than the minimum competitive score, ie. you cannot have a competitive hit by matching only non-essential clauses. This sorting logic works well in the common case when costs are inversely correlated with maximum scores and gives an optimal solution: the above algorithm will also optimize the cost of non-essential clauses and thus minimize the cost of essential clauses, in-turn further improving query runtimes. But this isn't true for all queries. E.g. fuzzy queries compute scores based on artificial term statistics, so scores are no longer inversely correlated with maximum scores. This was especially visible with the query `titel~2` on the wikipedia dataset, as `title` matches this query and is a high-frequency term. Yet the score contribution of this term is in the same order as the contribution of most other terms, so query runtime gets much improved if this clause gets considered non-essential rather than essential. This commit optimize the partitioning logic a bit by sorting clauses by `max_score / cost` instead of just `max_score`. This will not change anything in the common case when max scores are inversely correlated with costs, but can significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my machine and the wikimedium10m dataset with this change.	2023-07-29 21:02:28 +02:00
Peter Gromov	32ec38271e	hunspell: check for aff file wellformedness more strictly (#12468 ) * hunspell: check for .aff file well-formedness more strictly	2023-07-29 16:07:50 +02:00
Peter Gromov	1af68bf2d7	hunspell: make the hash table load factor customizable (#12464 ) * hunspell: make the hash table load factor customizable	2023-07-28 18:36:32 +02:00
Mayya Sharipova	155b2edbe3	Fix occasional failure in BaseKnnVectorsFormatTestCase#testIllegalDimensionTooLarge (#12467 ) Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to #12436	2023-07-28 09:37:16 -04:00
Mayya Sharipova	119635ad80	Make KnnVectorsFormat#getMaxDimensions abstract (#12466 ) - Backward codecs use 1024 as max dims - Test classes use the current KnnVectorsFormat#DEFAULT_MAX_DIMENSIONS Relates to PR#12436 Closes #12309	2023-07-28 08:34:17 -04:00
Mayya Sharipova	98320d7616	Move max vector dims limit to Codec (#12436 ) Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimensions that they can handle. Closes #12309	2023-07-27 14:50:33 -04:00
Armin Braun	538b7d0ffe	Clean up writing String to ByteBuffersDataOutput (#12455 ) Resolving TODO to use UnicodeUtil instead of a copy of its code here. Maybe slightly slower from the extra check for high-surrogate but that may be outweigh or better by more compact code and saving the capturing lambda that might not inline.	2023-07-26 14:25:01 +02:00
Greg Miller	87944c2aa7	Move CHANGES entry for GITHUB#12408 under 10.0 A backport to 9.x will be somewhat tricky with the API surface area so planning to wait for a 10.0 release.	2023-07-25 13:02:36 -07:00
Greg Miller	179b45bc23	Initialize facet counting data structures lazily (#12408 ) This change covers: * Taxonomy faceting * FastTaxonomyFacetCounts * TaxonomyFacetIntAssociations * TaxonomyFacetFloatAssociations * SSDV faceting * SortedSetDocValuesFacetCounts * ConcurrentSortedSetDocValuesFacetCounts * StringValueFacetCounts * Range faceting: * LongRangeFacetCounts * DoubleRangeFacetCounts * Long faceting: * LongValueFacetCounts Left for a future iteration: * RangeOnRange faceting * FacetSet faceting	2023-07-25 12:20:42 -07:00
Greg Miller	2b3b028734	GITHUB#12451: Update TestStringsToAutomaton validation to work around GH#12458 (#12461 )	2023-07-25 11:56:18 -07:00
Armin Braun	20e97fbd00	Faster bulk numeric reads from BufferedIndexInput (#12453 ) Reading ints/floats/longs one-by-one from a heap-byte-buffer, including doing our own bounds checks is not very efficient. We can use the ability to translate the buffer and read in bulk while taking turns with one-off reading/refilling instead.	2023-07-24 17:20:32 +02:00
Houston Putman	1c70c40082	Enable search for site javadocs (#12430 )	2023-07-24 10:38:19 -04:00
Stefan Vodita	34721f9439	Assert IdxOrDvQuery subqueries and document useful fields (#12442 )	2023-07-24 16:36:48 +02:00
Benjamin Trent	59c56a0aed	Fix sorted&unsorted graph test flakiness (#12452 ) When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query. To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed. I have verified that bumping the search candidate pool to 60 fixes the failure. The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph. I verified further by running the test again over a couple thousand seeds and it didn't fail again.	2023-07-20 14:08:21 -04:00
Adrien Grand	17c13a76c8	Add BS1 optimization to MaxScoreBulkScorer. (#12444 ) Lucene's scorers that can dynamically prune on score provide great speedups when they manage to skip many hits. Unfortunately, there are also cases when they cannot skip hits efficiently, one example case being when there are many clauses in the query. In this case, exhaustively evaluating the set of matches with `BooleanScorer` (BS1) may perform several times faster. This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of collecting hits into a bitset to save the overhead of reordering priority queues. This helps make performance degrade much more gracefully when dynamic pruning cannot help much. Closes #12439	2023-07-19 13:51:22 +02:00
Martin Demberger	55f2f9958b	LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches (#12437 ) --------- Co-authored-by: Martin Demberger <martin.demberger@root-nine.de>	2023-07-19 13:15:33 +02:00
Peter Gromov	f05adff4ca	hunspell: speed up the dictionary enumeration (#12447 ) * hunspell: speed up the dictionary enumeration cache each word's case and the lowercase form group the words by lengths to avoid even visiting entries with unneeded lengths	2023-07-18 21:25:26 +02:00
Stefan Vodita	b4619d87ed	Move sliced int buffer functionality to MemoryIndex (#11248 ) (#12409 ) * [WIP] Move IntBlockPool slices to MemoryIndex * [WIP] Working TestMemoryIndex * [WIP} Working TestSlicedIntBlockPool * Working many allocations tests * Add basic IntBlockPool test * SlicedIntBlockPool inherits from IntBlockPool * Tidy	2023-07-10 10:18:28 -04:00
Benjamin Trent	d03c8f16d9	Have byte[] vectors also trigger a timeout in ExitableDirectoryReader (#12423 ) `ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors. This commit fixes that bug.	2023-07-07 12:29:55 -04:00
Benjamin Trent	861153020a	Fix HNSW graph visitation limit bug (#12413 ) We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer. While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset. Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results. However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.	2023-07-06 15:46:36 -04:00
Adrien Grand	f527eb3b12	Remove Scorable#docID. (#12407 ) `Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates #12358	2023-07-05 10:40:06 +02:00
Uwe Schindler	9ffc625b2e	Followup on #12410 : Fix caller class check to use string literals to allow private/pkg-private classes	2023-07-03 17:52:10 +02:00
Uwe Schindler	f668cfd1cd	Fix forUtil.gradle to actually execute python script and also fix type error in script (#12411 )	2023-07-03 16:22:03 +02:00
Uwe Schindler	fde2e50d9e	Refactor vectorization support in Lucene (#12410 )	2023-07-03 11:39:19 +02:00
Perdjesk	907883f701	Correct Javadocs using SimpleBindings (#12402 ) Javadocs were still referring the old `SortField` API, which has been replaced with methods that use `DoubleValuesSource` instead.	2023-06-30 15:14:24 +01:00
Adrien Grand	8811f31b9c	Add a post-collection hook to LeafCollector. (#12380 ) This adds `LeafCollector#finish` as a per-segment post-collection hook. While it was already possible to do this sort of things on top of the collector API before, a downside is that the last leaf would need to be post-collected in the current thread instead of using the executor, which is a missed opportunity for making queries concurrent.	2023-06-30 15:19:35 +02:00
Sagar	40ee6e583e	Assign a dummy simScorer in TermsWeight if score is not needed (#12383 )	2023-06-30 15:14:33 +02:00
Sorabh	223eecca33	Add a thread safe CachingLeafSlicesSupplier to compute and cache the LeafSlices used with concurrent segment (#12374 ) search. It uses the protected method `slices` by default to compute the slices which can be overriden by the sub classes of IndexSearcher	2023-06-30 14:57:34 +02:00
zhangchao	01200b5804	Speed up NumericDocValuesWriter with index sorting (#12381 )	2023-06-30 14:56:56 +02:00
Uwe Schindler	e503805758	Remove usage and add some legacy java.util classes to forbiddenapis (Stack, Hashtable, Vector) (#12404 )	2023-06-29 16:56:41 +02:00
Luca Cavanna	f44cc45cf8	Share concurrent execution code into TaskExecutor (#12398 ) Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs tasks concurrently and waits for them to be completed and handles eventual exceptions. This commit shares code among these two scenarios, to reduce code duplicate as well as to ensure that furhter improvements can be shared among them.	2023-06-28 13:52:01 +02:00

1 2 3 4 5 ...

36626 Commits All Branches Search

36626 Commits

All Branches