lucene

Commit Graph

Author	SHA1	Message	Date
Jakub Slowinski	74865bb92a	Removing all deprecated TopScoreDocCollector + TopFieldCollector methods (#create, #createSharedManager) (#13617 ) These are already marked for deprecation in 9.x and we previously removed all internal use of these methods in 10.0. Closes #13499	2024-07-30 11:33:16 +02:00
Jakub Slowinski	8a7d4842cc	Remove usage of TopScoreDocCollector + TopFieldCollector deprecated methods (#create, #createSharedManager) (#13500 ) These methods were deprecated in #240 which is part of Lucene 10.0. Addresses #13499	2024-07-29 11:05:55 +02:00
Peter Gromov	481ca2d30f	hunspell: add Suggester#proceedPastRep to avoid losing relevant suggestions (#13612 ) * hunspell: add Suggester#proceedPastRep to avoid losing relevant suggestions	2024-07-27 21:39:00 +02:00
Adrien Grand	8d4f7a6e99	Bump the window size of disjunction from 2,048 to 4,096. (#13605 ) It's been pointed multiple times that a difference between Tantivy and Lucene is the fact that Tantivy uses windows of 4,096 docs when Lucene has a 2x smaller window size of 2,048 docs and that this might explain part of the performance difference. luceneutil suggests that bumping the window size to 4,096 does indeed improve performance for counting queries, but not for top-k queries. I'm still suggesting to bump the window size across the board to keep our disjunction scorer consistent.	2024-07-25 15:38:21 +02:00
Chris Hegarty	b4fb425c43	Aggregate files from the same segment into a single Arena (#13570 ) This commit adds a ref counted shared arena to support aggregating segment files into a single Arena.	2024-07-25 10:12:02 +01:00
Armin Braun	4c1d50d8e8	Save allocating some zero length byte arrays (#13608 ) Something I found in a heap dump. For large numbers of `FieldReader` where the minimum term is an empty string, we allocate MBs worth of empty `byte[]` in ES. Worth adding the conditional here I think.	2024-07-24 21:53:30 +02:00
Adrien Grand	acbd714140	Further reduce the search concurrency overhead. (#13606 ) This iterates on #13546 to further reduce the overhead of search concurrency by caching whether the hit count threshold has been reached: once the threshold has been reached, it cannot get "un-reached" again, so we don't need to pay the cost of `LongAdder#longValue`.	2024-07-24 14:58:53 +02:00
Luca Cavanna	97d066dd6b	Update TestTopDocsMerge to not rely on search(Query, Collector) (#13601 ) Relates to #12892	2024-07-24 13:07:08 +02:00
Luca Cavanna	d491dfe131	Update TestTopDocsCollector to no longer rely on the deprecated search(Query, Collector) (#13600 )	2024-07-24 13:06:31 +02:00
Dzung Bui	97d89c661f	Refactor FST.saveMetadata() to FSTMetadata.save() (#13549 ) * lazily write the FST padding byte * Also write the pad byte when there is emptyOutput * add comment * Make Lucene90BlockTreeTermsWriter to write FST off-heap * Add change log * Tidy code & Add comments * use temp IndexOutput for FST writing * Use IOUtils to delete files * Update CHANGES.txt * Update CHANGES.txt	2024-07-22 12:14:53 -04:00
Michael McCandless	af9a2b9803	Add simple tool to diff entries in lucene's CHANGES.txt that should be identical (#12860 ) * add simple tool to diff entries in lucene's CHANGES.txt that should be identical * remove temporary debugging code	2024-07-22 11:37:34 -04:00
Mike McCandless	1c3925cb15	reconcile main's copy to match 9.10.0 released CHANGES.txt entry	2024-07-22 11:37:01 -04:00
Ignacio Vera	7709f575ef	Align doc value skipper interval boundaries when an interval contains a constant value (#13597 ) keep adding documents to an skipper interval while it is dense and single valued.	2024-07-22 14:52:44 +02:00
Adrien Grand	cc3b412183	Fix test failure	2024-07-19 15:18:21 +02:00
Ignacio Vera	9f991ed07e	Add levels to DocValues skipper index (#13563 ) Adding levels o be able to skip several intervals in one step.	2024-07-19 11:20:16 +02:00
zhouhui	c245ed2fb4	Remove useless todo. (#13589 )	2024-07-19 10:12:52 +02:00
Nhat Nguyen	b42fd8e479	Avoid wrap readers without soft-deletes (#13588 ) I analyzed a heap dump of Elasticsearch where FixedBitSet uses more than 1GB of memory. Most of these FixedBitSets are used by soft-deletes reader wrappers, even though these segments have no deletes at all. I believe these segments previously had soft-deletes, but these deletes were pruned by merges. The reason we wrap soft-deletes is that the soft-deletes field exists. Since these segments had soft-deletes previously, we carried the field-infos into the new segment. Ideally, we should have ways to check whether the returned docValues iterator is empty or not so that we can avoid allocating FixedBitSet completely, or we should prune fields without values after merges.	2024-07-18 22:47:44 -07:00
Adrien Grand	00c9d9a03c	Fix test failures due to #13517 .	2024-07-18 11:52:16 +02:00
Adrien Grand	9f040864a6	Add a `targetSearchConcurrency` parameter to `LogMergePolicy`. (#13517 ) This adds the same `targetSearchConcurrency` parameter to `LogMergePolicy` that #13430 is adding to `TieredMergePolicy`.	2024-07-18 11:28:35 +02:00
Adrien Grand	fff997f801	Stop requiring MaxScoreBulkScorer's outer window from having at least INNER_WINDOW_SIZE docs. (#13582 ) Currently `MaxScoreBulkScorer` requires its "outer" window to be at least `WINDOW_SIZE`. The intuition there was that we should make sure we should use the whole range of the bit set that we are using to collect matches. The downside is that it may force us to use an upper level in the skip list that has worse upper bounds for the scores.	2024-07-18 11:23:19 +02:00
Carlos Delgado	22ca695ef5	Add target search concurrency to TieredMergePolicy (#13430 )	2024-07-17 13:54:19 +02:00
Chris Hegarty	99488b2245	Ensure to use IOContext.READONCE when reading segment files (#13574 ) This commit uses IOContext.READONCE in more places where the index input is clearly being read once by the thread opening it. We can then enforce that segment files are only opened with READONCE, in the test specific Mock directory wrapper. Much of the changes in this PR update individual test usage, but there is one non-test change to Directory::copyFrom.	2024-07-17 08:56:11 +01:00
Mayya Sharipova	5e52b8094a	Add IntervalsSource for range and regexp queries (#13562 ) We already have convenient functions for contructing IntervalsSource for wildcard and fuzzy functions. This adds functions for regexp and range as well.	2024-07-14 09:48:16 -04:00
Christine Poerschke	c55d664b3e	in KnnVectorsWriter reduce code duplication w.r.t. MergedVectorValues.merge(Float\|Byte)VectorValues (#13539 ) Co-authored-by: Vigya Sharma <vigyaspeaks@gmail.com>	2024-07-12 10:48:10 +01:00
Christine Poerschke	cc14555395	Lucene99HnswVectorsReader[.readFields] readability tweaks (#13532 ) * remove unnecessary readFields parameter * consistently use this. in constructor * align declare and init order	2024-07-12 09:49:32 +01:00
Michael Sokolov	8d1e624a67	Add HnswGraphBuilder.getCompletedGraph() and record completed state (#13561 )	2024-07-11 11:44:49 -04:00
Chris Hostetter	cc854f408f	WordBreakSpellChecker.suggestWordBreaks now does a breadth first search (#12100 )	2024-07-10 11:07:30 -07:00
Jakub Slowinski	49e781084a	Minor cleanup in some Facet tests (#13489 )	2024-07-10 17:05:09 +01:00
Benjamin Trent	428fdb5291	Reduce heap usage for knn index writers (#13538 ) * Reduce heap usage for knn index writers * iter * fixing heap usage & adding changes * javadocs	2024-07-10 10:28:48 -04:00
Adrien Grand	026d661e5f	Use `IndexInput#prefetch` for terms dictionary lookups. (#13359 ) This introduces `TermsEnum#prepareSeekExact`, which essentially calls `IndexInput#prefetch` at the right offset for the given term. Then it takes advantage of the fact that `BooleanQuery` already calls `Weight#scorerSupplier` on all clauses, before later calling `ScorerSupplier#get` on all clauses. So `TermQuery` now calls `TermsEnum#prepareSeekExact` on `Weight#scorerSupplier` (if scores are not needed), which in-turn means that the I/O all terms dictionary lookups get parallelized across all term queries of a `BooleanQuery` on a given segment (intra-segment parallelism).	2024-07-10 15:36:35 +02:00
Chris Hegarty	da41215a67	Use a confined Arena for IOContext.READONCE (#13535 ) Use a confined Arena for IOContext.READONCE. This change will require inputs opened with READONCE to be consumed and closed on the creating thread. Further testing and assertions can be added as a follow up.	2024-07-10 09:39:35 +01:00
zhouhui	ef215d87ab	Lookup next when current doc is deleted in PerThreadPKLookup.lookup (#13556 )	2024-07-10 10:14:14 +02:00
Benjamin Trent	9bfde5514c	Fix quantized vector writer ram estimates (#13553 ) * Fix quantized vector writer ram estimates * add test & changes	2024-07-09 13:00:10 -04:00
Jakub Slowinski	295c5d3576	GITHUB#13175: Stop double-checking priority queue inserts in some FacetCount classes (#13488 ) * GITHUB#13175: Stop double-checking priority queue inserts Removing 2 cases of bottomX optimizations where insertWithOverflow already handles the check. Closes #13175 * Update CHANGES.txt --------- Co-authored-by: Jakub Slowinski <jslowins@amazon.com>	2024-07-09 10:25:02 -04:00
Ignacio Vera	392ddc154f	Introduce TestLucene90DocValuesFormatVariableSkipInterval for testing docvalues skipper index (#13550 ) this commit makes possible to configure dynamically the interval size for doc values skipperfor testing, and add a new test suite that changes the interval size randomly.	2024-07-09 15:46:15 +02:00
ChrisHegarty	4baaedaa67	Add link to OpenJDK JIRA issue for VectorUtil::xorBitCount	2024-07-09 12:17:21 +01:00
Patrick Zhai	ceb4539609	Refactor and javadoc update for KNN vector writer classes (#13548 )	2024-07-08 13:04:27 -07:00
Chris Hegarty	3304b60c9c	Improve VectorUtil::xorBitCount perf on ARM (#13545 ) This commit improves the performance of VectorUtil::xorBitCount on ARM by ~4x. This change is effectively a workaround for the lack of vectorization of Long::bitCount on ARM. On x64 there is no issue, the long variant of xorBitCount outperforms the int variant by ~15%.	2024-07-08 17:30:45 +01:00
Armin Braun	9e04cb9c41	Override single byte writes to OutputStreamIndexOutput to remove locking (#13543 ) Single byte writes to BufferedOutputStream show up pretty hot in indexing benchmarks. We can save the locking overhead introduced by JEP374 by overriding and providing a no-lock fastpath.	2024-07-08 10:59:50 +02:00
Armin Braun	675772546c	Optimize MaxScoreBulkScorer (#13544 ) Don't use Comparator.comparingDouble(...) in a hotish loop here, it causes allocations that escape analysis is not able to remove. => lets just manually inline this to get predictable behavior and save up to 0.5% of all allocations in some benchmark runs.	2024-07-08 10:53:26 +02:00
Armin Braun	2a8d328ab2	Replace AtomicLong with LongAdder in HitsThresholdChecker (#13546 ) The value for the global count is incremented a lot more than it is read, the space overhead of LongAdder seems irrelevant => lets use LongAdder. The performance gain from using it is the higher the more threads you use, but at 4 threads already very visible in benchmarks.	2024-07-08 10:52:28 +02:00
Armin Braun	62e08f5f4b	TaskExecutor should not fork unnecessarily (#13472 ) When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1 to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no longer required. Previously, a separate executor was required to prevent deadlock, and all the tasks were offloaded to it unconditionally, wasting resources in some scenarios due to unnecessary forking, and the caller thread having to wait for all tasks to be completed anyways. it can now actively contribute to the execution as well.	2024-07-04 11:15:26 +02:00
Christine Poerschke	f4cd4b46fc	Lucene99HnswVectorsReader.search float-vs-byte variants: reduce code duplication (#13529 ) * Lucene99HnswVectorsReader.search float-vs-byte variants: reduce code duplication * action review feedback: use org.apache.lucene.util.IOSupplier	2024-07-01 17:32:04 +01:00
Christine Poerschke	0ad270d8b0	[Abstract]Knn[Byte\|Float]VectorQuery tweaks: reduce duplicate method calls (#13528 ) * reduce LeafReaderContext.reader()[.maxDoc()] calls in AbstractKnnVectorQuery.getLeafResults * reduce IndexReader.leaves() calls in AbstractKnnVectorQuery.findSegmentStarts * reduce LeafReaderContext.reader() calls in Knn(Byte\|Float)VectorQuery.approximateSearch	2024-07-01 17:31:02 +01:00
zhouhui	3cd406e783	Remove unused segNo calculation in IndexWriter.doFlush (#13491 )	2024-07-01 17:29:10 +01:00
Stefan Vodita	5f91d609ea	Make Gradle dashboard easy to find by adding a badge (#13476 )	2024-07-01 09:09:51 +01:00
Benjamin Trent	19fe1a56f7	Fix more vector similarity query tests (#13530 )	2024-06-29 13:54:32 -04:00
Adrien Grand	44ad4d95c6	Add bw tests for block-tree with inlined metadata. (#13527 ) The backport of #13524 found a hole in the testing of `Lucene40BlockTreeTerms` for versions before we moved metadata to its own file. This PR adds explicit bw testing for this version. Adding the correct if/else statements made the code extremely complicated so I opted for restoring the file as it was at the time when we bumped the version. This also fixes the bug that we introduced in #13524.	2024-06-28 21:29:00 +02:00
Ignacio Vera	f8ee339f64	Add back-compat indices for 9.11.1	2024-06-27 16:23:03 +02:00
Ignacio Vera	2aec233b5c	Sync CHANGES for 9.11.1	2024-06-27 16:14:36 +02:00

... 4 5 6 7 8 ...

37709 Commits All Branches Search

37709 Commits

All Branches