lucene

Commit Graph

Author	SHA1	Message	Date
Stefan Vodita	2e7426961b	Remove statement that SSDV facets aren't hierarchical (#12232 )	2023-04-21 18:40:08 -04:00
Peter Gromov	60c9039d9f	mention "GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end" CHANGES.txt	2023-04-21 15:17:33 +02:00
Usman Shaikh	bed07c6b02	Update Javadoc comment to mention gradle instead of ant (#12201 )	2023-04-18 22:14:19 -07:00
Houston Putman	08f30f82b4	Cleanup NOTICE.txt (#12227 ) - Ant is no longer used as the build system for Lucene - JUnit is not packaged in a Lucene release - The Float16Converter was removed before the PR it was used in was merged: https://github.com/apache/lucene-solr/pull/2108	2023-04-18 15:58:09 -04:00
Kartik Ganesh	3813f5ab7c	Change the access modifier for the "expert" readLatestCommit API to public. (#12229 ) This change also includes a unit test for this functionality. Signed-off-by: Kartik Ganesh <gkart@amazon.com>	2023-04-18 14:38:35 -04:00
Andrey Bozhko	2d0dc6407a	Avoid redundant copies of BytesRef when constructing new Term (#12234 )	2023-04-15 22:44:14 -07:00
Vigya Sharma	4e88118a35	Fix typo in CheckJoinIndex (#12231 )	2023-04-14 14:06:19 -07:00
Marcus	2d7908e3c9	Explain term automaton queries (#12208 )	2023-04-08 16:09:42 -07:00
Patrick Zhai	c31017589b	Remove a test in TestDocumentsWriterDeleteQueue (#12223 )	2023-04-04 10:49:14 -07:00
Peter Gromov	56aef7265a	hunspell: disallow hidden title-case entries from compound middle/end (#12220 ) if we only have custom-case uART and capitalized UART, we shouldn't accept StandUart as a compound (although we keep hidden "Uart" dictionary entries for internal purposes)	2023-04-03 20:06:58 +02:00
Adrien Grand	56e65919b1	Adjust DWPT pool concurrency to the number of cores. (#12216 ) After upgrading Elasticsearch to a recent Lucene snapshot, we observed a few indexing slowdowns when indexing with low numbers of cores. This appears to be due to the fact that we lost too much of the bias towards larger DWPTs in apache/lucene#12199. This change tries to add back more ordering by adjusting the concurrency of `DWPTPool` to the number of cores that are available on the local node.	2023-03-31 15:07:48 +02:00
Greg Miller	172dfaf867	changes entry for GH#12212	2023-03-29 11:09:22 -07:00
Frederic Thevenet	df1b0baa69	Fixes Searches made via DrillSideways may miss documents that should match the query (#12212 )	2023-03-29 11:05:58 -07:00
Uwe Schindler	b84b360f58	Upgrade forbiddenapis to version 3.5 (#12215 ) Upgrade forbiddenapis to version 3.5. This tones down some verbose warnings printed while checking Java 19 and Java 20 sourcesets for the MR-JAR	2023-03-27 13:30:22 +02:00
Hongyu Yan	a6475cecbf	Fix ordered intervals query over interleaved terms (#12214 ) Given an input text 'A B A C A B C' and search ORDERED(A, B, C), we should retrieve hits [0,3] and [4,6]; currently [4,6] is skipped. After finding the first interval [0, 3], the subintervals will become A[0,0], B[1,1], C[3,3]; then the algorithm will try to minimize it and the subintervals will become: A:[2,2], B:[5,5], C:[3,3] (after finding 5 > 3 it breaks the minimization) And when finding next interval, it will do advance(B) before checking whether it is after A(the do-while loop), so subintervals will become A[2,2], B[inf, inf], C[3,3] and return NO_MORE_INTERVAL. This commit instead continues advancing subintervals from where the last `nextInterval` call stopped, rather than always advancing all subintervals.	2023-03-27 09:18:33 +01:00
Adrien Grand	0782535017	Fully reuse postings enums when flushing sorted indexes. (#12206 ) Currently we're only half reusing postings enums when flushing sorted indexes as we still create new wrapper instances every time, which can be costly with fields that have many terms.	2023-03-16 13:51:33 +01:00
Patrick Zhai	d3b6ef3c86	Refactor part of IndexFileDeleter and ReplicaFileDeleter into a common utility class (#12126 )	2023-03-15 20:51:49 -07:00
Adrien Grand	f324204019	Reduce contention in DocumentsWriterPerThreadPool. (#12199 ) Obtaining a DWPT and putting it back into the pool is subject to contention. This change reduces contention by using 8 sub pools that are tried sequentially. When applied on top of #12198, this reduces the time to index geonames with 20 threads from ~19s to ~16-17s.	2023-03-15 13:17:40 +01:00
Adrien Grand	805eb0b613	Use radix sort to sort postings when index sorting is enabled. (#12114 ) This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are `DOCS`. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random.	2023-03-15 11:56:45 +01:00
Adrien Grand	d407edf4b8	Reduce contention in DocumentsWriterFlushControl. (#12198 ) lucene-util's `IndexGeoNames` benchmark is heavily contended when running with many indexing threads, 20 in my case. The main offender is `DocumentsWriterFlushControl#doAfterDocument`, which runs after every index operation to update doc and RAM accounting. This change reduces contention by only updating RAM accounting if the amount of RAM consumption that has not been committed yet by a single DWPT is at least 0.1% of the total RAM buffer size. This effectively batches updates to RAM accounting, similarly to what happens when using `IndexWriter#addDocuments` to index multiple documents at once. Since updates to RAM accounting may be batched, `FlushPolicy` can no longer distinguish between inserts, updates and deletes, so all 3 methods got merged into a single one. With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender for contention is now `DocumentsWriterPerThreadPool#getAndLock`. Co-authored-by: Simon Willnauer <simonw@apache.org>	2023-03-15 11:39:40 +01:00
Lu Xugang	62e175bf4f	Unrelated code in TestIndexSortSortedNumericDocValuesRangeQuery (#12153 )	2023-03-15 15:22:32 +08:00
Marcus	dfd9e0fe97	Remove the Now Unused Class `pointInPolygon`. (#12159 ) Removes the unused Tessellator.pointInPolygon method.	2023-03-14 09:16:02 -05:00
Jasir KT	4851bd74f4	Fix boost missing in MultiFieldQueryParser (#12202 ) When using boost along with any of fuzzy, wildcard, regexp, range or prefix queries, the boost is not applied.	2023-03-13 08:27:42 -04:00
Uwe Schindler	e4d8a5c5cb	Implement MMapDirectory with Java 20 Project Panama Preview API (#12188 )	2023-03-09 21:27:31 +01:00
Jasir KT	96efb34d00	Fix Slop Issue in MultiFieldQueryParser (#12196 ) In Lucene 5.4.0 62313b83ba9c69379e1f84dffc881a361713ce9 introduced some changes for immutability of queries. setBoost() function was replaced with new BoostQuery(), but BoostQuery is not handled in setSlop function. This commit adds the handling of BoostQuery in setSlop() function.	2023-03-08 11:16:09 -05:00
Greg Miller	0651d25713	Fixup TestLongValueFacetCounts after GITHUB#11744. (#12192 ) GH#11744 deprecated LongValueFacetCounts#getTopChildrenSortByCount in favor of the standard Facets#getTopChildren. The issue is that #getTopChildrenSortByCount didn't do any input validation and allowed for topN == 0, while #getTopChildren does input validation. Randomized testing could produce topN values of 0, which resulted in falied tests. This addresses the tests.	2023-03-06 18:13:03 -08:00
Greg Miller	afd3a7efbe	Remove LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality (#11744 )	2023-03-06 12:12:23 -08:00
Tyler Bertrand	c514089d66	Gradle optimizations (#12150 ) * Define inputs and outputs for task validateJarLicenses * Lazily configure validateJarLicenses * Move functionality from copyTestResources task into processTestResources task * Lazily configure processTestResources * Altered TestCustomAnalyzer.testStopWordsFromFile() to find resources in updated location * Resolve "overlapping output" issue preventing processTestResources from being cached * Provide system properties from CommandLineArgumentProviders * Configure certain system properties as inputs to take advantage of UP-TO-DATE checking * Applies the correct pathing strategies to take full advantage of caching even if builds are executed from different locations on disk * Make validateSourcePatterns task cacheable by removing .gradle directory from its input	2023-03-06 19:17:37 +01:00
Greg Miller	b4f969c197	Better PostingsEnum reuse in MultiTermQueryConstantScoreBlendedWrapper (#12179 )	2023-03-06 09:09:52 -08:00
Christine Poerschke	3bd06b1cb9	GITHUB-12181: fix false-positive TestKnnFloatVectorQuery.testDocAndScoreQueryBasics() failure (#12182 )	2023-03-06 15:29:36 +00:00
Kaival Parikh	e0d92eef98	Concurrent rewrite for KnnVectorQuery (#12160 ) - Reduce overhead of non-concurrent search by preserving original execution - Improve readability by factoring into separate functions --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>	2023-03-04 01:12:11 -08:00
Greg Miller	569533bd76	Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod (#12175 )	2023-03-01 08:25:44 -08:00
Greg Miller	00910cd6a4	Actually remove TermInSetQuery#getTermData (follow-up on #12173 )	2023-03-01 05:44:24 -08:00
Greg Miller	c8741f7c58	Deprecate TermInSetQuery#getTermData (#12173 )	2023-03-01 05:42:39 -08:00
Greg Miller	3809106602	Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery (#12156 )	2023-03-01 05:20:28 -08:00
Greg Miller	001acaf882	Clone the BytesRef[] values in KeywordField#newSetQuery (#12158 )	2023-02-28 18:30:38 -08:00
Greg Miller	b23b7475e1	Follow up on GH#12055 to remove un-referenced test methods	2023-02-27 16:03:44 -08:00
Adrien Grand	c6667e709f	Better skipping for multi-term queries with a FILTER rewrite. (#12055 ) This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a `DocIdSetBuilder`. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the `DocIdSetBuilder` that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the `DocIdSetBuilder`. This provides two main benefits: - Queries are less likely to allocate a FixedBitSet of size `maxDoc`. - Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown, so users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new default for PrefixQuery, WildcardQuery and TermRangeQuery. Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>	2023-02-27 15:11:31 -08:00
Adrien Grand	38d253fb53	Lazily resolve ordinals when merging. (#12170 ) The default implementation of merging doc values resolves the ordinal of a document in `nextDoc()`. But sometimes, doc values iterators are consumed without retrieving ordinals, e.g. to write the set of documents that have a value, so this may be wasteful. With this change, ordinals get resolved lazily upon `ordValue()`.	2023-02-27 17:38:59 +01:00
Adrien Grand	2d157bd348	Remove LogMergePolicy's boundary at the floor level. (#12113 ) `LogMergePolicy` has this boundary at the floor level that prevents merging segments above the minimum segment size with segments below this size. I cannot see a benefit from doing this, and no tests fail if I remove it, while this boundary has the downside of not running merges that seem legit to me. Should we remove this boundary check?	2023-02-27 17:38:34 +01:00
Adrien Grand	cce33b07e4	Skip the TokenStream overhead when indexing simple keywords. (#12139 ) Indexing simple keywords through a `TokenStream` abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.	2023-02-21 14:00:11 +01:00
Benjamin Trent	dbfca9a62b	Minor vector search matching doc optimizations (#12152 ) The two minor performance improvements are around count and Weight#scorer. segmentStarts is a monotonically increasing start for each scored document indexed by leaf-segment ordinal. Consequently, if the upper and lower segments are equivalent, that means no docs match for this segment. Count is similarly calculated by the difference between upper and lower segmentStarts according to the segment ordinal.	2023-02-21 07:51:03 -05:00
Greg Miller	7506f8462f	Speed up DocValuesRewriteMethod by making use of sortedness (#12155 )	2023-02-19 08:33:26 -08:00
Robert Muir	3ad2ede395	Implement ScorerSupplier for Sorted(Set)DocValuesField#newSlowRangeQuery (#12132 ) Similar to use of ScorerSupplier in #12129, implement it here too, because creation of a Scorer requires lookupTerm() operations in the DV terms dictionary. This results in wasted effort/random accesses, if, based on the cost(), IndexOrDocValuesQuery decides not to use this query.	2023-02-17 08:25:17 -05:00
Julie Tibshirani	8340b01c3c	Simplify max score for kNN vector queries (#12146 ) The helper class DocAndScoreQuery implements advanceShallow to help skip non-competitive documents. This method doesn't actually keep track of where it has advanced, which means it can do extra work. Overall the complexity here didn't seem worth it, given the low cost of collecting matching kNN docs. This PR switches to a simpler approach, which uses a fixed upper bound on the max score.	2023-02-16 12:03:59 -08:00
Nhat Nguyen	8e15c665be	Ensure caching all leaves from the upper tier (#12147 ) This change adjusts the cache policy to ensure that all segments in the max tier to be cached. Before, we cache segments that have more than 3% of the total documents in the index; now cache segments have more than half of the average documents per leave of the index. Closes #12140	2023-02-16 10:41:28 -08:00
Julie Tibshirani	54044a82a0	Improve DocAndScoreQuery#toString (#12148 ) Tiny improvements to DocAndScoreQuery: * Make toString more informative * Remove unnecessary 'k' parameter	2023-02-15 09:50:55 -08:00
Benjamin Trent	7baa01b3c2	Force merge into a single segment before getting the directory reader (#12138 ) The test assumes a single segment is created (because only one scorer is created from the leaf contexts). But, force merging wasn't done before getting the reader. Forcemerge to a single segment before getting the reader.	2023-02-09 11:27:08 -05:00
Alan Woodward	f38d51ee89	Don't wrap readers when checking for term vector access in test (#12136 ) In TestUnifiedHighlighterTermVec, we have a special reader which counts the number of times term vectors are accessed, so that we can assert that caching works correctly here. There is some special logic in place to skip the check when the test framework wraps readers with CheckIndex or ParallelReader; however, this logic no longer works with ParallelReader in particular, because term vectors are now accessed through an anonymous class. A simpler solution here is to call newSearcher(reader, false), which disables wrapping, meaning that we can remove this extra logic entirely. Fixes #12115	2023-02-09 09:29:41 +00:00
John Mazanec	776149f0f6	Reuse HNSW graph for intialization during merge (#12050 ) * Remove implicit addition of vector 0 Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com> * Enable out of order insertion of nodes in hnsw Enables nodes to be added into OnHeapHnswGraph in out of order fashion. To do so, additional operations have to be taken to resort the nodesByLevel array. Optimizations have been made to avoid sorting whenever possible. Signed-off-by: John Mazanec <jmazane@amazon.com> * Add ability to initialize from graph Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com> * Utilize merge with graph init in HNSWWriter Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com> * Minor modifications to Lucene95HnswVectorsWriter Signed-off-by: John Mazanec <jmazane@amazon.com> * Use TreeMap for graph structure for levels > 0 Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com> * Refactor initializer to be in static create method Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com> * Address review comments Signed-off-by: John Mazanec <jmazane@amazon.com> * Add change log entry Signed-off-by: John Mazanec <jmazane@amazon.com> * Remove empty iterator for neighborqueue Signed-off-by: John Mazanec <jmazane@amazon.com> --------- Signed-off-by: John Mazanec <jmazane@amazon.com>	2023-02-07 14:42:03 -05:00

1 2 3 4 5 ...

36483 Commits All Branches Search

36483 Commits

All Branches