lucene

Commit Graph

Author	SHA1	Message	Date
Peter Gromov	5e0761eab5	remove timeout dependency from TestHunspell.testSuggestionOrderStabilityOnDictionaryEditing	2023-04-23 21:16:56 +02:00
Peter Gromov	025dfec2dd	Hunspell: reduce suggestion set dependency on the hash table order (#12239 ) * Hunspell: reduce suggestion set dependency on the hash table order When adding words to a dictionary, suggestions for other words shouldn't change unless they're directly related to the added words. But before, GeneratingSuggester selected 100 best first matches from the hash table, whose order can change significantly after adding any unrelated word. That resulted in unexpected suggestion changes on seemingly unrelated dictionary edits.	2023-04-23 16:51:17 +02:00
Stefan Vodita	2e7426961b	Remove statement that SSDV facets aren't hierarchical (#12232 )	2023-04-21 18:40:08 -04:00
Peter Gromov	60c9039d9f	mention "GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end" CHANGES.txt	2023-04-21 15:17:33 +02:00
Usman Shaikh	bed07c6b02	Update Javadoc comment to mention gradle instead of ant (#12201 )	2023-04-18 22:14:19 -07:00
Kartik Ganesh	3813f5ab7c	Change the access modifier for the "expert" readLatestCommit API to public. (#12229 ) This change also includes a unit test for this functionality. Signed-off-by: Kartik Ganesh <gkart@amazon.com>	2023-04-18 14:38:35 -04:00
Andrey Bozhko	2d0dc6407a	Avoid redundant copies of BytesRef when constructing new Term (#12234 )	2023-04-15 22:44:14 -07:00
Vigya Sharma	4e88118a35	Fix typo in CheckJoinIndex (#12231 )	2023-04-14 14:06:19 -07:00
Marcus	2d7908e3c9	Explain term automaton queries (#12208 )	2023-04-08 16:09:42 -07:00
Patrick Zhai	c31017589b	Remove a test in TestDocumentsWriterDeleteQueue (#12223 )	2023-04-04 10:49:14 -07:00
Peter Gromov	56aef7265a	hunspell: disallow hidden title-case entries from compound middle/end (#12220 ) if we only have custom-case uART and capitalized UART, we shouldn't accept StandUart as a compound (although we keep hidden "Uart" dictionary entries for internal purposes)	2023-04-03 20:06:58 +02:00
Adrien Grand	56e65919b1	Adjust DWPT pool concurrency to the number of cores. (#12216 ) After upgrading Elasticsearch to a recent Lucene snapshot, we observed a few indexing slowdowns when indexing with low numbers of cores. This appears to be due to the fact that we lost too much of the bias towards larger DWPTs in apache/lucene#12199. This change tries to add back more ordering by adjusting the concurrency of `DWPTPool` to the number of cores that are available on the local node.	2023-03-31 15:07:48 +02:00
Greg Miller	172dfaf867	changes entry for GH#12212	2023-03-29 11:09:22 -07:00
Frederic Thevenet	df1b0baa69	Fixes Searches made via DrillSideways may miss documents that should match the query (#12212 )	2023-03-29 11:05:58 -07:00
Uwe Schindler	b84b360f58	Upgrade forbiddenapis to version 3.5 (#12215 ) Upgrade forbiddenapis to version 3.5. This tones down some verbose warnings printed while checking Java 19 and Java 20 sourcesets for the MR-JAR	2023-03-27 13:30:22 +02:00
Hongyu Yan	a6475cecbf	Fix ordered intervals query over interleaved terms (#12214 ) Given an input text 'A B A C A B C' and search ORDERED(A, B, C), we should retrieve hits [0,3] and [4,6]; currently [4,6] is skipped. After finding the first interval [0, 3], the subintervals will become A[0,0], B[1,1], C[3,3]; then the algorithm will try to minimize it and the subintervals will become: A:[2,2], B:[5,5], C:[3,3] (after finding 5 > 3 it breaks the minimization) And when finding next interval, it will do advance(B) before checking whether it is after A(the do-while loop), so subintervals will become A[2,2], B[inf, inf], C[3,3] and return NO_MORE_INTERVAL. This commit instead continues advancing subintervals from where the last `nextInterval` call stopped, rather than always advancing all subintervals.	2023-03-27 09:18:33 +01:00
Adrien Grand	0782535017	Fully reuse postings enums when flushing sorted indexes. (#12206 ) Currently we're only half reusing postings enums when flushing sorted indexes as we still create new wrapper instances every time, which can be costly with fields that have many terms.	2023-03-16 13:51:33 +01:00
Patrick Zhai	d3b6ef3c86	Refactor part of IndexFileDeleter and ReplicaFileDeleter into a common utility class (#12126 )	2023-03-15 20:51:49 -07:00
Adrien Grand	f324204019	Reduce contention in DocumentsWriterPerThreadPool. (#12199 ) Obtaining a DWPT and putting it back into the pool is subject to contention. This change reduces contention by using 8 sub pools that are tried sequentially. When applied on top of #12198, this reduces the time to index geonames with 20 threads from ~19s to ~16-17s.	2023-03-15 13:17:40 +01:00
Adrien Grand	805eb0b613	Use radix sort to sort postings when index sorting is enabled. (#12114 ) This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are `DOCS`. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random.	2023-03-15 11:56:45 +01:00
Adrien Grand	d407edf4b8	Reduce contention in DocumentsWriterFlushControl. (#12198 ) lucene-util's `IndexGeoNames` benchmark is heavily contended when running with many indexing threads, 20 in my case. The main offender is `DocumentsWriterFlushControl#doAfterDocument`, which runs after every index operation to update doc and RAM accounting. This change reduces contention by only updating RAM accounting if the amount of RAM consumption that has not been committed yet by a single DWPT is at least 0.1% of the total RAM buffer size. This effectively batches updates to RAM accounting, similarly to what happens when using `IndexWriter#addDocuments` to index multiple documents at once. Since updates to RAM accounting may be batched, `FlushPolicy` can no longer distinguish between inserts, updates and deletes, so all 3 methods got merged into a single one. With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender for contention is now `DocumentsWriterPerThreadPool#getAndLock`. Co-authored-by: Simon Willnauer <simonw@apache.org>	2023-03-15 11:39:40 +01:00
Lu Xugang	62e175bf4f	Unrelated code in TestIndexSortSortedNumericDocValuesRangeQuery (#12153 )	2023-03-15 15:22:32 +08:00
Marcus	dfd9e0fe97	Remove the Now Unused Class `pointInPolygon`. (#12159 ) Removes the unused Tessellator.pointInPolygon method.	2023-03-14 09:16:02 -05:00
Jasir KT	4851bd74f4	Fix boost missing in MultiFieldQueryParser (#12202 ) When using boost along with any of fuzzy, wildcard, regexp, range or prefix queries, the boost is not applied.	2023-03-13 08:27:42 -04:00
Uwe Schindler	e4d8a5c5cb	Implement MMapDirectory with Java 20 Project Panama Preview API (#12188 )	2023-03-09 21:27:31 +01:00
Jasir KT	96efb34d00	Fix Slop Issue in MultiFieldQueryParser (#12196 ) In Lucene 5.4.0 62313b83ba9c69379e1f84dffc881a361713ce9 introduced some changes for immutability of queries. setBoost() function was replaced with new BoostQuery(), but BoostQuery is not handled in setSlop function. This commit adds the handling of BoostQuery in setSlop() function.	2023-03-08 11:16:09 -05:00
Greg Miller	0651d25713	Fixup TestLongValueFacetCounts after GITHUB#11744. (#12192 ) GH#11744 deprecated LongValueFacetCounts#getTopChildrenSortByCount in favor of the standard Facets#getTopChildren. The issue is that #getTopChildrenSortByCount didn't do any input validation and allowed for topN == 0, while #getTopChildren does input validation. Randomized testing could produce topN values of 0, which resulted in falied tests. This addresses the tests.	2023-03-06 18:13:03 -08:00
Greg Miller	afd3a7efbe	Remove LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality (#11744 )	2023-03-06 12:12:23 -08:00
Tyler Bertrand	c514089d66	Gradle optimizations (#12150 ) * Define inputs and outputs for task validateJarLicenses * Lazily configure validateJarLicenses * Move functionality from copyTestResources task into processTestResources task * Lazily configure processTestResources * Altered TestCustomAnalyzer.testStopWordsFromFile() to find resources in updated location * Resolve "overlapping output" issue preventing processTestResources from being cached * Provide system properties from CommandLineArgumentProviders * Configure certain system properties as inputs to take advantage of UP-TO-DATE checking * Applies the correct pathing strategies to take full advantage of caching even if builds are executed from different locations on disk * Make validateSourcePatterns task cacheable by removing .gradle directory from its input	2023-03-06 19:17:37 +01:00
Greg Miller	b4f969c197	Better PostingsEnum reuse in MultiTermQueryConstantScoreBlendedWrapper (#12179 )	2023-03-06 09:09:52 -08:00
Christine Poerschke	3bd06b1cb9	GITHUB-12181: fix false-positive TestKnnFloatVectorQuery.testDocAndScoreQueryBasics() failure (#12182 )	2023-03-06 15:29:36 +00:00
Kaival Parikh	e0d92eef98	Concurrent rewrite for KnnVectorQuery (#12160 ) - Reduce overhead of non-concurrent search by preserving original execution - Improve readability by factoring into separate functions --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>	2023-03-04 01:12:11 -08:00
Greg Miller	569533bd76	Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod (#12175 )	2023-03-01 08:25:44 -08:00
Greg Miller	00910cd6a4	Actually remove TermInSetQuery#getTermData (follow-up on #12173 )	2023-03-01 05:44:24 -08:00
Greg Miller	c8741f7c58	Deprecate TermInSetQuery#getTermData (#12173 )	2023-03-01 05:42:39 -08:00
Greg Miller	3809106602	Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery (#12156 )	2023-03-01 05:20:28 -08:00
Greg Miller	001acaf882	Clone the BytesRef[] values in KeywordField#newSetQuery (#12158 )	2023-02-28 18:30:38 -08:00
Greg Miller	b23b7475e1	Follow up on GH#12055 to remove un-referenced test methods	2023-02-27 16:03:44 -08:00
Adrien Grand	c6667e709f	Better skipping for multi-term queries with a FILTER rewrite. (#12055 ) This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a `DocIdSetBuilder`. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the `DocIdSetBuilder` that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the `DocIdSetBuilder`. This provides two main benefits: - Queries are less likely to allocate a FixedBitSet of size `maxDoc`. - Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown, so users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new default for PrefixQuery, WildcardQuery and TermRangeQuery. Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>	2023-02-27 15:11:31 -08:00
Adrien Grand	38d253fb53	Lazily resolve ordinals when merging. (#12170 ) The default implementation of merging doc values resolves the ordinal of a document in `nextDoc()`. But sometimes, doc values iterators are consumed without retrieving ordinals, e.g. to write the set of documents that have a value, so this may be wasteful. With this change, ordinals get resolved lazily upon `ordValue()`.	2023-02-27 17:38:59 +01:00
Adrien Grand	2d157bd348	Remove LogMergePolicy's boundary at the floor level. (#12113 ) `LogMergePolicy` has this boundary at the floor level that prevents merging segments above the minimum segment size with segments below this size. I cannot see a benefit from doing this, and no tests fail if I remove it, while this boundary has the downside of not running merges that seem legit to me. Should we remove this boundary check?	2023-02-27 17:38:34 +01:00
Adrien Grand	cce33b07e4	Skip the TokenStream overhead when indexing simple keywords. (#12139 ) Indexing simple keywords through a `TokenStream` abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.	2023-02-21 14:00:11 +01:00
Benjamin Trent	dbfca9a62b	Minor vector search matching doc optimizations (#12152 ) The two minor performance improvements are around count and Weight#scorer. segmentStarts is a monotonically increasing start for each scored document indexed by leaf-segment ordinal. Consequently, if the upper and lower segments are equivalent, that means no docs match for this segment. Count is similarly calculated by the difference between upper and lower segmentStarts according to the segment ordinal.	2023-02-21 07:51:03 -05:00
Greg Miller	7506f8462f	Speed up DocValuesRewriteMethod by making use of sortedness (#12155 )	2023-02-19 08:33:26 -08:00
Robert Muir	3ad2ede395	Implement ScorerSupplier for Sorted(Set)DocValuesField#newSlowRangeQuery (#12132 ) Similar to use of ScorerSupplier in #12129, implement it here too, because creation of a Scorer requires lookupTerm() operations in the DV terms dictionary. This results in wasted effort/random accesses, if, based on the cost(), IndexOrDocValuesQuery decides not to use this query.	2023-02-17 08:25:17 -05:00
Julie Tibshirani	8340b01c3c	Simplify max score for kNN vector queries (#12146 ) The helper class DocAndScoreQuery implements advanceShallow to help skip non-competitive documents. This method doesn't actually keep track of where it has advanced, which means it can do extra work. Overall the complexity here didn't seem worth it, given the low cost of collecting matching kNN docs. This PR switches to a simpler approach, which uses a fixed upper bound on the max score.	2023-02-16 12:03:59 -08:00
Nhat Nguyen	8e15c665be	Ensure caching all leaves from the upper tier (#12147 ) This change adjusts the cache policy to ensure that all segments in the max tier to be cached. Before, we cache segments that have more than 3% of the total documents in the index; now cache segments have more than half of the average documents per leave of the index. Closes #12140	2023-02-16 10:41:28 -08:00
Julie Tibshirani	54044a82a0	Improve DocAndScoreQuery#toString (#12148 ) Tiny improvements to DocAndScoreQuery: * Make toString more informative * Remove unnecessary 'k' parameter	2023-02-15 09:50:55 -08:00
Benjamin Trent	7baa01b3c2	Force merge into a single segment before getting the directory reader (#12138 ) The test assumes a single segment is created (because only one scorer is created from the leaf contexts). But, force merging wasn't done before getting the reader. Forcemerge to a single segment before getting the reader.	2023-02-09 11:27:08 -05:00
Alan Woodward	f38d51ee89	Don't wrap readers when checking for term vector access in test (#12136 ) In TestUnifiedHighlighterTermVec, we have a special reader which counts the number of times term vectors are accessed, so that we can assert that caching works correctly here. There is some special logic in place to skip the check when the test framework wraps readers with CheckIndex or ParallelReader; however, this logic no longer works with ParallelReader in particular, because term vectors are now accessed through an anonymous class. A simpler solution here is to call newSearcher(reader, false), which disables wrapping, meaning that we can remove this extra logic entirely. Fixes #12115	2023-02-09 09:29:41 +00:00
John Mazanec	776149f0f6	Reuse HNSW graph for intialization during merge (#12050 ) * Remove implicit addition of vector 0 Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com> * Enable out of order insertion of nodes in hnsw Enables nodes to be added into OnHeapHnswGraph in out of order fashion. To do so, additional operations have to be taken to resort the nodesByLevel array. Optimizations have been made to avoid sorting whenever possible. Signed-off-by: John Mazanec <jmazane@amazon.com> * Add ability to initialize from graph Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com> * Utilize merge with graph init in HNSWWriter Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com> * Minor modifications to Lucene95HnswVectorsWriter Signed-off-by: John Mazanec <jmazane@amazon.com> * Use TreeMap for graph structure for levels > 0 Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com> * Refactor initializer to be in static create method Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com> * Address review comments Signed-off-by: John Mazanec <jmazane@amazon.com> * Add change log entry Signed-off-by: John Mazanec <jmazane@amazon.com> * Remove empty iterator for neighborqueue Signed-off-by: John Mazanec <jmazane@amazon.com> --------- Signed-off-by: John Mazanec <jmazane@amazon.com>	2023-02-07 14:42:03 -05:00
Adrien Grand	ab074d5483	Introduce a new `KeywordField`. (#12054 ) `KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`, similarly to how `LongField` is a combination of `LongPoint` and `SortedNumericDocValuesField`. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.	2023-02-07 18:19:09 +01:00
Adrien Grand	d69326408c	Fix javadoc references.	2023-02-07 11:06:04 +01:00
Adrien Grand	8b572f074e	Fix nightly compatibility tests after #12116 .	2023-02-07 10:46:30 +01:00
Uwe Schindler	bc09c2a0d9	Add tests for size() and contains() to LongHashSet; fix size bug with MISSING (#12134 )	2023-02-07 00:52:29 +01:00
Uwe Schindler	57403e26e0	Simplify LongHashSet by completely removing java.util.Set APIs (#12133 )	2023-02-06 22:43:20 +01:00
Uwe Schindler	8564da434d	Generate gradle.properties from gradlew (#12131 ) * SOLR-16641 - Generate gradle.properties from gradlew (#1320) * Adapt for Lucene * Remove localSettings from smoker; thanks @colvinco * Print properties at end for debugging * Add CHANGES.txt entry --------- Co-authored-by: Colvin Cowie <colvin.cowie.dev@gmail.com> Co-authored-by: Colvin Cowie <51863265+colvinco@users.noreply.github.com>	2023-02-06 19:47:15 +01:00
Robert Muir	02b202866c	Add CHANGES.txt for #12129	2023-02-06 12:49:49 -05:00
Robert Muir	0bc4135695	Speedup sandbox/DocValuesTermsQuery (#12129 ) * Optimize the common case that docs only have single values for the field * In the multivalued case, terminate reading docvalues if they are > maximum set ordinal * Implement ScorerSupplier, so that (potentially large) number of ordinal lookups aren't performed just to get the cost() * Graduate to Sorted(Set)DocValuesField.newSlowSetQuery to complement newSlowRangeQuery, newSlowExactQuery Like other slow queries in these classes, it's currently only recommended to use with points, e.g. IndexOrDocValuesQuery(new PointInSetQuery, newSlowSetQuery)	2023-02-06 12:47:53 -05:00
Robert Muir	10d9c7440b	Speed up docvalues set query by making use of sortedness (#12128 ) LongHashSet is used for the set of numbers, but it has some issues: * tries to hard to extend AbstractSet, mostly for testing * causes traps with boxing if you aren't careful * complex hashcode/equals Practically we should take advantage of the fact numbers come in sorted order for multivalued fields: just like range queries do. So we use min/max to our advantage, including termination of docvalues iteration Actually it is generally a win to just check min/max even in the single-valued case: these constant time comparisons are cheap and can avoid hashing, etc. In the worst-case, if all of your query Sets contain both the minimum and maximum possible values, then it won't help, but it doesn't hurt either.	2023-02-06 12:14:02 -05:00
Robert Muir	a6bceb7cf0	Remove useless abstractions in DocValues-based queries (#12127 ) There's no need to make things abstract: DocValues does the right thing Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!)	2023-02-06 11:49:25 -05:00
Adrien Grand	684f31ef06	Fix ambiguous links. (#12116 )	2023-02-06 17:45:05 +01:00
Adrien Grand	96136b4282	Improve document API for stored fields. (#12116 ) Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues: - If there is a problem, e.g. all of these 3 methods return null, it's currently discovered late, when we already passed the responsibility of writing data from IndexingChain to the codec. - numericValue() is used both for numeric doc values and storage. This makes it impossible to implement a `double` field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values. - binaryValue() is used both for sorted(_set) doc values and storage. This makes it impossible to implement `keyword` fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store. This commit introduces `IndexableField#storedValue()`, which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue().	2023-02-06 17:08:13 +01:00
Benjamin Trent	c2bef381d1	Fix TestFeatureField#testBasicsNonScoringCase test (#12130 ) Sometimes the random search lucene test searcher will wrap the reader. Consequently, we need to make sure to use the reader provided by the test IndexSearcher or the reader may be different between creating the weight with the searcher vs. accessing the leaf context for the scorer.	2023-02-06 10:15:00 -05:00
Benjamin Trent	4bbc273a43	Add `FeatureQuery` weight caching in non-scoring case (#12118 ) While FeatureQuery is a powerful tool in the scoring case, there are scenarios when caching should be allowed and scoring disabled. A particular case is when the FeatureQuery is used in conjunction with learned-sparse retrieval. It is useful to iterate and calculate the entire matching doc set when combined with various other queries. related to: https://github.com/apache/lucene/issues/11799	2023-02-02 13:00:33 -05:00
Ioana Tagirta	d591c9c37a	Always generate a polygon that has no self intersections (#12124 )	2023-02-02 16:57:42 +01:00
Jean-François BOEUF	5acca82633	Reduce bloom filter size by using the optimal count for hash functions. (#11900 )	2023-02-01 14:35:50 +01:00
Marc D'Mello	f9cb6a3b42	GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes (#11958 ) Co-authored-by: Marc D'Mello <dmellomd@amazon.com>	2023-02-01 14:11:12 +01:00
Luca Cavanna	57397f0cab	Adjust return type for VectorUtil methods (#12122 ) Two of the methods (squareDistance and dotProduct) that take byte arrays return a float while the variable used to store the value is an int. They can just return an int.	2023-02-01 10:48:05 +01:00
Luca Cavanna	ce433e5449	Remove VectorUtil#toBytesRef (#12121 ) The method is currently only used in its corresponding test method.	2023-02-01 10:47:50 +01:00
Luca Cavanna	73e2ae2705	Add back-compat indices for 9.5.0	2023-01-30 20:50:01 +01:00
Luca Cavanna	2bd87b7909	Fix formatting of Version.java	2023-01-25 13:49:23 +01:00
Luca Cavanna	59da15b0e5	Add minor version 9.6.0	2023-01-25 13:42:38 +01:00
Luca Cavanna	72c8b334a0	Add missing LUCENE_9_5_0 version	2023-01-25 13:37:48 +01:00
Benjamin Trent	d1fa52e62f	Fix flaky TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults test (#12110 )	2023-01-25 10:47:20 +01:00
Luca Cavanna	5a51ce1d5d	SimpleText codec to support writing byte vectors (#12111 ) A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written. This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter. Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support byte vectors and will throw unsupported operation exception when attempting to do that.	2023-01-25 10:43:35 +01:00
Luca Cavanna	95e2cfcc1e	Remove deprecated float vector classes and methods (#12107 ) Follow-up of #12105 to remove the deprecated classes for the next major version. Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.	2023-01-24 16:25:36 +01:00
Adrien Grand	ce8eaf138c	MemoryIndex should not fail integer fields that enable doc values. (#12109 ) When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast to `java.lang.Long`. However, the new `IntField` represents the value as a `java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with `IndexingChain` by casting to `Number` and calling `Number#longValue` instead of casting to `Long`.	2023-01-24 11:51:49 +01:00
Luca Cavanna	25623a63bf	format changelog entry and add missing author name	2023-01-24 10:57:52 +01:00
Luca Cavanna	92e67ec626	Rename vector float classes and methods (#12105 ) We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values. This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones. As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues. Relates to #11963	2023-01-24 10:20:24 +01:00
Luca Cavanna	f5bd28662f	Remove BytesRef usage from SortingCodeReader Follow-up of #12102	2023-01-23 22:29:10 +01:00
Luca Cavanna	4594400216	Replace BytesRef usages in byte vectors API with byte[] (#12102 ) The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore	2023-01-23 22:06:00 +01:00
Luca Cavanna	f8ee852696	add missing changelog for #12064	2023-01-23 21:22:15 +01:00
Adrien Grand	de94fa97fb	Remove binaryValue() on VectorValues and ByteVectorValues. (#12101 ) This method tries to expose an encoded view of vectors, but we shouldn't have this part of our user-facing API. With this change, the way vectors are encoded is entirely on the codec.	2023-01-23 14:42:49 +01:00
Alessandro Benedetti	77eca4bb38	Introduce getters for KnnVectorQuery(#12029 )	2023-01-23 12:35:08 +01:00
Chris Hostetter	9007f746a3	WordBreakSpellChecker now correctly respects maxEvaluations (#12077 )	2023-01-22 15:44:29 -07:00
Vigya Sharma	519adcc954	Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098 )	2023-01-19 16:16:42 -08:00
Lu Xugang	a9fd21b6af	Same bound with fallbackQuery (#12084 ) IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.	2023-01-19 14:33:58 +08:00
Vigya Sharma	dc33ade76d	Remove UTF8TaxonomyWriterCache (#12092 ) Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache	2023-01-18 13:20:26 -08:00
twosom	318b002e0b	fix typo in KoreanNumberFilter (#12045 ) * fix typo in KoreanNumberFilter * fix doc format	2023-01-17 22:32:59 -08:00
Robert Muir	4fe8424925	Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087 ) Clean up this query a bit and support: * NumericDocValuesField.newSlowSetQuery() * SortedNumericDocValuesField.newSlowSetQuery() This complements the existing docvalues-based range queries, with a set query. Add ScorerSupplier/cost estimation support to PointInSetQuery Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery	2023-01-16 09:38:08 -05:00
Alan Woodward	fc7b937aff	Don't throw UOE when highlighting FieldExistsQuery (#12088 ) WeightedSpanTermExtractor will try to rewrite queries that it doesn't know about, to see if they end up as something it does know about and that it can extract terms from. To support field merging, it rewrites against a delegating leaf reader that does not support getFieldInfos(). FieldExistsQuery uses getFieldInfos() in its rewrite, which means that if one is passed to WeightedSpanTermExtractor, we get an UnsupportedOperationException thrown. This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery, so that it can just ignore it and avoid throwing an exception.	2023-01-16 11:47:51 +00:00
Robert Muir	e06f8c2e8b	Update to error-prone 2.17 (#12056 )	2023-01-14 11:38:39 -05:00
Robert Muir	8ca05967e8	remove non-NRT replication support (#12038 ) Remove non-NRT replication support in 10.x (to be deprecated in 9.5)	2023-01-14 11:14:46 -05:00
Adrien Grand	b5062a2858	MultiCollector shouldn't report that scores are needed when they're not. (#12083 ) When sub collectors don't agree on their `ScoreMode`, `MultiCollector` currently returns `COMPLETE`. This makes sense when assuming that there is likely one collector computing top hits (`TOP_SCORES`) and another one computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it is also possible to have one collector computing top hits by field (`TOP_DOCS`) and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector` shouldn't report that scores are needed in that case.	2023-01-13 14:44:17 +01:00
Luca Cavanna	90a5a71448	fix typo in changelog	2023-01-13 14:12:06 +01:00
Lu Xugang	102622842b	Enhance XXXField#newRangeQuery (#12078 ) Introduce IndexSortSortedNumericDocValuesRangeQuery to IntFiled#newRangeQuery and LongField#newRangeQuery	2023-01-13 18:12:26 +08:00
Adrien Grand	aaab028266	Speed up DocIdMerger on sorted indexes. (#12081 ) In the case when an index is sorted on a low-cardinality field, or the index sort order correlates with the order in which documents get ingested, we can optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on the next sub. This checks covers at the same time whether the priority queue needs reordering and whether the current sub reached `NO_MORE_DOCS`.	2023-01-12 18:27:45 +01:00
Adrien Grand	729fedcbac	Speed up 1D BKD merging. (#12079 ) On the NYC taxis dataset on my local machine, switching from `Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15% speedup of BKD merging.	2023-01-12 18:14:15 +01:00
Benjamin Trent	59b17452aa	Fix exponential runtime for Boolean#rewrite (#12072 ) When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially. The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively. This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite. closes: #12069	2023-01-12 16:50:05 +01:00
Adrien Grand	9ab324d2be	Allow reusing indexed binary fields. (#12053 ) Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.	2023-01-12 09:52:12 +01:00
Adrien Grand	525a11091c	Revert "Allow reusing indexed binary fields. (#12053 )" This reverts commit `84778549af`.	2023-01-12 09:48:06 +01:00
Adrien Grand	84778549af	Allow reusing indexed binary fields. (#12053 ) Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.	2023-01-12 09:32:13 +01:00
Patrick Zhai	d3d9ab0044	Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051 )	2023-01-11 10:44:06 -08:00
Adrien Grand	a8ef03d979	Never throttle creation of compound files. (#12070 ) `ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a proxy for CPU usage, in order to prevent merging from disrupting searches too much. However creating compound files are lightweight CPU-wise and do not need throttling. Closes #12068	2023-01-11 09:57:13 +01:00
Adrien Grand	56ec51e558	Cut over Lucene Demo from LongPoint to LongField. (#12052 )	2023-01-11 09:43:43 +01:00
Benjamin Trent	cc29102a24	Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064 )	2023-01-11 09:20:47 +01:00
Erik Pellizzon	e14327288e	Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034 )	2023-01-06 14:35:52 -08:00
Uwe Schindler	5fccaec166	Remove deprecated APIs after #12066 ; this also removes another one missed to be removed before	2023-01-05 11:53:16 +01:00
Uwe Schindler	7f483bd618	Retire/deprecate per-instance MMapDirectory#setUseUnmap (#12066 )	2023-01-04 19:17:03 +01:00
Uwe Schindler	2f602c01dd	Add a sysprop "org.apache.lucene.store.MMapDirectory.enableMemorySegments" (#12062 )	2023-01-03 19:10:28 +01:00
Lu Xugang	19cc6cdf66	Out of boundary in CombinedFieldQuery#addTerm (#12046 )	2023-01-03 15:33:36 +08:00
Uwe Schindler	e2ee09d0c5	Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9 that has the bean, but without properties (#12058 )	2023-01-02 00:05:36 +01:00
twosom	4676a735c1	fix typo analysis-kuromoji (#12047 )	2023-01-01 10:58:50 -05:00
Patrick Zhai	4eab1d74e8	Fix TestRangeOnRaneFacetCounts dimention overflow error (#12049 )	2022-12-30 13:18:51 -08:00
Greg Miller	21107d811b	Add CHANGES entry for GITHUB#11869	2022-12-30 07:46:51 -08:00
Marc D'Mello	cbfed77fd3	Github#11869: Add RangeOnRangeFacetCounts (#11901 )	2022-12-30 07:38:13 -08:00
Adrien Grand	6f477e5831	Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037 ) This iterates on #399 to also optimize the case when an index sort is configured. When cutting over the NYC taxis benchmark to the new numeric fields, [flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times) stayed mostly the same when index sorting is disabled and increased by 7-8% when index sorting is enabled. I expect this change to address this slowdown.	2022-12-27 11:12:56 +01:00
Adrien Grand	ddd63d2da3	Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011 ) When flushing segments that have an index sort configured, postings lists get loaded into arrays and get reordered according to the index sort. This reordering is implemented with `TimSorter`, a variant of merge sort. Like merge sort, an important part of `TimSorter` consists of merging two contiguous sorted slices of the array into a combined sorted slice. This merging can be done either with external memory, which is the classical approach, or in-place, which still runs in linear time but with a much higher factor. Until now we were allocating a fixed budget of `maxDoc/64` for doing these merges with external memory. If this is not enough, sorted slices would be merged in place. I've been looking at some profiles recently for an index where a non-negligible chunk of the time was spent on in-place merges. So I would like to propose the following change: - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid in-place merges for all postings up to `docFreq = maxDoc / 4`. - Make this RAM budget lazily allocated, rather than eagerly like today. This would help not allocate memory in O(maxDoc) for fields like primary keys that only have a couple postings per term. So overall memory usage would never be more than 50% higher than what it is today, because `TimSorter` never needs more than X temporary slots if the postings list doesn't have at least 2X entries, and these 2X entries already get loaded into memory today. And for fields that have short postings, memory usage should actually be lower.	2022-12-27 11:11:18 +01:00
Adrien Grand	e9dc4f9188	Avoid sorting values of multi-valued writers if there is a single value. (#12039 ) They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to range checks and some logic to determine the optimal sorting algorithm to use depending on the number of values. We can skip this overhead in the case when there is a single value.	2022-12-27 11:03:06 +01:00
Zach Chen	008a0d4206	Remove IOContext from Directory#openChecksumInput (#12027 )	2022-12-26 11:45:42 -08:00
Uwe Schindler	c9401bf064	Patch class files for Java 19 code to no longer have the "preview" flag (this enables Java 19 memory segments by default) (#12033 )	2022-12-26 10:07:44 +01:00
Uwe Schindler	92f08aff9f	Make childLog final to fix compilation on Java 20. This closes #12041	2022-12-25 14:55:33 +01:00
Lu Xugang	3bc8cd5c20	Aggressive `count` in BooleanWeight (#12017 )	2022-12-22 23:48:05 +08:00
twosom	ad22fb2879	Fix typo in AbstractQueryConfig javadocs (#12031 )	2022-12-22 13:57:29 +01:00
twosom	5c78e04a17	fix typo in BaseSynonymParserTestCase (#12030 ) Co-authored-by: hope <hope@gravylab.co.kr>	2022-12-21 13:28:52 -05:00
Egor Potemkin	d18e3f1d45	Issue #11582 Update Faceting user guide (#12025 ) Update faceting user guide to modern times. Co-authored-by: Egor Potemkin <epotyom@amazon.com>	2022-12-21 12:20:18 -05:00
Francisco Fernández Castaño	57201aa967	Add IntField, LongField, FloatField and DoubleField (#11997 ) This commit adds new IndexableFields that index both points and doc values at once. Closes #11199	2022-12-20 18:19:46 +01:00
Benjamin Trent	1412e559d9	Clean up KNN related backward-codecs changes (#12019 )	2022-12-20 14:04:42 +01:00
Andriy Redko	945d7fe027	Upgrade ANTLR to version 4.11.1 (#12016 ) Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md "Q: What are the main design decisions in ANTLR4? Ease-of-use over performance. I will worry about performance later." It allows us to move forward with newer antlr but hopefully prevent the associated headaches. Signed-off-by: Andriy Redko <andriy.redko@aiven.io> Co-authored-by: Robert Muir <rmuir@apache.org>	2022-12-15 22:40:35 -05:00
Craig Taverner	3e8ef57e3f	Fix flat polygons incorrectly containing intersecting geometries (#12022 )	2022-12-15 14:56:09 +01:00
Benjamin Trent	11f2bc2056	Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004 (#12024 )	2022-12-15 14:49:47 +01:00
Benjamin Trent	72968d30ba	Move byte vector queries into new KnnByteVectorQuery (#12004 )	2022-12-14 09:53:10 +01:00
Robert Muir	9eeab8c4a6	Remove deprecated API in 10.x (#11998 )	2022-12-13 10:32:15 -05:00
Robert Muir	47f8c1baa2	Migrate away from per-segment-per-threadlocals on SegmentReader (#11998 ) Add new stored fields and termvectors interfaces: IndexReader.storedFields() and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector(). The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly reduce RAM requirements when there are many threads and/or segments. Co-authored-by: Adrien Grand <jpountz@gmail.com>	2022-12-13 09:10:21 -05:00
Ignacio Vera	ef5766aa81	Fix algorithm that chooses the bridge between a polygon and a hole (#11988 )	2022-12-13 10:16:53 +01:00
Robert Muir	06f9179295	Enable LongDoubleConversion error-prone check (#12010 )	2022-12-12 20:55:39 -05:00
Greg Miller	e34234ca6c	Remove unnecessary NaN checks from LongRange#verifyAndEncode (#12008 )	2022-12-11 12:55:21 -08:00
Greg Miller	8671e29929	Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery (#12003 ) * Leverage DISI static factory methods more over custom DISI impl where possible. * Assert points field is a single-dim. * Bound cost estimate by the cost of the doc values field (for sparse fields).	2022-12-10 12:23:31 -08:00
gf2121	54e00df7f6	Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (#12006 )	2022-12-11 02:30:17 +08:00
gf2121	9ff989ec00	Use ByteArrayComparator to replace Arrays#compareUnsigned in some other places (#11880 )	2022-12-08 23:51:08 +08:00
Alan Woodward	66127f6e69	Add support for stored fields to MemoryIndex (#11999 )	2022-12-08 09:56:24 +00:00
Adrien Grand	a971120d05	Make RandomAccessVectorValues an implementation detail of HNSW implementations rather than a proper API. (#11964 ) `RandomAccessVectorValues` is internally used in our HNSW implementation to provide random access to vectors, both at index and search time. In order to better reflect this, this change does the following: - `RandomAccessVectorValues` moves to `org.apache.lucene.util.hnsw`. - `BufferingKnnVectorsWriter` no longer has a dependency on `RandomAccessVectorValues` and moves to `org.apache.lucene.codecs` since it's more of a utility class for KNN vector file formats than an index API. Maybe we should think of moving it near each file format that uses it instead. - `SortingCodecReader` no longer has a dependency on `RandomAccessVectorValues`. Closes #10623	2022-12-08 08:49:37 +01:00
Adrien Grand	95df7e8109	Generalize range query optimization on sorted indexes to descending sorts. (#11972 ) This generalizes #687 to indexes that are sorted in descending order. The main challenge with descending sorts is that they require being able to compute the last doc ID that matches a value, which would ideally require walking the BKD tree in reverse order, but the API only support moving forward. This is worked around by maintaining a stack of `PointTree` clones to perform the search.	2022-12-08 08:38:53 +01:00
Benjamin Trent	d0be9ab57c	GITHUB-11830 Better optimize storage for vector connections (#11860 )	2022-12-07 08:51:54 +01:00
Karl David Wright	108462a005	Followup work for #11883	2022-12-03 08:07:10 -05:00
Costin Leau	4eba6a1284	Add exponential growth to TimeLimitingBulkScorer (#11984 ) Increase the timeout check inside TimeLimitBulkScorer at exponential rate. Fix #11676	2022-12-02 09:20:48 -08:00
Robert Muir	fad3108b27	fix wrong serialization by ShapeDocValues (#11974 ) Closes #11973	2022-12-01 20:32:42 -05:00
Alan Woodward	72ff140f5a	Don't let merged passages push out lower-scoring ones (#11990 ) PassageScorer uses a priority queue of size maxPassages to keep track of which highlighted passages are worth returning to the user. Once all passages have been collected, we go through and merge overlapping passages together, but this reduction in the number of passages is not compensated for by re-adding the highest-scoring passages that were pushed out of the queue by passages which have been merged away. This commit increases the size of the priority queue to try and account for overlapping passages that will subsequently be merged together.	2022-12-01 12:25:29 +00:00
Luca Cavanna	bd168ac2a8	Add changes entry for #11985	2022-11-30 10:13:39 +01:00
Luca Cavanna	343d888b30	ExitableTerms to override getMin and getMax (#11985 ) ExitableTerms should not iterate through the terms to retrieve min and max when the wrapped implementation has the values cached (e.g. FieldsReader, OrdsFieldReader)	2022-11-30 10:06:31 +01:00
Alan Woodward	0cc6f69536	Give OffsetsRetrievalStrategy implementations public constructors (#11983 ) OffsetsFromMatchIterator and OffsetsFromPositions both have package- private constructors, which makes them difficult to use as components in a separate highlighter implementation.	2022-11-28 16:22:46 +00:00
Karl David Wright	5c4896321d	Merge branch 'GITHUB-11883' into main Pulling in changes to address ticket 11883.	2022-11-25 16:32:02 -05:00
Karl David Wright	74e8b94796	Fix for 11883.	2022-11-25 16:17:18 -05:00
Karl David Wright	6dc6b5b0dd	As part of GITHUB-11883, develop new primitive Plane constructors to build boundary planes specific for each polygon edge.	2022-11-25 14:56:38 -05:00
Greg Miller	2e83c3b40f	Fix NPE in BinaryRangeFieldRangeQuery when field does not exist or is of wrong type (#11950 )	2022-11-25 11:38:41 -08:00
Robert Muir	4e93f29318	fix bad shift amounts and enable check (#11979 )	2022-11-25 11:47:25 -05:00
Robert Muir	545c93a394	fix use of wrong array toString() method in test, enable check (#11978 )	2022-11-25 11:47:04 -05:00
Robert Muir	4885b5f856	fix use of wrong array equals() method in test, enable check (#11977 )	2022-11-25 11:46:48 -05:00
Robert Muir	f4286493d1	fix variable assigned to itself in test and enable check (#11980 )	2022-11-25 11:45:45 -05:00
Karl David Wright	b5f94b6754	Add test that tweaks identical planes in intersections bug	2022-11-25 07:40:45 -05:00
Karl David Wright	b5dd71198d	Refactor, restoring isWithinSection and making sure it is properly called.	2022-11-24 02:47:06 -05:00
Shubham Chaudhary	b15ace46b2	Remove QueryTimeout#isTimeoutEnabled method and move check to caller (#11954 ) Co-authored-by: Shubham <cshbha@amazon.com>	2022-11-24 16:37:20 +01:00
Adrien Grand	28576eb99d	Fix precommit.	2022-11-24 11:44:21 +01:00
Simon Cooper	135f3fab41	Ensure collections are properly sized on creation (#11942 ) A few other optimisations along the way	2022-11-24 11:20:04 +01:00
Karl David Wright	839dfb5a2d	More refactoring work, and fix a distance calculation.	2022-11-23 23:36:15 -05:00
Karl David Wright	5e4623af1f	For 11965, add structural changes that would allow intersection calls to also be O(log(n)). Disabled though because test failures are the result of enabling it - work ongoing.	2022-11-23 15:07:57 -05:00
Karl David Wright	482f8251ff	More work related to 11965: Improve performance of nearestDistance queries somewhat by removing unnecessary code.	2022-11-23 12:21:38 -05:00
Adrien Grand	802774641a	Enforce VectorValues.cost() is equal to size(). (#11962 ) `VectorValues` have a `cost()` method that reports an approximate number of documents that have a vector, but also a `size()` method that reports the accurate number of vectors in the field. Since KNN vectors only support single-valued fields we should enforce that `cost()` returns the `size()`.	2022-11-23 11:05:00 +01:00
Adrien Grand	20c1ba5d9a	Remove VectorValues#EMPTY. (#11961 ) This instance is illegal as it reports a number of dimensions equal to zero.	2022-11-23 10:52:12 +01:00
Adrien Grand	8bdc59ce67	Add back-compat indices for 9.4.2	2022-11-23 10:35:06 +01:00
Dawid Weiss	30873cfcd9	Fix the boxing issue again.	2022-11-23 08:29:12 +01:00
Karl David Wright	5fec8efe4e	More tidying to make lint happy	2022-11-22 22:05:55 -05:00
Karl David Wright	49c8a75917	Resolve merge conflicts	2022-11-22 21:29:06 -05:00
Karl David Wright	0593eca73d	Fix problem in new Plane code	2022-11-22 21:18:11 -05:00
Karl David Wright	fc7ce76851	Refactor and make hierarchical GeoStandardPath. Some tests fail and will need to be researched further.	2022-11-22 21:18:11 -05:00
Karl David Wright	1ded41ea20	Final bugs fixed, except remaining legacy issue with nearest distance in GeoDegeneratePath.	2022-11-22 21:12:44 -05:00
Karl David Wright	c9c27c755a	Make sure use of aggregation form is consistent throughout, and fix segment endpoint computations of nearestDistance.	2022-11-22 18:43:19 -05:00
Karl David Wright	ae5179986c	All tests fixed saved two - distance related.	2022-11-22 16:56:34 -05:00
Adrien Grand	7b7cb396e5	Tidy.	2022-11-22 18:57:21 +01:00
Adrien Grand	750e7dba32	Add bugfix version 9.4.2	2022-11-22 18:56:30 +01:00
Karl David Wright	799421abba	Fix nearestDistance for real this time	2022-11-22 12:55:18 -05:00
Peter Gromov	2ae8dd632e	hunspell: support empty dictionaries, adapt to the hunspell/C++ repo changes (#11960 ) hunspell: support empty dictionaries, adapt to the hunspell/C++ repo changes	2022-11-22 18:23:45 +01:00
Mike McCandless	ad04ac1bc4	tidy up	2022-11-22 08:20:30 -05:00
Mike McCandless	4bd4f8b521	remove unused imports	2022-11-22 08:18:41 -05:00
Mike McCandless	acbc08fb32	expand wildcard imports	2022-11-22 08:03:54 -05:00
Mike McCandless	521c0e24f2	#10878 : revert #02528c6757d10420cc7d545282b49c4322943ac7 (add some test verbosity on failure (#11935 ))	2022-11-22 07:59:08 -05:00
Stefan Vodita	369a70f289	Support deletions in rearrange (#11815 ) * Support deletions in rearrange * Store BinaryDocValues in the binary doc value selector as ByteRef instead of String.	2022-11-21 23:52:38 -08:00
Karl David Wright	5d341e9d8c	Cleanup StandardGeoPath to get rid of unused member arrays	2022-11-21 20:36:25 -05:00
Karl David Wright	ecf4396ef7	Remove a dated test	2022-11-21 20:01:30 -05:00
Karl David Wright	d85c35e4d7	Fix problem in new Plane code	2022-11-21 19:15:10 -05:00
Karl David Wright	20fcd0b757	Refactor and make hierarchical GeoStandardPath. Some tests fail and will need to be researched further.	2022-11-21 18:39:50 -05:00
Alan Woodward	332679886c	Add field as a separate input to newSynonymQuery (#11941 ) QueryBuilder#newSynonymQuery takes an array of TermAndBoost objects as a parameter and uses the field of the first term in the array as its field. However, there are cases where this array may be empty, which will result in an ArrayOutOfBoundsException. This commit reworks QueryBuilder so that TermAndBoost contains plain BytesRefs, and passes the field as a separate parameter. This guards against accidental calls to newSynonymQuery with an empty list - in this case, an empty synonym query is generated rather than an exception. It also refactors SynonymQuery itself to hold BytesRefs rather than Terms, which needlessly repeat the field for each entry. Fixes #11864	2022-11-21 09:55:14 +00:00
Karl David Wright	cd82a9bbdc	Revert the change the relies on accurate bounds from path components. This caused randomized test failures, and fixing the bounds caused other (inexplicable) test failures. More research needed.	2022-11-20 10:32:14 -05:00
Robert Muir	8fdac2d88e	fixed boxed equality check to unbreak the build: has this code been tested????	2022-11-19 23:11:30 -05:00
Karl David Wright	1e236090af	Fix up formatting	2022-11-19 17:56:10 -05:00
Karl David Wright	9bca7a70e1	Fix longstanding bug in path bounds calculation, and hook up efficient isWithin() and distance logic	2022-11-19 17:56:09 -05:00
Karl David Wright	fbdb655221	Add node structures and fast operations for them.	2022-11-19 17:56:08 -05:00
Robert Muir	62f2b42502	Prevent TestStressIndexing from taking minutes for normal non-NIGHTLY runs (#11953 ) This test intentionally does a ton of filesystem operations: currently about 20% of the time you can get really unlucky and get virus checker simulated, against a real filesystem, which makes things really slow. Instead use a ByteBuffersDirectory for local runs so that it doesn't take minutes. The test can still be pretty slow even with this implementation, so tone down the runtime so that it takes ~ 1.5s locally.	2022-11-19 18:06:52 -05:00
Dawid Weiss	3f6410b738	Implement source code regeneration for test-framework perl scripts (#11952 )	2022-11-19 23:40:45 +01:00
Karl David Wright	718ae33e21	Merge branch 'main' of https://gitbox.apache.org/repos/asf/lucene into main	2022-11-18 02:04:54 -05:00
Dawid Weiss	2f21a866c1	Add star import check/validation (#11949 ) * Remove some old cruft that only slows down checks. Add star import check * Expand wildcard imports to comply with the rule.	2022-11-18 16:42:59 +01:00
Karl David Wright	492a25a2a2	Fix formatting	2022-11-18 01:41:53 -05:00
Karl David Wright	f3b4b057fa	Refactor, in preparation for using b-trees to enhance performance.	2022-11-18 00:32:52 -05:00
Jack Conradson	a18b62ded4	Decrease test time for TestManyKnnDocs.testLargeSegment (#11945 ) * Improve speed of TestManyKnnDocs	2022-11-16 23:52:32 -05:00
Karl David Wright	b6ebfd1861	Prevent NPEs while still handling the polar case for horizontal planes right off the pole	2022-11-16 11:03:24 -05:00
Mao Suhan	3c5bcb383b	fix bug of incorrect cost after upgradeToBitSet in DocIdSetBuilder class (#11939 )	2022-11-16 17:04:15 +01:00
Robert Muir	c5da727493	fix overflows in compound assignments (#11938 ) * Count points as longs. * Simplify KnnVectorsWriter. Co-authored-by: Adrien Grand <jpountz@gmail.com>	2022-11-16 10:53:34 -05:00
Michael McCandless	02528c6757	#10878 : add some test verbosity on failure (#11935 )	2022-11-15 14:20:04 -05:00
Adrien Grand	e2e14df5ac	Add changes entry for #11930 .	2022-11-15 15:31:47 +01:00
Adrien Grand	729dc2bb82	Introduce IOContext.LOAD (#11930 ) The default codec has a number of small and hot files, that actually used to be fully loaded in memory before we moved them off-heap. In the general case, these files are expected to fully fit into the page cache for things to work well. Should we give control over preloading to codecs? This is what this commit does for the following files: - Terms index (`tip`) - Points index (`kdi`) - Stored fields index (`fdx`) - Terms vector index (`tvx`) This only has an effect on `MMapDirectory`.	2022-11-15 13:59:51 +01:00
Robert Muir	1b9d98d6ec	enable error-prone "narrow calculation" check (#11923 ) This check finds bugs such as https://github.com/apache/lucene/pull/11905. See https://errorprone.info/bugpattern/NarrowCalculation	2022-11-15 06:42:41 -05:00
Lu Xugang	e5426dbbd2	count() in BooleanQuery could be early quit (#11895 ) * Count() in BooleanQuery could be early quit if queries are pure disjunctional	2022-11-15 17:57:51 +08:00
Adrien Grand	69a7cb22e7	More granular control of preloading on MMapDirectory. (#11929 ) This enables configuring preloading on MMapDirectory based on the file name as well as the IOContext that is used to open the file.	2022-11-15 10:52:49 +01:00
Uwe Schindler	0741a354c0	Improve test introduced in #11918 to also check that reported invalid position is transformed back original position by slicing code (#11926 )	2022-11-13 15:34:10 +01:00
Uwe Schindler	98b26e0885	fix merge problem in CHANGES.txt	2022-11-11 17:36:08 +01:00
Uwe Schindler	57ac311c70	Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput (#11918 ) Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput. This also adds the invalid position while seeking or reading to the exception message.	2022-11-11 16:47:52 +01:00
Uwe Schindler	2a68f282f4	Synchronize changelog with 9.4 branch so we do not have duplicates	2022-11-11 16:36:02 +01:00
Peter Gromov	6fbc5f73c3	hunspell: introduce FragmentChecker to speed up ModifyingSuggester (#11909 ) hunspell: introduce FragmentChecker to speed up ModifyingSuggester add NGramFragmentChecker to quickly check whether insertions/replacements produce strings that are even possible in the language Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>	2022-11-11 12:13:47 +01:00
Benjamin Trent	c8d44acf20	Follow up to GITHUB#11916, remove deleted docs check (#11919 )	2022-11-10 18:40:24 -05:00
Benjamin Trent	3a506ec87a	GITHUB#11911: improve checkindex to be more thorough for vectors (#11916 ) search every N docs to get close to 64 tests	2022-11-10 16:45:47 -05:00
Benjamin Trent	1360baaee9	Fix integer overflow when seeking the vector index for connections (#11905 ) * Fix integer overflow when seeking the vector index for connections * Adding monster test to cause overflow failure	2022-11-10 08:24:32 -05:00
Peter Gromov	f7417d5961	hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) (#11893 ) hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)	2022-11-09 08:20:50 +01:00
Greg Miller	c66a559050	Further optimize DrillSideways scoring (#11881 )	2022-11-08 10:08:12 -08:00
Benjamin Trent	f9c26ed501	Fix latent casting bug in BKDWriter (#11907 )	2022-11-08 15:55:07 +01:00
Peter Gromov	682e5c94e8	[hunspell] speed up WordFormGenerator (#11904 )	2022-11-07 19:41:17 +01:00
Lu Xugang	a8120bcb32	Simplify the logic of matchAll() in IndexSortSortedNumericDocValuesRangeQuery (#11884 ) * Simplify the logic of matchAll() in IndexSortSortedNumericDocValuesRangeQuery	2022-11-07 19:09:52 +08:00
Michael Sokolov	48aad5090f	#11896 : reduce top k in test to avoid split-graph (#11899 )	2022-11-04 09:30:46 -04:00
Nhat Nguyen	1a5ad61b9d	Document that bulkScorer method can return null (#11897 ) Like Weight#scorer, we should warn users that Weight#bulkScorer can return null if the query matches no documents.	2022-11-02 15:12:43 -07:00
Robert Muir	4e207fed62	Tone down TestDocumentsWriterStallControl.testRandom, so it does not take minutes (#11894 ) This test often takes several minutes with normal runs (no NIGHTLY/multiplier/etc). Tone it down so that it isn't slow: CI builds can work it harder by passing those parameters	2022-11-02 12:17:15 -04:00
Peter Gromov	419ffd3974	[hunspell] perform a bit fewer checks after 2 suffixes have been removed	2022-10-31 10:09:54 +01:00
Marios Trivyzas	3210a42f09	Fix nanos to millis conversion for tests (#11856 )	2022-10-29 09:05:17 +02:00
Navneet Verma	e7253f112d	Add interface to relate a LatLonShape with another shape represented as Component2D. (#11753 ) Adds createLatLonShapeDocValues and createXYShapeDocValues factory methods to LatLonShape and XYShape factory classes, respectively. Signed-off-by: Nicholas Walter Knize <nknize@apache.org>	2022-10-28 13:52:20 -05:00
Dawid Weiss	5c7edd7f38	Upgrade to gradle 7.5.1 (excluding launch scripts, which we have customized) (#11886 )	2022-10-28 08:49:36 +02:00
Marc D'Mello	2793256682	GITHUB#11795: Add FilterDirectory to track write amplification factor (#11796 ) * LUCENE-11795: Add FilterDirectory to track write amplification factor * addressed feedback * added optional temp output tracking and real time tracking * addressed more feedback * more improvements + added CHANGED.txt entry * format edit to CHANGES.txt * remove waf factor calculation Co-authored-by: Marc D'Mello <dmellomd@amazon.com>	2022-10-27 15:07:56 -04:00
Michael Sokolov	b3bc59910f	When evaluating expressions, defer calling advanceExact on operands until doubleValue() is called (#11878 )	2022-10-26 14:05:39 -04:00
gf2121	05bd83dfe1	Use ByteArrayComparator for PointInSetQuery#MergePointVisitor (#11876 )	2022-10-26 13:39:32 +08:00
gf2121	b1d1e488f2	Move LUCENE-10376 CHANGES entry to 10.0.0 (#11871 )	2022-10-24 22:39:21 +08:00
iverase	976a38baa0	Add back-compat indices for 9.4.1	2022-10-24 15:20:44 +02:00
iverase	9ce6268cce	Add bugfix version 9.4.1	2022-10-24 15:13:12 +02:00
gf2121	8cfbc18497	LUCENE-10376: Roll up the loop in vint/vlong in DataInput (#602 )	2022-10-24 17:39:22 +08:00
Julie Tibshirani	0f525bfb14	Fix Lucene94HnswVectorsFormat validation on large segments (#11861 ) When reading large segments, the vectors format can fail with a validation error: java.lang.IllegalStateException: Vector data length 3070061568 not matching size=999369 * dim=768 * byteSize=4 = -1224905728 The problem is that we use an integer to represent the size, which is too small to hold it. The bug snuck in during the work to enable int8 values, which switched a long value to an int.	2022-10-19 13:49:59 -07:00
Patrick Zhai	6cde41c9fd	GITHUB-11838 Change API to allow concurrent query rewrite (#11840 ) Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)	2022-10-19 09:49:40 -07:00
Peter Gromov	05971b3315	hunspell: speed up GeneratingSuggester by not deserializing non-suggestible roots (#11859 )	2022-10-19 13:17:43 +02:00
Steven Schlansker	f3d85be476	PrimaryNode: add configurable timeout to waitForAllRemotesToClose (#11822 )	2022-10-18 17:21:01 -07:00
Adrien Grand	2ed16c7846	Revert "Binary search the entries when all suffixes have the same length in a leaf block. (#11722 )" This reverts commit `3adec5b1ce`.	2022-10-18 14:27:02 +02:00
zhouhui	3adec5b1ce	Binary search the entries when all suffixes have the same length in a leaf block. (#11722 )	2022-10-18 11:07:52 +02:00
Benjamin Trent	cd5e200f47	Fix failure to load larger data sets in KnnGraphTest (#11849 ) When running the `reindex` task with KnnGraphTester, exceptionally large datasets can be used. Since mmap is used to read the data, we need to know the buffer size. This size is limited to Integer.MAX_VALUE, which is inadequate for larger datasets. So, this commit adjusts the reading to only read a single vector at a time.	2022-10-17 16:39:58 -07:00
Peter Gromov	2958f2ae9d	hunspell: speedup suggestions by caching speller and compound stemming requests (#11857 ) hunspell: speed up suggestions by caching speller and compound stemming requests	2022-10-17 21:25:12 +02:00
Zach Chen	21e3f654fb	LUCENE-10635: Ensure test coverage for WANDScorer by using a test query (#1039 )	2022-10-15 13:02:02 -07:00
Robert Muir	ece8ea715c	Fix ExitableDirectoryReader sampling constants to be power-of-2 (#11850 ) If it's performance sensitive enough that we should do sampling, then we should avoid integer division too.	2022-10-15 12:05:15 -04:00
Benjamin Trent	a7369d7f59	Remove cancellation check on every vector (#11843 ) We recently introduced support for kNN vectors to `ExitableDirectoryReader`. Previously, we checked for cancellation not only on sampled calls `advance`, but on every single call to `vectorValue`. This can cause significant overhead when a query scans many vector values (for example the case where you're doing an exact scan and computing a vector similarity for every matching document). This PR removes the cancellation checks on `vectorValue`, since having them on `advance` is already enough.	2022-10-13 09:29:33 -07:00
Marc D'Mello	3a608995a1	GITHUB-11761 (part 2): Fix unit tests to cleany work with new TierMergePolicy delete pct default (#11841 ) Co-authored-by: Marc D'Mello <dmellomd@amazon.com>	2022-10-13 15:18:50 +02:00
Robert Muir	5e26b36ac8	Mark TestLongBitSet.testHugeCapacity @Monster as it requires a lot of memory (#11844 ) Closes #11842	2022-10-13 07:20:21 -04:00
Peter Gromov	ab50fe640b	[hunspell] fix TestPerformance measurement after millis->nanos conversion	2022-10-12 11:29:07 +02:00
Marc D'Mello	d966adcb62	GITHUB-11761: Move minimum TieredMergePolicy delete percentage and change default value (#11831 ) Move minimum TieredMergePolicy delete percentage from 20% to 5% and change deletePctAllowed default to 20% Co-authored-by: Marc D'Mello <dmellomd@amazon.com>	2022-10-05 15:33:12 -07:00
Alan Woodward	6bd8733fdb	No need to rewrite queries in unified highlighter (#11807 ) Since QueryVisitor added the ability to signal multi-term queries, the query rewrite call in UnifiedHighlighter has been essentially useless, and with more aggressive rewriting this is now causing bugs like #11490. We can safely remove this call. Fixes #11490	2022-10-03 10:15:40 +01:00
Uwe Schindler	aae293437f	Upgrade forbiddenapis to 3.4 (#11834 )	2022-10-02 16:42:36 +02:00
Uwe Schindler	7333f0329b	Fix typo in log message (we only support exactly Java 19)	2022-10-02 11:09:58 +02:00
Greg Miller	44b4602776	TermInSetQuery optimization when all docs in a field match a term (#11828 )	2022-09-29 06:59:59 -07:00
Greg Miller	367cd2ea95	Associate correct PR with DrillSideway change in CHANGES	2022-09-29 05:48:29 -07:00
Greg Miller	d02ba3134f	DrillSideways optimizations (#11803 ) DrillSidewaysScorer now breaks up first- and second-phase matching and makes use of advance when possible over nextDoc.	2022-09-29 05:22:30 -07:00
Ignacio Vera	78b58b8e2e	Build SpatialVisitor once per index (#11825 ) Address a performance regression on polygon queries using LatLonPoint field.	2022-09-27 10:51:49 +02:00
Greg Miller	971ae01164	Fix tie-break bug in various Facets implementations (#11768 )	2022-09-26 15:05:57 -07:00
Greg Miller	734841d6c0	Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment. (#11738 )	2022-09-26 10:39:47 -07:00
Greg Miller	ac12cd9f17	FacetsCollector#collect is no longer final to allow extension (#11804 )	2022-09-26 10:15:31 -07:00
Uwe Schindler	d943b76215	GITHUB-912: Remove deprecated APIs; fix link	2022-09-26 18:36:09 +02:00
Uwe Schindler	3b9c728ab5	MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23) (#912 ) This uses Gradle's auto-provisioning to compile Java 19 classes and build a multi-release JAR from them. Please make sure to regenerate gradle.properties (delete it) or change "org.gradle.java.installations.auto-download" to "true"	2022-09-26 15:22:04 +02:00
Adrien Grand	432296d967	Fix codec name in index header for Lucene94FieldInfosFormat. (#11818 )	2022-09-26 14:56:30 +02:00
Dawid Weiss	6b82be5f11	Regenerate sources after dependency updates. (#11817 )	2022-09-25 18:09:30 +02:00
Dawid Weiss	5d121ce44c	Upgrade several build dependencies. (#11812 ) * Upgrade several build dependencies. * Update error prone rules (those are off but they do trigger warnings/ errors) * A few corrections I made before I turned off new warnings. Let's do nother issue to fix them.	2022-09-25 17:10:22 +02:00
Robert Muir	15f3743f02	Remove Operations.isFinite (#11813 ) This method is recursive: to avoid eating too much stack we apply a small limit. This means it can't really be used on any largish automata without hitting exception. But the benefit of knowing finite vs infinite in AutomatonTermsEnum is minor: let's not auto-compute this. FuzzyQuery still gets the finite optimization because its finite by definition. PrefixQuery is always infinite. Wildcard/Regex just assume infinite which is safe to do. Remove the auto-computation and the "trillean" Boolean parameter. If you dont know that your automaton is finite, pass false to CompiledAutomaton, it is safe. Move this method to AutomatonTestUtil so we can still use it in test asserts. Closes #11809	2022-09-24 10:51:04 -04:00
Dawid Weiss	54fba99cb1	Upgrade google java format and apply tidy (#11811 )	2022-09-24 15:40:27 +02:00
Dawid Weiss	8bdfa90ea9	Fix and simplify the test (#11734 ).	2022-09-24 12:51:01 +02:00
Alan Woodward	188a78d769	Don't try to highlight very long terms (#11808 ) The UnifiedHighlighter can throw exceptions when highlighting terms that are longer than the maximum size the DaciukMihovAutomatonBuilder accepts. Rather than throwing a confusing exception, we can instead filter out the long terms when building the MemoryIndexOffsetStrategy. Very long terms are likely to be junk input in any case.	2022-09-24 11:26:16 +01:00
Luke Kot-Zaniewski	3a04aa44c2	Fix repeating token sentence boundary bug (#11734 ) Signed-off-by: lkotzaniewsk <lkotzaniewsk@bloomberg.net> Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>	2022-09-23 12:59:46 +02:00
jianping weng	5b24a233bd	LUCENE-10425：speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search (#687 )	2022-09-22 08:51:13 +02:00
Shai Erera	bcc116057d	Minor refactoring and cleanup to taxonomy index code (#11775 )	2022-09-21 13:08:33 +03:00
Julie Tibshirani	add309bb40	Mute TestKnnVectorQuery#testFilterWithSameScore while we work on a fix	2022-09-20 15:48:56 -07:00
Luca Cavanna	4eaebee686	Guard FieldExistsQuery against null pointers (#11794 ) FieldExistsQuery checks if there are points for a certain field, and then retrieves the corresponding point values. When all documents that had points for a certain field have been deleted from a certain segments, as well as merged away, field info may report that there are points yet the corresponding point values are null. With this change we add a null check in FieldExistsQuery. Long term, we will likely want to prevent this situation from happening. Relates #11393	2022-09-20 15:38:38 +02:00
Adrien Grand	6c46662b43	Fix handling of ghost fields in string sorts. (#11792 ) Introduction of dynamic pruning for string sorts (#11669) introduced a bug with string sorts and ghost fields, triggering a `NullPointerException` because the code assumes that `LeafReader#terms` is not null if the field is indexed according to field infos. This commit fixes the issue and adds tests for ghost fields across all sort types. Hopefully we can simplify and remove the null check in the future when we improve handling of ghost fields (#11393).	2022-09-20 13:49:52 +02:00
Ignacio Vera	ecb0ba542b	Improve tessellator performance by delaying calls to the method #isIntersectingPolygon (#11786 )	2022-09-20 07:15:38 +02:00
Michael Sokolov	07af358f90	Diversity check bugfix (#11781 ) * Fixes bug in HNSW diversity checks introduced in LUCENE-10577	2022-09-19 11:48:59 -04:00
Michael Sokolov	e69c48b8d9	Fix rare bug in TestKnnVectorQuery when we have multiple segments	2022-09-18 20:21:39 +00:00
Namgyu Kim	451bab300e	GITHUB#11778: Add detailed part-of-speech tag for particle and ending on Nori (#11779 )	2022-09-17 00:42:35 +09:00
Adrien Grand	155876a902	LUCENE-10674: Move changes entry to 9.4.	2022-09-16 16:59:42 +02:00
Dawid Weiss	9acc653995	GH-11172: remove WindowsDirectory and native subproject. (#11774 )	2022-09-15 16:22:46 +02:00
John Mazanec	0587844742	LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts. (#1068 ) Signed-off-by: John Mazanec <jmazane@amazon.com>	2022-09-15 11:21:39 +02:00
Alexander Münch	5de685cfba	Removed duplicate check in SpanGradientFormatter (#11762 )	2022-09-14 13:37:31 +01:00
Adrien Grand	a426c6fec3	Fix integer overflow in tests.	2022-09-13 17:08:17 +02:00
Greg Miller	4463a0b271	GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children (#11764 )	2022-09-13 06:50:52 -07:00
Dhiru Kholia	30b72ec364	Fix a typo affecting Luke (#11763 )	2022-09-12 13:05:40 +02:00
Alan Woodward	41d03f69ce	Fix IntervalBuilder.NO_INTERVALS docId when unpositioned (#11760 ) IntervalBuilder.NO_INTERVALS should return -1 when unpositioned, not NO_MORE_DOCS. This can trigger exceptions when an empty IntervalQuery is combined in a conjunction. Fixes #11759	2022-09-09 17:19:15 +01:00
Mayya Sharipova	0ea8035612	LUCENE-10592 Better estimate memory for HNSW graph (#11743 ) Better estimate memory used for OnHeapHnswGraph, as well as add tests. Also don't overallocate arrays in NeighborArray Relates to #992	2022-09-08 16:54:29 -04:00
Yuting Gan	49b596ef02	Added a top-n range faceting example (#1035 )	2022-09-08 12:19:42 -07:00
Julie Tibshirani	09a13aeaf2	LUCENE-10577: Remove LeafReader#searchNearestVectorsExhaustively (#11756 ) This PR removes the recently added function on LeafReader to exhaustively search through vectors, plus the helper function KnnVectorsReader#searchExhaustively. Instead it performs the exact search within KnnVectorQuery, using a new helper class called VectorScorer.	2022-09-08 12:15:02 -07:00
Robert Muir	f4146a44e9	Fix TestIndexWriterOnDiskFull.testAddDocumentOnDiskFull to handle IllegalStateException from startCommit() (#11757 ) If ConcurrentMergeScheduler is used, and the merge hits fatal exception (such as disk full) after prepareCommit()'s ensureOpen() check, then startCommit() will throw IllegalStateException instead of AlreadyClosedException. The test is currently not prepared to handle this: the logic is only geared around exceptions coming from addDocument() Closes #11755	2022-09-08 13:35:54 -04:00
Adrien Grand	f8285fd0fe	Prevent term vectors from exceeding the maximum dictionary size. (#11726 ) When indexing term vectors for a very large document, the automatic computation of the dictionary size based on the overall size of the block might yield a size that exceeds the maximum window size that is supported by LZ4. This commit addresses the issue by automatically taking the minimum of the result of this computation and the maximum window size (64kB).	2022-09-08 13:44:21 +02:00
Marios Trivyzas	dbffe3472b	LUCENE-10423: Remove usages of System.currentTimeMillis() from tests (#11749 ) * Remove usages of System.currentTimeMillis() from tests - Use Random from `RandomizedRunner` to be able to use a Seed to reproduce tests, instead of a seed coming from wall clock. - Replace time based tests, using wall clock to determine periods with counter of repetitions, to have a consistent reproduction. Closes: #11459 * address comments * tune iterations * tune iterations for nightly	2022-09-06 17:55:01 -04:00
Greg Miller	84cae4f27c	Simplify dense optimization check in TermInSetQuery (#11737 )	2022-09-02 07:51:29 -07:00

... 4 5 6 7 8 ...

13930 Commits