lucene

Commit Graph

Author	SHA1	Message	Date
Kaival Parikh	e0d92eef98	Concurrent rewrite for KnnVectorQuery (#12160 ) - Reduce overhead of non-concurrent search by preserving original execution - Improve readability by factoring into separate functions --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>	2023-03-04 01:12:11 -08:00
Greg Miller	569533bd76	Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod (#12175 )	2023-03-01 08:25:44 -08:00
Greg Miller	00910cd6a4	Actually remove TermInSetQuery#getTermData (follow-up on #12173 )	2023-03-01 05:44:24 -08:00
Greg Miller	c8741f7c58	Deprecate TermInSetQuery#getTermData (#12173 )	2023-03-01 05:42:39 -08:00
Greg Miller	3809106602	Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery (#12156 )	2023-03-01 05:20:28 -08:00
Greg Miller	001acaf882	Clone the BytesRef[] values in KeywordField#newSetQuery (#12158 )	2023-02-28 18:30:38 -08:00
Greg Miller	b23b7475e1	Follow up on GH#12055 to remove un-referenced test methods	2023-02-27 16:03:44 -08:00
Adrien Grand	c6667e709f	Better skipping for multi-term queries with a FILTER rewrite. (#12055 ) This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a `DocIdSetBuilder`. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the `DocIdSetBuilder` that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the `DocIdSetBuilder`. This provides two main benefits: - Queries are less likely to allocate a FixedBitSet of size `maxDoc`. - Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown, so users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new default for PrefixQuery, WildcardQuery and TermRangeQuery. Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>	2023-02-27 15:11:31 -08:00
Adrien Grand	38d253fb53	Lazily resolve ordinals when merging. (#12170 ) The default implementation of merging doc values resolves the ordinal of a document in `nextDoc()`. But sometimes, doc values iterators are consumed without retrieving ordinals, e.g. to write the set of documents that have a value, so this may be wasteful. With this change, ordinals get resolved lazily upon `ordValue()`.	2023-02-27 17:38:59 +01:00
Adrien Grand	2d157bd348	Remove LogMergePolicy's boundary at the floor level. (#12113 ) `LogMergePolicy` has this boundary at the floor level that prevents merging segments above the minimum segment size with segments below this size. I cannot see a benefit from doing this, and no tests fail if I remove it, while this boundary has the downside of not running merges that seem legit to me. Should we remove this boundary check?	2023-02-27 17:38:34 +01:00
Adrien Grand	cce33b07e4	Skip the TokenStream overhead when indexing simple keywords. (#12139 ) Indexing simple keywords through a `TokenStream` abstraction introduces a bit of overhead due to attribute management. Not much, but indexing keywords boils down to adding to a hash map and appending to a postings list, which is quite cheap too so even some low overhead can significantly impact indexing speed.	2023-02-21 14:00:11 +01:00
Benjamin Trent	dbfca9a62b	Minor vector search matching doc optimizations (#12152 ) The two minor performance improvements are around count and Weight#scorer. segmentStarts is a monotonically increasing start for each scored document indexed by leaf-segment ordinal. Consequently, if the upper and lower segments are equivalent, that means no docs match for this segment. Count is similarly calculated by the difference between upper and lower segmentStarts according to the segment ordinal.	2023-02-21 07:51:03 -05:00
Greg Miller	7506f8462f	Speed up DocValuesRewriteMethod by making use of sortedness (#12155 )	2023-02-19 08:33:26 -08:00
Robert Muir	3ad2ede395	Implement ScorerSupplier for Sorted(Set)DocValuesField#newSlowRangeQuery (#12132 ) Similar to use of ScorerSupplier in #12129, implement it here too, because creation of a Scorer requires lookupTerm() operations in the DV terms dictionary. This results in wasted effort/random accesses, if, based on the cost(), IndexOrDocValuesQuery decides not to use this query.	2023-02-17 08:25:17 -05:00
Julie Tibshirani	8340b01c3c	Simplify max score for kNN vector queries (#12146 ) The helper class DocAndScoreQuery implements advanceShallow to help skip non-competitive documents. This method doesn't actually keep track of where it has advanced, which means it can do extra work. Overall the complexity here didn't seem worth it, given the low cost of collecting matching kNN docs. This PR switches to a simpler approach, which uses a fixed upper bound on the max score.	2023-02-16 12:03:59 -08:00
Nhat Nguyen	8e15c665be	Ensure caching all leaves from the upper tier (#12147 ) This change adjusts the cache policy to ensure that all segments in the max tier to be cached. Before, we cache segments that have more than 3% of the total documents in the index; now cache segments have more than half of the average documents per leave of the index. Closes #12140	2023-02-16 10:41:28 -08:00
Julie Tibshirani	54044a82a0	Improve DocAndScoreQuery#toString (#12148 ) Tiny improvements to DocAndScoreQuery: * Make toString more informative * Remove unnecessary 'k' parameter	2023-02-15 09:50:55 -08:00
Benjamin Trent	7baa01b3c2	Force merge into a single segment before getting the directory reader (#12138 ) The test assumes a single segment is created (because only one scorer is created from the leaf contexts). But, force merging wasn't done before getting the reader. Forcemerge to a single segment before getting the reader.	2023-02-09 11:27:08 -05:00
Alan Woodward	f38d51ee89	Don't wrap readers when checking for term vector access in test (#12136 ) In TestUnifiedHighlighterTermVec, we have a special reader which counts the number of times term vectors are accessed, so that we can assert that caching works correctly here. There is some special logic in place to skip the check when the test framework wraps readers with CheckIndex or ParallelReader; however, this logic no longer works with ParallelReader in particular, because term vectors are now accessed through an anonymous class. A simpler solution here is to call newSearcher(reader, false), which disables wrapping, meaning that we can remove this extra logic entirely. Fixes #12115	2023-02-09 09:29:41 +00:00
John Mazanec	776149f0f6	Reuse HNSW graph for intialization during merge (#12050 ) * Remove implicit addition of vector 0 Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com> * Enable out of order insertion of nodes in hnsw Enables nodes to be added into OnHeapHnswGraph in out of order fashion. To do so, additional operations have to be taken to resort the nodesByLevel array. Optimizations have been made to avoid sorting whenever possible. Signed-off-by: John Mazanec <jmazane@amazon.com> * Add ability to initialize from graph Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com> * Utilize merge with graph init in HNSWWriter Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com> * Minor modifications to Lucene95HnswVectorsWriter Signed-off-by: John Mazanec <jmazane@amazon.com> * Use TreeMap for graph structure for levels > 0 Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com> * Refactor initializer to be in static create method Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com> * Address review comments Signed-off-by: John Mazanec <jmazane@amazon.com> * Add change log entry Signed-off-by: John Mazanec <jmazane@amazon.com> * Remove empty iterator for neighborqueue Signed-off-by: John Mazanec <jmazane@amazon.com> --------- Signed-off-by: John Mazanec <jmazane@amazon.com>	2023-02-07 14:42:03 -05:00
Adrien Grand	ab074d5483	Introduce a new `KeywordField`. (#12054 ) `KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`, similarly to how `LongField` is a combination of `LongPoint` and `SortedNumericDocValuesField`. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.	2023-02-07 18:19:09 +01:00
Adrien Grand	d69326408c	Fix javadoc references.	2023-02-07 11:06:04 +01:00
Adrien Grand	8b572f074e	Fix nightly compatibility tests after #12116 .	2023-02-07 10:46:30 +01:00
Uwe Schindler	bc09c2a0d9	Add tests for size() and contains() to LongHashSet; fix size bug with MISSING (#12134 )	2023-02-07 00:52:29 +01:00
Uwe Schindler	57403e26e0	Simplify LongHashSet by completely removing java.util.Set APIs (#12133 )	2023-02-06 22:43:20 +01:00
Uwe Schindler	8564da434d	Generate gradle.properties from gradlew (#12131 ) * SOLR-16641 - Generate gradle.properties from gradlew (#1320) * Adapt for Lucene * Remove localSettings from smoker; thanks @colvinco * Print properties at end for debugging * Add CHANGES.txt entry --------- Co-authored-by: Colvin Cowie <colvin.cowie.dev@gmail.com> Co-authored-by: Colvin Cowie <51863265+colvinco@users.noreply.github.com>	2023-02-06 19:47:15 +01:00
Robert Muir	02b202866c	Add CHANGES.txt for #12129	2023-02-06 12:49:49 -05:00
Robert Muir	0bc4135695	Speedup sandbox/DocValuesTermsQuery (#12129 ) * Optimize the common case that docs only have single values for the field * In the multivalued case, terminate reading docvalues if they are > maximum set ordinal * Implement ScorerSupplier, so that (potentially large) number of ordinal lookups aren't performed just to get the cost() * Graduate to Sorted(Set)DocValuesField.newSlowSetQuery to complement newSlowRangeQuery, newSlowExactQuery Like other slow queries in these classes, it's currently only recommended to use with points, e.g. IndexOrDocValuesQuery(new PointInSetQuery, newSlowSetQuery)	2023-02-06 12:47:53 -05:00
Robert Muir	10d9c7440b	Speed up docvalues set query by making use of sortedness (#12128 ) LongHashSet is used for the set of numbers, but it has some issues: * tries to hard to extend AbstractSet, mostly for testing * causes traps with boxing if you aren't careful * complex hashcode/equals Practically we should take advantage of the fact numbers come in sorted order for multivalued fields: just like range queries do. So we use min/max to our advantage, including termination of docvalues iteration Actually it is generally a win to just check min/max even in the single-valued case: these constant time comparisons are cheap and can avoid hashing, etc. In the worst-case, if all of your query Sets contain both the minimum and maximum possible values, then it won't help, but it doesn't hurt either.	2023-02-06 12:14:02 -05:00
Robert Muir	a6bceb7cf0	Remove useless abstractions in DocValues-based queries (#12127 ) There's no need to make things abstract: DocValues does the right thing Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!)	2023-02-06 11:49:25 -05:00
Adrien Grand	684f31ef06	Fix ambiguous links. (#12116 )	2023-02-06 17:45:05 +01:00
Adrien Grand	96136b4282	Improve document API for stored fields. (#12116 ) Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues: - If there is a problem, e.g. all of these 3 methods return null, it's currently discovered late, when we already passed the responsibility of writing data from IndexingChain to the codec. - numericValue() is used both for numeric doc values and storage. This makes it impossible to implement a `double` field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values. - binaryValue() is used both for sorted(_set) doc values and storage. This makes it impossible to implement `keyword` fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store. This commit introduces `IndexableField#storedValue()`, which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue().	2023-02-06 17:08:13 +01:00
Benjamin Trent	c2bef381d1	Fix TestFeatureField#testBasicsNonScoringCase test (#12130 ) Sometimes the random search lucene test searcher will wrap the reader. Consequently, we need to make sure to use the reader provided by the test IndexSearcher or the reader may be different between creating the weight with the searcher vs. accessing the leaf context for the scorer.	2023-02-06 10:15:00 -05:00
Benjamin Trent	4bbc273a43	Add `FeatureQuery` weight caching in non-scoring case (#12118 ) While FeatureQuery is a powerful tool in the scoring case, there are scenarios when caching should be allowed and scoring disabled. A particular case is when the FeatureQuery is used in conjunction with learned-sparse retrieval. It is useful to iterate and calculate the entire matching doc set when combined with various other queries. related to: https://github.com/apache/lucene/issues/11799	2023-02-02 13:00:33 -05:00
Ioana Tagirta	d591c9c37a	Always generate a polygon that has no self intersections (#12124 )	2023-02-02 16:57:42 +01:00
Jean-François BOEUF	5acca82633	Reduce bloom filter size by using the optimal count for hash functions. (#11900 )	2023-02-01 14:35:50 +01:00
Marc D'Mello	f9cb6a3b42	GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes (#11958 ) Co-authored-by: Marc D'Mello <dmellomd@amazon.com>	2023-02-01 14:11:12 +01:00
Luca Cavanna	57397f0cab	Adjust return type for VectorUtil methods (#12122 ) Two of the methods (squareDistance and dotProduct) that take byte arrays return a float while the variable used to store the value is an int. They can just return an int.	2023-02-01 10:48:05 +01:00
Luca Cavanna	ce433e5449	Remove VectorUtil#toBytesRef (#12121 ) The method is currently only used in its corresponding test method.	2023-02-01 10:47:50 +01:00
Luca Cavanna	d7d07c453f	Release wizard: update folder name in stage artifacts command (#12117 )	2023-02-01 10:47:32 +01:00
Luca Cavanna	73e2ae2705	Add back-compat indices for 9.5.0	2023-01-30 20:50:01 +01:00
Luca Cavanna	102a4de32f	DOAP changes for release 9.5.0	2023-01-30 15:30:35 +01:00
Luca Cavanna	2bd87b7909	Fix formatting of Version.java	2023-01-25 13:49:23 +01:00
Luca Cavanna	59da15b0e5	Add minor version 9.6.0	2023-01-25 13:42:38 +01:00
Luca Cavanna	72c8b334a0	Add missing LUCENE_9_5_0 version	2023-01-25 13:37:48 +01:00
Benjamin Trent	d1fa52e62f	Fix flaky TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults test (#12110 )	2023-01-25 10:47:20 +01:00
Luca Cavanna	5a51ce1d5d	SimpleText codec to support writing byte vectors (#12111 ) A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written. This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter. Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support byte vectors and will throw unsupported operation exception when attempting to do that.	2023-01-25 10:43:35 +01:00
Luca Cavanna	95e2cfcc1e	Remove deprecated float vector classes and methods (#12107 ) Follow-up of #12105 to remove the deprecated classes for the next major version. Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.	2023-01-24 16:25:36 +01:00
Adrien Grand	ce8eaf138c	MemoryIndex should not fail integer fields that enable doc values. (#12109 ) When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast to `java.lang.Long`. However, the new `IntField` represents the value as a `java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with `IndexingChain` by casting to `Number` and calling `Number#longValue` instead of casting to `Long`.	2023-01-24 11:51:49 +01:00
Luca Cavanna	25623a63bf	format changelog entry and add missing author name	2023-01-24 10:57:52 +01:00

1 2 3 4 5 ...

36503 Commits All Branches Search

36503 Commits

All Branches