lucene

Commit Graph

Author	SHA1	Message	Date
Adrien Grand	ab074d5483	Introduce a new `KeywordField`. (#12054 ) `KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`, similarly to how `LongField` is a combination of `LongPoint` and `SortedNumericDocValuesField`. This makes it easier for users to create fields that can be used for filtering, sorting and faceting.	2023-02-07 18:19:09 +01:00
Adrien Grand	d69326408c	Fix javadoc references.	2023-02-07 11:06:04 +01:00
Adrien Grand	8b572f074e	Fix nightly compatibility tests after #12116 .	2023-02-07 10:46:30 +01:00
Uwe Schindler	bc09c2a0d9	Add tests for size() and contains() to LongHashSet; fix size bug with MISSING (#12134 )	2023-02-07 00:52:29 +01:00
Uwe Schindler	57403e26e0	Simplify LongHashSet by completely removing java.util.Set APIs (#12133 )	2023-02-06 22:43:20 +01:00
Uwe Schindler	8564da434d	Generate gradle.properties from gradlew (#12131 ) * SOLR-16641 - Generate gradle.properties from gradlew (#1320) * Adapt for Lucene * Remove localSettings from smoker; thanks @colvinco * Print properties at end for debugging * Add CHANGES.txt entry --------- Co-authored-by: Colvin Cowie <colvin.cowie.dev@gmail.com> Co-authored-by: Colvin Cowie <51863265+colvinco@users.noreply.github.com>	2023-02-06 19:47:15 +01:00
Robert Muir	02b202866c	Add CHANGES.txt for #12129	2023-02-06 12:49:49 -05:00
Robert Muir	0bc4135695	Speedup sandbox/DocValuesTermsQuery (#12129 ) * Optimize the common case that docs only have single values for the field * In the multivalued case, terminate reading docvalues if they are > maximum set ordinal * Implement ScorerSupplier, so that (potentially large) number of ordinal lookups aren't performed just to get the cost() * Graduate to Sorted(Set)DocValuesField.newSlowSetQuery to complement newSlowRangeQuery, newSlowExactQuery Like other slow queries in these classes, it's currently only recommended to use with points, e.g. IndexOrDocValuesQuery(new PointInSetQuery, newSlowSetQuery)	2023-02-06 12:47:53 -05:00
Robert Muir	10d9c7440b	Speed up docvalues set query by making use of sortedness (#12128 ) LongHashSet is used for the set of numbers, but it has some issues: * tries to hard to extend AbstractSet, mostly for testing * causes traps with boxing if you aren't careful * complex hashcode/equals Practically we should take advantage of the fact numbers come in sorted order for multivalued fields: just like range queries do. So we use min/max to our advantage, including termination of docvalues iteration Actually it is generally a win to just check min/max even in the single-valued case: these constant time comparisons are cheap and can avoid hashing, etc. In the worst-case, if all of your query Sets contain both the minimum and maximum possible values, then it won't help, but it doesn't hurt either.	2023-02-06 12:14:02 -05:00
Robert Muir	a6bceb7cf0	Remove useless abstractions in DocValues-based queries (#12127 ) There's no need to make things abstract: DocValues does the right thing Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!)	2023-02-06 11:49:25 -05:00
Adrien Grand	684f31ef06	Fix ambiguous links. (#12116 )	2023-02-06 17:45:05 +01:00
Adrien Grand	96136b4282	Improve document API for stored fields. (#12116 ) Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues: - If there is a problem, e.g. all of these 3 methods return null, it's currently discovered late, when we already passed the responsibility of writing data from IndexingChain to the codec. - numericValue() is used both for numeric doc values and storage. This makes it impossible to implement a `double` field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values. - binaryValue() is used both for sorted(_set) doc values and storage. This makes it impossible to implement `keyword` fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store. This commit introduces `IndexableField#storedValue()`, which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue().	2023-02-06 17:08:13 +01:00
Benjamin Trent	c2bef381d1	Fix TestFeatureField#testBasicsNonScoringCase test (#12130 ) Sometimes the random search lucene test searcher will wrap the reader. Consequently, we need to make sure to use the reader provided by the test IndexSearcher or the reader may be different between creating the weight with the searcher vs. accessing the leaf context for the scorer.	2023-02-06 10:15:00 -05:00
Benjamin Trent	4bbc273a43	Add `FeatureQuery` weight caching in non-scoring case (#12118 ) While FeatureQuery is a powerful tool in the scoring case, there are scenarios when caching should be allowed and scoring disabled. A particular case is when the FeatureQuery is used in conjunction with learned-sparse retrieval. It is useful to iterate and calculate the entire matching doc set when combined with various other queries. related to: https://github.com/apache/lucene/issues/11799	2023-02-02 13:00:33 -05:00
Ioana Tagirta	d591c9c37a	Always generate a polygon that has no self intersections (#12124 )	2023-02-02 16:57:42 +01:00
Jean-François BOEUF	5acca82633	Reduce bloom filter size by using the optimal count for hash functions. (#11900 )	2023-02-01 14:35:50 +01:00
Marc D'Mello	f9cb6a3b42	GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes (#11958 ) Co-authored-by: Marc D'Mello <dmellomd@amazon.com>	2023-02-01 14:11:12 +01:00
Luca Cavanna	57397f0cab	Adjust return type for VectorUtil methods (#12122 ) Two of the methods (squareDistance and dotProduct) that take byte arrays return a float while the variable used to store the value is an int. They can just return an int.	2023-02-01 10:48:05 +01:00
Luca Cavanna	ce433e5449	Remove VectorUtil#toBytesRef (#12121 ) The method is currently only used in its corresponding test method.	2023-02-01 10:47:50 +01:00
Luca Cavanna	d7d07c453f	Release wizard: update folder name in stage artifacts command (#12117 )	2023-02-01 10:47:32 +01:00
Luca Cavanna	73e2ae2705	Add back-compat indices for 9.5.0	2023-01-30 20:50:01 +01:00
Luca Cavanna	102a4de32f	DOAP changes for release 9.5.0	2023-01-30 15:30:35 +01:00
Luca Cavanna	2bd87b7909	Fix formatting of Version.java	2023-01-25 13:49:23 +01:00
Luca Cavanna	59da15b0e5	Add minor version 9.6.0	2023-01-25 13:42:38 +01:00
Luca Cavanna	72c8b334a0	Add missing LUCENE_9_5_0 version	2023-01-25 13:37:48 +01:00
Benjamin Trent	d1fa52e62f	Fix flaky TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults test (#12110 )	2023-01-25 10:47:20 +01:00
Luca Cavanna	5a51ce1d5d	SimpleText codec to support writing byte vectors (#12111 ) A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written. This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter. Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support byte vectors and will throw unsupported operation exception when attempting to do that.	2023-01-25 10:43:35 +01:00
Luca Cavanna	95e2cfcc1e	Remove deprecated float vector classes and methods (#12107 ) Follow-up of #12105 to remove the deprecated classes for the next major version. Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.	2023-01-24 16:25:36 +01:00
Adrien Grand	ce8eaf138c	MemoryIndex should not fail integer fields that enable doc values. (#12109 ) When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast to `java.lang.Long`. However, the new `IntField` represents the value as a `java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with `IndexingChain` by casting to `Number` and calling `Number#longValue` instead of casting to `Long`.	2023-01-24 11:51:49 +01:00
Luca Cavanna	25623a63bf	format changelog entry and add missing author name	2023-01-24 10:57:52 +01:00
Luca Cavanna	92e67ec626	Rename vector float classes and methods (#12105 ) We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values. This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones. As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues. Relates to #11963	2023-01-24 10:20:24 +01:00
Michael Gibney	d4006c0362	remove username from MANIFEST.MF in build artifacts (#12096 )	2023-01-23 16:45:38 -05:00
Michael Gibney	832552e0ac	buildAndPushRelease should optionally pause before assembleRelease (#12095 )	2023-01-23 16:39:10 -05:00
Luca Cavanna	f5bd28662f	Remove BytesRef usage from SortingCodeReader Follow-up of #12102	2023-01-23 22:29:10 +01:00
Luca Cavanna	4594400216	Replace BytesRef usages in byte vectors API with byte[] (#12102 ) The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore	2023-01-23 22:06:00 +01:00
Luca Cavanna	f8ee852696	add missing changelog for #12064	2023-01-23 21:22:15 +01:00
Adrien Grand	de94fa97fb	Remove binaryValue() on VectorValues and ByteVectorValues. (#12101 ) This method tries to expose an encoded view of vectors, but we shouldn't have this part of our user-facing API. With this change, the way vectors are encoded is entirely on the codec.	2023-01-23 14:42:49 +01:00
Alessandro Benedetti	77eca4bb38	Introduce getters for KnnVectorQuery(#12029 )	2023-01-23 12:35:08 +01:00
Chris Hostetter	9007f746a3	WordBreakSpellChecker now correctly respects maxEvaluations (#12077 )	2023-01-22 15:44:29 -07:00
Vigya Sharma	519adcc954	Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098 )	2023-01-19 16:16:42 -08:00
Michael Gibney	fa5dcbad9f	update releaseWizard.py to support offline gpg key (#12085 ) porting analogous change from solr: https://github.com/apache/solr/pull/1288	2023-01-19 14:41:35 +01:00
Lu Xugang	a9fd21b6af	Same bound with fallbackQuery (#12084 ) IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.	2023-01-19 14:33:58 +08:00
Vigya Sharma	dc33ade76d	Remove UTF8TaxonomyWriterCache (#12092 ) Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache	2023-01-18 13:20:26 -08:00
twosom	318b002e0b	fix typo in KoreanNumberFilter (#12045 ) * fix typo in KoreanNumberFilter * fix doc format	2023-01-17 22:32:59 -08:00
Robert Muir	4fe8424925	Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087 ) Clean up this query a bit and support: * NumericDocValuesField.newSlowSetQuery() * SortedNumericDocValuesField.newSlowSetQuery() This complements the existing docvalues-based range queries, with a set query. Add ScorerSupplier/cost estimation support to PointInSetQuery Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery	2023-01-16 09:38:08 -05:00
Alan Woodward	fc7b937aff	Don't throw UOE when highlighting FieldExistsQuery (#12088 ) WeightedSpanTermExtractor will try to rewrite queries that it doesn't know about, to see if they end up as something it does know about and that it can extract terms from. To support field merging, it rewrites against a delegating leaf reader that does not support getFieldInfos(). FieldExistsQuery uses getFieldInfos() in its rewrite, which means that if one is passed to WeightedSpanTermExtractor, we get an UnsupportedOperationException thrown. This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery, so that it can just ignore it and avoid throwing an exception.	2023-01-16 11:47:51 +00:00
Robert Muir	4f33aa8515	Upgrade to errorprone 2.18 (#12086 )	2023-01-14 14:39:23 -05:00
Robert Muir	e06f8c2e8b	Update to error-prone 2.17 (#12056 )	2023-01-14 11:38:39 -05:00
Robert Muir	8ca05967e8	remove non-NRT replication support (#12038 ) Remove non-NRT replication support in 10.x (to be deprecated in 9.5)	2023-01-14 11:14:46 -05:00
Adrien Grand	b5062a2858	MultiCollector shouldn't report that scores are needed when they're not. (#12083 ) When sub collectors don't agree on their `ScoreMode`, `MultiCollector` currently returns `COMPLETE`. This makes sense when assuming that there is likely one collector computing top hits (`TOP_SCORES`) and another one computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it is also possible to have one collector computing top hits by field (`TOP_DOCS`) and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector` shouldn't report that scores are needed in that case.	2023-01-13 14:44:17 +01:00

... 2 3 4 5 6 ...

36583 Commits All Branches Search

36583 Commits

All Branches