lucene

Commit Graph

Author	SHA1	Message	Date
Luca Cavanna	5a51ce1d5d	SimpleText codec to support writing byte vectors (#12111 ) A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written. This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter. Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support byte vectors and will throw unsupported operation exception when attempting to do that.	2023-01-25 10:43:35 +01:00
Luca Cavanna	95e2cfcc1e	Remove deprecated float vector classes and methods (#12107 ) Follow-up of #12105 to remove the deprecated classes for the next major version. Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.	2023-01-24 16:25:36 +01:00
Adrien Grand	ce8eaf138c	MemoryIndex should not fail integer fields that enable doc values. (#12109 ) When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast to `java.lang.Long`. However, the new `IntField` represents the value as a `java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with `IndexingChain` by casting to `Number` and calling `Number#longValue` instead of casting to `Long`.	2023-01-24 11:51:49 +01:00
Luca Cavanna	25623a63bf	format changelog entry and add missing author name	2023-01-24 10:57:52 +01:00
Luca Cavanna	92e67ec626	Rename vector float classes and methods (#12105 ) We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values. This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones. As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues. Relates to #11963	2023-01-24 10:20:24 +01:00
Michael Gibney	d4006c0362	remove username from MANIFEST.MF in build artifacts (#12096 )	2023-01-23 16:45:38 -05:00
Michael Gibney	832552e0ac	buildAndPushRelease should optionally pause before assembleRelease (#12095 )	2023-01-23 16:39:10 -05:00
Luca Cavanna	f5bd28662f	Remove BytesRef usage from SortingCodeReader Follow-up of #12102	2023-01-23 22:29:10 +01:00
Luca Cavanna	4594400216	Replace BytesRef usages in byte vectors API with byte[] (#12102 ) The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore	2023-01-23 22:06:00 +01:00
Luca Cavanna	f8ee852696	add missing changelog for #12064	2023-01-23 21:22:15 +01:00
Adrien Grand	de94fa97fb	Remove binaryValue() on VectorValues and ByteVectorValues. (#12101 ) This method tries to expose an encoded view of vectors, but we shouldn't have this part of our user-facing API. With this change, the way vectors are encoded is entirely on the codec.	2023-01-23 14:42:49 +01:00
Alessandro Benedetti	77eca4bb38	Introduce getters for KnnVectorQuery(#12029 )	2023-01-23 12:35:08 +01:00
Chris Hostetter	9007f746a3	WordBreakSpellChecker now correctly respects maxEvaluations (#12077 )	2023-01-22 15:44:29 -07:00
Vigya Sharma	519adcc954	Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098 )	2023-01-19 16:16:42 -08:00
Michael Gibney	fa5dcbad9f	update releaseWizard.py to support offline gpg key (#12085 ) porting analogous change from solr: https://github.com/apache/solr/pull/1288	2023-01-19 14:41:35 +01:00
Lu Xugang	a9fd21b6af	Same bound with fallbackQuery (#12084 ) IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.	2023-01-19 14:33:58 +08:00
Vigya Sharma	dc33ade76d	Remove UTF8TaxonomyWriterCache (#12092 ) Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache	2023-01-18 13:20:26 -08:00
twosom	318b002e0b	fix typo in KoreanNumberFilter (#12045 ) * fix typo in KoreanNumberFilter * fix doc format	2023-01-17 22:32:59 -08:00
Robert Muir	4fe8424925	Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087 ) Clean up this query a bit and support: * NumericDocValuesField.newSlowSetQuery() * SortedNumericDocValuesField.newSlowSetQuery() This complements the existing docvalues-based range queries, with a set query. Add ScorerSupplier/cost estimation support to PointInSetQuery Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery	2023-01-16 09:38:08 -05:00
Alan Woodward	fc7b937aff	Don't throw UOE when highlighting FieldExistsQuery (#12088 ) WeightedSpanTermExtractor will try to rewrite queries that it doesn't know about, to see if they end up as something it does know about and that it can extract terms from. To support field merging, it rewrites against a delegating leaf reader that does not support getFieldInfos(). FieldExistsQuery uses getFieldInfos() in its rewrite, which means that if one is passed to WeightedSpanTermExtractor, we get an UnsupportedOperationException thrown. This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery, so that it can just ignore it and avoid throwing an exception.	2023-01-16 11:47:51 +00:00
Robert Muir	4f33aa8515	Upgrade to errorprone 2.18 (#12086 )	2023-01-14 14:39:23 -05:00
Robert Muir	e06f8c2e8b	Update to error-prone 2.17 (#12056 )	2023-01-14 11:38:39 -05:00
Robert Muir	8ca05967e8	remove non-NRT replication support (#12038 ) Remove non-NRT replication support in 10.x (to be deprecated in 9.5)	2023-01-14 11:14:46 -05:00
Adrien Grand	b5062a2858	MultiCollector shouldn't report that scores are needed when they're not. (#12083 ) When sub collectors don't agree on their `ScoreMode`, `MultiCollector` currently returns `COMPLETE`. This makes sense when assuming that there is likely one collector computing top hits (`TOP_SCORES`) and another one computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it is also possible to have one collector computing top hits by field (`TOP_DOCS`) and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector` shouldn't report that scores are needed in that case.	2023-01-13 14:44:17 +01:00
Luca Cavanna	90a5a71448	fix typo in changelog	2023-01-13 14:12:06 +01:00
Lu Xugang	102622842b	Enhance XXXField#newRangeQuery (#12078 ) Introduce IndexSortSortedNumericDocValuesRangeQuery to IntFiled#newRangeQuery and LongField#newRangeQuery	2023-01-13 18:12:26 +08:00
Adrien Grand	aaab028266	Speed up DocIdMerger on sorted indexes. (#12081 ) In the case when an index is sorted on a low-cardinality field, or the index sort order correlates with the order in which documents get ingested, we can optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on the next sub. This checks covers at the same time whether the priority queue needs reordering and whether the current sub reached `NO_MORE_DOCS`.	2023-01-12 18:27:45 +01:00
Adrien Grand	729fedcbac	Speed up 1D BKD merging. (#12079 ) On the NYC taxis dataset on my local machine, switching from `Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15% speedup of BKD merging.	2023-01-12 18:14:15 +01:00
Benjamin Trent	59b17452aa	Fix exponential runtime for Boolean#rewrite (#12072 ) When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially. The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively. This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite. closes: #12069	2023-01-12 16:50:05 +01:00
Adrien Grand	9ab324d2be	Allow reusing indexed binary fields. (#12053 ) Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.	2023-01-12 09:52:12 +01:00
Adrien Grand	525a11091c	Revert "Allow reusing indexed binary fields. (#12053 )" This reverts commit `84778549af`.	2023-01-12 09:48:06 +01:00
Adrien Grand	84778549af	Allow reusing indexed binary fields. (#12053 ) Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.	2023-01-12 09:32:13 +01:00
Patrick Zhai	d3d9ab0044	Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051 )	2023-01-11 10:44:06 -08:00
Adrien Grand	a8ef03d979	Never throttle creation of compound files. (#12070 ) `ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a proxy for CPU usage, in order to prevent merging from disrupting searches too much. However creating compound files are lightweight CPU-wise and do not need throttling. Closes #12068	2023-01-11 09:57:13 +01:00
Adrien Grand	56ec51e558	Cut over Lucene Demo from LongPoint to LongField. (#12052 )	2023-01-11 09:43:43 +01:00
Benjamin Trent	cc29102a24	Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064 )	2023-01-11 09:20:47 +01:00
Erik Pellizzon	e14327288e	Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034 )	2023-01-06 14:35:52 -08:00
Uwe Schindler	5fccaec166	Remove deprecated APIs after #12066 ; this also removes another one missed to be removed before	2023-01-05 11:53:16 +01:00
Uwe Schindler	7f483bd618	Retire/deprecate per-instance MMapDirectory#setUseUnmap (#12066 )	2023-01-04 19:17:03 +01:00
Uwe Schindler	2f602c01dd	Add a sysprop "org.apache.lucene.store.MMapDirectory.enableMemorySegments" (#12062 )	2023-01-03 19:10:28 +01:00
Lu Xugang	19cc6cdf66	Out of boundary in CombinedFieldQuery#addTerm (#12046 )	2023-01-03 15:33:36 +08:00
Uwe Schindler	e2ee09d0c5	Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9 that has the bean, but without properties (#12058 )	2023-01-02 00:05:36 +01:00
twosom	4676a735c1	fix typo analysis-kuromoji (#12047 )	2023-01-01 10:58:50 -05:00
Patrick Zhai	4eab1d74e8	Fix TestRangeOnRaneFacetCounts dimention overflow error (#12049 )	2022-12-30 13:18:51 -08:00
Greg Miller	21107d811b	Add CHANGES entry for GITHUB#11869	2022-12-30 07:46:51 -08:00
Marc D'Mello	cbfed77fd3	Github#11869: Add RangeOnRangeFacetCounts (#11901 )	2022-12-30 07:38:13 -08:00
Adrien Grand	6f477e5831	Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037 ) This iterates on #399 to also optimize the case when an index sort is configured. When cutting over the NYC taxis benchmark to the new numeric fields, [flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times) stayed mostly the same when index sorting is disabled and increased by 7-8% when index sorting is enabled. I expect this change to address this slowdown.	2022-12-27 11:12:56 +01:00
Adrien Grand	ddd63d2da3	Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011 ) When flushing segments that have an index sort configured, postings lists get loaded into arrays and get reordered according to the index sort. This reordering is implemented with `TimSorter`, a variant of merge sort. Like merge sort, an important part of `TimSorter` consists of merging two contiguous sorted slices of the array into a combined sorted slice. This merging can be done either with external memory, which is the classical approach, or in-place, which still runs in linear time but with a much higher factor. Until now we were allocating a fixed budget of `maxDoc/64` for doing these merges with external memory. If this is not enough, sorted slices would be merged in place. I've been looking at some profiles recently for an index where a non-negligible chunk of the time was spent on in-place merges. So I would like to propose the following change: - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid in-place merges for all postings up to `docFreq = maxDoc / 4`. - Make this RAM budget lazily allocated, rather than eagerly like today. This would help not allocate memory in O(maxDoc) for fields like primary keys that only have a couple postings per term. So overall memory usage would never be more than 50% higher than what it is today, because `TimSorter` never needs more than X temporary slots if the postings list doesn't have at least 2X entries, and these 2X entries already get loaded into memory today. And for fields that have short postings, memory usage should actually be lower.	2022-12-27 11:11:18 +01:00
Adrien Grand	f5ea0412eb	Replace JIRA release instructions with GitHub. (#11968 )	2022-12-27 11:08:46 +01:00
Adrien Grand	e9dc4f9188	Avoid sorting values of multi-valued writers if there is a single value. (#12039 ) They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to range checks and some logic to determine the optimal sorting algorithm to use depending on the number of values. We can skip this overhead in the case when there is a single value.	2022-12-27 11:03:06 +01:00

... 5 6 7 8 9 ...

36707 Commits All Branches Search

36707 Commits

All Branches