Commit Graph

36707 Commits

Author SHA1 Message Date
Luca Cavanna 5a51ce1d5d
SimpleText codec to support writing byte vectors (#12111)
A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written.
This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter.

Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support
byte vectors and will throw unsupported operation exception when attempting to do that.
2023-01-25 10:43:35 +01:00
Luca Cavanna 95e2cfcc1e
Remove deprecated float vector classes and methods (#12107)
Follow-up of #12105 to remove the deprecated classes for the next major version.

Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.
2023-01-24 16:25:36 +01:00
Adrien Grand ce8eaf138c
MemoryIndex should not fail integer fields that enable doc values. (#12109)
When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast
to `java.lang.Long`. However, the new `IntField` represents the value as a
`java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with
`IndexingChain` by casting to `Number` and calling `Number#longValue` instead
of casting to `Long`.
2023-01-24 11:51:49 +01:00
Luca Cavanna 25623a63bf format changelog entry and add missing author name 2023-01-24 10:57:52 +01:00
Luca Cavanna 92e67ec626
Rename vector float classes and methods (#12105)
We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values.

This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones.

As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues.

Relates to #11963
2023-01-24 10:20:24 +01:00
Michael Gibney d4006c0362
remove username from MANIFEST.MF in build artifacts (#12096) 2023-01-23 16:45:38 -05:00
Michael Gibney 832552e0ac
buildAndPushRelease should optionally pause before assembleRelease (#12095) 2023-01-23 16:39:10 -05:00
Luca Cavanna f5bd28662f Remove BytesRef usage from SortingCodeReader
Follow-up of #12102
2023-01-23 22:29:10 +01:00
Luca Cavanna 4594400216
Replace BytesRef usages in byte vectors API with byte[] (#12102)
The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore
2023-01-23 22:06:00 +01:00
Luca Cavanna f8ee852696
add missing changelog for #12064 2023-01-23 21:22:15 +01:00
Adrien Grand de94fa97fb
Remove binaryValue() on VectorValues and ByteVectorValues. (#12101)
This method tries to expose an encoded view of vectors, but we shouldn't have
this part of our user-facing API. With this change, the way vectors are encoded
is entirely on the codec.
2023-01-23 14:42:49 +01:00
Alessandro Benedetti 77eca4bb38
Introduce getters for KnnVectorQuery(#12029) 2023-01-23 12:35:08 +01:00
Chris Hostetter 9007f746a3 WordBreakSpellChecker now correctly respects maxEvaluations (#12077) 2023-01-22 15:44:29 -07:00
Vigya Sharma 519adcc954
Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098) 2023-01-19 16:16:42 -08:00
Michael Gibney fa5dcbad9f
update releaseWizard.py to support offline gpg key (#12085)
porting analogous change from solr: https://github.com/apache/solr/pull/1288
2023-01-19 14:41:35 +01:00
Lu Xugang a9fd21b6af
Same bound with fallbackQuery (#12084)
IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.
2023-01-19 14:33:58 +08:00
Vigya Sharma dc33ade76d
Remove UTF8TaxonomyWriterCache (#12092)
Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache
2023-01-18 13:20:26 -08:00
twosom 318b002e0b
fix typo in KoreanNumberFilter (#12045)
* fix typo in KoreanNumberFilter

* fix doc format
2023-01-17 22:32:59 -08:00
Robert Muir 4fe8424925
Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087)
Clean up this query a bit and support:
* NumericDocValuesField.newSlowSetQuery()
* SortedNumericDocValuesField.newSlowSetQuery()

This complements the existing docvalues-based range queries, with a set query.

Add ScorerSupplier/cost estimation support to PointInSetQuery
Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery
2023-01-16 09:38:08 -05:00
Alan Woodward fc7b937aff
Don't throw UOE when highlighting FieldExistsQuery (#12088)
WeightedSpanTermExtractor will try to rewrite queries that it doesn't
know about, to see if they end up as something it does know about and
that it can extract terms from. To support field merging, it rewrites against
a delegating leaf reader that does not support getFieldInfos().

FieldExistsQuery uses getFieldInfos() in its rewrite, which means that
if one is passed to WeightedSpanTermExtractor, we get an
UnsupportedOperationException thrown.

This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery,
so that it can just ignore it and avoid throwing an exception.
2023-01-16 11:47:51 +00:00
Robert Muir 4f33aa8515
Upgrade to errorprone 2.18 (#12086) 2023-01-14 14:39:23 -05:00
Robert Muir e06f8c2e8b
Update to error-prone 2.17 (#12056) 2023-01-14 11:38:39 -05:00
Robert Muir 8ca05967e8
remove non-NRT replication support (#12038)
Remove non-NRT replication support in 10.x (to be deprecated in 9.5)
2023-01-14 11:14:46 -05:00
Adrien Grand b5062a2858
MultiCollector shouldn't report that scores are needed when they're not. (#12083)
When sub collectors don't agree on their `ScoreMode`, `MultiCollector`
currently returns `COMPLETE`. This makes sense when assuming that there is
likely one collector computing top hits (`TOP_SCORES`) and another one
computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it
is also possible to have one collector computing top hits by field (`TOP_DOCS`)
and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector`
shouldn't report that scores are needed in that case.
2023-01-13 14:44:17 +01:00
Luca Cavanna 90a5a71448 fix typo in changelog 2023-01-13 14:12:06 +01:00
Lu Xugang 102622842b
Enhance XXXField#newRangeQuery (#12078)
Introduce IndexSortSortedNumericDocValuesRangeQuery to IntFiled#newRangeQuery and LongField#newRangeQuery
2023-01-13 18:12:26 +08:00
Adrien Grand aaab028266
Speed up DocIdMerger on sorted indexes. (#12081)
In the case when an index is sorted on a low-cardinality field, or the index
sort order correlates with the order in which documents get ingested, we can
optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on
the next sub. This checks covers at the same time whether the priority queue
needs reordering and whether the current sub reached `NO_MORE_DOCS`.
2023-01-12 18:27:45 +01:00
Adrien Grand 729fedcbac
Speed up 1D BKD merging. (#12079)
On the NYC taxis dataset on my local machine, switching from
`Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15%
speedup of BKD merging.
2023-01-12 18:14:15 +01:00
Benjamin Trent 59b17452aa
Fix exponential runtime for Boolean#rewrite (#12072)
When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially.

The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively.

This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite.

closes: #12069
2023-01-12 16:50:05 +01:00
Adrien Grand 9ab324d2be Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:52:12 +01:00
Adrien Grand 525a11091c Revert "Allow reusing indexed binary fields. (#12053)"
This reverts commit 84778549af.
2023-01-12 09:48:06 +01:00
Adrien Grand 84778549af
Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:32:13 +01:00
Patrick Zhai d3d9ab0044
Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051) 2023-01-11 10:44:06 -08:00
Adrien Grand a8ef03d979
Never throttle creation of compound files. (#12070)
`ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a
proxy for CPU usage, in order to prevent merging from disrupting searches too
much. However creating compound files are lightweight CPU-wise and do not need
throttling.

Closes #12068
2023-01-11 09:57:13 +01:00
Adrien Grand 56ec51e558
Cut over Lucene Demo from LongPoint to LongField. (#12052) 2023-01-11 09:43:43 +01:00
Benjamin Trent cc29102a24
Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064) 2023-01-11 09:20:47 +01:00
Erik Pellizzon e14327288e
Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034) 2023-01-06 14:35:52 -08:00
Uwe Schindler 5fccaec166 Remove deprecated APIs after #12066; this also removes another one missed to be removed before 2023-01-05 11:53:16 +01:00
Uwe Schindler 7f483bd618
Retire/deprecate per-instance MMapDirectory#setUseUnmap (#12066) 2023-01-04 19:17:03 +01:00
Uwe Schindler 2f602c01dd
Add a sysprop "org.apache.lucene.store.MMapDirectory.enableMemorySegments" (#12062) 2023-01-03 19:10:28 +01:00
Lu Xugang 19cc6cdf66
Out of boundary in CombinedFieldQuery#addTerm (#12046) 2023-01-03 15:33:36 +08:00
Uwe Schindler e2ee09d0c5
Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9 that has the bean, but without properties (#12058) 2023-01-02 00:05:36 +01:00
twosom 4676a735c1
fix typo analysis-kuromoji (#12047) 2023-01-01 10:58:50 -05:00
Patrick Zhai 4eab1d74e8
Fix TestRangeOnRaneFacetCounts dimention overflow error (#12049) 2022-12-30 13:18:51 -08:00
Greg Miller 21107d811b Add CHANGES entry for GITHUB#11869 2022-12-30 07:46:51 -08:00
Marc D'Mello cbfed77fd3
Github#11869: Add RangeOnRangeFacetCounts (#11901) 2022-12-30 07:38:13 -08:00
Adrien Grand 6f477e5831
Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037)
This iterates on #399 to also optimize the case when an index sort is
configured. When cutting over the NYC taxis benchmark to the new numeric
fields,
[flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times)
stayed mostly the same when index sorting is disabled and increased by 7-8%
when index sorting is enabled. I expect this change to address this slowdown.
2022-12-27 11:12:56 +01:00
Adrien Grand ddd63d2da3
Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011)
When flushing segments that have an index sort configured, postings lists get
loaded into arrays and get reordered according to the index sort.

This reordering is implemented with `TimSorter`, a variant of merge sort. Like
merge sort, an important part of `TimSorter` consists of merging two contiguous
sorted slices of the array into a combined sorted slice. This merging can be
done either with external memory, which is the classical approach, or in-place,
which still runs in linear time but with a much higher factor. Until now we
were allocating a fixed budget of `maxDoc/64` for doing these merges with
external memory. If this is not enough, sorted slices would be merged in place.

I've been looking at some profiles recently for an index where a non-negligible
chunk of the time was spent on in-place merges. So I would like to propose the
following change:
 - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid
   in-place merges for all postings up to `docFreq = maxDoc / 4`.
 - Make this RAM budget lazily allocated, rather than eagerly like today. This
   would help not allocate memory in O(maxDoc) for fields like primary keys
   that only have a couple postings per term.

So overall memory usage would never be more than 50% higher than what it is
today, because `TimSorter` never needs more than X temporary slots if the
postings list doesn't have at least 2*X entries, and these 2*X entries already
get loaded into memory today. And for fields that have short postings, memory
usage should actually be lower.
2022-12-27 11:11:18 +01:00
Adrien Grand f5ea0412eb
Replace JIRA release instructions with GitHub. (#11968) 2022-12-27 11:08:46 +01:00
Adrien Grand e9dc4f9188
Avoid sorting values of multi-valued writers if there is a single value. (#12039)
They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to
range checks and some logic to determine the optimal sorting algorithm to use
depending on the number of values. We can skip this overhead in the case when
there is a single value.
2022-12-27 11:03:06 +01:00