Commit Graph

36420 Commits

Author SHA1 Message Date
Benjamin Trent 4bbc273a43
Add `FeatureQuery` weight caching in non-scoring case (#12118)
While FeatureQuery is a powerful tool in the scoring case, there are scenarios when caching should be allowed and scoring disabled.

A particular case is when the FeatureQuery is used in conjunction with learned-sparse retrieval. It is useful to iterate and calculate the entire matching doc set when combined with various other queries.

related to: https://github.com/apache/lucene/issues/11799
2023-02-02 13:00:33 -05:00
Ioana Tagirta d591c9c37a
Always generate a polygon that has no self intersections (#12124) 2023-02-02 16:57:42 +01:00
Jean-François BOEUF 5acca82633
Reduce bloom filter size by using the optimal count for hash functions. (#11900) 2023-02-01 14:35:50 +01:00
Marc D'Mello f9cb6a3b42
GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes (#11958)
Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
2023-02-01 14:11:12 +01:00
Luca Cavanna 57397f0cab
Adjust return type for VectorUtil methods (#12122)
Two of the methods (squareDistance and dotProduct) that take byte arrays return a float while
the variable used to store the value is an int. They can just return an int.
2023-02-01 10:48:05 +01:00
Luca Cavanna ce433e5449
Remove VectorUtil#toBytesRef (#12121)
The method is currently only used in its corresponding test method.
2023-02-01 10:47:50 +01:00
Luca Cavanna d7d07c453f
Release wizard: update folder name in stage artifacts command (#12117) 2023-02-01 10:47:32 +01:00
Luca Cavanna 73e2ae2705 Add back-compat indices for 9.5.0 2023-01-30 20:50:01 +01:00
Luca Cavanna 102a4de32f DOAP changes for release 9.5.0 2023-01-30 15:30:35 +01:00
Luca Cavanna 2bd87b7909 Fix formatting of Version.java 2023-01-25 13:49:23 +01:00
Luca Cavanna 59da15b0e5 Add minor version 9.6.0 2023-01-25 13:42:38 +01:00
Luca Cavanna 72c8b334a0 Add missing LUCENE_9_5_0 version 2023-01-25 13:37:48 +01:00
Benjamin Trent d1fa52e62f
Fix flaky TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults test (#12110) 2023-01-25 10:47:20 +01:00
Luca Cavanna 5a51ce1d5d
SimpleText codec to support writing byte vectors (#12111)
A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written.
This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter.

Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support
byte vectors and will throw unsupported operation exception when attempting to do that.
2023-01-25 10:43:35 +01:00
Luca Cavanna 95e2cfcc1e
Remove deprecated float vector classes and methods (#12107)
Follow-up of #12105 to remove the deprecated classes for the next major version.

Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.
2023-01-24 16:25:36 +01:00
Adrien Grand ce8eaf138c
MemoryIndex should not fail integer fields that enable doc values. (#12109)
When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast
to `java.lang.Long`. However, the new `IntField` represents the value as a
`java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with
`IndexingChain` by casting to `Number` and calling `Number#longValue` instead
of casting to `Long`.
2023-01-24 11:51:49 +01:00
Luca Cavanna 25623a63bf format changelog entry and add missing author name 2023-01-24 10:57:52 +01:00
Luca Cavanna 92e67ec626
Rename vector float classes and methods (#12105)
We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values.

This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones.

As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues.

Relates to #11963
2023-01-24 10:20:24 +01:00
Michael Gibney d4006c0362
remove username from MANIFEST.MF in build artifacts (#12096) 2023-01-23 16:45:38 -05:00
Michael Gibney 832552e0ac
buildAndPushRelease should optionally pause before assembleRelease (#12095) 2023-01-23 16:39:10 -05:00
Luca Cavanna f5bd28662f Remove BytesRef usage from SortingCodeReader
Follow-up of #12102
2023-01-23 22:29:10 +01:00
Luca Cavanna 4594400216
Replace BytesRef usages in byte vectors API with byte[] (#12102)
The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore
2023-01-23 22:06:00 +01:00
Luca Cavanna f8ee852696
add missing changelog for #12064 2023-01-23 21:22:15 +01:00
Adrien Grand de94fa97fb
Remove binaryValue() on VectorValues and ByteVectorValues. (#12101)
This method tries to expose an encoded view of vectors, but we shouldn't have
this part of our user-facing API. With this change, the way vectors are encoded
is entirely on the codec.
2023-01-23 14:42:49 +01:00
Alessandro Benedetti 77eca4bb38
Introduce getters for KnnVectorQuery(#12029) 2023-01-23 12:35:08 +01:00
Chris Hostetter 9007f746a3 WordBreakSpellChecker now correctly respects maxEvaluations (#12077) 2023-01-22 15:44:29 -07:00
Vigya Sharma 519adcc954
Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098) 2023-01-19 16:16:42 -08:00
Michael Gibney fa5dcbad9f
update releaseWizard.py to support offline gpg key (#12085)
porting analogous change from solr: https://github.com/apache/solr/pull/1288
2023-01-19 14:41:35 +01:00
Lu Xugang a9fd21b6af
Same bound with fallbackQuery (#12084)
IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.
2023-01-19 14:33:58 +08:00
Vigya Sharma dc33ade76d
Remove UTF8TaxonomyWriterCache (#12092)
Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache
2023-01-18 13:20:26 -08:00
twosom 318b002e0b
fix typo in KoreanNumberFilter (#12045)
* fix typo in KoreanNumberFilter

* fix doc format
2023-01-17 22:32:59 -08:00
Robert Muir 4fe8424925
Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087)
Clean up this query a bit and support:
* NumericDocValuesField.newSlowSetQuery()
* SortedNumericDocValuesField.newSlowSetQuery()

This complements the existing docvalues-based range queries, with a set query.

Add ScorerSupplier/cost estimation support to PointInSetQuery
Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery
2023-01-16 09:38:08 -05:00
Alan Woodward fc7b937aff
Don't throw UOE when highlighting FieldExistsQuery (#12088)
WeightedSpanTermExtractor will try to rewrite queries that it doesn't
know about, to see if they end up as something it does know about and
that it can extract terms from. To support field merging, it rewrites against
a delegating leaf reader that does not support getFieldInfos().

FieldExistsQuery uses getFieldInfos() in its rewrite, which means that
if one is passed to WeightedSpanTermExtractor, we get an
UnsupportedOperationException thrown.

This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery,
so that it can just ignore it and avoid throwing an exception.
2023-01-16 11:47:51 +00:00
Robert Muir 4f33aa8515
Upgrade to errorprone 2.18 (#12086) 2023-01-14 14:39:23 -05:00
Robert Muir e06f8c2e8b
Update to error-prone 2.17 (#12056) 2023-01-14 11:38:39 -05:00
Robert Muir 8ca05967e8
remove non-NRT replication support (#12038)
Remove non-NRT replication support in 10.x (to be deprecated in 9.5)
2023-01-14 11:14:46 -05:00
Adrien Grand b5062a2858
MultiCollector shouldn't report that scores are needed when they're not. (#12083)
When sub collectors don't agree on their `ScoreMode`, `MultiCollector`
currently returns `COMPLETE`. This makes sense when assuming that there is
likely one collector computing top hits (`TOP_SCORES`) and another one
computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it
is also possible to have one collector computing top hits by field (`TOP_DOCS`)
and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector`
shouldn't report that scores are needed in that case.
2023-01-13 14:44:17 +01:00
Luca Cavanna 90a5a71448 fix typo in changelog 2023-01-13 14:12:06 +01:00
Lu Xugang 102622842b
Enhance XXXField#newRangeQuery (#12078)
Introduce IndexSortSortedNumericDocValuesRangeQuery to IntFiled#newRangeQuery and LongField#newRangeQuery
2023-01-13 18:12:26 +08:00
Adrien Grand aaab028266
Speed up DocIdMerger on sorted indexes. (#12081)
In the case when an index is sorted on a low-cardinality field, or the index
sort order correlates with the order in which documents get ingested, we can
optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on
the next sub. This checks covers at the same time whether the priority queue
needs reordering and whether the current sub reached `NO_MORE_DOCS`.
2023-01-12 18:27:45 +01:00
Adrien Grand 729fedcbac
Speed up 1D BKD merging. (#12079)
On the NYC taxis dataset on my local machine, switching from
`Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15%
speedup of BKD merging.
2023-01-12 18:14:15 +01:00
Benjamin Trent 59b17452aa
Fix exponential runtime for Boolean#rewrite (#12072)
When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially.

The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively.

This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite.

closes: #12069
2023-01-12 16:50:05 +01:00
Adrien Grand 9ab324d2be Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:52:12 +01:00
Adrien Grand 525a11091c Revert "Allow reusing indexed binary fields. (#12053)"
This reverts commit 84778549af.
2023-01-12 09:48:06 +01:00
Adrien Grand 84778549af
Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:32:13 +01:00
Patrick Zhai d3d9ab0044
Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051) 2023-01-11 10:44:06 -08:00
Adrien Grand a8ef03d979
Never throttle creation of compound files. (#12070)
`ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a
proxy for CPU usage, in order to prevent merging from disrupting searches too
much. However creating compound files are lightweight CPU-wise and do not need
throttling.

Closes #12068
2023-01-11 09:57:13 +01:00
Adrien Grand 56ec51e558
Cut over Lucene Demo from LongPoint to LongField. (#12052) 2023-01-11 09:43:43 +01:00
Benjamin Trent cc29102a24
Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064) 2023-01-11 09:20:47 +01:00
Erik Pellizzon e14327288e
Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034) 2023-01-06 14:35:52 -08:00