Commit Graph

36485 Commits

Author SHA1 Message Date
Alan Woodward f38d51ee89
Don't wrap readers when checking for term vector access in test (#12136)
In TestUnifiedHighlighterTermVec, we have a special reader which counts the
number of times term vectors are accessed, so that we can assert that caching
works correctly here. There is some special logic in place to skip the check
when the test framework wraps readers with CheckIndex or ParallelReader;
however, this logic no longer works with ParallelReader in particular, because
term vectors are now accessed through an anonymous class.

A simpler solution here is to call newSearcher(reader, false), which disables
wrapping, meaning that we can remove this extra logic entirely.

Fixes #12115
2023-02-09 09:29:41 +00:00
John Mazanec 776149f0f6
Reuse HNSW graph for intialization during merge (#12050)
* Remove implicit addition of vector 0

Removes logic to add 0 vector implicitly. This is in preparation for
adding nodes from other graphs to initialize a new graph. Having the
implicit addition of node 0 complicates this logic.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Enable out of order insertion of nodes in hnsw

Enables nodes to be added into OnHeapHnswGraph in out of order fashion.
To do so, additional operations have to be taken to resort the
nodesByLevel array. Optimizations have been made to avoid sorting
whenever possible.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add ability to initialize from graph

Adds method to initialize an HNSWGraphBuilder from another HNSWGraph.
Initialization can only happen when the builder's graph is empty.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Utilize merge with graph init in HNSWWriter

Uses HNSWGraphBuilder initialization from graph functionality in
Lucene95HnswVectorsWriter. Selects the largest graph to initialize the
new graph produced by the HNSWGraphBuilder for merge.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Minor modifications to Lucene95HnswVectorsWriter

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Use TreeMap for graph structure for levels > 0

Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of
levels greater than 0. Refactors NodesIterator to support set
representation of nodes.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Refactor initializer to be in static create method

Refeactors initialization from graph to be accessible via a create
static method in HnswGraphBuilder.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Address review comments

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add change log entry

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Remove empty iterator for neighborqueue

Signed-off-by: John Mazanec <jmazane@amazon.com>

---------

Signed-off-by: John Mazanec <jmazane@amazon.com>
2023-02-07 14:42:03 -05:00
Adrien Grand ab074d5483
Introduce a new `KeywordField`. (#12054)
`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`,
similarly to how `LongField` is a combination of `LongPoint` and
`SortedNumericDocValuesField`. This makes it easier for users to create fields
that can be used for filtering, sorting and faceting.
2023-02-07 18:19:09 +01:00
Adrien Grand d69326408c Fix javadoc references. 2023-02-07 11:06:04 +01:00
Adrien Grand 8b572f074e Fix nightly compatibility tests after #12116. 2023-02-07 10:46:30 +01:00
Uwe Schindler bc09c2a0d9
Add tests for size() and contains() to LongHashSet; fix size bug with MISSING (#12134) 2023-02-07 00:52:29 +01:00
Uwe Schindler 57403e26e0
Simplify LongHashSet by completely removing java.util.Set APIs (#12133) 2023-02-06 22:43:20 +01:00
Uwe Schindler 8564da434d
Generate gradle.properties from gradlew (#12131)
* SOLR-16641 - Generate gradle.properties from gradlew (#1320)
* Adapt for Lucene
* Remove localSettings from smoker; thanks @colvinco
* Print properties at end for debugging
* Add CHANGES.txt entry

---------

Co-authored-by: Colvin Cowie <colvin.cowie.dev@gmail.com>
Co-authored-by: Colvin Cowie <51863265+colvinco@users.noreply.github.com>
2023-02-06 19:47:15 +01:00
Robert Muir 02b202866c
Add CHANGES.txt for #12129 2023-02-06 12:49:49 -05:00
Robert Muir 0bc4135695
Speedup sandbox/DocValuesTermsQuery (#12129)
* Optimize the common case that docs only have single values for the field
* In the multivalued case, terminate reading docvalues if they are > maximum set ordinal
* Implement ScorerSupplier, so that (potentially large) number of ordinal lookups aren't performed just to get the cost()
* Graduate to Sorted(Set)DocValuesField.newSlowSetQuery to complement newSlowRangeQuery, newSlowExactQuery

Like other slow queries in these classes, it's currently only recommended to use with points, e.g. IndexOrDocValuesQuery(new PointInSetQuery, newSlowSetQuery)
2023-02-06 12:47:53 -05:00
Robert Muir 10d9c7440b
Speed up docvalues set query by making use of sortedness (#12128)
LongHashSet is used for the set of numbers, but it has some issues:
* tries to hard to extend AbstractSet, mostly for testing
* causes traps with boxing if you aren't careful
* complex hashcode/equals

Practically we should take advantage of the fact numbers come in sorted
order for multivalued fields: just like range queries do. So we use
min/max to our advantage, including termination of docvalues iteration

Actually it is generally a win to just check min/max even in the single-valued
case: these constant time comparisons are cheap and can avoid hashing,
etc.

In the worst-case, if all of your query Sets contain both the minimum and maximum
possible values, then it won't help, but it doesn't hurt either.
2023-02-06 12:14:02 -05:00
Robert Muir a6bceb7cf0
Remove useless abstractions in DocValues-based queries (#12127)
There's no need to make things abstract: DocValues does the right thing
Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!)
2023-02-06 11:49:25 -05:00
Adrien Grand 684f31ef06 Fix ambiguous links. (#12116) 2023-02-06 17:45:05 +01:00
Adrien Grand 96136b4282
Improve document API for stored fields. (#12116)
Currently stored fields have to look at binaryValue(), stringValue() and
numericValue() to guess the type of the value and then store it. This has a few
issues:
 - If there is a problem, e.g. all of these 3 methods return null, it's
   currently discovered late, when we already passed the responsibility of
   writing data from IndexingChain to the codec.
 - numericValue() is used both for numeric doc values and storage. This makes
   it impossible to implement a `double` field that is stored and doc-valued,
   as numericValue() needs to return simultaneously a number that consists of
   the double for storage, and the long bits of the double for doc values.
 - binaryValue() is used both for sorted(_set) doc values and storage. This
   makes it impossible to implement `keyword` fields that is stored and
   doc-valued, as the field returns a non-null value for both binaryValue() and
   stringValue() and stored fields no longer know which field to store.

This commit introduces `IndexableField#storedValue()`, which is used only for
stored fields. This addresses the above issues. IndexingChain passes the
storedValue() directly to the codec, so it's impossible for a stored fields
format to mistakenly use binaryValue()/stringValue()/numericValue() instead of
storedValue().
2023-02-06 17:08:13 +01:00
Benjamin Trent c2bef381d1
Fix TestFeatureField#testBasicsNonScoringCase test (#12130)
Sometimes the random search lucene test searcher will wrap the reader. Consequently, we need to make sure to use the reader provided by the test IndexSearcher or the reader may be different between creating the weight with the searcher vs. accessing the leaf context for the scorer.
2023-02-06 10:15:00 -05:00
Benjamin Trent 4bbc273a43
Add `FeatureQuery` weight caching in non-scoring case (#12118)
While FeatureQuery is a powerful tool in the scoring case, there are scenarios when caching should be allowed and scoring disabled.

A particular case is when the FeatureQuery is used in conjunction with learned-sparse retrieval. It is useful to iterate and calculate the entire matching doc set when combined with various other queries.

related to: https://github.com/apache/lucene/issues/11799
2023-02-02 13:00:33 -05:00
Ioana Tagirta d591c9c37a
Always generate a polygon that has no self intersections (#12124) 2023-02-02 16:57:42 +01:00
Jean-François BOEUF 5acca82633
Reduce bloom filter size by using the optimal count for hash functions. (#11900) 2023-02-01 14:35:50 +01:00
Marc D'Mello f9cb6a3b42
GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes (#11958)
Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
2023-02-01 14:11:12 +01:00
Luca Cavanna 57397f0cab
Adjust return type for VectorUtil methods (#12122)
Two of the methods (squareDistance and dotProduct) that take byte arrays return a float while
the variable used to store the value is an int. They can just return an int.
2023-02-01 10:48:05 +01:00
Luca Cavanna ce433e5449
Remove VectorUtil#toBytesRef (#12121)
The method is currently only used in its corresponding test method.
2023-02-01 10:47:50 +01:00
Luca Cavanna d7d07c453f
Release wizard: update folder name in stage artifacts command (#12117) 2023-02-01 10:47:32 +01:00
Luca Cavanna 73e2ae2705 Add back-compat indices for 9.5.0 2023-01-30 20:50:01 +01:00
Luca Cavanna 102a4de32f DOAP changes for release 9.5.0 2023-01-30 15:30:35 +01:00
Luca Cavanna 2bd87b7909 Fix formatting of Version.java 2023-01-25 13:49:23 +01:00
Luca Cavanna 59da15b0e5 Add minor version 9.6.0 2023-01-25 13:42:38 +01:00
Luca Cavanna 72c8b334a0 Add missing LUCENE_9_5_0 version 2023-01-25 13:37:48 +01:00
Benjamin Trent d1fa52e62f
Fix flaky TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults test (#12110) 2023-01-25 10:47:20 +01:00
Luca Cavanna 5a51ce1d5d
SimpleText codec to support writing byte vectors (#12111)
A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written.
This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter.

Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support
byte vectors and will throw unsupported operation exception when attempting to do that.
2023-01-25 10:43:35 +01:00
Luca Cavanna 95e2cfcc1e
Remove deprecated float vector classes and methods (#12107)
Follow-up of #12105 to remove the deprecated classes for the next major version.

Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.
2023-01-24 16:25:36 +01:00
Adrien Grand ce8eaf138c
MemoryIndex should not fail integer fields that enable doc values. (#12109)
When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast
to `java.lang.Long`. However, the new `IntField` represents the value as a
`java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with
`IndexingChain` by casting to `Number` and calling `Number#longValue` instead
of casting to `Long`.
2023-01-24 11:51:49 +01:00
Luca Cavanna 25623a63bf format changelog entry and add missing author name 2023-01-24 10:57:52 +01:00
Luca Cavanna 92e67ec626
Rename vector float classes and methods (#12105)
We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values.

This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones.

As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues.

Relates to #11963
2023-01-24 10:20:24 +01:00
Michael Gibney d4006c0362
remove username from MANIFEST.MF in build artifacts (#12096) 2023-01-23 16:45:38 -05:00
Michael Gibney 832552e0ac
buildAndPushRelease should optionally pause before assembleRelease (#12095) 2023-01-23 16:39:10 -05:00
Luca Cavanna f5bd28662f Remove BytesRef usage from SortingCodeReader
Follow-up of #12102
2023-01-23 22:29:10 +01:00
Luca Cavanna 4594400216
Replace BytesRef usages in byte vectors API with byte[] (#12102)
The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore
2023-01-23 22:06:00 +01:00
Luca Cavanna f8ee852696
add missing changelog for #12064 2023-01-23 21:22:15 +01:00
Adrien Grand de94fa97fb
Remove binaryValue() on VectorValues and ByteVectorValues. (#12101)
This method tries to expose an encoded view of vectors, but we shouldn't have
this part of our user-facing API. With this change, the way vectors are encoded
is entirely on the codec.
2023-01-23 14:42:49 +01:00
Alessandro Benedetti 77eca4bb38
Introduce getters for KnnVectorQuery(#12029) 2023-01-23 12:35:08 +01:00
Chris Hostetter 9007f746a3 WordBreakSpellChecker now correctly respects maxEvaluations (#12077) 2023-01-22 15:44:29 -07:00
Vigya Sharma 519adcc954
Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098) 2023-01-19 16:16:42 -08:00
Michael Gibney fa5dcbad9f
update releaseWizard.py to support offline gpg key (#12085)
porting analogous change from solr: https://github.com/apache/solr/pull/1288
2023-01-19 14:41:35 +01:00
Lu Xugang a9fd21b6af
Same bound with fallbackQuery (#12084)
IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.
2023-01-19 14:33:58 +08:00
Vigya Sharma dc33ade76d
Remove UTF8TaxonomyWriterCache (#12092)
Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache
2023-01-18 13:20:26 -08:00
twosom 318b002e0b
fix typo in KoreanNumberFilter (#12045)
* fix typo in KoreanNumberFilter

* fix doc format
2023-01-17 22:32:59 -08:00
Robert Muir 4fe8424925
Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087)
Clean up this query a bit and support:
* NumericDocValuesField.newSlowSetQuery()
* SortedNumericDocValuesField.newSlowSetQuery()

This complements the existing docvalues-based range queries, with a set query.

Add ScorerSupplier/cost estimation support to PointInSetQuery
Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery
2023-01-16 09:38:08 -05:00
Alan Woodward fc7b937aff
Don't throw UOE when highlighting FieldExistsQuery (#12088)
WeightedSpanTermExtractor will try to rewrite queries that it doesn't
know about, to see if they end up as something it does know about and
that it can extract terms from. To support field merging, it rewrites against
a delegating leaf reader that does not support getFieldInfos().

FieldExistsQuery uses getFieldInfos() in its rewrite, which means that
if one is passed to WeightedSpanTermExtractor, we get an
UnsupportedOperationException thrown.

This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery,
so that it can just ignore it and avoid throwing an exception.
2023-01-16 11:47:51 +00:00
Robert Muir 4f33aa8515
Upgrade to errorprone 2.18 (#12086) 2023-01-14 14:39:23 -05:00
Robert Muir e06f8c2e8b
Update to error-prone 2.17 (#12056) 2023-01-14 11:38:39 -05:00