Some of our checks relied on doc IDs corresponding to the order in which docs
were passed to IndexWriter. This is fragile and sometimes resulted in failures.
Now we check against an "id" field instead.
As part of #716 I moved the test to use a collector manager, but I forgot to update one of the assertions.
We can't rely on totalHits being accurate when the search is executed my multiple threads and early terminated.
Add demo dependencies to third party modules. Add an IT that checks whether
demo classes are loadable.
Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
Instead of caching dictionary strings and building multiple redundant DictionaryLemmatizer objects.
Co-authored-by: Michael Gibney <michael@michaelgibney.net>
Allowing users to mutate MultiTermQuery can give rise to odd bugs, for example
in wrapper queries such as BooleanQuery which lazily calculate their hashcodes
and then cache the result. This commit deprecates the setRewriteMethod()
method on MultiTermQuery, in preparation for removing it entirely, and adds
constructor parameters to the various MTQ implementations as a preferred
way to set the rewrite method.
This computes a pop count on a sample of the longs that back the bitset.
Quick benchmarks suggest that this runs 5x-10x faster than
`FixedBitSet#cardinality` depending on the length of the bitset.
In the effort or replacing usages of IndexSearcher#search(Query, Collector) with IndexSearcher#search(Query, CollectorManager), this commit replaces many test usages of TopScoreDocCollector with its corresponding CollectorManager created by calling TopScoreDocCollector#createSharedManager.
`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?
Since doc IDs with a vector are loaded as an int[] in memory, this changes the
on-disk format of vectors to align with the in-memory representation by using
ints instead of vints to represent doc IDs. This might make vectors a bit
larger on disk, but also a bit faster to open.
I made the same change to how we encode nodes on levels for the same reason.
The original PR that added kNN filtering support overlooked non-default codecs.
This follow-up ensures that other codecs work with the new filtering logic:
* Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader`
and `Lucene90HnswVectorsReader`
* Add a test `BaseKnnVectorsFormatTestCase` to cover this case
* Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions
don't hold when SimpleText is used
This PR also clarifies the limit checking logic for
`Lucene91HnswVectorsReader`. Now we always check the limit before visiting a
new node, whereas before we only checked it in an outer loop.