Commit Graph

35795 Commits

Author SHA1 Message Date
Dawid Weiss 25c4310bd5
LUCENE-10461: fix windows launch script for luke so that it works with integration tests AND actual command line. Cmd escaping rules and start command line is absolutely insane. (#743) 2022-03-12 19:39:31 +09:00
Dawid Weiss 9e9c457f80
LUCENE-10459: Update smoke tester for 9.1 (#744)
Add demo dependencies to third party modules. Add an IT that checks whether
demo classes are loadable.

Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
2022-03-11 10:22:17 -08:00
Dawid Weiss e999056c19 LUCENE-10311: avoid division by zero on small sets. 2022-03-09 11:41:01 +01:00
Mayya Sharipova e5717cddfd
LUCENE-10408 Test correction checksum (#734)
Use double instead of float to test vector values checksum
2022-03-09 08:02:40 +00:00
Spyros Kapnissis 8afec33e74
LUCENE-10171: OpenNLPOpsFactory should directly cache DictionaryLemmatizer objects (#380)
Instead of caching dictionary strings and building multiple redundant DictionaryLemmatizer objects.

Co-authored-by: Michael Gibney <michael@michaelgibney.net>
2022-03-08 12:47:16 -05:00
Daniel Doubrovkine (dB.) 7aec489945
Fix: typo + +minScore. (#735) 2022-03-08 08:21:22 -05:00
Yuting Gan 10b78714c1
Fixed a typo in the Javadoc of TaxonomyReader (#732)
Co-authored-by: Yuting Gan <ganyutingi@gmail.com>
2022-03-06 19:21:34 -05:00
Adrien Grand ae16917c1d
LUCENE-10311: Remove pop_XXX helpers from `BitUtil`. (#724)
As @rmuir noted, it would be as simple and create less cognitive overhead to
use `Long#bitCount` directly.
2022-03-05 18:38:57 +01:00
Adrien Grand 8086ef9f45 LUCENE-10455: CHANGES entry. 2022-03-05 18:32:41 +01:00
Adrien Grand 9d732380ae
LUCENE-10453: Speed up euclidean distances. (#725) 2022-03-05 18:31:56 +01:00
Chris Lu 2700c6b525
LUCENE-10455: IndexSortSortedNumericDocValuesRangeQuery should implement Weight#scorerSupplier(LeafReaderContext) (#729) 2022-03-05 18:27:29 +01:00
Alan Woodward e049e426dd
LUCENE-10431: Remove MultiTermQuery.setRewriteMethod() (#726) 2022-03-04 11:54:02 +00:00
Dawid Weiss 81ab1e598f
LUCENE-10447: always use utf8 for forked process encoding. Use the sa… (#717) 2022-03-03 20:53:20 +01:00
Alan Woodward 3f994dec53
LUCENE-10431: Deprecate MultiTermQuery.setRewriteMethod() (#722)
Allowing users to mutate MultiTermQuery can give rise to odd bugs, for example
in wrapper queries such as BooleanQuery which lazily calculate their hashcodes
and then cache the result. This commit deprecates the setRewriteMethod()
method on MultiTermQuery, in preparation for removing it entirely, and adds
constructor parameters to the various MTQ implementations as a preferred
way to set the rewrite method.
2022-03-03 11:08:39 +00:00
Adrien Grand bff4246476 LUCENE-10002: Fix test failure.
When IndexSearcher is created with a threadpool it becomes impossible to assert
on the number of evaluated hits overall.
2022-03-03 10:10:35 +01:00
Adrien Grand 44a2a82319
LUCENE-10428: Avoid infinite loop under error conditions. (#711)
Co-authored-by: dblock <dblock@dblock.org>
2022-03-03 09:42:12 +01:00
Adrien Grand ca73ed1c28
LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate). (#710)
This computes a pop count on a sample of the longs that back the bitset.

Quick benchmarks suggest that this runs 5x-10x faster than
`FixedBitSet#cardinality` depending on the length of the bitset.
2022-03-03 08:48:44 +01:00
Peter Gromov 9ed526b70e [hunspell] make SuggestionTimeoutException public
to make it easier for custom checkCanceled implementations to throw it depending on their ad-hoc conditions and get partial results
2022-03-02 21:24:24 +01:00
Adrien Grand 46f9a25216 LUCENE-10237: Move CHANGES entry to 9.1. 2022-03-02 09:39:54 +01:00
Lu Xugang e996f1d8e7
LUCENE-10450: IndexSortSortedNumericDocValuesRangeQuery could be rewrite to MatchAllDocsQuery (#720) 2022-03-02 09:28:40 +01:00
Lu Xugang e8e522a52b
LUCENE-10439: update CHANGES.txt (#714) 2022-03-02 09:23:44 +01:00
Anand 14726dec51
LUCENE-10237 : Add MergeOnCommitTieredMergePolicy to sandbox (#446) 2022-03-02 09:19:25 +01:00
Luca Cavanna 1b083ea039
LUCENE-10002: Replace test usages of TopScoreDocCollector with a corresponding collector manager (#716)
In the effort or replacing usages of IndexSearcher#search(Query, Collector) with IndexSearcher#search(Query, CollectorManager), this commit replaces many test usages of TopScoreDocCollector with its corresponding CollectorManager created by calling TopScoreDocCollector#createSharedManager.
2022-03-02 09:14:36 +01:00
Greg Miller 51797dc7f1
LUCENE-10440: Reduce visibility of TaxonomyFacets and FloatTaxonomyFacets (#712) 2022-03-01 06:02:40 -08:00
Lu Xugang 6224d0b157
LUCENE-10442: When indexQuery or/and dvQuery be a MatchAllDocsQuery then IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery (#715) 2022-02-28 18:46:32 +01:00
Robert Muir 466278e149
LUCENE-10421: use Constant instead of relying upon timestamp (#686) 2022-02-25 00:38:13 -05:00
Greg Miller 4af516a149 Remove TODO for LUCENE-9952 since that issue was fixed 2022-02-24 12:03:55 -08:00
Adrien Grand d47ff38d70
LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)
`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?
2022-02-24 13:38:02 +01:00
Adrien Grand 44d7d962ae
LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708)
Since doc IDs with a vector are loaded as an int[] in memory, this changes the
on-disk format of vectors to align with the in-memory representation by using
ints instead of vints to represent doc IDs. This might make vectors a bit
larger on disk, but also a bit faster to open.

I made the same change to how we encode nodes on levels for the same reason.
2022-02-24 13:36:10 +01:00
Lu Xugang 550d1305db
LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery (#705) 2022-02-24 10:13:03 +01:00
gf2121 b0ca227862
LUCENE-10417: Revert "LUCENE-10315" (#706) 2022-02-24 16:41:17 +08:00
Julie Tibshirani d9c2e46824 LUCENE-10382: Fix testSearchWithVisitedLimit failures 2022-02-23 19:56:38 -08:00
Lu Xugang 7ec89603e3
LUCENE-10435: add CHANGES.txt entry (#704) 2022-02-23 15:41:02 -08:00
Julie Tibshirani b40a750aa8
LUCENE-10382: Ensure kNN filtering works with other codecs (#700)
The original PR that added kNN filtering support overlooked non-default codecs.
This follow-up ensures that other codecs work with the new filtering logic:
* Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader`
and `Lucene90HnswVectorsReader`
* Add a test `BaseKnnVectorsFormatTestCase` to cover this case
* Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions
don't hold when SimpleText is used

This PR also clarifies the limit checking logic for
`Lucene91HnswVectorsReader`. Now we always check the limit before visiting a
new node, whereas before we only checked it in an outer loop.
2022-02-23 14:58:27 -08:00
Julie Tibshirani 4364bdd63e
LUCENE-10054: Make sure to use Lucene90 codec in unit tests (#699)
Before we were using the default Lucene91 codec, so we weren't exercising the
old format.
2022-02-23 08:22:59 -08:00
Lu Xugang 43e89d6a29
LUCENE-10435: Break loop early while checking whether DocValuesFieldExistsQuery can be rewrite to MatchAllDocsQuery (#701) 2022-02-23 13:53:56 +01:00
Ignacio Vera ab47db4fee
LUCENE-10437: Improve error message in the Tessellator for polygon with all points collinear (#703)
Polygon tessellator throws a more informative error message when the provided polygon does not contain enough no-collinear points.
2022-02-23 13:51:44 +01:00
Tomoko Uchida f8040d565f LUCENE-10416: move changes entry to v10.0.0 2022-02-22 20:29:14 +09:00
Tomoko Uchida c7602a425c
migrate to temurin (#697) 2022-02-21 17:09:21 +09:00
Lu Xugang 36a2149d43
LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (#691) 2022-02-21 07:08:23 +01:00
Tomoko Uchida 76c9fd4e38 LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori 2022-02-20 21:39:03 +09:00
Tomoko Uchida e7a29c4c4c
Remove deprecated constructors in Nori (#695) 2022-02-20 17:40:45 +09:00
Tomoko Uchida 58fa95deea
LUCENE-10400: revise binary dictionaries' constructor in nori (#693) 2022-02-20 16:16:56 +09:00
Julie Tibshirani f0d17e94d9
LUCENE-10408: Fix vector values iteration bug (#690)
Now that there is special logic to handle the dense case, we need to adjust some
assertions in VectorValues#advance.
2022-02-18 11:36:22 -08:00
Julie Tibshirani cdb74e155a Temporarily mute TestKnnVectorQuery#testRandomWithFilter 2022-02-17 14:50:01 -08:00
Julie Tibshirani 8ca372573d
LUCENE-10382: Support filtering in KnnVectorQuery (#656)
This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein <jbernste@apache.org>
2022-02-17 11:35:25 -08:00
Vigya Sharma c132bbf677
LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (#677)
Since all documents are required to use the same features (LUCENE-9334) we can
rewrite DocValuesFieldExistsQuery to a MatchAllDocsQuery whenever terms or
points have a docCount that is equal to maxDoc.
2022-02-17 11:20:06 -08:00
Greg Miller 00029f1ec4 Add CHANGES entry for LUCENE-10398 2022-02-17 09:26:11 -08:00
spike.liu fc3c790ab4
LUCENE-10398: Add static method for getting Terms from LeafReader (#678)
Co-authored-by: chengliu@ctrip.com <chengliu@ctrip.com>
2022-02-17 09:21:51 -08:00
Mayya Sharipova f8c5408be7
LUCENE-10408 Better encoding of doc Ids in vectors (#649)
Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.

Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.
2022-02-17 11:34:42 +01:00