Commit Graph

35738 Commits

Author SHA1 Message Date
Dawid Weiss 1bf3cbc0b9 gradle 7.3.3 quick upgrade (#856) 2022-04-29 21:04:16 +02:00
Dawid Weiss 47ca4bc21c LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens to 255. 2022-04-29 09:48:23 +02:00
Dawid Weiss cb206196e4 LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit for nightlies in TestIndexWriterMergePolicy. (#424). Suppress SimpleTextCodec on TestIndexWriterMergePolicy. (#851) 2022-04-28 21:28:50 +02:00
Nhat Nguyen 62ebb22cc0
LUCENE-10518: Relax field consistency check for old indices (#842)
This change relaxes the field consistency check for old indices as we
didn't enforce that in the previous versions. This commit also disables
the optimization that relies on the field consistency for old indices.
2022-04-28 14:19:24 -04:00
Julie Tibshirani 9ae2181be5 Fix rare failures in TestVectorUtil cosine tests
If one of the vectors is zero, the cosine is not defined. This change makes sure
the test vectors are non-zero.
2022-04-08 09:46:33 -07:00
Adrien Grand ded9db7786 Add back-compat indices for 9.1.0. 2022-03-22 16:07:02 +01:00
Adrien Grand 53188e98e6 Add next bugfix version. 2022-03-22 15:57:31 +01:00
Adrien Grand 8a44234833 DOAP changes for release 9.1.0 2022-03-22 15:23:21 +01:00
Adrien Grand 1b890ab5f9 LUCENE-10473: Make tests a bit faster when running nightly. (#754) 2022-03-21 10:38:18 +01:00
Julie Tibshirani fcacd22a80 LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat
Before the assertion checked if two sets were equal, which resulted in rare
failures. Now we use 'contains' from hamcrest matchers.
2022-03-19 21:10:34 -07:00
Julie Tibshirani 22a9e45f09 LUCENE-9614: Fix rare TestKnnVectorQuery failures
Some of our checks relied on doc IDs corresponding to the order in which docs
were passed to IndexWriter. This is fragile and sometimes resulted in failures.
Now we check against an "id" field instead.
2022-03-18 15:23:27 -07:00
Luca Cavanna 9b4003236f LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (#753)
As part of #716 I moved the test to use a collector manager, but I forgot to update one of the assertions.
We can't rely on totalHits being accurate when the search is executed my multiple threads and early terminated.
2022-03-18 18:49:44 +01:00
Adrien Grand 5b522487ba LUCENE-10469: Fix score mode propagation in ConstantScoreQuery. (#750) 2022-03-16 13:19:30 +01:00
Dawid Weiss ea989fe8f3 LUCENE-10311: avoid division by zero on small sets. 2022-03-15 12:02:34 -07:00
Luca Cavanna a6114b532a
Revert "LUCENE-10385: Implement Weight#count on IndexSortSortedNumeri… (#745)
In LUCENE-10458 we identified a bug in the logic. We're reverting on the 9.1
branch to avoid holding up the release.
2022-03-14 13:41:29 -07:00
Dawid Weiss a796e08b1f LUCENE-10461: fix windows launch script for luke so that it works with integration tests AND actual command line. Cmd escaping rules and start command line is absolutely insane. (#743) 2022-03-12 19:44:17 +09:00
Dawid Weiss a3a058de6d LUCENE-10459: Update smoke tester for 9.1 (#744)
Add demo dependencies to third party modules. Add an IT that checks whether
demo classes are loadable.

Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
2022-03-11 10:46:35 -08:00
Julie Tibshirani 28f4baa511 Add missing 8.11.1 release to DOAP file 2022-03-09 14:25:00 -08:00
Mayya Sharipova 8f399572c9 LUCENE-10408 Test correction checksum (#734)
Use double instead of float to test vector values checksum
2022-03-09 12:16:40 +00:00
Spyros Kapnissis a033067246
LUCENE-10171: OpenNLPOpsFactory should directly cache DictionaryLemmatizer objects (#380)
Instead of caching dictionary strings and building multiple redundant DictionaryLemmatizer objects.

Co-authored-by: Michael Gibney <michael@michaelgibney.net>
2022-03-08 12:51:44 -05:00
Adrien Grand b5fe307c6f LUCENE-10311: Remove pop_XXX helpers from `BitUtil`. (#724)
As @rmuir noted, it would be as simple and create less cognitive overhead to
use `Long#bitCount` directly.
2022-03-05 18:39:16 +01:00
Adrien Grand 1818ae9de3 LUCENE-10453: Speed up euclidean distances. (#725) 2022-03-05 18:33:30 +01:00
Adrien Grand 29282fa315 LUCENE-10455: CHANGES entry. 2022-03-05 18:29:36 +01:00
Chris Lu ffdb246702 LUCENE-10455: IndexSortSortedNumericDocValuesRangeQuery should implement Weight#scorerSupplier(LeafReaderContext) (#729) 2022-03-05 18:27:54 +01:00
Alan Woodward 5e539bc50d
LUCENE-10431: Don't include rewriteMethod in MTQ hash calculation (#727)
BooleanQuery assumes that its children's hashcodes are stable, and has some
assertions to this effect. This did not apply to MultiTermQuery, which has a
mutable RewriteMethod member variable that was included in its hash calculation.
Changing the rewrite method would change the hash, leading to assertion failures
being tripped. This commit removes rewriteMethod from the hash calculation,
meaning that the hashcode will be stable even under mutability.
2022-03-04 11:54:44 +00:00
Dawid Weiss 8f92ec157f LUCENE-10447: always use utf8 for forked process encoding. Use the sa… (#717) 2022-03-03 20:57:09 +01:00
Alan Woodward 63454b83ad LUCENE-10431: Deprecate MultiTermQuery.setRewriteMethod() (#722)
Allowing users to mutate MultiTermQuery can give rise to odd bugs, for example
in wrapper queries such as BooleanQuery which lazily calculate their hashcodes
and then cache the result. This commit deprecates the setRewriteMethod()
method on MultiTermQuery, in preparation for removing it entirely, and adds
constructor parameters to the various MTQ implementations as a preferred
way to set the rewrite method.
2022-03-03 11:17:02 +00:00
Adrien Grand 2a6b2ca143 LUCENE-10002: Fix test failure.
When IndexSearcher is created with a threadpool it becomes impossible to assert
on the number of evaluated hits overall.
2022-03-03 10:09:35 +01:00
Adrien Grand 0d35e38b93 LUCENE-10428: Avoid infinite loop under error conditions. (#711)
Co-authored-by: dblock <dblock@dblock.org>
2022-03-03 09:42:39 +01:00
Adrien Grand bb10e62dff LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate). (#710)
This computes a pop count on a sample of the longs that back the bitset.

Quick benchmarks suggest that this runs 5x-10x faster than
`FixedBitSet#cardinality` depending on the length of the bitset.
2022-03-03 08:49:14 +01:00
Lu Xugang 1967942861 LUCENE-10450: IndexSortSortedNumericDocValuesRangeQuery could be rewrite to MatchAllDocsQuery (#720) 2022-03-02 09:29:28 +01:00
Lu Xugang c0d5022d5a LUCENE-10439: update CHANGES.txt (#714) 2022-03-02 09:25:12 +01:00
Anand 11e2fb8e0b LUCENE-10237 : Add MergeOnCommitTieredMergePolicy to sandbox (#446) 2022-03-02 09:21:50 +01:00
Luca Cavanna bfe7096565 LUCENE-10002: Replace test usages of TopScoreDocCollector with a corresponding collector manager (#716)
In the effort or replacing usages of IndexSearcher#search(Query, Collector) with IndexSearcher#search(Query, CollectorManager), this commit replaces many test usages of TopScoreDocCollector with its corresponding CollectorManager created by calling TopScoreDocCollector#createSharedManager.
2022-03-02 09:20:36 +01:00
Greg Miller 4c9c1c0746
LUCENE-10440: Mark TaxonomyFacets and FloatTaxonomyFacets as deprecated (#713) 2022-03-01 06:02:05 -08:00
Lu Xugang 9497524cc2 LUCENE-10442: When indexQuery or/and dvQuery be a MatchAllDocsQuery then IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery (#715) 2022-02-28 18:46:57 +01:00
Robert Muir 5972b495ba
LUCENE-10421: use Constant instead of relying upon timestamp (#686) 2022-02-25 00:39:06 -05:00
Greg Miller 81ab1d6ab6 Remove TODO for LUCENE-9952 since that issue was fixed 2022-02-24 13:46:18 -08:00
Adrien Grand d952b3a581 LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)
`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?
2022-02-24 13:38:13 +01:00
Adrien Grand d4cb6d0a30 LUCENE-10408: Write doc IDs of KNN vectors as ints rather than vints. (#708)
Since doc IDs with a vector are loaded as an int[] in memory, this changes the
on-disk format of vectors to align with the in-memory representation by using
ints instead of vints to represent doc IDs. This might make vectors a bit
larger on disk, but also a bit faster to open.

I made the same change to how we encode nodes on levels for the same reason.
2022-02-24 13:36:36 +01:00
Lu Xugang 6acf16a2e3 LUCENE-10439: Support multi-valued and multiple dimensions for count query in PointRangeQuery (#705) 2022-02-24 10:13:21 +01:00
gf2121 ad48203b55
LUCENE-10417: Revert "LUCENE-10315" (#706) (#707) 2022-02-24 16:57:35 +08:00
Julie Tibshirani a3b136573f LUCENE-10382: Fix testSearchWithVisitedLimit failures 2022-02-23 20:11:04 -08:00
Lu Xugang 5aab8a8e40 LUCENE-10435: add CHANGES.txt entry (#704) 2022-02-23 15:42:13 -08:00
Julie Tibshirani 29d4adfe60 LUCENE-10382: Ensure kNN filtering works with other codecs (#700)
The original PR that added kNN filtering support overlooked non-default codecs.
This follow-up ensures that other codecs work with the new filtering logic:
* Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader`
and `Lucene90HnswVectorsReader`
* Add a test `BaseKnnVectorsFormatTestCase` to cover this case
* Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions
don't hold when SimpleText is used

This PR also clarifies the limit checking logic for
`Lucene91HnswVectorsReader`. Now we always check the limit before visiting a
new node, whereas before we only checked it in an outer loop.
2022-02-23 14:59:16 -08:00
Lu Xugang 701e40132b LUCENE-10435: Break loop early while checking whether DocValuesFieldExistsQuery can be rewrite to MatchAllDocsQuery (#701) 2022-02-23 18:13:28 +01:00
Julie Tibshirani 458fb1abed LUCENE-10054: Make sure to use Lucene90 codec in unit tests (#699)
Before we were using the default Lucene91 codec, so we weren't exercising the
old format.
2022-02-23 08:23:54 -08:00
Ignacio Vera fb8d79d96a LUCENE-10437: Improve error message in the Tessellator for polygon with all points collinear (#703)
Polygon tessellator throws a more informative error message when the provided polygon does not contain enough no-collinear points.
2022-02-23 13:53:37 +01:00
Tomoko Uchida c22d6d09d9 Revert "LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori"
This reverts commit b2b3596466.
2022-02-22 20:21:07 +09:00
Lu Xugang ec4d20ac3c LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (#691) 2022-02-21 07:09:42 +01:00