Commit Graph

35546 Commits

Author SHA1 Message Date
Ignacio Vera 5ba87f9efa LUCENE-10310: Fix test error in TestXYDocValuesQueries#testRandomDistanceHuge (#537)
We create random circles using ShapeTestUtils which is safe.
2021-12-13 12:02:13 +01:00
Tomoko Uchida e0a6e1c662 LUCENE-10309: Minimum KnnVector codec support in Luke (#535) 2021-12-12 15:32:10 +09:00
Tomoko Uchida 140f48e267 LUCENE-10303: Upgrade log4j to 2.15.0 2021-12-11 10:45:00 +09:00
Tomoko Uchida f046b59a5b LUCENE-10305: Ensure line endings of versions.props is LF 2021-12-11 10:12:34 +09:00
Dawid Weiss cf5a7337e2 LUCENE-10229: change the wording a bit. 2021-12-09 17:35:18 +01:00
Patrick Zhai 2a47bbe8be LUCENE-10229: Unify behaviour of match offsets for interval queries (#521) 2021-12-09 17:35:18 +01:00
Ignacio Vera 5e0d8dc87a Revert "LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520)" (#532)
This reverts commit af1e68b891.
2021-12-09 13:55:36 +01:00
Dawid Weiss 17aaab654e LUCENE-10294: Avoid compiling javadocs twice in 'gradlew check'. 2021-12-09 09:56:32 +01:00
Julie Tibshirani 394472d4b8 LUCENE-10040: Add test for vector search with skewed deletions (#527)
This exercises a challenging case where the documents to skip all happen to
be closest to the query vector. In many cases, HNSW appears to be robust to this
case and maintains good recall.
2021-12-08 11:26:45 -08:00
Robert Muir c74642d9a7
remove unnecessary "dependencies" in versions.props (#526)
Looks like stray cats from back when it was shared with solr
2021-12-07 21:23:33 -05:00
Adrien Grand 85caa4364e DOAP changes for release 9.0.0 2021-12-07 14:38:56 +01:00
Ignacio Vera 1eb935229f LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520) 2021-12-07 07:42:04 +01:00
Tomoko Uchida 3eadfd4596 LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522) 2021-12-07 15:34:24 +09:00
gf2121 892e324d02 LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510) 2021-12-07 07:27:11 +01:00
Robert Muir 4d48dc87f7
speed up TestSimpleExplanationsWithFillerDocs (#516)
This is the slowest test suite, runs for ~ 60s, because between every
document it adds 2048 "filler docs". This just adds up to a ton of
indexing across all the test methods.

Use 2048 for Nightly, and instead a smaller number (4) for local builds.
2021-12-06 22:13:04 -05:00
Robert Muir 9000dfc382
simplify jflex grammars by using difference rather than negation (#515)
Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.
2021-12-06 21:59:40 -05:00
Uwe Schindler d36c70cdd6 LUCENE-10287: Add changes entry 2021-12-06 20:28:46 +01:00
Uwe Schindler 8e7fbcaf5b LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported as module (#517) 2021-12-06 20:24:56 +01:00
gf2121 ebee531df7 LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512)
Co-authored-by: guofeng.my <guofeng.my@bytedance.com>
2021-12-06 07:34:41 +01:00
Robert Muir aa6a78c28c
tone down TestIndexWriter.testMaxCompletedSequenceNumber in non-nightly (#506)
this test currently indexes up to 600 docs for each thread.
2021-12-04 14:23:03 -05:00
Robert Muir 3e06e2338e
tone down BaseTermVectorsFormatTestCase.testLotsOfFields in non-nightly (#505)
This test runs across every IndexOptions, indexing hundreds of fields.
It is slow for some implementations (e.g. SimpleText).

Use less fields for normal runs.
2021-12-04 14:22:57 -05:00
Robert Muir 401d6209fb
Make TestNRTReplication.testCrashReplica nightly (#504)
This test is forking and crashing JVMs, always runs over 10 seconds
2021-12-04 14:22:48 -05:00
Dawid Weiss d2563e6f1f LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514) 2021-12-04 09:57:02 +01:00
Robert Muir eff5430e58
LUCENE-10243: increase unicode versions of tokenizers to 12.1 (#465)
* Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars
* regenerate emoji conformance tests for unicode 12.1
* modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties)
* regenerate wordbreak conformance tests
* Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric
* Use jflex emoji properties rather than ICU-generated ones
2021-12-03 20:34:29 -05:00
gf2121 eadc146e08 LUCENE-10233: Use AND NOT for inverse intersector (#499)
When docIds are stored as a BitSet, use andNot to speed up collecting them.
2021-12-03 09:24:54 +01:00
Ignacio Vera e2264cd7ef LUCENE-10279: add entry in CHANGES.txt and make RangeClause final (#507) 2021-12-03 07:24:40 +01:00
Misha Tiurin 074a233244
Remove duplicate entries in SpanishPluralStemmer invariants list (#508)
* Remove duplicate entries in SpanishPluralStemmer invariants list
Add assertion to prevent duplicates in the future

Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>
2021-12-02 14:23:39 -05:00
Robert Muir cbd306f87b LUCENE-10278: don't write zero-sized array in this test (#501)
DocIdsWriter is not prepared for this.
2021-12-02 15:54:28 +01:00
gf2121 dd14424817 LUCENE-10233: Store docIds as bitset to speed up addAll (#438) 2021-12-02 15:54:18 +01:00
Adrien Grand 1b04807440 Make EndiannessReverser(Data|Index)Input always reverse byte order. (#502)
Currently EndiannessReverser(Data|Index)Input doesn't reverse the byte order for
`readLongs` and `readFloats`. The reasoning is that these two method replaced
`readLELongs` and `readLEFloats`, so the byte order doesn't need changing.

However this is creating some confusing situations where new code expects a
consistent byte order on the write and read sides and gets longs in the wrong
byte order. So this commit suggests a different approach, where
EndiannessReverser(Data|Index)Input always changes the byte order, and former
call sites of `readLELongs` and `readLEFloats` are changed to manually reverse
the byte order on top of `readLongs` and `readFloats`.

This is making old codecs a bit slower, but I think it's fair since these are
old codecs. And this makes the endianness reversing backward compatibility layer
easier to reason about?
2021-12-02 14:01:37 +01:00
Ignacio Vera c0f0686f74 LUCENE-10279: Fix equals in MultiRangeQuery (#503) 2021-12-02 13:34:47 +01:00
Robert Muir f33ae4e81f
improve term vector merging tests (#500)
Use less iterations locally so that term vector merging doesn't dominate
the list of slowest tests.

Split out deletes/no-deletes into separate methods to improve
debuggability.

Remove nightly from SimpleText term vectors merging tests, now that they
run much faster.
2021-12-02 05:40:36 -05:00
Ignacio Vera a580e29539 LUCENE-10275: Speed up MultiRangeQuery by using an interval tree 2021-12-02 09:54:32 +01:00
Robert Muir 072b775199
LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer (#497)
* LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer

The new SpanishPluralStemmer is in fact more "minimal", less agressive
stemming and normalization. For the user that wants only plural
stemming, it is the better choice.
2021-12-01 14:27:48 -05:00
Tomoko Uchida d551e128de backport a7ebf6618c 2021-12-01 20:13:23 +09:00
Robert Muir a201fde054
LUCENE-10270: Improve MIGRATE.md (#491)
* LUCENE-10270: Improve MIGRATE.md

* Separate sections for 9.0 and 9.1
* Remove abbreviations for artifact, package, class names etc. e.g. `lucene-core` instead of `core` and `org.apache.lucene.analysis` instead of `o.a.l.a`.
* Specify "java" for text blocks to get syntax highlighting
* When provided, consistently put JIRA issue in the same place
* Fixed-width font for classes/reserved words (e.g. false, true, long, makes for less ambiguous reading)
* More use of tables vs lists when there is mapping of old -> new names (packages, classes, etc)
* Use consistent notation for method calls (Class.method() vs Class.method vs Class#method etc)

* LUCENE-10270: replace LUCENE_ with LUCENE- so it gets JIRA link

* LUCENE-10270: fix things found by msokolov
2021-12-01 05:30:26 -05:00
Dawid Weiss 025b4b3106 LUCENE-10234: update smoke-tester with new module names. 2021-12-01 10:39:28 +01:00
Greg Miller da48e47f3b Move CHANGES entry for LUCENE-10232 to 9.0 2021-11-30 14:37:49 -08:00
Xavier Sanchez Loro 0293fc896d
LUCENE-10248: Spanish Plural Stemmer (#461)
Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373

This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n)

Some characteristics:

* Designed to stem just plural to singular form
* Distinguishes between masculine and feminine forms
* It will increase recall but precision can be reduced depending on the use case/information need
* Stems plural words of foreign origin: i.e. complots, bits, punks, robots
* Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
* Support for special cases: i.e. yoes, clubes, itemes, faralaes
* Use it when the distinction between singular and plural is not relevant but gender is relevant
* Produces meaningful tokens in form of singular
* Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
2021-11-30 16:07:09 -05:00
Robert Muir 468aceff0c
LUCENE-10248: add CHANGES.txt entry 2021-11-30 16:07:09 -05:00
Dawid Weiss 26257292c3 LUCENE-10234: Change module prefix to org.apache.* (#487) 2021-11-30 22:04:24 +01:00
Robert Muir c89c78cee0
LUCENE-10272: cross-check norms with postings in checkindex (#493)
Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.
2021-11-30 15:40:35 -05:00
Greg Miller 51e023bf7a LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match 2021-11-30 12:01:34 -08:00
Alan Woodward b697745407 LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery (#477)
If all documents in the segment have a value, then `Reader.getDocCount()` will
equal `maxDoc` and we can return `numDocs` as a shortcut.
2021-11-30 10:07:39 +00:00
Greg Miller 8a03d2ffc9 Add javadoc note in DoubleValuesSource (see LUCENE-10258) (#490) 2021-11-29 18:02:26 -08:00
Robert Muir c5b5fd641b
support tables in generated html documentation (#489)
Tables can be used in markdown (e.g. MIGRATE.md) and will become html tables in our generated HTML docs on the website.
2021-11-29 17:44:26 -05:00
Robert Muir 278316377c
Improve MIGRATE.md around analyzers artifacts. (#488)
* Improve MIGRATE.md around analyzers artifacts.

Move this to the very top of MIGRATE, the user needs to first be able to
pull in the artifacts, before doing anything else like trying to
compile, deal with renamed classes, etc.

Add a table of each package that got moved, with explicit old and new
names. Hopefully it helps search engines and users.

Link to MIGRATE.md explicitly from README.md
2021-11-29 17:44:26 -05:00
Ignacio Vera 70243ea811 LUCENE-9538: Detect polygon self-intersections in the Tessellator (#428)
Detect self-intersections so it can provide a more meaningful error to the users.
2021-11-29 11:06:06 +01:00
Ignacio Vera 62084d7138 LUCENE-10264: Clone index input when creating a PointTree in SimpleTextBKDReader (#478)
Fixes a race condition introduced in LUCENE-9820.
2021-11-29 09:21:27 +01:00
Robert Muir 8d0103724d
Speed up ECJ tasks by avoiding --release (#484)
LUCENE-10185 caused a large performance regression in ECJ tasks by using the --release flag.

Instead of using --release, we can just disable "terminal deprecation", and leave this check to `javac`. The --release flag makes this tool run 50% slower.
2021-11-28 15:11:02 -05:00