Commit Graph

35572 Commits

Author SHA1 Message Date
Tomoko Uchida e111182e12 LUCENE-10303: Upgrade log4j to 2.15.0 2021-12-11 10:43:03 +09:00
Tomoko Uchida cb788d8e9e LUCENE-10305: Ensure line endings of versions.props is LF 2021-12-11 10:10:44 +09:00
Dawid Weiss b2b52ca92a LUCENE-10229: change the wording a bit. 2021-12-09 17:34:54 +01:00
Patrick Zhai 53099e01de
LUCENE-10229: Unify behaviour of match offsets for interval queries (#521) 2021-12-09 17:19:18 +01:00
Ignacio Vera 40c213d873
Revert "LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520)" (#532)
This reverts commit af1e68b891.
2021-12-09 13:54:40 +01:00
Dawid Weiss 8367f700c7 LUCENE-10294: Avoid compiling javadocs twice in 'gradlew check'. 2021-12-09 09:56:11 +01:00
Robert Muir 7a872c7a5c
LUCENE-10296: Stop minimizing regepx (#528)
In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.

But now that we have cleaned this up, we can see it is even worse than just calling determinize(). We still call minimize() which is much crazier and much more.

We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState we do a lot of practical sorting/coalescing on-the-fly. Also we added this fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than it was before? So why minimize() ?

Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time.

This change moves hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's
algorithm.
2021-12-08 21:44:26 -05:00
Julie Tibshirani 5d39bca87a
LUCENE-10040: Add test for vector search with skewed deletions (#527)
This exercises a challenging case where the documents to skip all happen to
be closest to the query vector. In many cases, HNSW appears to be robust to this
case and maintains good recall.
2021-12-08 11:24:12 -08:00
Adrien Grand b9287c8ce0 Fix precommit. 2021-12-08 18:51:35 +01:00
Adrien Grand f190cc3509 Re-enable tests. 2021-12-08 17:52:17 +01:00
Adrien Grand ecc38495ab Add back-compat indices for 9.0.0. 2021-12-08 17:43:06 +01:00
Robert Muir 84e4b85b09
LUCENE-10010: don't determinize/minimize in RegExp (#513)
Previously, RegExp called minimize() at every parsing step. There is little point to making an NFA execution when it is doing this: minimize() implies exponential determinize().
 
Moreover, some minimize() calls are missing, and in fact in rare cases RegExp can already return an NFA today (for certain syntax)

Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use.

Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time.

All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this.

Add new unit test for RegExp parsing

New test tries to test each symbol/node independently, to make it easier to maintain this code.
The new test case now exceeds 90% coverage of the regexp parser.
2021-12-07 21:39:13 -05:00
Robert Muir 5a1fdd8865
remove unnecessary "dependencies" in versions.props (#526)
Looks like stray cats from back when it was shared with solr
2021-12-07 21:22:54 -05:00
Adrien Grand 68e94c9597 DOAP changes for release 9.0.0 2021-12-07 14:38:41 +01:00
Ignacio Vera af1e68b891
LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520) 2021-12-07 07:41:09 +01:00
Tomoko Uchida 35eff443a7
LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522) 2021-12-07 15:32:12 +09:00
gf2121 8525356c8a
LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510) 2021-12-07 07:26:03 +01:00
Robert Muir 4fda0766da
speed up TestSimpleExplanationsWithFillerDocs (#516)
This is the slowest test suite, runs for ~ 60s, because between every
document it adds 2048 "filler docs". This just adds up to a ton of
indexing across all the test methods.

Use 2048 for Nightly, and instead a smaller number (4) for local builds.
2021-12-06 22:12:36 -05:00
Robert Muir 5c746db53e
simplify jflex grammars by using difference rather than negation (#515)
Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.
2021-12-06 21:59:13 -05:00
Uwe Schindler ec57641ea5 LUCENE-10287: Add changes entry 2021-12-06 20:27:59 +01:00
Uwe Schindler 9cb16df215
LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported as module (#517) 2021-12-06 20:23:48 +01:00
gf2121 459388cfe1
LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512)
Co-authored-by: guofeng.my <guofeng.my@bytedance.com>
2021-12-06 07:33:32 +01:00
Robert Muir adaf610671
tone down TestIndexWriter.testMaxCompletedSequenceNumber in non-nightly (#506)
this test currently indexes up to 600 docs for each thread.
2021-12-04 14:21:45 -05:00
Robert Muir 2c750bbedf
tone down BaseTermVectorsFormatTestCase.testLotsOfFields in non-nightly (#505)
This test runs across every IndexOptions, indexing hundreds of fields.
It is slow for some implementations (e.g. SimpleText).

Use less fields for normal runs.
2021-12-04 14:21:28 -05:00
Robert Muir b9934a3dcf
Make TestNRTReplication.testCrashReplica nightly (#504)
This test is forking and crashing JVMs, always runs over 10 seconds
2021-12-04 14:21:07 -05:00
Dawid Weiss d2b7e7a441
LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514) 2021-12-04 09:56:28 +01:00
Robert Muir a39337e595
LUCENE-10010: fix TestMockAnalyzer to determinize
Test would randomly fail, if RegExp parsing returned an NFA, because it
wasn't explicitly determinizing itself.

This is a bit of a trap in RegExp, it calls minimize()-as-it-parses,
so at least most of the time, it returns a DFA. This may be
unnecessary...
2021-12-03 20:44:45 -05:00
Robert Muir c8f5b9127d
LUCENE-10243: increase unicode versions of tokenizers to 12.1 (#465)
* Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars
* regenerate emoji conformance tests for unicode 12.1
* modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties)
* regenerate wordbreak conformance tests
* Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric
* Use jflex emoji properties rather than ICU-generated ones
2021-12-03 20:20:57 -05:00
Robert Muir b2e866b703
LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton (#485)
Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't.

This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.

In the meantime, it cleans up plenty of APIs by not having to plumb `determinizeWorkLimit` parameters down to the guts.
2021-12-03 19:48:33 -05:00
gf2121 5f09bb3fab
LUCENE-10233: Use AND NOT for inverse intersector (#499)
When docIds are stored as a BitSet, use andNot to speed up collecting them.
2021-12-03 09:23:16 +01:00
Ignacio Vera ec7a79eb56
LUCENE-10279: add entry in CHANGES.txt and make RangeClause final (#507) 2021-12-03 07:23:31 +01:00
Misha Tiurin a26ea57ec7
Remove duplicate entries in SpanishPluralStemmer invariants list (#508)
* Remove duplicate entries in SpanishPluralStemmer invariants list
Add assertion to prevent duplicates in the future

Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>
2021-12-02 14:11:15 -05:00
Robert Muir 77563c2c15
LUCENE-10278: don't write zero-sized array in this test (#501)
DocIdsWriter is not prepared for this.
2021-12-02 15:53:09 +01:00
gf2121 3d61ff2cf6 LUCENE-10233: Store docIds as bitset to speed up addAll (#438) 2021-12-02 15:23:42 +01:00
Adrien Grand 0ec217b753
Make EndiannessReverser(Data|Index)Input always reverse byte order. (#502)
Currently EndiannessReverser(Data|Index)Input doesn't reverse the byte order for
`readLongs` and `readFloats`. The reasoning is that these two method replaced
`readLELongs` and `readLEFloats`, so the byte order doesn't need changing.

However this is creating some confusing situations where new code expects a
consistent byte order on the write and read sides and gets longs in the wrong
byte order. So this commit suggests a different approach, where
EndiannessReverser(Data|Index)Input always changes the byte order, and former
call sites of `readLELongs` and `readLEFloats` are changed to manually reverse
the byte order on top of `readLongs` and `readFloats`.

This is making old codecs a bit slower, but I think it's fair since these are
old codecs. And this makes the endianness reversing backward compatibility layer
easier to reason about?
2021-12-02 14:00:46 +01:00
Ignacio Vera 704193f6bf
LUCENE-10279: Fix equals in MultiRangeQuery (#503) 2021-12-02 13:33:49 +01:00
Robert Muir d74255a96c
improve term vector merging tests (#500)
Use less iterations locally so that term vector merging doesn't dominate
the list of slowest tests.

Split out deletes/no-deletes into separate methods to improve
debuggability.

Remove nightly from SimpleText term vectors merging tests, now that they
run much faster.
2021-12-02 05:29:41 -05:00
Ignacio Vera efc713c9c5
LUCENE-10275: Speed up MultiRangeQuery by using an interval tree 2021-12-02 09:53:23 +01:00
Adrien Grand ffb58f6e75 Revert "LUCENE-10233: Store docIds as bitset to speed up addAll (#438)"
This reverts commit 5eb575f8ab.
2021-12-02 08:33:38 +01:00
Robert Muir 387e67ec87
LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer (#497)
* LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer

The new SpanishPluralStemmer is in fact more "minimal", less agressive
stemming and normalization. For the user that wants only plural
stemming, it is the better choice.
2021-12-01 14:24:58 -05:00
Adrien Grand f605b4a692
LUCENE-10253: Remove the BadApple annotation. (#468) 2021-12-01 18:03:02 +01:00
gf2121 5eb575f8ab
LUCENE-10233: Store docIds as bitset to speed up addAll (#438) 2021-12-01 15:31:05 +01:00
Tomoko Uchida a7ebf6618c
move build related changes entry to the 'Build' section from 'Other' section (#496) 2021-12-01 20:06:34 +09:00
Robert Muir 4dc3e8ab01
LUCENE-10270: Improve MIGRATE.md (#491)
* LUCENE-10270: Improve MIGRATE.md

* Separate sections for 9.0 and 9.1
* Remove abbreviations for artifact, package, class names etc. e.g. `lucene-core` instead of `core` and `org.apache.lucene.analysis` instead of `o.a.l.a`.
* Specify "java" for text blocks to get syntax highlighting
* When provided, consistently put JIRA issue in the same place
* Fixed-width font for classes/reserved words (e.g. false, true, long, makes for less ambiguous reading)
* More use of tables vs lists when there is mapping of old -> new names (packages, classes, etc)
* Use consistent notation for method calls (Class.method() vs Class.method vs Class#method etc)

* LUCENE-10270: replace LUCENE_ with LUCENE- so it gets JIRA link

* LUCENE-10270: fix things found by msokolov
2021-12-01 05:29:33 -05:00
Adrien Grand 939054e4a0 Add 8.11 indices to the list of backward indices. 2021-12-01 11:23:50 +01:00
Dawid Weiss a37b74a630 LUCENE-10234: update smoke-tester with new module names. 2021-12-01 09:59:15 +01:00
Greg Miller bd68624639 Move CHANGES entry for LUCENE-10232 to 9.0 2021-11-30 14:37:07 -08:00
Dawid Weiss 20cb6817db
LUCENE-10234: Change module prefix to org.apache.* (#487) 2021-11-30 22:03:33 +01:00
Robert Muir 5d18596d3d
LUCENE-10248: add CHANGES.txt entry 2021-11-30 15:57:24 -05:00
Xavier Sanchez Loro edb936f090
LUCENE-10248: Spanish Plural Stemmer (#461)
Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373

This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n)

Some characteristics:

* Designed to stem just plural to singular form
* Distinguishes between masculine and feminine forms
* It will increase recall but precision can be reduced depending on the use case/information need
* Stems plural words of foreign origin: i.e. complots, bits, punks, robots
* Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
* Support for special cases: i.e. yoes, clubes, itemes, faralaes
* Use it when the distinction between singular and plural is not relevant but gender is relevant
* Produces meaningful tokens in form of singular
* Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
2021-11-30 15:51:10 -05:00