lucene

Commit Graph

Author	SHA1	Message	Date
Dawid Weiss	1bcdc600b3	LUCENE-10304: exclude module-info.java from all sourcesets for Eclipse, otherwise things break (predictably).	2021-12-10 19:56:55 +01:00
Dawid Weiss	003fa44357	LUCENE-10307: correct module descriptor so that exported packages test passes.	2021-12-10 19:15:54 +01:00
Dawid Weiss	51d93635aa	LUCENE-10307: add exported packages consistency check.	2021-12-10 17:53:16 +01:00
Dawid Weiss	8511def95b	LUCENE-10307: add distribution sanity tests.	2021-12-10 17:25:47 +01:00
Dawid Weiss	458c0486c0	LUCENE-10304: a workaround for intellij's problem with runtime scopes on dependencies.	2021-12-10 17:16:19 +01:00
Dawid Weiss	aee191d878	LUCENE-10300: rewrite how resources are read in ukrainian morfologik analyzer (module vs. classpath lookup).	2021-12-10 17:16:19 +01:00
Dawid Weiss	768adb99d6	LUCENE-10300: add morfologik.tests and check if the ukrainian analyzer loads properly.	2021-12-10 17:16:19 +01:00
Dawid Weiss	600d8345f8	LUCENE-10306: set up module configurations to consume full JARs for test projects.	2021-12-10 17:16:19 +01:00
Dawid Weiss	328b3cc55f	LUCENE-10255: add support for .tests subprojects which contain module tests.	2021-12-10 17:16:19 +01:00
Dawid Weiss	6d83c2e08e	LUCENE-10255: add gradle compilation and module descriptor support for the java module system. Adds module descriptors to all Lucene subprojects.	2021-12-10 17:16:19 +01:00
Dawid Weiss	b9c22fdb49	LUCENE-9871: minor cleanups of extra semicolons and solr build remnants.	2021-12-10 10:29:35 +01:00
Dawid Weiss	b2b52ca92a	LUCENE-10229: change the wording a bit.	2021-12-09 17:34:54 +01:00
Patrick Zhai	53099e01de	LUCENE-10229: Unify behaviour of match offsets for interval queries (#521 )	2021-12-09 17:19:18 +01:00
Ignacio Vera	40c213d873	Revert "LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520 )" (#532 ) This reverts commit `af1e68b891`.	2021-12-09 13:54:40 +01:00
Dawid Weiss	8367f700c7	LUCENE-10294: Avoid compiling javadocs twice in 'gradlew check'.	2021-12-09 09:56:11 +01:00
Robert Muir	7a872c7a5c	LUCENE-10296: Stop minimizing regepx (#528 ) In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore. But now that we have cleaned this up, we can see it is even worse than just calling determinize(). We still call minimize() which is much crazier and much more. We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState we do a lot of practical sorting/coalescing on-the-fly. Also we added this fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than it was before? So why minimize() ? Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time. This change moves hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's algorithm.	2021-12-08 21:44:26 -05:00
Julie Tibshirani	5d39bca87a	LUCENE-10040: Add test for vector search with skewed deletions (#527 ) This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall.	2021-12-08 11:24:12 -08:00
Adrien Grand	b9287c8ce0	Fix precommit.	2021-12-08 18:51:35 +01:00
Adrien Grand	f190cc3509	Re-enable tests.	2021-12-08 17:52:17 +01:00
Adrien Grand	ecc38495ab	Add back-compat indices for 9.0.0.	2021-12-08 17:43:06 +01:00
Robert Muir	84e4b85b09	LUCENE-10010: don't determinize/minimize in RegExp (#513 ) Previously, RegExp called minimize() at every parsing step. There is little point to making an NFA execution when it is doing this: minimize() implies exponential determinize(). Moreover, some minimize() calls are missing, and in fact in rare cases RegExp can already return an NFA today (for certain syntax) Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use. Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time. All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this. Add new unit test for RegExp parsing New test tries to test each symbol/node independently, to make it easier to maintain this code. The new test case now exceeds 90% coverage of the regexp parser.	2021-12-07 21:39:13 -05:00
Robert Muir	5a1fdd8865	remove unnecessary "dependencies" in versions.props (#526 ) Looks like stray cats from back when it was shared with solr	2021-12-07 21:22:54 -05:00
Adrien Grand	68e94c9597	DOAP changes for release 9.0.0	2021-12-07 14:38:41 +01:00
Ignacio Vera	af1e68b891	LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520 )	2021-12-07 07:41:09 +01:00
Tomoko Uchida	35eff443a7	LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522 )	2021-12-07 15:32:12 +09:00
gf2121	8525356c8a	LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510 )	2021-12-07 07:26:03 +01:00
Robert Muir	4fda0766da	speed up TestSimpleExplanationsWithFillerDocs (#516 ) This is the slowest test suite, runs for ~ 60s, because between every document it adds 2048 "filler docs". This just adds up to a ton of indexing across all the test methods. Use 2048 for Nightly, and instead a smaller number (4) for local builds.	2021-12-06 22:12:36 -05:00
Robert Muir	5c746db53e	simplify jflex grammars by using difference rather than negation (#515 ) Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.	2021-12-06 21:59:13 -05:00
Uwe Schindler	ec57641ea5	LUCENE-10287: Add changes entry	2021-12-06 20:27:59 +01:00
Uwe Schindler	9cb16df215	LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported as module (#517 )	2021-12-06 20:23:48 +01:00
gf2121	459388cfe1	LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512 ) Co-authored-by: guofeng.my <guofeng.my@bytedance.com>	2021-12-06 07:33:32 +01:00
Robert Muir	adaf610671	tone down TestIndexWriter.testMaxCompletedSequenceNumber in non-nightly (#506 ) this test currently indexes up to 600 docs for each thread.	2021-12-04 14:21:45 -05:00
Robert Muir	2c750bbedf	tone down BaseTermVectorsFormatTestCase.testLotsOfFields in non-nightly (#505 ) This test runs across every IndexOptions, indexing hundreds of fields. It is slow for some implementations (e.g. SimpleText). Use less fields for normal runs.	2021-12-04 14:21:28 -05:00
Robert Muir	b9934a3dcf	Make TestNRTReplication.testCrashReplica nightly (#504 ) This test is forking and crashing JVMs, always runs over 10 seconds	2021-12-04 14:21:07 -05:00
Dawid Weiss	d2b7e7a441	LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514 )	2021-12-04 09:56:28 +01:00
Robert Muir	a39337e595	LUCENE-10010: fix TestMockAnalyzer to determinize Test would randomly fail, if RegExp parsing returned an NFA, because it wasn't explicitly determinizing itself. This is a bit of a trap in RegExp, it calls minimize()-as-it-parses, so at least most of the time, it returns a DFA. This may be unnecessary...	2021-12-03 20:44:45 -05:00
Robert Muir	c8f5b9127d	LUCENE-10243: increase unicode versions of tokenizers to 12.1 (#465 ) * Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars * regenerate emoji conformance tests for unicode 12.1 * modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties) * regenerate wordbreak conformance tests * Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric * Use jflex emoji properties rather than ICU-generated ones	2021-12-03 20:20:57 -05:00
Robert Muir	b2e866b703	LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton (#485 ) Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't. This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums. In the meantime, it cleans up plenty of APIs by not having to plumb `determinizeWorkLimit` parameters down to the guts.	2021-12-03 19:48:33 -05:00
gf2121	5f09bb3fab	LUCENE-10233: Use AND NOT for inverse intersector (#499 ) When docIds are stored as a BitSet, use andNot to speed up collecting them.	2021-12-03 09:23:16 +01:00
Ignacio Vera	ec7a79eb56	LUCENE-10279: add entry in CHANGES.txt and make RangeClause final (#507 )	2021-12-03 07:23:31 +01:00
Misha Tiurin	a26ea57ec7	Remove duplicate entries in SpanishPluralStemmer invariants list (#508 ) * Remove duplicate entries in SpanishPluralStemmer invariants list Add assertion to prevent duplicates in the future Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>	2021-12-02 14:11:15 -05:00
Robert Muir	77563c2c15	LUCENE-10278: don't write zero-sized array in this test (#501 ) DocIdsWriter is not prepared for this.	2021-12-02 15:53:09 +01:00
gf2121	3d61ff2cf6	LUCENE-10233: Store docIds as bitset to speed up addAll (#438 )	2021-12-02 15:23:42 +01:00
Adrien Grand	0ec217b753	Make EndiannessReverser(Data\|Index)Input always reverse byte order. (#502 ) Currently EndiannessReverser(Data\|Index)Input doesn't reverse the byte order for `readLongs` and `readFloats`. The reasoning is that these two method replaced `readLELongs` and `readLEFloats`, so the byte order doesn't need changing. However this is creating some confusing situations where new code expects a consistent byte order on the write and read sides and gets longs in the wrong byte order. So this commit suggests a different approach, where EndiannessReverser(Data\|Index)Input always changes the byte order, and former call sites of `readLELongs` and `readLEFloats` are changed to manually reverse the byte order on top of `readLongs` and `readFloats`. This is making old codecs a bit slower, but I think it's fair since these are old codecs. And this makes the endianness reversing backward compatibility layer easier to reason about?	2021-12-02 14:00:46 +01:00
Ignacio Vera	704193f6bf	LUCENE-10279: Fix equals in MultiRangeQuery (#503 )	2021-12-02 13:33:49 +01:00
Robert Muir	d74255a96c	improve term vector merging tests (#500 ) Use less iterations locally so that term vector merging doesn't dominate the list of slowest tests. Split out deletes/no-deletes into separate methods to improve debuggability. Remove nightly from SimpleText term vectors merging tests, now that they run much faster.	2021-12-02 05:29:41 -05:00
Ignacio Vera	efc713c9c5	LUCENE-10275: Speed up MultiRangeQuery by using an interval tree	2021-12-02 09:53:23 +01:00
Adrien Grand	ffb58f6e75	Revert "LUCENE-10233: Store docIds as bitset to speed up addAll (#438 )" This reverts commit `5eb575f8ab`.	2021-12-02 08:33:38 +01:00
Robert Muir	387e67ec87	LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer (#497 ) * LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer The new SpanishPluralStemmer is in fact more "minimal", less agressive stemming and normalization. For the user that wants only plural stemming, it is the better choice.	2021-12-01 14:24:58 -05:00
Adrien Grand	f605b4a692	LUCENE-10253: Remove the BadApple annotation. (#468 )	2021-12-01 18:03:02 +01:00

1 2 3 4 5 ...

35581 Commits All Branches Search

35581 Commits

All Branches