lucene

Commit Graph

Author	SHA1	Message	Date
Adrien Grand	68e94c9597	DOAP changes for release 9.0.0	2021-12-07 14:38:41 +01:00
Ignacio Vera	af1e68b891	LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520 )	2021-12-07 07:41:09 +01:00
Tomoko Uchida	35eff443a7	LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522 )	2021-12-07 15:32:12 +09:00
gf2121	8525356c8a	LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510 )	2021-12-07 07:26:03 +01:00
Robert Muir	4fda0766da	speed up TestSimpleExplanationsWithFillerDocs (#516 ) This is the slowest test suite, runs for ~ 60s, because between every document it adds 2048 "filler docs". This just adds up to a ton of indexing across all the test methods. Use 2048 for Nightly, and instead a smaller number (4) for local builds.	2021-12-06 22:12:36 -05:00
Robert Muir	5c746db53e	simplify jflex grammars by using difference rather than negation (#515 ) Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.	2021-12-06 21:59:13 -05:00
Uwe Schindler	ec57641ea5	LUCENE-10287: Add changes entry	2021-12-06 20:27:59 +01:00
Uwe Schindler	9cb16df215	LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported as module (#517 )	2021-12-06 20:23:48 +01:00
gf2121	459388cfe1	LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512 ) Co-authored-by: guofeng.my <guofeng.my@bytedance.com>	2021-12-06 07:33:32 +01:00
Robert Muir	adaf610671	tone down TestIndexWriter.testMaxCompletedSequenceNumber in non-nightly (#506 ) this test currently indexes up to 600 docs for each thread.	2021-12-04 14:21:45 -05:00
Robert Muir	2c750bbedf	tone down BaseTermVectorsFormatTestCase.testLotsOfFields in non-nightly (#505 ) This test runs across every IndexOptions, indexing hundreds of fields. It is slow for some implementations (e.g. SimpleText). Use less fields for normal runs.	2021-12-04 14:21:28 -05:00
Robert Muir	b9934a3dcf	Make TestNRTReplication.testCrashReplica nightly (#504 ) This test is forking and crashing JVMs, always runs over 10 seconds	2021-12-04 14:21:07 -05:00
Dawid Weiss	d2b7e7a441	LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514 )	2021-12-04 09:56:28 +01:00
Robert Muir	a39337e595	LUCENE-10010: fix TestMockAnalyzer to determinize Test would randomly fail, if RegExp parsing returned an NFA, because it wasn't explicitly determinizing itself. This is a bit of a trap in RegExp, it calls minimize()-as-it-parses, so at least most of the time, it returns a DFA. This may be unnecessary...	2021-12-03 20:44:45 -05:00
Robert Muir	c8f5b9127d	LUCENE-10243: increase unicode versions of tokenizers to 12.1 (#465 ) * Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars * regenerate emoji conformance tests for unicode 12.1 * modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties) * regenerate wordbreak conformance tests * Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric * Use jflex emoji properties rather than ICU-generated ones	2021-12-03 20:20:57 -05:00
Robert Muir	b2e866b703	LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton (#485 ) Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't. This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums. In the meantime, it cleans up plenty of APIs by not having to plumb `determinizeWorkLimit` parameters down to the guts.	2021-12-03 19:48:33 -05:00
gf2121	5f09bb3fab	LUCENE-10233: Use AND NOT for inverse intersector (#499 ) When docIds are stored as a BitSet, use andNot to speed up collecting them.	2021-12-03 09:23:16 +01:00
Ignacio Vera	ec7a79eb56	LUCENE-10279: add entry in CHANGES.txt and make RangeClause final (#507 )	2021-12-03 07:23:31 +01:00
Misha Tiurin	a26ea57ec7	Remove duplicate entries in SpanishPluralStemmer invariants list (#508 ) * Remove duplicate entries in SpanishPluralStemmer invariants list Add assertion to prevent duplicates in the future Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>	2021-12-02 14:11:15 -05:00
Robert Muir	77563c2c15	LUCENE-10278: don't write zero-sized array in this test (#501 ) DocIdsWriter is not prepared for this.	2021-12-02 15:53:09 +01:00
gf2121	3d61ff2cf6	LUCENE-10233: Store docIds as bitset to speed up addAll (#438 )	2021-12-02 15:23:42 +01:00
Adrien Grand	0ec217b753	Make EndiannessReverser(Data\|Index)Input always reverse byte order. (#502 ) Currently EndiannessReverser(Data\|Index)Input doesn't reverse the byte order for `readLongs` and `readFloats`. The reasoning is that these two method replaced `readLELongs` and `readLEFloats`, so the byte order doesn't need changing. However this is creating some confusing situations where new code expects a consistent byte order on the write and read sides and gets longs in the wrong byte order. So this commit suggests a different approach, where EndiannessReverser(Data\|Index)Input always changes the byte order, and former call sites of `readLELongs` and `readLEFloats` are changed to manually reverse the byte order on top of `readLongs` and `readFloats`. This is making old codecs a bit slower, but I think it's fair since these are old codecs. And this makes the endianness reversing backward compatibility layer easier to reason about?	2021-12-02 14:00:46 +01:00
Ignacio Vera	704193f6bf	LUCENE-10279: Fix equals in MultiRangeQuery (#503 )	2021-12-02 13:33:49 +01:00
Robert Muir	d74255a96c	improve term vector merging tests (#500 ) Use less iterations locally so that term vector merging doesn't dominate the list of slowest tests. Split out deletes/no-deletes into separate methods to improve debuggability. Remove nightly from SimpleText term vectors merging tests, now that they run much faster.	2021-12-02 05:29:41 -05:00
Ignacio Vera	efc713c9c5	LUCENE-10275: Speed up MultiRangeQuery by using an interval tree	2021-12-02 09:53:23 +01:00
Adrien Grand	ffb58f6e75	Revert "LUCENE-10233: Store docIds as bitset to speed up addAll (#438 )" This reverts commit `5eb575f8ab`.	2021-12-02 08:33:38 +01:00
Robert Muir	387e67ec87	LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer (#497 ) * LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer The new SpanishPluralStemmer is in fact more "minimal", less agressive stemming and normalization. For the user that wants only plural stemming, it is the better choice.	2021-12-01 14:24:58 -05:00
Adrien Grand	f605b4a692	LUCENE-10253: Remove the BadApple annotation. (#468 )	2021-12-01 18:03:02 +01:00
gf2121	5eb575f8ab	LUCENE-10233: Store docIds as bitset to speed up addAll (#438 )	2021-12-01 15:31:05 +01:00
Tomoko Uchida	a7ebf6618c	move build related changes entry to the 'Build' section from 'Other' section (#496 )	2021-12-01 20:06:34 +09:00
Robert Muir	4dc3e8ab01	LUCENE-10270: Improve MIGRATE.md (#491 ) * LUCENE-10270: Improve MIGRATE.md * Separate sections for 9.0 and 9.1 * Remove abbreviations for artifact, package, class names etc. e.g. `lucene-core` instead of `core` and `org.apache.lucene.analysis` instead of `o.a.l.a`. * Specify "java" for text blocks to get syntax highlighting * When provided, consistently put JIRA issue in the same place * Fixed-width font for classes/reserved words (e.g. false, true, long, makes for less ambiguous reading) * More use of tables vs lists when there is mapping of old -> new names (packages, classes, etc) * Use consistent notation for method calls (Class.method() vs Class.method vs Class#method etc) * LUCENE-10270: replace LUCENE_ with LUCENE- so it gets JIRA link * LUCENE-10270: fix things found by msokolov	2021-12-01 05:29:33 -05:00
Adrien Grand	939054e4a0	Add 8.11 indices to the list of backward indices.	2021-12-01 11:23:50 +01:00
Dawid Weiss	a37b74a630	LUCENE-10234: update smoke-tester with new module names.	2021-12-01 09:59:15 +01:00
Greg Miller	bd68624639	Move CHANGES entry for LUCENE-10232 to 9.0	2021-11-30 14:37:07 -08:00
Dawid Weiss	20cb6817db	LUCENE-10234: Change module prefix to org.apache.* (#487 )	2021-11-30 22:03:33 +01:00
Robert Muir	5d18596d3d	LUCENE-10248: add CHANGES.txt entry	2021-11-30 15:57:24 -05:00
Xavier Sanchez Loro	edb936f090	LUCENE-10248: Spanish Plural Stemmer (#461 ) Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches. See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373 This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n) Some characteristics: * Designed to stem just plural to singular form * Distinguishes between masculine and feminine forms * It will increase recall but precision can be reduced depending on the use case/information need * Stems plural words of foreign origin: i.e. complots, bits, punks, robots * Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc * Support for special cases: i.e. yoes, clubes, itemes, faralaes * Use it when the distinction between singular and plural is not relevant but gender is relevant * Produces meaningful tokens in form of singular * Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words	2021-11-30 15:51:10 -05:00
Greg Miller	f48a430f35	LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match (#437 )	2021-11-30 11:58:38 -08:00
Robert Muir	46a5a57724	LUCENE-10272: cross-check norms with postings in checkindex (#493 ) Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.	2021-11-30 14:21:40 -05:00
Alan Woodward	749b744c0c	LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery (#477 ) If all documents in the segment have a value, then `Reader.getDocCount()` will equal `maxDoc` and we can return `numDocs` as a shortcut.	2021-11-30 10:00:38 +00:00
Greg Miller	4f5b41a71c	Add javadoc note in DoubleValuesSource (see LUCENE-10258) (#490 )	2021-11-29 18:00:52 -08:00
Robert Muir	453168ec76	support tables in generated html documentation (#489 ) Tables can be used in markdown (e.g. MIGRATE.md) and will become html tables in our generated HTML docs on the website.	2021-11-29 17:38:14 -05:00
Robert Muir	5aa9da9ead	Improve MIGRATE.md around analyzers artifacts. (#488 ) * Improve MIGRATE.md around analyzers artifacts. Move this to the very top of MIGRATE, the user needs to first be able to pull in the artifacts, before doing anything else like trying to compile, deal with renamed classes, etc. Add a table of each package that got moved, with explicit old and new names. Hopefully it helps search engines and users. Link to MIGRATE.md explicitly from README.md	2021-11-29 17:04:15 -05:00
Ignacio Vera	78c8d7b7ea	LUCENE-9538: Detect polygon self-intersections in the Tessellator (#428 ) Detect self-intersections so it can provide a more meaningful error to the users.	2021-11-29 11:05:03 +01:00
Ignacio Vera	634c22c527	LUCENE-10264: Clone index input when creating a PointTree in SimpleTextBKDReader (#478 ) Fixes a race condition introduced in LUCENE-9820.	2021-11-29 09:20:20 +01:00
Robert Muir	63c89f678d	Speed up ECJ tasks by avoiding --release (#484 ) LUCENE-10185 caused a large performance regression in ECJ tasks by using the --release flag. Instead of using --release, we can just disable "terminal deprecation", and leave this check to `javac`. The --release flag makes this tool run 50% slower.	2021-11-28 15:10:32 -05:00
Robert Muir	1fb45da7bb	upgrade ecj linter from 3.25.0 -> 3.27.0 (#483 ) The newest version has a significant performance increase for our use-case.	2021-11-28 12:05:19 -05:00
Robert Muir	3772ff563a	speed up extremely slow test methods (runtime 15-30s) (#471 )	2021-11-28 09:40:43 -05:00
Tomoko Uchida	cb5f1b6ca0	Use the same analysis chain to StandardAnalyzer (a follow-up of #480 ) (#482 )	2021-11-28 21:22:28 +09:00
Tomoko Uchida	c041517304	set group to 'run' benchmark task (#481 )	2021-11-28 21:22:07 +09:00

1 2 3 4 5 ...

35609 Commits All Branches Search

35609 Commits

All Branches