Commit Graph

35647 Commits

Author SHA1 Message Date
Dawid Weiss d2b7e7a441
LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514) 2021-12-04 09:56:28 +01:00
Robert Muir a39337e595
LUCENE-10010: fix TestMockAnalyzer to determinize
Test would randomly fail, if RegExp parsing returned an NFA, because it
wasn't explicitly determinizing itself.

This is a bit of a trap in RegExp, it calls minimize()-as-it-parses,
so at least most of the time, it returns a DFA. This may be
unnecessary...
2021-12-03 20:44:45 -05:00
Robert Muir c8f5b9127d
LUCENE-10243: increase unicode versions of tokenizers to 12.1 (#465)
* Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars
* regenerate emoji conformance tests for unicode 12.1
* modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties)
* regenerate wordbreak conformance tests
* Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric
* Use jflex emoji properties rather than ICU-generated ones
2021-12-03 20:20:57 -05:00
Robert Muir b2e866b703
LUCENE-10010: don't determinize in CompiledAutomaton/RunAutomaton (#485)
Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't.

This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.

In the meantime, it cleans up plenty of APIs by not having to plumb `determinizeWorkLimit` parameters down to the guts.
2021-12-03 19:48:33 -05:00
gf2121 5f09bb3fab
LUCENE-10233: Use AND NOT for inverse intersector (#499)
When docIds are stored as a BitSet, use andNot to speed up collecting them.
2021-12-03 09:23:16 +01:00
Ignacio Vera ec7a79eb56
LUCENE-10279: add entry in CHANGES.txt and make RangeClause final (#507) 2021-12-03 07:23:31 +01:00
Misha Tiurin a26ea57ec7
Remove duplicate entries in SpanishPluralStemmer invariants list (#508)
* Remove duplicate entries in SpanishPluralStemmer invariants list
Add assertion to prevent duplicates in the future

Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>
2021-12-02 14:11:15 -05:00
Robert Muir 77563c2c15
LUCENE-10278: don't write zero-sized array in this test (#501)
DocIdsWriter is not prepared for this.
2021-12-02 15:53:09 +01:00
gf2121 3d61ff2cf6 LUCENE-10233: Store docIds as bitset to speed up addAll (#438) 2021-12-02 15:23:42 +01:00
Adrien Grand 0ec217b753
Make EndiannessReverser(Data|Index)Input always reverse byte order. (#502)
Currently EndiannessReverser(Data|Index)Input doesn't reverse the byte order for
`readLongs` and `readFloats`. The reasoning is that these two method replaced
`readLELongs` and `readLEFloats`, so the byte order doesn't need changing.

However this is creating some confusing situations where new code expects a
consistent byte order on the write and read sides and gets longs in the wrong
byte order. So this commit suggests a different approach, where
EndiannessReverser(Data|Index)Input always changes the byte order, and former
call sites of `readLELongs` and `readLEFloats` are changed to manually reverse
the byte order on top of `readLongs` and `readFloats`.

This is making old codecs a bit slower, but I think it's fair since these are
old codecs. And this makes the endianness reversing backward compatibility layer
easier to reason about?
2021-12-02 14:00:46 +01:00
Ignacio Vera 704193f6bf
LUCENE-10279: Fix equals in MultiRangeQuery (#503) 2021-12-02 13:33:49 +01:00
Robert Muir d74255a96c
improve term vector merging tests (#500)
Use less iterations locally so that term vector merging doesn't dominate
the list of slowest tests.

Split out deletes/no-deletes into separate methods to improve
debuggability.

Remove nightly from SimpleText term vectors merging tests, now that they
run much faster.
2021-12-02 05:29:41 -05:00
Ignacio Vera efc713c9c5
LUCENE-10275: Speed up MultiRangeQuery by using an interval tree 2021-12-02 09:53:23 +01:00
Adrien Grand ffb58f6e75 Revert "LUCENE-10233: Store docIds as bitset to speed up addAll (#438)"
This reverts commit 5eb575f8ab.
2021-12-02 08:33:38 +01:00
Robert Muir 387e67ec87
LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer (#497)
* LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer

The new SpanishPluralStemmer is in fact more "minimal", less agressive
stemming and normalization. For the user that wants only plural
stemming, it is the better choice.
2021-12-01 14:24:58 -05:00
Adrien Grand f605b4a692
LUCENE-10253: Remove the BadApple annotation. (#468) 2021-12-01 18:03:02 +01:00
gf2121 5eb575f8ab
LUCENE-10233: Store docIds as bitset to speed up addAll (#438) 2021-12-01 15:31:05 +01:00
Tomoko Uchida a7ebf6618c
move build related changes entry to the 'Build' section from 'Other' section (#496) 2021-12-01 20:06:34 +09:00
Robert Muir 4dc3e8ab01
LUCENE-10270: Improve MIGRATE.md (#491)
* LUCENE-10270: Improve MIGRATE.md

* Separate sections for 9.0 and 9.1
* Remove abbreviations for artifact, package, class names etc. e.g. `lucene-core` instead of `core` and `org.apache.lucene.analysis` instead of `o.a.l.a`.
* Specify "java" for text blocks to get syntax highlighting
* When provided, consistently put JIRA issue in the same place
* Fixed-width font for classes/reserved words (e.g. false, true, long, makes for less ambiguous reading)
* More use of tables vs lists when there is mapping of old -> new names (packages, classes, etc)
* Use consistent notation for method calls (Class.method() vs Class.method vs Class#method etc)

* LUCENE-10270: replace LUCENE_ with LUCENE- so it gets JIRA link

* LUCENE-10270: fix things found by msokolov
2021-12-01 05:29:33 -05:00
Adrien Grand 939054e4a0 Add 8.11 indices to the list of backward indices. 2021-12-01 11:23:50 +01:00
Dawid Weiss a37b74a630 LUCENE-10234: update smoke-tester with new module names. 2021-12-01 09:59:15 +01:00
Greg Miller bd68624639 Move CHANGES entry for LUCENE-10232 to 9.0 2021-11-30 14:37:07 -08:00
Dawid Weiss 20cb6817db
LUCENE-10234: Change module prefix to org.apache.* (#487) 2021-11-30 22:03:33 +01:00
Robert Muir 5d18596d3d
LUCENE-10248: add CHANGES.txt entry 2021-11-30 15:57:24 -05:00
Xavier Sanchez Loro edb936f090
LUCENE-10248: Spanish Plural Stemmer (#461)
Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373

This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n)

Some characteristics:

* Designed to stem just plural to singular form
* Distinguishes between masculine and feminine forms
* It will increase recall but precision can be reduced depending on the use case/information need
* Stems plural words of foreign origin: i.e. complots, bits, punks, robots
* Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
* Support for special cases: i.e. yoes, clubes, itemes, faralaes
* Use it when the distinction between singular and plural is not relevant but gender is relevant
* Produces meaningful tokens in form of singular
* Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
2021-11-30 15:51:10 -05:00
Greg Miller f48a430f35
LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match (#437) 2021-11-30 11:58:38 -08:00
Robert Muir 46a5a57724
LUCENE-10272: cross-check norms with postings in checkindex (#493)
Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.
2021-11-30 14:21:40 -05:00
Alan Woodward 749b744c0c
LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery (#477)
If all documents in the segment have a value, then `Reader.getDocCount()` will
equal `maxDoc` and we can return `numDocs` as a shortcut.
2021-11-30 10:00:38 +00:00
Greg Miller 4f5b41a71c
Add javadoc note in DoubleValuesSource (see LUCENE-10258) (#490) 2021-11-29 18:00:52 -08:00
Robert Muir 453168ec76
support tables in generated html documentation (#489)
Tables can be used in markdown (e.g. MIGRATE.md) and will become html tables in our generated HTML docs on the website.
2021-11-29 17:38:14 -05:00
Robert Muir 5aa9da9ead
Improve MIGRATE.md around analyzers artifacts. (#488)
* Improve MIGRATE.md around analyzers artifacts.

Move this to the very top of MIGRATE, the user needs to first be able to
pull in the artifacts, before doing anything else like trying to
compile, deal with renamed classes, etc.

Add a table of each package that got moved, with explicit old and new
names. Hopefully it helps search engines and users.

Link to MIGRATE.md explicitly from README.md
2021-11-29 17:04:15 -05:00
Ignacio Vera 78c8d7b7ea
LUCENE-9538: Detect polygon self-intersections in the Tessellator (#428)
Detect self-intersections so it can provide a more meaningful error to the users.
2021-11-29 11:05:03 +01:00
Ignacio Vera 634c22c527
LUCENE-10264: Clone index input when creating a PointTree in SimpleTextBKDReader (#478)
Fixes a race condition introduced in LUCENE-9820.
2021-11-29 09:20:20 +01:00
Robert Muir 63c89f678d
Speed up ECJ tasks by avoiding --release (#484)
LUCENE-10185 caused a large performance regression in ECJ tasks by using the --release flag.

Instead of using --release, we can just disable "terminal deprecation", and leave this check to `javac`. The --release flag makes this tool run 50% slower.
2021-11-28 15:10:32 -05:00
Robert Muir 1fb45da7bb
upgrade ecj linter from 3.25.0 -> 3.27.0 (#483)
The newest version has a significant performance increase for our
use-case.
2021-11-28 12:05:19 -05:00
Robert Muir 3772ff563a
speed up extremely slow test methods (runtime 15-30s) (#471) 2021-11-28 09:40:43 -05:00
Tomoko Uchida cb5f1b6ca0
Use the same analysis chain to StandardAnalyzer (a follow-up of #480) (#482) 2021-11-28 21:22:28 +09:00
Tomoko Uchida c041517304
set group to 'run' benchmark task (#481) 2021-11-28 21:22:07 +09:00
Tomoko Uchida 9eb7857199 fix typo in documentation 2021-11-28 10:11:49 +09:00
Uwe Schindler aed47c1862 Fix wrong path in documentation 2021-11-28 00:55:28 +01:00
Tomoko Uchida 57f695b14d
LUCENE-10261: clean up reflection stuff in luke module and make minor adjustments (#480) 2021-11-27 15:36:38 +09:00
Dawid Weiss 1029651d12 Don't log warnings from ant (different class loader, I guess). Makes Alan happier. 2021-11-26 11:39:55 +01:00
Dawid Weiss 651755aab7
LUCENE-10260: Luke's about window no longer shows version number (#473) 2021-11-26 08:32:23 +01:00
Ignacio Vera a590c6d2a0
LUCENE-10262: Lift up restrictions for navigating PointValues#PointTree (#476)
This change allows random navigation of a PointValues#PointTree.
2021-11-26 07:42:13 +01:00
Uwe Schindler d973e50c15
LUCENE-10259: Fix startup scripts to allow whitespace in path names and use /bin/sh only (#472) 2021-11-25 16:07:23 +01:00
Tomoko Uchida 40b38438c8
LUCENE-10261: Remove preset analyzer panel from Luke Analysis UI. (#475) 2021-11-25 20:30:36 +09:00
Ignacio Vera 800f002e44
LUCENE-9820: PointTree#size() should handle the case of balanced tree in pre-8.6 indexes (#462)
Handle properly the case where trees are fully balanced for number of dimension > 1
2021-11-25 11:03:16 +01:00
Adrien Grand 8710252116 Fix test failures with testIndexUpgraderCommandLineArgs and ExtraFS. 2021-11-25 08:51:56 +01:00
Adrien Grand f80d816ce7
Speed up TestBackwardsCompatibility#testCommandLineArgs. (#467)
This test unzip files that we already unzipped. This commit copies the already
uncompressed files instead.
2021-11-24 08:25:22 +01:00
Adrien Grand 24fcd80a37
LUCENE-10168: Only test N-2 codecs on nightly runs. (#466)
In order for tests to keep running fast, this annotates all tests of N-2 codecs
with `@Nightly`. To keep good coverage of releases, the smoke tester is now
configured to run nightly tests.
2021-11-24 08:20:04 +01:00