Test would randomly fail, if RegExp parsing returned an NFA, because it
wasn't explicitly determinizing itself.
This is a bit of a trap in RegExp, it calls minimize()-as-it-parses,
so at least most of the time, it returns a DFA. This may be
unnecessary...
* Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars
* regenerate emoji conformance tests for unicode 12.1
* modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties)
* regenerate wordbreak conformance tests
* Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric
* Use jflex emoji properties rather than ICU-generated ones
Instead, require that incoming automata is determinized by the caller, throwing an exception if it isn't.
This paves the way for NFA execution in the future: if you pass an NFA to AutomatonQuery, we should use the NFA algorithm on it. No need for lots of booleans or enums.
In the meantime, it cleans up plenty of APIs by not having to plumb `determinizeWorkLimit` parameters down to the guts.
* Remove duplicate entries in SpanishPluralStemmer invariants list
Add assertion to prevent duplicates in the future
Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>
Currently EndiannessReverser(Data|Index)Input doesn't reverse the byte order for
`readLongs` and `readFloats`. The reasoning is that these two method replaced
`readLELongs` and `readLEFloats`, so the byte order doesn't need changing.
However this is creating some confusing situations where new code expects a
consistent byte order on the write and read sides and gets longs in the wrong
byte order. So this commit suggests a different approach, where
EndiannessReverser(Data|Index)Input always changes the byte order, and former
call sites of `readLELongs` and `readLEFloats` are changed to manually reverse
the byte order on top of `readLongs` and `readFloats`.
This is making old codecs a bit slower, but I think it's fair since these are
old codecs. And this makes the endianness reversing backward compatibility layer
easier to reason about?
Use less iterations locally so that term vector merging doesn't dominate
the list of slowest tests.
Split out deletes/no-deletes into separate methods to improve
debuggability.
Remove nightly from SimpleText term vectors merging tests, now that they
run much faster.
* LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer
The new SpanishPluralStemmer is in fact more "minimal", less agressive
stemming and normalization. For the user that wants only plural
stemming, it is the better choice.
* LUCENE-10270: Improve MIGRATE.md
* Separate sections for 9.0 and 9.1
* Remove abbreviations for artifact, package, class names etc. e.g. `lucene-core` instead of `core` and `org.apache.lucene.analysis` instead of `o.a.l.a`.
* Specify "java" for text blocks to get syntax highlighting
* When provided, consistently put JIRA issue in the same place
* Fixed-width font for classes/reserved words (e.g. false, true, long, makes for less ambiguous reading)
* More use of tables vs lists when there is mapping of old -> new names (packages, classes, etc)
* Use consistent notation for method calls (Class.method() vs Class.method vs Class#method etc)
* LUCENE-10270: replace LUCENE_ with LUCENE- so it gets JIRA link
* LUCENE-10270: fix things found by msokolov
Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.
See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373
This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n)
Some characteristics:
* Designed to stem just plural to singular form
* Distinguishes between masculine and feminine forms
* It will increase recall but precision can be reduced depending on the use case/information need
* Stems plural words of foreign origin: i.e. complots, bits, punks, robots
* Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
* Support for special cases: i.e. yoes, clubes, itemes, faralaes
* Use it when the distinction between singular and plural is not relevant but gender is relevant
* Produces meaningful tokens in form of singular
* Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.
* Improve MIGRATE.md around analyzers artifacts.
Move this to the very top of MIGRATE, the user needs to first be able to
pull in the artifacts, before doing anything else like trying to
compile, deal with renamed classes, etc.
Add a table of each package that got moved, with explicit old and new
names. Hopefully it helps search engines and users.
Link to MIGRATE.md explicitly from README.md
LUCENE-10185 caused a large performance regression in ECJ tasks by using the --release flag.
Instead of using --release, we can just disable "terminal deprecation", and leave this check to `javac`. The --release flag makes this tool run 50% slower.
In order for tests to keep running fast, this annotates all tests of N-2 codecs
with `@Nightly`. To keep good coverage of releases, the smoke tester is now
configured to run nightly tests.