Five times every 10 000 tests, we did not index any documents with i
between 0 and 10 (inclusive), which caused the deleted tests to fail.
With this commit, we make sure that we always index at least one
document between 0 and 10.
This exercises a challenging case where the documents to skip all happen to
be closest to the query vector. In many cases, HNSW appears to be robust to this
case and maintains good recall.
This is the slowest test suite, runs for ~ 60s, because between every
document it adds 2048 "filler docs". This just adds up to a ton of
indexing across all the test methods.
Use 2048 for Nightly, and instead a smaller number (4) for local builds.
Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.
This test runs across every IndexOptions, indexing hundreds of fields.
It is slow for some implementations (e.g. SimpleText).
Use less fields for normal runs.
* Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars
* regenerate emoji conformance tests for unicode 12.1
* modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties)
* regenerate wordbreak conformance tests
* Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric
* Use jflex emoji properties rather than ICU-generated ones
* Remove duplicate entries in SpanishPluralStemmer invariants list
Add assertion to prevent duplicates in the future
Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>
Currently EndiannessReverser(Data|Index)Input doesn't reverse the byte order for
`readLongs` and `readFloats`. The reasoning is that these two method replaced
`readLELongs` and `readLEFloats`, so the byte order doesn't need changing.
However this is creating some confusing situations where new code expects a
consistent byte order on the write and read sides and gets longs in the wrong
byte order. So this commit suggests a different approach, where
EndiannessReverser(Data|Index)Input always changes the byte order, and former
call sites of `readLELongs` and `readLEFloats` are changed to manually reverse
the byte order on top of `readLongs` and `readFloats`.
This is making old codecs a bit slower, but I think it's fair since these are
old codecs. And this makes the endianness reversing backward compatibility layer
easier to reason about?
Use less iterations locally so that term vector merging doesn't dominate
the list of slowest tests.
Split out deletes/no-deletes into separate methods to improve
debuggability.
Remove nightly from SimpleText term vectors merging tests, now that they
run much faster.