Five times every 10 000 tests, we did not index any documents with i
between 0 and 10 (inclusive), which caused the deleted tests to fail.
With this commit, we make sure that we always index at least one
document between 0 and 10.
In current trunk, we let caller (e.g. RegExpQuery) try to "reduce" the expression. The parser nor the low-level executors don't implicitly call exponential-time algorithms anymore.
But now that we have cleaned this up, we can see it is even worse than just calling determinize(). We still call minimize() which is much crazier and much more.
We stopped doing this for all other AutomatonQuery subclasses a long time ago, as we determined that it didn't help performance. Additionally, minimization vs. determinization is even less important than early days where we found trouble: the representation got a lot better. Today when you finishState we do a lot of practical sorting/coalescing on-the-fly. Also we added this fancy UTF32-to-UTF8 automata convertor, that makes the worst-case-space-per-state significantly lower than it was before? So why minimize() ?
Let's just replace minimize() calls with determinize() calls? I've already swapped them out for all of src/test, to get jenkins looking for issues ahead of time.
This change moves hopcroft minimization (MinimizeOperations) to src/test for now. I'd like to explore nuking it from there as a next step, any tests that truly need minimization should be fine with brzozowski's
algorithm.
This exercises a challenging case where the documents to skip all happen to
be closest to the query vector. In many cases, HNSW appears to be robust to this
case and maintains good recall.
Previously, RegExp called minimize() at every parsing step. There is little point to making an NFA execution when it is doing this: minimize() implies exponential determinize().
Moreover, some minimize() calls are missing, and in fact in rare cases RegExp can already return an NFA today (for certain syntax)
Instead, RegExp parsing should do none of this, instead it may return a DFA or NFA. NOTE: many simple regexps happen to be still returned as DFA, just because of the algorithms in use.
Callers can decide whether to determinize or minimize. RegExp parsing should not run in exponential time.
All src/java callsites were modified to call minimize(), to prevent any performance problems. minimize() seems unnecessary, but let's approach removing minimization as a separate PR. src/test was fixed to just use determinize() in preparation for this.
Add new unit test for RegExp parsing
New test tries to test each symbol/node independently, to make it easier to maintain this code.
The new test case now exceeds 90% coverage of the regexp parser.
This is the slowest test suite, runs for ~ 60s, because between every
document it adds 2048 "filler docs". This just adds up to a ton of
indexing across all the test methods.
Use 2048 for Nightly, and instead a smaller number (4) for local builds.
Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.