Even though it was not the driver for the slowdown, in LUCENE-10125 we
identified that the move to PFOR had slowed down indexing significantly
for fields indexed with indexOptions=DOCS. This patch gets some of the
peformance back by using the `LongHeap` that we introduced for vectors
instead of sorting the same array over and over again.
On the NYC Taxis benchmark, I observed ~8% faster merges of postings
with this change.
Seems that VectorFormat merge creates A LOT of these bitsets. We don't
need to do any fancy reflection here via shallowSizeOf(Object), when we
can call sizeOf(long[]) which is fast.
We may want to revisit this RAMUsageEstimator api in the future to
prevent traps like this.
Lucene90HnswVectorsFormat has a default 'beam width' of 16. This is quite low
and produces poor recall on typical-sized datasets.
This commit bumps it to 100. This new default tries to balance good search
performance with indexing speed. Most runs in ann-benchmarks set the parameter
between ~400 and 800, but they are heavily optimizing search over index speed.
We should not set single sort when the search_after is non-null;
otherwise, we will incorrectly skip documents whose values are equal to
the value from the search_after and docIDs are greater than the docID
from the search_after.
This commit adds the QueryParserBase::getFuzzyDistance protected method, which
can be overridden by subclasses to provide customisation of how the similarity distance
is determined. The default implementation retains the current behaviour.
The test fails randomly because HNSW can sometimes miss results when k is close
to the number of total docs. While we wait for a fix, this commit decreases k to
prevent failures.
The first documents of subsequent segments are mistakenly skipped when
sort optimization is enabled. We should initialize maxDocVisited in
NumericComparator to -1 instead of 0.
SloppyMath had a deprecated haversin() function that returned its values in
km, which has been replaced by a haversinMeters() function that is explicit
about its units. As part of removing this function, we changed the expressions
module haversin function to point instead to haversinMeters. However, this
may silently change the behaviour of expressions on upgrade.
This commit instead adds a haversinKilometers method to the expressions
module and maps the haversin function to it. It also adds a new
haversinMeters expression function to be more explicit for future users.
This commit moves the responsibility to disable
the numeric sort optimization on comparators to the SortField.
This way we don't need to apply the logic on every top field collectors.
LUCENE-10098: add note/link to GermanAnalyzer for decompounding nouns.
We can't do this out of box with the analyzer, due to incompatible
licenses. But we can make it easy on the user to do this, by linking to
repo that has sample code, documentation, and the required data files.
CachingWrapperWeight always returns -1 from its count() method, which
disables the fast path for TermQuery, MatchAllDocQuery, etc, when running
IndexSearcher.count(Query). This commit makes it delegate the method
to its wrapped Weight.
If we set numSeed = 10, this test fails sometimes because it may mark
expected results docs (from 0 to 9) as deleted which don't end up
being retrieved, resulting in a low recall
- set numSeed to 10 to ensure 10 results are returned
- add startIndex paramenter to createRandomAcceptOrds that allows
documents before startIndex to be NOT deleted
- use startIndex equal to 10 for createRandomAcceptOrds
Relates to #239