This PR extends VectorReader#search to take a parameter specifying the live
docs. LeafReader#searchNearestVectors then always returns the k nearest
undeleted docs.
To implement this, the HNSW algorithm will only add a candidate to the result
set if it is a live doc. The graph search still visits and traverses deleted
docs as it gathers candidates.
In LUCENE-9002 we introduced logic to skip caching a clause if it would be too
expensive compared to the usual query cost. Specifically, we avoid caching a
clause if its cost is estimated to be a 250x higher than the lead iterator's.
We've found that the default of 250 is quite high and can lead to poor tail
latencies. This PR decreases it to 10 to cache more conservatively.
6 main improvements:
1) Iterate through all output.InputNodes since dest gaps can exist.
2) freeBefore the minimum input node instead of the first input node(which was usually, but not always, the minimum).
3) Don't freeBefore from a hole source node. Book keeping may not be correct and could result in an early free.
4) When adding an output node after hole recovery, calculate its new position increment instead of adding it to the end of the output graph.
5) Nodes after holes that have edges to their source will do the output re-mapping that the deleted node would have done.
6) If a disconnected input node swaps order with another node in the output, then map them to the same output node.
Co-authored-by: Lawson <geoffrl@amazon.com>
Provide leaf sorter for directory readers opened from IndexCommit
LUCENE-9507 allowed to provide a leaf sorter for directory readers.
One API that was missed is to allow to provide a leaf sorter
for directory readers opened from an index commit.
This patch address this by adding an extra parameter: a custom
comparator for sorting leaf readers to the Directory reader open API
from indexCommit and minSupportedMajorVersion.
Relates to PR #32
the last index commit
If the Lucene version was < 9 then use a StringField or else
if the index is fresh or if the index is was built using a
version >= 9, then use a BDV field.
BDV field with a different name
Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).
This commit also enables the back-compat check that was disabled
earlier.
When sorting by low-cardinality fields, the same sub remains current for long
sequences of doc IDs. This speeds up SortedDocIdMerger a bit by extracting
the sub that leads iteration.
When there's only one field, CombinedFieldQuery will ignore its weight while
scoring. This makes the scoring inconsistent, since the field weight is supposed
to multiply its term frequency.
This PR removes the optimizations around single-field scoring to make sure the
weight is always taken into account. These optimizations are not critical since
it should be uncommon to use CombinedFieldQuery with only one field.
This change fixes a bug in `MultiNormsLeafSimScorer` that assumes that each
field should have a norm for every term/document.
As part of the fix, it adds validation that the fields have consistent norms
settings.
The previous equals and hashCode methods only compared query terms. This meant
that queries on different fields, or with different field weights, were
considered the same
During boolean query rewrites, duplicate clauses are removed. So because equals/
hashCode was incorrect, rewrites could accidentally drop CombinedFieldQuery
clauses.
The previous svn-based link no longer works. Instead point at the
license file in github: it is for icu4c, but see the repo: user is
explicitly directed at this license file for both icu4c and icu4j.
Good case to have a correct link, as the ICU license is complicated. It
even has "if (version > X)" conditionals in the legalese!!!
Re-enable the randomized testing here, but with a separate test for each
mode rather than all in one method. It gives better testing and also easier-to-debug
testing.
Normalization-inert characters need not be required as boundaries
for incremental processing. It is sufficient to check `hasBoundaryAfter`
and `hasBoundaryBefore`, substantially improving worst-case performance.