Provide leaf sorter for directory readers opened from IndexCommit
LUCENE-9507 allowed to provide a leaf sorter for directory readers.
One API that was missed is to allow to provide a leaf sorter
for directory readers opened from an index commit.
This patch address this by adding an extra parameter: a custom
comparator for sorting leaf readers to the Directory reader open API
from indexCommit and minSupportedMajorVersion.
Relates to PR #32
the last index commit
If the Lucene version was < 9 then use a StringField or else
if the index is fresh or if the index is was built using a
version >= 9, then use a BDV field.
BDV field with a different name
Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).
This commit also enables the back-compat check that was disabled
earlier.
When sorting by low-cardinality fields, the same sub remains current for long
sequences of doc IDs. This speeds up SortedDocIdMerger a bit by extracting
the sub that leads iteration.
When there's only one field, CombinedFieldQuery will ignore its weight while
scoring. This makes the scoring inconsistent, since the field weight is supposed
to multiply its term frequency.
This PR removes the optimizations around single-field scoring to make sure the
weight is always taken into account. These optimizations are not critical since
it should be uncommon to use CombinedFieldQuery with only one field.
This change fixes a bug in `MultiNormsLeafSimScorer` that assumes that each
field should have a norm for every term/document.
As part of the fix, it adds validation that the fields have consistent norms
settings.
The previous equals and hashCode methods only compared query terms. This meant
that queries on different fields, or with different field weights, were
considered the same
During boolean query rewrites, duplicate clauses are removed. So because equals/
hashCode was incorrect, rewrites could accidentally drop CombinedFieldQuery
clauses.
The previous svn-based link no longer works. Instead point at the
license file in github: it is for icu4c, but see the repo: user is
explicitly directed at this license file for both icu4c and icu4j.
Good case to have a correct link, as the ICU license is complicated. It
even has "if (version > X)" conditionals in the legalese!!!
Re-enable the randomized testing here, but with a separate test for each
mode rather than all in one method. It gives better testing and also easier-to-debug
testing.
Normalization-inert characters need not be required as boundaries
for incremental processing. It is sufficient to check `hasBoundaryAfter`
and `hasBoundaryBefore`, substantially improving worst-case performance.
It's possible to create a DisjunctionMaxQuery with no clauses. This is now
rewritten to MatchNoDocsQuery, matching the approach we take for BooleanQuery.
DocComparator should not skip docs with the same docID on multiple
sorts with search after.
Because of the optimization introduced in LUCENE-9449, currently when
searching with sort on [_doc, other fields] with search after,
DocComparator can efficiently skip all docs before and including
the provided [search after docID]. This is a desirable behaviour
in a single index search. But in a distributed search, where multiple
indices have docs with the same docID, and when searching on
[_doc, other fields], the sort optimization should NOT skip
documents with the same docIDs.
This PR fixes this.
Relates to LUCENE-9449
* Added a explicit Flush Task to flush data at Thread level once it completes the processing
* Included explicit flush per Thread level
* Done changes for parallel processing
* Removed extra brace
* Removed unused variable
* Removed unused variable initialization
* Did the required formating
* Refactored the code and added required comments & checks
This helps simplify the code, and also adds some optimizations to ordinals like
better compression for long runs of equal values or fields that are used in
index sorts.