This commit adds a new interface to all MergeScheduler classes that allows the scheduler to provide an Executor for intra-merge parallelism. The first sub-class to satisfy this new interface is the ConcurrentMergeScheduler (CMS). In particular, the CMS will limit the amount of parallelism based on the current number of mergeThreads and the configured maxThread options.
Parallelism is now added by default to the SegmentMerger where each merging task is executed in parallel with the others.
Additionally, the Lucene99 HNSW codec will use the provided MergeScheduler executor if no executor is provided directly.
Relates to: https://github.com/apache/lucene/issues/12740
Relates to: https://github.com/apache/lucene/issues/9626
The current logic for reordering splits a slice of doc IDs into a left side and
a right side, and for each document it computes the expected gain of moving to
the other side. Then it swaps documents from both sides as long as the sum of
the gain of moving the left doc to the right and the right doc to the left is
positive.
This works well, but I would like to extend BP reordering to also work with
blocks, and the swapping logic is challenging to modify as two parent documents
may have different numbers of children.
One of the follow-up papers on BP suggested to use a different logic, where one
would compute a bias for all documents that is negative when a document is
attracted to the left and positive otherwise. Then we only have to partition doc
IDs around the mid point, e.g. with quickselect.
A benefit of this change is that it will make it easier to generalize BP
reordering to indexes that have blocks, e.g. by using a stable sort on biases.
### Description
The boosted doc scores were the same due to floating point precision limitations, when multiplying the original scores with tiny differences with a fairly large multiplier.
My change removes the assertion expecting the exact order to be the same (AFAIU this won't be possible for every scores/boost configurations due to precision limitations), but rather asserts that the `original doc score * boost factor` and the `boosted doc score` are equals within a certain delta.
Closes https://github.com/apache/lucene/issues/13173.
This adds multi-leaf optimizations for diversified children collector. This means as children vectors are collected within a block join, we can share information between leaves to speed up vector search.
To make this happen, I refactored the multi-leaf collector slightly. Now, instead of inheriting from TopKnnCollector, we inject a inner collector.
* Hunspell: don't proceed with other suggestions if we found good REP ones
This makes "rep" and "ph" from TestHunspellRepositoryTestCases pass on latest revisions of the original Hunspell (after b88f9ea57b)
The current logic for reordering splits a slice of doc IDs into a left side and
a right side, and for each document it computes the expected gain of moving to
the other side. Then it swaps documents from both sides as long as the sum of
the gain of moving the left doc to the right and the right doc to the left is
positive.
This works well, but I would like to extend BP reordering to also work with
blocks, and the swapping logic is challenging to modify as two parent documents
may have different numbers of children.
One of the follow-up papers on BP suggested to use a different logic, where one
would compute a bias for all documents that is negative when a document is
attracted to the left and positive otherwise. Then we only have to partition doc
IDs around the mid point, e.g. with quickselect.
A benefit of this change is that it will make it easier to generalize BP
reordering to indexes that have blocks, e.g. by using a stable sort on biases.
This commit updates the FieldInfosFormat translation of vector similarity functions to be independent of the VectorSimilartyFunction enum.
The VectorSimilartyFunction enum lives outside of the codec format, and the format should not inadvertently depend upon the declaration order or values in VectorSimilartyFunction. The format should be in charge of the translation of similarity function to format ordinal (and visa versa). In reality, and for now, the translation remains the same as the declaration order, but this may not be the case in the future.