Increase the default number of merge threads. (#13294)

You need as many merge threads as necessary to make sure that merges can keep
up with indexing. But this number depends on the data that you are indexing: if
you are only indexing stored fields, merges can copy compressed data directly
and merges are only a small fraction of the total indexing+flushing+merging
cost. But if you primary index knn vectors, merging N docs may require about as
much work as flushing N docs. If you add the fact that documents typically go
through multiple rounds of merging, the merging cost can end up being more than
half of the total indexing+flushing+merging cost.

This change proposes to update the default number of merge threads assuming an
intermediate scenario where merges perform about half of the total
indexing+flushing+merging work, ie. it gives half the threads of the system to
merges.

One goal of this change is to no longer have to configure a custom number of
merge threads on nightly benchmarks, which run on a highly concurrent machine.
This commit is contained in:
Adrien Grand 2024-04-11 21:21:28 +02:00 committed by GitHub
parent fbea47b4f4
commit e19238a7bd
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 11 additions and 1 deletions

View File

@ -180,6 +180,9 @@ Changes in Runtime Behavior
* GITHUB13293: Auto I/O throttling is now disabled by default on ConcurrentMergeScheduler.
(Adrien Grand)
* GITHUB#13293: ConcurrentMergeScheduler now allows up to 50% of the threads of the host to be used
for merging. (Adrien Grand)
Changes in Backwards Compatibility Policy
-----------------------------------------

View File

@ -181,7 +181,14 @@ public class ConcurrentMergeScheduler extends MergeScheduler {
Throwable ignored) {
}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
// If you are indexing at full throttle, how many merge threads do you need to keep up? It
// depends: for most data structures, merging is cheaper than indexing/flushing, but for knn
// vectors, merges can require about as much work as the initial indexing/flushing. Plus
// documents are indexed/flushed only once, but may be merged multiple times.
// Here, we assume an intermediate scenario where merging requires about as much work as
// indexing/flushing overall, so we give half the core count to merges.
maxThreadCount = Math.max(1, coreCount / 2);
maxMergeCount = maxThreadCount + 5;
}
}