Improve code clarity for OrdinalMap (#11729)

Closes #11728
This commit is contained in:
Marc D'Mello 2024-01-08 14:00:53 +01:00 committed by Adrien Grand
parent 3c235bb7b4
commit 376bd24693

View File

@ -196,6 +196,50 @@ public class OrdinalMap implements Accountable {
// ram usage
final long ramBytesUsed;
/**
* Here is how the OrdinalMap encodes the mapping from global ords to local segment ords. Assume
* we have the following global mapping for a doc values field: <br>
* bar -&gt; 0, cat -&gt; 1, dog -&gt; 2, foo -&gt; 3 <br>
* And our index is split into 2 segments with the following local mappings for that same doc
* values field: <br>
* Segment 0: bar -&gt; 0, foo -&gt; 1 <br>
* Segment 1: cat -&gt; 0, dog -&gt; 1 <br>
* We will then encode delta between the local and global mapping in a packed 2d array keyed by
* (segmentIndex, segmentOrd). So the following 2d array will be created by OrdinalMap: <br>
* [[0, 2], [1, 1]]
*
* <p>The general algorithm for creating an OrdinalMap (skipping over some implementation details
* and optimizations) is as follows:
*
* <p>[1] Create and populate a PQ with ({@link TermsEnum}, index) tuples where index is the
* position of the termEnum in an array of termEnum's sorted by descending size. The PQ itself
* will be ordered by {@link TermsEnum#term()}
*
* <p>[2] We will iterate through every term in the index now. In order to do so, we will start
* with the first term at the top of the PQ . We keep track of a global ord, and track the
* difference between the global ord and {@link TermsEnum#ord()} in ordDeltas, which maps: <br>
* (segmentIndex, {@link TermsEnum#ord()}) -> globalTermOrdinal - {@link TermsEnum#ord()} <br>
* We then call {@link TermsEnum#next()} then update the PQ to iterate (remember the PQ maintains
* and order based on {@link TermsEnum#term()} which changes on the next() calls). If the current
* term exists in some other segment, the top of the queue will contain that segment. If not, the
* top of the queue will contain a segment with the next term in the index and the global ord will
* also be incremented.
*
* <p>[3] We use some information gathered in the previous step to perform optimizations on memory
* usage and building time in the following steps, for more detail on those, look at the code.
*
* <p>[4] We will then populate segmentToGlobalOrds, which maps (segmentIndex, segmentOrd) -&gt;
* globalOrd. Using the information we tracked in ordDeltas, we can construct this information
* relatively easily.
*
* @param owner For caching purposes
* @param subs A TermsEnum[], where each index corresponds to a segment
* @param segmentMap Provides two maps, newToOld which lists segments in descending 'weight' order
* (see {@link SegmentMap} for more details) and a oldToNew map which maps each original
* segment index to their position in newToOld
* @param acceptableOverheadRatio Acceptable overhead memory usage for some packed data structures
* @throws IOException throws IOException
*/
OrdinalMap(
IndexReader.CacheKey owner,
TermsEnum[] subs,