diff --git a/lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java b/lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java index bcee2f43214..5785c5dc938 100644 --- a/lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java +++ b/lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java @@ -196,6 +196,50 @@ public class OrdinalMap implements Accountable { // ram usage final long ramBytesUsed; + /** + * Here is how the OrdinalMap encodes the mapping from global ords to local segment ords. Assume + * we have the following global mapping for a doc values field:
+ * bar -> 0, cat -> 1, dog -> 2, foo -> 3
+ * And our index is split into 2 segments with the following local mappings for that same doc + * values field:
+ * Segment 0: bar -> 0, foo -> 1
+ * Segment 1: cat -> 0, dog -> 1
+ * We will then encode delta between the local and global mapping in a packed 2d array keyed by + * (segmentIndex, segmentOrd). So the following 2d array will be created by OrdinalMap:
+ * [[0, 2], [1, 1]] + * + *

The general algorithm for creating an OrdinalMap (skipping over some implementation details + * and optimizations) is as follows: + * + *

[1] Create and populate a PQ with ({@link TermsEnum}, index) tuples where index is the + * position of the termEnum in an array of termEnum's sorted by descending size. The PQ itself + * will be ordered by {@link TermsEnum#term()} + * + *

[2] We will iterate through every term in the index now. In order to do so, we will start + * with the first term at the top of the PQ . We keep track of a global ord, and track the + * difference between the global ord and {@link TermsEnum#ord()} in ordDeltas, which maps:
+ * (segmentIndex, {@link TermsEnum#ord()}) -> globalTermOrdinal - {@link TermsEnum#ord()}
+ * We then call {@link TermsEnum#next()} then update the PQ to iterate (remember the PQ maintains + * and order based on {@link TermsEnum#term()} which changes on the next() calls). If the current + * term exists in some other segment, the top of the queue will contain that segment. If not, the + * top of the queue will contain a segment with the next term in the index and the global ord will + * also be incremented. + * + *

[3] We use some information gathered in the previous step to perform optimizations on memory + * usage and building time in the following steps, for more detail on those, look at the code. + * + *

[4] We will then populate segmentToGlobalOrds, which maps (segmentIndex, segmentOrd) -> + * globalOrd. Using the information we tracked in ordDeltas, we can construct this information + * relatively easily. + * + * @param owner For caching purposes + * @param subs A TermsEnum[], where each index corresponds to a segment + * @param segmentMap Provides two maps, newToOld which lists segments in descending 'weight' order + * (see {@link SegmentMap} for more details) and a oldToNew map which maps each original + * segment index to their position in newToOld + * @param acceptableOverheadRatio Acceptable overhead memory usage for some packed data structures + * @throws IOException throws IOException + */ OrdinalMap( IndexReader.CacheKey owner, TermsEnum[] subs,