Improve code clarity for OrdinalMap (#11729)

Closes #11728
2025-03-07 00:39:21 +00:00 · 2024-01-08 14:00:53 +01:00 · 2024-01-08 14:00:53 +01:00 · 376bd24693
commit 376bd24693
parent 3c235bb7b4
1 changed files with 44 additions and 0 deletions
--- a/lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java
+++ b/lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java
@ -196,6 +196,50 @@ public class OrdinalMap implements Accountable {
  // ram usage
  final long ramBytesUsed;

+  /**
+   * Here is how the OrdinalMap encodes the mapping from global ords to local segment ords. Assume
+   * we have the following global mapping for a doc values field: <br>
+   * bar -&gt; 0, cat -&gt; 1, dog -&gt; 2, foo -&gt; 3 <br>
+   * And our index is split into 2 segments with the following local mappings for that same doc
+   * values field: <br>
+   * Segment 0: bar -&gt; 0, foo -&gt; 1 <br>
+   * Segment 1: cat -&gt; 0, dog -&gt; 1 <br>
+   * We will then encode delta between the local and global mapping in a packed 2d array keyed by
+   * (segmentIndex, segmentOrd). So the following 2d array will be created by OrdinalMap: <br>
+   * [[0, 2], [1, 1]]
+   *
+   * <p>The general algorithm for creating an OrdinalMap (skipping over some implementation details
+   * and optimizations) is as follows:
+   *
+   * <p>[1] Create and populate a PQ with ({@link TermsEnum}, index) tuples where index is the
+   * position of the termEnum in an array of termEnum's sorted by descending size. The PQ itself
+   * will be ordered by {@link TermsEnum#term()}
+   *
+   * <p>[2] We will iterate through every term in the index now. In order to do so, we will start
+   * with the first term at the top of the PQ . We keep track of a global ord, and track the
+   * difference between the global ord and {@link TermsEnum#ord()} in ordDeltas, which maps: <br>
+   * (segmentIndex, {@link TermsEnum#ord()}) -> globalTermOrdinal - {@link TermsEnum#ord()} <br>
+   * We then call {@link TermsEnum#next()} then update the PQ to iterate (remember the PQ maintains
+   * and order based on {@link TermsEnum#term()} which changes on the next() calls). If the current
+   * term exists in some other segment, the top of the queue will contain that segment. If not, the
+   * top of the queue will contain a segment with the next term in the index and the global ord will
+   * also be incremented.
+   *
+   * <p>[3] We use some information gathered in the previous step to perform optimizations on memory
+   * usage and building time in the following steps, for more detail on those, look at the code.
+   *
+   * <p>[4] We will then populate segmentToGlobalOrds, which maps (segmentIndex, segmentOrd) -&gt;
+   * globalOrd. Using the information we tracked in ordDeltas, we can construct this information
+   * relatively easily.
+   *
+   * @param owner For caching purposes
+   * @param subs A TermsEnum[], where each index corresponds to a segment
+   * @param segmentMap Provides two maps, newToOld which lists segments in descending 'weight' order
+   *     (see {@link SegmentMap} for more details) and a oldToNew map which maps each original
+   *     segment index to their position in newToOld
+   * @param acceptableOverheadRatio Acceptable overhead memory usage for some packed data structures
+   * @throws IOException throws IOException
+   */
  OrdinalMap(
      IndexReader.CacheKey owner,
      TermsEnum[] subs,