If a LogByteSizeMergePolicy is used, then it might decide to not merge the two
one-document segments if their on-disk sizes are too different. Using a
LogDocMergePolicy addresses the issue as both segments are always considered
the same size.
This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever
consider merges where the resulting segment would be at least 50% bigger than
the biggest input segment. While a merge that only grows the biggest segment by
50% is still quite inefficient, this constraint is good enough to prevent
pathological O(N^2) merging.
- Removed dependency on LineFileDocs to improve reproducibility.
- Relaxed the expected exception type: any exception is ok.
- Ignore rare cases when a file still appears to have a well-formed footer
after truncation.
The original HNSW paper (https://arxiv.org/pdf/1603.09320.pdf) suggests
to use a different maxConn for the upper layers vs. the bottom one
(which contains the full neighborhood graph). Specifically, they
suggest using maxConn=M for upper layers and maxConn=2*M for the bottom.
This patch ensures that we follow this recommendation and use
maxConn=2*M for the bottom layer.
Doc values terms dictionaries keep the first term of each block uncompressed so
that they can somewhat efficiently perform binary searches across blocks.
Suffixes of the other 63 terms are compressed together using LZ4 to leverage
redundancy across suffixes. This change improves compression a bit by using the
first (uncompressed) term of each block as a dictionary when compressing
suffixes of the 63 other terms. This helps with compressing the first few
suffixes when there's not much context yet that can be leveraged to find
duplicates.
Currently vector's all docs of all fields are fully loaded into memory (for sparse cases).
This happens not only when we do vector search, but also when we open an index to
load meta info for vector readers.
This patch instead uses IndexedDISI to store docIds and DirectMonotonicWriter/Reader to
handle ordToDoc mapping. Benefits are reduced memory usage, and faster loading of
meta info for vector readers.
The skipping logic relies on the points index telling us by how much we can
reduce the candidate set by applying a filter that only matches documents that
compare better than the bottom value.
Some randomized points formats have large numbers of points per leaf, and
produce estimates of point counts for range queries that are way above the
actual value, which in-turn doesn't enable skipping when we think it should. To
avoid running into this corner case, this change forces the default codec on
this test.
The method moved from DocValuesFieldExistsQuery to DocValuesIterator, but the latter
is a package-private utility class, making it invisible to client code. This commit moves it
back onto FieldExistsQuery, meaning that the upgrade path will be the same as for all other
uses of DocValuesFieldExistsQuery.
* LUCENE-9848 Sort HNSW graph neighbors for construction
Sort HNSW graph neighbors when applying diversity criterion
During HNSW graph construction, when a node has already a number of
connections larger than maximum allowed (maxConn), we need to prune
its connections using a diversity criteria to limit the number of
connections to maxConn.
Currently when we add reverse connections to already existing nodes,
we don't keep them sorted. Thus later, when we apply diversity criteria
we may prune not the worst most distant non-diverse nodes.
This patch makes sure that neighbours connections are always sorted
from best (closest) to worst (distant), and during the application
of diversity criteria processes nodes from worst to best.
This path does the following:
- enhance NeighborArray to always keep neighbour nodes sorted according
to their scores (in desc or asc order). Make NeighborArray aware in
which order the nodes should be sorted.
- make OnHeapHnswGraph aware of the order of similarity function
- make HnswGraphBuilder apply diversity criteria from worst to
best nodes
- create Lucene90NeighborArray to keep the previous logic of
NeighborArray for Lucene90Codec