Instead of using a fixed number of bits per value, the group-varint benchmark
now tries to reproduce the distribution of the number of bits per values that
can be observed on tail postings of wikibigall.
When we moved to group-varint for tail postings, we stop interleaving docs and
freqs and instead wrote all docs first, then all freqs. This means that we can
now skip decoding frequencies when they are not needed.
Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge #12660). We could use the TaskExecutor implementation to do this for us.
Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
* hunspell: a couple micro-optimizations to speed up dictionary loading
1. sort by the whole entry without searching for separators first: WordStorage doesn't require strong lexicographic order (only something close to it), and the separators are anyway before any usual word characters
2. avoid stream overhead when adding an entry
I think that this optimization was introduced because `advanceShallow` may
advance skip lists and then never decode a block of postings. But actually
`IndexInput#seek` is cheap, including on `NIOFSDirectory`. So let's seek
immediately?
* #7820: add initial (failing) test case exposing the bug in CheckIndex
* #7820: initial fix to CheckIndex to detect 'on load' corruption of older segments files
* exclamation point police
* tidy
* add missing @Override in new test case
* fold feedback: merge SIS-loading logic into single for-loop; rename sis -> lastCommit
* tidy
* sidestep segments.gen file as well in case we are reading very old index
* tidy
* always load last commit point first so if it has an issue that will not be masked by issues with older commit points
When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors.
One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.
This was found by `testRandomExceptions()`: if an exception occurs when opening
the meta file, then the `rawVectorsReader` that is passed to the constructor
never gets closed.
Prevent contention on the ReentrantReadWriteLock at search time by
ensuring that `commit` and `purgeCache` are never both trying to
hold the write lock.
This commit refactors the usage of the deprecated System::runFinalization in tests and benchmarks, so that narrowly targeted suppressions can be added.
Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.
This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.
Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`.
Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`