* CheckIndex - Making -fast the default behaviour
1. Making -fast the new default.
2. The previous -slow is moved to -slower
3. The previous default behavior (checksum + segment file content) is activated by -slow.
* gradlew tidy
* Add changes.txt
* Moved change to Lucene 10.0, now using -detailLevel param
* Fix failing test
* Add MIGRATE.md note and comment to remove deprecated params
* Fix failing unit test
* Changing detailLevel -> level
* catch invalid API calls
* Update lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
Co-authored-by: Adrien Grand <jpountz@gmail.com>
---------
Co-authored-by: Adrien Grand <jpountz@gmail.com>
This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc
IDs on merge using a `BPIndexReorderer`.
- Reordering always run on forced merges.
- A `minNaturalMergeNumDocs` parameter helps only enable reordering on the
larger merged segments. This way, small merges retain all merging
optimizations like bulk copying of stored fields, and only the larger
segments - which are the most important for search performance - get
reordered.
- If not enough RAM is available to perform reordering, reordering is skipped.
To make this work, I had to add the ability for any merge to reorder doc IDs of
the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the
test framework randomly reverts the order of documents in a merged segment to
make sure this logic is properly exercised.
Instead of using a fixed number of bits per value, the group-varint benchmark
now tries to reproduce the distribution of the number of bits per values that
can be observed on tail postings of wikibigall.
When we moved to group-varint for tail postings, we stop interleaving docs and
freqs and instead wrote all docs first, then all freqs. This means that we can
now skip decoding frequencies when they are not needed.
Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge #12660). We could use the TaskExecutor implementation to do this for us.
Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
* hunspell: a couple micro-optimizations to speed up dictionary loading
1. sort by the whole entry without searching for separators first: WordStorage doesn't require strong lexicographic order (only something close to it), and the separators are anyway before any usual word characters
2. avoid stream overhead when adding an entry
I think that this optimization was introduced because `advanceShallow` may
advance skip lists and then never decode a block of postings. But actually
`IndexInput#seek` is cheap, including on `NIOFSDirectory`. So let's seek
immediately?
* #7820: add initial (failing) test case exposing the bug in CheckIndex
* #7820: initial fix to CheckIndex to detect 'on load' corruption of older segments files
* exclamation point police
* tidy
* add missing @Override in new test case
* fold feedback: merge SIS-loading logic into single for-loop; rename sis -> lastCommit
* tidy
* sidestep segments.gen file as well in case we are reading very old index
* tidy
* always load last commit point first so if it has an issue that will not be masked by issues with older commit points
When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors.
One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.
This was found by `testRandomExceptions()`: if an exception occurs when opening
the meta file, then the `rawVectorsReader` that is passed to the constructor
never gets closed.
Prevent contention on the ReentrantReadWriteLock at search time by
ensuring that `commit` and `purgeCache` are never both trying to
hold the write lock.