### Description
Background in #12579
Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system
### Considerations
I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius
### Advantages
1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`
On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?
Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)
### Caveats
I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`
For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
I just noticed that the move from FOR to PFOR did all the work to make the old
format (FOR) writeable, but missed keeping an instance of
`BasePostingsFormatTestCase` for this format.
* #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895
* woops, forgot to tidy up
* #12901: Ignore failing test; reflow text to workaround spotless' poor text formatting skills
I just noticed that the move from FOR to PFOR did all the work to make the old
format (FOR) writeable, but missed keeping an instance of
`BasePostingsFormatTestCase` for this format.
* Rewrite Javascript expression compiler to use hidden classes and MethodHandles for functions
* Use dynamic constants for MethodHandles
* Remove invokestatic code and handle everything through dynamic constants
* Rewrite code to patch stack trace (keep Expressions class unmodified)
* Improve generating of constant names
* Remove classloader test (no longer needed)
* Add benchmark
* use better exception in benchmark
* Add documentation, migration guide and a utility method to convert legacy function maps
* also ignore SecurityException here while checking compatibility (if it happens only an imprecise error message is thrown)
* Use Map.copyOf to not clone the map each time we compile an expression
* Add another test with same method multiple times
* Update ASM to 9.6 and set classfile version to Java 17
* Cleanup classloader permissions, unfortunately "createClassLoader" is still needed for Jacoco for God knows what
This commit fixes the intermittently failing TestParallelLeafReader.
The ParallelLeafReader requires the document order to be consistent across indexes - each document contains the union of the fields of all documents with the same document number. The test asserts this. But now, with MockRandomMergePolicy potentially reversing the doc ID order while merging, this invalidates the assumption of the test indexes and assertions. The solution is to just ensure that no merging actually happens in these tiny test indexes.
This commit fixes the intermittently failing TestSortedSetFieldSource.
The test assertions depend on doc order which may be affected by merging. The fix is to trivially avoid merging for the very small index, with just two docs.
* Report the time it took for building the FST
* Update CHANGES
* Change ramBytesUsed to numBytes
* Report the verification time
* Rename to fstSizeInBytes
* CheckIndex - Making -fast the default behaviour
1. Making -fast the new default.
2. The previous -slow is moved to -slower
3. The previous default behavior (checksum + segment file content) is activated by -slow.
* gradlew tidy
* Add changes.txt
* Moved change to Lucene 10.0, now using -detailLevel param
* Fix failing test
* Add MIGRATE.md note and comment to remove deprecated params
* Fix failing unit test
* Changing detailLevel -> level
* catch invalid API calls
* Update lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
Co-authored-by: Adrien Grand <jpountz@gmail.com>
---------
Co-authored-by: Adrien Grand <jpountz@gmail.com>
This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc
IDs on merge using a `BPIndexReorderer`.
- Reordering always run on forced merges.
- A `minNaturalMergeNumDocs` parameter helps only enable reordering on the
larger merged segments. This way, small merges retain all merging
optimizations like bulk copying of stored fields, and only the larger
segments - which are the most important for search performance - get
reordered.
- If not enough RAM is available to perform reordering, reordering is skipped.
To make this work, I had to add the ability for any merge to reorder doc IDs of
the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the
test framework randomly reverts the order of documents in a merged segment to
make sure this logic is properly exercised.