Real-world data exhibits patterns that are taken advantage of by the
compression logic, but also hardly reproducible in a randomized way. This makes
this new test introduce interesting coverage.
It takes one second to run on my machine, so I did not mark it `@Nightly`.
* Fix PathHierarchyTokenizer positions
PathHierarchyTokenizer was emitting multiple tokens in the same position
with changing offsets. To be consistent with EdgeNGramTokenizer (which
is conceptually similar -- it's emitting multiple prefixes/suffixes off
the input string), we can output every token with length 1 with
positions incrementing by 1.
* Fix ReversePathHierarchyTokenizer positions
Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.
---------
Co-authored-by: Michael Froh <froh@amazon.com>
While quantization generally works well, when the number of dimensions is tiny (just two like in our tests), and we are indexing a circle, and we have random merge policies, we can end up getting unexpected ordering on the resulting vectors.
closes: https://github.com/apache/lucene/issues/12940
* ensure Nori/Kuromoji shipped binary FST is the latest version (closes#12911)
* fold feedback from @uschindler: sharpen test failure methods to give the specific gradlew command to regenerate the precise FST (not everything)
* add javadoc for FSTMetadata.getVersion
* Simple patch to prevent the common zero-width code points in our source and some types of resource files
* Validate correct UTF-8 input and fix buggy CSS file (ISO-8859-x encoded)
* add a bit of context
* Add CHANGES.txt
Discovered in #12921, and introduced in #12679
The first issue is that we weren't advancing the `VectorScorer` [here](cf13a92950/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L257-L262)) -- so it was still un-positioned while trying to compute the similarity score
Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](cad565439b/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L107))) -- so we didn't need to explicitly advance it
Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790f28)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](5096790f28/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L255))) and I missed advancing the `VectorScorer` explicitly..
After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search
Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)
Sorry for missing this earlier!
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes
`LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with
real-world data.
With this change, `TestLucene90PostingsFormat` now exhibits #12895.
This commit reflows the code in the method body of computeCommonPrefixLengthAndBuildHistogram, so as to avoid a JVM JIT crash. The purpose of this change is to workaround the JVM bug which is somewhat fragile, but the best that we can do for now and appears to be working well.
### Description
Background in #12579
Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system
### Considerations
I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius
### Advantages
1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`
On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?
Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)
### Caveats
I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`
For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
I just noticed that the move from FOR to PFOR did all the work to make the old
format (FOR) writeable, but missed keeping an instance of
`BasePostingsFormatTestCase` for this format.