These internal versions only make sense within a codec definition, and aren't
meant to be exposed and compared across codecs. Since this method is only used
in tests, we can move the check to the test classes instead.
This change folds the `RandomAccessVectorValuesProducer` interface into
`RandomAccessVectorValues`. This reduces the number of interfaces and clarifies
the cloning/ copying behavior.
This is a small simplification related to LUCENE-9583, but does not address the
main issue.
The base spatial test case may create invalid self crossing polygons. These
polygons are cleaned by the tessellator which may result in an inconsistent
bounding box between the tessellated shape and the original, invalid, geometry.
This commit fixes the shape doc value test case to compute the bounding box from
the cleaned geometry instead of relying on the, potentially invalid, original
geometry.
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
We currently compute the partition point for a set of points by multiplying the number of nodes that needs to be on
the left of the BKD tree by the maxPointsInLeafNode. This multiplication is done on the integer space so if the partition point is bigger than Integer.MAX_VALUE it will overflow. This commit moves the multiplication to the long space so it doesn't overflow.
Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.
DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.
Note this does not support the following:
* Multi Geometries or Collections - This will be investigated by exploring
the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement.
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
* add Comment on Lev & pretty the toDot
* use auto generate scripts to add comment
* update checksum
* update checksum
* restore toDot
* add removeDeadStates in levAutomata
Co-authored-by: tangdonghai <tangdonghai@meituan.com>
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.
`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.
In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
If there are multiple segments. KnnVectorQuery explain has a bug in locating
the doc ID. This is because the doc ID in explain is the docBase without the
segment. In KnnVectorQuery.DocAndScoreQuery docs docid is increased in each
segment of the docBase. So, in the 'DocAndScoreQuery.explain', needs to be
added with the segment's docBase.
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
This test occasionally fails if knn search returns only 1 document
in the index, as we have an assertion that returned doc IDs from
sorted and unsorted index must be different.
This patch ensures that we have many documents in the index, so
that knn search always returns enough results.
Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.
Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.
Co-authored-by: Adrien Grand <jpountz@gmail.com>