Commit Graph

36216 Commits

Author SHA1 Message Date
Michael Sokolov 0a58318e16
Fix for bad cast when sorting a KnnVectors index over BytesRef (#1074) 2022-08-20 17:23:47 -04:00
Michael Sokolov 798c02dd70
fix VectorUtil.dotProductScore normalization (#1073) 2022-08-20 09:15:38 -04:00
Michael Sokolov 60fa19d509
don't call BitSet.cardinality() more than needed (#1075) 2022-08-20 08:40:50 -04:00
Michael Sokolov f9680c6807
Add safety checks to KnnVectorField; fixed issue with copying BytesRef (#1076) 2022-08-20 08:38:42 -04:00
Tomoko Uchida 9ae3498f82 add notes about labels' color code 2022-08-20 13:22:50 +09:00
Julie Tibshirani 8308688d78
LUCENE-9583: Remove RandomAccessVectorValuesProducer (#1071)
This change folds the `RandomAccessVectorValuesProducer` interface into
`RandomAccessVectorValues`. This reduces the number of interfaces and clarifies
the cloning/ copying behavior.

This is a small simplification related to LUCENE-9583, but does not address the
main issue.
2022-08-19 18:04:05 -07:00
Yuting Gan 0914b537db
LUCENE-10644: Facets#getAllChildren testing should ignore child order (#1013) 2022-08-18 10:38:49 -07:00
Julie Tibshirani 7912ed02c4 Move Lucene91HnswGraphBuilder to test folder
It's only used in unit tests so it can live in the backwards_codecs tests.
2022-08-17 17:10:38 -07:00
Tomoko Uchida 8b3303b25f .asf.yaml 2022-08-16 20:02:47 +09:00
Michael Sokolov bc214d4958 standardize exception text for vector dimension mismatch (in SimpleText codec) 2022-08-13 13:12:11 -04:00
Nick Knize 543910d900
LUCENE-10654: Fix ShapeDocValue Bounding Box failure (#1066)
The base spatial test case may create invalid self crossing polygons. These
polygons are cleaned by the tessellator which may result in an inconsistent
bounding box between the tessellated shape and the original, invalid, geometry.
This commit fixes the shape doc value test case to compute the bounding box from
the cleaned geometry instead of relying on the, potentially invalid, original
geometry.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-12 10:54:22 -05:00
Ignacio Vera fe8d11254a
LUCENE-10678: Fix potential overflow when computing the partition point on the BKD tree (#1065)
We currently compute the partition point for a set of points by multiplying the number of nodes that needs to be on
 the left of the BKD tree by the maxPointsInLeafNode. This multiplication is done on the integer space so if the partition point is bigger than Integer.MAX_VALUE it will overflow. This commit moves the multiplication to the long space so it doesn't overflow.
2022-08-11 15:25:53 +02:00
Michael Sokolov a693fe819b
LUCENE-10577: enable quantization of HNSW vectors to 8 bits (#1054)
* LUCENE-10577: enable supplying, storing, and comparing HNSW vectors with 8 bit precision
2022-08-10 17:09:07 -04:00
Vigya Sharma 59a0917e25
Fix typo in PostingsReaderBase docstring (#948)
* remove extra PostingsEnum from docstring
* add ImpactsEnum to docstring
2022-08-09 16:20:51 -07:00
Nick Knize d7fd48c950
LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)
Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring 
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement. 

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-09 12:51:45 -05:00
Tomoko Uchida 0eba72f625 disable GH issue 2022-08-09 15:26:55 +09:00
tang donghai b08e34722d
LUCENE-10646: Add some comment on LevenshteinAutomata (#1016)
* add Comment on Lev & pretty the toDot

* use auto generate scripts to add comment

* update checksum

* update checksum

* restore toDot

* add removeDeadStates in levAutomata

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2022-08-07 10:01:30 -04:00
Ignacio Vera bd0718f071
LUCENE-10673: Improve check of equality for latitudes for spatial3d GeoBoundingBox (#1056) 2022-08-04 06:47:27 +02:00
luyuncheng 34154736c6
LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data (#987) 2022-08-01 18:34:41 +02:00
Adrien Grand 04e4f317cb LUCENE-10629: Fix NullPointerException.
I hit a NPE while running tests. `Weight#scorer` may return `null`, but not
`Scorer#iterator`.
2022-08-01 14:13:22 +02:00
Shai Erera 7ac75135b9
[LUCENE-10629]: Add fast match query support to FacetSets (#1015) 2022-07-31 07:50:03 +03:00
Dawid Weiss f93e52e5bb
LUCENE-10669: The build should be more helpful when generated resources are touched (#1053) 2022-07-30 20:45:32 +02:00
Adrien Grand 7c9d3cd6ff LUCENE-10633: Fix handling of missing values in reverse sorts. 2022-07-29 21:36:35 +02:00
Kaival Parikh 1ad28a3136
LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932)
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
2022-07-29 11:21:34 -07:00
Adrien Grand eb7b7791ba
LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields. (#1023)
This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields
by using postings to filter competitive documents.
2022-07-29 11:12:32 +02:00
iverase e1d2005df4 Add back-compat indices for 9.3.0 2022-07-29 10:13:20 +02:00
iverase 52a41d702f DOAP changes for release 9.3.0 2022-07-29 09:40:53 +02:00
Greg Miller 4ebc249dbc
Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto (#1020) 2022-07-28 11:12:21 -07:00
Shiming Li bb752c774c
LUCENE-10663: Fix KnnVectorQuery explain with multiple segments (#1050)
If there are multiple segments. KnnVectorQuery explain has a bug in locating
the doc ID. This is because the doc ID in explain is the docBase without the
segment.  In KnnVectorQuery.DocAndScoreQuery docs docid is increased in each
segment of the docBase. So, in the 'DocAndScoreQuery.explain', needs to be
added with the segment's docBase. 

Co-authored-by: Julie Tibshirani <julietibs@apache.org>
2022-07-28 10:31:49 -07:00
Adrien Grand 0ff987562a LUCENE-10661: Move CHANGES entry to 9.4. 2022-07-27 16:20:20 +02:00
luyuncheng 107747f359
LUCENE-10661: Reduce memory copy in BytesStore (#1047) 2022-07-27 16:17:08 +02:00
Weiming Wu 2cf12b8cdc
Cache decoded bytes for TFIDFSimilarity scorer. (#1042)
Co-authored-by: Weiming Wu <wweiming@amazon.com>
2022-07-26 13:47:52 +02:00
tang donghai 94960a0aff
precompute maxlevel in LogMergePolicy (#1045) 2022-07-26 13:42:32 +02:00
Mayya Sharipova 2efc204a39 LUCENE-10592 Strengthen TestHnswGraph::testSortedAndUnsortedIndicesReturnSameResults
This test occasionally fails if knn search returns only 1 document
in the index, as we have an assertion that returned doc IDs from
sorted and unsorted index must be different.

This patch ensures that we have many documents in the index, so
that knn search always returns enough results.
2022-07-25 09:48:43 -04:00
Greg Miller f943a57ebe Fix another TestDisiPriorityQueue bug 2022-07-22 14:32:08 -07:00
Mayya Sharipova bd06cebfc2 Add change log for LUCENE-10592 2022-07-22 12:14:58 -04:00
Mayya Sharipova fdbb76a8d7 Add next minor version 9.3.0 2022-07-22 12:01:08 -04:00
Mayya Sharipova ba4bc04271
LUCENE-10592 Build HNSW Graph on indexing (#992)
Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.

Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-07-22 11:29:28 -04:00
Mayya Sharipova bd360f9b3e
Create Lucene94 Codec and move Lucene92 to backwards_codecs (#1041) 2022-07-22 10:04:10 -04:00
Michael Sokolov 6bdeb141b7 Revert "Create Lucene93 Codec and move Lucene92 to backwards_codecs (#924)"
This reverts commit f4f4a159b7.
2022-07-21 12:52:42 -04:00
Vigya Sharma 25a842d871
LUCENE-10583: Add docstring warning to not lock on Lucene objects (#963)
* add locking warning to docstring

* git tidy
2022-07-21 06:35:17 -04:00
Greg Miller 1bc38b7d1f Fix TestDisiPriorityQueue test bug 2022-07-20 11:33:14 -07:00
Lu Xugang 39e7597f6e
LUCENE-10656: It is unnecessary that using `limit` to check boundary (#1027) 2022-07-20 10:00:06 +08:00
Zach Chen 28ce8abb51
LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions (#1018) 2022-07-19 18:59:19 -07:00
Greg Miller 3d7d85f245
LUCENE-10653: Heapify in BMMScorer (#1022) 2022-07-19 13:49:31 -07:00
Greg Miller a35dee5b27
Small tweak to IntervalQuery#visit logic (#1007) 2022-07-19 12:27:41 -07:00
Adrien Grand 11e7fe6618 LUCENE-10657: Move CHANGES entry to 9.3. 2022-07-19 09:39:18 +02:00
luyuncheng e5bf76b843
LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput (#1034)
Abstract method copyBytes need to copy from input to a buffer and then write into ByteBuffersDataOutput, i think there is unnecessary, we can override it, copy directly from input into output
2022-07-19 09:37:07 +02:00
hcqs33 9f80fea502
Fix error in TieredMergePolicy (#1028)
Fix error in comparing between bytes of candidates and bytes of max merge.
It's wrong to use candidateSize rather than currentCandidateBytes comparing with maxMergeBytes.
2022-07-19 09:21:09 +02:00
Tomoko Uchida 781edf442b
LUCENE-10557: Refine issue label texts (#1036) 2022-07-19 13:41:42 +09:00