Commit Graph

36184 Commits

Author SHA1 Message Date
Michael Sokolov 1649964f07 Forward-port CHANGES entry for quantized HNSW vectors from 9.x branch 2022-09-01 09:53:46 -04:00
Tomoko Uchida fd86968fee
remove a link to old Jira in README. 2022-09-01 00:41:56 +09:00
Mayya Sharipova 554fabf682
LUCENE-10633 Disable sort optimization for SortedSetSortField (#3125)
Add ability to SortedSetSortField to disable sort optimization
2022-08-30 16:52:28 -04:00
Michael Sokolov 61ef031f7f
SimpleText knn vectors; fix searchExhaustively and suppress a byte format test case (#11725) 2022-08-29 11:49:52 -04:00
Tomoko Uchida 29f94b0404
a bit of clarification about GitHub Milestone 2022-08-28 13:52:58 +09:00
Tomoko Uchida 6d664ccd95 adjast wording 2022-08-27 13:02:48 +09:00
Tomoko Uchida 09a7f9aa53 clarify the relation between CHANGES and Milestone 2022-08-27 12:58:33 +09:00
Tomoko Uchida 224953304c
Document about Milestone for release planning (#11723) 2022-08-27 12:29:40 +09:00
Tomoko Uchida e61958e4fd links to github should be '/issues' 2022-08-27 11:54:20 +09:00
Dawid Weiss 4f7543725c
#11720 Upgrade randomizedtesting to 2.8.1 (#11721) 2022-08-26 00:01:57 +02:00
Mike Drob dbc7a9764a
Add Integer awareness to RamUsageEstimator.sizeOf (#11715)
Additionally, update comments to reflect that we have not been VM cache-aware for a long time now.
2022-08-25 15:18:08 -05:00
Uwe Schindler 1d54299011
Fix classloading deadlock in analysis factories / AnalysisSPILoader initialization. This closes #11701 (#11718) 2022-08-25 18:16:04 +02:00
Tomoko Uchida 53b1ce7504
update contributing guide for GH issue (#11716) 2022-08-25 04:06:09 +09:00
Greg Miller 1529606763
Optimize TermInSetQuery for terms that match all docs in a segment (#1062) 2022-08-23 08:37:44 -07:00
Michael Sokolov 8021c2db4e Don't throw an exception for byte-encoded vectors in SimpleText codec 2022-08-22 08:29:58 -04:00
Julie Tibshirani df67223497 Disable byte encoding in TestSimpleTextKnnVectorsFormat 2022-08-21 17:00:57 -07:00
Julie Tibshirani 653d2ebf71
Remove KnnVectorsFormat#currentVersion (#1077)
These internal versions only make sense within a codec definition, and aren't
meant to be exposed and compared across codecs. Since this method is only used
in tests, we can move the check to the test classes instead.
2022-08-21 13:09:07 -07:00
Michael Sokolov daa56d30f0 Fix TestHnswGraph rare failure 2022-08-20 17:26:50 -04:00
Michael Sokolov 0a58318e16
Fix for bad cast when sorting a KnnVectors index over BytesRef (#1074) 2022-08-20 17:23:47 -04:00
Michael Sokolov 798c02dd70
fix VectorUtil.dotProductScore normalization (#1073) 2022-08-20 09:15:38 -04:00
Michael Sokolov 60fa19d509
don't call BitSet.cardinality() more than needed (#1075) 2022-08-20 08:40:50 -04:00
Michael Sokolov f9680c6807
Add safety checks to KnnVectorField; fixed issue with copying BytesRef (#1076) 2022-08-20 08:38:42 -04:00
Tomoko Uchida 9ae3498f82 add notes about labels' color code 2022-08-20 13:22:50 +09:00
Julie Tibshirani 8308688d78
LUCENE-9583: Remove RandomAccessVectorValuesProducer (#1071)
This change folds the `RandomAccessVectorValuesProducer` interface into
`RandomAccessVectorValues`. This reduces the number of interfaces and clarifies
the cloning/ copying behavior.

This is a small simplification related to LUCENE-9583, but does not address the
main issue.
2022-08-19 18:04:05 -07:00
Yuting Gan 0914b537db
LUCENE-10644: Facets#getAllChildren testing should ignore child order (#1013) 2022-08-18 10:38:49 -07:00
Julie Tibshirani 7912ed02c4 Move Lucene91HnswGraphBuilder to test folder
It's only used in unit tests so it can live in the backwards_codecs tests.
2022-08-17 17:10:38 -07:00
Tomoko Uchida 8b3303b25f .asf.yaml 2022-08-16 20:02:47 +09:00
Michael Sokolov bc214d4958 standardize exception text for vector dimension mismatch (in SimpleText codec) 2022-08-13 13:12:11 -04:00
Nick Knize 543910d900
LUCENE-10654: Fix ShapeDocValue Bounding Box failure (#1066)
The base spatial test case may create invalid self crossing polygons. These
polygons are cleaned by the tessellator which may result in an inconsistent
bounding box between the tessellated shape and the original, invalid, geometry.
This commit fixes the shape doc value test case to compute the bounding box from
the cleaned geometry instead of relying on the, potentially invalid, original
geometry.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-12 10:54:22 -05:00
Ignacio Vera fe8d11254a
LUCENE-10678: Fix potential overflow when computing the partition point on the BKD tree (#1065)
We currently compute the partition point for a set of points by multiplying the number of nodes that needs to be on
 the left of the BKD tree by the maxPointsInLeafNode. This multiplication is done on the integer space so if the partition point is bigger than Integer.MAX_VALUE it will overflow. This commit moves the multiplication to the long space so it doesn't overflow.
2022-08-11 15:25:53 +02:00
Michael Sokolov a693fe819b
LUCENE-10577: enable quantization of HNSW vectors to 8 bits (#1054)
* LUCENE-10577: enable supplying, storing, and comparing HNSW vectors with 8 bit precision
2022-08-10 17:09:07 -04:00
Vigya Sharma 59a0917e25
Fix typo in PostingsReaderBase docstring (#948)
* remove extra PostingsEnum from docstring
* add ImpactsEnum to docstring
2022-08-09 16:20:51 -07:00
Nick Knize d7fd48c950
LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)
Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring 
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement. 

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-09 12:51:45 -05:00
Tomoko Uchida 0eba72f625 disable GH issue 2022-08-09 15:26:55 +09:00
tang donghai b08e34722d
LUCENE-10646: Add some comment on LevenshteinAutomata (#1016)
* add Comment on Lev & pretty the toDot

* use auto generate scripts to add comment

* update checksum

* update checksum

* restore toDot

* add removeDeadStates in levAutomata

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2022-08-07 10:01:30 -04:00
Ignacio Vera bd0718f071
LUCENE-10673: Improve check of equality for latitudes for spatial3d GeoBoundingBox (#1056) 2022-08-04 06:47:27 +02:00
luyuncheng 34154736c6
LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data (#987) 2022-08-01 18:34:41 +02:00
Adrien Grand 04e4f317cb LUCENE-10629: Fix NullPointerException.
I hit a NPE while running tests. `Weight#scorer` may return `null`, but not
`Scorer#iterator`.
2022-08-01 14:13:22 +02:00
Shai Erera 7ac75135b9
[LUCENE-10629]: Add fast match query support to FacetSets (#1015) 2022-07-31 07:50:03 +03:00
Dawid Weiss f93e52e5bb
LUCENE-10669: The build should be more helpful when generated resources are touched (#1053) 2022-07-30 20:45:32 +02:00
Adrien Grand 7c9d3cd6ff LUCENE-10633: Fix handling of missing values in reverse sorts. 2022-07-29 21:36:35 +02:00
Kaival Parikh 1ad28a3136
LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932)
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
2022-07-29 11:21:34 -07:00
Adrien Grand eb7b7791ba
LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields. (#1023)
This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields
by using postings to filter competitive documents.
2022-07-29 11:12:32 +02:00
iverase e1d2005df4 Add back-compat indices for 9.3.0 2022-07-29 10:13:20 +02:00
iverase 52a41d702f DOAP changes for release 9.3.0 2022-07-29 09:40:53 +02:00
Greg Miller 4ebc249dbc
Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto (#1020) 2022-07-28 11:12:21 -07:00
Shiming Li bb752c774c
LUCENE-10663: Fix KnnVectorQuery explain with multiple segments (#1050)
If there are multiple segments. KnnVectorQuery explain has a bug in locating
the doc ID. This is because the doc ID in explain is the docBase without the
segment.  In KnnVectorQuery.DocAndScoreQuery docs docid is increased in each
segment of the docBase. So, in the 'DocAndScoreQuery.explain', needs to be
added with the segment's docBase. 

Co-authored-by: Julie Tibshirani <julietibs@apache.org>
2022-07-28 10:31:49 -07:00
Adrien Grand 0ff987562a LUCENE-10661: Move CHANGES entry to 9.4. 2022-07-27 16:20:20 +02:00
luyuncheng 107747f359
LUCENE-10661: Reduce memory copy in BytesStore (#1047) 2022-07-27 16:17:08 +02:00
Weiming Wu 2cf12b8cdc
Cache decoded bytes for TFIDFSimilarity scorer. (#1042)
Co-authored-by: Weiming Wu <wweiming@amazon.com>
2022-07-26 13:47:52 +02:00