36126 Commits

Author SHA1 Message Date
Tomoko Uchida
e61958e4fd links to github should be '/issues' 2022-08-27 11:54:20 +09:00
Dawid Weiss
4f7543725c
#11720 Upgrade randomizedtesting to 2.8.1 (#11721) 2022-08-26 00:01:57 +02:00
Mike Drob
dbc7a9764a
Add Integer awareness to RamUsageEstimator.sizeOf (#11715)
Additionally, update comments to reflect that we have not been VM cache-aware for a long time now.
2022-08-25 15:18:08 -05:00
Uwe Schindler
1d54299011
Fix classloading deadlock in analysis factories / AnalysisSPILoader initialization. This closes #11701 (#11718) 2022-08-25 18:16:04 +02:00
Tomoko Uchida
53b1ce7504
update contributing guide for GH issue (#11716) 2022-08-25 04:06:09 +09:00
Greg Miller
1529606763
Optimize TermInSetQuery for terms that match all docs in a segment (#1062) 2022-08-23 08:37:44 -07:00
Michael Sokolov
8021c2db4e Don't throw an exception for byte-encoded vectors in SimpleText codec 2022-08-22 08:29:58 -04:00
Julie Tibshirani
df67223497 Disable byte encoding in TestSimpleTextKnnVectorsFormat 2022-08-21 17:00:57 -07:00
Julie Tibshirani
653d2ebf71
Remove KnnVectorsFormat#currentVersion (#1077)
These internal versions only make sense within a codec definition, and aren't
meant to be exposed and compared across codecs. Since this method is only used
in tests, we can move the check to the test classes instead.
2022-08-21 13:09:07 -07:00
Michael Sokolov
daa56d30f0 Fix TestHnswGraph rare failure 2022-08-20 17:26:50 -04:00
Michael Sokolov
0a58318e16
Fix for bad cast when sorting a KnnVectors index over BytesRef (#1074) 2022-08-20 17:23:47 -04:00
Michael Sokolov
798c02dd70
fix VectorUtil.dotProductScore normalization (#1073) 2022-08-20 09:15:38 -04:00
Michael Sokolov
60fa19d509
don't call BitSet.cardinality() more than needed (#1075) 2022-08-20 08:40:50 -04:00
Michael Sokolov
f9680c6807
Add safety checks to KnnVectorField; fixed issue with copying BytesRef (#1076) 2022-08-20 08:38:42 -04:00
Tomoko Uchida
9ae3498f82 add notes about labels' color code 2022-08-20 13:22:50 +09:00
Julie Tibshirani
8308688d78
LUCENE-9583: Remove RandomAccessVectorValuesProducer (#1071)
This change folds the `RandomAccessVectorValuesProducer` interface into
`RandomAccessVectorValues`. This reduces the number of interfaces and clarifies
the cloning/ copying behavior.

This is a small simplification related to LUCENE-9583, but does not address the
main issue.
2022-08-19 18:04:05 -07:00
Yuting Gan
0914b537db
LUCENE-10644: Facets#getAllChildren testing should ignore child order (#1013) 2022-08-18 10:38:49 -07:00
Julie Tibshirani
7912ed02c4 Move Lucene91HnswGraphBuilder to test folder
It's only used in unit tests so it can live in the backwards_codecs tests.
2022-08-17 17:10:38 -07:00
Tomoko Uchida
8b3303b25f .asf.yaml 2022-08-16 20:02:47 +09:00
Michael Sokolov
bc214d4958 standardize exception text for vector dimension mismatch (in SimpleText codec) 2022-08-13 13:12:11 -04:00
Nick Knize
543910d900
LUCENE-10654: Fix ShapeDocValue Bounding Box failure (#1066)
The base spatial test case may create invalid self crossing polygons. These
polygons are cleaned by the tessellator which may result in an inconsistent
bounding box between the tessellated shape and the original, invalid, geometry.
This commit fixes the shape doc value test case to compute the bounding box from
the cleaned geometry instead of relying on the, potentially invalid, original
geometry.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-12 10:54:22 -05:00
Ignacio Vera
fe8d11254a
LUCENE-10678: Fix potential overflow when computing the partition point on the BKD tree (#1065)
We currently compute the partition point for a set of points by multiplying the number of nodes that needs to be on
 the left of the BKD tree by the maxPointsInLeafNode. This multiplication is done on the integer space so if the partition point is bigger than Integer.MAX_VALUE it will overflow. This commit moves the multiplication to the long space so it doesn't overflow.
2022-08-11 15:25:53 +02:00
Michael Sokolov
a693fe819b
LUCENE-10577: enable quantization of HNSW vectors to 8 bits (#1054)
* LUCENE-10577: enable supplying, storing, and comparing HNSW vectors with 8 bit precision
2022-08-10 17:09:07 -04:00
Vigya Sharma
59a0917e25
Fix typo in PostingsReaderBase docstring (#948)
* remove extra PostingsEnum from docstring
* add ImpactsEnum to docstring
2022-08-09 16:20:51 -07:00
Nick Knize
d7fd48c950
LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)
Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring 
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement. 

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-09 12:51:45 -05:00
Tomoko Uchida
0eba72f625 disable GH issue 2022-08-09 15:26:55 +09:00
tang donghai
b08e34722d
LUCENE-10646: Add some comment on LevenshteinAutomata (#1016)
* add Comment on Lev & pretty the toDot

* use auto generate scripts to add comment

* update checksum

* update checksum

* restore toDot

* add removeDeadStates in levAutomata

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2022-08-07 10:01:30 -04:00
Ignacio Vera
bd0718f071
LUCENE-10673: Improve check of equality for latitudes for spatial3d GeoBoundingBox (#1056) 2022-08-04 06:47:27 +02:00
luyuncheng
34154736c6
LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data (#987) 2022-08-01 18:34:41 +02:00
Adrien Grand
04e4f317cb LUCENE-10629: Fix NullPointerException.
I hit a NPE while running tests. `Weight#scorer` may return `null`, but not
`Scorer#iterator`.
2022-08-01 14:13:22 +02:00
Shai Erera
7ac75135b9
[LUCENE-10629]: Add fast match query support to FacetSets (#1015) 2022-07-31 07:50:03 +03:00
Dawid Weiss
f93e52e5bb
LUCENE-10669: The build should be more helpful when generated resources are touched (#1053) 2022-07-30 20:45:32 +02:00
Adrien Grand
7c9d3cd6ff LUCENE-10633: Fix handling of missing values in reverse sorts. 2022-07-29 21:36:35 +02:00
Kaival Parikh
1ad28a3136
LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932)
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
2022-07-29 11:21:34 -07:00
Adrien Grand
eb7b7791ba
LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields. (#1023)
This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields
by using postings to filter competitive documents.
2022-07-29 11:12:32 +02:00
iverase
e1d2005df4 Add back-compat indices for 9.3.0 2022-07-29 10:13:20 +02:00
iverase
52a41d702f DOAP changes for release 9.3.0 2022-07-29 09:40:53 +02:00
Greg Miller
4ebc249dbc
Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto (#1020) 2022-07-28 11:12:21 -07:00
Shiming Li
bb752c774c
LUCENE-10663: Fix KnnVectorQuery explain with multiple segments (#1050)
If there are multiple segments. KnnVectorQuery explain has a bug in locating
the doc ID. This is because the doc ID in explain is the docBase without the
segment.  In KnnVectorQuery.DocAndScoreQuery docs docid is increased in each
segment of the docBase. So, in the 'DocAndScoreQuery.explain', needs to be
added with the segment's docBase. 

Co-authored-by: Julie Tibshirani <julietibs@apache.org>
2022-07-28 10:31:49 -07:00
Adrien Grand
0ff987562a LUCENE-10661: Move CHANGES entry to 9.4. 2022-07-27 16:20:20 +02:00
luyuncheng
107747f359
LUCENE-10661: Reduce memory copy in BytesStore (#1047) 2022-07-27 16:17:08 +02:00
Weiming Wu
2cf12b8cdc
Cache decoded bytes for TFIDFSimilarity scorer. (#1042)
Co-authored-by: Weiming Wu <wweiming@amazon.com>
2022-07-26 13:47:52 +02:00
tang donghai
94960a0aff
precompute maxlevel in LogMergePolicy (#1045) 2022-07-26 13:42:32 +02:00
Mayya Sharipova
2efc204a39 LUCENE-10592 Strengthen TestHnswGraph::testSortedAndUnsortedIndicesReturnSameResults
This test occasionally fails if knn search returns only 1 document
in the index, as we have an assertion that returned doc IDs from
sorted and unsorted index must be different.

This patch ensures that we have many documents in the index, so
that knn search always returns enough results.
2022-07-25 09:48:43 -04:00
Greg Miller
f943a57ebe Fix another TestDisiPriorityQueue bug 2022-07-22 14:32:08 -07:00
Mayya Sharipova
bd06cebfc2 Add change log for LUCENE-10592 2022-07-22 12:14:58 -04:00
Mayya Sharipova
fdbb76a8d7 Add next minor version 9.3.0 2022-07-22 12:01:08 -04:00
Mayya Sharipova
ba4bc04271
LUCENE-10592 Build HNSW Graph on indexing (#992)
Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.

Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-07-22 11:29:28 -04:00
Mayya Sharipova
bd360f9b3e
Create Lucene94 Codec and move Lucene92 to backwards_codecs (#1041) 2022-07-22 10:04:10 -04:00
Michael Sokolov
6bdeb141b7 Revert "Create Lucene93 Codec and move Lucene92 to backwards_codecs (#924)"
This reverts commit f4f4a159b77ffca974c003aba5d6b33a3b40be97.
2022-07-21 12:52:42 -04:00