Commit Graph

36105 Commits

Author SHA1 Message Date
Ignacio Vera fe8d11254a
LUCENE-10678: Fix potential overflow when computing the partition point on the BKD tree (#1065)
We currently compute the partition point for a set of points by multiplying the number of nodes that needs to be on
 the left of the BKD tree by the maxPointsInLeafNode. This multiplication is done on the integer space so if the partition point is bigger than Integer.MAX_VALUE it will overflow. This commit moves the multiplication to the long space so it doesn't overflow.
2022-08-11 15:25:53 +02:00
Michael Sokolov a693fe819b
LUCENE-10577: enable quantization of HNSW vectors to 8 bits (#1054)
* LUCENE-10577: enable supplying, storing, and comparing HNSW vectors with 8 bit precision
2022-08-10 17:09:07 -04:00
Vigya Sharma 59a0917e25
Fix typo in PostingsReaderBase docstring (#948)
* remove extra PostingsEnum from docstring
* add ImpactsEnum to docstring
2022-08-09 16:20:51 -07:00
Nick Knize d7fd48c950
LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)
Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring 
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement. 

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-09 12:51:45 -05:00
Tomoko Uchida 0eba72f625 disable GH issue 2022-08-09 15:26:55 +09:00
tang donghai b08e34722d
LUCENE-10646: Add some comment on LevenshteinAutomata (#1016)
* add Comment on Lev & pretty the toDot

* use auto generate scripts to add comment

* update checksum

* update checksum

* restore toDot

* add removeDeadStates in levAutomata

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2022-08-07 10:01:30 -04:00
Ignacio Vera bd0718f071
LUCENE-10673: Improve check of equality for latitudes for spatial3d GeoBoundingBox (#1056) 2022-08-04 06:47:27 +02:00
luyuncheng 34154736c6
LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data (#987) 2022-08-01 18:34:41 +02:00
Adrien Grand 04e4f317cb LUCENE-10629: Fix NullPointerException.
I hit a NPE while running tests. `Weight#scorer` may return `null`, but not
`Scorer#iterator`.
2022-08-01 14:13:22 +02:00
Shai Erera 7ac75135b9
[LUCENE-10629]: Add fast match query support to FacetSets (#1015) 2022-07-31 07:50:03 +03:00
Dawid Weiss f93e52e5bb
LUCENE-10669: The build should be more helpful when generated resources are touched (#1053) 2022-07-30 20:45:32 +02:00
Adrien Grand 7c9d3cd6ff LUCENE-10633: Fix handling of missing values in reverse sorts. 2022-07-29 21:36:35 +02:00
Kaival Parikh 1ad28a3136
LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932)
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
2022-07-29 11:21:34 -07:00
Adrien Grand eb7b7791ba
LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields. (#1023)
This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields
by using postings to filter competitive documents.
2022-07-29 11:12:32 +02:00
iverase e1d2005df4 Add back-compat indices for 9.3.0 2022-07-29 10:13:20 +02:00
iverase 52a41d702f DOAP changes for release 9.3.0 2022-07-29 09:40:53 +02:00
Greg Miller 4ebc249dbc
Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto (#1020) 2022-07-28 11:12:21 -07:00
Shiming Li bb752c774c
LUCENE-10663: Fix KnnVectorQuery explain with multiple segments (#1050)
If there are multiple segments. KnnVectorQuery explain has a bug in locating
the doc ID. This is because the doc ID in explain is the docBase without the
segment.  In KnnVectorQuery.DocAndScoreQuery docs docid is increased in each
segment of the docBase. So, in the 'DocAndScoreQuery.explain', needs to be
added with the segment's docBase. 

Co-authored-by: Julie Tibshirani <julietibs@apache.org>
2022-07-28 10:31:49 -07:00
Adrien Grand 0ff987562a LUCENE-10661: Move CHANGES entry to 9.4. 2022-07-27 16:20:20 +02:00
luyuncheng 107747f359
LUCENE-10661: Reduce memory copy in BytesStore (#1047) 2022-07-27 16:17:08 +02:00
Weiming Wu 2cf12b8cdc
Cache decoded bytes for TFIDFSimilarity scorer. (#1042)
Co-authored-by: Weiming Wu <wweiming@amazon.com>
2022-07-26 13:47:52 +02:00
tang donghai 94960a0aff
precompute maxlevel in LogMergePolicy (#1045) 2022-07-26 13:42:32 +02:00
Mayya Sharipova 2efc204a39 LUCENE-10592 Strengthen TestHnswGraph::testSortedAndUnsortedIndicesReturnSameResults
This test occasionally fails if knn search returns only 1 document
in the index, as we have an assertion that returned doc IDs from
sorted and unsorted index must be different.

This patch ensures that we have many documents in the index, so
that knn search always returns enough results.
2022-07-25 09:48:43 -04:00
Greg Miller f943a57ebe Fix another TestDisiPriorityQueue bug 2022-07-22 14:32:08 -07:00
Mayya Sharipova bd06cebfc2 Add change log for LUCENE-10592 2022-07-22 12:14:58 -04:00
Mayya Sharipova fdbb76a8d7 Add next minor version 9.3.0 2022-07-22 12:01:08 -04:00
Mayya Sharipova ba4bc04271
LUCENE-10592 Build HNSW Graph on indexing (#992)
Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.

Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-07-22 11:29:28 -04:00
Mayya Sharipova bd360f9b3e
Create Lucene94 Codec and move Lucene92 to backwards_codecs (#1041) 2022-07-22 10:04:10 -04:00
Michael Sokolov 6bdeb141b7 Revert "Create Lucene93 Codec and move Lucene92 to backwards_codecs (#924)"
This reverts commit f4f4a159b7.
2022-07-21 12:52:42 -04:00
Vigya Sharma 25a842d871
LUCENE-10583: Add docstring warning to not lock on Lucene objects (#963)
* add locking warning to docstring

* git tidy
2022-07-21 06:35:17 -04:00
Greg Miller 1bc38b7d1f Fix TestDisiPriorityQueue test bug 2022-07-20 11:33:14 -07:00
Lu Xugang 39e7597f6e
LUCENE-10656: It is unnecessary that using `limit` to check boundary (#1027) 2022-07-20 10:00:06 +08:00
Zach Chen 28ce8abb51
LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions (#1018) 2022-07-19 18:59:19 -07:00
Greg Miller 3d7d85f245
LUCENE-10653: Heapify in BMMScorer (#1022) 2022-07-19 13:49:31 -07:00
Greg Miller a35dee5b27
Small tweak to IntervalQuery#visit logic (#1007) 2022-07-19 12:27:41 -07:00
Adrien Grand 11e7fe6618 LUCENE-10657: Move CHANGES entry to 9.3. 2022-07-19 09:39:18 +02:00
luyuncheng e5bf76b843
LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput (#1034)
Abstract method copyBytes need to copy from input to a buffer and then write into ByteBuffersDataOutput, i think there is unnecessary, we can override it, copy directly from input into output
2022-07-19 09:37:07 +02:00
hcqs33 9f80fea502
Fix error in TieredMergePolicy (#1028)
Fix error in comparing between bytes of candidates and bytes of max merge.
It's wrong to use candidateSize rather than currentCandidateBytes comparing with maxMergeBytes.
2022-07-19 09:21:09 +02:00
Tomoko Uchida 781edf442b
LUCENE-10557: Refine issue label texts (#1036) 2022-07-19 13:41:42 +09:00
Adrien Grand 216e38a159
Synchronize FieldInfos#verifyFieldInfos. (#1019)
This method is called from `addIndexes` and should be synchronized so that it
would see consistent data structures in case of concurrent indexing that would
be introducing new fields.

I hit a rare test failure of `TestIndexRearranger` that I can only explain by this lack of locking:

```
15:40:14    >     java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot read field "numDimensions" because "props" is null
15:40:14    >         at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
15:40:14    >         at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
15:40:14    >         at org.apache.lucene.misc.index.IndexRearranger.execute(IndexRearranger.java:98)
15:40:14    >         at org.apache.lucene.misc.index.TestIndexRearranger.testRearrangeUsingBinaryDocValueSelector(TestIndexRearranger.java:97)
15:40:14    >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
15:40:14    >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
15:40:14    >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
15:40:14    >         at java.base/java.lang.reflect.Method.invoke(Method.java:568)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
15:40:14    >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
15:40:14    >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
15:40:14    >         at java.base/java.lang.Thread.run(Thread.java:833)
15:40:14    >
15:40:14    >         Caused by:
15:40:14    >         java.lang.NullPointerException: Cannot read field "numDimensions" because "props" is null
15:40:14    >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:459)
15:40:14    >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.FieldInfos$FieldNumbers.verifyFieldInfo(FieldInfos.java:359)
15:40:14    >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:3149)
15:40:14    >             at org.apache.lucene.misc.index.IndexRearranger.addOneSegment(IndexRearranger.java:139)
15:40:14    >             at org.apache.lucene.misc.index.IndexRearranger.lambda$execute$0(IndexRearranger.java:92)
15:40:14    >             at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
15:40:14    >             at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
15:40:14    >             at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
15:40:14    >             ... 1 more
```
2022-07-18 16:17:29 +02:00
Adrien Grand 0364402b30 Fix rare test failures of TestIndexSorting.
Sometimes the final merge might not require sorting depending on the merge
policy configuration.
2022-07-18 15:26:08 +02:00
Vigya Sharma 30a7c52e6c
LUCENE-10649: Fix failures in TestDemoParallelLeafReader (#1025) 2022-07-18 14:32:38 +02:00
Tomoko Uchida 8938e6a3fa
LUCENE-10557: Add GitHub issue templates (#1024) 2022-07-18 15:33:00 +09:00
Greg Miller 9b185b99c4
LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition (#1021) 2022-07-13 20:02:17 -07:00
Vigya Sharma ca7917472b
LUCENE-10648: Fix failures in TestAssertingPointsFormat.testWithExceptions (#1012)
* Fix failures in TestAssertingPointsFormat.testWithExceptions

* remove redundant finally block

* tidy

* remove TODO as it is done now
2022-07-13 13:55:08 -04:00
Christine Poerschke 56462b5f96
LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method (#821) 2022-07-13 18:43:31 +01:00
Greg Miller 7c35311f29
Specialize ordinal encoding for SortedSetDocValues (#1010) 2022-07-12 18:55:54 -07:00
tang donghai d7c2def019
LUCENE-10619: Optimize the writeBytes in TermsHashPerField (#966) 2022-07-12 17:14:37 +02:00
Greg Miller d6dbe4374a Move LUCENE-10614 CHANGES entry to 10.0 and add MIGRATE entry 2022-07-11 09:10:58 -07:00
Yuting Gan 5ef7e5025d
LUCENE-10614: Properly support getTopChildren in RangeFacetCounts (#974) 2022-07-11 09:04:46 -07:00