Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.
`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.
In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
If there are multiple segments. KnnVectorQuery explain has a bug in locating
the doc ID. This is because the doc ID in explain is the docBase without the
segment. In KnnVectorQuery.DocAndScoreQuery docs docid is increased in each
segment of the docBase. So, in the 'DocAndScoreQuery.explain', needs to be
added with the segment's docBase.
Co-authored-by: Julie Tibshirani <julietibs@apache.org>
This test occasionally fails if knn search returns only 1 document
in the index, as we have an assertion that returned doc IDs from
sorted and unsorted index must be different.
This patch ensures that we have many documents in the index, so
that knn search always returns enough results.
Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.
Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Abstract method copyBytes need to copy from input to a buffer and then write into ByteBuffersDataOutput, i think there is unnecessary, we can override it, copy directly from input into output
Fix error in comparing between bytes of candidates and bytes of max merge.
It's wrong to use candidateSize rather than currentCandidateBytes comparing with maxMergeBytes.
This method is called from `addIndexes` and should be synchronized so that it
would see consistent data structures in case of concurrent indexing that would
be introducing new fields.
I hit a rare test failure of `TestIndexRearranger` that I can only explain by this lack of locking:
```
15:40:14 > java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot read field "numDimensions" because "props" is null
15:40:14 > at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
15:40:14 > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
15:40:14 > at org.apache.lucene.misc.index.IndexRearranger.execute(IndexRearranger.java:98)
15:40:14 > at org.apache.lucene.misc.index.TestIndexRearranger.testRearrangeUsingBinaryDocValueSelector(TestIndexRearranger.java:97)
15:40:14 > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
15:40:14 > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
15:40:14 > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
15:40:14 > at java.base/java.lang.reflect.Method.invoke(Method.java:568)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
15:40:14 > at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
15:40:14 > at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
15:40:14 > at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
15:40:14 > at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
15:40:14 > at java.base/java.lang.Thread.run(Thread.java:833)
15:40:14 >
15:40:14 > Caused by:
15:40:14 > java.lang.NullPointerException: Cannot read field "numDimensions" because "props" is null
15:40:14 > at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:459)
15:40:14 > at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.FieldInfos$FieldNumbers.verifyFieldInfo(FieldInfos.java:359)
15:40:14 > at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:3149)
15:40:14 > at org.apache.lucene.misc.index.IndexRearranger.addOneSegment(IndexRearranger.java:139)
15:40:14 > at org.apache.lucene.misc.index.IndexRearranger.lambda$execute$0(IndexRearranger.java:92)
15:40:14 > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
15:40:14 > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
15:40:14 > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
15:40:14 > ... 1 more
```
* Use merge policy and merge scheduler to run addIndexes merges
* wrapped reader does not see deletes - debug
* Partially fixed tests in TestAddIndexes
* Use writer object to invoke addIndexes merge
* Use merge object info
* Add javadocs for new methods
* TestAddIndexes passing
* verify field info schemas upfront from incoming readers
* rename flag to track pooled readers
* Keep addIndexes API transactional
* Maintain transactionality - register segments with iw after all merges complete
* fix checkstyle
* PR comments
* Fix pendingDocs - numDocs mismatch bug
* Tests with 1-1 merges and partial merge failures
* variable renaming and better comments
* add test for partial merge failures. change tests to use 1-1 findmerges
* abort pending merges gracefully
* test null and empty merge specs
* test interim files are deleted
* test with empty readers
* test cascading merges triggered
* remove nocommits
* gradle check errors
* remove unused line
* remove printf
* spotless apply
* update TestIndexWriterOnDiskFull to accept mergeException from failing addIndexes calls
* return singleton reader mergespec in NoMergePolicy
* rethrow exceptions seen in merge threads on failure
* spotless apply
* update test to new exception type thrown
* spotlessApply
* test for maxDoc limit in IndexWriter
* spotlessApply
* Use DocValuesIterator instead of DocValuesFieldExistsQuery for counting soft deletes
* spotless apply
* change exception message for closed IW
* remove non-essential comments
* update api doc string
* doc string update
* spotless
* Changes file entry
* simplify findMerges API, add 1-1 merges to MockRandomMergePolicy
* update merge policies to new api
* remove unused imports
* spotless apply
* move changes entry to end of list
* fix testAddIndicesWithSoftDeletes
* test with 1-1 merge policy always enabled
* please spotcheck
* tidy
* test - never use 1-1 merge policy
* use 1-1 merge policy randomly
* Remove concurrent addIndexes findMerges from MockRandomMergePolicy
* Bug Fix: RuntimeException in addIndexes
Aborted pending merges were slipping through the merge exception check in
API, and getting caught later in the RuntimeException check.
* tidy
* Rebase on main. Move changes to 10.0
* Synchronize IW.AddIndexesMergeSource on outer class IW object
* tidy
I noticed some minor bugs in the original PR #927 that this PR should fix:
- When a timeout is set, we would no longer catch
`CollectionTerminatedException`.
- I added randomization to `LuceneTestCase` to randomly set a timeout, it
would have caught the above bug.
- Fixed visibility of `TimeLimitingBulkScorer`.