Commit Graph

13930 Commits

Author SHA1 Message Date
Jim Ferenczi c26b0180bd
Introduce a random vector scorer in HNSW builder/searcher (#12529)
This PR involves the refactoring of the HNSW builder and searcher, aiming to create an abstraction for the random access and vector comparisons conducted during graph traversal.

The newly added RandomVectorScorer provides a means to directly compare ordinals, eliminating the need to expose the raw vector primitive type.
This scorer takes charge of vector retrieval and comparison during the graph's construction and search processes.

The primary purpose of this abstraction is to enable the implementation of various strategies.
For example, it opens the door to constructing the graph using the original float vectors while performing searches using their quantized int8 vector counterparts.
2023-09-12 13:57:07 +01:00
Tony-X d77195d705
Document why we need `lastPosBlockOffset` (#12541)
* Document why we need `lastPosBlockOffset`

* Let ./gradlew tidy fix the formatting

* Fix '<' with &lt;

---------

Co-authored-by: Tony Xu <tonyx@amazon.com>
2023-09-12 07:32:58 -04:00
Michael McCandless 57dd5a4bda
The bitsRequired passed during NodeHash rehash (when building an FST) (#12545) 2023-09-09 19:14:18 -04:00
Luca Cavanna a7202e2e6f
Close index readers in tests (#12544)
There are a few places where tests don't close index readers. This has
not caused problems so far, but it becomes an issue when the reader gets
an executor, because its shutdown happens as a closing listener of the
reader. This has become more evident since we now offload sequential
execution to the executor. If there's an executor, but it's never used,
no threads are created, and no threads are leaked. If we do use the
executor, and the reader is not closed, the test leaks threads.
2023-09-08 14:55:01 +02:00
Christine Poerschke ef42af65f2
clarify QueryVisitor.acceptField javadoc w.r.t. not being term-specific (#12540) 2023-09-06 11:18:46 +01:00
Luca Cavanna d62ca4a01f add missing changelog entry for #12498 2023-09-05 16:41:38 +02:00
Luca Cavanna da894151a6
Offload single slice to executor (#12515)
When an executor is set to the IndexSearcher, we should try and offload
most of the computation to such executor. Ideally, the caller thread
would only do light coordination work, and the executor is responsible
for the heavier workload. If we don't offload sequential execution to
the executor, it becomes very difficult to make any distinction about
the type of workload performed on the two sides.

Closes #12498
2023-09-05 16:30:20 +02:00
Luca Cavanna 947b2c5e5a
Unwrap execution exceptions cause and rethrow as is when possible (#12516)
When performing concurrent search, we may get an execution exception
from one or more slices. In that case, we'd like to rethrow the cause of
the execution exception, which we do by wrapping it into a new runtime
exception. Instead, we can rethrow runtime exceptions as-is, and the
same is true for io exceptions. Any other exception is still wrapped
into a new runtime exception. This unifies the exceptions that get
thrown between sequential codepath (when no executor is provided) and
concurrent codepath (when an executor is provided).
2023-09-05 15:55:48 +02:00
Chaitanya Gohel d631615665
Honor topvalue while determining isMissingvalueCompetitive in case bottom is not set (#12520) 2023-09-04 18:25:34 +02:00
zhangchao a52161b131
Update outdated comment about maxPointsInLeafNode in BKD tree (#12532) 2023-09-04 14:37:16 +02:00
Jack Wang 9fd45e3951
Enhancement 11236 lazy compute similarity score (#12480) 2023-09-01 11:05:49 -07:00
Benjamin Trent d1c3531161
Use panama vector for l2normalize (#12518)
Use panama vector for l2normalize
2023-08-29 08:33:49 -04:00
zhangchao 16e4874bb9
Remove unused variable in BKDWriter (#12512) 2023-08-22 15:50:47 +08:00
Luca Cavanna bb62720526
Simplify task executor for concurrent operations (#12499)
This commit removes the QueueSizeBasedExecutor (package private) in favour of simply offloading concurrent execution to the provided executor. In need of specific behaviour, it can all be included in the executor itself.

This removes an instanceof check that determines which type of executor wrapper is used, which means that some tasks may be executed on the caller thread depending on queue size, whenever a rejection happens, or always for the last slice. This behaviour is not configurable in any way, and is too rigid. Rather than making this pluggable, I propose to make Lucene less opinionated about concurrent tasks execution and require that users include their own execution strategy directly in the executor that they provide to the index searcher.

Relates to #12498
2023-08-21 21:54:37 +02:00
Jakub Slowinski fb8183332b
Fix stack overflow in RegExp for long string (#12462) 2023-08-16 22:45:20 -07:00
Shubham Chaudhary 368dbffef3
Replace consecutive close() calls and close() calls with null checks with IOUtils.close() (#12428) 2023-08-16 17:12:34 -07:00
tang donghai ec1367862d
Fix UTF32toUTF8 will produce invalid transition (#12472) 2023-08-16 13:59:07 -07:00
Benjamin Trent 4174b521dd
Rename ToParentBlockJoin[Byte|Float]KnnVectorQuery and adjust to return highest score child doc ID by parent id (#12510)
The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

Related to: https://github.com/apache/lucene/pull/12434
2023-08-16 13:44:49 -04:00
Benjamin Trent 5a5aa2c8fa
GITHUB#12342 Add new maximum inner product vector similarity method (#12479)
The current dot-product score scaling and similarity implementation assumes normalized vectors. This disregards information that the model may store within the magnitude. 

See: https://github.com/apache/lucene/issues/12342#issuecomment-1658640222 for a good explanation for the need.

To prevent from breaking current scoring assumptions in Lucene, a new `MAXIMUM_INNER_PRODUCT` similarity function is added. 

Because the similarity from a `dotProduct` function call could be negative, this similarity scorer will scale negative dotProducts to between 0-1 and then all positive dotProduct values are from 1-MAX.

One concern with adding this similarity function is that it breaks the triangle inequality. It is assumed that this is needed to build graph structures. But, there is conflicting research here when it comes to real-world data.

See:
 - For: https://github.com/apache/lucene/issues/12342#issuecomment-1618258984
 - Against: https://github.com/apache/lucene/issues/12342#issuecomment-1631577657, https://github.com/apache/lucene/issues/12342#issuecomment-1631808301

To check if any transformation of the input is required to satisfy the triangle inequality, many tests have been ran

See:

 - https://github.com/apache/lucene/issues/12342#issuecomment-1653420640
 - https://github.com/apache/lucene/issues/12342#issuecomment-1656112434
 - https://github.com/apache/lucene/issues/12342#issuecomment-1656718447

If there are any additional tests, or issues with the provided tests & scripts, please let me know. We want to make sure this works well for our users.

closes: https://github.com/apache/lucene/issues/12342
2023-08-16 12:15:25 -04:00
Lu Xugang 71f6f59a75
Remove outdated comment in Scorer (#12494)
we should delete this comment since this constructor parameters already removed from LUCENE-2876 , it's description of 'given Similarity' is a lit bit confuse to reader.

Scorer always provide non-negative
2023-08-16 11:36:20 +08:00
Benjamin Trent 18b56bd002
ToParentBlockJoin[Byte|Float]KnnVectorQuery needs to handle the case when parents are missing (#12504)
This is a follow up to: https://github.com/apache/lucene/pull/12434

Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE
2023-08-14 09:24:25 -04:00
Adrien Grand 47258cc9e9 Move changes of #12415 to 9.8 2023-08-11 22:39:36 +02:00
Adrien Grand 4d26cb2219
Optimize disjunction counts. (#12415)
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).

Relates #12358
2023-08-11 22:37:37 +02:00
Benjamin Trent df8745e59e
Fix flaky testToString method for Knn Vector queries (#12500)
Periodically, the random indexer will force merge on close, this means that what was originally indexed as the zeroth document could no longer be the zeroth document.

This commit adjusts the assertion to ensure the to string format is as expected for `DocAndScoreQuery`, regardless of the matching doc-id in the test.

This seed shows the issue:
```
./gradlew test --tests TestKnnByteVectorQuery.testToString -Dtests.seed=B78CDB966F4B8FC5
```
2023-08-11 07:26:49 -04:00
Peter Gromov 13e747f95f
hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration (#12491)
* hunspell: simplify TrigramAutomaton to speed up the suggestion enumeration

avoid the automaton access on definitely absent characters;
count the scores for all substring lengths together
2023-08-08 22:40:42 +02:00
Benjamin Trent dd4e66dad6
Fix test failure with zero-length vector (#12493)
This adds assertions around the random test vector dimension count and continues to generate random vectors until it has a `squareSum > 0`
2023-08-08 08:38:46 -04:00
Benjamin Trent a65cf8960a
Add ParentJoin KNN support (#12434)
A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing. 

However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent.

This commit adds this ability through some significant changes:
 - New leaf reader function that allows a collector for knn results
 - The knn results can then utilize bit-sets to join back to the parent id
 
This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.

This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).
2023-08-07 14:46:42 -04:00
Adrien Grand 03ab02157a Revert "Stop aligning windows in BooleanScorer. (#12488)"
This reverts commit 09e3b43331.
2023-08-06 22:10:09 +02:00
Adrien Grand 09e3b43331
Stop aligning windows in BooleanScorer. (#12488)
BooleanScorer aligns windows to multiples of 2048, but it doesn't have to.
Actually, not aligning windows can help evaluate fewer windows overall and
speed up query evaluation.
2023-08-05 11:29:34 +02:00
Adrien Grand df3632cb03
Fix `DefaultBulkScorer` to not advance the competitive iterator beyond the end of the window. (#12481)
The way `DefaultBulkScorer` uses `ConjunctionDISI` may make it advance the
competitive iterator beyond the end of the window. This may cause bugs with
bulk scorers such as `BooleanScorer` that sometimes delegate to the single
clause that has matches in a given window of doc IDs. We should then make sure
to not advance the competitive iterator beyond the end of the window based on
this clause, as other clauses may have matches as well.
2023-08-03 07:19:27 +02:00
Adrien Grand acffcfaaf0
Reduce overhead of disabling scoring on `BooleanScorer`. (#12475)
This is a subset of #12415, which I'm extracting to its own pull request in
order to have separate data points in nightly benchmarks.

Results on `wikimedium10m` and `wikinightly` counting tasks:

```
                       CountTerm     4624.91      (6.4%)     4581.34      (6.4%)   -0.9% ( -12% -   12%) 0.640
                 CountAndHighMed      280.03      (4.5%)      280.15      (4.4%)    0.0% (  -8% -    9%) 0.974
                     CountPhrase        7.22      (3.0%)        7.24      (1.8%)    0.3% (  -4% -    5%) 0.728
                CountAndHighHigh       52.84      (4.9%)       53.12      (5.6%)    0.5% (  -9% -   11%) 0.755
                        PKLookup      232.01      (3.6%)      235.45      (2.8%)    1.5% (  -4% -    8%) 0.144
                 CountOrHighHigh       42.37      (6.1%)       56.04      (9.1%)   32.3% (  16% -   50%) 0.000
                  CountOrHighMed       30.56      (6.5%)       40.46      (9.8%)   32.4% (  15% -   52%) 0.000
```
2023-08-03 07:17:52 +02:00
Armin Braun e78feb7809
Clenup duplication in BKDWriter (#12469)
The logic for creating the writer runnable could be deduplicated.
Also, a couple of annonymous classes could be turned into lambdas.
2023-08-03 07:16:44 +02:00
Benjamin Trent 229dc7481e
Fix randomly failing field info format tests (#12473) 2023-08-02 14:10:57 -04:00
Adrien Grand 5e725964a0
Improve MaxScoreBulkScorer partitioning logic. (#12457)
Partitioning scorers is an optimization problem: the optimal set of
non-essential scorers is the subset of scorers whose sum of max window scores
is less than the minimum competitive score that maximizes the sum of costs.

The current approach consists of sorting scorers by maximum score within the
window and computing the set of non-essential clauses as the first scorers
whose sum of max scores is less than the minimum competitive score, ie. you
cannot have a competitive hit by matching only non-essential clauses.

This sorting logic works well in the common case when costs are inversely
correlated with maximum scores and gives an optimal solution: the above
algorithm will also optimize the cost of non-essential clauses and thus
minimize the cost of essential clauses, in-turn further improving query
runtimes. But this isn't true for all queries. E.g. fuzzy queries compute
scores based on artificial term statistics, so scores are no longer inversely
correlated with maximum scores. This was especially visible with the query
`titel~2` on the wikipedia dataset, as `title` matches this query and is a
high-frequency term. Yet the score contribution of this term is in the same
order as the contribution of most other terms, so query runtime gets much
improved if this clause gets considered non-essential rather than essential.

This commit optimize the partitioning logic a bit by sorting clauses by
`max_score / cost` instead of just `max_score`. This will not change anything
in the common case when max scores are inversely correlated with costs, but can
significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my
machine and the wikimedium10m dataset with this change.
2023-07-29 21:02:28 +02:00
Peter Gromov 32ec38271e
hunspell: check for aff file wellformedness more strictly (#12468)
* hunspell: check for .aff file well-formedness more strictly
2023-07-29 16:07:50 +02:00
Peter Gromov 1af68bf2d7
hunspell: make the hash table load factor customizable (#12464)
* hunspell: make the hash table load factor customizable
2023-07-28 18:36:32 +02:00
Mayya Sharipova 155b2edbe3
Fix occasional failure in BaseKnnVectorsFormatTestCase#testIllegalDimensionTooLarge (#12467)
Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.

Relates to #12436
2023-07-28 09:37:16 -04:00
Mayya Sharipova 119635ad80
Make KnnVectorsFormat#getMaxDimensions abstract (#12466)
- Backward codecs use 1024 as max dims
- Test classes use the current KnnVectorsFormat#DEFAULT_MAX_DIMENSIONS

Relates to PR#12436
Closes #12309
2023-07-28 08:34:17 -04:00
Mayya Sharipova 98320d7616
Move max vector dims limit to Codec (#12436)
Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.

Closes #12309
2023-07-27 14:50:33 -04:00
Armin Braun 538b7d0ffe
Clean up writing String to ByteBuffersDataOutput (#12455)
Resolving TODO to use UnicodeUtil instead of a copy of its code here.
Maybe slightly slower from the extra check for high-surrogate but that
may be outweigh or better by more compact code and saving the capturing lambda
that might not inline.
2023-07-26 14:25:01 +02:00
Greg Miller 87944c2aa7 Move CHANGES entry for GITHUB#12408 under 10.0
A backport to 9.x will be somewhat tricky with the API surface area
so planning to wait for a 10.0 release.
2023-07-25 13:02:36 -07:00
Greg Miller 179b45bc23
Initialize facet counting data structures lazily (#12408)
This change covers:
* Taxonomy faceting
  * FastTaxonomyFacetCounts
  * TaxonomyFacetIntAssociations
  * TaxonomyFacetFloatAssociations
* SSDV faceting
  * SortedSetDocValuesFacetCounts
  * ConcurrentSortedSetDocValuesFacetCounts
  * StringValueFacetCounts
* Range faceting:
  * LongRangeFacetCounts
  * DoubleRangeFacetCounts
* Long faceting:
  * LongValueFacetCounts

Left for a future iteration:
* RangeOnRange faceting
* FacetSet faceting
2023-07-25 12:20:42 -07:00
Greg Miller 2b3b028734
GITHUB#12451: Update TestStringsToAutomaton validation to work around GH#12458 (#12461) 2023-07-25 11:56:18 -07:00
Armin Braun 20e97fbd00
Faster bulk numeric reads from BufferedIndexInput (#12453)
Reading ints/floats/longs one-by-one from a heap-byte-buffer, including
doing our own bounds checks is not very efficient. We can use the
ability to translate the buffer and read in bulk while taking turns with
one-off reading/refilling instead.
2023-07-24 17:20:32 +02:00
Stefan Vodita 34721f9439
Assert IdxOrDvQuery subqueries and document useful fields (#12442) 2023-07-24 16:36:48 +02:00
Benjamin Trent 59c56a0aed
Fix sorted&unsorted graph test flakiness (#12452)
When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query.

To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed.

I have verified that bumping the search candidate pool to 60 fixes the failure.

The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph.

I verified further by running the test again over a couple thousand seeds and it didn't fail again.
2023-07-20 14:08:21 -04:00
Adrien Grand 17c13a76c8
Add BS1 optimization to MaxScoreBulkScorer. (#12444)
Lucene's scorers that can dynamically prune on score provide great speedups
when they manage to skip many hits. Unfortunately, there are also cases when
they cannot skip hits efficiently, one example case being when there are many
clauses in the query. In this case, exhaustively evaluating the set of matches
with `BooleanScorer` (BS1) may perform several times faster.

This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of
collecting hits into a bitset to save the overhead of reordering priority
queues. This helps make performance degrade much more gracefully when dynamic
pruning cannot help much.

Closes #12439
2023-07-19 13:51:22 +02:00
Martin Demberger 55f2f9958b
LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches (#12437)
---------

Co-authored-by: Martin Demberger <martin.demberger@root-nine.de>
2023-07-19 13:15:33 +02:00
Peter Gromov f05adff4ca
hunspell: speed up the dictionary enumeration (#12447)
* hunspell: speed up the dictionary enumeration

cache each word's case and the lowercase form
group the words by lengths to avoid even visiting entries with unneeded lengths
2023-07-18 21:25:26 +02:00
Stefan Vodita b4619d87ed
Move sliced int buffer functionality to MemoryIndex (#11248) (#12409)
* [WIP] Move IntBlockPool slices to MemoryIndex

* [WIP] Working TestMemoryIndex

* [WIP} Working TestSlicedIntBlockPool

* Working many allocations tests

* Add basic IntBlockPool test

* SlicedIntBlockPool inherits from IntBlockPool

* Tidy
2023-07-10 10:18:28 -04:00
Benjamin Trent d03c8f16d9
Have byte[] vectors also trigger a timeout in ExitableDirectoryReader (#12423)
`ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors.

This commit fixes that bug.
2023-07-07 12:29:55 -04:00
Benjamin Trent 861153020a
Fix HNSW graph visitation limit bug (#12413)
We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.

While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.

Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.

However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.
2023-07-06 15:46:36 -04:00
Adrien Grand f527eb3b12
Remove Scorable#docID. (#12407)
`Scorable#docID()` exposes the document that is being collected, which makes it
impossible to bulk-collect multiple documents at once.

Relates #12358
2023-07-05 10:40:06 +02:00
Uwe Schindler 9ffc625b2e Followup on #12410: Fix caller class check to use string literals to allow private/pkg-private classes 2023-07-03 17:52:10 +02:00
Uwe Schindler f668cfd1cd
Fix forUtil.gradle to actually execute python script and also fix type error in script (#12411) 2023-07-03 16:22:03 +02:00
Uwe Schindler fde2e50d9e
Refactor vectorization support in Lucene (#12410) 2023-07-03 11:39:19 +02:00
Perdjesk 907883f701
Correct Javadocs using SimpleBindings (#12402)
Javadocs were still referring the old `SortField` API, which
has been replaced with methods that use `DoubleValuesSource`
instead.
2023-06-30 15:14:24 +01:00
Adrien Grand 8811f31b9c
Add a post-collection hook to LeafCollector. (#12380)
This adds `LeafCollector#finish` as a per-segment post-collection hook. While
it was already possible to do this sort of things on top of the collector API
before, a downside is that the last leaf would need to be post-collected in the
current thread instead of using the executor, which is a missed opportunity for
making queries concurrent.
2023-06-30 15:19:35 +02:00
Sagar 40ee6e583e
Assign a dummy simScorer in TermsWeight if score is not needed (#12383) 2023-06-30 15:14:33 +02:00
Sorabh 223eecca33
Add a thread safe CachingLeafSlicesSupplier to compute and cache the LeafSlices used with concurrent segment (#12374)
search. It uses the protected method `slices` by default to compute the slices which can be
overriden by the sub classes of IndexSearcher
2023-06-30 14:57:34 +02:00
zhangchao 01200b5804
Speed up NumericDocValuesWriter with index sorting (#12381) 2023-06-30 14:56:56 +02:00
Uwe Schindler e503805758
Remove usage and add some legacy java.util classes to forbiddenapis (Stack, Hashtable, Vector) (#12404) 2023-06-29 16:56:41 +02:00
Luca Cavanna f44cc45cf8
Share concurrent execution code into TaskExecutor (#12398)
Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search
is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs
tasks concurrently and waits for them to be completed and handles
eventual exceptions.

This commit shares code among these two scenarios, to reduce code
duplicate as well as to ensure that furhter improvements can be shared among them.
2023-06-28 13:52:01 +02:00
Adrien Grand 4029cc37a7
Fix MaxScoreBulkScorer#score's return value. (#12400)
`AssertingBulkScorer` asserts that the return value of `BulkScorer#score` may
not be in `[maxDoc, NO_MORE_DOCS)`. While this is not part of the contract of
`BulkScorer#score`, a reasonable implementation should never have return values
in this range, as it would suggest that more matches need collecting when we're
already out of the range of valid doc IDs. So this generally indicates a bug.

`MaxScoreBulkScorer` failed this assertion, because it can sometimes skip the
requested window of doc IDs, when the sum of maximum scores would be less than
the minimum competitive score. In that case, the best information it has is
that there are no matches in the window, but it cannot give a good estimate of
the next potential match.

This assertion in `AssertingBulkScorer` looks sane to me, so I made a small
change to `MaxScoreBulkScorer` to make sure it meets `AssertingScorer`'s
expectations. This is done in a place that is only called once per scored
window, so it should not have a noticeable performance impact.
2023-06-28 09:17:41 +02:00
yixunx 6bb8cc0235
Let hard link wrapper fallback to delegate.copyFrom (#12384)
Co-authored-by: Yixun Xu <yixunx@palantir.com>
2023-06-28 09:17:20 +02:00
Stefan Vodita b88d3e1988
Catch offset overflows in byte pool (#9660) (#12392) 2023-06-28 09:16:56 +02:00
Tomas Eduardo Fernandez Lobbe b50b969d25
Update comment about IndexOptions ordinals (#12360)
FieldInfos no longer accepts changes in IndexOptions, however, different IndexOptions are still compared using their ordinals
2023-06-26 16:29:13 -07:00
Alan Woodward 79e8c9c8b9 Fix edge case in TestJoinUtil
TestJoinUtil.checkBoost() needs to check to see if there are
any results to validate, otherwise we can get an array-out-of-bounds
exception
2023-06-26 13:28:09 +01:00
Adrien Grand 4fec812cc5 Add back-compat indices for 9.7.0 2023-06-26 14:24:27 +02:00
Luca Cavanna 3e0dc2b572 Add missing change entires for 9.8 2023-06-26 11:23:36 +02:00
Alan Woodward edd799824f
Enable boosts on JoinUtil queries (#12388)
Boosts should not be ignored by queries returned from JoinUtil
2023-06-26 09:47:14 +01:00
Luca Cavanna 7f10dca1e5
Revert "Parallelize knn query rewrite across slices rather than segments (#12325)" (#12385)
This reverts commit 10bebde269.

Based on a recent discussion in
https://github.com/apache/lucene/pull/12183#discussion_r1235739084 we
agreed it makes more sense to parallelize knn query vector rewrite
across leaves rather than leaf slices.
2023-06-26 10:41:18 +02:00
Michael Sokolov cb195bd96e
github-12386: set java.io.tmpdir in replicator tests' forked processes (#12387) 2023-06-23 08:38:06 -04:00
Jonathan Ellis fe0278e36e
Reuse neighborqueue during hnsw index build (attempt 2) (#12372)
This changes HnswGraphBuilder to re-use the same candidates queues for adding nodes by allocating them in the Builder instance.

This saves about 2.5% of build time and takes memory allocations of NQ long[] from 25% of total to 0%. JFR runs are attached.

The difference from the first attempt (which actually made things slower for some graphs) is that it preserves the original code's behavior of using a 1-sized queue for the search in the levels above where the node actually gets added.

* Re-use NeighborQueue during build's search

* improve javadoc for OnHeapHnswGraphSearcher

* assert that results parameter is minheap as expected

* update CHANGES
2023-06-20 15:05:37 -04:00
Adrien Grand 8703e449ce
Change the MAXSCORE scorer to a bulk scorer. (#12361) 2023-06-20 18:55:03 +02:00
zhangchao 37b92adf6a
Avoid redundant loop for compute min value in DirectMonotonicWriter (#12377)
* Avoid redundant loop for get min value

* update CHANGES.txt
2023-06-20 09:12:15 -04:00
Alan Woodward 6d4314d46f Add back-compat indices for 9.6.0 2023-06-16 17:58:31 +02:00
Adrien Grand 6ee7b2b9f6 Add next minor version 9.8.0 2023-06-16 13:39:30 +02:00
Uwe Schindler 148236a50b
This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (#12376) 2023-06-16 12:28:29 +02:00
Luca Cavanna bb6ec50d4c
Increased the likelihood of leveraging inter-segment concurrency in tests (#12369)
We have recently increased the likelihood of leveraging inter-segment search
concurrency in tests when `newSearcher` is used to create the index
searcher (see #959). When parallel execution is enabled though, it is
dependent on the number of documents and segments. That means
that out of 1000 test runs that use `RandomIndexWriter` to index a random
number of docs up to 100, we will effectively parallelize only a couple
of times.

This commit increases the likelihood of running concurrent searches by
randomly forcing 1 max segments per slice as well as 1 max doc per slice.
2023-06-15 11:35:10 +02:00
Alessandro Benedetti af1afc8cb6 * GITHUB#12252 CHANGES.txt fix 2023-06-14 16:00:41 +01:00
Elia Porciani 14c18d8624
GITHUB-12252: Add function queries for computing similarity scores between knn vectors (#12253)
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
2023-06-14 15:49:00 +01:00
Adrien Grand a8baa47733
Move TermAndBoost back to its original location. (#12366)
PR #12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.
2023-06-14 11:54:10 +02:00
Chaitanya Gohel 65447c8388
Add CHANGES.txt for #12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (#12368)
Signed-off-by: gashutos <gashutos@amazon.com>
2023-06-14 10:17:06 +02:00
Chris Hegarty 1090928c14
Implement VectorUtilProvider with Java 21 Project Pamana Vector API (#12363)
This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.
2023-06-13 09:44:58 +01:00
Jonathan Ellis 071461ece5
Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (#12281)
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
2023-06-13 10:40:03 +02:00
gf2121 30eba6df56
Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (#12324) 2023-06-13 14:24:26 +08:00
Uwe Schindler c8e05c8cd6
Implement MMapDirectory with Java 21 Project Panama Preview API (#12294) 2023-06-12 21:07:04 +02:00
Chris Fournier 41baf23ad9
Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (#12249) 2023-06-12 18:20:10 +02:00
Uwe Schindler ef35e6edf4
Work around SecurityManager issues during initialization of vector api (JDK-8309727) (#12362) 2023-06-09 22:07:31 +02:00
Alan Woodward a51241e4c9
Better paging when random reads go backwards (#12357)
When reading data from outside the buffer, BufferedIndexInput always resets
its buffer to start at the new read position. If we are reading backwards (for example,
using an OffHeapFSTStore for a terms dictionary) then this can have the effect of
re-reading the same data over and over again.

This commit changes BufferedIndexInput to use paging when reading backwards,
so that if we ask for a byte that is before the current buffer, we read a block of data
of bufferSize that ends at the previous buffer start.

Fixes #12356
2023-06-09 11:57:59 +01:00
fudongying 2934899ca6
feat: soft delete optimize (#12339) 2023-06-09 11:41:28 +02:00
Ignacio Vera 9a2d19324f
[Tessellator] Improve the checks that validate the diagonal between two polygon nodes (#12353) 2023-06-09 08:10:33 +02:00
Peter Gromov 5b63a1879d
TestHunspell: reduce the flakiness probability (#12351)
* TestHunspell: reduce the flakiness probability

We need to check how the timeout interacts with custom exception-throwing checkCanceled.
The default timeout seems not enough for some CI agents, so let's increase it.

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
2023-06-07 14:10:44 +02:00
Patrick Zhai 0c293909c0
Add updateDocuments API which accept a query (reopen) (#12346) 2023-06-03 20:16:16 -07:00
Greg Miller 52ace7eb35
Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (#12320) 2023-06-02 09:34:52 -07:00
Petr Portnov | PROgrm_JARvis 45110a6a46
Make memory fence in `ByteBufferGuard` explicit (#12290) 2023-06-01 13:41:06 +02:00
Uwe Schindler 40b582ab18
Revert "Add updateDocuments API which accept a query (#12341)" (#12344)
This reverts commit 52ab16731e.
2023-06-01 13:37:36 +02:00
Patrick Zhai 52ab16731e
Add updateDocuments API which accept a query (#12341) 2023-06-01 13:37:04 +02:00
Peter Gromov 4bf1b94209
hunspell (minor): reduce allocations when reading the dictionary's morphological data (#12323)
there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation
2023-06-01 11:37:38 +02:00
Greg Miller f79b316bd5 Add CHANGES entry for GH#12334 2023-05-31 15:18:34 -07:00
Chaitanya Gohel d44be24025
Fix searchafter high latency when after value is out of range for segment (#12334) 2023-05-31 15:07:53 -07:00
Daniele Antuzi da36c24cb9
Use thread-safe search version of HnswGraphSearcher (#12246)
Addressing comment received in the PR https://github.com/apache/lucene/pull/12246
2023-05-30 15:38:06 +01:00
Luca Cavanna 72b91156f3
Don't generate stacktrace for TimeExceededException (#12335)
The exception is package private and never rethrown, we can avoid
generating a stacktrace for it.
2023-05-30 10:29:46 +02:00
Patrick Zhai d1850e44f3
Update TestVectorUtilProviders.java (#12338) 2023-05-29 16:26:29 -07:00
Uwe Schindler db0c21f25d Clenaup and update changes and synchronize with 9.x 2023-05-26 18:22:51 +02:00
Jonathan Ellis 431dc7b415
add BitSet.clear() (#12268) 2023-05-26 18:13:16 +02:00
Greg Miller 367b03bfc2
GH#12321: Reduce visibility of StringsToAutomaton (#12331) 2023-05-26 08:55:02 -07:00
Uwe Schindler f5f25777d8 Update changes to be correct with ARM (it is called NEON there) 2023-05-26 16:53:39 +02:00
Luca Cavanna 24712d7525 Move changes entry for #12328 to 9.7 2023-05-26 15:11:21 +02:00
Armin Braun fd75807350
Optimize ConjunctionDISI.createConjunction (#12328)
This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.
2023-05-26 13:44:39 +02:00
Luca Cavanna 0ce6b9a67b Adjust changes entries for knn query concurrent rewrite
Moved entry for #12160 to 9.7.0 as it's been backported.
Added missing entry for #12325.
2023-05-26 09:25:54 +02:00
Luca Cavanna 10bebde269
Parallelize knn query rewrite across slices rather than segments (#12325)
The concurrent query rewrite for knn vectory query introduced with #12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.
2023-05-26 09:17:25 +02:00
Uwe Schindler c188d47a8b
Handle jdk.internal classes mentioned in vector superclass or interfaces during extraction (#12329) 2023-05-25 17:21:03 +02:00
Michael McCandless 7da7c43638
#12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton (#12310)
Closes #12276
2023-05-25 10:18:41 -04:00
Chris Hegarty f756f90644
Integrate the Incubating Panama Vector API (#12311)
Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>
2023-05-25 07:59:50 +01:00
Andrey Bozhko c9c49bc553
[MINOR] Update javadoc in Query class (#12233)
- add a few missing full stops
- update wording in the description of Query#equals method
2023-05-23 12:16:50 +02:00
Patrick Zhai 8a602b5063
Add multi-thread searchability to OnHeapHnswGraph (#12257) 2023-05-21 21:48:46 -07:00
Peter Gromov a454388b80
hunspell (minor): reduce allocations when processing compound rules (#12316) 2023-05-19 21:36:05 +02:00
Uwe Schindler 84e2e3afc3
Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (#12315) 2023-05-19 18:42:55 +02:00
Uwe Schindler a8a95e64ce Forward port references to AccessController in VirtualMethod (#12308) 2023-05-19 16:38:24 +02:00
Jerry Chin 04ef6de826
GITHUB-12291: Skip blank lines from stopwords list. (#12299) 2023-05-18 16:58:32 +02:00
Michael Sokolov 6b51cce0b8 NeighborQueue.reset() now clears incomplete flag 2023-05-18 10:23:22 -04:00
Greg Miller 3e4ca4042c
Minor cleanup and improvements to DaciukMihovAutomatonBuilder (#12305) 2023-05-18 07:01:19 -07:00
Michael Sokolov 2facb3ae0e Revert "allocate one NeighborQueue per search for results (#12255)"
This reverts commit 9a7efe92c0.
2023-05-18 13:42:17 +00:00
Petr Portnov | PROgrm_JARvis 0c6e8aec67
Seal `IndexReader` and `IndexReaderContext` (#12296) 2023-05-17 08:47:47 +02:00
tang donghai f53eb28af0
remove max recursion from Operations.java to AutomatonTestUtil.java (#12298)
Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2023-05-16 07:09:28 -04:00
Patrick Zhai 8af305892d
Optimize HNSW diversity calculation (#12235) 2023-05-15 23:20:31 -07:00
tang donghai 0e172b0723
Update Javadoc for topoSortStates method after #12286 (#12292) 2023-05-15 18:06:01 +02:00
tang donghai 5d203f8337
toposort use iterator to avoid stackoverflow (#12286)
Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2023-05-15 16:20:15 +02:00
Luca Cavanna 223e28ef16
Simplify SliceExecutor and QueueSizeBasedExecutor (#12285)
The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.
2023-05-11 11:08:48 +02:00
Marcus 963ed7ce88
`ToParentBlockJoinQuery` Explain Support Score Mode (#12245)
* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Mikhail Khludnev <mkhl@apache.org>
2023-05-10 19:10:37 +03:00
Luca Cavanna b6100d9787
Make TimeExceededException members final (#12271)
TimeExceededException has three members that are set within its constructor and never modified. They can be made final.
2023-05-09 11:28:23 +02:00
Luca Cavanna 082c49a9ef
Update javadocs for QueryTimeout (#12272)
QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.
2023-05-09 11:27:47 +02:00
Luca Cavanna 10bad40ed3
Make query timeout members final in ExitableDirectoryReader (#12274)
There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.
2023-05-09 11:27:06 +02:00
Luca Cavanna 1cd9c1d66a add missing changelog entry for #12220 2023-05-09 10:57:28 +02:00
Luca Cavanna 67bb384f72 add missing changelog entry for #12260 2023-05-09 10:52:03 +02:00
Luca Cavanna 9579d2de76 Move changes entry for #12270 to 9.7.0 section 2023-05-09 10:28:22 +02:00
Armin Braun add9aba16d
Don't generate stacktrace in CollectionTerminatedException (#12270)
CollectionTerminatedException is always caught and never exposed to users so there's no point in filling
in a stack-trace for it.
2023-05-09 10:18:52 +02:00
Jonathan Ellis 9a7efe92c0
allocate one NeighborQueue per search for results (#12255) 2023-05-08 17:22:58 -04:00
Michael Sokolov a39885fdab
GITHUB-12224: remove KnnGraphTester (moved to luceneutil) (#12238) 2023-05-08 10:12:36 -04:00
Uwe Schindler 397c2e547a
Fix MMapDirectory documentation for Java 20 (#12265) 2023-05-05 12:04:38 +02:00
Luca Cavanna caeabf3930
Fix SynonymQuery equals implementation (#12260)
The term member of TermAndBoost used to be a Term instance and became a
BytesRef with #11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.
2023-05-03 11:27:33 +02:00
Jonathan Ellis 3c163745bb Use HashMap (was TreeMap) for OnHeapHnswGraph neighbors 2023-04-30 17:59:39 -04:00
Patrick Zhai 1fa2be90ea Tidy the main branch 2023-04-26 21:21:57 -07:00
Alan Woodward 7374c200a1 Add next minor version 9.7.0 2023-04-26 16:44:47 +01:00
Christoph Büscher f45e096304
Add ordering of files in compound files (#12241)
Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is undefined. This commit changes to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.
2023-04-26 14:01:02 +01:00
Luca Cavanna b0befef912
QueryProfilerWeight to extend FilterWeight (#12242)
QueryProfilerWeight should override matches and delegate to the
subQueryWeight. Another way to fix this issue is to make it extend
ProfileWeight and override only methods that need to have a different
behaviour than delegating to the sub weight.
2023-04-26 10:24:57 +02:00
Alessandro Benedetti 4deb0003c4 Word2VecSynonymFilter constructor null check (#12169) 2023-04-24 17:28:12 +02:00
Daniele Antuzi 1f4f2bf509
Introduced the Word2VecSynonymFilter (#12169)
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
2023-04-24 13:35:26 +02:00