Commit Graph

37089 Commits

Author SHA1 Message Date
Kaival Parikh cd195980ec
Add support for similarity-based vector searches (#12679)
### Description

Background in #12579

Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system

### Considerations

I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius

### Advantages

1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`

On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?

Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)

### Caveats

I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`

For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
2023-12-11 14:18:36 -05:00
Shubham Chaudhary 1630ed4bd8
Remove some redundant modifiers from code (#12880) 2023-12-11 10:17:47 -08:00
Adrien Grand c0fd4404ea
Add tests for the 9.0->9.8 block tree terms dict format back. (#12908)
I just noticed that the move from FOR to PFOR did all the work to make the old
format (FOR) writeable, but missed keeping an instance of
`BasePostingsFormatTestCase` for this format.
2023-12-11 18:14:28 +01:00
Jakub Slowinski 4d57070bf7
Removing @lucene.experimental tags in testXXX methods in CheckIndex (#12893) 2023-12-11 11:31:18 -05:00
Michael McCandless af11ea2562 #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895 (#12913)
* #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895

* woops, forgot to tidy up

* #12901: Ignore failing test; reflow text to workaround spotless' poor text formatting skills
2023-12-11 08:53:53 -05:00
Adrien Grand 069c048770
Add tests for Lucene90PostingsFormat back (#12904)
I just noticed that the move from FOR to PFOR did all the work to make the old
format (FOR) writeable, but missed keeping an instance of
`BasePostingsFormatTestCase` for this format.
2023-12-11 13:09:40 +01:00
Usman Shaikh 6a56b2ea7d
Fix typo Levenstein -> Levenshtein (#12519)
Please do not lose sight of the exquisite irony of mis-spelling this particular word ;)
2023-12-11 05:49:36 -05:00
Greg Miller a9b5ef4749
Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored (#12853) 2023-12-08 15:25:57 -08:00
Dzung Bui fb269c9e64
Fix NPE on off-heap test and FST is null (#12894) 2023-12-08 09:59:39 -08:00
Stefan Vodita 1c4c1c831d
Quick exit for non-zero slice buffers (#12812) 2023-12-08 09:41:58 -08:00
Greg Miller 18a6033a7e
Remove DrillSideways#createDrillDownFacetsCollector in favor of the manager-based hook (#12855) 2023-12-08 09:37:51 -08:00
Mike McCandless 9ec938f2ef Revert "SlowCompositeCodecWrapper: remove redundant FieldInfo lookup in its PointsReader.getValues method"
This reverts commit 0aff413797.
2023-12-08 08:14:16 -05:00
Dawid Weiss ba81826951
Make unified highlighter tests avoid mock random merge policy's document reordering (#12889) 2023-12-08 08:27:45 +01:00
Stefan Vodita ad9eff3c27
Shorten getOrdinal synchronized loop (#12870) 2023-12-07 16:27:49 -08:00
Chris Hegarty 5e1e6c9e68
Upgrade ECJ to 3.36.0 (#12888)
This commit upgrades ECJ to 3.36.0, as it has support for more recent Java versions, like Java 21.
2023-12-07 21:13:10 +00:00
Zach Chen fcc6f71f0e Move change entry for GITHUB#11041 from version 10.0.0 to version 9.10.0 2023-12-07 11:01:46 -08:00
Dawid Weiss 755de5aae3
Performance improvements to MatchHighlighter and MatchRegionRetriever (#12881) 2023-12-07 18:42:13 +01:00
Dzung Bui b699981847
Allow FST builder to use different writer (#12543) (#12624)
* Move size() to FSTStore

* Remove size() completely

* Allow FST builder to use different DataOutput

* access BytesStore byte[] directly for copying

* Rename BytesStore

* Change class to final

* Reorder methods

* Remove unused methods

* Rename truncate to setPosition() and remove skipBytes()

* Simplify the writing operations

* Update comment

* remove unused parameter

* Simplify BytesStore operation

* tidy code

* Rename copyBytes to writeTo

* Simplify BytesStore operations

* Embed writeBytes() to FSTCompiler

* Fix the write bytes method

* Remove the default block bits constant

* add assertion

* Rename method parameter names

* Move reverse to FSTCompiler

* Revert setPosition call

* Address comments

* Return immediately when writing 0 bytes

* Add comment &

* Rename variables

* Fix the compile error

* Remove isReadable()

* Remove isReadable()

* Optimize ReadWriteDataOutput

* tidy code

* Freeze the DataOutput once finished()

* Refactor

* freeze the DataOutput before use

* Address comments and add off-heap FST tests

* Remove the hardcoded random

* Ignore the Test2BFSTOffHeap test

* Simplify ReadWriteDataOutput

* Update Javadoc

* Address comments
2023-12-07 09:26:03 -05:00
Mike McCandless f55f2af0d0 SlowCompositeCodecWrapper: remove redundant FieldInfo lookup in its PointsReader.getValues method 2023-12-07 09:03:10 -05:00
Mike McCandless 4f1b967be9 #11023: move CHANGES entry down to 9.10 section 2023-12-07 06:45:40 -05:00
Jakub Slowinski 37cca614fc
CheckIndex - Removal of some dead code (#12876)
* CheckIndex - Remove some dead code

* Add back testPostings and testTermVectors
2023-12-07 06:34:33 -05:00
Uwe Schindler af2aea5eb2
Fix the declared Exceptions of Expression#evaluate() to match those of DoubleValues#doubleValue() (#12878) 2023-12-06 11:56:37 +01:00
Jakub Slowinski b22a9b8998
Removing TermInSetQuery varargs ctor (#12837) 2023-12-05 14:35:46 -08:00
Uwe Schindler 880d0ba1a8
Rewrite JavaScriptCompiler to use modern JVM features (Java 17) (#12873)
* Rewrite Javascript expression compiler to use hidden classes and MethodHandles for functions

* Use dynamic constants for MethodHandles

* Remove invokestatic code and handle everything through dynamic constants

* Rewrite code to patch stack trace (keep Expressions class unmodified)

* Improve generating of constant names

* Remove classloader test (no longer needed)

* Add benchmark

* use better exception in benchmark

* Add documentation, migration guide and a utility method to convert legacy function maps

* also ignore SecurityException here while checking compatibility (if it happens only an imprecise error message is thrown)

* Use Map.copyOf to not clone the map each time we compile an expression

* Add another test with same method multiple times

* Update ASM to 9.6 and set classfile version to Java 17

* Cleanup classloader permissions, unfortunately "createClassLoader" is still needed for Jacoco for God knows what
2023-12-05 11:53:57 +01:00
Chris Hegarty e852dfe550
Fix intermittently failing TestParallelLeafReader (#12865)
This commit fixes the intermittently failing TestParallelLeafReader.

The ParallelLeafReader requires the document order to be consistent across indexes - each document contains the union of the fields of all documents with the same document number. The test asserts this. But now, with MockRandomMergePolicy potentially reversing the doc ID order while merging, this invalidates the assumption of the test indexes and assertions. The solution is to just ensure that no merging actually happens in these tiny test indexes.
2023-12-02 10:02:13 +00:00
Kaival Parikh 65d30ca1af
Prevent extra similarity computation for single-level graphs (#12866)
### Description

[`#findBestEntryPoint`](4bc7850465/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java (L151)) is used to determine the entry point for the last level of HNSW search

It finds the single best-scoring node from [all upper levels](4bc7850465/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java (L159)) - but performs an [unnecessary computation](4bc7850465/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java (L157)) (along with [recording one visited node](4bc7850465/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java (L154))) when the graph just has 1 level (so the entry node is just the overall graph's entry node)

Also added a test to demonstrate this (fails without the changes in PR) -- where we visit `graph.size() + 1` nodes when the `topK` is high (should be a maximum of `graph.size()`)

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
2023-12-01 13:02:00 -05:00
asubbu90 0e96b9cd8c
TruncateTokenFilterFactory now accepts lengths > 127 (#12507) 2023-12-01 12:01:29 -05:00
Lukáš Vlček ad0c9cca97
Improve Javadoc (#12508)
Remove duplicated line.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
2023-12-01 11:45:17 -05:00
Chris Hegarty b231e5b213
Reconcile changelog 9.9.0 section (#12867)
Reconcile the changelog between branch_9_9 and main. This change just reorders a number of entries in main to match that of branch_9_9.
2023-12-01 14:15:01 +00:00
Jakub Slowinski 4bc7850465
Fix bug in UnescapedCharSequence and add basic unit tests (#12849) 2023-11-30 10:25:24 -08:00
Michael McCandless 00de0aef63 Avoid null PointValues when merging points in SlowCompositeCodecReaderWrapper (#12859)
* avoid null PointValues when merging points in SlowCompositeCodecReaderWrapper

* null check in SortingCodecReader

---------

Co-authored-by: ChrisHegarty <chegar999@gmail.com>
2023-11-30 09:01:19 -05:00
Chris Hegarty 7919133858
Fix intermittently failing TestSortedSetFieldSource (#12850)
This commit fixes the intermittently failing TestSortedSetFieldSource.

The test assertions depend on doc order which may be affected by merging. The fix is to trivially avoid merging for the very small index, with just two docs.
2023-11-30 08:39:26 +00:00
Mike McCandless 6ba38c13eb #12847: move CHANGES.txt entry to 9.9.0 to match backport 2023-11-29 06:34:24 -05:00
Dzung Bui bbf56f9419
Report the time it took for building the FST in Test2BFST (#12847)
* Report the time it took for building the FST

* Update CHANGES

* Change ramBytesUsed to numBytes

* Report the verification time

* Rename to fstSizeInBytes
2023-11-29 06:17:54 -05:00
Mike McCandless 8703b541a5 #240: disable threads in the random IndexSearcher from newSearcher for this test class 2023-11-28 19:37:21 -05:00
Benjamin Trent 2bb69f3246
Fix *HnswVectorsFormat.testIndexedValueNotAliased test flakiness (#12848) 2023-11-28 14:50:48 -05:00
ThomasDC 502f15a89b
Let WordDelimiterGraphFilterFactory propagate ignoreKeywords flag (#12525)
* Let WordDelimiterGraphFilterFactory propagate ignoreKeywords flag

fixes https://github.com/apache/lucene/issues/12522

* Document changes

* Align with default in code
2023-11-28 12:28:07 -05:00
Jakub Slowinski 203f506130
CheckIndex - Adding a `-level` parameter to give ability to control index check detail programmatically (#12797)
* CheckIndex - Making -fast the default behaviour

1. Making -fast the new default.
2. The previous -slow is moved to -slower
3. The previous default behavior (checksum + segment file content) is activated by -slow.

* gradlew tidy

* Add changes.txt

* Moved change to Lucene 10.0, now using -detailLevel param

* Fix failing test

* Add MIGRATE.md note and comment to remove deprecated params

* Fix failing unit test

* Changing detailLevel -> level

* catch invalid API calls

* Update lucene/core/src/java/org/apache/lucene/index/CheckIndex.java

Co-authored-by: Adrien Grand <jpountz@gmail.com>

---------

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2023-11-28 12:19:31 -05:00
gf2121 9574cbd1f1
Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum (#12699) 2023-11-28 13:04:41 +08:00
Zach Chen 38ca8d3e42
LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager (#240) 2023-11-27 16:16:31 -08:00
Uwe Schindler 17bb73332c
Only enable support for tests.profile if jdk.jfr module is available in Gradle runtime (#12845) 2023-11-25 20:16:09 +01:00
Ignacio Vera 74085cd1b0
Hide the internal data structure of HeapPointWriter (#12762) 2023-11-24 13:58:39 +01:00
Peter Gromov f460d612b5
hunspell: allow in-memory entry sorting for faster dictionary loading (#12834)
* hunspell: allow in-memory entry sorting for faster dictionary loading

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
2023-11-24 08:21:43 +01:00
Simon Willnauer 981339be04
Move MergeState.DocMap to a FunctionalInterface (#12836)
This change converts MergeState to an interface to make use of lambda expressions.
2023-11-23 20:39:15 +01:00
Adrien Grand 0ec7fdb3b5 Revert "Simplify advancing on postings/impacts enums (#12810)"
This reverts commit 5aa401e7d8.
2023-11-23 18:19:12 +01:00
Adrien Grand 14cee15c08 Fix test to preserve index order. 2023-11-23 13:37:56 +01:00
Adrien Grand 5aa401e7d8
Simplify advancing on postings/impacts enums (#12810) 2023-11-23 13:32:08 +01:00
Adrien Grand f7cab16450
Add a merge policy wrapper that performs recursive graph bisection on merge. (#12622)
This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc
IDs on merge using a `BPIndexReorderer`.
 - Reordering always run on forced merges.
 - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the
   larger merged segments. This way, small merges retain all merging
   optimizations like bulk copying of stored fields, and only the larger
   segments - which are the most important for search performance - get
   reordered.
 - If not enough RAM is available to perform reordering, reordering is skipped.

To make this work, I had to add the ability for any merge to reorder doc IDs of
the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the
test framework randomly reverts the order of documents in a merged segment to
make sure this logic is properly exercised.
2023-11-23 13:25:00 +01:00
gf2121 76fe6bdbc1
Improve DirectReader java doc (#12835) 2023-11-23 17:19:57 +08:00
Michael Sokolov 74fe7f7fdf change-log entry for #12817 2023-11-22 17:13:01 -05:00