Apache Lucene open-source search software
Go to file
Kaival Parikh cd195980ec
Add support for similarity-based vector searches (#12679)
### Description

Background in #12579

Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system

### Considerations

I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius

### Advantages

1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`

On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?

Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)

### Caveats

I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`

For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
2023-12-11 14:18:36 -05:00
.github Build: build scans on ge.apache.org to benefit from deep build insights (#12293) 2023-10-24 12:32:18 -04:00
buildSrc Upgrade ECJ to 3.36.0 (#12888) 2023-12-07 21:13:10 +00:00
dev-docs a bit of clarification about GitHub Milestone 2022-08-28 13:52:58 +09:00
dev-tools script to run microbenchmarks across different ec2 instance types (#12787) 2023-11-10 12:31:10 -05:00
gradle Rewrite JavaScriptCompiler to use modern JVM features (Java 17) (#12873) 2023-12-05 11:53:57 +01:00
help Add downloading binutils instructions for the macos. (#12804) 2023-11-14 05:51:51 -05:00
lucene Add support for similarity-based vector searches (#12679) 2023-12-11 14:18:36 -05:00
.asf.yaml .asf.yaml 2022-08-16 20:02:47 +09:00
.dir-locals.el LUCENE-9322: Add Lucene90 codec, including VectorFormat 2020-10-18 07:49:36 -04:00
.git-blame-ignore-revs GITHUB#12655: Add google java format upgrade tidy / regen to blame ignore 2023-10-11 16:15:42 -04:00
.gitattributes LUCENE-10305: Ensure line endings of versions.props is LF 2021-12-11 10:10:44 +09:00
.gitignore LUCENE-9920: Remove binary gradle-wrapper.jar from the repository 2021-04-10 16:08:39 +02:00
.hgignore LUCENE-2792: add FST impl 2010-12-12 15:36:08 +00:00
.lift.toml Disable liftbot, we have our own tools 2022-05-05 22:27:57 +02:00
CONTRIBUTING.md Fix type in CONTRIBUTING.md (#11879) 2022-11-01 20:10:05 +00:00
LICENSE.txt LUCENE-10163 Move LICENSE and NOTICE file to top level (#388) 2021-10-18 01:24:11 +02:00
NOTICE.txt Cleanup NOTICE.txt (#12227) 2023-04-18 15:58:09 -04:00
README.md Allow building with java 18 now that gradle supports it (#11889) 2022-10-28 23:41:09 -04:00
build.gradle Only enable support for tests.profile if jdk.jfr module is available in Gradle runtime (#12845) 2023-11-25 20:16:09 +01:00
gradlew GITHUB#12655: Upgrade to Gradle 8.4 2023-10-11 16:11:53 -04:00
gradlew.bat GITHUB#12655: Upgrade to Gradle 8.4 2023-10-11 16:11:53 -04:00
settings.gradle Build: build scans on ge.apache.org to benefit from deep build insights (#12293) 2023-10-24 12:32:18 -04:00
versions.lock Rewrite JavaScriptCompiler to use modern JVM features (Java 17) (#12873) 2023-12-05 11:53:57 +01:00
versions.props Rewrite JavaScriptCompiler to use modern JVM features (Java 17) (#12873) 2023-12-05 11:53:57 +01:00

README.md

Apache Lucene

Lucene Logo

Apache Lucene is a high-performance, full-featured text search engine library written in Java.

Build Status

Online Documentation

This README file only contains basic setup instructions. For more comprehensive documentation, visit:

Building

Basic steps:

  1. Install OpenJDK 17 or 18.
  2. Clone Lucene's git repository (or download the source distribution).
  3. Run gradle launcher script (gradlew).

We'll assume that you know how to get and set up the JDK - if you don't, then we suggest starting at https://jdk.java.net/ and learning more about Java, before returning to this README.

See Contributing Guide for details.

Contributing

Bug fixes, improvements and new features are always welcome! Please review the Contributing to Lucene Guide for information on contributing.

Discussion and Support