Apache Lucene open-source search software

backend information-retrieval java lucene nosql search search-engine

Go to file

Kaival Parikh cd195980ec Add support for similarity-based vector searches (#12679 ) ### Description Background in #12579 Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system ### Considerations I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to: 1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius 2. Collect all nodes that are within an inner "result" radius ### Advantages 1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`) 2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later) 3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK` On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes? Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?) ### Caveats I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental` For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)		2023-12-11 14:18:36 -05:00
.github	Build: build scans on ge.apache.org to benefit from deep build insights (#12293 )	2023-10-24 12:32:18 -04:00
buildSrc	Upgrade ECJ to 3.36.0 (#12888 )	2023-12-07 21:13:10 +00:00
dev-docs	a bit of clarification about GitHub Milestone	2022-08-28 13:52:58 +09:00
dev-tools	script to run microbenchmarks across different ec2 instance types (#12787 )	2023-11-10 12:31:10 -05:00
gradle	Rewrite JavaScriptCompiler to use modern JVM features (Java 17) (#12873 )	2023-12-05 11:53:57 +01:00
help	Add downloading binutils instructions for the macos. (#12804 )	2023-11-14 05:51:51 -05:00
lucene	Add support for similarity-based vector searches (#12679 )	2023-12-11 14:18:36 -05:00
.asf.yaml	.asf.yaml	2022-08-16 20:02:47 +09:00
.dir-locals.el	LUCENE-9322: Add Lucene90 codec, including VectorFormat	2020-10-18 07:49:36 -04:00
.git-blame-ignore-revs	GITHUB#12655: Add google java format upgrade tidy / regen to blame ignore	2023-10-11 16:15:42 -04:00
.gitattributes	LUCENE-10305: Ensure line endings of versions.props is LF	2021-12-11 10:10:44 +09:00
.gitignore	LUCENE-9920: Remove binary gradle-wrapper.jar from the repository	2021-04-10 16:08:39 +02:00
.hgignore	LUCENE-2792: add FST impl	2010-12-12 15:36:08 +00:00
.lift.toml	Disable liftbot, we have our own tools	2022-05-05 22:27:57 +02:00
CONTRIBUTING.md	Fix type in CONTRIBUTING.md (#11879 )	2022-11-01 20:10:05 +00:00
LICENSE.txt	LUCENE-10163 Move LICENSE and NOTICE file to top level (#388 )	2021-10-18 01:24:11 +02:00
NOTICE.txt	Cleanup NOTICE.txt (#12227 )	2023-04-18 15:58:09 -04:00
README.md	Allow building with java 18 now that gradle supports it (#11889 )	2022-10-28 23:41:09 -04:00
build.gradle	Only enable support for tests.profile if jdk.jfr module is available in Gradle runtime (#12845 )	2023-11-25 20:16:09 +01:00
gradlew	GITHUB#12655: Upgrade to Gradle 8.4	2023-10-11 16:11:53 -04:00
gradlew.bat	GITHUB#12655: Upgrade to Gradle 8.4	2023-10-11 16:11:53 -04:00
settings.gradle	Build: build scans on ge.apache.org to benefit from deep build insights (#12293 )	2023-10-24 12:32:18 -04:00
versions.lock	Rewrite JavaScriptCompiler to use modern JVM features (Java 17) (#12873 )	2023-12-05 11:53:57 +01:00
versions.props	Rewrite JavaScriptCompiler to use modern JVM features (Java 17) (#12873 )	2023-12-05 11:53:57 +01:00

README.md

Apache Lucene

Apache Lucene is a high-performance, full-featured text search engine library written in Java.

Online Documentation

This README file only contains basic setup instructions. For more comprehensive documentation, visit:

Latest Releases: https://lucene.apache.org/core/documentation.html
Nightly: https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/javadoc/
Build System Documentation: help/
Developer Documentation: dev-docs/
Migration Guide: lucene/MIGRATE.md

Building

Basic steps:

Install OpenJDK 17 or 18.
Clone Lucene's git repository (or download the source distribution).
Run gradle launcher script (gradlew).

We'll assume that you know how to get and set up the JDK - if you don't, then we suggest starting at https://jdk.java.net/ and learning more about Java, before returning to this README.

See Contributing Guide for details.

Contributing

Bug fixes, improvements and new features are always welcome! Please review the Contributing to Lucene Guide for information on contributing.

Discussion and Support

Users Mailing List
Developers Mailing List
IRC: #lucene and #lucene-dev on freenode.net