Commit Graph

35693 Commits

Author SHA1 Message Date
Mayya Sharipova b0d6fe68d1
LUCENE-10054 Make HnswGraph hierarchical (#608)
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 #250, #267, #287, #315, #536, #416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
2022-01-25 13:53:55 -05:00
Mayya Sharipova 1a4f838fe2
LUCENE-10384: Simplify LongHeap small addition (#623)
LUCENE-10384 and PR#615 introduced encoding f into NeighborQueue.
But one function `nodes()` was remained to add this encoding.

Also modify the test that would fail without this patch.
2022-01-25 11:43:40 -05:00
Luca Cavanna 11006fba59
LUCENE-10002: Replace simple usages of TotalHitCountCollector with IndexSearcher#count (#612)
In case only number of documents are collected, IndexSearcher#search(Query, Collector) is commonly used, which does not use the executor that's been eventually set to the searcher. Calling `IndexSearcher#count(Query)` makes the code more concise and is also more correct as it honours the executor that's been set to the searcher instance.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-01-25 16:11:19 +01:00
Tomoko Uchida fd817b6fb1 LUCENE-8930: increase timeout to 1 minite for the launched luke (seems it occationaly takes long time on windows vm) 2022-01-25 22:56:11 +09:00
Tomoko Uchida 0aa526d256 LUCENE-10076: fix the assertion in luke module to only check if the optional has a value 2022-01-25 22:05:27 +09:00
Adrien Grand 07fe46ff86
LUCENE-10384: Simplify LongHeap. (#615)
The min/max ordering logic moves to NeighborQueue.
2022-01-25 09:04:52 +01:00
Greg Miller eaf3cb6739 Fix minor bug that snuck in with LUCENE-9952 2022-01-24 06:58:31 -08:00
Greg Miller 9e560c1af1
LUCENE-9952: Fix dim count inaccuracies in SSDV faceting when a dim is multi-valued (#611) 2022-01-24 06:48:20 -08:00
Greg Miller 10ca531ddc
LUCENE-10381: Require users to provide FacetsConfig for SSDV faceting (#613) 2022-01-24 06:46:22 -08:00
Julie Tibshirani fb09ae1f7c Undo accidental change to build.gradle 2022-01-23 16:26:16 -08:00
Julie Tibshirani 7ece8145bc
LUCENE-10375: Write vectors to file in flush (#617)
In a previous commit, we updated HNSW merge to first write the combined segment
vectors to a file, then use that file to build the graph. This commit applies
the same strategy to flush, which lets us use the same logic for flush and
merge.
2022-01-23 16:19:23 -08:00
Dawid Weiss 08d6633d94 LUCENE-8930: increase timeout for the launched luke. 2022-01-20 16:51:05 +01:00
Ignacio Vera 4ec8f865c8
LUCENE-10288: Check BKD tree shape for lucene pre-8.6 1D indexes (#607)
Adds efficient logic to compute if a tree is balanced or unbalanced for indexes 
created before Lucene 8.6
2022-01-20 07:49:29 +01:00
Dawid Weiss 72ba7ae2ee
LUCENE-8930: script testing in the distribution (#550) 2022-01-20 00:09:15 +09:00
Julie Tibshirani 9b6d417d1c LUCENE-10040: Update HnswGraph javadoc related to deletions
Previously it claimed the search method did not handle deletions.
2022-01-18 15:36:00 -08:00
Julie Tibshirani dfca9a5608
LUCENE-10375: Write merged vectors to file before building graph (#601)
When merging segments together, the `KnnVectorsWriter` creates a `VectorValues`
instance with a merged view of all the segments' vectors. This merged instance
is used when constructing the new HNSW graph. Graph building needs random
access, and the merged VectorValues support this by mapping from merged
ordinals to segments and segment ordinals. This mapping can add significant
overhead when building the graph.

This change updates the HNSW merging logic to first write the combined segment
vectors to a file, then use that the file to build the graph. This helps speed
up segment merging, and also lets us simplify `VectorValuesMerger`, which
provides the merged view of vector values.
2022-01-18 13:53:05 -08:00
Alan Woodward 2e2c4818d1
LUCENE-10377: Replace 'sortPos' with 'enableSkipping' in SortField.getComparator() (#603)
The sort position parameter in SortField.getComparator() is only ever used
to determine whether or not skipping should be enabled on a given comparator,
so the parameter name should reflect that.  This commit also explicitly disables
skipping in a number of cases where it is never used, in particular CheckIndex
and the grouping collectors.
2022-01-17 10:44:57 +00:00
Adrien Grand 457367e9b7 LUCENE-10168: Fix typo that would _not_ run nightly tests. 2022-01-14 13:51:16 +01:00
Greg Miller 2f5e3c323b
LUCENE-10379: Count directly into the dense values array in FastTaxonomyFacetCounts#countAll (#605)
Co-authored-by: guofeng.my <guofeng.my@bytedance.com>
2022-01-13 09:17:55 -08:00
Mayya Sharipova bd2cc4124d
Small edits for KnnGraphTester (#575)
1. Correct the remaining size for input files larger
than Integer.MAX_VALUE, as currently with every
iteration we try to map the next blockSize of bytes
even if less < blockSize bytes are left in the file.

2. Correct java.lang.ClassCastException when retrieving
KnnGraphValues for stats printing.

3. Add an option for euclidean metric
2022-01-12 17:23:10 -05:00
gf2121 8d9fa6dba1
revert LUCENE-10355 (#597)
Trying to find the source of taxo-facet performance regression. See also LUCENE-10374

Co-authored-by: guofeng.my <guofeng.my@bytedance.com>
2022-01-12 10:23:13 -08:00
Adrien Grand 71dfa9e9cd
addBackcompatIndexes.py should use Gradle, not Ant. (#531) 2022-01-12 18:55:59 +01:00
Uwe Schindler 636d42e032 Fix wrong project name 2022-01-11 17:42:21 +01:00
Nikola Grcevski bad65c53c9
LUCENE-10369: Move DelegatingCacheHelper to FilterDirectoryReader (#596) 2022-01-11 15:22:06 +01:00
Adrien Grand 308ddd7502
Add documentation on file formats. (#598) 2022-01-11 15:16:05 +01:00
Adrien Grand f81c760cc8 LUCENE-10370: Fix precommit. 2022-01-11 10:13:10 +01:00
Dawid Weiss 9b54fbaa01 LUCENE-10370: temporarily ignore TestStressNRTReplication 2022-01-11 09:25:31 +01:00
Greg Miller 82703757fe
LUCENE-10245: Addition of MultiDoubleValues(Source) and MultiLongValues(Source) along with faceting capabilities (#543) 2022-01-10 13:48:36 -08:00
Dawid Weiss bff930c1bf LUCENE-10370: temporarily ignore TestNRTReplication. 2022-01-10 22:18:12 +01:00
Greg Miller cf12b46092
LUCENE-10356: Further optimize facet counting for single-valued TaxonomyFacetCounts (#585) 2022-01-10 10:23:46 -08:00
Greg Miller eb0b1bf9f1 Add CHANGES entry for LUCENE-10250 2022-01-10 08:57:28 -08:00
Marc D'mello b4e27f2c63
LUCENE-10250: Add support for arbitrary length hierarchical SSDV facets (#509) 2022-01-10 08:52:14 -08:00
gf2121 e750f6cd37
LUCENE-10350: Avoid some null checking for FastTaxonomyFacetCounts#countAll() (#578) 2022-01-10 07:43:09 -08:00
Adrien Grand 2ebc57a465
LUCENE-10283: Bump minimum required Java version to 17. (#579)
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
2022-01-10 15:42:15 +01:00
Adrien Grand 74698994a9
Simplify some exception handling with try-with-resources. (#589) 2022-01-10 15:40:47 +01:00
Yannick Welsch d9d65ab849
LUCENE-10291: Don't use CFS in testMinimalCodec (#593)
This test was occasionally failing on CI, as the test randomly installed a merge policy
that would force compound file creation while the goal of the test was not to do so.
2022-01-10 12:17:45 +00:00
Uwe Schindler 42fe2d5620
LUCENE-10364: Prepare and update errorprone plugin for Java 17 (#590) 2022-01-07 19:19:46 +01:00
zacharymorn d0ad9f5bfc
LUCENE-10183: KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues (#534) 2022-01-06 22:14:41 -08:00
Robert Muir f2e00bb9e0
LUCENE-10353: add random null injection to TestRandomChains (#586)
Co-authored-by: Uwe Schindler <uschindler@apache.org>, Robert Muir <rmuir@apache.org>
2022-01-06 16:56:49 +01:00
Adrien Grand 603a43f668
Fix path of docs for import into the website. (#524)
The current `svn import` looks for docs where they used to be produced by the
`Ant` build, but `Gradle` now puts them in a different place.
2022-01-06 09:26:45 +01:00
Dawid Weiss b8da9f32c8 LUCENE-10328: open up certain packages for junit and the test framework (reflective access). 2022-01-05 21:02:51 +01:00
Dawid Weiss ff547e7bbd
LUCENE-10328: Module path for compiling and running tests is wrong (#571) 2022-01-05 20:42:02 +01:00
Adrien Grand c8651afde7
LUCENE-10354: Clarify contract of codec APIs with missing/disabled fields. (#583) 2022-01-05 18:47:35 +01:00
Adrien Grand 7fdba36941 LUCENE-10291: Bug fix. 2022-01-05 16:37:37 +01:00
Adrien Grand f9ff620ec6 LUCENE-10291: CHANGES entry 2022-01-05 16:30:58 +01:00
Yannick Welsch 8fa7412dec
LUCENE-10291: Only read/write postings when there is at least one indexed field (#539) 2022-01-05 16:28:00 +01:00
Adrien Grand 65296e5f84
Use CDN to download source release. (#529) 2022-01-05 15:54:33 +01:00
Adrien Grand 6149387f7c
Modernize release announcement text. (#525)
It currently reads as Lucene is a full-text search library when it can do much
more than that nowadays.
2022-01-05 15:53:49 +01:00
Uwe Schindler 475fbd0bdd
LUCENE-10352: Convert TestAllAnalyzersHaveFactories and TestRandomChains to a global integration test and discover classes to check from module system (#582)
Co-authored-by: Robert Muir <rmuir@apache.org>
2022-01-05 15:35:02 +01:00
gf2121 238119224a
LUCENE-10343: Remove MyRandom in favor of test framework random (#573) 2022-01-05 15:31:00 +01:00