Commit Graph

35812 Commits

Author SHA1 Message Date
Julie Tibshirani 57d9515eff
LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls (#641)
A couple of the data structures used in HNSW search are pretty large and
expensive to allocate. This commit creates a shared candidates queue and
visited set that are reused across calls to HnswGraph#searchLevel. Now the same
data structures are used for building the entire graph, which can cut down on
allocations during indexing. For graph building it also switches the visited
set to FixedBitSet for better performance.
2022-02-03 16:00:09 -08:00
Luca Cavanna bade484998
LUCENE-10385: Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery (#635)
IndexSortSortedNumericDocValuesRangeQuery can implement its count method and coompute count through a binary search, the same binary search that is used to execute the query itself, whenever all the required conditions are met.
2022-02-03 17:19:05 +01:00
Luca Cavanna ee7a8d6918
LUCENE-10002: Replace some IndexSearcher#search(Collector, Query) in tests (#639)
Also use LuceneTestCase#newSearcher
2022-02-03 17:17:02 +01:00
Dawid Weiss 9a28c91a5a LUCENE-10283: bump the minimum source/release in javadoc settings. 2022-02-02 17:25:50 +01:00
Dawid Weiss 87bba4152c LUCENE-10283: bump the minimum source/release in ecj linter settings. 2022-02-02 17:25:41 +01:00
Mike Drob 56f49257ed
null check on infoStream (#637) 2022-02-02 09:44:31 -06:00
Mayya Sharipova c8e1c08cc8
Small fix for assertConsistentGraph (#631)
TestKnnGraph.testMultipleVectorFields sometimes breaks with
the following message:

java.lang.NullPointerException: Cannot invoke
 "org.apache.lucene.codecs.lucene91.Lucene91HnswVectorsReader.getGraphValues(String)"
because "vectorReader" is null

This happens in assertConsistentGraph.

This patch ensures that for a segment and a field  where there is no
vectors indexed, we don't run a check on consistent graph.
2022-02-01 10:21:48 -05:00
Dawid Weiss f103cca565 LUCENE-10255: Add the required unnamed modules in benchmarks subproject to module-info so that they are explicit. 2022-02-01 12:15:01 +01:00
Dawid Weiss e7212fa47d LUCENE-10283: bump minimum JDK version to 17 in buildSrc. 2022-02-01 12:09:35 +01:00
Mayya Sharipova 8dfdb261e7
LUCENE-9573 Add Vectors to TestBackwardsCompatibility (#616)
This patch adds KNN vectors for testing backward compatible indices

- Add a KnnVectorField to documents when creating a new backward
  compatible index
- Add knn vectors search and check for vector values to the testing
  of search of backward compatible indices
- Add tests for knn vector search when changing backward compatible
 indices (merging them and adding new documents to them)
2022-01-31 09:20:53 -05:00
Luca Cavanna df12e2b195
LUCENE-10395: Introduce TotalHitCountCollectorManager (#622) 2022-01-31 14:45:35 +01:00
Luca Cavanna 933c54fe87
Improve Weight#count and IndexSearcher#count javadocs (#625) 2022-01-28 16:47:25 +01:00
Robert Muir 61edacee5d
update javac flags for java 17 (#628)
Previously -Xlint:text-blocks and -Xlint:text-blocks were enabled
conditionally, if the user had at least java 15 or java 16,
respectively. Enable them always.

Add new options so that the warnings list is fully configured:
* -Xlint:module (new in java 17)
* -Xlint:strictfp (new in java 17)

Disable "path" with -Xlint:-path rather than commenting it out, for
consistency.

Disable "missing-explicit-ctor" (new in java 17), as it is unlikely to
succeed right now.

Alphasort the flags and doc how to get the updated list, this makes it
easy to compare and keep up to date.
2022-01-28 05:48:58 -05:00
Adrien Grand 09ddac1fe5
Simplify HnswGraph#search. (#627)
Currently the contract on `bound` is that it holds the score of the top of the
`results` priority queue. It means that a candidate is only considered if its
score is better than the bound *or* if less than `topK` results have been
accumulated so far. I think it would be simpler if `bound` would always hold
the minimum score that is required for a candidate to be considered? This would
also be more consistent with how our WAND support works, by trusting
`setMinCompetitiveScore` alone, instead of having to check whether the priority
queue is full as well.
2022-01-27 18:08:06 +01:00
Greg Miller 4323848469
LUCENE-10368: Make IntTaxonomyFacets pkg-private (#600) 2022-01-27 08:56:42 -08:00
Mayya Sharipova dcd9e3d6f7
LUCENE-10389 Adjust TestHnswGraph.testRandom (#626)
Before PR #608 this test when searching HnswGraph was using
numSeed (the search queue size) equal to 100.
This patch returns the original value of the search queue to 100,
and gets the top topK results from it.
2022-01-27 09:06:48 -05:00
gf2121 eda9c29b8c
LUCENE-10388: Remove MultiLevelSkipListReader#SkipBuffer (#620) 2022-01-26 16:36:33 +08:00
gf2121 3ad4c1b3c9
clean lastPayloadByteUpto (#619) 2022-01-26 14:03:32 +08:00
Tomoko Uchida e18dfea8bd LUCENE-10389: temporary disable TestHnswGraph.testRandom() 2022-01-26 10:45:55 +09:00
Mayya Sharipova b0d6fe68d1
LUCENE-10054 Make HnswGraph hierarchical (#608)
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 #250, #267, #287, #315, #536, #416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
2022-01-25 13:53:55 -05:00
Mayya Sharipova 1a4f838fe2
LUCENE-10384: Simplify LongHeap small addition (#623)
LUCENE-10384 and PR#615 introduced encoding f into NeighborQueue.
But one function `nodes()` was remained to add this encoding.

Also modify the test that would fail without this patch.
2022-01-25 11:43:40 -05:00
Luca Cavanna 11006fba59
LUCENE-10002: Replace simple usages of TotalHitCountCollector with IndexSearcher#count (#612)
In case only number of documents are collected, IndexSearcher#search(Query, Collector) is commonly used, which does not use the executor that's been eventually set to the searcher. Calling `IndexSearcher#count(Query)` makes the code more concise and is also more correct as it honours the executor that's been set to the searcher instance.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-01-25 16:11:19 +01:00
Tomoko Uchida fd817b6fb1 LUCENE-8930: increase timeout to 1 minite for the launched luke (seems it occationaly takes long time on windows vm) 2022-01-25 22:56:11 +09:00
Tomoko Uchida 0aa526d256 LUCENE-10076: fix the assertion in luke module to only check if the optional has a value 2022-01-25 22:05:27 +09:00
Adrien Grand 07fe46ff86
LUCENE-10384: Simplify LongHeap. (#615)
The min/max ordering logic moves to NeighborQueue.
2022-01-25 09:04:52 +01:00
Greg Miller eaf3cb6739 Fix minor bug that snuck in with LUCENE-9952 2022-01-24 06:58:31 -08:00
Greg Miller 9e560c1af1
LUCENE-9952: Fix dim count inaccuracies in SSDV faceting when a dim is multi-valued (#611) 2022-01-24 06:48:20 -08:00
Greg Miller 10ca531ddc
LUCENE-10381: Require users to provide FacetsConfig for SSDV faceting (#613) 2022-01-24 06:46:22 -08:00
Julie Tibshirani fb09ae1f7c Undo accidental change to build.gradle 2022-01-23 16:26:16 -08:00
Julie Tibshirani 7ece8145bc
LUCENE-10375: Write vectors to file in flush (#617)
In a previous commit, we updated HNSW merge to first write the combined segment
vectors to a file, then use that file to build the graph. This commit applies
the same strategy to flush, which lets us use the same logic for flush and
merge.
2022-01-23 16:19:23 -08:00
Dawid Weiss 08d6633d94 LUCENE-8930: increase timeout for the launched luke. 2022-01-20 16:51:05 +01:00
Ignacio Vera 4ec8f865c8
LUCENE-10288: Check BKD tree shape for lucene pre-8.6 1D indexes (#607)
Adds efficient logic to compute if a tree is balanced or unbalanced for indexes 
created before Lucene 8.6
2022-01-20 07:49:29 +01:00
Dawid Weiss 72ba7ae2ee
LUCENE-8930: script testing in the distribution (#550) 2022-01-20 00:09:15 +09:00
Julie Tibshirani 9b6d417d1c LUCENE-10040: Update HnswGraph javadoc related to deletions
Previously it claimed the search method did not handle deletions.
2022-01-18 15:36:00 -08:00
Julie Tibshirani dfca9a5608
LUCENE-10375: Write merged vectors to file before building graph (#601)
When merging segments together, the `KnnVectorsWriter` creates a `VectorValues`
instance with a merged view of all the segments' vectors. This merged instance
is used when constructing the new HNSW graph. Graph building needs random
access, and the merged VectorValues support this by mapping from merged
ordinals to segments and segment ordinals. This mapping can add significant
overhead when building the graph.

This change updates the HNSW merging logic to first write the combined segment
vectors to a file, then use that the file to build the graph. This helps speed
up segment merging, and also lets us simplify `VectorValuesMerger`, which
provides the merged view of vector values.
2022-01-18 13:53:05 -08:00
Alan Woodward 2e2c4818d1
LUCENE-10377: Replace 'sortPos' with 'enableSkipping' in SortField.getComparator() (#603)
The sort position parameter in SortField.getComparator() is only ever used
to determine whether or not skipping should be enabled on a given comparator,
so the parameter name should reflect that.  This commit also explicitly disables
skipping in a number of cases where it is never used, in particular CheckIndex
and the grouping collectors.
2022-01-17 10:44:57 +00:00
Adrien Grand 457367e9b7 LUCENE-10168: Fix typo that would _not_ run nightly tests. 2022-01-14 13:51:16 +01:00
Greg Miller 2f5e3c323b
LUCENE-10379: Count directly into the dense values array in FastTaxonomyFacetCounts#countAll (#605)
Co-authored-by: guofeng.my <guofeng.my@bytedance.com>
2022-01-13 09:17:55 -08:00
Mayya Sharipova bd2cc4124d
Small edits for KnnGraphTester (#575)
1. Correct the remaining size for input files larger
than Integer.MAX_VALUE, as currently with every
iteration we try to map the next blockSize of bytes
even if less < blockSize bytes are left in the file.

2. Correct java.lang.ClassCastException when retrieving
KnnGraphValues for stats printing.

3. Add an option for euclidean metric
2022-01-12 17:23:10 -05:00
gf2121 8d9fa6dba1
revert LUCENE-10355 (#597)
Trying to find the source of taxo-facet performance regression. See also LUCENE-10374

Co-authored-by: guofeng.my <guofeng.my@bytedance.com>
2022-01-12 10:23:13 -08:00
Adrien Grand 71dfa9e9cd
addBackcompatIndexes.py should use Gradle, not Ant. (#531) 2022-01-12 18:55:59 +01:00
Uwe Schindler 636d42e032 Fix wrong project name 2022-01-11 17:42:21 +01:00
Nikola Grcevski bad65c53c9
LUCENE-10369: Move DelegatingCacheHelper to FilterDirectoryReader (#596) 2022-01-11 15:22:06 +01:00
Adrien Grand 308ddd7502
Add documentation on file formats. (#598) 2022-01-11 15:16:05 +01:00
Adrien Grand f81c760cc8 LUCENE-10370: Fix precommit. 2022-01-11 10:13:10 +01:00
Dawid Weiss 9b54fbaa01 LUCENE-10370: temporarily ignore TestStressNRTReplication 2022-01-11 09:25:31 +01:00
Greg Miller 82703757fe
LUCENE-10245: Addition of MultiDoubleValues(Source) and MultiLongValues(Source) along with faceting capabilities (#543) 2022-01-10 13:48:36 -08:00
Dawid Weiss bff930c1bf LUCENE-10370: temporarily ignore TestNRTReplication. 2022-01-10 22:18:12 +01:00
Greg Miller cf12b46092
LUCENE-10356: Further optimize facet counting for single-valued TaxonomyFacetCounts (#585) 2022-01-10 10:23:46 -08:00
Greg Miller eb0b1bf9f1 Add CHANGES entry for LUCENE-10250 2022-01-10 08:57:28 -08:00