Commit Graph

35246 Commits

Author SHA1 Message Date
Mayya Sharipova 257d256def
LUCENE-10054 Make HnswGraph hierarchical (#250)
Currently HNSW has only a single layer.
This is the first part to make it multi-layered.

To keep changes small, this PR only adds
 multiple layers in the HnswGraph class.

TODO  for following PRs:
- modify graph construction and search algorithm for a hierarchical
graph.
- modify Lucene90HnswVectorsWriter and Lucene90HnswVectorsReader to
write and read multiple layers\
2021-08-23 15:54:26 -04:00
Greg Miller 46fa09d265
LUCENE-5309: Optimize facet counting for single-valued SSDV / StringValueFacetCounts (#255) 2021-08-23 10:01:23 -07:00
51search 191ee3ad3e
LUCENE-10058: fix gradle lucene:benchmark:run error (#253) 2021-08-23 10:36:33 -04:00
Uwe Schindler 5813292de2 LUCENE-10055: Update Subversion foder for Javadocs 2021-08-22 13:34:09 +02:00
Michael Sokolov 054b444c14 Fix off-by-one in TestDemo.testKnnVectorSearch 2021-08-21 14:22:47 -04:00
Mike Drob c36495dce7
LUCENE-10017 Less verbose exception on IndexFormatTooOld (#200) 2021-08-20 15:40:52 -05:00
Dzung Bui 0c3c8ec09a
LUCENE-10059: Fix an AssertionError when JapaneseTokenizer tries to backtrace from and to the same position (#254)
Co-authored-by: Anh Dung Bui <buidun@amazon.com>
2021-08-20 08:21:58 -04:00
Michael Sokolov 5896e5389a
LUCENE-10057: Use Lucene abstractions to store demo KnnVectorDict (Dawid Weiss) 2021-08-19 16:14:06 -04:00
Michael Sokolov eeb296ce90
LUCENE-8638: remove LegacyBM25Similarity 2021-08-18 15:44:56 -04:00
Michael Sokolov b8210dee7a Close vector dictionary when exiting the demo 2021-08-18 15:43:33 -04:00
Michael Sokolov d1d60e2db6
LUCENE-8638: remove unused deprecated methods and related tests (#248) 2021-08-18 08:19:49 -04:00
Michael Sokolov 666c7a2590
LUCENE-8638: remove deprecated FST get by output 2021-08-18 08:15:31 -04:00
Michael Sokolov a37844aedd
LUCENE-10016: Added KnnVector index/query support to demo 2021-08-18 08:13:59 -04:00
Michael Sokolov 4213f9d3cd
LUCENE-8638: remove long-deprecated Jaspell suggester 2021-08-17 17:45:22 -04:00
Michael McCandless 65a53450dc
LUCENE-10052: first cut at LTC.newBytesRef methods, and switching a few test cases over (#245)
* LUCENE-10052: first cut at LTC.newBytesRef methods, to randomize the offset/length of a BytesRef, and switching a few test cases over
2021-08-17 16:18:40 -04:00
Michael Sokolov 2d21a600ba
LUCENE-8638: remove deprecated code (#243) 2021-08-17 13:51:04 -04:00
Julie Tibshirani 29ed3908ea LUCENE-9614: Small fixes to KnnVectorQuery hashCode and toString 2021-08-16 09:10:53 -07:00
Julie Tibshirani e48be684b2
LUCENE-9614: Prevent TestKnnVectorQuery from using simple text codec (#244)
The simple text codec doesn't support kNN searches, so the test will fail when
we randomly chose to use it.
2021-08-16 09:11:03 -07:00
Julie Tibshirani 6993fb9a99
LUCENE-10040: Handle deletions in nearest vector search (#239)
This PR extends VectorReader#search to take a parameter specifying the live
docs. LeafReader#searchNearestVectors then always returns the k nearest
undeleted docs.

To implement this, the HNSW algorithm will only add a candidate to the result
set if it is a live doc. The graph search still visits and traverses deleted
docs as it gathers candidates.
2021-08-16 07:44:17 -07:00
Mike McCandless 19e5c00a4f LUCENE-10014: fix performance bug: when writing doc values with block GCD compression we were unnecessarily wasting index storage by failing to take fully advantage of the GCD compression 2021-08-16 08:40:02 -04:00
Mike McCandless b18f714096 LUCENE-10008: add CHANGES entry 2021-08-13 14:47:53 -04:00
Vigya Sharma cb4c8ae07f
Lucene-10008: Respect ignoreCase flag in CommonGramsFilterFactory and factor out a common abstract base class AbstractWordsFileFilterFactory.java (#188) 2021-08-13 14:45:58 -04:00
Michael Sokolov 624560a3d7
LUCENE-9614: add KnnVectorQuery implementation 2021-08-13 12:15:40 -04:00
Julie Tibshirani a9fb5a965d
LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10 (#232)
In LUCENE-9002 we introduced logic to skip caching a clause if it would be too
expensive compared to the usual query cost. Specifically, we avoid caching a
clause if its cost is estimated to be a 250x higher than the lead iterator's.
We've found that the default of 250 is quite high and can lead to poor tail
latencies. This PR decreases it to 10 to cache more conservatively.
2021-08-11 13:29:12 +03:00
Mike McCandless 931ff63232 LUCENE-9963: add CHANGES entry 2021-08-09 16:11:31 -04:00
Geoffrey Lawson 647255b4d2
LUCENE-9963 Improve FlattenGraphFilter's robustness when handling incoming token graphs with holes (#157)
6 main improvements:
    1) Iterate through all output.InputNodes since dest gaps can exist.
    2) freeBefore the minimum input node instead of the first input node(which was usually, but not always, the minimum).
    3) Don't freeBefore from a hole source node. Book keeping may not be correct and could result in an early free.
    4) When adding an output node after hole recovery, calculate its new position increment instead of adding it to the end of the output graph.
    5) Nodes after holes that have edges to their source will do the output re-mapping that the deleted node would have done.
    6) If a disconnected input node swaps order with another node in the output, then map them to the same output node.

Co-authored-by: Lawson <geoffrl@amazon.com>
2021-08-09 16:06:53 -04:00
Greg Miller a11457b4e6
LUCENE-10047: Fix value de-duping check in LongValueFacetCounts and RangeFacetCounts (#237) 2021-08-07 10:20:49 -07:00
Greg Miller e937e739f3
LUCENE-10046: Fix counting bug in StringValueFacetCounts (#236) 2021-08-07 07:32:50 -07:00
Greg Miller 3037e33025
Slight improvement/optimization to duplicate facet value checking (ref: LUCENE-9964) (#234) 2021-08-06 12:57:09 -07:00
Greg Miller 645b64ef4e Update CHANGES entry for LUCENE-9945 after backporting 2021-08-02 16:38:10 -07:00
Sejal Pawar a76f2f8072
LUCENE-9945: Extend DrillSidewaysResult to expose drillDowns and drillSideways (#159) 2021-08-02 16:01:08 -07:00
Greg Miller 7450a7e64b Update CHANGES entry for LUCENE-10030 after backporting 2021-08-01 12:39:11 -07:00
Dawid Weiss b016c8dc2a LUCENE-10042: JAR minimal manifest JDK entries are incorrectly set to build-JVM 2021-08-01 14:14:42 +02:00
Mayya Sharipova 597398439c LUCENE-10027 Changes for Dir Open with leafSorter
Adjust changes to Directory Open API from commit with
leafsorter according with v. 8.10.

Relates to PR #214
2021-07-30 13:42:29 -04:00
Mayya Sharipova 1daf7e7c74
LUCENE-10027 provide leaf sorter from commit (#214)
Provide leaf sorter for directory readers opened from IndexCommit

LUCENE-9507 allowed to provide a leaf sorter for directory readers.
One API that was missed is to allow to provide a leaf sorter
for directory readers opened from an index commit.
This patch address this by adding an extra parameter: a custom
comparator for sorting leaf readers to the Directory reader open API
from indexCommit and minSupportedMajorVersion.

Relates to PR #32
2021-07-30 09:15:21 -04:00
Gautam Worah 56eb76dbaf Simplify some code 2021-07-29 13:12:27 -04:00
Gautam Worah bd3174de10 PR fixes 1. Change negation to 2. Move statement inside if condition 2021-07-29 13:12:27 -04:00
Gautam Worah cec19125fa Fix minor logic 2021-07-29 13:12:27 -04:00
Gautam Worah be0a3e5721 Move the version check to a final variable that is initialized in the
constructor
2021-07-29 13:12:27 -04:00
Gautam Worah 162131ecf8 Use BDV or a StoredField based on the Lucene version that has created
the last index commit

If the Lucene version was < 9 then use a StringField or else
if the index is fresh or if the index is was built using a
version >= 9, then use a BDV field.
2021-07-29 13:12:27 -04:00
Gautam Worah 7cb696041c Category documents added in the Lucene 9.0 taxonomy index use a
BDV field with a different name

Using BDV fields with a different "$full_path_binary$" name
ensures that the earlier "$full_path$" StringField does not have the same name as the
BDV field and hence they don't violate the field type consistency check
(LUCENE-9334).

This commit also enables the back-compat check that was disabled
earlier.
2021-07-29 13:12:27 -04:00
Nhat Nguyen ba417b593f
LUCENE-10032: Remove leafDocMaps from MergeState (#222)
These maps are no longer useful after LUCENE-8505.
2021-07-29 08:28:39 -04:00
Adrien Grand 0e6c3146d7
LUCENE-10031: Speed up SortedDocIdMerger on low-cardinality sort fields. (#221)
When sorting by low-cardinality fields, the same sub remains current for long
sequences of doc IDs. This speeds up SortedDocIdMerger a bit by extracting
the sub that leads iteration.
2021-07-29 08:46:10 +02:00
Shintaro Murakami 03b1db91f9
LUCENE-9304: Remove assertion in DocumentsWriterFlushControl (#228)
This is assertion becomes obvious after LUCENE-9304.
2021-07-28 10:05:00 -04:00
Julie Tibshirani e8663b30b8
LUCENE-10039: Fix single-field scoring for CombinedFieldQuery (#229)
When there's only one field, CombinedFieldQuery will ignore its weight while
scoring. This makes the scoring inconsistent, since the field weight is supposed
to multiply its term frequency.

This PR removes the optimizations around single-field scoring to make sure the
weight is always taken into account. These optimizations are not critical since
it should be uncommon to use CombinedFieldQuery with only one field.
2021-07-28 15:43:56 +03:00
Greg Miller e44636c280
LUCENE-10036: Add factory method to ScoreCachingWrappingScorer that ensures unnecessary wrapping doesn't occur (#226) 2021-07-27 07:53:36 -07:00
Greg Miller 736d114901 Add CHANGES entry for LUCENE-10030 2021-07-26 13:11:32 -07:00
Grigoriy Troitskiy 61f8517000
LUCENE-10030: Lazily evaluate score in DrillSidewaysScorer.doQueryFirstScoring (#217) 2021-07-26 13:04:51 -07:00
Michael Sokolov 0ec93b632c LUCENE-10016: fix test case to use the same similarity in both cases 2021-07-24 15:22:39 -04:00
Tomoko Uchida df807dbe8f
LUCENE-9855: Rename knn search vector format (#218) 2021-07-24 12:03:15 +09:00