Commit Graph

35812 Commits

Author SHA1 Message Date
Julie Tibshirani b40a750aa8
LUCENE-10382: Ensure kNN filtering works with other codecs (#700)
The original PR that added kNN filtering support overlooked non-default codecs.
This follow-up ensures that other codecs work with the new filtering logic:
* Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader`
and `Lucene90HnswVectorsReader`
* Add a test `BaseKnnVectorsFormatTestCase` to cover this case
* Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions
don't hold when SimpleText is used

This PR also clarifies the limit checking logic for
`Lucene91HnswVectorsReader`. Now we always check the limit before visiting a
new node, whereas before we only checked it in an outer loop.
2022-02-23 14:58:27 -08:00
Julie Tibshirani 4364bdd63e
LUCENE-10054: Make sure to use Lucene90 codec in unit tests (#699)
Before we were using the default Lucene91 codec, so we weren't exercising the
old format.
2022-02-23 08:22:59 -08:00
Lu Xugang 43e89d6a29
LUCENE-10435: Break loop early while checking whether DocValuesFieldExistsQuery can be rewrite to MatchAllDocsQuery (#701) 2022-02-23 13:53:56 +01:00
Ignacio Vera ab47db4fee
LUCENE-10437: Improve error message in the Tessellator for polygon with all points collinear (#703)
Polygon tessellator throws a more informative error message when the provided polygon does not contain enough no-collinear points.
2022-02-23 13:51:44 +01:00
Tomoko Uchida f8040d565f LUCENE-10416: move changes entry to v10.0.0 2022-02-22 20:29:14 +09:00
Tomoko Uchida c7602a425c
migrate to temurin (#697) 2022-02-21 17:09:21 +09:00
Lu Xugang 36a2149d43
LUCENE-10424: Optimize the "everything matches" case for count query in PointRangeQuery (#691) 2022-02-21 07:08:23 +01:00
Tomoko Uchida 76c9fd4e38 LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori 2022-02-20 21:39:03 +09:00
Tomoko Uchida e7a29c4c4c
Remove deprecated constructors in Nori (#695) 2022-02-20 17:40:45 +09:00
Tomoko Uchida 58fa95deea
LUCENE-10400: revise binary dictionaries' constructor in nori (#693) 2022-02-20 16:16:56 +09:00
Julie Tibshirani f0d17e94d9
LUCENE-10408: Fix vector values iteration bug (#690)
Now that there is special logic to handle the dense case, we need to adjust some
assertions in VectorValues#advance.
2022-02-18 11:36:22 -08:00
Julie Tibshirani cdb74e155a Temporarily mute TestKnnVectorQuery#testRandomWithFilter 2022-02-17 14:50:01 -08:00
Julie Tibshirani 8ca372573d
LUCENE-10382: Support filtering in KnnVectorQuery (#656)
This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein <jbernste@apache.org>
2022-02-17 11:35:25 -08:00
Vigya Sharma c132bbf677
LUCENE-10084: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery when all docs have the field (#677)
Since all documents are required to use the same features (LUCENE-9334) we can
rewrite DocValuesFieldExistsQuery to a MatchAllDocsQuery whenever terms or
points have a docCount that is equal to maxDoc.
2022-02-17 11:20:06 -08:00
Greg Miller 00029f1ec4 Add CHANGES entry for LUCENE-10398 2022-02-17 09:26:11 -08:00
spike.liu fc3c790ab4
LUCENE-10398: Add static method for getting Terms from LeafReader (#678)
Co-authored-by: chengliu@ctrip.com <chengliu@ctrip.com>
2022-02-17 09:21:51 -08:00
Mayya Sharipova f8c5408be7
LUCENE-10408 Better encoding of doc Ids in vectors (#649)
Better encoding of doc Ids in Lucene91HnswVectorsFormat
for a dense case where all docs have vectors.

Currently we write doc Ids of all documents that have vectors
not very efficiently.
This improve their encoding by for a case when all documents
have vectors, we don't write document IDs, but just write a
single short value – a dense marker.
2022-02-17 11:34:42 +01:00
Ignacio Vera 84e34dc468
LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count. (#685)
These query wrappers do not modify the set of matching documents so they can delegate Weight#count.
2022-02-17 08:03:47 +01:00
Gautam Worah dd25fabb03
LUCENE-10378 Implement Weight#count for PointRangeQuery (#658)
Implement Weight#count for PointRangeQuery to provide a faster way to calculate
the number of matching range docs when each doc has at-most one point and the 
points are 1-dimensional.
2022-02-16 07:23:49 +01:00
Patrick Zhai 6157854523
LUCENE-10371 Make IndexRearranger able to arrange segment in a determined order (#630) 2022-02-15 10:52:40 -08:00
Uwe Schindler 70c152bf32
LUCENE-10420: Remove deprecated interfaces and methods in IOUtils in main (#680) 2022-02-14 17:05:34 +01:00
Tomoko Uchida db8fcb84bb
LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces (#673)
Co-authored-by: Uwe Schindler <uschindler@apache.org>
2022-02-15 00:12:28 +09:00
Dawid Weiss 8aa4763070 LUCENE-10419: fix rat thread safety bug. 2022-02-13 18:43:13 +01:00
Dawid Weiss a861ff8df2 LUCENE-10419: revert debugging changes. 2022-02-13 18:34:57 +01:00
Dawid Weiss 50b7e2970f LUCENE-10419: more debugging code. The message from AbstractStringBuilder suggests a concurrency issue somewhere, but I just can't see it! 2022-02-12 20:22:49 +01:00
Dawid Weiss 21c5b42063 LUCENE-10419: upgrade rat to 0.13. 2022-02-10 17:37:06 +01:00
Tomoko Uchida 4cb55a7e9c
trivial updates on github actions (#674) 2022-02-11 01:13:18 +09:00
Luca Cavanna ea170c9fab
Avoid SimpleText codec in TestIndexSortSortedNumericDocValuesRangeQuery (#675)
The recently introduced testCount verifies that the Weight#count optimization kicks in. When SimpleText codec is used, `DocValues#unwrapSingleton` returns null which disables the optimization and makes the test fail.
2022-02-10 17:06:31 +01:00
Dawid Weiss f6cebac333
LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser (#668) 2022-02-10 12:18:13 +01:00
Dawid Weiss 1f1da12c89 LUCENE-10419: add debugging code. 2022-02-10 12:03:54 +01:00
Adrien Grand 69d3a1d6af
LUCENE-10412: Improve handling of MatchNoDocsQuery in rewrites. (#664) 2022-02-09 19:02:54 +01:00
Alan Woodward 2183756f1c
LUCENE-10413: Make default Ukrainian stopword set available (#665)
This commit adds a new getDefaultStopwords() static method to
UkrainianMorfologikAnalyzer, which makes it possible to create an
analyzer with the default stop word set but a custom stem exclusion
set.
2022-02-09 14:37:44 +00:00
Greg Miller 8178ffda00
LUCENE-10403: Add ArrayUtil#grow(T[]) (#644) 2022-02-08 09:43:55 -08:00
Adrien Grand ce93d45532 LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of matching clauses is a constant. 2022-02-08 17:25:53 +01:00
Nhat Nguyen bcb70fd742
LUCENE-10190: Ensure changes are visible before advancing seqno (#640)
DocumentWriter#anyChanges() can return false after we process and
generate a sequence number for an update operation; but before we adjust
the numDocsInRAM. In this window of time, refreshes are noop, although
the maxCompletedSequenceNumber has advanced.
2022-02-08 10:29:20 -05:00
gf2121 5250186bd1
LUCENE-10410: Add more tests for legacy decoding logic in DocIdsWriter (#654) 2022-02-08 16:59:32 +08:00
Tomoko Uchida 20f7f33c8d
LUCENE-10400: cleanup obsolete APIs in kuromoji (#655) 2022-02-08 09:32:33 +09:00
Julie Tibshirani eb5bdd7d15
Rename KnnGraphValues -> HnswGraph (#645)
This PR proposes some renames to clarify the code structure. The top-level
`KnnGraphValues` is renamed to `HnswGraph`, since it now represents a
hierarchical graph. It's also moved from `org.apache.lucene.index` to the
`hnsw` package.

Other renames:
* The old `HnswGraph` -> `OnHeapHnswGraph`
* `IndexedKnnGraphValues` -> `OffHeapHnswGraph` (to match
`OffHeapVectorValues`)
2022-02-07 13:21:15 -08:00
Tomoko Uchida e7546c2427
LUCENE-10400: revise binary dictionaries' constructor in kuromoji (#643) 2022-02-07 19:31:22 +09:00
gf2121 e93b08f471
LUCENE-10315: Add CHANGES for #541 (#653) 2022-02-07 16:23:34 +08:00
gf2121 8c67a3816b
LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (#541) 2022-02-07 15:35:54 +08:00
Ignacio Vera 4c578017af
LUCENE-10405: binary and Sorted doc values are stored as BytesRef instead of BytesRefHash in memory index (#647)
When using the MemoryIndex, binary and Sorted doc values are stored 
as BytesRef instead of BytesRefHash so they don't have a limit on size.
2022-02-07 07:33:07 +01:00
Greg Miller deef3c704e
Update github hunspell regression test to use JDK 17 (#651) 2022-02-06 08:00:31 -08:00
Gautam Worah de4eccbb55
LUCENE-10050 Remove DrillSideways#search(DrillDownQuery,Collector) in favor of DrillSideways#search(DrillDownQuery,CollectorManager) (#632) 2022-02-04 15:25:52 -08:00
Mayya Sharipova ff2189c477 Add changes item for LUCENE-10054 2022-02-04 14:51:48 -05:00
Mayya Sharipova ea4ab26e52 LUCENE-9573 Add Vectors to TestBackwardsCompatibility (#636)
Update index.9.0.0-cfs.zip and index.9.0.0-nocfs.zip
to include knn vector field.
2022-02-04 14:42:44 -05:00
Alan Woodward 6b64f4b556
LUCENE-10407: Set bpos flag to true when containing filter is exhausted (#648)
ContainedByIntervalIterator and OverlappingIntervalIterator set their 'is the filter
interval exhausted' flag to `false` once it has returned NO_MORE_POSITIONS on
a document, so that subsequent calls to `startPosition()` will also return
NO_MORE_POSITIONS. ContainingIntervalIterator omits to do this, and so it can
incorrectly report matches, for example when used in a disjunction.  This commit
fixes that omission.
2022-02-04 16:44:57 +00:00
Alan Woodward 9ebee5a058 LUCENE-10402: Changes entry 2022-02-04 15:28:44 +00:00
Alan Woodward e72d796e96
LUCENE-10402: Prefix interval automaton should be declared binary (#646) 2022-02-04 15:27:03 +00:00
Adrien Grand ed6c1b5aea
LUCENE-10401: Fix lookups on empty doc-values terms dictionaries. (#642) 2022-02-04 09:28:35 +01:00