Commit Graph

37646 Commits

Author SHA1 Message Date
Luca Cavanna cb4cd024cc Make SegmentInfos#readCommit(Directory, String, int) public (#14027)
The corresponding readLatestCommit method is public and can be used to
read segment infos from indices that are older than N - 1.
The same should be possible for readCommit, but that requires the method
that takes the minimum supported version as an argument to be public.
2024-12-02 10:39:13 +01:00
Nhat Nguyen 80a69348ce
Avoid allocating liveDocs for no soft-deletes (#13895) (#13903) (#14001)
Backport of #13895 to 10_0

This is a continuation of #13588, where we avoided allocating liveDocs 
for segments that have the __soft_deletes field but no values in it.
However, that PR only addressed the reading side. This change fixes the
writing scenario with IndexWriter.

Relates #13588
2024-11-18 11:18:08 -08:00
Adrien Grand 7a0365d4d3 Add MIGRATE entry about the fact that readVLong() may now read up to 10 bytes. (#13956)
This may be of interest for custom `DataInput`/`IndexInput` implementations
that extend `readVLong()`.
2024-10-25 13:38:59 +02:00
Luca Cavanna 5851033047
Align TestGenerateBwcIndices.java with AddBackcompatindices.py (#13911)
We updated TestGenerateBwcIndices to create int7 HNSW indices instead of int8 with #13874.
The corresponding python code part of the release wizard needs to be updated accordingly.
2024-10-15 13:39:47 +02:00
Luca Cavanna e821a1b4f8 Add back-compat indices for 10.0.0 2024-10-14 22:54:40 +02:00
Luca Cavanna d8e76916d6 Add next bugfix version 10.0.1 2024-10-14 22:19:14 +02:00
Luca Cavanna df581bd828 Add 10.0.1 section to CHANGES.txt 2024-10-14 22:11:23 +02:00
Luca Cavanna 17831fa0af Update year in copyright 2024-10-14 18:00:41 +02:00
Luca Cavanna 133a5b1a7c DOAP changes for release 10.0.0 2024-10-14 14:58:58 +02:00
Michael McCandless eadc07cc6a Fix 9.12.0 backcompat break (Lucene 9.12.0 cannot read 9.11.x indices written with quantized HNSW, `Lucene99HnswScalarQuantizedVectorsFormat`) (#13874)
* carefully regenerate the int8_hnsw bwc indices so that they do in fact use Lucene99ScalarQuantizedVectorsFormat ... when running TestInt8HnswBackwardsCompatibility it now fails (as expected) on 9.11.0 and 9.11.1 bwc indices, but not on 9.10.0

* rename int8 -> int7 bwc tests since we are actually testing 7 bit quantization

* actually fix the bwc bug: only allow compress=true when bits is 7 or 8 in HNSW scalar quantization

* tidy

* Revert "rename int8 -> int7 bwc tests since we are actually testing 7 bit quantization"

This reverts commit eeb3f8a668.

* Reapply "rename int8 -> int7 bwc tests since we are actually testing 7 bit quantization"

This reverts commit 3487c4210b.

* #13880: add test to verify the int7 quantized indices are in fact using quantized vectors not float32

* bump 9.12.x version to 9.12.1 and add bwc indices for 9.12.0

* remove duplicate 9.12.0 Version constant

* revert changes to index.9.12.0-cfs.zip, index.9.12.0-nocfs.zip, sorted.9.12.0.zip

* remove unused bwc index

Closes #13867
Closes #13880
2024-10-09 18:11:41 -04:00
Adrien Grand e6bb5e2c54 Fix flakiness issues with TestTieredMergePolicy. (#13881)
The two seeds at #13818 had different root causes:
 - The test allows the number of segments to go above the limit, only if none
   of the merges are legal. But there are multiple reasons why a merge may be
   illegal: because it exceeds the max doc count or because it is too imbalanced.
   However these two things were checked independently, so you could run into
   cases when the test would think that there are legal merges from the doc count
   perspective and from the balance perspective, but all legal merges from the doc
   count perspective are illegal from the balance perspective and vice-versa. The
   test now checks that there are merges that are good wrt these two criteria
   at once.
 - `TieredMergePolicy` allows at least `targetSearchConcurrency` segments in an
   index. There was a bug in `TieredMergePolicy` where this condition is
   applied after "too big" segments have been removed, so it effectively allowed
   more segments than necessary in the index.

Closes #13818
2024-10-09 17:29:22 +02:00
Adrien Grand ad0d2802ec Disable CFS in TestDefaultCodecParallelizesIO. (#13875)
`SerialIODirectory` doesn't count reads to files that are open with
`ReadAdvice#RANDOM_PRELOAD` as these files are expected to be loaded in memory.
Unfortunately, we cannot detect such files on compound segments, so this test
now disables compound segments.

Closes #13854
2024-10-09 16:27:48 +02:00
Ignacio Vera bc478d85a1
Avoid performance regression by constructing lazily the PointTree in NumericComparator (#13498) (#13877) 2024-10-09 12:33:34 +02:00
Dawid Weiss 67f7729b3c Adds changes entry and migration comments #13820. 2024-10-09 11:49:35 +02:00
Zhang Chao c335f3e0f3 Move DataInput.readGroupVInts into GroupVIntUtil (#13830) 2024-10-09 11:49:30 +02:00
Christine Poerschke 0a573099eb
PR 13757 follow-up: add missing with-discountOverlaps Similarity constructor variants, CHANGES.txt entries (#13845) (#13858)
(cherry picked from commit dab731175c)
(cherry picked from commit cbd8b5218a)

Resolved Conflicts:
	lucene/CHANGES.txt
2024-10-09 11:30:56 +02:00
Luca Cavanna a4c0f741cc
Revert "Disjunction as CompetitiveIterator for numeric dynamic pruning (#13221)" (#13857)
This reverts commit 1ee4f8a111.

We have observed performance regressions that can be linked to #13221. We will need to revise the logic that such change introduced in main and branch_10x. While we do so, I propose that we bake it out of branch_10_0 and we release Lucene 10 without it.

Relates to #13856
2024-10-04 16:11:35 +01:00
Luca Cavanna f76fdb293e Adjust command to remove uploaded artifacts upon respin (#13853)
The command to remove uploaded artifacts from svn is missing a dash, hence it
fails as it does not match the name of the artifacts uploaded at the previous steps.
2024-10-03 00:19:41 +02:00
Luca Cavanna 36d936dd88 Reassign the knn values iterator in TestSortingCodecReader (#13852) 2024-10-02 23:45:35 +02:00
Luca Cavanna 628e2cf6f3 Add missing authors in Lucene 10 changelog 2024-10-02 20:49:12 +02:00
Chris Hegarty 25f49d4f86 Add test for float vector values in FlatVectorsScorer impls (#13851)
This is a test only change that verifies the behaviour when float vector values are passed to our FlatVectorsScorer implementations. This would have caught the bug causing #13844, subsequently fixed by #13850.
2024-10-02 16:06:47 +01:00
Benjamin Trent 4f6eb799bb Fix bug where off-heap scorer would kick on even for float vectors (#13850)
introduced in the major refactor #13779

Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.
2024-10-02 09:35:25 -04:00
Benjamin Trent 4461bc1eff
Fix simple text byte vector iteration (#13842) 2024-10-01 09:31:26 -04:00
Luca Cavanna f411adfe43 Release wizard to split clean and check calls to separate calls
This works around a gradle issue when calling `clean check` from a single command.
2024-10-01 13:58:43 +02:00
Chris Hegarty 31a58cecf5 Override iterator in Empty off heap vector values (#13837)
This commit override the iterator method in the empty off-heap vector values. The implementation is just the dense iterator, which handles empty values just fine. We use it elsewhere for similar too.
2024-10-01 10:33:27 +01:00
Armin Braun ab07bad1ba Speedup GlobalHitsThresholdChecker a little (#13836)
Even though this field is not `volatile`, writing it isn't free and
causes needless cache thrashing at some frequency. We can speed things
up by only writing the `true` value and never the `false` value.
2024-10-01 10:34:22 +02:00
Adrien Grand d4a52fd3a6 Bump the codec version to 10.0. (#13815)
Bump the codec version to 10.0.

Lucene100Codec is the exact same file format as Lucene912Codec. This codec
dance just makes things slightly easier to reason about since our backward
compatibility guarantees are aligned with major version: once we drop support
for 9.x indices, we can remove all `Lucene9XXCodec`s.
2024-10-01 08:58:32 +01:00
Uwe Schindler 524ea208c8 Add changes entries for CVE-2024-45772 and related commits 2024-09-30 17:29:50 +02:00
Michael Sokolov 638d489038 CHANGES/MIGRATE entries for KnnVectorValues API change (#13833) 2024-09-30 10:44:31 -04:00
ChrisHegarty 22ac47c07a Fix bwc tests by adding next minor version - even though there will never be a 9.13.0 2024-09-29 20:27:06 +01:00
zhouhui 7eea8fbd0d
Fix comment on compare suffix and target. (#13787) 2024-09-29 11:12:32 -04:00
Michael Sokolov 0a8604d908 Merge remote-tracking branch 'origin/main' into main 2024-09-28 17:45:05 -04:00
Michael Sokolov c8eeacb39a fix safety check in KnnVectorsWriter 2024-09-28 17:44:33 -04:00
Michael Sokolov f21ec8414c SlowCompositeCodecReaderWrapper KnnVectorValues handles binary search over array with repeated values 2024-09-28 17:43:01 -04:00
ChrisHegarty a38443bc6a Add back-compat indices for 9.12.0 2024-09-28 21:47:03 +01:00
ChrisHegarty dafe006652 DOAP changes for release 9.12.0 2024-09-28 21:14:49 +01:00
Michael Sokolov 6053e1e313
First-class random access API for KnnVectorValues (#13779) 2024-09-28 09:14:01 -04:00
ChrisHegarty 7b4b0238d7 Fix back compat 11.8.4 indices in main 2024-09-26 14:54:17 +01:00
ChrisHegarty ff57fa7b42 Add RDF entries for 8.11.4 2024-09-25 16:35:24 +01:00
Chris Hegarty 64371c14d6
Add 8.11.4 backward compat indices (#13824)
This commit add 8.11.4 back compat indices.
2024-09-25 16:20:54 +01:00
ChrisHegarty 53d1c2bd2f Test fix: make float vector dims even for SQ testing 2024-09-20 15:28:54 +01:00
Adrien Grand 73d71acedd Also fix formats that are only tested nightly. 2024-09-20 15:24:04 +02:00
Adrien Grand da1f954601
Improve testing of mismatched field numbers. (#13812)
This improves testing of mismatched field numbers by
 - improving `AssertingDocValuesProducer` to detect mismatched field numbers,
 - introducing a `MismatchedCodecReader` to actually test mismatched field
   numbers on `DocValuesProducer` (a `MismatchedLeafReader` wrapping a
`SlowCodecReaderWrapper` doesn't work since `SlowCodecReaderWrapper` implicitly
resolves the correct `FieldInfo` object),
 - introducing an explicit test for mismatched field numbers for doc values, points,
postings and knn vectors.

These new tests uncovered a bug when merging sorted doc values, which would
call the underlying doc values producer with the merged field info.

Closes #13805
2024-09-20 14:37:45 +02:00
Chris Hegarty 7ef7122eba
Revert "Replace Map<String,Object> with IntObjectHashMap for DV producer (#13686) (#13810)
Reverts "Replace Map<String,Object> with IntObjectHashMap for DV producer (#13686)

relates #13809
2024-09-20 11:06:43 +01:00
Christoph Büscher e4ac57746e
Add BytesRefIterator to TermInSetQuery (#13806)
TermInSetQuery used to have an accessor to its terms that was removed in #12173
to protect leaking internal encoding details. This introduces an accessor to the
term data in the query that doesn't expose internals but merely allows iterating
over the decoded BytesRef, making inspection of the querys content possible again.

Closes #13804
2024-09-19 11:51:42 +02:00
Benjamin Trent 6d987e1ce1
Disable intra-merge parallelism for all structures but kNN vectors (#13799)
After adjusting tests that truly exercise intra-merge parallelism, more issues have arisen. See: https://github.com/apache/lucene/issues/13798

To be risk adverse & due to the soon to be released/freezed Lucene 10 & 9.12, I am reverting all intra-merge parallelism, except for the parallelism when merging HNSW graphs.

Merging other structures was never really enabled in a release (we disabled it in a bugfix for Lucene 9.11). While this is frustrating as it seems like we leaving lots of perf on the floor, I am err'ing on the side of safety here. 

In Lucene 10, we can work on incrementally reenabling intra-merge parallelism.

closes: https://github.com/apache/lucene/issues/13798
2024-09-18 08:36:11 -04:00
Alan Woodward dbceba77a4
Correct Point file extensions in Codec javadocs (#13801) 2024-09-18 12:33:31 +01:00
Adrien Grand b59a357e58
Change docValuesSkipIndex from a boolean to an enum. (#13784)
At the moment, our skip indexes record min/max ordinal/value per range
of doc IDs. It would be natural to extend it to other pre-aggregated
data such as a sum and value count, which facets could take advantage
of. This change switches `docValuesSkipIndex` from a boolean to an enum
so that we could release such changes in the future in an additive
fashion, by adding constants to this enum and new methods to
`DocValuesSkipper`.
2024-09-17 14:35:30 +02:00
Armin Braun 644feeb02a
Cleanup redundant allocations and code around Comparator use (#13795)
Noticed some visible allocations in CompetitiveImpactAccumulator
during benchmarking and fixed the needless allocation for the comparator
in that class as well as a couple other similar spots where needless
classes and/or objects could easily be replaced by more lightweight
solutions.
2024-09-17 14:34:31 +02:00
Christine Poerschke a817426511
add RawTFSimilarity class (#13749) 2024-09-17 13:11:25 +01:00