Commit Graph

35959 Commits

Author SHA1 Message Date
Adrien Grand 5e9dfbed27
LUCENE-10574: Keep allowing unbalanced merges if they would reclaim lots of deletes. (#905)
`TestTieredMergePolicy` caught this special case: if a segment has lots of
deletes, we should still allow unbalanced merges.
2022-05-20 10:06:38 +02:00
Adrien Grand 8e777a1320 Fix precommit. 2022-05-19 09:49:11 +02:00
Adrien Grand 3960c16296 LUCENE-10574: More test failures.
- MergeOnFlushMergePolicy doesn't try to avoid O(n^2) merges, so I'm disabling
   the test on it for now.
 - TestUpgradeIndexMergePolicy would sometimes wrap with a non-standard merge
   policy like the alcoholic merge policy, I forced it to wrap a
   TieredMergePolicy.
2022-05-19 09:35:17 +02:00
Adrien Grand bf07d98f13 LUCENE-10574: Fix test failure.
LogDocMergePolicy would previously always force-merge an index that has 10
segments of size 1 to 10, due to the min doc count. This is not the case
anymore, but the test was assuming that such an index would get merged, so I
fixed the test's expectations.

Also changed the merge policy to keep working when RAM buffers are flushed in
such a way that segments do not appear in decreasing size order using the same
logic as LogMergePolicy.
2022-05-19 09:24:54 +02:00
Adrien Grand 4240159b44 LUCENE-10574: Fix test failure.
If a LogByteSizeMergePolicy is used, then it might decide to not merge the two
one-document segments if their on-disk sizes are too different. Using a
LogDocMergePolicy addresses the issue as both segments are always considered
the same size.
2022-05-18 23:33:08 +02:00
Adrien Grand 268d29b845
LUCENE-10574: Prevent pathological merging. (#900)
This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever
consider merges where the resulting segment would be at least 50% bigger than
the biggest input segment. While a merge that only grows the biggest segment by
50% is still quite inefficient, this constraint is good enough to prevent
pathological O(N^2) merging.
2022-05-18 23:05:54 +02:00
Alan Woodward ac2267035a Add next minor version 9.2.0 2022-05-18 16:37:10 +01:00
Adrien Grand 62189b2e85
LUCENE-9409: Reenable TestAllFilesDetectTruncation. (#896)
- Removed dependency on LineFileDocs to improve reproducibility.
 - Relaxed the expected exception type: any exception is ok.
 - Ignore rare cases when a file still appears to have a well-formed footer
   after truncation.
2022-05-18 15:52:55 +02:00
Tomoko Uchida 34446c40c4 LUCENE-10531: small follow-up for b911d1d47 2022-05-18 09:44:06 +09:00
Tomoko Uchida b911d1d47c
LUCENE-10531: Add @RequiresGUI test group for GUI tests (#893)
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
2022-05-18 09:26:06 +09:00
Adrien Grand e65c0c777b
LUCENE-9356: Change test to detect mismatched checksums instead of byte flips. (#876)
This makes the test more robust and gives a good sense of whether file formats
are implementing `checkIntegrity` correctly.
2022-05-17 14:29:51 +02:00
Alan Woodward 8921b23bcd
LUCENE-10575: Fix some visibility issues (#894) 2022-05-16 14:25:36 +01:00
sunwq de8a6998d7
LUCENE-10568: fix javadocs errors in IndexWriter.DocStats (#884) 2022-05-16 13:34:29 +09:00
Tomoko Uchida c577508630 correct pr number in changes 2022-05-16 10:38:31 +09:00
Uwe Schindler 24ae064234 Correct issue numbers 2022-05-15 17:48:09 +02:00
Uwe Schindler fcc6de1a1f Add Github PR/Issue numbers to CHANGES.txt 2022-05-15 11:19:32 +02:00
Uwe Schindler 7a8071c9d4
Detect CI builds and enable errorprone by default for those CI builds (#890) 2022-05-14 20:49:50 +02:00
Rushabh Shah 694d797526
LUCENE-10561 Reduce class/member visibility of all normalizer and stemmer classes (#883)
Co-authored-by: Rushabh Shah <shahrs87@apache.org>
Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>
2022-05-14 12:01:19 +09:00
Greg Miller e01b65d284 CHANGES entry for LUCENE-10488 2022-05-13 16:02:57 -07:00
Yuting Gan f0ec226167
LUCENE-10488: Optimize Facets#getTopDims in FloatTaxonomyFacets (#806) 2022-05-13 15:54:41 -07:00
Yuting Gan 57f8cb2fd6
LUCENE-10488: Optimize Facets#getTopDims in IntTaxonomyFacets (#779) 2022-05-13 15:54:31 -07:00
Yuting Gan ef43242d77
LUCENE-10488: Optimized getTopDims in ConcurrentSSDVFacetCounts (#777) 2022-05-13 15:54:18 -07:00
Julie Tibshirani 2cca0e8441 LUCENE-10564: Fix errorprone warning
This slipped through in the original commit because we only enable errorprone on
nightly runs.
2022-05-12 17:31:55 -07:00
Julie Tibshirani 802f5422c0 Add CHANGES entry for LUCENE-10564 2022-05-12 13:37:47 -07:00
Julie Tibshirani 3afc9fa966
LUCENE-10564: Make sure SparseFixedBitSet#or updates memory usage (#882)
Before, it didn't update the estimated memory usage, so calls to ramBytesUsed
could be totally off.
2022-05-12 13:29:07 -07:00
Mayya Sharipova ea5c40686f
LUCENE-10527 Use 2*maxConn for last layer in HNSW (#872)
The original HNSW paper (https://arxiv.org/pdf/1603.09320.pdf) suggests
to use a different maxConn for the upper layers vs. the bottom one
(which contains the full neighborhood graph). Specifically, they
suggest using maxConn=M for upper layers and maxConn=2*M for the bottom.

This patch ensures that we follow this recommendation and use
maxConn=2*M for the bottom layer.
2022-05-12 15:22:25 -04:00
Adrien Grand 8f89db8048
LUCENE-10536: Slightly better compression of doc values' terms dictionaries. (#838)
Doc values terms dictionaries keep the first term of each block uncompressed so
that they can somewhat efficiently perform binary searches across blocks.
Suffixes of the other 63 terms are compressed together using LZ4 to leverage
redundancy across suffixes. This change improves compression a bit by using the
first (uncompressed) term of each block as a dictionary when compressing
suffixes of the 63 other terms. This helps with compressing the first few
suffixes when there's not much context yet that can be leveraged to find
duplicates.
2022-05-12 10:32:58 +02:00
zacharymorn 96036bca9f
LUCENE-10411: Add NN vectors support to ExitableDirectoryReader (#833) 2022-05-11 22:26:35 -07:00
Lu Xugang a06460a538
LUCENE-10502: add changes entry (#881) 2022-05-11 21:24:22 -04:00
Lu Xugang 6040d1648f
LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc (#880)
Currently vector's all docs of all fields are fully loaded into memory (for sparse cases).
This happens not only when we do vector search, but also when we open an index to 
load meta info for vector readers.

This patch instead uses IndexedDISI to store docIds and DirectMonotonicWriter/Reader to 
handle  ordToDoc mapping. Benefits are reduced memory usage, and faster loading of 
meta info for vector readers.
2022-05-11 13:18:10 -04:00
Adrien Grand 54595611ae LUCENE-10496: CHANGES entry. 2022-05-11 11:38:44 +02:00
xiaoping e49708e01d
LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction (#780) 2022-05-11 11:32:07 +02:00
xiaoping 6c6bb00cec
LUCENE-10555: fix iteratorCost initial logic error (#878) 2022-05-11 08:36:24 +02:00
Adrien Grand 8476ac1f6a Fix rare test failures in TestSortOptimization.
The skipping logic relies on the points index telling us by how much we can
reduce the candidate set by applying a filter that only matches documents that
compare better than the bottom value.

Some randomized points formats have large numbers of points per leaf, and
produce estimates of point counts for range queries that are way above the
actual value, which in-turn doesn't enable skipping when we think it should. To
avoid running into this corner case, this change forces the default codec on
this test.
2022-05-10 17:16:42 +02:00
xiaoping 102483bc57
fix bkd test logic error and doc error (#863) 2022-05-10 13:10:00 +02:00
xiaoping f431511cb7 LUCENE-10555: avoid NumericLeafComparator#iteratorCost repeated initialization when NumericLeafComparator#setScorer is called (#864) 2022-05-10 13:07:41 +02:00
Robert Muir 3edfeb5eb2
LUCENE-10532: remove @Slow annotation (#832)
Remove `@Slow` annotation, for more consistency with CI and local jobs. All tests can be fast!
2022-05-09 23:03:55 -04:00
Ramin ALirezaee 111d6b186e
LUCENE-10312: Add PersianStemmer (#540)
Co-authored-by: Tomoko Uchida <tomoko.uchida.1111@gmail.com>
2022-05-07 17:09:56 +09:00
Uwe Schindler 8aa4a56491
LUCENE-10558: Implement URL ctor to support classpath/module usage in Kuromoji and Nori dictionaries (main branch) (#871) 2022-05-06 16:49:56 +02:00
Alan Woodward 5f832c64bf
LUCENE-10436: Reinstate public getdocValuesdocIdSetIterator method on DocValues (#869)
The method moved from DocValuesFieldExistsQuery to DocValuesIterator, but the latter
is a package-private utility class, making it invisible to client code.  This commit moves it
back onto FieldExistsQuery, meaning that the upgrade path will be the same as for all other
uses of DocValuesFieldExistsQuery.
2022-05-06 09:41:28 +01:00
Uwe Schindler 14dcc9c9ce Disable liftbot, we have our own tools 2022-05-05 22:27:57 +02:00
Adrien Grand 26301898b2
LUCENE-10553: Fix WANDScorer's handling of 0 and +Infty. (#860)
The computation of the scaling factor has special cases for these two values,
but the current logic is backwards.
2022-05-05 10:24:28 +01:00
Tomoko Uchida a89c57f35f
Make CONTRIBUTING.md a bit more succinct (#866) 2022-05-05 10:35:33 +09:00
Michael Sokolov 7fbaa63dd1
LUCENE-10504: KnnGraphTester to use KnnVectorQuery (#796)
* LUCENE-10504: KnnGraphTester to use KnnVectorQuery
2022-05-04 18:22:48 -04:00
Mayya Sharipova 87255c117d Add change line for LUCENE-9848 2022-05-04 14:22:31 -04:00
Mayya Sharipova dc6a7f9468
LUCENE-9848 Sort HNSW graph neighbors for construction (#862)
* LUCENE-9848 Sort HNSW graph neighbors for construction

Sort HNSW graph neighbors when applying diversity criterion

During HNSW graph construction, when a node has already a number of
connections larger than maximum allowed (maxConn), we need to prune
its connections using a diversity criteria to limit the number of
connections to maxConn.

Currently when we add reverse connections to already existing nodes,
we don't keep them sorted. Thus later, when we apply diversity criteria
we may prune not the worst most distant non-diverse nodes.

This patch makes sure that neighbours connections are always sorted
from best (closest) to worst (distant), and during the application
of diversity criteria processes nodes from worst to best.

This path does the following:
- enhance NeighborArray to always keep neighbour nodes sorted according
  to their scores (in desc or asc order). Make NeighborArray aware in
  which order the nodes should be sorted.
- make OnHeapHnswGraph aware of the order of similarity function
- make HnswGraphBuilder apply diversity criteria from worst to
  best nodes
- create Lucene90NeighborArray to keep the previous logic of
  NeighborArray for Lucene90Codec
2022-05-04 14:15:14 -04:00
Gautam Worah c3d47507e9
LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md (#853) 2022-05-03 12:53:20 +09:00
Lu Xugang fe9d26178d
LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode (#859)
* LUCENE-10552: KnnVectorQuery now includes filter in equals/ hashCode
2022-05-02 17:58:47 -04:00
Kevin Risden 7efac761f4
LUCENE-10534: MinFloatFunction / MaxFloatFunction calls exists twice (#837) 2022-05-02 13:13:45 -04:00
spike.liu d9d2cb6f09
LUCENE-10188: Give SortedSetDocValues a docValueCount() (#663)
Co-authored-by: vlc刘诚 <chengliu@trip.com>
2022-05-02 10:41:12 -04:00