Commit Graph

36461 Commits

Author SHA1 Message Date
Adrien Grand 6f477e5831
Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037)
This iterates on #399 to also optimize the case when an index sort is
configured. When cutting over the NYC taxis benchmark to the new numeric
fields,
[flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times)
stayed mostly the same when index sorting is disabled and increased by 7-8%
when index sorting is enabled. I expect this change to address this slowdown.
2022-12-27 11:12:56 +01:00
Adrien Grand ddd63d2da3
Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011)
When flushing segments that have an index sort configured, postings lists get
loaded into arrays and get reordered according to the index sort.

This reordering is implemented with `TimSorter`, a variant of merge sort. Like
merge sort, an important part of `TimSorter` consists of merging two contiguous
sorted slices of the array into a combined sorted slice. This merging can be
done either with external memory, which is the classical approach, or in-place,
which still runs in linear time but with a much higher factor. Until now we
were allocating a fixed budget of `maxDoc/64` for doing these merges with
external memory. If this is not enough, sorted slices would be merged in place.

I've been looking at some profiles recently for an index where a non-negligible
chunk of the time was spent on in-place merges. So I would like to propose the
following change:
 - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid
   in-place merges for all postings up to `docFreq = maxDoc / 4`.
 - Make this RAM budget lazily allocated, rather than eagerly like today. This
   would help not allocate memory in O(maxDoc) for fields like primary keys
   that only have a couple postings per term.

So overall memory usage would never be more than 50% higher than what it is
today, because `TimSorter` never needs more than X temporary slots if the
postings list doesn't have at least 2*X entries, and these 2*X entries already
get loaded into memory today. And for fields that have short postings, memory
usage should actually be lower.
2022-12-27 11:11:18 +01:00
Adrien Grand f5ea0412eb
Replace JIRA release instructions with GitHub. (#11968) 2022-12-27 11:08:46 +01:00
Adrien Grand e9dc4f9188
Avoid sorting values of multi-valued writers if there is a single value. (#12039)
They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to
range checks and some logic to determine the optimal sorting algorithm to use
depending on the number of values. We can skip this overhead in the case when
there is a single value.
2022-12-27 11:03:06 +01:00
Zach Chen 008a0d4206
Remove IOContext from Directory#openChecksumInput (#12027) 2022-12-26 11:45:42 -08:00
Uwe Schindler c9401bf064
Patch class files for Java 19 code to no longer have the "preview" flag (this enables Java 19 memory segments by default) (#12033) 2022-12-26 10:07:44 +01:00
Uwe Schindler 92f08aff9f Make childLog final to fix compilation on Java 20. This closes #12041 2022-12-25 14:55:33 +01:00
Lu Xugang 3bc8cd5c20
Aggressive `count` in BooleanWeight (#12017) 2022-12-22 23:48:05 +08:00
twosom ad22fb2879
Fix typo in AbstractQueryConfig javadocs (#12031) 2022-12-22 13:57:29 +01:00
twosom 5c78e04a17
fix typo in BaseSynonymParserTestCase (#12030)
Co-authored-by: hope <hope@gravylab.co.kr>
2022-12-21 13:28:52 -05:00
Egor Potemkin d18e3f1d45
Issue #11582 Update Faceting user guide (#12025)
Update faceting user guide to modern times.

Co-authored-by: Egor Potemkin <epotyom@amazon.com>
2022-12-21 12:20:18 -05:00
Francisco Fernández Castaño 57201aa967
Add IntField, LongField, FloatField and DoubleField (#11997)
This commit adds new IndexableFields that index both points and doc
values at once.

Closes #11199
2022-12-20 18:19:46 +01:00
Benjamin Trent 1412e559d9
Clean up KNN related backward-codecs changes (#12019) 2022-12-20 14:04:42 +01:00
Robert Muir 3ac71adbdf
Ban use of Math.fma across the entire codebase (#12014)
When FMA is not supported by the hardware, these methods fall back to
BigDecimal usage which causes them to be 2500x slower.

While most hardware in the last 10 years may have the support, out of
box both VirtualBox and QEMU don't pass thru FMA support (for the latter
at least you can tweak it with e.g. -cpu host or similar to fix this).

This creates a terrible undocumented performance trap. Prevent it from
sneaking into our codebase.
2022-12-17 08:01:22 -05:00
Andriy Redko 945d7fe027
Upgrade ANTLR to version 4.11.1 (#12016)
Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md

"Q: What are the main design decisions in ANTLR4?
Ease-of-use over performance. I will worry about performance later."

It allows us to move forward with newer antlr but hopefully prevent the associated headaches.

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
Co-authored-by: Robert Muir <rmuir@apache.org>
2022-12-15 22:40:35 -05:00
Craig Taverner 3e8ef57e3f
Fix flat polygons incorrectly containing intersecting geometries (#12022) 2022-12-15 14:56:09 +01:00
Benjamin Trent 11f2bc2056
Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004 (#12024) 2022-12-15 14:49:47 +01:00
Benjamin Trent 72968d30ba
Move byte vector queries into new KnnByteVectorQuery (#12004) 2022-12-14 09:53:10 +01:00
Robert Muir 9eeab8c4a6
Remove deprecated API in 10.x (#11998) 2022-12-13 10:32:15 -05:00
Robert Muir 47f8c1baa2
Migrate away from per-segment-per-threadlocals on SegmentReader (#11998)
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-12-13 09:10:21 -05:00
Ignacio Vera ef5766aa81
Fix algorithm that chooses the bridge between a polygon and a hole (#11988) 2022-12-13 10:16:53 +01:00
Dawid Weiss 486003833f
Run spotless after javac (#12012) (#12015) 2022-12-13 08:42:04 +01:00
Robert Muir 06f9179295
Enable LongDoubleConversion error-prone check (#12010) 2022-12-12 20:55:39 -05:00
Greg Miller e34234ca6c
Remove unnecessary NaN checks from LongRange#verifyAndEncode (#12008) 2022-12-11 12:55:21 -08:00
Greg Miller 8671e29929
Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery (#12003)
* Leverage DISI static factory methods more over custom DISI impl where possible.
* Assert points field is a single-dim.
* Bound cost estimate by the cost of the doc values field (for sparse fields).
2022-12-10 12:23:31 -08:00
gf2121 54e00df7f6
Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (#12006) 2022-12-11 02:30:17 +08:00
gf2121 9ff989ec00
Use ByteArrayComparator to replace Arrays#compareUnsigned in some other places (#11880) 2022-12-08 23:51:08 +08:00
Alan Woodward 66127f6e69
Add support for stored fields to MemoryIndex (#11999) 2022-12-08 09:56:24 +00:00
Adrien Grand a971120d05
Make RandomAccessVectorValues an implementation detail of HNSW implementations rather than a proper API. (#11964)
`RandomAccessVectorValues` is internally used in our HNSW implementation to
provide random access to vectors, both at index and search time. In order to
better reflect this, this change does the following:
 - `RandomAccessVectorValues` moves to `org.apache.lucene.util.hnsw`.
 - `BufferingKnnVectorsWriter` no longer has a dependency on
   `RandomAccessVectorValues` and moves to `org.apache.lucene.codecs` since
   it's more of a utility class for KNN vector file formats than an index API.
   Maybe we should think of moving it near each file format that uses it
   instead.
 - `SortingCodecReader` no longer has a dependency on
   `RandomAccessVectorValues`.

Closes #10623
2022-12-08 08:49:37 +01:00
Adrien Grand 95df7e8109
Generalize range query optimization on sorted indexes to descending sorts. (#11972)
This generalizes #687 to indexes that are sorted in descending order. The main
challenge with descending sorts is that they require being able to compute the
last doc ID that matches a value, which would ideally require walking the BKD
tree in reverse order, but the API only support moving forward. This is worked
around by maintaining a stack of `PointTree` clones to perform the search.
2022-12-08 08:38:53 +01:00
Benjamin Trent d0be9ab57c
GITHUB-11830 Better optimize storage for vector connections (#11860) 2022-12-07 08:51:54 +01:00
Karl David Wright 108462a005 Followup work for #11883 2022-12-03 08:07:10 -05:00
Costin Leau 4eba6a1284
Add exponential growth to TimeLimitingBulkScorer (#11984)
Increase the timeout check inside TimeLimitBulkScorer at exponential rate.

Fix #11676
2022-12-02 09:20:48 -08:00
Dawid Weiss 1f741ff63c
Upgrade gradle to 7.6. (#11993) 2022-12-02 09:18:38 +01:00
Robert Muir fad3108b27
fix wrong serialization by ShapeDocValues (#11974)
Closes #11973
2022-12-01 20:32:42 -05:00
Robert Muir 0a9bb6e2ac
Disable useless error-prone checks (libraries/frameworks we do not use) (#11971)
These are easy/obvious ones to disable since we don't use the
functionality at all: the checks are literally useless.

This gives some performance boost to the error-prone, although it is
still pretty slow.

triage most of the previously disabled checks into TODO, noisy, etc
2022-12-01 08:46:23 -05:00
Alan Woodward 72ff140f5a
Don't let merged passages push out lower-scoring ones (#11990)
PassageScorer uses a priority queue of size maxPassages to keep track of
which highlighted passages are worth returning to the user. Once all
passages have been collected, we go through and merge overlapping
passages together, but this reduction in the number of passages is not
compensated for by re-adding the highest-scoring passages that were pushed
out of the queue by passages which have been merged away.

This commit increases the size of the priority queue to try and account for
overlapping passages that will subsequently be merged together.
2022-12-01 12:25:29 +00:00
Luca Cavanna bd168ac2a8 Add changes entry for #11985 2022-11-30 10:13:39 +01:00
Luca Cavanna 343d888b30
ExitableTerms to override getMin and getMax (#11985)
ExitableTerms should not iterate through the terms to retrieve min and max when the wrapped implementation has the values cached (e.g. FieldsReader, OrdsFieldReader)
2022-11-30 10:06:31 +01:00
Alan Woodward 0cc6f69536
Give OffsetsRetrievalStrategy implementations public constructors (#11983)
OffsetsFromMatchIterator and OffsetsFromPositions both have package-
private constructors, which makes them difficult to use as components in a
separate highlighter implementation.
2022-11-28 16:22:46 +00:00
Karl David Wright 5c4896321d Merge branch 'GITHUB-11883' into main
Pulling in changes to address ticket 11883.
2022-11-25 16:32:02 -05:00
Karl David Wright 74e8b94796 Fix for 11883. 2022-11-25 16:17:18 -05:00
Karl David Wright 6dc6b5b0dd As part of GITHUB-11883, develop new primitive Plane constructors to build boundary planes specific for each polygon edge. 2022-11-25 14:56:38 -05:00
Greg Miller 2e83c3b40f
Fix NPE in BinaryRangeFieldRangeQuery when field does not exist or is of wrong type (#11950) 2022-11-25 11:38:41 -08:00
Robert Muir 4e93f29318
fix bad shift amounts and enable check (#11979) 2022-11-25 11:47:25 -05:00
Robert Muir 545c93a394
fix use of wrong array toString() method in test, enable check (#11978) 2022-11-25 11:47:04 -05:00
Robert Muir 4885b5f856
fix use of wrong array equals() method in test, enable check (#11977) 2022-11-25 11:46:48 -05:00
Robert Muir f4286493d1
fix variable assigned to itself in test and enable check (#11980) 2022-11-25 11:45:45 -05:00
Karl David Wright b5f94b6754 Add test that tweaks identical planes in intersections bug 2022-11-25 07:40:45 -05:00
Karl David Wright b5dd71198d Refactor, restoring isWithinSection and making sure it is properly called. 2022-11-24 02:47:06 -05:00