Commit Graph

36428 Commits

Author SHA1 Message Date
Adrien Grand 9ab324d2be Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:52:12 +01:00
Adrien Grand 525a11091c Revert "Allow reusing indexed binary fields. (#12053)"
This reverts commit 84778549af.
2023-01-12 09:48:06 +01:00
Adrien Grand 84778549af
Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:32:13 +01:00
Patrick Zhai d3d9ab0044
Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051) 2023-01-11 10:44:06 -08:00
Adrien Grand a8ef03d979
Never throttle creation of compound files. (#12070)
`ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a
proxy for CPU usage, in order to prevent merging from disrupting searches too
much. However creating compound files are lightweight CPU-wise and do not need
throttling.

Closes #12068
2023-01-11 09:57:13 +01:00
Adrien Grand 56ec51e558
Cut over Lucene Demo from LongPoint to LongField. (#12052) 2023-01-11 09:43:43 +01:00
Benjamin Trent cc29102a24
Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064) 2023-01-11 09:20:47 +01:00
Erik Pellizzon e14327288e
Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034) 2023-01-06 14:35:52 -08:00
Uwe Schindler 5fccaec166 Remove deprecated APIs after #12066; this also removes another one missed to be removed before 2023-01-05 11:53:16 +01:00
Uwe Schindler 7f483bd618
Retire/deprecate per-instance MMapDirectory#setUseUnmap (#12066) 2023-01-04 19:17:03 +01:00
Uwe Schindler 2f602c01dd
Add a sysprop "org.apache.lucene.store.MMapDirectory.enableMemorySegments" (#12062) 2023-01-03 19:10:28 +01:00
Lu Xugang 19cc6cdf66
Out of boundary in CombinedFieldQuery#addTerm (#12046) 2023-01-03 15:33:36 +08:00
Uwe Schindler e2ee09d0c5
Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9 that has the bean, but without properties (#12058) 2023-01-02 00:05:36 +01:00
twosom 4676a735c1
fix typo analysis-kuromoji (#12047) 2023-01-01 10:58:50 -05:00
Patrick Zhai 4eab1d74e8
Fix TestRangeOnRaneFacetCounts dimention overflow error (#12049) 2022-12-30 13:18:51 -08:00
Greg Miller 21107d811b Add CHANGES entry for GITHUB#11869 2022-12-30 07:46:51 -08:00
Marc D'Mello cbfed77fd3
Github#11869: Add RangeOnRangeFacetCounts (#11901) 2022-12-30 07:38:13 -08:00
Adrien Grand 6f477e5831
Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037)
This iterates on #399 to also optimize the case when an index sort is
configured. When cutting over the NYC taxis benchmark to the new numeric
fields,
[flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times)
stayed mostly the same when index sorting is disabled and increased by 7-8%
when index sorting is enabled. I expect this change to address this slowdown.
2022-12-27 11:12:56 +01:00
Adrien Grand ddd63d2da3
Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011)
When flushing segments that have an index sort configured, postings lists get
loaded into arrays and get reordered according to the index sort.

This reordering is implemented with `TimSorter`, a variant of merge sort. Like
merge sort, an important part of `TimSorter` consists of merging two contiguous
sorted slices of the array into a combined sorted slice. This merging can be
done either with external memory, which is the classical approach, or in-place,
which still runs in linear time but with a much higher factor. Until now we
were allocating a fixed budget of `maxDoc/64` for doing these merges with
external memory. If this is not enough, sorted slices would be merged in place.

I've been looking at some profiles recently for an index where a non-negligible
chunk of the time was spent on in-place merges. So I would like to propose the
following change:
 - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid
   in-place merges for all postings up to `docFreq = maxDoc / 4`.
 - Make this RAM budget lazily allocated, rather than eagerly like today. This
   would help not allocate memory in O(maxDoc) for fields like primary keys
   that only have a couple postings per term.

So overall memory usage would never be more than 50% higher than what it is
today, because `TimSorter` never needs more than X temporary slots if the
postings list doesn't have at least 2*X entries, and these 2*X entries already
get loaded into memory today. And for fields that have short postings, memory
usage should actually be lower.
2022-12-27 11:11:18 +01:00
Adrien Grand f5ea0412eb
Replace JIRA release instructions with GitHub. (#11968) 2022-12-27 11:08:46 +01:00
Adrien Grand e9dc4f9188
Avoid sorting values of multi-valued writers if there is a single value. (#12039)
They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to
range checks and some logic to determine the optimal sorting algorithm to use
depending on the number of values. We can skip this overhead in the case when
there is a single value.
2022-12-27 11:03:06 +01:00
Zach Chen 008a0d4206
Remove IOContext from Directory#openChecksumInput (#12027) 2022-12-26 11:45:42 -08:00
Uwe Schindler c9401bf064
Patch class files for Java 19 code to no longer have the "preview" flag (this enables Java 19 memory segments by default) (#12033) 2022-12-26 10:07:44 +01:00
Uwe Schindler 92f08aff9f Make childLog final to fix compilation on Java 20. This closes #12041 2022-12-25 14:55:33 +01:00
Lu Xugang 3bc8cd5c20
Aggressive `count` in BooleanWeight (#12017) 2022-12-22 23:48:05 +08:00
twosom ad22fb2879
Fix typo in AbstractQueryConfig javadocs (#12031) 2022-12-22 13:57:29 +01:00
twosom 5c78e04a17
fix typo in BaseSynonymParserTestCase (#12030)
Co-authored-by: hope <hope@gravylab.co.kr>
2022-12-21 13:28:52 -05:00
Egor Potemkin d18e3f1d45
Issue #11582 Update Faceting user guide (#12025)
Update faceting user guide to modern times.

Co-authored-by: Egor Potemkin <epotyom@amazon.com>
2022-12-21 12:20:18 -05:00
Francisco Fernández Castaño 57201aa967
Add IntField, LongField, FloatField and DoubleField (#11997)
This commit adds new IndexableFields that index both points and doc
values at once.

Closes #11199
2022-12-20 18:19:46 +01:00
Benjamin Trent 1412e559d9
Clean up KNN related backward-codecs changes (#12019) 2022-12-20 14:04:42 +01:00
Robert Muir 3ac71adbdf
Ban use of Math.fma across the entire codebase (#12014)
When FMA is not supported by the hardware, these methods fall back to
BigDecimal usage which causes them to be 2500x slower.

While most hardware in the last 10 years may have the support, out of
box both VirtualBox and QEMU don't pass thru FMA support (for the latter
at least you can tweak it with e.g. -cpu host or similar to fix this).

This creates a terrible undocumented performance trap. Prevent it from
sneaking into our codebase.
2022-12-17 08:01:22 -05:00
Andriy Redko 945d7fe027
Upgrade ANTLR to version 4.11.1 (#12016)
Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md

"Q: What are the main design decisions in ANTLR4?
Ease-of-use over performance. I will worry about performance later."

It allows us to move forward with newer antlr but hopefully prevent the associated headaches.

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
Co-authored-by: Robert Muir <rmuir@apache.org>
2022-12-15 22:40:35 -05:00
Craig Taverner 3e8ef57e3f
Fix flat polygons incorrectly containing intersecting geometries (#12022) 2022-12-15 14:56:09 +01:00
Benjamin Trent 11f2bc2056
Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004 (#12024) 2022-12-15 14:49:47 +01:00
Benjamin Trent 72968d30ba
Move byte vector queries into new KnnByteVectorQuery (#12004) 2022-12-14 09:53:10 +01:00
Robert Muir 9eeab8c4a6
Remove deprecated API in 10.x (#11998) 2022-12-13 10:32:15 -05:00
Robert Muir 47f8c1baa2
Migrate away from per-segment-per-threadlocals on SegmentReader (#11998)
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-12-13 09:10:21 -05:00
Ignacio Vera ef5766aa81
Fix algorithm that chooses the bridge between a polygon and a hole (#11988) 2022-12-13 10:16:53 +01:00
Dawid Weiss 486003833f
Run spotless after javac (#12012) (#12015) 2022-12-13 08:42:04 +01:00
Robert Muir 06f9179295
Enable LongDoubleConversion error-prone check (#12010) 2022-12-12 20:55:39 -05:00
Greg Miller e34234ca6c
Remove unnecessary NaN checks from LongRange#verifyAndEncode (#12008) 2022-12-11 12:55:21 -08:00
Greg Miller 8671e29929
Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery (#12003)
* Leverage DISI static factory methods more over custom DISI impl where possible.
* Assert points field is a single-dim.
* Bound cost estimate by the cost of the doc values field (for sparse fields).
2022-12-10 12:23:31 -08:00
gf2121 54e00df7f6
Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (#12006) 2022-12-11 02:30:17 +08:00
gf2121 9ff989ec00
Use ByteArrayComparator to replace Arrays#compareUnsigned in some other places (#11880) 2022-12-08 23:51:08 +08:00
Alan Woodward 66127f6e69
Add support for stored fields to MemoryIndex (#11999) 2022-12-08 09:56:24 +00:00
Adrien Grand a971120d05
Make RandomAccessVectorValues an implementation detail of HNSW implementations rather than a proper API. (#11964)
`RandomAccessVectorValues` is internally used in our HNSW implementation to
provide random access to vectors, both at index and search time. In order to
better reflect this, this change does the following:
 - `RandomAccessVectorValues` moves to `org.apache.lucene.util.hnsw`.
 - `BufferingKnnVectorsWriter` no longer has a dependency on
   `RandomAccessVectorValues` and moves to `org.apache.lucene.codecs` since
   it's more of a utility class for KNN vector file formats than an index API.
   Maybe we should think of moving it near each file format that uses it
   instead.
 - `SortingCodecReader` no longer has a dependency on
   `RandomAccessVectorValues`.

Closes #10623
2022-12-08 08:49:37 +01:00
Adrien Grand 95df7e8109
Generalize range query optimization on sorted indexes to descending sorts. (#11972)
This generalizes #687 to indexes that are sorted in descending order. The main
challenge with descending sorts is that they require being able to compute the
last doc ID that matches a value, which would ideally require walking the BKD
tree in reverse order, but the API only support moving forward. This is worked
around by maintaining a stack of `PointTree` clones to perform the search.
2022-12-08 08:38:53 +01:00
Benjamin Trent d0be9ab57c
GITHUB-11830 Better optimize storage for vector connections (#11860) 2022-12-07 08:51:54 +01:00
Karl David Wright 108462a005 Followup work for #11883 2022-12-03 08:07:10 -05:00
Costin Leau 4eba6a1284
Add exponential growth to TimeLimitingBulkScorer (#11984)
Increase the timeout check inside TimeLimitBulkScorer at exponential rate.

Fix #11676
2022-12-02 09:20:48 -08:00