lucene

Commit Graph

Author	SHA1	Message	Date
Adrien Grand	b5062a2858	MultiCollector shouldn't report that scores are needed when they're not. (#12083 ) When sub collectors don't agree on their `ScoreMode`, `MultiCollector` currently returns `COMPLETE`. This makes sense when assuming that there is likely one collector computing top hits (`TOP_SCORES`) and another one computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it is also possible to have one collector computing top hits by field (`TOP_DOCS`) and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector` shouldn't report that scores are needed in that case.	2023-01-13 14:44:17 +01:00
Luca Cavanna	90a5a71448	fix typo in changelog	2023-01-13 14:12:06 +01:00
Lu Xugang	102622842b	Enhance XXXField#newRangeQuery (#12078 ) Introduce IndexSortSortedNumericDocValuesRangeQuery to IntFiled#newRangeQuery and LongField#newRangeQuery	2023-01-13 18:12:26 +08:00
Adrien Grand	aaab028266	Speed up DocIdMerger on sorted indexes. (#12081 ) In the case when an index is sorted on a low-cardinality field, or the index sort order correlates with the order in which documents get ingested, we can optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on the next sub. This checks covers at the same time whether the priority queue needs reordering and whether the current sub reached `NO_MORE_DOCS`.	2023-01-12 18:27:45 +01:00
Adrien Grand	729fedcbac	Speed up 1D BKD merging. (#12079 ) On the NYC taxis dataset on my local machine, switching from `Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15% speedup of BKD merging.	2023-01-12 18:14:15 +01:00
Benjamin Trent	59b17452aa	Fix exponential runtime for Boolean#rewrite (#12072 ) When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially. The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively. This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite. closes: #12069	2023-01-12 16:50:05 +01:00
Adrien Grand	9ab324d2be	Allow reusing indexed binary fields. (#12053 ) Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.	2023-01-12 09:52:12 +01:00
Adrien Grand	525a11091c	Revert "Allow reusing indexed binary fields. (#12053 )" This reverts commit `84778549af`.	2023-01-12 09:48:06 +01:00
Adrien Grand	84778549af	Allow reusing indexed binary fields. (#12053 ) Today Lucene allows creating indexed binary fields, e.g. via `StringField(String, BytesRef, Field.Store)`, but not reusing them: calling `setBytesValue` on a `StringField` throws. This commit removes the check that prevents reusing fields with binary values. I considered an alternative that consisted of failing if calling `setBytesValue` on a field that is indexed and tokenized, but we currently don't have such checks e.g. on numeric values, so it did not feel consistent. Doing this change would help improve the [nightly benchmarks for the NYC taxis dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html) by doing the String -> UTF-8 conversion only once for keywords, instead of once for the `StringField` and one for the `SortedDocValuesField`, while still reusing fields.	2023-01-12 09:32:13 +01:00
Patrick Zhai	d3d9ab0044	Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051 )	2023-01-11 10:44:06 -08:00
Adrien Grand	a8ef03d979	Never throttle creation of compound files. (#12070 ) `ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a proxy for CPU usage, in order to prevent merging from disrupting searches too much. However creating compound files are lightweight CPU-wise and do not need throttling. Closes #12068	2023-01-11 09:57:13 +01:00
Adrien Grand	56ec51e558	Cut over Lucene Demo from LongPoint to LongField. (#12052 )	2023-01-11 09:43:43 +01:00
Benjamin Trent	cc29102a24	Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064 )	2023-01-11 09:20:47 +01:00
Erik Pellizzon	e14327288e	Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034 )	2023-01-06 14:35:52 -08:00
Uwe Schindler	5fccaec166	Remove deprecated APIs after #12066 ; this also removes another one missed to be removed before	2023-01-05 11:53:16 +01:00
Uwe Schindler	7f483bd618	Retire/deprecate per-instance MMapDirectory#setUseUnmap (#12066 )	2023-01-04 19:17:03 +01:00
Uwe Schindler	2f602c01dd	Add a sysprop "org.apache.lucene.store.MMapDirectory.enableMemorySegments" (#12062 )	2023-01-03 19:10:28 +01:00
Lu Xugang	19cc6cdf66	Out of boundary in CombinedFieldQuery#addTerm (#12046 )	2023-01-03 15:33:36 +08:00
Uwe Schindler	e2ee09d0c5	Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9 that has the bean, but without properties (#12058 )	2023-01-02 00:05:36 +01:00
twosom	4676a735c1	fix typo analysis-kuromoji (#12047 )	2023-01-01 10:58:50 -05:00
Patrick Zhai	4eab1d74e8	Fix TestRangeOnRaneFacetCounts dimention overflow error (#12049 )	2022-12-30 13:18:51 -08:00
Greg Miller	21107d811b	Add CHANGES entry for GITHUB#11869	2022-12-30 07:46:51 -08:00
Marc D'Mello	cbfed77fd3	Github#11869: Add RangeOnRangeFacetCounts (#11901 )	2022-12-30 07:38:13 -08:00
Adrien Grand	6f477e5831	Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037 ) This iterates on #399 to also optimize the case when an index sort is configured. When cutting over the NYC taxis benchmark to the new numeric fields, [flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times) stayed mostly the same when index sorting is disabled and increased by 7-8% when index sorting is enabled. I expect this change to address this slowdown.	2022-12-27 11:12:56 +01:00
Adrien Grand	ddd63d2da3	Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011 ) When flushing segments that have an index sort configured, postings lists get loaded into arrays and get reordered according to the index sort. This reordering is implemented with `TimSorter`, a variant of merge sort. Like merge sort, an important part of `TimSorter` consists of merging two contiguous sorted slices of the array into a combined sorted slice. This merging can be done either with external memory, which is the classical approach, or in-place, which still runs in linear time but with a much higher factor. Until now we were allocating a fixed budget of `maxDoc/64` for doing these merges with external memory. If this is not enough, sorted slices would be merged in place. I've been looking at some profiles recently for an index where a non-negligible chunk of the time was spent on in-place merges. So I would like to propose the following change: - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid in-place merges for all postings up to `docFreq = maxDoc / 4`. - Make this RAM budget lazily allocated, rather than eagerly like today. This would help not allocate memory in O(maxDoc) for fields like primary keys that only have a couple postings per term. So overall memory usage would never be more than 50% higher than what it is today, because `TimSorter` never needs more than X temporary slots if the postings list doesn't have at least 2X entries, and these 2X entries already get loaded into memory today. And for fields that have short postings, memory usage should actually be lower.	2022-12-27 11:11:18 +01:00
Adrien Grand	f5ea0412eb	Replace JIRA release instructions with GitHub. (#11968 )	2022-12-27 11:08:46 +01:00
Adrien Grand	e9dc4f9188	Avoid sorting values of multi-valued writers if there is a single value. (#12039 ) They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to range checks and some logic to determine the optimal sorting algorithm to use depending on the number of values. We can skip this overhead in the case when there is a single value.	2022-12-27 11:03:06 +01:00
Zach Chen	008a0d4206	Remove IOContext from Directory#openChecksumInput (#12027 )	2022-12-26 11:45:42 -08:00
Uwe Schindler	c9401bf064	Patch class files for Java 19 code to no longer have the "preview" flag (this enables Java 19 memory segments by default) (#12033 )	2022-12-26 10:07:44 +01:00
Uwe Schindler	92f08aff9f	Make childLog final to fix compilation on Java 20. This closes #12041	2022-12-25 14:55:33 +01:00
Lu Xugang	3bc8cd5c20	Aggressive `count` in BooleanWeight (#12017 )	2022-12-22 23:48:05 +08:00
twosom	ad22fb2879	Fix typo in AbstractQueryConfig javadocs (#12031 )	2022-12-22 13:57:29 +01:00
twosom	5c78e04a17	fix typo in BaseSynonymParserTestCase (#12030 ) Co-authored-by: hope <hope@gravylab.co.kr>	2022-12-21 13:28:52 -05:00
Egor Potemkin	d18e3f1d45	Issue #11582 Update Faceting user guide (#12025 ) Update faceting user guide to modern times. Co-authored-by: Egor Potemkin <epotyom@amazon.com>	2022-12-21 12:20:18 -05:00
Francisco Fernández Castaño	57201aa967	Add IntField, LongField, FloatField and DoubleField (#11997 ) This commit adds new IndexableFields that index both points and doc values at once. Closes #11199	2022-12-20 18:19:46 +01:00
Benjamin Trent	1412e559d9	Clean up KNN related backward-codecs changes (#12019 )	2022-12-20 14:04:42 +01:00
Robert Muir	3ac71adbdf	Ban use of Math.fma across the entire codebase (#12014 ) When FMA is not supported by the hardware, these methods fall back to BigDecimal usage which causes them to be 2500x slower. While most hardware in the last 10 years may have the support, out of box both VirtualBox and QEMU don't pass thru FMA support (for the latter at least you can tweak it with e.g. -cpu host or similar to fix this). This creates a terrible undocumented performance trap. Prevent it from sneaking into our codebase.	2022-12-17 08:01:22 -05:00
Andriy Redko	945d7fe027	Upgrade ANTLR to version 4.11.1 (#12016 ) Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md "Q: What are the main design decisions in ANTLR4? Ease-of-use over performance. I will worry about performance later." It allows us to move forward with newer antlr but hopefully prevent the associated headaches. Signed-off-by: Andriy Redko <andriy.redko@aiven.io> Co-authored-by: Robert Muir <rmuir@apache.org>	2022-12-15 22:40:35 -05:00
Craig Taverner	3e8ef57e3f	Fix flat polygons incorrectly containing intersecting geometries (#12022 )	2022-12-15 14:56:09 +01:00
Benjamin Trent	11f2bc2056	Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004 (#12024 )	2022-12-15 14:49:47 +01:00
Benjamin Trent	72968d30ba	Move byte vector queries into new KnnByteVectorQuery (#12004 )	2022-12-14 09:53:10 +01:00
Robert Muir	9eeab8c4a6	Remove deprecated API in 10.x (#11998 )	2022-12-13 10:32:15 -05:00
Robert Muir	47f8c1baa2	Migrate away from per-segment-per-threadlocals on SegmentReader (#11998 ) Add new stored fields and termvectors interfaces: IndexReader.storedFields() and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector(). The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly reduce RAM requirements when there are many threads and/or segments. Co-authored-by: Adrien Grand <jpountz@gmail.com>	2022-12-13 09:10:21 -05:00
Ignacio Vera	ef5766aa81	Fix algorithm that chooses the bridge between a polygon and a hole (#11988 )	2022-12-13 10:16:53 +01:00
Dawid Weiss	486003833f	Run spotless after javac (#12012 ) (#12015 )	2022-12-13 08:42:04 +01:00
Robert Muir	06f9179295	Enable LongDoubleConversion error-prone check (#12010 )	2022-12-12 20:55:39 -05:00
Greg Miller	e34234ca6c	Remove unnecessary NaN checks from LongRange#verifyAndEncode (#12008 )	2022-12-11 12:55:21 -08:00
Greg Miller	8671e29929	Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery (#12003 ) * Leverage DISI static factory methods more over custom DISI impl where possible. * Assert points field is a single-dim. * Bound cost estimate by the cost of the doc values field (for sparse fields).	2022-12-10 12:23:31 -08:00
gf2121	54e00df7f6	Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (#12006 )	2022-12-11 02:30:17 +08:00
gf2121	9ff989ec00	Use ByteArrayComparator to replace Arrays#compareUnsigned in some other places (#11880 )	2022-12-08 23:51:08 +08:00

1 2 3 4 5 ...

36384 Commits All Branches Search

36384 Commits

All Branches