lucene

Commit Graph

Author	SHA1	Message	Date
Adrien Grand	6f477e5831	Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037 ) This iterates on #399 to also optimize the case when an index sort is configured. When cutting over the NYC taxis benchmark to the new numeric fields, [flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times) stayed mostly the same when index sorting is disabled and increased by 7-8% when index sorting is enabled. I expect this change to address this slowdown.	2022-12-27 11:12:56 +01:00
Adrien Grand	ddd63d2da3	Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011 ) When flushing segments that have an index sort configured, postings lists get loaded into arrays and get reordered according to the index sort. This reordering is implemented with `TimSorter`, a variant of merge sort. Like merge sort, an important part of `TimSorter` consists of merging two contiguous sorted slices of the array into a combined sorted slice. This merging can be done either with external memory, which is the classical approach, or in-place, which still runs in linear time but with a much higher factor. Until now we were allocating a fixed budget of `maxDoc/64` for doing these merges with external memory. If this is not enough, sorted slices would be merged in place. I've been looking at some profiles recently for an index where a non-negligible chunk of the time was spent on in-place merges. So I would like to propose the following change: - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid in-place merges for all postings up to `docFreq = maxDoc / 4`. - Make this RAM budget lazily allocated, rather than eagerly like today. This would help not allocate memory in O(maxDoc) for fields like primary keys that only have a couple postings per term. So overall memory usage would never be more than 50% higher than what it is today, because `TimSorter` never needs more than X temporary slots if the postings list doesn't have at least 2X entries, and these 2X entries already get loaded into memory today. And for fields that have short postings, memory usage should actually be lower.	2022-12-27 11:11:18 +01:00
Adrien Grand	f5ea0412eb	Replace JIRA release instructions with GitHub. (#11968 )	2022-12-27 11:08:46 +01:00
Adrien Grand	e9dc4f9188	Avoid sorting values of multi-valued writers if there is a single value. (#12039 ) They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to range checks and some logic to determine the optimal sorting algorithm to use depending on the number of values. We can skip this overhead in the case when there is a single value.	2022-12-27 11:03:06 +01:00
Zach Chen	008a0d4206	Remove IOContext from Directory#openChecksumInput (#12027 )	2022-12-26 11:45:42 -08:00
Uwe Schindler	c9401bf064	Patch class files for Java 19 code to no longer have the "preview" flag (this enables Java 19 memory segments by default) (#12033 )	2022-12-26 10:07:44 +01:00
Uwe Schindler	92f08aff9f	Make childLog final to fix compilation on Java 20. This closes #12041	2022-12-25 14:55:33 +01:00
Lu Xugang	3bc8cd5c20	Aggressive `count` in BooleanWeight (#12017 )	2022-12-22 23:48:05 +08:00
twosom	ad22fb2879	Fix typo in AbstractQueryConfig javadocs (#12031 )	2022-12-22 13:57:29 +01:00
twosom	5c78e04a17	fix typo in BaseSynonymParserTestCase (#12030 ) Co-authored-by: hope <hope@gravylab.co.kr>	2022-12-21 13:28:52 -05:00
Egor Potemkin	d18e3f1d45	Issue #11582 Update Faceting user guide (#12025 ) Update faceting user guide to modern times. Co-authored-by: Egor Potemkin <epotyom@amazon.com>	2022-12-21 12:20:18 -05:00
Francisco Fernández Castaño	57201aa967	Add IntField, LongField, FloatField and DoubleField (#11997 ) This commit adds new IndexableFields that index both points and doc values at once. Closes #11199	2022-12-20 18:19:46 +01:00
Benjamin Trent	1412e559d9	Clean up KNN related backward-codecs changes (#12019 )	2022-12-20 14:04:42 +01:00
Robert Muir	3ac71adbdf	Ban use of Math.fma across the entire codebase (#12014 ) When FMA is not supported by the hardware, these methods fall back to BigDecimal usage which causes them to be 2500x slower. While most hardware in the last 10 years may have the support, out of box both VirtualBox and QEMU don't pass thru FMA support (for the latter at least you can tweak it with e.g. -cpu host or similar to fix this). This creates a terrible undocumented performance trap. Prevent it from sneaking into our codebase.	2022-12-17 08:01:22 -05:00
Andriy Redko	945d7fe027	Upgrade ANTLR to version 4.11.1 (#12016 ) Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md "Q: What are the main design decisions in ANTLR4? Ease-of-use over performance. I will worry about performance later." It allows us to move forward with newer antlr but hopefully prevent the associated headaches. Signed-off-by: Andriy Redko <andriy.redko@aiven.io> Co-authored-by: Robert Muir <rmuir@apache.org>	2022-12-15 22:40:35 -05:00
Craig Taverner	3e8ef57e3f	Fix flat polygons incorrectly containing intersecting geometries (#12022 )	2022-12-15 14:56:09 +01:00
Benjamin Trent	11f2bc2056	Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004 (#12024 )	2022-12-15 14:49:47 +01:00
Benjamin Trent	72968d30ba	Move byte vector queries into new KnnByteVectorQuery (#12004 )	2022-12-14 09:53:10 +01:00
Robert Muir	9eeab8c4a6	Remove deprecated API in 10.x (#11998 )	2022-12-13 10:32:15 -05:00
Robert Muir	47f8c1baa2	Migrate away from per-segment-per-threadlocals on SegmentReader (#11998 ) Add new stored fields and termvectors interfaces: IndexReader.storedFields() and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector(). The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly reduce RAM requirements when there are many threads and/or segments. Co-authored-by: Adrien Grand <jpountz@gmail.com>	2022-12-13 09:10:21 -05:00
Ignacio Vera	ef5766aa81	Fix algorithm that chooses the bridge between a polygon and a hole (#11988 )	2022-12-13 10:16:53 +01:00
Dawid Weiss	486003833f	Run spotless after javac (#12012 ) (#12015 )	2022-12-13 08:42:04 +01:00
Robert Muir	06f9179295	Enable LongDoubleConversion error-prone check (#12010 )	2022-12-12 20:55:39 -05:00
Greg Miller	e34234ca6c	Remove unnecessary NaN checks from LongRange#verifyAndEncode (#12008 )	2022-12-11 12:55:21 -08:00
Greg Miller	8671e29929	Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery (#12003 ) * Leverage DISI static factory methods more over custom DISI impl where possible. * Assert points field is a single-dim. * Bound cost estimate by the cost of the doc values field (for sparse fields).	2022-12-10 12:23:31 -08:00
gf2121	54e00df7f6	Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (#12006 )	2022-12-11 02:30:17 +08:00
gf2121	9ff989ec00	Use ByteArrayComparator to replace Arrays#compareUnsigned in some other places (#11880 )	2022-12-08 23:51:08 +08:00
Alan Woodward	66127f6e69	Add support for stored fields to MemoryIndex (#11999 )	2022-12-08 09:56:24 +00:00
Adrien Grand	a971120d05	Make RandomAccessVectorValues an implementation detail of HNSW implementations rather than a proper API. (#11964 ) `RandomAccessVectorValues` is internally used in our HNSW implementation to provide random access to vectors, both at index and search time. In order to better reflect this, this change does the following: - `RandomAccessVectorValues` moves to `org.apache.lucene.util.hnsw`. - `BufferingKnnVectorsWriter` no longer has a dependency on `RandomAccessVectorValues` and moves to `org.apache.lucene.codecs` since it's more of a utility class for KNN vector file formats than an index API. Maybe we should think of moving it near each file format that uses it instead. - `SortingCodecReader` no longer has a dependency on `RandomAccessVectorValues`. Closes #10623	2022-12-08 08:49:37 +01:00
Adrien Grand	95df7e8109	Generalize range query optimization on sorted indexes to descending sorts. (#11972 ) This generalizes #687 to indexes that are sorted in descending order. The main challenge with descending sorts is that they require being able to compute the last doc ID that matches a value, which would ideally require walking the BKD tree in reverse order, but the API only support moving forward. This is worked around by maintaining a stack of `PointTree` clones to perform the search.	2022-12-08 08:38:53 +01:00
Benjamin Trent	d0be9ab57c	GITHUB-11830 Better optimize storage for vector connections (#11860 )	2022-12-07 08:51:54 +01:00
Karl David Wright	108462a005	Followup work for #11883	2022-12-03 08:07:10 -05:00
Costin Leau	4eba6a1284	Add exponential growth to TimeLimitingBulkScorer (#11984 ) Increase the timeout check inside TimeLimitBulkScorer at exponential rate. Fix #11676	2022-12-02 09:20:48 -08:00
Dawid Weiss	1f741ff63c	Upgrade gradle to 7.6. (#11993 )	2022-12-02 09:18:38 +01:00
Robert Muir	fad3108b27	fix wrong serialization by ShapeDocValues (#11974 ) Closes #11973	2022-12-01 20:32:42 -05:00
Robert Muir	0a9bb6e2ac	Disable useless error-prone checks (libraries/frameworks we do not use) (#11971 ) These are easy/obvious ones to disable since we don't use the functionality at all: the checks are literally useless. This gives some performance boost to the error-prone, although it is still pretty slow. triage most of the previously disabled checks into TODO, noisy, etc	2022-12-01 08:46:23 -05:00
Alan Woodward	72ff140f5a	Don't let merged passages push out lower-scoring ones (#11990 ) PassageScorer uses a priority queue of size maxPassages to keep track of which highlighted passages are worth returning to the user. Once all passages have been collected, we go through and merge overlapping passages together, but this reduction in the number of passages is not compensated for by re-adding the highest-scoring passages that were pushed out of the queue by passages which have been merged away. This commit increases the size of the priority queue to try and account for overlapping passages that will subsequently be merged together.	2022-12-01 12:25:29 +00:00
Luca Cavanna	bd168ac2a8	Add changes entry for #11985	2022-11-30 10:13:39 +01:00
Luca Cavanna	343d888b30	ExitableTerms to override getMin and getMax (#11985 ) ExitableTerms should not iterate through the terms to retrieve min and max when the wrapped implementation has the values cached (e.g. FieldsReader, OrdsFieldReader)	2022-11-30 10:06:31 +01:00
Alan Woodward	0cc6f69536	Give OffsetsRetrievalStrategy implementations public constructors (#11983 ) OffsetsFromMatchIterator and OffsetsFromPositions both have package- private constructors, which makes them difficult to use as components in a separate highlighter implementation.	2022-11-28 16:22:46 +00:00
Karl David Wright	5c4896321d	Merge branch 'GITHUB-11883' into main Pulling in changes to address ticket 11883.	2022-11-25 16:32:02 -05:00
Karl David Wright	74e8b94796	Fix for 11883.	2022-11-25 16:17:18 -05:00
Karl David Wright	6dc6b5b0dd	As part of GITHUB-11883, develop new primitive Plane constructors to build boundary planes specific for each polygon edge.	2022-11-25 14:56:38 -05:00
Greg Miller	2e83c3b40f	Fix NPE in BinaryRangeFieldRangeQuery when field does not exist or is of wrong type (#11950 )	2022-11-25 11:38:41 -08:00
Robert Muir	4e93f29318	fix bad shift amounts and enable check (#11979 )	2022-11-25 11:47:25 -05:00
Robert Muir	545c93a394	fix use of wrong array toString() method in test, enable check (#11978 )	2022-11-25 11:47:04 -05:00
Robert Muir	4885b5f856	fix use of wrong array equals() method in test, enable check (#11977 )	2022-11-25 11:46:48 -05:00
Robert Muir	f4286493d1	fix variable assigned to itself in test and enable check (#11980 )	2022-11-25 11:45:45 -05:00
Karl David Wright	b5f94b6754	Add test that tweaks identical planes in intersections bug	2022-11-25 07:40:45 -05:00
Karl David Wright	b5dd71198d	Refactor, restoring isWithinSection and making sure it is properly called.	2022-11-24 02:47:06 -05:00

1 2 3 4 5 ...

36361 Commits All Branches Search

36361 Commits

All Branches