lucene

Commit Graph

Author	SHA1	Message	Date
Benjamin Trent	f16007c3ec	Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat (#13027 ) When merging `Lucene99HnswScalarQuantizedVectorsFormat` a NPE is possible when deleted documents are present. `ScalarQuantizer#fromVectors` doesn't take deleted documents into account. This means using `FloatVectorValues#size` may actually be larger than the actual size of live documents. Consequently, when iterating for sampling iteration too far is possible and an NPE will be thrown.	2024-01-23 07:54:32 -05:00
Stefan Vodita	3674e779cb	[Minor] Document operation costs for stale workflow (#13000 )	2024-01-22 09:40:25 +00:00
Michael Froh	2a0b7f2056	Split taxonomy arrays across chunks (#12995 ) Split taxonomy arrays across chunks Taxonomy ordinals are added in an append-only way. Instead of reallocating a single big array when loading new taxonomy ordinals and copying all the values from the previous arrays over individually, we can keep blocks of ordinals and reuse blocks from the previous arrays.	2024-01-22 09:11:20 +00:00
Simon Willnauer	24d557a4f6	Run BWC indices generation code together with unittest (#13023 ) This change runs the current BWC indices generation code together with the unittest to catch issues with the generated indices earliy. Each generation method runs a sanity check on the generated indices.	2024-01-19 15:47:01 +01:00
Simon Willnauer	c661c88fc8	Fix BWC test generation after mondernizing LineDocFile (#13021 ) The changes on #12929 broke the generation code for BWC indices since they are expecting vertain fields created by LineDocFile. Yet, this change adds some sanity checks that run with unittest to ensure the BWC generation is at least readable with the current version. Relates to #12929	2024-01-17 15:03:30 +01:00
Simon Willnauer	340d9d2f9c	Prevent parent field from added to existing index (#13016 ) This change prevents users from adding a parent field to an existing index. Parent field must be added before any documents are added to the index to prevent documents without the parent field from being indexed and later to be treated as child documents upon merge. Relates to #12829	2024-01-16 10:53:43 +01:00
gf2121	ed7c78ce6d	LUCENE-10366: Override #readVInt and #readVLong for ByteBufferDataInput to avoid the abstraction confusion of #readByte. (#592 )	2024-01-16 17:30:09 +08:00
Sreehari Guruprasad	c746bea233	Fixed Field.java documentation to refer to new IntField/FloatField/LongField/DoubleField (#13012 ) Co-authored-by: Harshitha-g-06 <harshitha.grs@gmail.com> Closes #12125	2024-01-15 09:14:02 +01:00
Patrick Zhai	d72686be94	Suppress SimpleTextCodec for VectorSimilarityQueryTestCase (#13010 )	2024-01-14 20:00:10 -08:00
Adrien Grand	e0daca1eb4	Make sure `DocumentsWriterPerThread#getAndLock` never returns `null` on a non-empty queue. (#12959 ) Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes return `null` even though the queue was empty at no point in time. The practical implication is that we can end up with more DWPTs in memory than indexing threads, which, while not strictly a bug, may require doing more merging than we'd like later on. I ran luceneutil's `IndexGeonames` with this change, and `DocumentsWriterPerThread#getAndLock` was not the main source of contention. Closes #12649 #12916	2024-01-12 16:21:01 +01:00
Michael Froh	7dfef017e3	Output binary doc values as hex array in SimpleTextCodec (#12987 ) Binary doc values were being written directly in SimpleTextCodec, though they may not be valid UTF-8 (i.e. they may not be "text"). This change encodes them as a string representing an array of hexadecimal bytes.	2024-01-12 16:09:18 +01:00
kuramitsu	8bee41880e	Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji (#12885 )	2024-01-11 11:33:16 -08:00
Dzung Bui	2a851401a1	Clean up unused code & variables (#12994 )	2024-01-11 18:43:25 +01:00
Simon Willnauer	df6bd25ce4	Add support for index sorting with document blocks (#12829 ) Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to #12711	2024-01-11 16:11:15 +01:00
Michael Froh	b7728c5657	Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos (#12897 ) The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a result, the segment infos were often malformed.	2024-01-11 15:45:48 +01:00
Zhang Chao	75e1a0b96c	Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false (#12997 )	2024-01-11 15:37:52 +01:00
zhouhui	4b1180372e	Copy collected acc(maxFreqs) into empty acc, rather than merge them. (#12846 )	2024-01-11 14:44:11 +01:00
Dzung Bui	701619d35a	Lazily write the FST padding byte (#12981 ) * lazily write the FST padding byte * Also write the pad byte when there is emptyOutput * add comment * Add more comments	2024-01-11 07:31:24 -05:00
sabi0	09837bae73	Cleanup comments and asserts in TestIndexWriter (#13006 )	2024-01-10 20:47:03 +01:00
sabi0	f67b1b3d1f	Simplify asserts in TestWordBreakSpellChecker (#13007 )	2024-01-10 20:33:40 +01:00
Michael Gibney	89a02fa4e3	Use Automaton for SurroundQuery prefix/pattern matching (#12999 )	2024-01-10 13:14:21 -05:00
sabi0	d7a14257ce	Get rid of deprecated assertThat() usages (#12982 )	2024-01-10 16:31:29 +01:00
Andrew Ross	872702d828	Remove outdated comment from TaskExecutor (#12993 ) A previous iteration of this code used an AtomicInteger and required this comment. The committed version uses a self-documenting boolean and the comment is not needed.	2024-01-10 10:35:23 +01:00
Simon Willnauer	4d916a754b	Fix test to also take into accont minor versions for BWC	2024-01-09 12:21:13 +01:00
Simon Willnauer	ea327220a8	Remove stale BWC tests (#12874 ) Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex` looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.	2024-01-09 11:49:53 +01:00
sabi0	5442748995	Fix missing variable assignment in testAllVersionHaveCfsAndNocfs() and other minor code cleanups (#12969 )	2024-01-09 11:04:31 +01:00
sabi0	0fc1e2c2f7	Code cleanups in EscapeQuerySyntaxImpl (#12973 )	2024-01-08 22:18:37 +01:00
Jakub Slowinski	6d27c20579	Fix only use of .toLowerCase() with no Locale (#12856 )	2024-01-08 22:04:04 +01:00
sabi0	a32f6acadf	Remove unnecessary fields loop from extractWeightedSpanTerms() (#12965 )	2024-01-08 22:01:56 +01:00
Marc D'Mello	376bd24693	Improve code clarity for OrdinalMap (#11729 ) Closes #11728	2024-01-08 14:00:53 +01:00
Michael McCandless	3c235bb7b4	LockVerifyServer does not need to reuse addresses nor set accept timeout (#12535 )	2024-01-08 13:53:08 +01:00
gf2121	67be0189bc	clean up sleep (#12914 )	2024-01-08 13:48:26 +01:00
Adrien Grand	40060f8b70	Reduce contention on flushControl.isFullFlush(). (#12958 ) `flushControl.isFullFlush()` is a surprising source of contention with documents that are cheap to index and many indexing threads. If I slightly modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing buffer and disable `TextField` fields, which are more costly to index than `KeywordField` or `IntField` fields, this brings the time to load all the dataset in the `IndexWriter` buffers from 8.0s to 7.0s.	2024-01-08 13:23:05 +01:00
Stefan Vodita	115a30d462	Increase stale PRs actionbudget and mark not debug-only (#12998 )	2024-01-08 07:20:59 -05:00
Stefan Vodita	564b2ebecc	Introduce workflow for stale PRs (#12813 ) * Introduce stale workflow * Exempt draft PRs * Tune the action to our needs 1. Don't mark issues stale, only PRs. 2. Don't close anything automatically. 3. Keep the default Stale label. 4. Run in debug-only mode to start.	2024-01-08 06:22:19 -05:00
Dzung Bui	4c883a414c	Optimize FST on-heap BytesReader (#12879 ) * Move size() to FSTStore * Remove size() completely * Allow FST builder to use different DataOutput * access BytesStore byte[] directly for copying * Rename BytesStore * Change class to final * Reorder methods * Remove unused methods * Rename truncate to setPosition() and remove skipBytes() * Simplify the writing operations * Update comment * remove unused parameter * Simplify BytesStore operation * tidy code * Rename copyBytes to writeTo * Simplify BytesStore operations * Embed writeBytes() to FSTCompiler * Fix the write bytes method * Remove the default block bits constant * add assertion * Rename method parameter names * Move reverse to FSTCompiler * Revert setPosition call * Address comments * Return immediately when writing 0 bytes * Add comment & * Rename variables * Fix the compile error * Remove isReadable() * Remove isReadable() * Optimize ReadWriteDataOutput * tidy code * Freeze the DataOutput once finished() * Refactor * freeze the DataOutput before use * Improvement of ReadWriteDataOutput * tidy code * Address comments and add off-heap FST tests * Remove the hardcoded random * Ignore the Test2BFSTOffHeap test * Simplify ReadWriteDataOutput * Do not expose blockBits * tidy code * Remove 0 initialization * Add assertion and comment	2024-01-06 07:47:19 -05:00
sabi0	7b8aece125	Use Collections.addAll() instead of manual array copy and misc. code cleanups (#12977 )	2024-01-04 22:27:36 +01:00
sabi0	1a939410dd	Misc code cleanups (#12974 )	2024-01-04 08:37:49 +01:00
Kaival Parikh	248f067d52	Reduce number of dimensions for Test[Byte\|Float]VectorSimilarityQuery (#12988 ) ### Description Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](`dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)`) (similar test for KNN with random vectors) ### Command to reproduce ``` ./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361 ```	2024-01-02 13:06:12 -05:00
sabi0	78b4f75a2c	Replace .collect(toList()) with .toList() and misc. code cleanups (#12978 )	2023-12-30 17:04:11 +01:00
sabi0	ec9e593dc4	Remove obsolete 'mappingRules' in Tokenizer tests (#12972 )	2023-12-30 16:59:59 +01:00
sabi0	67d866c586	Minor code cleanups (intellij inspections).	2023-12-30 16:55:49 +01:00
Uwe Schindler	346f4ff7d2	Move changes entry to 9.10 (#12841 )	2023-12-29 13:08:49 +01:00
sabi0	64cf54a4bf	Replace "UTF-8" with StandardCharsets.UTF_8 and other typo and minor cleanups (#12979 )	2023-12-28 19:42:22 +01:00
sabi0	91272f45da	Replace println(String.format(...)) with printf(...) (#12976 )	2023-12-28 19:32:06 +01:00
sabi0	57b104e806	Get rid of inefficient Stream.count() (#12975 )	2023-12-28 19:30:01 +01:00
sabi0	9c9949b2bc	Remove unused imports (#12970 )	2023-12-28 19:28:24 +01:00
Patrick Zhai	948970be58	Fix bug where NFARunAutomaton#getTransition does not set Transition correctly (#12909 )	2023-12-27 22:49:35 -08:00
sabi0	02722eeb69	Add missing spaces in concatenated strings (#12967 )	2023-12-23 20:30:30 -05:00
Zhang Chao	dc9f154aa5	Move group-varint encoding/decoding logic to DataOutput/DataInput (#12841 )	2023-12-23 13:18:34 +01:00

1 2 3 4 5 ...

37124 Commits All Branches Search

37124 Commits

All Branches