lucene

mirror of https://github.com/apache/lucene.git synced 2025-02-10 20:15:18 +00:00

Author	SHA1	Message	Date
Adrien Grand	e0daca1eb4	Make sure `DocumentsWriterPerThread#getAndLock` never returns `null` on a non-empty queue. (#12959 ) Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes return `null` even though the queue was empty at no point in time. The practical implication is that we can end up with more DWPTs in memory than indexing threads, which, while not strictly a bug, may require doing more merging than we'd like later on. I ran luceneutil's `IndexGeonames` with this change, and `DocumentsWriterPerThread#getAndLock` was not the main source of contention. Closes #12649 #12916	2024-01-12 16:21:01 +01:00
Michael Froh	7dfef017e3	Output binary doc values as hex array in SimpleTextCodec (#12987 ) Binary doc values were being written directly in SimpleTextCodec, though they may not be valid UTF-8 (i.e. they may not be "text"). This change encodes them as a string representing an array of hexadecimal bytes.	2024-01-12 16:09:18 +01:00
kuramitsu	8bee41880e	Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji (#12885 )	2024-01-11 11:33:16 -08:00
Dzung Bui	2a851401a1	Clean up unused code & variables (#12994 )	2024-01-11 18:43:25 +01:00
Simon Willnauer	df6bd25ce4	Add support for index sorting with document blocks (#12829 ) Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to #12711	2024-01-11 16:11:15 +01:00
Michael Froh	b7728c5657	Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos (#12897 ) The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a result, the segment infos were often malformed.	2024-01-11 15:45:48 +01:00
Zhang Chao	75e1a0b96c	Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false (#12997 )	2024-01-11 15:37:52 +01:00
zhouhui	4b1180372e	Copy collected acc(maxFreqs) into empty acc, rather than merge them. (#12846 )	2024-01-11 14:44:11 +01:00
Dzung Bui	701619d35a	Lazily write the FST padding byte (#12981 ) * lazily write the FST padding byte * Also write the pad byte when there is emptyOutput * add comment * Add more comments	2024-01-11 07:31:24 -05:00
sabi0	09837bae73	Cleanup comments and asserts in TestIndexWriter (#13006 )	2024-01-10 20:47:03 +01:00
sabi0	f67b1b3d1f	Simplify asserts in TestWordBreakSpellChecker (#13007 )	2024-01-10 20:33:40 +01:00
Michael Gibney	89a02fa4e3	Use Automaton for SurroundQuery prefix/pattern matching (#12999 )	2024-01-10 13:14:21 -05:00
sabi0	d7a14257ce	Get rid of deprecated assertThat() usages (#12982 )	2024-01-10 16:31:29 +01:00
Andrew Ross	872702d828	Remove outdated comment from TaskExecutor (#12993 ) A previous iteration of this code used an AtomicInteger and required this comment. The committed version uses a self-documenting boolean and the comment is not needed.	2024-01-10 10:35:23 +01:00
Simon Willnauer	4d916a754b	Fix test to also take into accont minor versions for BWC	2024-01-09 12:21:13 +01:00
Simon Willnauer	ea327220a8	Remove stale BWC tests (#12874 ) Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex` looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.	2024-01-09 11:49:53 +01:00
sabi0	5442748995	Fix missing variable assignment in testAllVersionHaveCfsAndNocfs() and other minor code cleanups (#12969 )	2024-01-09 11:04:31 +01:00
sabi0	0fc1e2c2f7	Code cleanups in EscapeQuerySyntaxImpl (#12973 )	2024-01-08 22:18:37 +01:00
Jakub Slowinski	6d27c20579	Fix only use of .toLowerCase() with no Locale (#12856 )	2024-01-08 22:04:04 +01:00
sabi0	a32f6acadf	Remove unnecessary fields loop from extractWeightedSpanTerms() (#12965 )	2024-01-08 22:01:56 +01:00
Marc D'Mello	376bd24693	Improve code clarity for OrdinalMap (#11729 ) Closes #11728	2024-01-08 14:00:53 +01:00
Michael McCandless	3c235bb7b4	LockVerifyServer does not need to reuse addresses nor set accept timeout (#12535 )	2024-01-08 13:53:08 +01:00
gf2121	67be0189bc	clean up sleep (#12914 )	2024-01-08 13:48:26 +01:00
Adrien Grand	40060f8b70	Reduce contention on flushControl.isFullFlush(). (#12958 ) `flushControl.isFullFlush()` is a surprising source of contention with documents that are cheap to index and many indexing threads. If I slightly modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing buffer and disable `TextField` fields, which are more costly to index than `KeywordField` or `IntField` fields, this brings the time to load all the dataset in the `IndexWriter` buffers from 8.0s to 7.0s.	2024-01-08 13:23:05 +01:00
Stefan Vodita	115a30d462	Increase stale PRs actionbudget and mark not debug-only (#12998 )	2024-01-08 07:20:59 -05:00
Stefan Vodita	564b2ebecc	Introduce workflow for stale PRs (#12813 ) * Introduce stale workflow * Exempt draft PRs * Tune the action to our needs 1. Don't mark issues stale, only PRs. 2. Don't close anything automatically. 3. Keep the default Stale label. 4. Run in debug-only mode to start.	2024-01-08 06:22:19 -05:00
Dzung Bui	4c883a414c	Optimize FST on-heap BytesReader (#12879 ) * Move size() to FSTStore * Remove size() completely * Allow FST builder to use different DataOutput * access BytesStore byte[] directly for copying * Rename BytesStore * Change class to final * Reorder methods * Remove unused methods * Rename truncate to setPosition() and remove skipBytes() * Simplify the writing operations * Update comment * remove unused parameter * Simplify BytesStore operation * tidy code * Rename copyBytes to writeTo * Simplify BytesStore operations * Embed writeBytes() to FSTCompiler * Fix the write bytes method * Remove the default block bits constant * add assertion * Rename method parameter names * Move reverse to FSTCompiler * Revert setPosition call * Address comments * Return immediately when writing 0 bytes * Add comment & * Rename variables * Fix the compile error * Remove isReadable() * Remove isReadable() * Optimize ReadWriteDataOutput * tidy code * Freeze the DataOutput once finished() * Refactor * freeze the DataOutput before use * Improvement of ReadWriteDataOutput * tidy code * Address comments and add off-heap FST tests * Remove the hardcoded random * Ignore the Test2BFSTOffHeap test * Simplify ReadWriteDataOutput * Do not expose blockBits * tidy code * Remove 0 initialization * Add assertion and comment	2024-01-06 07:47:19 -05:00
sabi0	7b8aece125	Use Collections.addAll() instead of manual array copy and misc. code cleanups (#12977 )	2024-01-04 22:27:36 +01:00
sabi0	1a939410dd	Misc code cleanups (#12974 )	2024-01-04 08:37:49 +01:00
Kaival Parikh	248f067d52	Reduce number of dimensions for Test[Byte\|Float]VectorSimilarityQuery (#12988 ) ### Description Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](`dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)`) (similar test for KNN with random vectors) ### Command to reproduce ``` ./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361 ```	2024-01-02 13:06:12 -05:00
sabi0	78b4f75a2c	Replace .collect(toList()) with .toList() and misc. code cleanups (#12978 )	2023-12-30 17:04:11 +01:00
sabi0	ec9e593dc4	Remove obsolete 'mappingRules' in Tokenizer tests (#12972 )	2023-12-30 16:59:59 +01:00
sabi0	67d866c586	Minor code cleanups (intellij inspections).	2023-12-30 16:55:49 +01:00
Uwe Schindler	346f4ff7d2	Move changes entry to 9.10 (#12841 )	2023-12-29 13:08:49 +01:00
sabi0	64cf54a4bf	Replace "UTF-8" with StandardCharsets.UTF_8 and other typo and minor cleanups (#12979 )	2023-12-28 19:42:22 +01:00
sabi0	91272f45da	Replace println(String.format(...)) with printf(...) (#12976 )	2023-12-28 19:32:06 +01:00
sabi0	57b104e806	Get rid of inefficient Stream.count() (#12975 )	2023-12-28 19:30:01 +01:00
sabi0	9c9949b2bc	Remove unused imports (#12970 )	2023-12-28 19:28:24 +01:00
Patrick Zhai	948970be58	Fix bug where NFARunAutomaton#getTransition does not set Transition correctly (#12909 )	2023-12-27 22:49:35 -08:00
sabi0	02722eeb69	Add missing spaces in concatenated strings (#12967 )	2023-12-23 20:30:30 -05:00
Zhang Chao	dc9f154aa5	Move group-varint encoding/decoding logic to DataOutput/DataInput (#12841 )	2023-12-23 13:18:34 +01:00
sabi0	9359a9dcff	Update contributing guide: autocrlf and build dependencies (#12963 )	2023-12-22 09:28:53 +01:00
sabi0	f6b2006195	Fix typo in help/formatting.txt (#12960 )	2023-12-21 19:58:53 +01:00
Adrien Grand	91002d04d3	Fix CheckIndex to correctly flag the automaton as binary.	2023-12-20 14:39:32 +01:00
Zhang Chao	5152051f68	Improve Javadoc for DocValuesConsumer (#12952 )	2023-12-20 13:40:44 +01:00
Adrien Grand	bcc7e120ba	Modernize LineFileDocs. (#12929 ) This replaces `StringField`/`SortedDocValuesField` with `KeywordField` and `IntPoint`/`NumericDocValuesField` with `IntField`.	2023-12-19 11:25:26 +01:00
Adrien Grand	5c084fcd6e	Add a stored fields test that indexes LineFileDocs. (#12927 ) Real-world data exhibits patterns that are taken advantage of by the compression logic, but also hardly reproducible in a randomized way. This makes this new test introduce interesting coverage. It takes one second to run on my machine, so I did not mark it `@Nightly`.	2023-12-19 11:20:14 +01:00
Adrien Grand	bf45ab79ec	Beef up `Terms#intersect` checks in `CheckIndex`. (#12926 ) Now also testing what happens with a non-null `startTerm`. This found bugs in `DirectPostingsFormat`.	2023-12-19 11:17:38 +01:00
Lukáš Vlček	5d6086e199	Fix position increment in (Reverse)PathHierarchyTokenizer (#12875 ) * Fix PathHierarchyTokenizer positions PathHierarchyTokenizer was emitting multiple tokens in the same position with changing offsets. To be consistent with EdgeNGramTokenizer (which is conceptually similar -- it's emitting multiple prefixes/suffixes off the input string), we can output every token with length 1 with positions incrementing by 1. * Fix ReversePathHierarchyTokenizer positions Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer. --------- Co-authored-by: Michael Froh <froh@amazon.com>	2023-12-18 08:48:22 -05:00
Dawid Weiss	6bb244a932	An improved check for ignoring the c2-crash test if running on a client compiler. (#12953 )	2023-12-18 12:37:57 +01:00

... 3 4 5 6 7 ...

37230 Commits