lucene

Commit Graph

Author	SHA1	Message	Date
Michael Froh	7dfef017e3	Output binary doc values as hex array in SimpleTextCodec (#12987 ) Binary doc values were being written directly in SimpleTextCodec, though they may not be valid UTF-8 (i.e. they may not be "text"). This change encodes them as a string representing an array of hexadecimal bytes.	2024-01-12 16:09:18 +01:00
kuramitsu	8bee41880e	Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji (#12885 )	2024-01-11 11:33:16 -08:00
Dzung Bui	2a851401a1	Clean up unused code & variables (#12994 )	2024-01-11 18:43:25 +01:00
Simon Willnauer	df6bd25ce4	Add support for index sorting with document blocks (#12829 ) Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to #12711	2024-01-11 16:11:15 +01:00
Michael Froh	b7728c5657	Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos (#12897 ) The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a result, the segment infos were often malformed.	2024-01-11 15:45:48 +01:00
Zhang Chao	75e1a0b96c	Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false (#12997 )	2024-01-11 15:37:52 +01:00
zhouhui	4b1180372e	Copy collected acc(maxFreqs) into empty acc, rather than merge them. (#12846 )	2024-01-11 14:44:11 +01:00
Dzung Bui	701619d35a	Lazily write the FST padding byte (#12981 ) * lazily write the FST padding byte * Also write the pad byte when there is emptyOutput * add comment * Add more comments	2024-01-11 07:31:24 -05:00
sabi0	09837bae73	Cleanup comments and asserts in TestIndexWriter (#13006 )	2024-01-10 20:47:03 +01:00
sabi0	f67b1b3d1f	Simplify asserts in TestWordBreakSpellChecker (#13007 )	2024-01-10 20:33:40 +01:00
Michael Gibney	89a02fa4e3	Use Automaton for SurroundQuery prefix/pattern matching (#12999 )	2024-01-10 13:14:21 -05:00
sabi0	d7a14257ce	Get rid of deprecated assertThat() usages (#12982 )	2024-01-10 16:31:29 +01:00
Andrew Ross	872702d828	Remove outdated comment from TaskExecutor (#12993 ) A previous iteration of this code used an AtomicInteger and required this comment. The committed version uses a self-documenting boolean and the comment is not needed.	2024-01-10 10:35:23 +01:00
Simon Willnauer	4d916a754b	Fix test to also take into accont minor versions for BWC	2024-01-09 12:21:13 +01:00
Simon Willnauer	ea327220a8	Remove stale BWC tests (#12874 ) Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex` looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.	2024-01-09 11:49:53 +01:00
sabi0	5442748995	Fix missing variable assignment in testAllVersionHaveCfsAndNocfs() and other minor code cleanups (#12969 )	2024-01-09 11:04:31 +01:00
sabi0	0fc1e2c2f7	Code cleanups in EscapeQuerySyntaxImpl (#12973 )	2024-01-08 22:18:37 +01:00
Jakub Slowinski	6d27c20579	Fix only use of .toLowerCase() with no Locale (#12856 )	2024-01-08 22:04:04 +01:00
sabi0	a32f6acadf	Remove unnecessary fields loop from extractWeightedSpanTerms() (#12965 )	2024-01-08 22:01:56 +01:00
Marc D'Mello	376bd24693	Improve code clarity for OrdinalMap (#11729 ) Closes #11728	2024-01-08 14:00:53 +01:00
Michael McCandless	3c235bb7b4	LockVerifyServer does not need to reuse addresses nor set accept timeout (#12535 )	2024-01-08 13:53:08 +01:00
gf2121	67be0189bc	clean up sleep (#12914 )	2024-01-08 13:48:26 +01:00
Adrien Grand	40060f8b70	Reduce contention on flushControl.isFullFlush(). (#12958 ) `flushControl.isFullFlush()` is a surprising source of contention with documents that are cheap to index and many indexing threads. If I slightly modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing buffer and disable `TextField` fields, which are more costly to index than `KeywordField` or `IntField` fields, this brings the time to load all the dataset in the `IndexWriter` buffers from 8.0s to 7.0s.	2024-01-08 13:23:05 +01:00
Stefan Vodita	115a30d462	Increase stale PRs actionbudget and mark not debug-only (#12998 )	2024-01-08 07:20:59 -05:00
Stefan Vodita	564b2ebecc	Introduce workflow for stale PRs (#12813 ) * Introduce stale workflow * Exempt draft PRs * Tune the action to our needs 1. Don't mark issues stale, only PRs. 2. Don't close anything automatically. 3. Keep the default Stale label. 4. Run in debug-only mode to start.	2024-01-08 06:22:19 -05:00
Dzung Bui	4c883a414c	Optimize FST on-heap BytesReader (#12879 ) * Move size() to FSTStore * Remove size() completely * Allow FST builder to use different DataOutput * access BytesStore byte[] directly for copying * Rename BytesStore * Change class to final * Reorder methods * Remove unused methods * Rename truncate to setPosition() and remove skipBytes() * Simplify the writing operations * Update comment * remove unused parameter * Simplify BytesStore operation * tidy code * Rename copyBytes to writeTo * Simplify BytesStore operations * Embed writeBytes() to FSTCompiler * Fix the write bytes method * Remove the default block bits constant * add assertion * Rename method parameter names * Move reverse to FSTCompiler * Revert setPosition call * Address comments * Return immediately when writing 0 bytes * Add comment & * Rename variables * Fix the compile error * Remove isReadable() * Remove isReadable() * Optimize ReadWriteDataOutput * tidy code * Freeze the DataOutput once finished() * Refactor * freeze the DataOutput before use * Improvement of ReadWriteDataOutput * tidy code * Address comments and add off-heap FST tests * Remove the hardcoded random * Ignore the Test2BFSTOffHeap test * Simplify ReadWriteDataOutput * Do not expose blockBits * tidy code * Remove 0 initialization * Add assertion and comment	2024-01-06 07:47:19 -05:00
sabi0	7b8aece125	Use Collections.addAll() instead of manual array copy and misc. code cleanups (#12977 )	2024-01-04 22:27:36 +01:00
sabi0	1a939410dd	Misc code cleanups (#12974 )	2024-01-04 08:37:49 +01:00
Kaival Parikh	248f067d52	Reduce number of dimensions for Test[Byte\|Float]VectorSimilarityQuery (#12988 ) ### Description Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](`dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)`) (similar test for KNN with random vectors) ### Command to reproduce ``` ./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361 ```	2024-01-02 13:06:12 -05:00
sabi0	78b4f75a2c	Replace .collect(toList()) with .toList() and misc. code cleanups (#12978 )	2023-12-30 17:04:11 +01:00
sabi0	ec9e593dc4	Remove obsolete 'mappingRules' in Tokenizer tests (#12972 )	2023-12-30 16:59:59 +01:00
sabi0	67d866c586	Minor code cleanups (intellij inspections).	2023-12-30 16:55:49 +01:00
Uwe Schindler	346f4ff7d2	Move changes entry to 9.10 (#12841 )	2023-12-29 13:08:49 +01:00
sabi0	64cf54a4bf	Replace "UTF-8" with StandardCharsets.UTF_8 and other typo and minor cleanups (#12979 )	2023-12-28 19:42:22 +01:00
sabi0	91272f45da	Replace println(String.format(...)) with printf(...) (#12976 )	2023-12-28 19:32:06 +01:00
sabi0	57b104e806	Get rid of inefficient Stream.count() (#12975 )	2023-12-28 19:30:01 +01:00
sabi0	9c9949b2bc	Remove unused imports (#12970 )	2023-12-28 19:28:24 +01:00
Patrick Zhai	948970be58	Fix bug where NFARunAutomaton#getTransition does not set Transition correctly (#12909 )	2023-12-27 22:49:35 -08:00
sabi0	02722eeb69	Add missing spaces in concatenated strings (#12967 )	2023-12-23 20:30:30 -05:00
Zhang Chao	dc9f154aa5	Move group-varint encoding/decoding logic to DataOutput/DataInput (#12841 )	2023-12-23 13:18:34 +01:00
sabi0	9359a9dcff	Update contributing guide: autocrlf and build dependencies (#12963 )	2023-12-22 09:28:53 +01:00
sabi0	f6b2006195	Fix typo in help/formatting.txt (#12960 )	2023-12-21 19:58:53 +01:00
Adrien Grand	91002d04d3	Fix CheckIndex to correctly flag the automaton as binary.	2023-12-20 14:39:32 +01:00
Zhang Chao	5152051f68	Improve Javadoc for DocValuesConsumer (#12952 )	2023-12-20 13:40:44 +01:00
Adrien Grand	bcc7e120ba	Modernize LineFileDocs. (#12929 ) This replaces `StringField`/`SortedDocValuesField` with `KeywordField` and `IntPoint`/`NumericDocValuesField` with `IntField`.	2023-12-19 11:25:26 +01:00
Adrien Grand	5c084fcd6e	Add a stored fields test that indexes LineFileDocs. (#12927 ) Real-world data exhibits patterns that are taken advantage of by the compression logic, but also hardly reproducible in a randomized way. This makes this new test introduce interesting coverage. It takes one second to run on my machine, so I did not mark it `@Nightly`.	2023-12-19 11:20:14 +01:00
Adrien Grand	bf45ab79ec	Beef up `Terms#intersect` checks in `CheckIndex`. (#12926 ) Now also testing what happens with a non-null `startTerm`. This found bugs in `DirectPostingsFormat`.	2023-12-19 11:17:38 +01:00
Lukáš Vlček	5d6086e199	Fix position increment in (Reverse)PathHierarchyTokenizer (#12875 ) * Fix PathHierarchyTokenizer positions PathHierarchyTokenizer was emitting multiple tokens in the same position with changing offsets. To be consistent with EdgeNGramTokenizer (which is conceptually similar -- it's emitting multiple prefixes/suffixes off the input string), we can output every token with length 1 with positions incrementing by 1. * Fix ReversePathHierarchyTokenizer positions Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer. --------- Co-authored-by: Michael Froh <froh@amazon.com>	2023-12-18 08:48:22 -05:00
Dawid Weiss	6bb244a932	An improved check for ignoring the c2-crash test if running on a client compiler. (#12953 )	2023-12-18 12:37:57 +01:00
ChrisHegarty	f6582ce048	Add back-compat indices for 9.9.1	2023-12-17 09:39:46 +00:00

1 2 3 4 5 ...

37014 Commits All Branches Search

37014 Commits

All Branches