Commit Graph

37425 Commits

Author SHA1 Message Date
Michael Froh b7728c5657
Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos (#12897)
The SimpleTextSegmentInfoFormat was writing the random byte array used
as a segment's ID directly -- not converting to a simple text
representation of the byte array. As a result, the segment infos were
often malformed.
2024-01-11 15:45:48 +01:00
Zhang Chao 75e1a0b96c
Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false (#12997) 2024-01-11 15:37:52 +01:00
zhouhui 4b1180372e
Copy collected acc(maxFreqs) into empty acc, rather than merge them. (#12846) 2024-01-11 14:44:11 +01:00
Dzung Bui 701619d35a
Lazily write the FST padding byte (#12981)
* lazily write the FST padding byte

* Also write the pad byte when there is emptyOutput

* add comment

* Add more comments
2024-01-11 07:31:24 -05:00
sabi0 09837bae73
Cleanup comments and asserts in TestIndexWriter (#13006) 2024-01-10 20:47:03 +01:00
sabi0 f67b1b3d1f
Simplify asserts in TestWordBreakSpellChecker (#13007) 2024-01-10 20:33:40 +01:00
Michael Gibney 89a02fa4e3
Use Automaton for SurroundQuery prefix/pattern matching (#12999) 2024-01-10 13:14:21 -05:00
sabi0 d7a14257ce
Get rid of deprecated assertThat() usages (#12982) 2024-01-10 16:31:29 +01:00
Andrew Ross 872702d828
Remove outdated comment from TaskExecutor (#12993)
A previous iteration of this code used an AtomicInteger and
required this comment. The committed version uses a self-documenting
boolean and the comment is not needed.
2024-01-10 10:35:23 +01:00
Simon Willnauer 4d916a754b Fix test to also take into accont minor versions for BWC 2024-01-09 12:21:13 +01:00
Simon Willnauer ea327220a8
Remove stale BWC tests (#12874)
Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex`
looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.
2024-01-09 11:49:53 +01:00
sabi0 5442748995
Fix missing variable assignment in testAllVersionHaveCfsAndNocfs() and other minor code cleanups (#12969) 2024-01-09 11:04:31 +01:00
sabi0 0fc1e2c2f7
Code cleanups in EscapeQuerySyntaxImpl (#12973) 2024-01-08 22:18:37 +01:00
Jakub Slowinski 6d27c20579
Fix only use of .toLowerCase() with no Locale (#12856) 2024-01-08 22:04:04 +01:00
sabi0 a32f6acadf
Remove unnecessary fields loop from extractWeightedSpanTerms() (#12965) 2024-01-08 22:01:56 +01:00
Marc D'Mello 376bd24693 Improve code clarity for OrdinalMap (#11729)
Closes #11728
2024-01-08 14:00:53 +01:00
Michael McCandless 3c235bb7b4
LockVerifyServer does not need to reuse addresses nor set accept timeout (#12535) 2024-01-08 13:53:08 +01:00
gf2121 67be0189bc
clean up sleep (#12914) 2024-01-08 13:48:26 +01:00
Adrien Grand 40060f8b70
Reduce contention on flushControl.isFullFlush(). (#12958)
`flushControl.isFullFlush()` is a surprising source of contention with
documents that are cheap to index and many indexing threads. If I slightly
modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing
buffer and disable `TextField` fields, which are more costly to index than
`KeywordField` or `IntField` fields, this brings the time to load all the
dataset in the `IndexWriter` buffers from 8.0s to 7.0s.
2024-01-08 13:23:05 +01:00
Stefan Vodita 115a30d462
Increase stale PRs actionbudget and mark not debug-only (#12998) 2024-01-08 07:20:59 -05:00
Stefan Vodita 564b2ebecc
Introduce workflow for stale PRs (#12813)
* Introduce stale workflow

* Exempt draft PRs

* Tune the action to our needs

1. Don't mark issues stale, only PRs.
2. Don't close anything automatically.
3. Keep the default Stale label.
4. Run in debug-only mode to start.
2024-01-08 06:22:19 -05:00
Dzung Bui 4c883a414c
Optimize FST on-heap BytesReader (#12879)
* Move size() to FSTStore

* Remove size() completely

* Allow FST builder to use different DataOutput

* access BytesStore byte[] directly for copying

* Rename BytesStore

* Change class to final

* Reorder methods

* Remove unused methods

* Rename truncate to setPosition() and remove skipBytes()

* Simplify the writing operations

* Update comment

* remove unused parameter

* Simplify BytesStore operation

* tidy code

* Rename copyBytes to writeTo

* Simplify BytesStore operations

* Embed writeBytes() to FSTCompiler

* Fix the write bytes method

* Remove the default block bits constant

* add assertion

* Rename method parameter names

* Move reverse to FSTCompiler

* Revert setPosition call

* Address comments

* Return immediately when writing 0 bytes

* Add comment &

* Rename variables

* Fix the compile error

* Remove isReadable()

* Remove isReadable()

* Optimize ReadWriteDataOutput

* tidy code

* Freeze the DataOutput once finished()

* Refactor

* freeze the DataOutput before use

* Improvement of ReadWriteDataOutput

* tidy code

* Address comments and add off-heap FST tests

* Remove the hardcoded random

* Ignore the Test2BFSTOffHeap test

* Simplify ReadWriteDataOutput

* Do not expose blockBits

* tidy code

* Remove 0 initialization

* Add assertion and comment
2024-01-06 07:47:19 -05:00
sabi0 7b8aece125
Use Collections.addAll() instead of manual array copy and misc. code cleanups (#12977) 2024-01-04 22:27:36 +01:00
sabi0 1a939410dd
Misc code cleanups (#12974) 2024-01-04 08:37:49 +01:00
Kaival Parikh 248f067d52
Reduce number of dimensions for Test[Byte|Float]VectorSimilarityQuery (#12988)
### Description

Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph

This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)) (similar test for KNN with random vectors)

### Command to reproduce

```
./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361
```
2024-01-02 13:06:12 -05:00
sabi0 78b4f75a2c
Replace .collect(toList()) with .toList() and misc. code cleanups (#12978) 2023-12-30 17:04:11 +01:00
sabi0 ec9e593dc4
Remove obsolete 'mappingRules' in Tokenizer tests (#12972) 2023-12-30 16:59:59 +01:00
sabi0 67d866c586
Minor code cleanups (intellij inspections). 2023-12-30 16:55:49 +01:00
Uwe Schindler 346f4ff7d2 Move changes entry to 9.10 (#12841) 2023-12-29 13:08:49 +01:00
sabi0 64cf54a4bf
Replace "UTF-8" with StandardCharsets.UTF_8 and other typo and minor cleanups (#12979) 2023-12-28 19:42:22 +01:00
sabi0 91272f45da
Replace println(String.format(...)) with printf(...) (#12976) 2023-12-28 19:32:06 +01:00
sabi0 57b104e806
Get rid of inefficient Stream.count() (#12975) 2023-12-28 19:30:01 +01:00
sabi0 9c9949b2bc
Remove unused imports (#12970) 2023-12-28 19:28:24 +01:00
Patrick Zhai 948970be58
Fix bug where NFARunAutomaton#getTransition does not set Transition correctly (#12909) 2023-12-27 22:49:35 -08:00
sabi0 02722eeb69
Add missing spaces in concatenated strings (#12967) 2023-12-23 20:30:30 -05:00
Zhang Chao dc9f154aa5
Move group-varint encoding/decoding logic to DataOutput/DataInput (#12841) 2023-12-23 13:18:34 +01:00
sabi0 9359a9dcff
Update contributing guide: autocrlf and build dependencies (#12963) 2023-12-22 09:28:53 +01:00
sabi0 f6b2006195
Fix typo in help/formatting.txt (#12960) 2023-12-21 19:58:53 +01:00
Adrien Grand 91002d04d3 Fix CheckIndex to correctly flag the automaton as binary. 2023-12-20 14:39:32 +01:00
Zhang Chao 5152051f68
Improve Javadoc for DocValuesConsumer (#12952) 2023-12-20 13:40:44 +01:00
Adrien Grand bcc7e120ba
Modernize LineFileDocs. (#12929)
This replaces `StringField`/`SortedDocValuesField` with `KeywordField` and
`IntPoint`/`NumericDocValuesField` with `IntField`.
2023-12-19 11:25:26 +01:00
Adrien Grand 5c084fcd6e
Add a stored fields test that indexes LineFileDocs. (#12927)
Real-world data exhibits patterns that are taken advantage of by the
compression logic, but also hardly reproducible in a randomized way. This makes
this new test introduce interesting coverage.

It takes one second to run on my machine, so I did not mark it `@Nightly`.
2023-12-19 11:20:14 +01:00
Adrien Grand bf45ab79ec
Beef up `Terms#intersect` checks in `CheckIndex`. (#12926)
Now also testing what happens with a non-null `startTerm`. This found bugs in
`DirectPostingsFormat`.
2023-12-19 11:17:38 +01:00
Lukáš Vlček 5d6086e199
Fix position increment in (Reverse)PathHierarchyTokenizer (#12875)
* Fix PathHierarchyTokenizer positions

PathHierarchyTokenizer was emitting multiple tokens in the same position
with changing offsets. To be consistent with EdgeNGramTokenizer (which
is conceptually similar -- it's emitting multiple prefixes/suffixes off
the input string), we can output every token with length 1 with
positions incrementing by 1.

* Fix ReversePathHierarchyTokenizer positions

Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.

---------

Co-authored-by: Michael Froh <froh@amazon.com>
2023-12-18 08:48:22 -05:00
Dawid Weiss 6bb244a932
An improved check for ignoring the c2-crash test if running on a client compiler. (#12953) 2023-12-18 12:37:57 +01:00
ChrisHegarty f6582ce048 Add back-compat indices for 9.9.1 2023-12-17 09:39:46 +00:00
ChrisHegarty 08728bf202 Add bugfix version 9.9.1 2023-12-17 09:20:34 +00:00
ChrisHegarty 1f1d0735c8 DOAP changes for release 9.9.1 2023-12-16 22:55:20 +00:00
Michael Sokolov 49d521145d
Use hppc IntIntHashMap to avoid Integer box/unbox when remapping vector ordinals during merge (#12950) 2023-12-15 13:24:05 -05:00
Benjamin Trent 423f8279f0
Fix flaky tests that are caused by small float vectors (#12943)
While quantization generally works well, when the number of dimensions is tiny (just two like in our tests), and we are indexing a circle, and we have random merge policies, we can end up getting unexpected ordering on the resulting vectors.

closes: https://github.com/apache/lucene/issues/12940
2023-12-14 14:38:22 -05:00